# Tips for running large risk calculations¶

Scenario risk calculations usually do not pose a performance problem, since they involve a single rupture and a limited geography for analysis. Some event-based risk calculations, however, may involve millions of ruptures and exposures spanning entire countries or even larger regions. This section offers some practical tips for running large event based risk calculations, especially ones involving large logic trees, and proposes techniques that might be used to make an intractable calculation tractable.

## Understanding the hazard¶

Event-based calculations are typically dominated by the hazard component (unless there are lots of assets aggregated on few hazard sites) and therefore the first thing to do is to estimate the size of the hazard, i.e. the number of GMFs that will be produced. Since we are talking about a large calculation, first of all we need reduce it to a size that is guaranteed to run quickly. The simplest way to do that is to reduce the parameters directly affecting the number of ruptures generated, i.e.

- investigation_time
- ses_per_logic_tree_path
- number_of_logic_tree_samples

For instance, if you have `ses_per_logic_tree_path = 10,000`

reduce
it to 10, run the calculation and you will see in the log something
like this:

```
[2018-12-16 09:09:57,689 #35263 INFO] Received
{'gmfdata': '752.18 MB', 'hcurves': '224 B', 'indices': '29.42 MB'}
```

The amount of GMFs generated for the reduced calculation is 752.18 MB; and since the calculation has been reduced by a factor of 1,000, the full computation is likely to generate around 750 GB of GMFs. Even if you have sufficient disk space to store this large quantity of GMFs, most likely you will run out of memory. Even if the hazard part of the calculation manages to run to completion, the risk part of the calculation is very likely to fail — managing 750 GB of GMFs is beyond the current capabilities of the engine. Thus, you will have to find ways to reduce the size of the computation.

A good start would be to carefully set the parameters
`minimum_magnitude`

and `minimum_intensity`

:

`minimum_magnitude`

is a scalar or a dictionary keyed by tectonic region; the engine will discard ruptures with magnitudes below the given threshoulds`minimum_intensity`

is a scalar or a dictionary keyed by the intensity measure type; the engine will discard GMFs below the given intensity threshoulds

Choosing reasonable cutoff thresholds with these parameters can significantly reduce the size of your computation when there are a large number of small magnitude ruptures or low intensity GMFs being generated, which may have a negligible impact on the damage or losses, and thus could be safely discarded.

## Collapsing of branches¶

When one is not interested so much in the uncertainty around the loss estimates, but more interested simply in the mean estimates, all of the source model branches can be “collapsed” into one branch. Using the collapsed source model should yield the same mean hazard or loss estimates as using the full source model logic tree and then computing the weighted mean of the individual branch results.

Similarly, the GMPE logic tree for each tectonic region can also be “collapsed” into a single branch. Using a single collapsed GMPE for each TRT should also yield the same mean hazard estimates as using the full GMPE logic tree and then computing the weighted mean of the individual branch results. This has become possible through the introduction of AvgGMPE feature in version 3.9.

## Splitting the calculation into subregions¶

If one is interested in propagating the full uncertainty in the source models or ground motion models to the hazard or loss estimates, collapsing the logic trees into a single branch to reduce computational expense is not an option. But before going through the effort of trimming the logic trees, there is an interim step that must be explored, at least for large regions like the entire continental United States. This step is to geographically divide the large region into logical smaller subregions, such that the contribution to the hazard or losses in one subregion from the other subregions is negligibly small or even zero. The effective realizations in each of the subregions will then be much fewer than when trying to cover the entire large region in a single calculation.

## Trimming of the logic-trees or sampling of the branches¶

Trimming or sampling may be necessary if the following two conditions hold:

- You are interested in propagating the full uncertainty to the hazard and loss estimates; only the mean or quantile results are not sufficient for your analysis requirements, AND
- The region of interest cannot be logically divided further as described above; the logic-tree for your chosen region of interest still leads to a very large number of effective realizations.

Sampling is the easier of the two options now. You only need to ensure that you sample a sufficient number of branches to capture the underlying distribution of the hazard or loss results you are interested in. The drawback of random sampling is that you may still need to sample hundreds of branches to capture well the underlying distribution of the results.

Trimming can be much more efficient than sampling, because you pick a few branches such that the distribution of the hazard or loss results obtained from a full-enumeration of these branches is nearly the same as the distribution of the hazard or loss results obtained from a full-enumeration of the entire logic-tree.

## Disabling the propagation of vulnerability uncertainty to losses¶

The vulnerability functions using continuous distributions (such as the lognormal distribution or beta distribution) to characterize the uncertainty in the loss ratio conditional on the shaking intensity level, specify the mean loss ratios and the corresponding coefficients of variation for a set of intensity levels. They are used to build the so called epsilon matrix within the engine, which is how loss ratios are sampled from the distribution for each asset.

There is clearly a performance penalty associated with the propagation of uncertainty in the vulnerability to losses. The epsilon matrix has to be computed and stored, and then the worker processes have to read it, which involves large quantities of data transfer and memory usage.

Setting

`ignore_covs = true`

in your job.ini file will result in the engine using just the mean loss ratio conditioned on the shaking intensity and ignoring the uncertainty. This tradeoff of not propagating the vulnerabilty uncertainty to the loss estimates can lead to a significant boost in performance and tractability.

## The ebrisk calculator¶

Even with all the tricks in the book, some problems cannot be solved
with the traditional `event_based_risk`

calculator, in particular
when there are too many hazard sites. Suppose for instance that you
have a very detailed exposure for Canada with 462,000 hazard sites,
and a corresponding site model covering all of the sites.
It would be a pity to lose such detailed information by aggregating
the assets onto a coarser grid, but this may be the only viable option
for the traditional `event_based_risk`

calculator.

The issue is that the `event_based_risk`

cannot work well with
so many sites, unless you reduce your investigation_time considerably.
If the investigation_time is long enough for a reasonable computation,
you will most likely run into issues such as:

- running out of memory when computing the GMFs
- running out of disk space when saving the GMFs
- running out of memory when reading the GMFs
- having an impossibly slow risk calculation

The solution - in theory - would be to split Canada in regions, but this comes with its own problems. For instance,

- one has to compute the ruptures for all Canada in a single run, to make sure that the random seeds are consistent for all regions
- then one has to run several calculations starting from the ruptures, one per sub-region
- finally one has to carefully aggregate the results from the different calculations

Such steps are tedious, time consuming and very much error prone.

In order to solve such issues a new calculator, tentatively called `ebrisk`

,
has been introduced in engine 3.4. For small calculations the `ebrisk`

calculator
will not be much better than the `event_based_risk`

calculator, but
the larger your calculation is, the better it will work, and in situations
like the Canada example here it can be orders of
magnitude more efficient, both in speed and memory consumption.
The reason why the `ebrisk`

calculator is so efficient is that
it computes the GMFs in memory instead of reading them from the datastore.

The older `event_based_risk`

calculator
works by storing the GMFs in the hazard phase of the calculation and
by reading them in the risk phase. For small to medium sized risk
calculations, this approach has the following advantages:

- if the GMFs calculation is expensive, it is good to avoid repeating it when you change a risk parameter without changing the hazard parameters
- it is convenient to have the GMFs saved on disk to debug issues with the calculation
- except for huge calculations, writing and reading the GMFs is fast, since they stored in a very optimized HDF5 format

On the other hand, there are other considerations for large national or continental risk calculations:

- these larger risk calculations are typically dominated by the reading time of the GMFs, which happens concurrently
- saving disk space matters, as such large calculations can generate hundreds of gigabytes or even terabytes of GMFs that cannot be stored conveniently

So, in practice, in very large calculations the strategy of computing the
GMFs on-the-fly wins over over the strategy of saving them to disk and this is
why the `ebrisk`

calculator exists.

## Differences with the event_based_risk calculator¶

The `event_based_risk`

calculator parallelizes by hazard sites: it splits
the exposure in spatial blocks and then each task reads the GMFs for each site
in the block it gets.

The `ebrisk`

calculator instead parallelizes by ruptures: it splits
the ruptures in blocks and then each task generates the corresponding GMFs
on the fly.

Since the amount of data in ruptures form is typically two orders of
magnitude smaller than the amount of data in GMFs, and since the GMF-generation
is fast, the `ebrisk`

calculator is able to beat the `event_based_risk`

calculator.

Moreover, since each task in the `ebrisk`

calculator gets sent the entire
exposure, it is able to aggregate the losses without problems, while the
`event_based_risk`

calculator has troubles doing that — even if each task
has access to all events, it only receives a subset of the exposure, so it
cannot aggregate on the assets. Starting from engine 3.11 the
`event_based_risk`

calculator can compute aggregate losses and aggregate
loss curves, but in an inefficient way, by collection partial returns
from all tasks and aggregating them. That means that your calculation
can easily run out of memory or can be extremely slow, while it would work
much better by setting `calculation_mode=ebrisk`

.

Aggregation of average annual losses, is computed simply by
summing the component values. The algorithm is linear and both the
`event_based_risk`

calculator and the `ebrisk`

calculator are capable of
this aggregation, but the first calculator is more efficient.

## The asset loss table and the agg_loss_table¶

When performing an event based risk (or ebrisk) calculation the engine keeps in memory a table with the losses for each asset and each event, for each loss type. It is usually impossible to fully store such table, because it is extremely large; for instance, for 1 million assets, 1 million events, 2 loss types and 4 bytes per loss ~8 TB of disk space would be required. It is true that many events will produce zero losses because of the maximum_distance and minimum_intensity parameters, but still the asset loss table is prohibitively large and for many years could not be stored. In engine 3.8 we made a breakthrough: we decided to store a partial asset loss table, obtained by discarding small losses, by leveraging on the fact that loss curves for long enough return periods are dominated by extreme events, i.e. there is no point in saving all the small losses.

To that aim,the engine honors a parameter called
`minimum_asset_loss`

which determine how many losses are discarded
when storing the asset loss table. The rule is simple: losses below
`minimum_asset_loss`

are discarded. By choosing the threshold
properly in an ideal world

- the vast majority of the losses would be discarded, thus making the asset loss table storable;
- the loss curves would still be nearly identical to the ones without discarding any loss, except for small return periods.

It is the job of the user to verify if 1 and 2 are true in the real world.
He can assess that by playing with the `minimum_asset_loss`

in a small
calculation, finding a good value for it, and then extending to the large
calculation. Clearly it is a matter of compromise: by sacrificing precision
it is possible to reduce enourmously the size of the stored asset loss table
and to make an impossible calculation possible.

NB: starting from engine 3.11 the asset loss table is stored if the user specifies

`aggregate_by = id`

in the job.ini file. In large calculations it extremely easy to run out of
memory or the make the calculation extremely slow, so we recommend
not to store the asset loss table. The functionality is there for the sole
purpose of debugging small calculations, for instance to see the effect
of the `minimum_asset_loss`

approximation at the asset level.

For large calculations usually one is interested in the aggregate loss
table, which contains the losses per event and per aggregation tag (or
multi-tag). For instance, the tag `occupancy`

has the three values
“Residential”, “Industrial” and “Commercial” and by setting

`aggregate_by = occupancy`

the engine will store a pandas DataFrame with field `agg_id`

with 4
possible value: 0 for “Residential”, 1 for “Industrial”, 2 for “Commercial”
and 3 for the full aggregation.

NB: if the parameter `aggregate_by`

is not specified, the engine will
still compute the `agg_loss_table`

but then the `agg_id`

field will
have a single value 0 corresponding to the total portfolio losses.

## The Probable Maximum Loss (PML) and the loss curves¶

Given an effective investigation time, a return period and an
`agg_loss_table`

, the engine is able to compute a PML for each
aggregation tag. It does so by using the function
`openquake.risklib.scientific.losses_by_period`

which takes in input
an array of cumulative losses associated to the aggregation tag, a
list of or return periods, and the effective investigation time. If
there is a single return period the function returns the PML; if there are
multiple return periods it returns the loss curve. The two concepts
are essentially the same thing, since a loss curve is just an array of
PMLs, one for each return period. For instance

computes the Probably Maximum Loss at 500 years for the given losses
with an effective investigation time of 1000 years. The algorithm works
by ordering the losses (suppose there are E > 1 losses) generating E time
periods `eff_time/E, eff_time/(E-1), ... eff_time/1`

and log-interpolating
the loss at the return period. Of course this works only if the condition

`eff_time/E < return_period < eff_time`

is respected. In this example there are E=16 losses, so the return period must be in the range 62.5 .. 1000 years. If the return period is too small the PML will be zero

```
>>> losses_by_period(losses, [50], eff_time=1000)
array([0.])
```

while if the return period is outside the investigation range we will refuse the temptation to extrapolate and we will return NaN instead:

```
>>> losses_by_period(losses, [1500], eff_time=1000)
array([nan])
```

The rules above are the reason while you will see zeros or NaNs in the loss curves generated by the engine sometimes, especially when there are too few events: the valid range will be small and some return periods may slip outside the range.

In order to compute aggregate loss curves you must
set the `aggregate_by`

parameter in the `job.ini`

to one or more tags
over which you wish to perform the aggregation. Your exposure must contain
the specified tags with values for each asset.
We have an example for Nepal in our event based risk demo.
The exposure for this demo contains various tags and in particular a geographic
tag called NAME1 with values “Mid-Western”, “Far-Western”, “West”, “East”,
“Central”, and the `job_eb.ini`

file defines

`aggregate_by = NAME_1`

When running the calculation you will see something like this:

```
Calculation 1 finished correctly in 17 seconds
id | name
9 | Aggregate Event Losses
1 | Aggregate Loss Curves
2 | Aggregate Loss Curves Statistics
3 | Aggregate Losses
4 | Aggregate Losses Statistics
5 | Average Asset Losses Statistics
11 | Earthquake Ruptures
6 | Events
7 | Full Report
8 | Input Files
10 | Realizations
12 | Total Loss Curves
13 | Total Loss Curves Statistics
14 | Total Losses
15 | Total Losses Statistics
```

Exporting the *Aggregate Loss Curves Statistics* output will give
you the mean and quantile loss curves in a format like the following one:

```
annual_frequency_of_exceedence,return_period,loss_type,loss_value,loss_ratio
5.00000E-01,2,nonstructural,0.00000E+00,0.00000E+00
5.00000E-01,2,structural,0.00000E+00,0.00000E+00
2.00000E-01,5,nonstructural,0.00000E+00,0.00000E+00
2.00000E-01,5,structural,0.00000E+00,0.00000E+00
1.00000E-01,10,nonstructural,0.00000E+00,0.00000E+00
1.00000E-01,10,structural,0.00000E+00,0.00000E+00
5.00000E-02,20,nonstructural,0.00000E+00,0.00000E+00
5.00000E-02,20,structural,0.00000E+00,0.00000E+00
2.00000E-02,50,nonstructural,0.00000E+00,0.00000E+00
2.00000E-02,50,structural,0.00000E+00,0.00000E+00
1.00000E-02,100,nonstructural,0.00000E+00,0.00000E+00
1.00000E-02,100,structural,0.00000E+00,0.00000E+00
5.00000E-03,200,nonstructural,1.35279E+05,1.26664E-06
5.00000E-03,200,structural,2.36901E+05,9.02027E-03
2.00000E-03,500,nonstructural,1.74918E+06,1.63779E-05
2.00000E-03,500,structural,2.99670E+06,1.14103E-01
1.00000E-03,1000,nonstructural,6.92401E+06,6.48308E-05
1.00000E-03,1000,structural,1.15148E+07,4.38439E-01
```

If you do not set the `aggregate_by`

parameter
you will still able to compute the total loss curve
(for the entire portfolio of assets), and the total average losses.