Tips for running large risk calculations

Scenario risk calculations usually do not pose a performance problem, since they involve a single rupture and a limited geography for analysis. Some event-based risk calculations, however, may involve millions of ruptures and exposures spanning entire countries or even larger regions. This section offers some practical tips for running large event based risk calculations, especially ones involving large logic trees, and proposes techniques that might be used to make an intractable calculation tractable.

Understanding the hazard

Event-based calculations are typically dominated by the hazard component (unless there are lots of assets aggregated on few hazard sites) and therefore the first thing to do is to estimate the size of the hazard, i.e. the number of GMFs that will be produced. Since we are talking about a large calculation, first of all we need reduce it to a size that is guaranteed to run quickly. The simplest way to do that is to reduce the parameters directly affecting the number of ruptures generated, i.e.

  • investigation_time
  • ses_per_logic_tree_path
  • number_of_logic_tree_samples

For instance, if you have ses_per_logic_tree_path = 10,000 reduce it to 10, run the calculation and you will see in the log something like this:

[2018-12-16 09:09:57,689 #35263 INFO] Received
{'gmfdata': '752.18 MB', 'hcurves': '224 B', 'indices': '29.42 MB'}

The amount of GMFs generated for the reduced calculation is 752.18 MB; and since the calculation has been reduced by a factor of 1,000, the full computation is likely to generate around 750 GB of GMFs. Even if you have sufficient disk space to store this large quantity of GMFs, most likely you will run out of memory. Even if the hazard part of the calculation manages to run to completion, the risk part of the calculation is very likely to fail — managing 750 GB of GMFs is beyond the current capabilities of the engine. Thus, you will have to find ways to reduce the size of the computation.

A good start would be to carefully set the parameters minimum_magnitude and minimum_intensity:

  • minimum_magnitude is a scalar or a dictionary keyed by tectonic region; the engine will discard ruptures with magnitudes below the given threshoulds
  • minimum_intensity is a scalar or a dictionary keyed by the intensity measure type; the engine will discard GMFs below the given intensity threshoulds

Choosing reasonable cutoff thresholds with these parameters can significantly reduce the size of your computation when there are a large number of small magnitude ruptures or low intensity GMFs being generated, which may have a negligible impact on the damage or losses, and thus could be safely discarded.

Collapsing of branches

When one is not interested so much in the uncertainty around the loss estimates, but more interested simply in the mean estimates, all of the source model branches can be “collapsed” into one branch. Using the collapsed source model should yield the same mean hazard or loss estimates as using the full source model logic tree and then computing the weighted mean of the individual branch results.

Similarly, the GMPE logic tree for each tectonic region can also be “collapsed” into a single branch. Using a single collapsed GMPE for each TRT should also yield the same mean hazard estimates as using the full GMPE logic tree and then computing the weighted mean of the individual branch results. This has become possible through the introduction of AvgGMPE feature in version 3.9.

Splitting the calculation into subregions

If one is interested in propagating the full uncertainty in the source models or ground motion models to the hazard or loss estimates, collapsing the logic trees into a single branch to reduce computational expense is not an option. But before going through the effort of trimming the logic trees, there is an interim step that must be explored, at least for large regions like the entire continental United States. This step is to geographically divide the large region into logical smaller subregions, such that the contribution to the hazard or losses in one subregion from the other subregions is negligibly small or even zero. The effective realizations in each of the subregions will then be much fewer than when trying to cover the entire large region in a single calculation.

Trimming of the logic-trees or sampling of the branches

Trimming or sampling may be necessary if the following two conditions hold:

  1. You are interested in propagating the full uncertainty to the hazard and loss estimates; only the mean or quantile results are not sufficient for your analysis requirements, AND
  2. The region of interest cannot be logically divided further as described above; the logic-tree for your chosen region of interest still leads to a very large number of effective realizations.

Sampling is the easier of the two options now. You only need to ensure that you sample a sufficient number of branches to capture the underlying distribution of the hazard or loss results you are interested in. The drawback of random sampling is that you may still need to sample hundreds of branches to capture well the underlying distribution of the results.

Trimming can be much more efficient than sampling, because you pick a few branches such that the distribution of the hazard or loss results obtained from a full-enumeration of these branches is nearly the same as the distribution of the hazard or loss results obtained from a full-enumeration of the entire logic-tree.

Disabling the propagation of vulnerability uncertainty to losses

The vulnerability functions using continuous distributions (such as the lognormal distribution or beta distribution) to characterize the uncertainty in the loss ratio conditional on the shaking intensity level, specify the mean loss ratios and the corresponding coefficients of variation for a set of intensity levels. They are used to build the so called epsilon matrix within the engine, which is how loss ratios are sampled from the distribution for each asset.

There is clearly a performance penalty associated with the propagation of uncertainty in the vulnerability to losses. The epsilon matrix has to be computed and stored, and then the worker processes have to read it, which involves large quantities of data transfer and memory usage.

Setting

ignore_covs = true

in your job.ini file will result in the engine using just the mean loss ratio conditioned on the shaking intensity and ignoring the uncertainty. This tradeoff of not propagating the vulnerabilty uncertainty to the loss estimates can lead to a significant boost in performance and tractability.

The ebrisk calculator

Even with all the tricks in the book, some problems cannot be solved with the traditional event_based_risk calculator, in particular when there are too many hazard sites. Suppose for instance that you have a very detailed exposure for Canada with 462,000 hazard sites, and a corresponding site model covering all of the sites. It would be a pity to lose such detailed information by aggregating the assets onto a coarser grid, but this may be the only viable option for the traditional event_based_risk calculator.

The issue is that the event_based_risk cannot work well with so many sites, unless you reduce your investigation_time considerably. If the investigation_time is long enough for a reasonable computation, you will most likely run into issues such as:

  1. running out of memory when computing the GMFs
  2. running out of disk space when saving the GMFs
  3. running out of memory when reading the GMFs
  4. having an impossibly slow risk calculation

The solution - in theory - would be to split Canada in regions, but this comes with its own problems. For instance,

  1. one has to compute the ruptures for all Canada in a single run, to make sure that the random seeds are consistent for all regions
  2. then one has to run several calculations starting from the ruptures, one per sub-region
  3. finally one has to carefully aggregate the results from the different calculations

Such steps are tedious, time consuming and very much error prone.

In order to solve such issues a new calculator, tentatively called ebrisk, has been introduced in engine 3.4. For small calculations the ebrisk calculator will not be much better than the event_based_risk calculator, but the larger your calculation is, the better it will work, and in situations like the Canada example here it can be orders of magnitude more efficient, both in speed and memory consumption. The reason why the ebrisk calculator is so efficient is that it computes the GMFs in memory instead of reading them from the datastore.

The older event_based_risk calculator works by storing the GMFs in the hazard phase of the calculation and by reading them in the risk phase. For small to medium sized risk calculations, this approach has the following advantages:

  1. if the GMFs calculation is expensive, it is good to avoid repeating it when you change a risk parameter without changing the hazard parameters
  2. it is convenient to have the GMFs saved on disk to debug issues with the calculation
  3. except for huge calculations, writing and reading the GMFs is fast, since they stored in a very optimized HDF5 format

On the other hand, there are other considerations for large national or continental risk calculations:

  1. these larger risk calculations are typically dominated by the reading time of the GMFs, which happens concurrently
  2. saving disk space matters, as such large calculations can generate hundreds of gigabytes or even terabytes of GMFs that cannot be stored conveniently

So, in practice, in very large calculations the strategy of computing the GMFs on-the-fly wins over over the strategy of saving them to disk and this is why the ebrisk calculator exists.

Differences with the event_based_risk calculator

The event_based_risk calculator parallelizes by hazard sites: it splits the exposure in spatial blocks and then each task reads the GMFs for each site in the block it gets.

The ebrisk calculator instead parallelizes by ruptures: it splits the ruptures in blocks and then each task generates the corresponding GMFs on the fly.

Since the amount of data in ruptures form is typically two orders of magnitude smaller than the amount of data in GMFs, and since the GMF-generation is fast, the ebrisk calculator is able to beat the event_based_risk calculator.

Moreover, since each task in the ebrisk calculator gets sent the entire exposure, it is able to aggregate the losses without problems, while the event_based_risk calculator has troubles doing that — even if each task has access to all events, it only receives a subset of the exposure, so it cannot aggregate on the assets. Starting from engine 3.11 the event_based_risk calculator can compute aggregate losses and aggregate loss curves, but in an inefficient way, by collection partial returns from all tasks and aggregating them. That means that your calculation can easily run out of memory or can be extremely slow, while it would work much better by setting calculation_mode=ebrisk.

Aggregation of average annual losses, is computed simply by summing the component values. The algorithm is linear and both the event_based_risk calculator and the ebrisk calculator are capable of this aggregation, but the first calculator is more efficient.

The asset loss table and the agg_loss_table

When performing an event based risk (or ebrisk) calculation the engine keeps in memory a table with the losses for each asset and each event, for each loss type. It is usually impossible to fully store such table, because it is extremely large; for instance, for 1 million assets, 1 million events, 2 loss types and 4 bytes per loss ~8 TB of disk space would be required. It is true that many events will produce zero losses because of the maximum_distance and minimum_intensity parameters, but still the asset loss table is prohibitively large and for many years could not be stored. In engine 3.8 we made a breakthrough: we decided to store a partial asset loss table, obtained by discarding small losses, by leveraging on the fact that loss curves for long enough return periods are dominated by extreme events, i.e. there is no point in saving all the small losses.

To that aim,the engine honors a parameter called minimum_asset_loss which determine how many losses are discarded when storing the asset loss table. The rule is simple: losses below minimum_asset_loss are discarded. By choosing the threshold properly in an ideal world

  1. the vast majority of the losses would be discarded, thus making the asset loss table storable;
  2. the loss curves would still be nearly identical to the ones without discarding any loss, except for small return periods.

It is the job of the user to verify if 1 and 2 are true in the real world. He can assess that by playing with the minimum_asset_loss in a small calculation, finding a good value for it, and then extending to the large calculation. Clearly it is a matter of compromise: by sacrificing precision it is possible to reduce enourmously the size of the stored asset loss table and to make an impossible calculation possible.

NB: starting from engine 3.11 the asset loss table is stored if the user specifies

aggregate_by = id

in the job.ini file. In large calculations it extremely easy to run out of memory or the make the calculation extremely slow, so we recommend not to store the asset loss table. The functionality is there for the sole purpose of debugging small calculations, for instance to see the effect of the minimum_asset_loss approximation at the asset level.

For large calculations usually one is interested in the aggregate loss table, which contains the losses per event and per aggregation tag (or multi-tag). For instance, the tag occupancy has the three values “Residential”, “Industrial” and “Commercial” and by setting

aggregate_by = occupancy

the engine will store a pandas DataFrame with field agg_id with 4 possible value: 0 for “Residential”, 1 for “Industrial”, 2 for “Commercial” and 3 for the full aggregation.

NB: if the parameter aggregate_by is not specified, the engine will still compute the agg_loss_table but then the agg_id field will have a single value 0 corresponding to the total portfolio losses.

The Probable Maximum Loss (PML) and the loss curves

Given an effective investigation time, a return period and an agg_loss_table, the engine is able to compute a PML for each aggregation tag. It does so by using the function openquake.risklib.scientific.losses_by_period which takes in input an array of cumulative losses associated to the aggregation tag, a list of or return periods, and the effective investigation time. If there is a single return period the function returns the PML; if there are multiple return periods it returns the loss curve. The two concepts are essentially the same thing, since a loss curve is just an array of PMLs, one for each return period. For instance

computes the Probably Maximum Loss at 500 years for the given losses with an effective investigation time of 1000 years. The algorithm works by ordering the losses (suppose there are E > 1 losses) generating E time periods eff_time/E, eff_time/(E-1), ... eff_time/1 and log-interpolating the loss at the return period. Of course this works only if the condition

eff_time/E < return_period < eff_time

is respected. In this example there are E=16 losses, so the return period must be in the range 62.5 .. 1000 years. If the return period is too small the PML will be zero

>>> losses_by_period(losses, [50], eff_time=1000)
array([0.])

while if the return period is outside the investigation range we will refuse the temptation to extrapolate and we will return NaN instead:

>>> losses_by_period(losses, [1500], eff_time=1000)
array([nan])

The rules above are the reason while you will see zeros or NaNs in the loss curves generated by the engine sometimes, especially when there are too few events: the valid range will be small and some return periods may slip outside the range.

In order to compute aggregate loss curves you must set the aggregate_by parameter in the job.ini to one or more tags over which you wish to perform the aggregation. Your exposure must contain the specified tags with values for each asset. We have an example for Nepal in our event based risk demo. The exposure for this demo contains various tags and in particular a geographic tag called NAME1 with values “Mid-Western”, “Far-Western”, “West”, “East”, “Central”, and the job_eb.ini file defines

aggregate_by = NAME_1

When running the calculation you will see something like this:

 Calculation 1 finished correctly in 17 seconds
id | name
 9 | Aggregate Event Losses
 1 | Aggregate Loss Curves
 2 | Aggregate Loss Curves Statistics
 3 | Aggregate Losses
 4 | Aggregate Losses Statistics
 5 | Average Asset Losses Statistics
11 | Earthquake Ruptures
 6 | Events
 7 | Full Report
 8 | Input Files
10 | Realizations
12 | Total Loss Curves
13 | Total Loss Curves Statistics
14 | Total Losses
15 | Total Losses Statistics

Exporting the Aggregate Loss Curves Statistics output will give you the mean and quantile loss curves in a format like the following one:

annual_frequency_of_exceedence,return_period,loss_type,loss_value,loss_ratio
5.00000E-01,2,nonstructural,0.00000E+00,0.00000E+00
5.00000E-01,2,structural,0.00000E+00,0.00000E+00
2.00000E-01,5,nonstructural,0.00000E+00,0.00000E+00
2.00000E-01,5,structural,0.00000E+00,0.00000E+00
1.00000E-01,10,nonstructural,0.00000E+00,0.00000E+00
1.00000E-01,10,structural,0.00000E+00,0.00000E+00
5.00000E-02,20,nonstructural,0.00000E+00,0.00000E+00
5.00000E-02,20,structural,0.00000E+00,0.00000E+00
2.00000E-02,50,nonstructural,0.00000E+00,0.00000E+00
2.00000E-02,50,structural,0.00000E+00,0.00000E+00
1.00000E-02,100,nonstructural,0.00000E+00,0.00000E+00
1.00000E-02,100,structural,0.00000E+00,0.00000E+00
5.00000E-03,200,nonstructural,1.35279E+05,1.26664E-06
5.00000E-03,200,structural,2.36901E+05,9.02027E-03
2.00000E-03,500,nonstructural,1.74918E+06,1.63779E-05
2.00000E-03,500,structural,2.99670E+06,1.14103E-01
1.00000E-03,1000,nonstructural,6.92401E+06,6.48308E-05
1.00000E-03,1000,structural,1.15148E+07,4.38439E-01

If you do not set the aggregate_by parameter you will still able to compute the total loss curve (for the entire portfolio of assets), and the total average losses.