Event Based ======================================== Scenario risk calculations usually do not pose a performance problem, since they involve a single rupture and a limited geography for analysis. Some event-based risk calculations, however, may involve millions of ruptures and exposures spanning entire countries or even larger regions. This section offers some practical tips for running large event based risk calculations, especially ones involving large logic trees, and proposes techniques that might be used to make an intractable calculation tractable. Understanding the hazard ------------------------ Event-based calculations are typically dominated by the hazard component (unless there are lots of assets aggregated on few hazard sites) and therefore the first thing to do is to estimate the size of the hazard, i.e. the number of GMFs that will be produced. Since we are talking about a large calculation, first of all we need reduce it to a size that is guaranteed to run quickly. The simplest way to do that is to reduce the parameters directly affecting the number of ruptures generated, i.e. - investigation_time - ses_per_logic_tree_path - number_of_logic_tree_samples For instance, if you have ses_per_logic_tree_path = 10,000 reduce it to 10, run the calculation and you will see in the log something like this:: [2018-12-16 09:09:57,689 #35263 INFO] Received {'gmfdata': '752.18 MB', 'hcurves': '224 B', 'indices': '29.42 MB'} The amount of GMFs generated for the reduced calculation is 752.18 MB; and since the calculation has been reduced by a factor of 1,000, the full computation is likely to generate around 750 GB of GMFs. Even if you have sufficient disk space to store this large quantity of GMFs, most likely you will run out of memory. Even if the hazard part of the calculation manages to run to completion, the risk part of the calculation is very likely to fail — managing 750 GB of GMFs is beyond the current capabilities of the engine. Thus, you will have to find ways to reduce the size of the computation. A good start would be to carefully set the parameters minimum_magnitude and minimum_intensity: - minimum_magnitude is a scalar or a dictionary keyed by tectonic region; the engine will discard ruptures with magnitudes below the given threshoulds - minimum_intensity is a scalar or a dictionary keyed by the intensity measure type; the engine will discard GMFs below the given intensity threshoulds Choosing reasonable cutoff thresholds with these parameters can significantly reduce the size of your computation when there are a large number of small magnitude ruptures or low intensity GMFs being generated, which may have a negligible impact on the damage or losses, and thus could be safely discarded. region_grid_spacing --------------------- In our experience, the most common error made by out users is to compute the hazard at the sites of the exposure. The issue is that it very possible to have exposures with millions of assets on millions of distinct hazard sites. Computing the GMFs for millions of sites is hard or even impossible (there is a limit of 4 billion rows on the size of the GMF table in the datastore). Even in the cases when computing the hazard is possible, then computing the risk starting from an extremely large amount of GMFs will likely be impossible, due to memory/runtime constraints. The second most common error is to use an extremely fine grid for the site model. Remember that if you have a resolution of 250 meters, a square of 250 km x 250 km will contain one million sites, which is definitely too much. The engine when designed when the site models had resolutions around 5-10 km, i.e. of the same order of the hazard grid, while nowadays the vs30 fields have a much larger resolution. Both problems can be solved in a simple way by specifying the region_grid_spacing parameter. Make it large enough that the resulting number of sites becomes reasonable and you are done. You will loose some precision, but that is preferable to not being able to run the calculation. You will need to run a sensitivity analysis with different values of region_grid_spacing parameter to make sure that you get consistent results, but that's it. Once a region_grid_spacing is specified, the engine computes the convex hull of the exposure sites and builds a grid of hazard sites, associating the site parameters from the closest site in the site model and discarding sites in region where there are no assets (i.e. more distant than region_grid_spacing * sqrt(2)). The precise logic is encoded in the function openquake.commonlib.readinput.get_sitecol_assetcol, if you want to know the nitty-gritty details. Our recommendation is to use the command oq prepare_site_model to apply such logic before starting a calculation and thus producing a custom site model file tailored to your exposure (see the section :ref:prepare_site_model). Collapsing of branches ---------------------- When one is not interested in the uncertainty around the loss estimates and cares more about the mean estimates, all of the source model branches can be "collapsed" into one branch. Using the collapsed source model should yield the same mean hazard or loss estimates as using the full source model logic tree and then computing the weighted mean of the individual branch results. Similarly, the GMPE logic tree for each tectonic region can also be "collapsed" into a single branch. Using a single collapsed GMPE for each TRT should also yield the same mean hazard estimates as using the full GMPE logic tree and then computing the weighted mean of the individual branch results. This has become possible through the introduction of AvgGMPE feature _ in version 3.9. Using collect_rlzs=true in the risk calculation --------------------------------------------------- Since version 3.12 the engine recognizes a flag collect_rlzs in the risk configuration file, which by default is false. When the flag is set to true, then the hazard realizations are collected together when computing the risk results and considered as one. This is possible only when the weights of the realizations are all equal, otherwise the engine raises an error. Collecting the realizations makes the calculation of the losses and loss curves much faster and more memory efficient. It is the recommended way to proceed when you are interested only in mean results. Note 1: when using sampling, collect_rlzs is implicitly set to True, so if you want to export the individual results per realization you must set explictly collect_rlzs=false. Note 2: collect_rlzs is not the inverse of the individual rlsz flag. The two flags are completely independent, one refers to risk and the other to hazard calculations. Note 3: collect_rlzs is completely ignored in the hazard part of the calculation, i.e. it does not affect at all the computation of the GMFs, only the computation of the risk curves. Splitting the calculation into subregions ----------------------------------------- If one is interested in propagating the full uncertainty in the source models or ground motion models to the hazard or loss estimates, collapsing the logic trees into a single branch to reduce computational expense is not an option. But before going through the effort of trimming the logic trees, there is an interim step that must be explored, at least for large regions like the entire continental United States. This step is to geographically divide the large region into logical smaller subregions, such that the contribution to the hazard or losses in one subregion from the other subregions is negligibly small or even zero. The effective realizations in each of the subregions will then be much fewer than when trying to cover the entire large region in a single calculation. Trimming of the logic-trees or sampling of the branches ------------------------------------------------------- Trimming or sampling may be necessary if the following two conditions hold: 1. You are interested in propagating the full uncertainty to the hazard and loss estimates; only the mean or quantile results are not sufficient for your analysis requirements, AND 2. The region of interest cannot be logically divided further as described above; the logic-tree for your chosen region of interest still leads to a very large number of effective realizations. Sampling is the easier of the two options now. You only need to ensure that you sample a sufficient number of branches to capture the underlying distribution of the hazard or loss results you are interested in. The drawback of random sampling is that you may still need to sample hundreds of branches to capture well the underlying distribution of the results. Trimming can be much more efficient than sampling, because you pick a few branches such that the distribution of the hazard or loss results obtained from a full-enumeration of these branches is nearly the same as the distribution of the hazard or loss results obtained from a full-enumeration of the entire logic-tree. ignore_covs vs ignore_master_seed ---------------------------------------------------------------- The vulnerability functions using continuous distributions (lognormal/beta) to characterize the uncertainty in the loss ratio, specify the mean loss ratios and the corresponding coefficients of variation for a set of intensity levels. There is clearly a performance/memory penalty associated with the propagation of uncertainty in the vulnerability to losses. You can completely remove it by setting ignore_covs = true in the job.ini file. Then the engine would compute just the mean loss ratios by ignoring the uncertainty i.e. the coefficients of variation. Since engine 3.12 there is a better solution: setting ignore_master_seed = true in the job.ini file. Then the engine will compute the mean loss ratios but also store information about the uncertainty of the results in the asset loss table, in the column "variance", by using the formulae .. math:: variance &= \Sigma_i \sigma_i^2 \ for\ asset\_correl=0 \\ variance &= (\Sigma_i \sigma_i)^2 \ for\ asset\_correl=1 in terms of the variance of each asset for the event and intensity level in consideration, extracted from the asset loss and the coefficients of variation. People interested in the details should look at the implementation in https://github.com/gem/oq-engine/blob/engine-3.15/openquake/risklib/scientific.py. The asset loss table ==================== When performing an event based risk calculation the engine keeps in memory a table with the losses for each asset and each event, for each loss type. It is usually impossible to fully store such table, because it is extremely large; for instance, for 1 million assets, 1 million events, 2 loss types and 4 bytes per loss ~8 TB of disk space would be required. It is true that many events will produce zero losses because of the maximum_distance and minimum_intensity parameters, but still the asset loss table is prohibitively large and for many years could not be stored. In engine 3.8 we made a breakthrough: we decided to store a partial asset loss table, obtained by discarding small losses, by leveraging on the fact that loss curves for long enough return periods are dominated by extreme events, i.e. there is no point in saving all the small losses. To that aim,the engine honors a parameter called minimum_asset_loss which determine how many losses are discarded when storing the asset loss table. The rule is simple: losses below minimum_asset_loss are discarded. By choosing the threshold properly in an ideal world 1. the vast majority of the losses would be discarded, thus making the asset loss table storable; 2. the loss curves would still be nearly identical to the ones without discarding any loss, except for small return periods. It is the job of the user to verify if 1 and 2 are true in the real world. He can assess that by playing with the minimum_asset_loss in a small calculation, finding a good value for it, and then extending to the large calculation. Clearly it is a matter of compromise: by sacrificing precision it is possible to reduce enourmously the size of the stored asset loss table and to make an impossible calculation possible. Starting from engine 3.11 the asset loss table is stored if the user specifies aggregate_by = id in the job.ini file. In large calculations it extremely easy to run out of memory or the make the calculation extremely slow, so we recommend not to store the asset loss table. The functionality is there for the sole purpose of debugging small calculations, for instance to see the effect of the minimum_asset_loss approximation at the asset level. For large calculations usually one is interested in the aggregate loss table, which contains the losses per event and per aggregation tag (or multi-tag). For instance, the tag occupancy has the three values "Residential", "Industrial" and "Commercial" and by setting aggregate_by = occupancy the engine will store a pandas DataFrame called risk_by_event with a field agg_id with 4 possible value: 0 for "Residential", 1 for "Industrial", 2 for "Commercial" and 3 for the full aggregation. NB: if the parameter aggregate_by is not specified, the engine will still compute the aggregate loss table but then the agg_id field will have a single value 0 corresponding to the total portfolio losses. The Probable Maximum Loss (PML) and the loss curves --------------------------------------------------- Given an effective investigation time and a return period, the engine is able to compute a PML for each aggregation tag. It does so by using the function openquake.risklib.scientific.losses_by_period which takes in input an array of cumulative losses associated to the aggregation tag, a list of or return periods, and the effective investigation time. If there is a single return period the function returns the PML; if there are multiple return periods it returns the loss curve. The two concepts are essentially the same thing, since a loss curve is just an array of PMLs, one for each return period. For instance .. code-block:: python >>> from openquake.risklib.scientific import losses_by_period >>> losses = [3, 2, 3.5, 4, 3, 23, 11, 2, 1, 4, 5, 7, 8, 9, 13, 0] >>> [PML_500y] = losses_by_period(losses, [500], eff_time=1000) >>> PML_500y 13.0 computes the Probably Maximum Loss at 500 years for the given losses with an effective investigation time of 1000 years. The algorithm works by ordering the losses (suppose there are E > 1 losses) generating E time periods eff_time/E, eff_time/(E-1), ... eff_time/1 and log-interpolating the loss at the return period. Of course this works only if the condition eff_time/E < return_period < eff_time is respected. In this example there are E=16 losses, so the return period must be in the range 62.5 .. 1000 years. If the return period is too small the PML will be zero >>> losses_by_period(losses, [50], eff_time=1000) array([0.]) while if the return period is outside the investigation range we will refuse the temptation to extrapolate and we will return NaN instead: >>> losses_by_period(losses, [1500], eff_time=1000) array([nan]) The rules above are the reason while you will see zeros or NaNs in the loss curves generated by the engine sometimes, especially when there are too few events: the valid range will be small and some return periods may slip outside the range. In order to compute aggregate loss curves you must set the aggregate_by parameter in the job.ini to one or more tags over which you wish to perform the aggregation. Your exposure must contain the specified tags with values for each asset. We have an example for Nepal in our event based risk demo. The exposure for this demo contains various tags and in particular a geographic tag called NAME1 with values "Mid-Western", "Far-Western", "West", "East", "Central", and the job_eb.ini file defines aggregate_by = NAME_1 When running the calculation you will see something like this:: Calculation 1 finished correctly in 17 seconds id | name 9 | Aggregate Event Losses 1 | Aggregate Loss Curves 2 | Aggregate Loss Curves Statistics 3 | Aggregate Losses 4 | Aggregate Losses Statistics 5 | Average Asset Losses Statistics 11 | Earthquake Ruptures 6 | Events 7 | Full Report 8 | Input Files 10 | Realizations 12 | Total Loss Curves 13 | Total Loss Curves Statistics 14 | Total Losses 15 | Total Losses Statistics Exporting the *Aggregate Loss Curves Statistics* output will give you the mean and quantile loss curves in a format like the following one:: annual_frequency_of_exceedence,return_period,loss_type,loss_value,loss_ratio 5.00000E-01,2,nonstructural,0.00000E+00,0.00000E+00 5.00000E-01,2,structural,0.00000E+00,0.00000E+00 2.00000E-01,5,nonstructural,0.00000E+00,0.00000E+00 2.00000E-01,5,structural,0.00000E+00,0.00000E+00 1.00000E-01,10,nonstructural,0.00000E+00,0.00000E+00 1.00000E-01,10,structural,0.00000E+00,0.00000E+00 5.00000E-02,20,nonstructural,0.00000E+00,0.00000E+00 5.00000E-02,20,structural,0.00000E+00,0.00000E+00 2.00000E-02,50,nonstructural,0.00000E+00,0.00000E+00 2.00000E-02,50,structural,0.00000E+00,0.00000E+00 1.00000E-02,100,nonstructural,0.00000E+00,0.00000E+00 1.00000E-02,100,structural,0.00000E+00,0.00000E+00 5.00000E-03,200,nonstructural,1.35279E+05,1.26664E-06 5.00000E-03,200,structural,2.36901E+05,9.02027E-03 2.00000E-03,500,nonstructural,1.74918E+06,1.63779E-05 2.00000E-03,500,structural,2.99670E+06,1.14103E-01 1.00000E-03,1000,nonstructural,6.92401E+06,6.48308E-05 1.00000E-03,1000,structural,1.15148E+07,4.38439E-01 If you do not set the aggregate_by parameter you will still able to compute the total loss curve (for the entire portfolio of assets), and the total average losses. Aggregating by multiple tags ---------------------------- The engine also supports aggregation my multiple tags. For instance the second event based risk demo (the file job_eb.ini) has a line aggregate_by = NAME_1, taxonomy and it is able to aggregate both on geographic region (NAME_1) and on taxonomy. There are 25 possible combinations, that you can see with the command:: $oq show agg_keys | NAME_1_ | taxonomy_ | NAME_1 | taxonomy | +---------+-----------+-------------+----------------------------+ | 1 | 1 | Mid-Western | Wood | | 1 | 2 | Mid-Western | Adobe | | 1 | 3 | Mid-Western | Stone-Masonry | | 1 | 4 | Mid-Western | Unreinforced-Brick-Masonry | | 1 | 5 | Mid-Western | Concrete | | 2 | 1 | Far-Western | Wood | | 2 | 2 | Far-Western | Adobe | | 2 | 3 | Far-Western | Stone-Masonry | | 2 | 4 | Far-Western | Unreinforced-Brick-Masonry | | 2 | 5 | Far-Western | Concrete | | 3 | 1 | West | Wood | | 3 | 2 | West | Adobe | | 3 | 3 | West | Stone-Masonry | | 3 | 4 | West | Unreinforced-Brick-Masonry | | 3 | 5 | West | Concrete | | 4 | 1 | East | Wood | | 4 | 2 | East | Adobe | | 4 | 3 | East | Stone-Masonry | | 4 | 4 | East | Unreinforced-Brick-Masonry | | 4 | 5 | East | Concrete | | 5 | 1 | Central | Wood | | 5 | 2 | Central | Adobe | | 5 | 3 | Central | Stone-Masonry | | 5 | 4 | Central | Unreinforced-Brick-Masonry | | 5 | 5 | Central | Concrete | The lines in this table are associated to the *generalized aggregation ID*, agg_id which is an index going from 0 (meaning aggregate assets with NAME_1=*Mid-Western* and taxonomy=*Wood*) to 24 (meaning aggregate assets with NAME_1=*Mid-Western* and taxonomy=*Wood*); moreover agg_id=25 means full aggregation. The agg_id field enters in risk_by_event and in outputs like the aggregate losses; for instance::$ oq show agg_losses-rlzs | agg_id | rlz | loss_type | value | +--------+-----+---------------+-------------+ | 0 | 0 | nonstructural | 2_327_008 | | 0 | 0 | structural | 937_852 | +--------+-----+---------------+-------------+ | ... + ... + ... + ... + +--------+-----+---------------+-------------+ | 25 | 1 | nonstructural | 100_199_448 | | 25 | 1 | structural | 157_885_648 | The exporter (oq export agg_losses-rlzs) converts back the agg_id to the proper combination of tags; agg_id=25, i.e. full aggregation, is replaced with the string *total*. By knowing the number of events, the number of aggregation keys and the number of loss types, it is possible to give an upper limit to the size of risk_by_event. In the demo there are 1703 events, 26 aggregation keys and 2 loss types, so risk_by_event contains at most 1703 * 26 * 2 = 88,556 rows This is an upper limit, since some combination can produce zero losses and are not stored, especially if the minimum_asset_loss feature is used. In the case of the demo actually only 20,877 rows are nonzero:: $oq show risk_by_event event_id agg_id loss_id loss variance ... [20877 rows x 5 columns] Rupture sampling: how does it work? =================================== In this section we explain how the sampling of ruptures in event based calculations works, at least for the case of Poissonian sources. As an example, consider the following point source: >>> from openquake.hazardlib import nrml >>> src = nrml.get('''\ ... ... ... 179.5 0 ... 0 ... 10 ... ... WC1994 ... 1.5 ... ... ... ... ... ... ... ... ''', investigation_time=1, width_of_mfd_bin=1.0) The source here is particularly simple, with only one seismogenic depth and one nodal plane. It generates two ruptures, because with a width_of_mfd_bin of 1 there are only two magnitudes in the range from 5 to 7: >>> [(mag1, rate1), (mag2, rate2)] = src.get_annual_occurrence_rates() >>> mag1 5.5 >>> mag2 6.5 The occurrence rates are respectively 0.009 and 0.0009. So, if we set the number of stochastic event sets to 1,000,000 >>> num_ses = 1_000_000 we would expect the first rupture (the one with magnitude 5.5) to occur around 9,000 times and the second rupture (the one with magnitude 6.5) to occur around 900 times. Clearly the exact numbers will depend on the stochastic seed; if we set >>> import numpy.random >>> numpy.random.seed(42) then we will have (for investigation_time = 1) >>> numpy.random.poisson(rate1 * num_ses * 1) 8966 >>> numpy.random.poisson(rate2 * num_ses * 1) 921 These are the number of occurrences of each rupture in the effective investigation time, i.e. the investigation time multiplied by the number of stochastic event sets and the number of realizations (here we assumed 1 realization). The total number of events generated by the source will be number_of_events = sum(n_occ for each rupture) i.e. 8,966 + 921 = 9,887, with ~91% of the events associated to the first rupture and ~9% of the events associated to the second rupture. Since the details of the seed algorithm can change with updates to the the engine, if you run an event based calculation with the same parameters with different versions of the engine, you may not get exactly the same number of events, but something close given a reasonably long effective investigation time. After running the calculation, inside the datastore, in the ruptures dataset you will find the two ruptures, their occurrence rates and their integer number of occurrences (n_occ). If the effective investigation time is large enough the relation n_occ ~ occurrence_rate * eff_investigation_time will hold. If the effective investigation time is not large enough, or the occurrence rate is extremely small, then you should expect to see larger differences between the expected number of occurrences and n_occ, as well as a strong seed dependency. It is important to notice than in order to determine the effective investigation time, the engine takes into account also the ground motion logic tree and the correct formula to use is eff_investigation_time = investigation_time * num_ses * num_rlzs where num_rlzs is the number of realizations in the ground motion logic tree. Just to be concrete, if you run a calculation with the same parameters as described before, but with two GMPEs instead of one (and number_of_logic_tree_samples = 0), then the total number of paths admitted by the logic tree will be 2 and you should expect to get about twice the number of occurrences for each rupture. Users wanting to know the nitty-gritty details should look at the code, inside hazardlib/source/base.py, in the method src.sample_ruptures(eff_num_ses, ses_seed). The case of multiple tectonic region types and realizations ----------------------------------------------------------- Since engine 3.13 hazardlib contains some helper functions that allow users to compute stochastic event sets manually. Such functions are in the module openquake.hazardlib.calc.stochastic. Internally, the engine does not use directly such functions, since it needs to follow a slightly more complex logic in order to make the calculations parallelizable. Also, the engine is able to manage general source model logic trees, while the helper functions are meant to work in a situation with a single source model and a trivial source model logic tree. However, in spirit, the idea is the same. As a concrete example, consider the event based logic tree demo which is part of the engine distribution (search for demos/hazard/EventBasedPSHA). This is a case with a trivial source model logic tree, a source model with two tectonic region types and a GSIM logic tree generating 2x2 = 4 realizations with weights .36, .24, .24, .16 respectively. The effective investigation time is eff_time = 50 years x 250 ses x 4 rlz = 50,000 years You can sample the ruptures with the following commands, assuming you are inside the demo directory:: >> from openquake.hazardlib.contexts import ContextMaker >> from openquake.commonlib import readinput >> from openquake.hazardlib.calc.stochastic import sample_ebruptures >> oq = readinput.get_oqparam('job.ini') >> gsim_lt = readinput.get_gsim_lt(oq) >> csm = readinput.get_composite_source_model(oq) >> rlzs_by_gsim_trt = gsim_lt.get_rlzs_by_gsim_trt( .. oq.number_of_logic_tree_samples, oq.random_seed) >> cmakerdict = {trt: ContextMaker(trt, rbg, vars(oq)) .. for trt, rbg in rlzs_by_gsim_trt.items()} >> ebruptures = sample_ebruptures(csm.src_groups, cmakerdict) Then you can extract the events associated to the ruptures with the function get_ebr_df which returns a DataFrame:: >> from openquake.hazardlib.calc.stochastic import get_ebr_df >> ebr_df = get_ebr_df(ebruptures, cmakerdict) This DataFrame has fields eid (event ID) and rlz (realization number) and it is indexed by the ordinal of the rupture. For instance it can be used to determine the number of events per realization:: >> ebr_df.groupby('rlz').count() eid rlz 0 7842 1 7709 2 7893 3 7856 Notice that the number of events is more or less the same for each realization. This is a general fact, valid also in the case of sampling, a consequence of the random algorithm used to associate the events to the realizations. The difference between full enumeration and sampling -------------------------------------------------------------- Users are often confused about the difference between full enumeration and sampling. For this reason the engine distribution comes with a pedagogical example that considers an extremely simplified situation comprising a single site, a single rupture, and only two GMPEs. You can find the example in the engine repository under the directory openquake/qa_tests_data/event_based/case_3. If you look at the ground motion logic tree file, the two GMPEs are AkkarBommer2010 (with weight 0.9) and SadighEtAl1997 (with weight 0.1). The parameters in the job.ini are:: investigation_time = 1 ses_per_logic_tree_path = 5_000 number_of_logic_tree_paths = 0 Since there are 2 realizations, the effective investigation time is 10,000 years. If you run the calculation, you will generate (at least with version 3.13 of the engine, though the details may change with the version) 10,121 events, since the occurrence rate of the rupture was chosen to be 1. Roughly half of the events will be associated with the first GMPE (AkkarBommer2010) and half with the second GMPE (SadighEtAl1997). Actually, if you look at the test, the precise numbers will be 5,191 and 4,930 events, i.e. 51% and 49% rather than 50% and 50%, but this is expected and by increasing the investigation time you can get closer to the ideal equipartition. Therefore, even if the AkkarBommer2010 GMPE is assigned a relative weight that is 9 times greater than SadighEtAl1997, *this is not reflected in the simulated event set*. It means that when performing a computation (for instance to compute the mean ground motion field, or the average loss) one has to keep the two realizations distinct, and only at the end to perform the weighted average. The situation is the opposite when sampling is used. In order to get the same effective investigation time of 10,000 years you should change the parameters in the job.ini to:: investigation_time = 1 ses_per_logic_tree_path = 1 number_of_logic_tree_paths = 10_000 Now there are 10,000 realizations, not 2, and they *all have the same weight .0001*. The number of events per realization is still roughly constant (around 1) and there are still 10,121 events, however now *the original weights are reflected in the event set*. In particular there are 9,130 events associated to the AkkarBommer2010 GMPE and 991 events associated to the SadighEtAl1997 GMPE. There is no need to keep the realizations separated: since they have all the same weigths, you can trivially compute average quantities. AkkarBommer2010 will count more than SadighEtAl1997 simply because there are 9 times more events for it (actually 9130/991 = 9.2, but the rate will tend to 9 when the effective time will tend to infinity). NB: just to be clear, normally realizations are not in one-to-one correspondence with GMPEs. In this example, it is true because there is a single tectonic region type. However, usually there are multiple tectonic region types, and a realization is associated to a tuple of GMPEs. Extra tips specific to event based calculations =============================================== Event based calculations differ from classical calculations because they produce visible ruptures, which can be exported and made accessible to the user. In classical calculations, instead, the underlying ruptures only live in memory and are normally not saved in the datastore, nor are exportable. The limitation is fundamentally a technical one: in the case of an event based calculation only a small fraction of the ruptures contained in a source are actually generated, so it is possible to store them. In a classical calculation *all* ruptures are generated and there are so many millions of them that it is impractical to save them, unless there are very few sites. For this reason they live in memory, they are used to produce the hazard curves and immediately discarded right after. The exception if for the case of few sites, i.e. if the number of sites is less than the parameter max_sites_disagg which by default is 10. Sampling of the logic tree ---------------------------------------------------- There are real life examples of very large logic trees, like the model for South Africa which features 3,194,799,993,706,229,268,480 branches. In such situations it is impossible to perform a computation with full enumeration. However, the engine allows to sample the branches of the complete logic tree. More precisely, for each branch sampled from the source model logic tree, a branch of the GMPE logic tree is chosen randomly, by taking into account the weights in the GMPE logic tree file. It should be noticed that even if source model path is sampled several times, the model is parsed and sent to the workers *only once*. In particular if there is a single source model (like for South America) and number_of_logic_tree_samples =100, we generate effectively 1 source model realization and not 100 equivalent source model realizations, as we did in past (actually in the engine version 1.3). The engine keeps track of how many times a model has been sampled (say Ns) and in the event based case it produce ruptures (*with different seeds*) by calling the appropriate hazardlib function Ns times. This is done inside the worker nodes. In the classical case, all the ruptures are identical and there are no seeds, so the computation is done only once, in an efficient way. Convergency of the GMFs for non-trivial logic trees --------------------------------------------------------------------------- In theory, the hazard curves produced by an event based calculation should converge to the curves produced by an equivalent classical calculation. In practice, if the parameters number_of_logic_tree_samples and ses_per_logic_tree_path (the product of them is the relevant one) are not large enough they may be different. The engine is able to compare the mean hazard curves and to see how well they converge. This is done automatically if the option mean_hazard_curves = true is set. Here is an example of how to generate and plot the curves for one of our QA tests (a case with bad convergence was chosen on purpose)::$ oq engine --run event_based/case_7/job.ini WARNING:root:Relative difference with the classical mean curves for IMT=SA(0.1): 51% WARNING:root:Relative difference with the classical mean curves for IMT=PGA: 49% $oq plot /tmp/cl/hazard.pik /tmp/hazard.pik --sites=0,1,2 .. image:: ebcl-convergency.png The relative difference between the classical and event based curves is computed by computing the relative difference between each point of the curves for each curve, and by taking the maximum, at least for probabilities of exceedence larger than 1% (for low values of the probability the convergency may be bad). For the details I suggest you to look at the code. The concept of "mean" ground motion field ============================================ The engine has at least three different kinds of *mean ground motion field*, computed differently and used in different situations: 1. *Mean ground motion field by GMPE*, used to reduce disk space and make risk calculations faster. 2. *Mean ground motion field by event*, used for debugging/plotting purposes. 3. *Single-rupture hazardlib mean ground motion field*, used for analysis/plotting purposes. Mean ground motion field by GMPE -------------------------------- This is the most useful concept for people doing risk calculations. To be concrete, suppose you are running a scenario_risk calculation on a region where you have a very fine site model (say at 1 km resolution) and a sophisticated hazard model (say with 16 different GMPEs): then you can easily end up with a pretty large calculation. For instance one of our users was doing such a calculation with an exposure of 1.2 million assets, 50,000+ hazard sites, 5 intensity measure levels and 1000 simulations, corresponding to 16,000 events given that there are 16 GMPEs. Given that each ground motion value needs 4 bytes to be stored as a 32 bit float, the math tells us that such calculation will generate 50000 x 16000 x 5 x 4 ~ 15 GB of data (it could be a but less by using the minimum_intensity feature, but you get the order of magnitude). This is very little for the engine that can store such an amount of data in less than 1 minute, but it is a huge amount of data for a database. If you a (re)insurance company and your workflow requires ingesting the GMFs in a database to compute the financial losses, that's a big issue. The engine could compute the hazard in just an hour, but the risk part could easily take 8 days. This is a no-go for most companies. They have deadlines and cannot way 8 days to perform a single analysis. At the end they are interested only in the mean losses, so they would like to have a single effective mean field producing something close to the mean losses that more correctly would be obtained by considering all 16 realizations. With a single effective realization the data storage would drop under 1 GB and more significantly the financial model software would complete the calculation in 12 hours instead of 8 days, something a lot more reasonable. For this kind of situations hazardlib provides an AvgGMPE class, that allows to replace a set of GMPEs with a single effective GMPE. More specifically, the method AvgGMPE.get_means_and_stddevs calls the methods .get_means_and_stddevs on the underlying GMPEs and performs a weighted average of the means and a weighted average of the variances using the usual formulas: .. math:: \mu &= \Sigma_i w_i \mu_i \\ \sigma^2 &= \Sigma_i w_i (\sigma_i)^2 where the weights sum up to 1. It is up to the user to check how big is the difference in the risk between the complete calculation and the mean field calculation. A factor of 2 discrepancies would not be surprising, but we have also seen situations where there is no difference within the uncertainty due to the random seed choice. Mean ground motion field by event --------------------------------- Using the AvgGMPE trick does not solve the issue of visualizing the ground motion fields, since for each site there are still 1000 events. A plotting tool has still to download 1 GB of data and then one has to decide which event to plot. The situation is the same if you are doing a sensitivity analysis, i.e. you are changing some parameter (it could be a parameter of the underlying rupture, or even the random seed) and you are studying how the ground motion fields change. It is hard to compare two sets of data of 1 GB each. Instead, it is a lot easier to define a "mean" ground motion field obtained by averaging on the events and then compare the mean fields of the two calculations: if they are very different, it is clear that the calculation is very sensitive to the parameter being studied. Still, the tool performing the comparison will need to consider 1000 times less data and will be 1000 times faster, also downloding 1000 times less data from the remote server where the calculation has been performed. For this kind of analysis the engine provides an internal output avg_gmf that can be plotted with the command oq plot avg_gmf . It is also possible to compare two calculations with the command $ oq compare avg_gmf imt  Since avg_gmf is meant for internal usage and for debugging it is not exported by default and it is not visible in the WebUI. It is also not guaranteed to stay the same across engine versions. It is available starting from version 3.11. It should be noted that, consistently with how the AvgGMPE works, the avg_gmf output *is computed in log space*, i.e. it is geometric mean, not the usual mean. If the distribution was exactly lognormal that would also coincide with the median field. However, you should remember that in order to reduce the data transfer and to save disk space the engine discards ground motion values below a certain minimum intensity, determined explicitly by the user or inferred from the vulnerability functions when performing a risk calculation: there is no point in considering ground motion values below the minimum in the vulnerability functions, since they would generate zero losses. Discarding the values below the threshould breaks the log normal distribution. To be concrete, consider a case with a single site, and single intensity measure type (PGA) and a minimum_intensity of 0.05g. Suppose there are 1000 simulations and that you have a normal distribution of the logaritms with :math:\mu=-2 and :math:\sigma=.5; then the ground motion values that you could obtain would be as follows: >>> import numpy >>> numpy.random.seed(42) # fix the seed >>> gmvs = numpy.random.lognormal(mean=-2.0, sigma=.5, size=1000) As expected, the variability of the values is rather large, spanning more than one order of magnitude: >>> gmvs.min(), numpy.median(gmvs), gmvs.max() (0.026765710489091852, 0.1370582013790309, 0.9290114132955762) Also mean and standard deviation of the logarithms are very close to the expected values :math:\mu=-2 and :math:\sigma=.5: >>> numpy.log(gmvs).mean() -1.9903339720888376 >>> numpy.log(gmvs).std() 0.4893631038736771 The geometric mean of the values (i.e. the exponential of the mean of the logarithms) is very close to the median, as expected for a lognormal distribution: >>> numpy.exp(numpy.log(gmvs).mean()) 0.13664978061122787 All these properties are broken when the ground motion values are truncated below the minimum_intensity:: >> gmvs[gmvs < .05] = .05 >> numpy.log(gmvs).mean() -1.9876078473466177 >> numpy.log(gmvs).std() 0.48280630467779523 >> numpy.exp(numpy.log(gmvs).mean()) 0.13702281319482504 In this case the difference is minor, but if the number of simulations is small and/or the :math:\sigma is large the mean and standard deviation obtained from the logarithms of the ground motion fields could be quite different from the expected ones. Finally, it should be noticed that the geometric mean can be orders of magnitude different from the usual mean and it is purely a coincidence that in this case they are close (~0.137 vs ~0.155). Single-rupture estimated median ground motion field --------------------------------------------------- The mean ground motion field by event discussed above is an *a posteriori* output: *after* performing the calculation, some statistics are performed on the stored ground motion fields. However, in the case of a single rupture it is possible to estimate the geometric mean and the geometric standard deviation *a priori*, using hazardlib and without performing a full calculation. However, there are some limitations to this approach: 1. it only works when there is a single rupture 2. you have to manage the minimum_intensity manually if you want to compare with a concrete engine output 3. it is good for estimates, it gives you the theoretical ground ground motion field but not the ones concretely generated by the engine fixed a specific seed It should also be noticed that there is a shortcut to compute the single-rupture hazardlib "mean" ground motion field without writing any code; just set in your job.ini the following values:: truncation_level = 0 ground_motion_fields = 1 Setting truncation_level = 0 effectively replaces the lognormal distribution with a delta function, so the generated ground motion fields will be all equal, with the same value for all events: this is why you can set ground_motion_fields = 1, since you would just waste time and space by generating multiple copies. Finally let's warn again on the term hazardlib "mean" ground motion field: in log space it is truly a mean, but in terms of the original GMFs it is a geometric mean - which is the same as the median since the distribution is lognormal - so you can also call this the hazardlib *median* ground motion field. How the hazard sites are determined =================================== There are several ways to specify the hazard sites in an engine calculation. 1. The user can specify the sites directly in the job.ini using the sites parameter (e.g. sites = -122.4194 37.7749, -118.2437 34.0522, -117.1611 32.7157). This method is perhaps most useful when the analysis is limited to a handful of sites. 2. Otherwise the user can specify the list of sites in a CSV file (i.e. sites_csv = sites.csv). 3. Otherwise the user can specify a grid via the region and region_grid_spacing parameters. 4. Otherwise the sites can be inferred from the exposure, if any, in two different ways: 1. if region_grid_spacing is specified, a grid is implicitly generated from the convex hull of the exposure and used 2. otherwise the locations of the assets are used as hazard sites 5. Otherwise the sites can be inferred from the site model file, if any. It must be noted that the engine rounds longitudes and latitudes to 5 decimal places (or approximately 1 meter spatial resolution), so sites that differ only at the 6th decimal place or beyond will end up being considered as duplicated sites by the engine, and this will be flagged as an error. Having determined the sites, a SiteCollection object is generated by associating the closest parameters from the site model (if any) or using the global site parameters, if any. If the site model is specified, but the closest site parameters are too distant from the sites, a warning is logged for each site. There are a number of error situations: 1. If both site model and global site parameters are missing, the engine raises an error. 2. If both site model and global site parameters are specified, the engine raises an error. 3. Specifying both the sites.csv and a grid is an error. 4. Specifying both the sites.csv and a site_model.csv is an error. If you are in such situation you should consider using the command oq prepare_site_model to manually prepare a site model on the location of the sites. 5. Having duplicates (i.e. rows with identical lon, lat up to 5 digits) in the site model is an error. If you want to compute the hazard on the locations specified by the site model and not on the exposure locations, you can split the calculation in two files: job_haz.ini containing the site model and job_risk.ini containing the exposure. Then the risk calculator will find the closest hazard to each asset and use it. However, if the closest hazard is more distant than the asset_hazard_distance parameter (default 15 km) an error is raised. Scenarios from ShakeMaps ======================== Beginning with version 3.1, the engine is able to perform scenario_risk and scenario_damage calculations starting from the GeoJSON feed for ShakeMaps_ provided by the United States Geological Survey (USGS). Furthermore, starting from version 3.12 it is possible to use ShakeMaps from other sources like the local filesystem or a custom URL. .. _ShakeMaps: https://earthquake.usgs.gov/data/shakemap/ Running the Calculation ======================= In order to enable this functionality one has to prepare a parent calculation containing the exposure and risk functions for the region of interest, say Peru. To that aim the user will need to write a prepare_job.ini file like this one:: [general] description = Peru - Preloading exposure and vulnerability calculation_mode = scenario exposure_file = exposure_model.xml structural_vulnerability_file = structural_vulnerability_model.xml By running the calculation $oq engine --run prepare_job.ini The exposure and the risk functions will be imported in the datastore. This example only includes vulnerability functions for the loss type structural, but one could also have in this preparatory job file the functions for nonstructural components and contents, and occupants, or fragility functions if damage calculations are of interest. It is essential that each fragility/vulnerability function in the risk model should be conditioned on one of the intensity measure types that are supported by the ShakeMap service – MMI, PGV, PGA, SA(0.3), SA(1.0), and SA(3.0). If your fragility/vulnerability functions involves an intensity measure type which is not supported by the ShakeMap system (for instance SA(0.6)) the calculation will terminate with an error. Let's suppose that the calculation ID of this 'pre' calculation is 1000. We can now run the risk calculation starting from a ShakeMap. For that, one need a job.ini file like the following:: [general] description = Peru - 2007 M8.0 Pisco earthquake losses calculation_mode = scenario_risk number_of_ground_motion_fields = 10 truncation_level = 3 shakemap_id = usp000fjta spatial_correlation = yes cross_correlation = yes This example refers to the 2007 Mw8.0 Pisco earthquake in Peru (see https://earthquake.usgs.gov/earthquakes/eventpage/usp000fjta#shakemap). The risk can be computed by running the risk job file against the prepared calculation::$ oq engine --run job.ini --hc 1000 Starting from version 3.12 it is also possible to specify the following sources instead of a shakemap_id:: # (1) from local files: shakemap_uri = { "kind": "usgs_xml", "grid_url": "relative/path/file.xml", "uncertainty_url": "relative/path/file.xml" } # (2) from remote files: shakemap_uri = { "kind": "usgs_xml", "grid_url": "https://url.to/grid.xml", "uncertainty_url": "https://url.to/uncertainty.zip" } # (3) both files in a single archive # containing grid.xml, uncertainty.xml: shakemap_uri = { "kind": "usgs_xml", "grid_url": "relative/path/grid.zip" } While it is also possible to define absolute paths, it is advised not to do so since using absolute paths will make your calculation not portable across different machines. The files must be valid .xml USGS ShakeMaps (1). One or both files can also be passed as .zip archives containing a single valid xml ShakeMap (2). If both files are in the same .zip, the archived files must be named grid.xml and uncertainty.xml. Also starting from version 3.12 it is possible to use ESRI Shapefiles in the same manner as ShakeMaps. Polygons define areas with the same intensity levels and assets/sites will be associated to a polygon if contained by the latter. Sites outside of a polygon will be discarded. Shapefile inputs can be specified similar to ShakeMaps:: shakemap_uri = { "kind": "shapefile", "fname": "path_to/file.shp" } It is only necessary to specify one of the available files, and the rest of the files will be expected to be in the same location. It is also possible to have them contained together in a *.zip file. There are at least a *.shp-main file and a *.dbf-dBASE file required. The record field names, intensity measure types and units all need to be the same as with regular USGS ShakeMaps. Irrespective of the input, the engine will perform the following operations: 1. download the ShakeMap and convert it into a format suitable for further processing, i.e. a ShakeMaps array with lon, lat fields 2. the ShakeMap array will be associated to the hazard sites in the region covered by the ShakeMap 3. by using the parameters truncation_level and number_of_ground_motion_fields a set of ground motion fields (GMFs) following the truncated Gaussian distribution will be generated and stored in the datastore 4. a regular risk calculation will be performed by using such GMFs and the assets within the region covered by the shakemap. Correlation =========== By default the engine tries to compute both the spatial correlation and the cross correlation between different intensity measure types. Please note that if you are using MMI as intensity measure type in your vulnerability model, it is not possible to apply correlations since those are based on physical measures. For each kind of correlation you have three choices, that you can set in the job.ini, for a total of nine combinations:: - spatial_correlation = yes, cross_correlation = yes # the default - spatial_correlation = no, cross_correlation = no # disable everything - spatial_correlation = yes, cross_correlation = no - spatial_correlation = no, cross_correlation = yes - spatial_correlation = full, cross_correlation = full - spatial_correlation = yes, cross_correlation = full - spatial_correlation = no, cross_correlation = full - spatial_correlation = full, cross_correlation = no - spatial_correlation = full, cross_correlation = yes yes means using the correlation matrix of the Silva-Horspool_ paper; no mean using no correlation; full means using an all-ones correlation matrix. .. _Silva-Horspool: https://onlinelibrary.wiley.com/doi/abs/10.1002/eqe.3154 Apart from performance considerations, disabling either the spatial correlation or the cross correlation (or both) might be useful to see how significant the effect of the correlation is on the damage/loss estimates. In particular, due to numeric errors, the spatial correlation matrix - that by construction contains only positive numbers - can still produce small negative eigenvalues (of the order of -1E-15) and the calculation fails with an error message saying that the correlation matrix is not positive defined. Welcome to the world of floating point approximation! Rather than magically discarding negative eigenvalues the engine raises an error and the user has two choices: either disable the spatial correlation or reduce the number of sites because that can make the numerical instability go away. The easiest way to reduce the number of sites is setting a region_grid_spacing parameter in the prepare_job.ini file, then the engine will automatically put the assets on a grid. The larger the grid spacing, the fewer the number of points, and the closer the calculation will be to tractability. Performance Considerations ========================== The performance of the calculation will be crucially determined by the number of hazard sites. For instance, in the case of the Pisco earthquake the ShakeMap has 506,142 sites, which is a significantly large number of sites. However, the extent of the ShakeMap in longitude and latitude is about 6 degrees, with a step of 10 km the grid contains around 65 x 65 sites; most of the sites are without assets because most of the grid is on the sea or on high mountains, so actually there are around ~500 effective sites. Computing a correlation matrix of size 500 x 500 is feasible, so the risk computation can be performed. Clearly in situations in which the number of hazard sites is too large, approximations will have to be made such as using a larger region_grid_spacing. Disabling spatial AND cross correlation makes it possible run much larger calculations. The performance can be further increased by not using a truncation_level. When applying correlation, a soft cap on the size of the calculations is defined. This is done and modifiable through the parameter cholesky_limit which refers to the number of sites multiplied by the number of intensity measure types used in the vulnerability model. Raising that limit is at your own peril, as you might run out of memory during calculation or may encounter instabilities in the calculations as described above. If the ground motion values or the standard deviations are particularly large, the user will get a warning about suspicious GMFs. Moreover, especially for old ShakeMaps, the USGS can provide them in a format that the engine cannot read. Thus, this feature is not expected to work in all cases. Extended consequences ===================== Scenario damage calculations produce damage distributions, i.e. arrays containing the number of buildings in each damage state defined in the fragility functions. There is a damage distribution per each asset, event and loss type, so you can easily produce *billions* of damage distributions. This is why the engine provide facilities to compute results based on aggregating the damage distributions, possibly multiplied by suitable coefficients, i.e. *consequences*. For instance, from the probability of being in the collapsed damage state, one may estimate the number of fatalities, given the right multiplicative coefficient. Another commonly computed consequence is the economic loss; in order to estimated it, one need a different multiplicative coefficient for each damage state and for each taxonomy. The table of coefficients, a.k.a. the *consequence model*, can be represented as a CSV file like the following: =================== ============ ============ ======== ========== =========== ========== taxonomy consequence loss_type slight moderate extensive complete ------------------- ------------ ------------ -------- ---------- ----------- ---------- CR_LFINF-DUH_H2 losses structural 0.05 0.25 0.6 1 CR_LFINF-DUH_H4 losses structural 0.05 0.25 0.6 1 MCF_LWAL-DNO_H3 losses structural 0.05 0.25 0.6 1 MR_LWAL-DNO_H1 losses structural 0.05 0.25 0.6 1 MR_LWAL-DNO_H2 losses structural 0.05 0.25 0.6 1 MUR_LWAL-DNO_H1 losses structural 0.05 0.25 0.6 1 W-WS_LPB-DNO_H1 losses structural 0.05 0.25 0.6 1 W-WWD_LWAL-DNO_H1 losses structural 0.05 0.25 0.6 1 MR_LWAL-DNO_H3 losses structural 0.05 0.25 0.6 1 =================== ============ ============ ======== ========== =========== ========== The first field in the header is the name of a tag in the exposure; in this case it is the taxonomy but it could be any other tag — for instance, for volcanic ash-fall consequences, the roof-type might be more relevant, and for recovery time estimates, the occupancy class might be more relevant. The consequence framework is meant to be used for generic consequences, not necessarily limited to earthquakes, because since version 3.6 the engine provides a multi-hazard risk calculator. The second field of the header, the consequence, is a string identifying the kind of consequence we are considering. It is important because it is associated to the name of the function to use to compute the consequence. It is rather easy to write an additional function in case one needed to support a new kind of consequence. You can show the list of consequences by the version of the engine that you have installed with the command:: $oq info consequences # in version 3.12 The following 5 consequences are implemented: losses collapsed injured fatalities homeless The other fields in the header are the loss type and the damage states. For instance the coefficient 0.25 for "moderate" means that the cost to bring a structure in "moderate damage" back to its undamaged state is 25% of the total replacement value of the asset. The loss type refers to the fragility model, i.e. structural will mean that the coefficients apply to damage distributions obtained from the fragility functions defined in the file structural_fragility_model.xml. discrete_damage_distribution ---------------------------- Damage distributions are called discrete when the number of buildings in each damage is an integer, and continuous when the number of buildings in each damage state is a floating point number. Continuous distributions are a lot more efficient to compute and therefore that is the default behavior of the engine, at least starting from version 3.13. You can ask the engine to use discrete damage distribution by setting the flag in the job.ini file discrete_damage_distribution = true However, it should be noticed that setting discrete_damage_distribution = true will raise an error if the exposure contains a floating point number of buildings for some asset. Having a floating point number of buildings in the exposure is quite common since the "number" field is often estimated as an average. Even if the exposure contains only integers and you have set discrete_damage_distribution = true in the job.ini, the aggregate damage distributions will normally contains floating point numbers, since they are obtained by summing integer distributions for all seismic events of a given hazard realization and dividing by the number of events of that realization. By summing the number of buildings in each damage state one will get the total number of buildings for the given aggregation level; if the exposure contains integer numbers than the sum of the numbers will be an integer, apart from minor differences due to numeric errors, since the engine stores even discrete distributions as floating point numbers. The EventBasedDamage demo ---------------------------------------------------------------- Given a source model, a logic tree, an exposure, a set of fragility functions and a set of consequence functions, the event_based_damage calculator is able to compute results such as average consequences and average consequence curves. The scenario_damage calculator does the same, except it does not start from a source model and a logic tree, but rather from a set of predetermined ruptures or ground motion fields, and the averages are performed on the input parameter number_of_ground_motion_fields and not on the effective investigation time. In the engine distribution, in the folders demos/risk/EventBasedDamage and demos/risk/ScenarioDamage there are examples of how to use the calculators. Let's start with the EventBasedDamage demo. The source model, the exposure and the fragility functions are much simplified and you should not consider them realistic for the Nepal, but they permit very fast hazard and risk calculations. The effective investigation time is eff_time = 1 (year) x 1000 (ses) x 50 (rlzs) = 50,000 years and the calculation is using sampling of the logic tree. Since all the realizations have the same weight, on the risk side we can effectively consider all of them together. This is why there will be a single output (for the effective risk realization) and not 50 outputs (one for each hazard realization) as it would happen for an event_based_risk calculation. Normally the engine does not store the damage distributions for each asset (unless you specify aggregate_by=id in the job.ini file). By default it stores the aggregate damage distributions by summing on all the assets in the exposure. If you are interested only in partial sums, i.e. in aggregating only the distributions associated to a certain tag combination, you can produce the partial sums by specifying the tags. For instance aggregate_by = taxonomy will aggregate by taxonomy, aggregate_by = taxonomy, region will aggregate by taxonomy and region, etc. The aggregated damage distributions (and aggregated consequences, if any) will be stored in a table called risk_by_event which can be accessed with pandas. The corresponding DataFrame will have fields event_id, agg_id (integer referring to which kind of aggregation you are considering), loss_id (integer referring to the loss type in consideration), a column named dmg_X for each damage state and a column for each consequence. In the EventBasedDamage demo the exposure has a field called NAME_1 and representing a geographic region in Nepal (i.e. "East" or "Mid-Western") and there is an aggregate_by = NAME_1, taxonomy in the job.ini. Since the demo has 4 taxonomies ("Wood", "Adobe", "Stone-Masonry", "Unreinforced-Brick-Masonry") there 4 x 2 = 8 possible aggregations; actually, there is also a 9th possibility corresponding to aggregating on all assets by disregarding the tags. You can see the possible values of the the agg_id field with the following command::$ oq show agg_id taxonomy NAME_1 agg_id 0 Wood East 1 Wood Mid-Western 2 Adobe East 3 Adobe Mid-Western 4 Stone-Masonry East 5 Stone-Masonry Mid-Western 6 Unreinforced-Brick-Masonry East 7 Unreinforced-Brick-Masonry Mid-Western 8 *total* *total* Armed with that knowledge it is pretty easy to understand the risk_by_event table:: >> from openquake.commonlib.datastore import read >> dstore = read(-1) # the latest calculation >> df = dstore.read_df('risk_by_event', 'event_id') agg_id loss_id dmg_1 dmg_2 dmg_3 dmg_4 losses event_id 472 0 0 0.0 1.0 0.0 0.0 5260.828125 472 8 0 0.0 1.0 0.0 0.0 5260.828125 477 0 0 2.0 0.0 1.0 0.0 6368.788574 477 8 0 2.0 0.0 1.0 0.0 6368.788574 478 0 0 3.0 1.0 1.0 0.0 5453.355469 ... ... ... ... ... ... ... ... 30687 8 0 56.0 53.0 26.0 16.0 634266.187500 30688 0 0 3.0 6.0 1.0 0.0 14515.125000 30688 8 0 3.0 6.0 1.0 0.0 14515.125000 30690 0 0 2.0 0.0 1.0 0.0 5709.204102 30690 8 0 2.0 0.0 1.0 0.0 5709.204102 [8066 rows x 7 columns] The number of buildings in each damage state is integer (even if stored as a float) because the exposure contains only integers and the job.ini is setting explicitly discrete_damage_distribution = true. It should be noted that while there is a CSV exporter for the risk_by_event table, it is designed to export only the total aggregation component (i.e. agg_id=9 in this example) for reasons of backward compatibility with the past, the time when the only aggregation the engine could perform was the total aggregation. Since the risk_by_event table can be rather large, it is recommmended to interact with it with pandas and not to export in CSV. There is instead a CSV exporter for the aggregated damage distributions (together with the aggregated consequences) that you may call with the command oq export aggrisk; you can also see the distributions directly:: $oq show aggrisk agg_id rlz_id loss_id dmg_0 dmg_1 dmg_2 dmg_3 dmg_4 losses 0 0 0 0 18.841061 0.077873 0.052915 0.018116 0.010036 459.162567 1 3 0 0 172.107361 0.329445 0.591998 0.422925 0.548271 11213.121094 2 5 0 0 1.981786 0.003877 0.005539 0.004203 0.004594 104.431755 3 6 0 0 797.826111 1.593724 1.680134 0.926167 0.973836 23901.496094 4 7 0 0 48.648529 0.120687 0.122120 0.060278 0.048386 1420.059448 5 8 0 0 1039.404907 2.125607 2.452706 1.431690 1.585123 37098.269531 By summing on the damage states one gets the total number of buildings for each aggregation level:: agg_id dmg_0 + dmg_1 + dmg_2 + dmg_3 + dmg_4 aggkeys 0 19.000039 ~ 19 Wood,East 3 173.999639 ~ 174 Wood,Mid-Western 5 2.000004 ~ 2 Stone-Masonry,Mid-Western 6 802.999853 ~ 803 Unreinforced-Brick-Masonry,East 7 48.999971 ~ 49 Unreinforced-Brick-Masonry,Mid-Western 8 1046.995130 ~ 1047 Total The ScenarioDamage demo ---------------------------------------------------------------- The demo in demos/risk/ScenarioDamage is similar to the EventBasedDemo (it still refers to Nepal) but it uses a much large exposure with 9063 assets and 5,365,761 building. Moreover the configuration file is split in two: first you should run job_hazard.ini and then run job_risk.ini with the --hc option. The first calculation will produce 2 sets of 100 ground motion fields each (since job_hazard.ini contains number_of_ground_motion_fields = 100 and the gsim logic tree file contains two GMPEs). The second calculation will use such GMFs to compute aggregated damage distributions. Contrarily to event based damage calculations, scenario damage calculations normally use full enumeration, since there are very few realizations (only two in this example), thus the scenario damage calculator is able to distinguish the results by realization. The main output of a scenario_damage calculation is still the risk_by_event table which has exactly the same form as for the EventBasedDamage demo. However there is a difference when considering the aggrisk output: since we are using full enumeration we will produce a damage distribution for each realization::$ oq show aggrisk agg_id rlz_id loss_id dmg_0 ... dmg_4 losses 0 0 0 0 4173405.75 ... 452433.40625 7.779261e+09 1 0 1 0 3596234.00 ... 633638.37500 1.123458e+10 The sum over the damage states will still produce the total number of buildings, which will be independent from the realization:: rlz_id dmg_0 + dmg_1 + dmg_2 + dmg_3 + dmg_4 0 5365761.0 1 5365761.0 In this demo there is no aggregate_by specified, so the only aggregation which is performed is the total aggregation. You are invited to specify aggregate_by and study how aggrisk changes. Taxonomy mapping --------------------------------- In an ideal world, for every building type represented in the exposure model, there would be a unique matching function in the vulnerability or fragility models. However, often it may not be possible to have a one-to-one mapping of the taxonomy strings in the exposure and those in the vulnerability or fragility models. For cases where the exposure model has richer detail, many taxonomy strings in the exposure would need to be mapped onto a single vulnerability or fragility function. In other cases where building classes in the exposure are more generic and the fragility or vulnerability functions are available for more specific building types, a modeller may wish to assign more than one vulnerability or fragility function to the same building type in the exposure with different weights. We may encode such information into a taxonomy_mapping.csv file like the following: =========== =========== taxonomy conversion ----------- ----------- Wood Type A Wood Wood Type B Wood Wood Type C Wood =========== =========== Using an external file is convenient, because we can avoid changing the original exposure. If in the future we will be able to get specific risk functions, then we will just remove the taxonomy mapping. This usage of the taxonomy mapping (use proxies for missing risk functions) is pretty useful, but there is also another usage which is even more interesting. Consider a situation where there are doubts about the precise composition of the exposure. For instance we may know than in a given geographic region 20% of the building of type "Wood" are of "Wood Type A", 30% of "Wood Type B" and 50% of "Wood Type C", corresponding to different risk functions, but do not know building per building what it its precise taxonomy, so we just use a generic "Wood" taxonomy in the exposure. We may encode the weight information into a taxonomy_mapping.csv file like the following: ========= ============ ======= taxonomy conversion weight --------- ------------ ------- Wood Wood Type A 0.2 Wood Wood Type B 0.3 Wood Wood Type C 0.5 ========= ============ ======= The engine will read this mapping file and when performing the risk calculation will use all three kinds of risk functions to compute a single result with a weighted mean algorithm. The sums of the weights must be 1 for each exposure taxonomy, otherwise the engine will raise an error. In this case the taxonomy mapping file works like a risk logic tree. Internally both the first usage and the second usage are treated in the same way, since the first usage is a special case of the second when all the weights are equal to 1. Risk profiles ============= The OpenQuake engine can produce risk profiles, i.e. estimates of average losses and maximum probable losses for all countries in the world. Even if you are interested in a single country, you can still use this feature to compute risk profiles for each province in your country. However, the calculation of the risk profiles is tricky and there are actually several different ways to do it. 1. The least-recommended way is to run independent calculations, one for each country. The issue with this approach is that even if the hazard model is the same for all the countries (say you are interested in the 13 countries of South America), due to the nature of event based calculations, different ruptures will be sampled in different countries. In practice, when comparing Chile with Peru you will see differences due to the fact that the random sampling picked different ruptures in the two contries and not real differences. In theory, the effect should disappear if the calculations have sufficiently long investigation times, when all possible ruptures are sampled, but in practice, for finite investigation times there will always be different ruptures. 2. To avoid such issues, the country-specific calculations should ideally all start from the same set of precomputed ruptures. You can compute the whole stochastic event set by running an event based calculation without specifying the sites and with the parameter ground_motion_fields set to false. Currently, one must specify a few global site parameters in the precalculation to make the engine checker happy, but they will not be used since the ground motion fields will not be generated in the precalculation. The ground motion fields will be generated on-the-fly in the subsequent individual country calculations, but not stored in the file system. This approach is fine if you do not have a lot of disk space at your disposal, but it is still inefficient since it is quite prone to the slow tasks issue. 3. If you have plenty of disk space it is better to also generate the ground motion fields in the precalculation, and then run the contry-specific risk calculations starting from there. This is particularly convenient if you foresee the need to run the risk part of the calculations multiple times, while the hazard part remains unchanged. Using a precomputed set of GMFs removes the need to rerun the hazard part of the calculations each time. 4. If you have a really powerful machine, the most efficient way is to run a single calculation considering all countries in a single job.ini file. The risk profiles can be obtained by using the aggregate_by and reaggregate_by parameters. This approach can be much faster than the previous ones. However, approaches #2 and #3 are cloud-friendly and can be preferred if you have access to cloud-computing resources, since then you can spawn a different machine for each country and parallelize horizontally. Here are some tips on how to prepare the required job.ini files: When using approach #1 you will have 13 different files (in the example of South America) with a format like the following:: $cat job_Argentina.ini calculation_mode = event_based_risk source_model_logic_tree_file = ssmLT.xml gsim_logic_tree_file = gmmLTrisk.xml site_model_file = Site_model_Argentina.csv exposure_file = Exposure_Argentina.xml ...$ cat job_Bolivia.ini calculation_mode = event_based_risk source_model_logic_tree_file = ssmLT.xml gsim_logic_tree_file = gmmLTrisk.xml site_model_file = Site_model_Bolivia.csv exposure_file = Exposure_Bolivia.xml ... Notice that the source_model_logic_tree_file and gsim_logic_tree_file will be the same for all countries since the hazard model is the same; the same sources will be read 13 times and the ruptures will be sampled and filtered 13 times. This is inefficient. Also, hazard parameters like :: truncation_level = 3 investigation_time = 1 number_of_logic_tree_samples = 1000 ses_per_logic_tree_path = 100 maximum_distance = 300 must be the same in all 13 files to ensure the consistency of the calculation. Ensuring this consistency can be prone to human error. When using approach #2 you will have 14 different files: 13 files for the individual countries and a special file for precomputing the ruptures:: $cat job_rup.ini calculation_mode = event_based source_model_logic_tree_file = ssmLT.xml gsim_logic_tree_file = gmmLTrisk.xml reference_vs30_value = 760 reference_depth_to_1pt0km_per_sec = 440 ground_motion_fields = false ... The files for the individual countries will be as before, except for the parameter source_model_logic_tree_file which should be removed. That will avoid reading 13 times the same source model files, which are useless anyway, since the calculation now starts from precomputed ruptures. There are still a lot of repetitions in the files and the potential for making mistakes. Approach #3 is very similar to approach #2: the only differences will be in the initial file, the one used to precompute the GMFs. Obviously it will require setting ground_motion_fields = true; moreover, it will require specifying the full site model as follows:: site_model_file = Site_model_Argentina.csv Site_model_Bolivia.csv ... The engine will automatically concatenate the site model files for all 13 countries and produce a single site collection. The site parameters will be extracted from such files, so the dummy global parameters reference_vs30_value, reference_depth_to_1pt0km_per_sec, etc can be removed. It is FUNDAMENTAL FOR PERFORMANCE to have reasonable site model files, i.e. the number of sites must be relatively small, let's say below 100,000 sites. For calculations with large high-definition exposure models, trying to calculate the hazard at the location of every single asset can easily generate millions of sites, making the calculation intractable in terms of both memory and disk space occupation. The engine provides a command oq prepare_site_model which is meant to generate sensible site model files starting from the country exposures and the global USGS vs30 grid. It works by using a hazard grid so that the number of sites can be reduced to a manageable number. Please refer to the manual in the section about the oq commands to see how to use it, or try oq prepare_site_model --help. Approach #4 is the best, since there is only a single file, thus avoiding entirely the possibily of having inconsistent parameters in different files. It is also the faster approach, not to mention the most convenient one, since you have to manage a single calculation and not 13. That makes the task of managing any kind of post-processing a lot simpler. Unfortunately, it is also the option that requires more memory and it can be infeasable if the model is too large and you do not have enough computing resources. In that case your best bet might be to go back to options #2 or #3. If you have access to multiple small machines, approaches #2 and #3 can be more attractive than #4, since then you can scale horizontally. If you decide to use approach #4, in the single file you must specify the site_model_file as done in approach #3, and also the exposure_file as follows:: exposure_file = Exposure_Argentina.xml Exposure_Bolivia.xml ... The engine will automatically build a single asset collection for the entire continent of South America. In order to use this approach, you need to collect all the vulnerability functions in a single file and the taxonomy mapping file must cover the entire exposure for all countries. Moreover, the exposure must contain the associations between asset<->country; in GEM's exposure models, this is typically encoded in a field called ID_0. Then the aggregation by country can be done with the option :: aggregate_by = ID_0 Sometimes, one is interested in finer aggregations, for instance by country and also by occupancy (Residential, Industrial or Commercial); then you have to set :: aggregate_by = ID_0, OCCUPANCY reaggregate_by = ID_0 reaggregate_by is a new feature of engine 3.13 which allows to go from a finer aggregation (i.e. one with more tags, in this example 2) to a coarser aggregation (i.e. one with fewer tags, in this example 1). Actually the command oq reaggregate has been there for more than one year; the new feature is that it is automatically called at the end of a calculation, by spawning a subcalculation to compute the reaggregation. Without reaggregate_by the aggregation by country would be lost, since only the result of the finer aggregation would be stored. Single-line commands -------------------- When using approach #1 your can run all of the required calculations with the command::$ oq engine --multi --run job_Argentina.csv job_Bolivia.csv ... When using approach #2 your can run all of the required calculations with the command:: $oq engine --run job_rup.ini job_Argentina.csv job_Bolivia.csv ... When using approach #3 your can run all of the required calculations with the command::$ oq engine --run job_gmf.ini job_Argentina.csv job_Bolivia.csv ... When using approach #4 your can run all of the required calculations with the command:: $oq engine --run job_all.ini Here job_XXX.ini are the country specific configuration files, job_rup.ini is the file generating the ruptures, job_rup.ini is the file generating the ruptures, job_gmf.ini is the file generating the ground motion files and job_all.ini is the file encompassing all countries. Finally, if you have a file job_haz.ini generating the full GMFs, a file job_weak.ini generating the losses with a weak building code and a file job_strong.ini generating the losses with a strong building code, you can run the entire an analysis with a single command as follows::$ oq engine --run job_haz.ini job_weak.ini job_strong.ini This will generate three calculations and the GMFs will be reused. This is as efficient as possible for this kind of problem. Caveat: GMFs are split-dependent -------------------------------- You should not expect the results of approach #4 to match exactly the results of approaches #3 or #2, since splitting a calculation by countries is a tricky operation. In general, if you have a set of sites and you split it in disjoint subsets, and then you compute the ground motion fields for each subset, you will get different results than if you do not split. To be concrete, if you run a calculation for Chile and then one for Argentina, you will get different results than running a single calculation for Chile+Argentina, *even if you have precomputed the ruptures for both countries, even if the random seeds are the same and even if there is no spatial correlation*. Many users are surprised but this fact, but it is obvious if you know how the GMFs are computed. Suppose you are considering 3 sites in Chile and 2 sites in Argentina, and that the value of the random seed in 123456: if you split, assuming there is a single event, you will produce the following 3+2 normally distributed random numbers: >>> numpy.random.default_rng(123456).normal(size=3) # for Chile array([ 0.1928212 , -0.06550702, 0.43550665]) >>> numpy.random.default_rng(123456).normal(size=2) # for Argentina array([ 0.1928212 , -0.06550702]) If you do not split, you will generate the following 5 random numbers instead: >>> numpy.random.default_rng(123456).normal(size=5) array([ 0.1928212 , -0.06550702, 0.43550665, 0.88235875, 0.37132785]) They are unavoidably different. You may argue that not splitting is the correct way of proceeding, since the splitting causes some random numbers to be repeated (the numbers 0.1928212 and -0.0655070 in this example) and actually breaks the normal distribution. In practice, if there is a sufficiently large event-set and if you are interested in statistical quantities, things work out and you should see similar results with and without splitting. But you will *never produce identical results*. Only the classical calculator does not depend on the splitting of the sites, for event based and scenario calculations there is no way out. Special features of the engine =============================== There are a few less frequently used features of the engine that are not documented in the general user's manual, since their usage is quite specific. They are documented here. Sensitivity analysis -------------------------- Running a sensitivity analysis study means to run multiple calculations by changing a parameter and to study how the results change. For instance, it is interesting to study the random seed dependency when running a calculation using sampling of the logic tree, or it is interesting to study the impact of the truncation level on the PoEs. The engine offers a special syntax to run a sensitivity analysis with respect to one (or even more than one) parameter; you can find examples in the demos, see for instance the MultiPointClassicalPSHA demo or the EventBasedDamage demo. It is enough to write in the job.ini a dictionary of lists like the following:: sensitivity_analysis = {"random_seed": [100, 200, 300]} sensitivity_analysis = {'truncation_level': [2, 3]} The first example with run 3 calculations, the second 2 calculations. The calculations will be sequential unless you specify the --many flag in oq engine --run --many job.ini. The descriptions of the spawned calculation will be extended to include the parameter, so you could have descriptions as follows:: Multipoint demo {'truncation_level': 2} Multipoint demo {'truncation_level': 3} The custom_site_id ---------------------- Since engine 3.13, it is possible to assign 6-character ASCII strings as unique identifiers for the sites (8-characters since engine 3.15). This can be convenient in various situations, especially when splitting a calculation in geographic regions. The way to enable it is to add a field called custom_site_id to the site model file, which must be unique for each site. The hazard curve and ground motion field exporters have been modified to export the custom_site_id instead of the site_id (if present). We used this feature to split the ESHM20 model in two parts (Northern Europe and Southern Europe). Then creating the full hazard map was as trivial as joining the generated CSV files. Without the custom_site_id the site IDs would overlap, thus making impossible to join the outputs. A geohash string (see https://en.wikipedia.org/wiki/Geohash) makes a good custom_site_id since it can enable the unique identification of all potential sites across the globe. The minimum_distance parameter ---------------------------------- GMPEs often have a prescribed range of validity. In particular they may give unexpected results for points too close to ruptures. To avoid this problem the engine recognizes a minimum_distance parameter: if it is set, then for distances below the specified minimum distance, the GMPEs return the ground-motion value at the minimum distance. This avoids producing extremely large (and physically unrealistic) ground-motion values at small distances. The minimum distance is somewhat heuristic. It may be useful to experiment with different values of the minimum_distance, to see how the hazard and risk change. GMPE logic trees with weighted IMTs ----------------------------------- In order to support Canada's 6th Generation seismic hazard model, the engine now has the ability to manage GMPE logic trees where the weight assigned to each GMPE may be different for each IMT. For instance you could have a particular GMPE applied to PGA with a certain weight, to SA(0.1) with a different weight, and to SA(1.0) with yet another weight. The user may want to assign a higher weight to the IMTs where the GMPE has a small uncertainty and a lower weight to the IMTs with a large uncertainty. Moreover a particular GMPE may not be applicable for some periods, and in that case the user can assign to a zero weight for those periods, in which case the engine will ignore it entirely for those IMTs. This is useful when you have a logic tree with multiple GMPEs per branchset, some of which are applicable for some IMTs and not for others. Here is an example: .. code-block:: xml BooreEtAl1997GeometricMean 0.33 0.25 0.5 0.5 0.5 SadighEtAl1997 0.33 0.25 0.5 0.5 0.5 MunsonThurber1997Hawaii 0.34 0.25 0.0 0.0 0.0 Campbell1997 0.0 0.25 0.0 0.0 0.0 Clearly the weights for each IMT must sum up to 1, otherwise the engine will complain. Note that this feature only works for the classical and disaggregation calculators: in the event based case only the default uncertaintyWeight (i.e. the first in the list of weights, the one without imt attribute) would be taken for all IMTs. Equivalent Epicenter Distance Approximation ------------------------------------------- The equivalent epicenter distance approximation (reqv for short) was introduced in engine 3.2 to enable the comparison of the OpenQuake engine with time-honored Fortran codes using the same approximation. You can enable it in the engine by adding a [reqv] section to the job.ini, like in our example in openquake/qa_tests_data/classical/case_2/job.ini:: reqv_hdf5 = {'active shallow crust': 'lookup_asc.hdf5', 'stable shallow crust': 'lookup_sta.hdf5'} For each tectonic region type to which the approximation should be applied, the user must provide a lookup table in .hdf5 format containing arrays mags of shape M, repi of shape N and reqv of shape (M, N). The examples in openquake/qa_tests_data/classical/case_2 will give you the exact format required. M is the number of magnitudes (in the examples there are 26 magnitudes ranging from 6.05 to 8.55) and N is the number of epicenter distances (in the examples ranging from 1 km to 1000 km). Depending on the tectonic region type and rupture magnitude, the engine converts the epicentral distance repi into an equivalent distance by looking at the lookup table and use it to determine the rjb and rrup distances, instead of the regular routines. This means that within this approximation ruptures are treated as pointwise and not rectangular as the engine usually does. Notice that the equivalent epicenter distance approximation only applies to ruptures coming from PointSources/AreaSources/MultiPointSources, fault sources are untouched. Ruptures in CSV format ------------------------------------------- Since engine v3.10 there is a way to serialize ruptures in CSV format. The command to give is:: \$ oq extract "ruptures?min_mag="  For instance, assuming there is an event based calculation with ID 42, we can extract the ruptures in the datastore with magnitude larger than 6 with oq extract "ruptures?min_mag=6" 42: this will generate a CSV file. Then it is possible to run scenario calculations starting from that rupture by simply setting rupture_model_file = ruptures-min_mag=6_42.csv in the job.ini file. The format is provisional and may change in the future, but it will stay a CSV with JSON fields. Here is an example for a planar rupture, i.e. a rupture generated by a point source:: #,,,,,,,,,,"trts=['Active Shallow Crust']" seed,mag,rake,lon,lat,dep,multiplicity,trt,kind,mesh,extra 24,5.050000E+00,0.000000E+00,0.08456,0.15503,5.000000E+00,1,Active Shallow Crust,ParametricProbabilisticRupture PlanarSurface,"[[[[0.08456, 0.08456, 0.08456, 0.08456]], [[0.13861, 0.17145, 0.13861, 0.17145]], [[3.17413, 3.17413, 6.82587, 6.82587]]]]","{""occurrence_rate"": 4e-05}" The format is meant to support all kind of ruptures, including ruptures generated by simple and complex fault sources, characteristic sources, nonparametric sources and new kind of sources that could be introduced in the engine in the future. The header will be the same for all kind of ruptures that will be stored in the same CSV. Here is description of the fields as they are named now (engine 3.11): seed the random seed used to compute the GMFs generated by the rupture mag the magnitude of the rupture rake the rake angle of the rupture surface in degrees lon the longitude of the hypocenter in degrees lat the latitude of the hypocenter in degrees dep the depth of the hypocenter in km multiplicity the number of occurrences of the rupture (i.e. number of events) trt the tectonic region type of the rupture; must be consistent with the trts listed in the pre-header of the file kind a space-separated string listing the rupture class and the surface class used in the engine mesh 3 times nested list with lon, lat, dep of the points of the discretized rupture geometry for each underlying surface extra extra parameters of the rupture as a JSON dictionary, for instance the rupture occurrence rate Notice that using a CSV file generated with an old version of the engine is inherently risky: for instance if we changed the ParametricProbabilisticRupture class or the PlanarSurface classes in an incompatible way with the past, then a scenario calculation starting with the CSV would give different results in the new version of the engine. We never changed the rupture classes or the surface classes, but we changed the seed algorithm often, and that too would cause different numbers to be generated (hopefully, statistically consistent). A bug fix or change of logic in the calculator can also change the numbers across engine versions. max_sites_disagg -------------------------------- There is a parameter in the job.ini called max_sites_disagg, with a default value of 10. This parameter controls the maximum number of sites on which it is possible to run a disaggregation. If you need to run a disaggregation on a large number of sites you will have to increase that parameter. Notice that there are technical limits: trying to disaggregate 100 sites will likely succeed, trying to disaggregate 100,000 sites will most likely cause your system to go out of memory or out of disk space, and the calculation will be terribly slow. If you have a really large number of sites to disaggregate, you will have to split the calculation and it will be challenging to complete all the subcalculations. The parameter max_sites_disagg is extremely important not only for disaggregation, but also for classical calculations. Depending on its value and then number of sites (N) your calculation can be in the *few sites* regime or the *many sites regime*. In the *few sites regime* (N <= max_sites_disagg) the engine stores information for each rupture in the model (in particular the distances for each site) and therefore uses more disk space. The problem is mitigated since the engine uses a relatively aggressive strategy to collapse ruptures, but that requires more RAM available. In the *many sites regime* (N > max_sites_disagg) the engine does not store rupture information (otherwise it would immediately run out of disk space, since typical hazard models have tens of millions of ruptures) and uses a much less aggressive strategy to collapse ruptures, which has the advantage of requiring less RAM. Correlation of Ground Motion Fields ========================================= There are multiple different kind of correlation on the engine, so it is extremely easy to get confused. Here I will list all possibilities, in historical order. 1. Spatial correlation of ground motion fields has been a feature of the engine from day one. The available models are JB2009 and HM2018. 2. Cross correlation in ShakeMaps has been available for a few years. The model used there is hard-coded an the user cannot change it, only disable it. The models list below (3. and 4.) *have no effect on ShakeMaps*. 3. Since version 3.13 the engine provides the BakerJayaram2008 cross correlation model, however at the moment it is used only in the conditional spectrum calculator. 4. Since version 3.13 the engine provides the GodaAtkinson2009 cross correlation model and the FullCrossCorrelation model which can be used in scenario and event based calculations. Earthquake theory tells us that ground motion fields depend on two different lognormal distributions with parameters (:math:\mu, :math:\tau) and (:math:\mu, :math:\phi) respectively, which are determined by the GMPE (Ground Motion Prediction Equal). Given a rupture, a set of M intensity measure types and a collection of N sites, the parameters :math:\mu, :math:\tau and :math:\phi are arrays of shape (M, N). :math:\mu is the mean of the logarithms and :math:\tau the between-event standard deviation, associated to the cross correlation, while :math:\phi is the within-event standard deviation, associated to the spatial correlation. math:\tau and :math:\phi are normally N-independent, i.e. each array of shape (M, N) actually contains N copies of the same M values read from the coefficient table of the GMPE. In the OpenQuake engine each rupture has associated a random seed generated from the parameter ses_seed given in the job.ini file, therefore given a fixed number E of events it is possible to generate a deterministic distribution of ground motion fields, i.e. an array of shape (M, N, E). Technically such feature is implemented in the class openquake.hazardlib.calc.gmf.GmfComputer. The algorithm used there is to generate two arrays of normally distributed numbers called :math:\epsilon_\tau (of shape (M, E)) and :math:\epsilon_\phi (of shape (M, N, E)), one using the between-event standard deviation :math:\tau and the other using the within-event standard deviation :math:\phi, while keeping the same mean :math:\mu. Then the ground motion fields are generated as an array of shape (M, N, E) with the formula .. math:: gmf = exp(\mu + crosscorrel(\epsilon_\tau) + spatialcorrel(\epsilon\phi)) The details depend on the form of the cross correlation model and of the spatial correlation model and you have to study the source code if you really want to understand how it works, in particular how the correlation matrices are extracted from the correlation models. By default, if no cross correlation nor spatial correlation are specified, then there are no correlation matrices and :math:crosscorrel(\epsilon_\tau) and :math:spatialcorrel(\epsilon\phi) are computed by using scipy.stats.truncnorm. Otherwise scipy.stats.multivariate_normal with a correlation matrix of shape (M, M) is used for cross correlation and scipy.stats.multivariate_normal` distribution with a matrix of shape (N, N) is used for spatial correlation. Notice that the truncation feature is lost if you use correlation, since scipy does not offer at truncated multivariate_normal distribution. Not truncating the normal distribution can easily generated non-physical fields, but even if the truncation is on it is very possible to generate exceedingly large ground motion fields, so the user has to be *very* careful. Correlation is important because its presence normally causes the risk to increase, i.e. ignoring the correlation will under-estimate the risk. The best way to play with the correlation is to consider a scenario_risk calculation with a single rupture and to change the cross and spatial correlation models. Possibilities are to specify in the job.ini all possible combinations of cross_correlation = FullCrossCorrelation cross_correlation = GodaAtkinson2009 ground_motion_correlation_model = JB2009 ground_motion_correlation_model = HM2018 including removing one or the other or all correlations.