# Stochastic Event Based Probabilistic Seismic Hazard Analysis#

Analysis Input data for the Event-Based PSHA - as in the case of the Classical Probabilistic Seismic Hazard Analysis calculator - consists of a PSHA Input Model and a set of calculation settings.

The main calculators used to perform this analysis are:

1. Logic Tree Processor

The Logic Tree Processor works in the same way described in the description of the Classical Probabilistic Seismic Hazard Analysis workflow (see Section Classical PSHA).

2. Earthquake Rupture Forecast Calculator

The Earthquake Rupture Forecast Calculator was already introduced in the description of the PSHA workflow (see Section Classical PSHA).

3. Stochastic Event Set Calculator

The Stochastic Event Set Calculator generates a collection of stochastic event sets by sampling the ruptures contained in the ERF according to their probability of occurrence.

A Stochastic Event Set (SES) thus represents a potential realisation of the seismicity (i.e. a list of ruptures) produced by the set of seismic sources considered in the analysis over the time span fixed for the calculation of hazard.

4. Ground Motion Field Calculator

The Ground Motion Field Calculator computes for each event contained in a Stochastic Event Set a realization of the geographic distribution of the shaking by taking into account the aleatory uncertainties in the ground- motion model. Eventually, the Ground Motion Field calculator can consider the spatial correlation of the ground-motion during the generation of the Ground Motion Field.

5. Event-based PSHA Calculator

The event-based PSHA calculator takes a (large) set of ground-motion fields representative of the possible shaking scenarios that the investigated area can experience over a (long) time span and for each site computes the corresponding hazard curve.

This procedure is computationally intensive and is not recommended for investigating the hazard over large areas.

## Rupture sampling: how does it work?#

In this section we explain how the sampling of ruptures in event based calculations works, at least for the case of Poissonian sources. As an example, consider the following point source:

```>>> from openquake.hazardlib import nrml
>>> src = nrml.get('''\
... <pointSource id="1" name="Point Source"
...             tectonicRegion="Active Shallow Crust">
...    <pointGeometry>
...        <gml:Point><gml:pos>179.5 0</gml:pos></gml:Point>
...        <upperSeismoDepth>0</upperSeismoDepth>
...        <lowerSeismoDepth>10</lowerSeismoDepth>
...    </pointGeometry>
...    <magScaleRel>WC1994</magScaleRel>
...    <ruptAspectRatio>1.5</ruptAspectRatio>
...    <truncGutenbergRichterMFD aValue="3" bValue="1" minMag="5" maxMag="7"/>
...    <nodalPlaneDist>
...        <nodalPlane dip="30" probability="1" strike="45" rake="90" />
...    </nodalPlaneDist>
...    <hypoDepthDist>
...        <hypoDepth depth="4" probability="1"/>
...    </hypoDepthDist>
...</pointSource>''', investigation_time=1, width_of_mfd_bin=1.0)
```

The source here is particularly simple, with only one seismogenic depth and one nodal plane. It generates two ruptures, because with a `width_of_mfd_bin` of 1 there are only two magnitudes in the range from 5 to 7:

```>>> [(mag1, rate1), (mag2, rate2)] = src.get_annual_occurrence_rates()
>>> mag1
5.5
>>> mag2
6.5
```

The occurrence rates are respectively 0.009 and 0.0009. So, if we set the number of stochastic event sets to 1,000,000:

```>>> num_ses = 1_000_000
```

we would expect the first rupture (the one with magnitude 5.5) to occur around 9,000 times and the second rupture (the one with magnitude 6.5) to occur around 900 times. Clearly the exact numbers will depend on the stochastic seed; if we set:

```>>> np.random.seed(42)
```

then we will have (for `investigation_time = 1`):

```>>> np.random.poisson(rate1 * num_ses * 1)
8966
>>> np.random.poisson(rate2 * num_ses * 1)
921
```

These are the number of occurrences of each rupture in the effective investigation time, i.e. the investigation time multiplied by the number of stochastic event sets and the number of realizations (here we assumed 1 realization).

The total number of events generated by the source will be `number_of_events = sum(n_occ for each rupture)` i.e. 8,966 + 921 = 9,887, with ~91% of the events associated to the first rupture and ~9% of the events associated to the second rupture.

Since the details of the seed algorithm can change with updates to the the engine, if you run an event based calculation with the same parameters with different versions of the engine, you may not get exactly the same number of events, but something close given a reasonably long effective investigation time. After running the calculation, inside the datastore, in the `ruptures` dataset you will find the two ruptures, their occurrence rates and their integer number of occurrences (`n_occ`). If the effective investigation time is large enough then the relation `n_occ ~ occurrence_rate * eff_investigation_time` will hold. If the effective investigation time is not large enough, or the occurrence rate is extremely small, then you should expect to see larger differences between the expected number of occurrences and `n_occ`, as well as a strong seed dependency.

It is important to notice than in order to determine the effective investigation time, the engine takes into account also the ground motion logic tree and the correct formula to use is `eff_investigation_time = investigation_time * num_ses * num_rlzs` where `num_rlzs` is the number of realizations in the ground motion logic tree.

Just to be concrete, if you run a calculation with the same parameters as described before, but with two GMPEs instead of one (and `number_of_logic_tree_samples = 0`), then the total number of paths admitted by the logic tree will be 2 and you should expect to get about twice the number of occurrences for each rupture. Users wanting to know the nitty-gritty details should look at the code, inside hazardlib/source/base.py, in the method `src.sample_ruptures(eff_num_ses, ses_seed)`.

### The difference between full enumeration and sampling#

Users are often confused about the difference between full enumeration and sampling. For this reason the engine distribution comes with a pedagogical example that considers an extremely simplified situation comprising a single site, a single rupture, and only two GMPEs. You can find the example in the engine repository under the directory openquake/qa_tests_data/event_based/case_3. If you look at the ground motion logic tree file, the two GMPEs are AkkarBommer2010 (with weight 0.9) and SadighEtAl1997 (with weight 0.1).

The parameters in the job.ini are:

```investigation_time = 1
ses_per_logic_tree_path = 5_000
number_of_logic_tree_paths = 0
```

Since there are 2 realizations, the effective investigation time is 10,000 years. If you run the calculation, you will generate (at least with version 3.13 of the engine, though the details may change with the version) 10,121 events, since the occurrence rate of the rupture was chosen to be 1. Roughly half of the events will be associated with the first GMPE (AkkarBommer2010) and half with the second GMPE (SadighEtAl1997). Actually, if you look at the test, the precise numbers will be 5,191 and 4,930 events, i.e. 51% and 49% rather than 50% and 50%, but this is expected and by increasing the investigation time you can get closer to the ideal equipartition. Therefore, even if the AkkarBommer2010 GMPE is assigned a relative weight that is 9 times greater than SadighEtAl1997, this is not reflected in the simulated event set. It means that when performing a computation (for instance to compute the mean ground motion field, or the average loss) one has to keep the two realizations distinct, and only at the end to perform the weighted average.

The situation is the opposite when sampling is used. In order to get the same effective investigation time of 10,000 years you should change the parameters in the job.ini to:

```investigation_time = 1
ses_per_logic_tree_path = 1
number_of_logic_tree_paths = 10_000
```

Now there are 10,000 realizations, not 2, and they all have the same weight .0001. The number of events per realization is still roughly constant (around 1) and there are still 10,121 events, however now the original weights are reflected in the event set. In particular there are 9,130 events associated to the AkkarBommer2010 GMPE and 991 events associated to the SadighEtAl1997 GMPE. There is no need to keep the realizations separated: since they have all the same weigths, you can trivially compute average quantities. AkkarBommer2010 will count more than SadighEtAl1997 simply because there are 9 times more events for it (actually 9130/991 = 9.2, but the rate will tend to 9 when the effective time will tend to infinity).

NB: just to be clear, normally realizations are not in one-to-one correspondence with GMPEs. In this example, it is true because there is a single tectonic region type. However, usually there are multiple tectonic region types, and a realization is associated to a tuple of GMPEs.

## Rupture sampling: how to get it wrong#

Rupture samplings is much more complex than one could expect and in many respects surprising. In the many years of existence of the engine, multiple approached were tried and you can expect the details of the rupture sampling mechanism to be different nearly at every version of the engine.

Here we will discuss some tricky points that may help you understand why different versions of the engine may give different results and also why the comparison between the engine and other software performing rupture sampling is nontrivial.

We will start with the first subtlety, the interaction between sampling and filtering. The short version is that you should first sample and then filter.

Here is the long version. Consider the following code emulating rupture sampling for poissonian ruptures:

```import numpy

class FakeRupture:
def __init__(self, mag, rate):
self.mag = mag
self.rate = rate

def calc_n_occ(ruptures, eff_time, seed):
rates = numpy.array([rup.rate for rup in ruptures])
return numpy.random.default_rng(seed).poisson(rates * eff_time)

mag_rates = [(5.0, 1e-5), (5.1, 2e-5), (5.2, 1e-5), (5.3, 2e-5),
(5.4, 1e-5), (5.5, 2e-5), (5.6, 1e-5), (5.7, 2e-5)]
fake_ruptures = [FakeRupture(mag, rate) for mag, rate in mag_rates]
eff_time = 50 * 10_000
seed = 42
```

Running this code will give you the following numbers of occurrence for the 8 ruptures considered:

```>> calc_n_occ(fake_ruptures, eff_time, seed)
[ 8  9  6 13  7  6  6 10]
```

Here we did not consider the fact that engine has a `minimum_magnitude` feature and it is able to discard ruptures below the minimum magnitude. But how should it work? The natural approach to follow, for performance-oriented applications, would be to first discard the low magnitudes and then perform the sampling. However, that would have effects that would be surprising for most users. Consider the following two alternative:

```def calc_n_occ_after_filtering(ruptures, eff_time, seed, min_mag):
mags = numpy.array([rup.mag for rup in ruptures])
rates = numpy.array([rup.rate for rup in ruptures])
return numpy.random.default_rng(seed).poisson(
rates[mags >= min_mag] * eff_time)

def calc_n_occ_before_filtering(ruptures, eff_time, seed, min_mag):
mags = numpy.array([rup.mag for rup in ruptures])
rates = numpy.array([rup.rate for rup in ruptures])
n_occ = numpy.random.default_rng(seed).poisson(rates * eff_time)
return n_occ[mags >= min_mag]
```

Most users would expect that removing a little number of ruptures has a little effect; for instance, if we set `min_mag = 5.1` such that only the first rupture is removed from the total 8 ruptures, we would expect a minor change. However, if we follow the filter-early approach the user would get completely different occupation numbers:

```>> calc_n_occ_after_filtering(fake_ruptures, eff_time, seed, min_mag)
[13  6  9  6 13  7  6]
```

It is only by using the filter-late approach that the occupation numbers are consistent with the no-filtering case:

```>> calc_n_occ_before_filtering(fake_ruptures, eff_time, seed, min_mag)
[ 9  6 13  7  6  6 10]
```

The problem with the filtering is absolutely general and not restricted only to the magnitude filtering: it is exactly the same also for distance filtering. Suppose you have a `maximum_distance` of 300 km and than you decide that you want to increase it to 301 km. One would expect this change to have a minor impact; instead, you may end up sampling a very different set of ruptures.

It is true that average quantities like the hazard curves obtained from the ground motion fields will converge for long enough effective time, however in practice you are always in situations were

1. you cannot perform the calculation for a long enough effective time since it would be computationally prohibitive

2. you are interested on quantities which are strongly sensitive to aany change, like the Maximum Probable Loss at some return period

In such situations changing the site collection (or changing the maximum distance which is akin to changing the site collection) can change the sampling of the ruptures significantly, at least for engine versions lower than 3.17.

Users wanting to compare the GMFs or the risk on different site collections should be aware of this effect; the solution is to first sample the ruptures without setting any site collection (i.e. disabling the filtering) and then perform the calculation with the site collection starting from the sampled ruptures.