# Stochastic Event Based Probabilistic Seismic Hazard Analysis#

Analysis Input data for the Event-Based PSHA - as in the case of the Classical *Probabilistic Seismic Hazard Analysis*
calculator - consists of a PSHA Input Model and a set of calculation settings.

The main calculators used to perform this analysis are:

*Logic Tree Processor*The Logic Tree Processor works in the same way described in the description of the Classical

*Probabilistic Seismic Hazard Analysis*workflow (see Section Classical PSHA).*Earthquake Rupture Forecast Calculator*The Earthquake Rupture Forecast Calculator was already introduced in the description of the PSHA workflow (see Section Classical PSHA).

*Stochastic Event Set Calculator*The Stochastic Event Set Calculator generates a collection of stochastic event sets by sampling the ruptures contained in the ERF according to their probability of occurrence.

A Stochastic Event Set (SES) thus represents a potential realisation of the seismicity (i.e. a list of ruptures) produced by the set of seismic sources considered in the analysis over the time span fixed for the calculation of hazard.

*Ground Motion Field Calculator*The Ground Motion Field Calculator computes for each event contained in a Stochastic Event Set a realization of the geographic distribution of the shaking by taking into account the aleatory uncertainties in the ground- motion model. Eventually, the Ground Motion Field calculator can consider the spatial correlation of the ground-motion during the generation of the Ground Motion Field.

*Event-based PSHA Calculator*The event-based PSHA calculator takes a (large) set of ground-motion fields representative of the possible shaking scenarios that the investigated area can experience over a (long) time span and for each site computes the corresponding hazard curve.

This procedure is computationally intensive and is not recommended for investigating the hazard over large areas.

## Rupture sampling: how does it work?#

In this section we explain how the sampling of ruptures in event based calculations works, at least for the case of Poissonian sources. As an example, consider the following point source:

```
>>> from openquake.hazardlib import nrml
>>> src = nrml.get('''\
... <pointSource id="1" name="Point Source"
... tectonicRegion="Active Shallow Crust">
... <pointGeometry>
... <gml:Point><gml:pos>179.5 0</gml:pos></gml:Point>
... <upperSeismoDepth>0</upperSeismoDepth>
... <lowerSeismoDepth>10</lowerSeismoDepth>
... </pointGeometry>
... <magScaleRel>WC1994</magScaleRel>
... <ruptAspectRatio>1.5</ruptAspectRatio>
... <truncGutenbergRichterMFD aValue="3" bValue="1" minMag="5" maxMag="7"/>
... <nodalPlaneDist>
... <nodalPlane dip="30" probability="1" strike="45" rake="90" />
... </nodalPlaneDist>
... <hypoDepthDist>
... <hypoDepth depth="4" probability="1"/>
... </hypoDepthDist>
...</pointSource>''', investigation_time=1, width_of_mfd_bin=1.0)
```

The source here is particularly simple, with only one seismogenic depth and one nodal plane. It generates two ruptures,
because with a `width_of_mfd_bin`

of 1 there are only two magnitudes in the range from 5 to 7:

```
>>> [(mag1, rate1), (mag2, rate2)] = src.get_annual_occurrence_rates()
>>> mag1
5.5
>>> mag2
6.5
```

The occurrence rates are respectively 0.009 and 0.0009. So, if we set the number of stochastic event sets to 1,000,000:

```
>>> num_ses = 1_000_000
```

we would expect the first rupture (the one with magnitude 5.5) to occur around 9,000 times and the second rupture (the one with magnitude 6.5) to occur around 900 times. Clearly the exact numbers will depend on the stochastic seed; if we set:

```
>>> np.random.seed(42)
```

then we will have (for `investigation_time = 1`

):

```
>>> np.random.poisson(rate1 * num_ses * 1)
8966
>>> np.random.poisson(rate2 * num_ses * 1)
921
```

These are the number of occurrences of each rupture in the effective investigation time, i.e. the investigation time multiplied by the number of stochastic event sets and the number of realizations (here we assumed 1 realization).

The total number of events generated by the source will be `number_of_events = sum(n_occ for each rupture)`

i.e. 8,966 + 921 = 9,887, with ~91% of the events associated to the first rupture and ~9% of the events associated to
the second rupture.

Since the details of the seed algorithm can change with updates to the the engine, if you run an event based calculation
with the same parameters with different versions of the engine, you may not get exactly the same number of events, but
something close given a reasonably long effective investigation time. After running the calculation, inside the
datastore, in the `ruptures`

dataset you will find the two ruptures, their occurrence rates and their integer number of
occurrences (`n_occ`

). If the effective investigation time is large enough then the relation `n_occ ~ occurrence_rate * eff_investigation_time`

will hold. If the effective investigation time is not large enough, or the occurrence rate is extremely small, then you
should expect to see larger differences between the expected number of occurrences and `n_occ`

, as well as a strong
seed dependency.

It is important to notice than in order to determine the effective investigation time, the engine takes into account also
the ground motion logic tree and the correct formula to use is `eff_investigation_time = investigation_time * num_ses * num_rlzs`

where `num_rlzs`

is the number of realizations in the ground motion logic tree.

Just to be concrete, if you run a calculation with the same parameters as described before, but with two GMPEs instead
of one (and `number_of_logic_tree_samples = 0`

), then the total number of paths admitted by the logic tree will be 2
and you should expect to get about twice the number of occurrences for each rupture. Users wanting to know the
nitty-gritty details should look at the code, inside hazardlib/source/base.py, in the method
`src.sample_ruptures(eff_num_ses, ses_seed)`

.

### The difference between full enumeration and sampling#

Users are often confused about the difference between full enumeration and sampling. For this reason the engine distribution comes with a pedagogical example that considers an extremely simplified situation comprising a single site, a single rupture, and only two GMPEs. You can find the example in the engine repository under the directory openquake/qa_tests_data/event_based/case_3. If you look at the ground motion logic tree file, the two GMPEs are AkkarBommer2010 (with weight 0.9) and SadighEtAl1997 (with weight 0.1).

The parameters in the job.ini are:

```
investigation_time = 1
ses_per_logic_tree_path = 5_000
number_of_logic_tree_paths = 0
```

Since there are 2 realizations, the effective investigation time is 10,000 years. If you run the calculation, you will
generate (at least with version 3.13 of the engine, though the details may change with the version) 10,121 events, since
the occurrence rate of the rupture was chosen to be 1. Roughly half of the events will be associated with the first
GMPE (AkkarBommer2010) and half with the second GMPE (SadighEtAl1997). Actually, if you look at the test, the precise
numbers will be 5,191 and 4,930 events, i.e. 51% and 49% rather than 50% and 50%, but this is expected and by increasing
the investigation time you can get closer to the ideal equipartition. Therefore, even if the AkkarBommer2010 GMPE is
assigned a relative weight that is 9 times greater than SadighEtAl1997, *this is not reflected in the simulated event set.*
It means that when performing a computation (for instance to compute the mean ground motion field, or the average loss)
one has to keep the two realizations distinct, and only at the end to perform the weighted average.

The situation is the opposite when sampling is used. In order to get the same effective investigation time of 10,000 years you should change the parameters in the job.ini to:

```
investigation_time = 1
ses_per_logic_tree_path = 1
number_of_logic_tree_paths = 10_000
```

Now there are 10,000 realizations, not 2, and they all have the same weight .0001. The number of events per realization
is still roughly constant (around 1) and there are still 10,121 events, however now *the original weights are reflected
in the event set.* In particular there are 9,130 events associated to the AkkarBommer2010 GMPE and 991 events associated
to the SadighEtAl1997 GMPE. There is no need to keep the realizations separated: since they have all the same weigths,
you can trivially compute average quantities. AkkarBommer2010 will count more than SadighEtAl1997 simply because there
are 9 times more events for it (actually 9130/991 = 9.2, but the rate will tend to 9 when the effective time will tend
to infinity).

NB: just to be clear, normally realizations are not in one-to-one correspondence with GMPEs. In this example, it is true because there is a single tectonic region type. However, usually there are multiple tectonic region types, and a realization is associated to a tuple of GMPEs.

## Rupture sampling: how to get it wrong#

Rupture samplings is *much more complex than one could expect* and in many respects *surprising*. In the many years of
existence of the engine, multiple approached were tried and you can expect the details of the rupture sampling
mechanism to be different nearly at every version of the engine.

Here we will discuss some tricky points that may help you understand why different versions of the engine may give different results and also why the comparison between the engine and other software performing rupture sampling is nontrivial.

We will start with the first subtlety, the *interaction between sampling and filtering*. The short version is that you
should *first sample and then filter*.

Here is the long version. Consider the following code emulating rupture sampling for poissonian ruptures:

```
import numpy
class FakeRupture:
def __init__(self, mag, rate):
self.mag = mag
self.rate = rate
def calc_n_occ(ruptures, eff_time, seed):
rates = numpy.array([rup.rate for rup in ruptures])
return numpy.random.default_rng(seed).poisson(rates * eff_time)
mag_rates = [(5.0, 1e-5), (5.1, 2e-5), (5.2, 1e-5), (5.3, 2e-5),
(5.4, 1e-5), (5.5, 2e-5), (5.6, 1e-5), (5.7, 2e-5)]
fake_ruptures = [FakeRupture(mag, rate) for mag, rate in mag_rates]
eff_time = 50 * 10_000
seed = 42
```

Running this code will give you the following numbers of occurrence for the 8 ruptures considered:

```
>> calc_n_occ(fake_ruptures, eff_time, seed)
[ 8 9 6 13 7 6 6 10]
```

Here we did not consider the fact that engine has a `minimum_magnitude`

feature and it is able to discard ruptures
below the minimum magnitude. But how should it work? The natural approach to follow, for performance-oriented
applications, would be to first discard the low magnitudes and then perform the sampling. However, that would
have effects that would be surprising for most users. Consider the following two alternative:

```
def calc_n_occ_after_filtering(ruptures, eff_time, seed, min_mag):
mags = numpy.array([rup.mag for rup in ruptures])
rates = numpy.array([rup.rate for rup in ruptures])
return numpy.random.default_rng(seed).poisson(
rates[mags >= min_mag] * eff_time)
def calc_n_occ_before_filtering(ruptures, eff_time, seed, min_mag):
mags = numpy.array([rup.mag for rup in ruptures])
rates = numpy.array([rup.rate for rup in ruptures])
n_occ = numpy.random.default_rng(seed).poisson(rates * eff_time)
return n_occ[mags >= min_mag]
```

Most users would expect that removing a little number of ruptures has a little effect; for instance, if we set
`min_mag = 5.1`

such that only the first rupture is removed from the total 8 ruptures, we would expect a minor change.
However, if we follow the filter-early approach the user would get completely different occupation numbers:

```
>> calc_n_occ_after_filtering(fake_ruptures, eff_time, seed, min_mag)
[13 6 9 6 13 7 6]
```

It is only by using the filter-late approach that the occupation numbers are consistent with the no-filtering case:

```
>> calc_n_occ_before_filtering(fake_ruptures, eff_time, seed, min_mag)
[ 9 6 13 7 6 6 10]
```

The problem with the filtering is absolutely general and not restricted only to the magnitude filtering: it is exactly
the same also for distance filtering. Suppose you have a `maximum_distance`

of 300 km and than you decide that you
want to increase it to 301 km. One would expect this change to have a minor impact; instead, you may end up sampling a
very different set of ruptures.

It is true that average quantities like the hazard curves obtained from the ground motion fields will converge for long enough effective time, however in practice you are always in situations were

you cannot perform the calculation for a long enough effective time since it would be computationally prohibitive

you are interested on quantities which are strongly sensitive to aany change, like the Maximum Probable Loss at some return period

In such situations changing the site collection (or changing the maximum distance which is akin to changing the site collection) can change the sampling of the ruptures significantly, at least for engine versions lower than 3.17.

Users wanting to compare the GMFs or the risk on different site collections should be aware of this effect; the solution is to first sample the ruptures without setting any site collection (i.e. disabling the filtering) and then perform the calculation with the site collection starting from the sampled ruptures.