.. _scenario-hazard-intro:

Scenario Hazard
===============

In case of Scenario Based Seismic Hazard Analysis, the engine simulates a set of ground motion fields (GMFs) at the 
target sites for the requested set of intensity measure types. This set of GMFs can then be used in :ref:`Scenario Damage 
Assessment <scenario-damage-intro>` and :ref:`Scenario Risk Assessment <scenario-risk-intro>` to estimate the 
distribution of potential damage, economic losses, fatalities, and other consequences. The scenario calculator is 
useful for simulating both historical and hypothetical earthquakes.

In case of Scenario Based Seismic Hazard Analysis, the input data consist of a single earthquake rupture model and one 
or more ground-motion models (GSIMs). Using the Ground Motion Field Calculator, multiple realizations of ground shaking 
can be computed, each realization sampling the aleatory uncertainties in the ground-motion model. The main calculator 
used to perform this analysis is the Ground Motion Field Calculator, which was already introduced during the description 
of the event based PSHA workflow (see Section :ref:`Event based PSHA <event-based-psha-intro>`).

As the scenario calculator does not need to determine the probability of occurrence of the specific rupture, but only 
sufficient information to parameterise the location (as a three-dimensional surface), the magnitude and the 
style-of-faulting of the rupture, a more simplified NRML structure is sufficient compared to the source model structures 
described in :ref:`source typologies <source-typologies>`. A rupture model XML can be defined as presented in the 
section :ref:`rupture-model`.

Conditioning ground shaking to observations
-------------------------------------------

Starting from OpenQuake engine v3.16, it is possible to condition the ground shaking to observations, such as ground 
motion recordings and macroseismic intensity observations. The simulated ground motion fields are cross-spatially 
correlated, and can reduce considerably the uncertainty and bias in the resulting loss and damage estimates. The 
implementation of the conditioning of ground motion fields in the engine was performed following closely the procedure 
proposed by Engler et al. (2022).

To conditioning of ground shaking to observations in OpenQuake-engine, it is necessary to provide additional input files 
and adjustments to the configuration file, as presented in the :ref:`input-models` section.


The concept of “mean” ground motion field
-----------------------------------------

The engine has at least three different kinds of mean ground motion field, computed 
differently and used in different situations:

	- Mean ground motion field by GMPE, used to reduce disk space and make risk calculations faster.
	- Mean ground motion field by event, used for debugging/plotting purposes.
	- Single-rupture hazardlib mean ground motion field, used for analysis/plotting purposes.

Mean ground motion field by GMPE
********************************

This is the most useful concept for people doing risk calculations. To be concrete, 
suppose you are running a *scenario_risk* calculation on a region where you have a 
very fine site model (say at 1 km resolution) and a sophisticated hazard model 
(say with 16 different GMPEs): then you can easily end up with a pretty large 
calculation. For instance one of our users was doing such a calculation with an 
exposure of 1.2 million assets, 50,000+ hazard sites, 5 intensity measure levels 
and 1000 simulations, corresponding to 16,000 events given that there are 16 GMPEs. 
Given that each ground motion value needs 4 bytes to be stored as a 32 bit float, 
the math tells us that such calculation will generate 50000 x 16000 x 5 x 4 ~ 15 
GB of data (it could be a but less by using the ``minimum_intensity`` feature, but 
you get the order of magnitude). This is very little for the engine that can 
store such an amount of data in less than 1 minute, but it is a huge amount of 
data for a database. If you a (re)insurance company and your workflow requires 
ingesting the GMFs in a database to compute the financial losses, that’s a big 
issue. The engine could compute the hazard in just an hour, but the risk part 
could easily take 8 days. This is a no-go for most companies. They have deadlines 
and cannot way 8 days to perform a single analysis. At the end they are interested 
only in the mean losses, so they would like to have a single effective mean field 
producing something close to the mean losses that more correctly would be obtained 
by considering all 16 realizations. With a single effective realization the data 
storage would drop under 1 GB and more significantly the financial model software 
would complete the calculation in 12 hours instead of 8 days, something a lot 
more reasonable.

For this kind of situations hazardlib provides an ``AvgGMPE`` class, that allows to 
replace a set of GMPEs with a single effective GMPE. More specifically, the 
method ``AvgGMPE.get_means_and_stddevs`` calls the methods ``.get_means_and_stddevs`` 
on the underlying GMPEs and performs a weighted average of the means and a weighted 
average of the variances using the usual formulas:

.. math::

 \mu = \sum_{i}\omega_{i}\mu_{i}

.. math::

 \sigma^2 = \sum_{i}\omega_{i}(\sigma_{i})^2

where the weights sum up to 1. It is up to the user to check how big is the 
difference in the risk between the complete calculation and the mean field 
calculation. A factor of 2 discrepancies would not be surprising, but we have 
also seen situations where there is no difference within the uncertainty due to 
the random seed choice.

Mean ground motion field by event
*********************************

Using the *AvgGMPE* trick does not solve the issue of visualizing the ground motion 
fields, since for each site there are still 1000 events. A plotting tool has still 
to download 1 GB of data and then one has to decide which event to plot. The 
situation is the same if you are doing a sensitivity analysis, i.e. you are 
changing some parameter (it could be a parameter of the underlying rupture, or 
even the random seed) and you are studying how the ground motion fields change. 
It is hard to compare two sets of data of 1 GB each. Instead, it is a lot easier 
to define a “mean” ground motion field obtained by averaging on the events and then 
compare the mean fields of the two calculations: if they are very different, it is 
clear that the calculation is very sensitive to the parameter being studied. Still, 
the tool performing the comparison will need to consider 1000 times less data and 
will be 1000 times faster, also downloding 1000 times less data from the remote 
server where the calculation has been performed.

For this kind of analysis the engine provides an internal output ``avg_gmf`` that can 
be plotted with the command ``oq plot avg_gmf <calc_id>``. It is also possible to 
compare two calculations with the command ``$ oq compare avg_gmf imt <calc1> <calc2>``
Since ``avg_gmf`` is meant for internal usage and for debugging it is not exported 
by default and it is not visible in the WebUI. It is also not guaranteed to stay 
the same across engine versions. It is available starting from version 3.11. It 
should be noted that, consistently with how the ``AvgGMPE`` works, the ``avg_gmf`` 
output is computed in *log space*, i.e. it is geometric mean, not the usual mean. 
If the distribution was exactly lognormal that would also coincide with the median 
field.

However, you should remember that in order to reduce the data transfer and to 
save disk space the engine discards ground motion values below a certain minimum 
intensity, determined explicitly by the user or inferred from the vulnerability 
functions when performing a risk calculation: there is no point in considering 
ground motion values below the minimum in the vulnerability functions, since they 
would generate zero losses. Discarding the values below the threshould breaks the 
log normal distribution.

To be concrete, consider a case with a single site, and single intensity measure 
type (PGA) and a ``minimum_intensity`` of 0.05g. Suppose there are 1000 simulations 
and that you have a normal distribution of the logarithms with :math:`\mu = -2.0,  \sigma=.5`; 
then the ground motion values that you could obtain would be as follows::

	>>> import numpy
	>>> np.random.seed(42) # fix the seed
	>>> gmvs = np.random.lognormal(mean=-2.0, sigma=.5, size=1000)

As expected, the variability of the values is rather large, spanning more than 
one order of magnitude::

	>>> numpy.round([gmvs.min(), np.median(gmvs), gmvs.max()], 6)
	array([0.026766, 0.137058, 0.929011])

Also mean and standard deviation of the logarithms are very close to the expected 
values :math:`\mu = -2.0,  \sigma=.5`::

	>>> round(np.log(gmvs).mean(), 6)
	-1.990334
	>>> round(np.log(gmvs).std(), 6)
	0.489363

The geometric mean of the values (i.e. the exponential of the mean of the 
logarithms) is very close to the median, as expected for a lognormal distribution::

	>>> round(np.exp(np.log(gmvs).mean()), 6)
	0.13665

All these properties are broken when the ground motion values are truncated 
below the ``minimum_intensity``::

	>>> gmvs[gmvs < .05] = .05
	>>> round(np.log(gmvs).mean(), 6)
	-1.987608
	>>> round(np.log(gmvs).std(), 6)
	0.4828063
	>>> round(np.exp(np.log(gmvs).mean()), 6)
	0.137023

In this case the difference is minor, but if the number of simulations is small 
and/or the :math:`\sigma` is large the mean and standard deviation obtained from 
the logarithms of the ground motion fields could be quite different from the 
expected ones.

Finally, it should be noticed that the geometric mean can be orders of magnitude 
different from the usual mean and it is purely a coincidence that in this case 
they are close (~0.137 vs ~0.155).

Single-rupture estimated median ground motion field
***************************************************

The mean ground motion field by event discussed above is an a posteriori output: 
after performing the calculation, some statistics are performed on the stored 
ground motion fields. However, in the case of a single rupture it is possible to 
estimate the geometric mean and the geometric standard deviation a priori, using 
hazardlib and without performing a full calculation. However, there are some 
limitations to this approach:

1. it only works when there is a single rupture
2. you have to manage the ``minimum_intensity`` manually if you want to compare with a concrete engine output
3. it is good for estimates, it gives you the theoretical ground ground motion field but not the ones concretely generated by the engine fixed a specific seed

It should also be noticed that there is a shortcut to compute the single-rupture 
hazardlib “mean” ground motion field without writing any code; just set in your 
``job.ini`` the following values::

	truncation_level = 0
	ground_motion_fields = 1

Setting ``truncation_level = 0`` effectively replaces the lognormal distribution 
with a delta function, so the generated ground motion fields will be all equal, 
with the same value for all events: this is why you can set ``ground_motion_fields = 1``, 
since you would just waste time and space by generating multiple copies.

Finally let’s warn again on the term hazardlib “mean” ground motion field: in log 
space it is truly a mean, but in terms of the original GMFs it is a geometric mean 
- which is the same as the median since the distribution is lognormal - so you can 
also call this the hazardlib median ground motion field.

Case study: GMFs for California
*******************************

We had an user asking for the GMFs of California on 707,920 hazard sites, using 
the UCERF mean model and an investigation time of 100,000 years. Is this feasible 
or not? Some back of the envelope calculations suggests that it is unfeasible, 
but reality can be different.

The relevant parameters are the following::

	N = 707,920 hazard sites
	E = 10^5 estimated events of magnitude greater then 5.5 in the investigation
	    time of 100,000 years
	B = 1 number of branches in the UCERF logic tree
	G = 5 number of GSIMS in the GMPE logic tree
	I = 6 number of intensity measure types
	S1 = 13 number of bytes used by the engine to store a single GMV

The maximum size of generated GMFs is ``N * E * B * G * I * S1 = 25 TB (terabytes)``
Storing and sharing 25 TB of data is a big issue, so the problem seems without 
solution. However, most of the ground motion values are zero, because there is a 
maximum distance of 300 km and a rupture cannot affect all of the sites. So the 
size of the GMFs should be less than 25 TB. Moreover, if you want to use such GMFs 
for a damage analysis, you may want to discard very small shaking that will not 
cause any damage to your buildings. The engine has a parameter to discard all 
GMFs below a minimum threshold, the ``minimum_intensity`` parameter. The higher the 
threshold, the smaller the size of the GMFs. By playing with that parameter you 
can reduce the size of the output by orders of magnitudes. Terabytes could easily 
become gigabytes with a well chosen threshold.

In practice, we were able to run the full 707,920 sites by splitting the sites in 
70 tiles and by using a minimum intensity of 0.1 g. This was the limit 
configuration for our cluster which has 5 machines with 128 GB of RAM each.

The full calculation was completed in only 4 hours because our calculators are 
highly optimized. The total size of the generated HDF5 files was of 400 GB. This 
is a lot less than 25 TB, but still too large for sharing purposes.

Another way to reduce the output is to reduce the number of intensity measure 
types. Currently in your calculations there are 6 of them (PGA, SA(0.1), SA(0.2), 
SA(0.5), SA(1.0), SA(2.0)) but if you restrict yourself to only PGA the 
computation and the output will become 6 times smaller. Also, there are 5 GMPEs: 
if you restrict yourself to 1 GMPE you gain a factor of 5. Similarly, you can 
reduce the investigation period from 100,000 year to 10,000 years, thus gaining 
another order of magnitude. Also, raising the minimum magnitude reduces the 
number of events significantly.

But the best approach is to be smart. For instance, we know from experience that 
if the final goal is to estimate the total loss for a given exposure, the correct 
way to do that is to aggregate the exposure on a smaller number of hazard sites. 
For instance, instead of the original 707,920 hazard sites we could aggregate on 
only ~7,000 hazard sites and we would a calculation which is 100 times faster, 
produces 100 times less GMFs and still produces a good estimate for the total loss.

In short, risk calculations for the mean field UCERF model are routines now, in 
spite of what the naive expectations could be.

.. _scenarios-from-shakemaps:

Scenarios from ShakeMaps
------------------------

Beginning with version 3.1, the engine is able to perform *scenario_risk* and 
*scenario_damage* calculations starting from the GeoJSON feed for `ShakeMaps <https://earthquake.usgs.gov/data/shakemap/>`__
provided by the United States Geological Survey (USGS). Furthermore, starting 
from version 3.12 it is possible to use ShakeMaps from other sources like the 
local filesystem or a custom URL.

Running the Calculation
***********************

In order to enable this functionality one has to prepare a parent calculation 
containing the exposure and risk functions for the region of interest, say Peru. 
To that aim the user will need to write a prepare ``job.ini`` file like this one::

	[general]
	description = Peru - Preloading exposure and vulnerability
	calculation_mode = scenario
	exposure_file = exposure_model.xml
	structural_vulnerability_file = structural_vulnerability_model.xml

By running the calculation::

	$ oq engine --run prepare_job.ini

The exposure and the risk functions will be imported in the datastore.

This example only includes vulnerability functions for the loss type ``structural``, 
but one could also have in this preparatory job file the functions for 
nonstructural components and contents, and occupants, or fragility functions if 
damage calculations are of interest.

It is essential that each fragility/vulnerability function in the risk model 
should be conditioned on one of the intensity measure types that are supported 
by the ShakeMap service – MMI, PGV, PGA, SA(0.3), SA(1.0), and SA(3.0). If your 
fragility/vulnerability functions involves an intensity measure type which is 
not supported by the ShakeMap system (for instance SA(0.6)) the calculation will 
terminate with an error.

Let’s suppose that the calculation ID of this ‘pre’ calculation is 1000. We can 
now run the risk calculation starting from a ShakeMap. For that, one need a ``job.ini`` 
file like the following::

	[general]
	description = Peru - 2007 M8.0 Pisco earthquake losses
	calculation_mode = scenario_risk
	number_of_ground_motion_fields = 10
	truncation_level = 3
	shakemap_id = usp000fjta
	spatial_correlation = yes
	cross_correlation = yes

This example refers to the 2007 Mw8.0 Pisco earthquake in Peru 
(see https://earthquake.usgs.gov/earthquakes/eventpage/usp000fjta#shakemap). 
The risk can be computed by running the risk job file against the prepared 
calculation::

	$ oq engine --run job.ini --hc 1000

Starting from version 3.12 it is also possible to specify the following sources 
instead of a *shakemap_id*::

	# (1) from local files:
	shakemap_uri = {
	      "kind": "usgs_xml",
	      "grid_url": "relative/path/file.xml",
	      "uncertainty_url": "relative/path/file.xml"
	      }
	
	# (2) from remote files:
	shakemap_uri = {
	      "kind": "usgs_xml",
	      "grid_url": "https://url.to/grid.xml",
	      "uncertainty_url": "https://url.to/uncertainty.zip"
	      }
	
	# (3) both files in a single archive
	# containing grid.xml, uncertainty.xml:
	shakemap_uri = {
	      "kind": "usgs_xml",
	      "grid_url": "relative/path/grid.zip"
	      }

While it is also possible to define absolute paths, it is advised not to do so 
since using absolute paths will make your calculation not portable across 
different machines.

The files must be valid *.xml* USGS ShakeMaps (1). One or both files can also be 
passed as *.zip* archives containing a single valid xml ShakeMap (2). If both files 
are in the same *.zip*, the archived files *must* be named ``grid.xml`` and ``uncertainty.xml``.

Also starting from version 3.12 it is possible to use ESRI Shapefiles in the same 
manner as ShakeMaps. Polygons define areas with the same intensity levels and 
assets/sites will be associated to a polygon if contained by the latter. Sites 
outside of a polygon will be discarded. Shapefile inputs can be specified similar 
to ShakeMaps::

	shakemap_uri = {
	   "kind": "shapefile",
	   "fname": "path_to/file.shp"
	}

It is only necessary to specify one of the available files, and the rest of the 
files will be expected to be in the same location. It is also possible to have 
them contained together in a *.zip* file. There are at least a *.shp-main* file 
and a *.dbf-dBASE* file required. The record field names, intensity measure types 
and units all need to be the same as with regular USGS ShakeMaps.

Irrespective of the input, the engine will perform the following operations:

1. download the ShakeMap and convert it into a format suitable for further processing, i.e. a ShakeMaps array with lon, lat fields
2. the ShakeMap array will be associated to the hazard sites in the region covered by the ShakeMap
3. by using the parameters ``truncation_level`` and ``number_of_ground_motion_fields`` a set of ground motion fields (GMFs) following the truncated Gaussian distribution will be generated and stored in the datastore
4. a regular risk calculation will be performed by using such GMFs and the assets within the region covered by the shakemap.

Correlation
***********

By default the engine tries to compute both the spatial correlation and the cross 
correlation between different intensity measure types. Please note that if you 
are using MMI as intensity measure type in your vulnerability model, it is not 
possible to apply correlations since those are based on physical measures.

For each kind of correlation you have three choices, that you can set in the 
*job.ini*, for a total of nine combinations::

	- spatial_correlation = yes, cross_correlation = yes  # the default
	- spatial_correlation = no, cross_correlation = no   # disable everything
	- spatial_correlation = yes, cross_correlation = no
	- spatial_correlation = no, cross_correlation = yes
	- spatial_correlation = full, cross_correlation = full
	- spatial_correlation = yes, cross_correlation = full
	- spatial_correlation = no, cross_correlation = full
	- spatial_correlation = full, cross_correlation = no
	- spatial_correlation = full, cross_correlation = yes

yes means using the correlation matrix of the `Silva-Horspool <https://onlinelibrary.wiley.com/doi/abs/10.1002/eqe.3154>`__
paper; *no* mean using no correlation; *full* means using an all-ones correlation 
matrix.

Apart from performance considerations, disabling either the spatial correlation 
or the cross correlation (or both) might be useful to see how significant the 
effect of the correlation is on the damage/loss estimates.

In particular, due to numeric errors, the spatial correlation matrix - that by 
construction contains only positive numbers - can still produce small negative 
eigenvalues (of the order of -1E-15) and the calculation fails with an error 
message saying that the correlation matrix is not positive defined. Welcome to 
the world of floating point approximation! Rather than magically discarding 
negative eigenvalues the engine raises an error and the user has two choices: 
either disable the spatial correlation or reduce the number of sites because 
that can make the numerical instability go away. The easiest way to reduce the 
number of sites is setting a *region_grid_spacing* parameter in the *prepare_job.ini* 
file, then the engine will automatically put the assets on a grid. The larger the 
grid spacing, the fewer the number of points, and the closer the calculation will 
be to tractability.

Performance Considerations
**************************

The performance of the calculation will be crucially determined by the number of 
hazard sites. For instance, in the case of the Pisco earthquake the ShakeMap has 
506,142 sites, which is a significantly large number of sites. However, the extent 
of the ShakeMap in longitude and latitude is about 6 degrees, with a step of 10 
km the grid contains around 65 x 65 sites; most of the sites are without assets 
because most of the grid is on the sea or on high mountains, so actually there 
are around ~500 effective sites. Computing a correlation matrix of size 500 x 500 
is feasible, so the risk computation can be performed.

Clearly in situations in which the number of hazard sites is too large, 
approximations will have to be made such as using a larger *region_grid_spacing*. 
Disabling spatial AND cross correlation makes it possible run much larger 
calculations. The performance can be further increased by not using a ``truncation_level``.

When applying correlation, a soft cap on the size of the calculations is defined. 
This is done and modifiable through the parameter ``cholesky_limit`` which refers to 
the number of sites multiplied by the number of intensity measure types used in 
the vulnerability model. Raising that limit is at your own peril, as you might 
run out of memory during calculation or may encounter instabilities in the 
calculations as described above.

If the ground motion values or the standard deviations are particularly large, 
the user will get a warning about suspicious GMFs.

Moreover, especially for old ShakeMaps, the USGS can provide them in a format 
that the engine cannot read.

Thus, this feature is not expected to work in all cases.

The concept of “mean” ground motion field
The engine has at least three different kinds of mean ground motion field, computed differently and used in different situations:

Mean ground motion field by GMPE, used to reduce disk space and make risk calculations faster.

Mean ground motion field by event, used for debugging/plotting purposes.

Single-rupture hazardlib mean ground motion field, used for analysis/plotting purposes.