Event Based Damage and Risk
===========================

The asset loss table
--------------------

When performing an event based risk calculation the engine
keeps in memory a table with the losses for each asset and each event,
for each loss type. It is usually impossible to fully store such a table,
because it is extremely large; for instance, for 1 million assets, 1
million events, 2 loss types and 4 bytes per loss ~8 TB of disk space
would be required. It is true that many events will produce zero losses
because of the `maximum_distance` and `minimum_intensity` parameters,
but still, the asset loss table is prohibitively large and for many years
could not be stored. In engine 3.8 we made a breakthrough: we decided to
store a partial asset loss table, obtained by discarding small losses,
by leveraging on the fact that loss curves for long enough return periods
are dominated by extreme events, i.e. there is no point in saving all
the small losses.

To that aim, the engine honours a parameter called
``minimum_asset_loss`` which determines how many losses are discarded
when storing the asset loss table. The rule is simple: losses below
``minimum_asset_loss`` are discarded. By choosing the threshold
properly in an ideal world

1. the vast majority of the losses would be discarded, thus making the
   asset loss table storable;
2. the loss curves would still be nearly identical to the ones without
   discarding any loss, except for small return periods.

It is the job of the user to verify if 1 and 2 are true in the real world.
He can assess that by playing with the ``minimum_asset_loss`` in a small
calculation, finding a good value for it, and then extending it to the large
calculation. Clearly, it is a matter of compromise: by sacrificing precision
it is possible to reduce enormously the size of the stored asset loss table
and to make an impossible calculation possible.

Starting from engine 3.11 the asset loss table is stored if the user
specifies

``aggregate_by = id``

in the job.ini file. In large calculations it is extremely easy to run out of
memory or make the calculation extremely slow, so we recommend
not to store the asset loss table. The functionality is there for the sole
purpose of debugging small calculations, for instance, to see the effect
of the ``minimum_asset_loss`` approximation at the asset level.

For large calculations usually one is interested in the aggregate loss
table, which contains the losses per event and per aggregation tag (or
multi-tag). For instance, the tag ``occupancy`` has the three values
"Residential", "Industrial" and "Commercial" and by setting

``aggregate_by = occupancy``

the engine will store a pandas DataFrame called ``risk_by_event`` with a
field ``agg_id`` with 4 possible value: 0 for "Residential", 1 for
"Industrial", 2 for "Commercial" and 3 for the full aggregation.

NB: if the parameter ``aggregate_by`` is not specified, the engine will
still compute the aggregate loss table but then the ``agg_id`` field will
have a single value of 0 corresponding to the total portfolio losses.

The Probable Maximum Loss (PML) and the loss curves
---------------------------------------------------

Given an effective investigation time and a return period,
the engine is able to compute a PML for each
aggregation tag. It does so by using the function
``openquake.risklib.scientific.losses_by_period`` which takes as input
an array of cumulative losses associated with the aggregation tag, a
list of the return periods, and the effective investigation time. If
there is a single return period the function returns the PML; if there are
multiple return periods it returns the loss curve. The two concepts
are essentially the same thing, since a loss curve is just an array of
PMLs, one for each return period. 

For instance:

.. code-block:: python

   >>> from openquake.risklib.scientific import losses_by_period
   >>> losses = [3, 2, 3.5, 4, 3, 23, 11, 2, 1, 4, 5, 7, 8, 9, 13, 0]
   >>> [PML_500y] = losses_by_period(losses, [500], eff_time=1000)
   >>> PML_500y
   13.0

computes the Probably Maximum Loss at 500 years for the given losses
with an effective investigation time of 1000 years. The algorithm works
by ordering the losses (suppose there are E losses,  E > 1) generating E time
periods ``eff_time/E, eff_time/(E-1), ... eff_time/1``, and log-interpolating
the loss at the return period. Of course this works only if the condition
``eff_time/E < return_period < eff_time`` is respected.

In this example there are E=16 losses, so the return period
must be in the range 62.5 .. 1000 years. If the return period is too
small the PML will be zero

>>> losses_by_period(losses, [50], eff_time=1000)
array([0.])

while if the return period is outside the investigation range, we will
refuse the temptation to extrapolate and we will return NaN instead:

>>> losses_by_period(losses, [1500], eff_time=1000)
array([nan])

The rules above are the reason why you will see zeros or NaNs in the
loss curves generated by the engine sometimes, especially when there are
too few events (the valid range will be small and some return periods
may slip outside the range).

Aggregate loss curves
~~~~~~~~~~~~~~~~~~~~~
In some cases the computation of the PML is particularly simple and
you can do it by hand: this happens when the ratio
``eff_time/return_period`` is an integer. Consider for instance a case with
``eff_time=10,000`` years and ``return_period=2,000`` years;
suppose there are the following 10 losses aggregating the commercial
and residential buildings of an exposure:

>>> import numpy as np
>>> losses_COM = np.array([123, 0, 400, 0, 1500, 200, 350, 0, 700, 600])
>>> losses_RES = np.array([0, 800, 200, 0, 500, 1200, 250, 600, 300, 150])

The loss curve associate the highest loss to 10,000 years, the second
highest to 10,000 / 2 years, the third highest to 10,000 / 3 years, the
fourth highest to 10,000 / 4 years, the fifth highest to 10,000 / 5 years
and so on until the lowest loss is associated to 10,000 / 10 years.
Since the return period is 2,000 = 10,000 / 5 to compute the MPL
it is enough to take the fifth loss ordered in descending order:

>>> MPL_COM = [1500, 700, 600, 400, 350, 200, 123, 0, 0, 0][4] = 350
>>> MPL_RES = [1200, 800, 600, 500, 300, 250, 200, 150, 0, 0][4] = 300

Given this algorithm, it is clear why the MPL cannot be additive, i.e.
MPL(COM + RES) != MPL(COM) + MPL(RES): doing the sums before or
after the ordering of the losses is different. In this example
by taking the fifth loss of the sorted sums

>>> sorted(losses_COM + losses_RES, reverse=True)
[2000, 1400, 1000, 800, 750, 600, 600, 600, 123, 0]

one gets ``MPL(COM + RES) = 750`` which is different from
``MPL(RES) + MPL(COM) = 350 + 300 = 650``.

The engine is able to compute aggregate loss curves correctly, i.e.
by doing the sums before the ordering phase. In order to perform
aggregations, you need to set the ``aggregate_by`` parameter in the
``job.ini`` by specifying tags over which you wish to perform the
aggregation. Your exposure must contain the specified tags
for each asset.  We have an example for Nepal in our event based risk
demo.  The exposure for this demo contains various tags and in
particular a geographic tag called NAME_1 with values "Mid-Western",
"Far-Western", "West", "East", "Central", and the ``job_eb.ini`` file
defines

``aggregate_by = NAME_1``

When running the calculation you will see something like this::

   Calculation 1 finished correctly in 17 seconds
  id | name
   9 | Aggregate Event Losses
   1 | Aggregate Loss Curves
   2 | Aggregate Loss Curves Statistics
   3 | Aggregate Losses
   4 | Aggregate Losses Statistics
   5 | Average Asset Losses Statistics
  11 | Earthquake Ruptures
   6 | Events
   7 | Full Report
   8 | Input Files
  10 | Realizations
  12 | Total Loss Curves
  13 | Total Loss Curves Statistics
  14 | Total Losses
  15 | Total Losses Statistics

Exporting the *Aggregate Loss Curves Statistics* output will give
you the mean and quantile loss curves in a format like the following one::

    annual_frequency_of_exceedence,return_period,loss_type,loss_value,loss_ratio
    5.00000E-01,2,nonstructural,0.00000E+00,0.00000E+00
    5.00000E-01,2,structural,0.00000E+00,0.00000E+00
    2.00000E-01,5,nonstructural,0.00000E+00,0.00000E+00
    2.00000E-01,5,structural,0.00000E+00,0.00000E+00
    1.00000E-01,10,nonstructural,0.00000E+00,0.00000E+00
    1.00000E-01,10,structural,0.00000E+00,0.00000E+00
    5.00000E-02,20,nonstructural,0.00000E+00,0.00000E+00
    5.00000E-02,20,structural,0.00000E+00,0.00000E+00
    2.00000E-02,50,nonstructural,0.00000E+00,0.00000E+00
    2.00000E-02,50,structural,0.00000E+00,0.00000E+00
    1.00000E-02,100,nonstructural,0.00000E+00,0.00000E+00
    1.00000E-02,100,structural,0.00000E+00,0.00000E+00
    5.00000E-03,200,nonstructural,1.35279E+05,1.26664E-06
    5.00000E-03,200,structural,2.36901E+05,9.02027E-03
    2.00000E-03,500,nonstructural,1.74918E+06,1.63779E-05
    2.00000E-03,500,structural,2.99670E+06,1.14103E-01
    1.00000E-03,1000,nonstructural,6.92401E+06,6.48308E-05
    1.00000E-03,1000,structural,1.15148E+07,4.38439E-01
    
If you do not set the ``aggregate_by`` parameter
you will still be able to compute the total loss curve 
(for the entire portfolio of assets), and the total average losses.