EM-TEST is a testing framework for EM-DAT public
data, built on the pandas and
pandera Python packages.
Important
This version of EM-TEST has been built for EM-DAT public data on 2024/08/26. Some tests might fail for versions prior to this date. EM-TEST is not suitable for EM-DAT versions prior to September 26, 2023.
The EM-DAT database is a long-lasting project that has started in 1988. In the past, data was encoded manually sometimes using free text fields without constraints. The EM-TEST framework was initially developed to identify issues in the data, prior to the redesign of the database. EM-TEST is to some extent redundant with many existing constraints in the EM-DAT database. So, why use EM-TEST? Here are five reasons:
- EM-DAT database constraints are invisible to the end-users, while EM-TEST is open-source and transparently illustrates implemented testing.
- EM-DAT is changing over time based on projects and feedback, using EM-TEST allows you to check if two or more EM-DAT files are compatible.
- EM-TEST reports exceptions to existing standards such as ISO3, country names, region names that were used in the past but are not included in today's reference.
- Some EM-TEST cases may not be implemented in the database constraints as they are more fit to Python pandas Dataframe validation. If EM-TEST is run periodically by the EM-DAT team, you might be interested in running EM-TEST to validate recent changes in the database.
- You need a more rigorous testing or need to apply testing to a former archive of EM-DAT? You can download EM-TEST and quickly customize the validation scheme to fit your own needs.
EM-TEST was developed using Python 3.11 with the following dependencies:
openpyxl~=3.1
pandas~=2.2
pandera~=0.20
toml~=0.10
- Download the project or clone the repository to your local machine using Git
git clone https://github.com/em-dat/EM-TEST.git- Navigate to the project's directory
- Create a Python virtual environment (or alternatives)
On macOS and Linux:
python3 -m venv env # Create virtual environment
source env/bin/activate # Activate virtual environmentOn Windows:
python -m venv env # Create virtual environment
.\env\Scripts\activate # Activate virtual environment- Install the project and its dependencies
pip install -r requirements.txt # Install the dependencies
python setup.py install # Install the projectCheck out the subsequent sections to understand how to use the project.
First, import pandas and the emdat_schema defined in emtest using
pandera. EM-DAT data can be loaded and parsed into a pandas.DataFrame, then
validated using the emdat_schema.validate method.
import pandas as pd
from emtest import emdat_schema
emdat = pd.read_excel(
PATH_TO_EMDAT_XLSX_FILE, # Replace with your file
index_col='DisNo.',
parse_dates=['Entry Date', 'Last Update']
)
emdat_schema.validate(emdat)See the "examples" folder of this repository.
The table below shows the type constraints applied by the implemented validation schema.
| Column | Type | Nullable | Unique |
|---|---|---|---|
| DisNo. | str | False | True |
| Historic | str | False | False |
| Classification Key | str | False | False |
| Disaster Group | str | False | False |
| Disaster Subgroup | str | False | False |
| Disaster Type | str | False | False |
| Disaster Subtype | str | False | False |
| External IDs | str | True | False |
| Event Name | str | True | False |
| ISO | str | False | False |
| Country | str | False | False |
| Subregion | str | False | False |
| Region | str | False | False |
| Location | str | True | False |
| Origin | str | True | False |
| Associated Types | str | True | False |
| OFDA/BHA Response | str | False | False |
| Appeal | str | False | False |
| Declaration | str | False | False |
| AID Contribution ('000 US$) | float | True | False |
| Magnitude | float | True | False |
| Magnitude Scale | str | True | False |
| Latitude | float | True | False |
| Longitude | float | True | False |
| River Basin | str | True | False |
| Start Year | int | False | False |
| Start Month | float1 | True | False |
| Start Day | float1 | True | False |
| End Year | int | False | False |
| End Month | float1 | True | False |
| End Day | float1 | True | False |
| Total Deaths | float1 | True | False |
| No. Injured | float1 | True | False |
| No. Affected | float1 | True | False |
| No. Homeless | float1 | True | False |
| Total Affected | float1 | True | False |
| Reconstruction Costs ('000 US$) | float | True | False |
| Reconstruction Costs, Adjusted ('000 US$) | float | True | False |
| Insured Damage ('000 US$) | float | True | False |
| Insured Damage, Adjusted ('000 US$) | float | True | False |
| Total Damage ('000 US$) | float | True | False |
| Total Damage, Adjusted ('000 US$) | float | True | False |
| CPI | float | True | False |
| Admin Units | str | True | False |
| Entry Date | Timestamp | False | False |
| Last Update | Timestamp | False | False |
For each column in EM-DAT, pandera makes it possible to test and validate
column's content
with Checks.
The current checks are listed in the table below.
Most tests are implemented as errors, meaning that they must pass for the EM-DAT dataset to be considered valid. Warnings, instead, are used to notify an abnormal, yet, possible value. Column checks have four possible warnings on the 'ISO', 'Country', 'Start Year', and 'CPI' columns.
EM-DAT may contain ISO country codes or country names that are not listed in the currently used reference ( See EM-DAT Documentation). EM-DAT has a few exceptions for oversea territories and historical countries not included in the current reference. This is somewhat normal given polical changes throughout History, yet, EM-TEST allows to explicitly flag which cases are in EM-DAT thanks to the implemented warning.
Regarding the warning comparing the Start Year to the year included in the
DisNo., check_disno_vs_start_year , both year should be identical. However,
it may happen that in the final reference used to describe the disaster event,
the official start date has been updated. Such a change is more likely for slow
onset disasters like droughts. For this reason, the test is implemented as a
warning that will retrieve these cases, as well as potential errors in the
year definition.
Finally, we test the 'CPI' value to be in the range 0-110. Technically, CPI should only be greater than 0. EM-DAT rescales de CPI values such that the last-year CPI is set to 100, which makes it very unlikely to have values above 100, because it would refer to a year of deflation ( See EM-DAT Documentation).
| Column | Test Name | Test Description | Test Type |
|---|---|---|---|
| DisNo. | check_disno | Validate value using regular expression | Error |
| Historic | check_yes_no | Test whether value is either 'Yes' or 'No' | Error |
| Classification Key | check_classification_key | Test whether value is in the reference list | Error |
| Disaster Group | check_group | Test whether value is in the reference list | Error |
| Disaster Subgroup | check_subgroup | Test whether value is in the reference list | Error |
| Disaster Type | check_type | Test whether value is in the reference list | Error |
| Disaster Subtype | check_subtype | Test whether value is in the reference list | Error |
| External IDs | validate_external_id | Validate values using regular expressions | Error |
| Event Name | - | - | - |
| ISO | validate_iso3_code | Validate values using regular expressions | Error |
| check_iso3_code | Test whether value is in the reference list | Warning | |
| Country | check_country | Test whether value is in the reference list | Warning |
| Subregion | check_subregion | Test whether value is in the reference list | Error |
| Region | check_region | Test whether value is in the reference list | Error |
| Location | - | - | - |
| Origin | - | - | - |
| Associated Types | - | - | - |
| OFDA/BHA Response | check_yes_no | Test whether value is either 'Yes' or 'No' | Error |
| Appeal | check_yes_no | Test whether value is either 'Yes' or 'No' | Error |
| Declaration | check_yes_no | Test whether value is either 'Yes' or 'No' | Error |
| AID Contribution ('000 US$) | greater_than(0.) | Test whether value is greater than 0 | Error |
| Magnitude | - | - | - |
| Magnitude Scale | check_magnitude_unit | Test whether value is in the reference list | Error |
| Latitude | in_range(-90., 90.) | Test whether value is within range -90-90 | Error |
| Longitude | in_range(-180., 180.) | Test whether value is within range -180-180 | Error |
| River Basin | - | - | - |
| Start Year | in_range(1900, CURRENT_YEAR) | Test whether value is within range 1900-{CURRENT_YEAR} | Error |
| check_disno_vs_start_year2 | Test that start year is the same as in DisNo | Warning | |
| Start Month | check_month | Test whether value is a valid month number (1-12) | Error |
| Start Day | check_day | Test whether value is a valid day number (1-31) | Error |
| End Year | in_range(1900, CURRENT_YEAR) | Test whether value is within range 1900-{CURRENT_YEAR} | Error |
| End Month | check_month | Test whether value is a valid month number (1-12) | Error |
| End Day | check_day | Test whether value is a valid day number (1-31) | Error |
| Total Deaths | greater_than(0.) | Test whether value is greater than 0 | Error |
| No. Injured | greater_than(0.) | Test whether value is greater than 0 | Error |
| No. Affected | greater_than(0.) | Test whether value is greater than 0 | Error |
| No. Homeless | greater_than(0.) | Test whether value is greater than 0 | Error |
| Total Affected | greater_than(0.) | Test whether value is greater than 0 | Error |
| Reconstruction Costs ('000 US$) | greater_than(0.) | Test whether value is greater than 0 | Error |
| Reconstruction Costs, Adjusted ('000 US$) | greater_than(0.) | Test whether value is greater than 0 | Error |
| Insured Damage ('000 US$) | greater_than(0.) | Test whether value is greater than 0 | Error |
| Insured Damage, Adjusted ('000 US$) | greater_than(0.) | Test whether value is greater than 0 | Error |
| Total Damage ('000 US$) | greater_than(0.) | Test whether value is greater than 0 | Error |
| Total Damage, Adjusted ('000 US$) | greater_than(0.) | Test whether value is greater than 0 | Error |
| CPI | in_range(0., 110.) | Test whether value is within range 0-110. | Warning |
| Admin Units | is_valid_json | Test whether value is a json string | Error |
| Entry Date | in_range(1988/1/1, CURRENT_DATE) | Test whether value is within valid date range | Error |
| Last Update | in_range(1988/1/1, CURRENT_DATE) | Test whether value is within valid date range | Error |
The pandera package enables
defining Wide Checks
at the DataFrame level, allowing multi-column checks. The currently implemented
multi-column checks are listed below.
| Columns | Test Name | Test Description | Test Type |
|---|---|---|---|
| Latitude, Longitude | check_both_lat_lon_coordinates | Test whether latitude and longitude coordinates are either both defined or undefined | Error |
| Start Month, Start Day | check_no_start_day_if_no_month | Test whether Start Day is set if Start Month is not | Error |
| End Month, End Day | check_no_end_day_if_no_month | Test whether End Day is set if End Month is not | Error |
| Start Year, End Year | check_start_end_year_consistency | Test whether start year is prior or equal to end year | Error |
| Start Year, Start Month, End Year, End Month | check_start_end_month_consistency | Test whether start year is prior or equal to end year at the month resolution | Error |
| Start Year, Start Month, Start Day, End Year, End Month, End Day | check_start_end_day_consistency | Test whether start year is prior or equal to end year at the day resolution | Error |
| Disaster Subtype, Magnitude | check_coldwave_magnitude | Test whether cold wave magnitude is in realistic range (<=10°C) | Error |
| Disaster Type, Magnitude | check_earthquake_magnitude | Test whether earthquake magnitude is in realistic range (3 to 10) | Error |
| Disaster Subtype, Magnitude | check_heatwave_magnitude | Test whether heatwave magnitude is in realistic range (>=25°C) | Error |
| Disaster Type, Magnitude | check_other_magnitude | Test whether disaster different from earthquake, cold and heat waves have magnitude above zero | Error |
If you notice an anomaly in the EM-DAT public data that could be prevented using the EM-TEST framework, we encourage you to submit an issue, using the GitHub issue tracker. We will consider the addition of new tests to update the framework.
If you use EM-TEST in your research and activities, you may use the
citation below or the citation metadata file CITATION.cff.
@software{delforge_2024_14275790,
author = {Delforge, Damien and
Wathelet, Valentin},
title = {EM-TEST: A Testing Framework for the EM-DAT Data},
month = dec,
year = 2024,
publisher = {Zenodo},
version = {2024.12.0},
doi = {10.5281/zenodo.14275790},
url = {https://doi.org/10.5281/zenodo.14275790}
}Or
Delforge, D., & Wathelet, V. (2024). EM-TEST: A Testing Framework for the EM-DAT Data (2024.12.0). Zenodo. https://doi.org/10.5281/zenodo.14275790
- EM-DAT Project Website
- EM-DAT Project Documentation
- EM-DAT Data Download Portal
- Pandera Documentation: the Open-source Framework for Precision Data Testing
Footnotes
-
The integer type in Python is not nullable; therefore, these values are typed as float numbers. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9
-
Comparing Start Year to DisNo. is not considered as a multi-column check because DisNo. is the index of the Column Series in
pandera. Hence, the test is not implemented at the DataFrame level. ↩