Skip to content

em-dat/em-test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EM-TEST

DOI

EM-TEST is a testing framework for EM-DAT public data, built on the pandas and pandera Python packages.

Important

This version of EM-TEST has been built for EM-DAT public data on 2024/08/26. Some tests might fail for versions prior to this date. EM-TEST is not suitable for EM-DAT versions prior to September 26, 2023.

Why using EM-TEST?

The EM-DAT database is a long-lasting project that has started in 1988. In the past, data was encoded manually sometimes using free text fields without constraints. The EM-TEST framework was initially developed to identify issues in the data, prior to the redesign of the database. EM-TEST is to some extent redundant with many existing constraints in the EM-DAT database. So, why use EM-TEST? Here are five reasons:

  1. EM-DAT database constraints are invisible to the end-users, while EM-TEST is open-source and transparently illustrates implemented testing.
  2. EM-DAT is changing over time based on projects and feedback, using EM-TEST allows you to check if two or more EM-DAT files are compatible.
  3. EM-TEST reports exceptions to existing standards such as ISO3, country names, region names that were used in the past but are not included in today's reference.
  4. Some EM-TEST cases may not be implemented in the database constraints as they are more fit to Python pandas Dataframe validation. If EM-TEST is run periodically by the EM-DAT team, you might be interested in running EM-TEST to validate recent changes in the database.
  5. You need a more rigorous testing or need to apply testing to a former archive of EM-DAT? You can download EM-TEST and quickly customize the validation scheme to fit your own needs.

How to Use?

Prerequisites

EM-TEST was developed using Python 3.11 with the following dependencies:

openpyxl~=3.1
pandas~=2.2
pandera~=0.20
toml~=0.10

Installation

  1. Download the project or clone the repository to your local machine using Git
git clone https://github.com/em-dat/EM-TEST.git
  1. Navigate to the project's directory
  2. Create a Python virtual environment (or alternatives)

On macOS and Linux:

python3 -m venv env       # Create virtual environment
source env/bin/activate   # Activate virtual environment

On Windows:

python -m venv env            # Create virtual environment
.\env\Scripts\activate    # Activate virtual environment
  1. Install the project and its dependencies
pip install -r requirements.txt    # Install the dependencies
python setup.py install            # Install the project

Check out the subsequent sections to understand how to use the project.

Validate EM-DAT Content

First, import pandas and the emdat_schema defined in emtest using pandera. EM-DAT data can be loaded and parsed into a pandas.DataFrame, then validated using the emdat_schema.validate method.

import pandas as pd
from emtest import emdat_schema

emdat = pd.read_excel(
    PATH_TO_EMDAT_XLSX_FILE,  # Replace with your file
    index_col='DisNo.',
    parse_dates=['Entry Date', 'Last Update']
)
emdat_schema.validate(emdat)

See the "examples" folder of this repository.

EM-DAT Validation Schema

Data Type Validation

The table below shows the type constraints applied by the implemented validation schema.

Column Type Nullable Unique
DisNo. str False True
Historic str False False
Classification Key str False False
Disaster Group str False False
Disaster Subgroup str False False
Disaster Type str False False
Disaster Subtype str False False
External IDs str True False
Event Name str True False
ISO str False False
Country str False False
Subregion str False False
Region str False False
Location str True False
Origin str True False
Associated Types str True False
OFDA/BHA Response str False False
Appeal str False False
Declaration str False False
AID Contribution ('000 US$) float True False
Magnitude float True False
Magnitude Scale str True False
Latitude float True False
Longitude float True False
River Basin str True False
Start Year int False False
Start Month float1 True False
Start Day float1 True False
End Year int False False
End Month float1 True False
End Day float1 True False
Total Deaths float1 True False
No. Injured float1 True False
No. Affected float1 True False
No. Homeless float1 True False
Total Affected float1 True False
Reconstruction Costs ('000 US$) float True False
Reconstruction Costs, Adjusted ('000 US$) float True False
Insured Damage ('000 US$) float True False
Insured Damage, Adjusted ('000 US$) float True False
Total Damage ('000 US$) float True False
Total Damage, Adjusted ('000 US$) float True False
CPI float True False
Admin Units str True False
Entry Date Timestamp False False
Last Update Timestamp False False

Column Checks

For each column in EM-DAT, pandera makes it possible to test and validate column's content with Checks. The current checks are listed in the table below.

Most tests are implemented as errors, meaning that they must pass for the EM-DAT dataset to be considered valid. Warnings, instead, are used to notify an abnormal, yet, possible value. Column checks have four possible warnings on the 'ISO', 'Country', 'Start Year', and 'CPI' columns.

EM-DAT may contain ISO country codes or country names that are not listed in the currently used reference ( See EM-DAT Documentation). EM-DAT has a few exceptions for oversea territories and historical countries not included in the current reference. This is somewhat normal given polical changes throughout History, yet, EM-TEST allows to explicitly flag which cases are in EM-DAT thanks to the implemented warning.

Regarding the warning comparing the Start Year to the year included in the DisNo., check_disno_vs_start_year , both year should be identical. However, it may happen that in the final reference used to describe the disaster event, the official start date has been updated. Such a change is more likely for slow onset disasters like droughts. For this reason, the test is implemented as a warning that will retrieve these cases, as well as potential errors in the year definition.

Finally, we test the 'CPI' value to be in the range 0-110. Technically, CPI should only be greater than 0. EM-DAT rescales de CPI values such that the last-year CPI is set to 100, which makes it very unlikely to have values above 100, because it would refer to a year of deflation ( See EM-DAT Documentation).

Column Test Name Test Description Test Type
DisNo. check_disno Validate value using regular expression Error
Historic check_yes_no Test whether value is either 'Yes' or 'No' Error
Classification Key check_classification_key Test whether value is in the reference list Error
Disaster Group check_group Test whether value is in the reference list Error
Disaster Subgroup check_subgroup Test whether value is in the reference list Error
Disaster Type check_type Test whether value is in the reference list Error
Disaster Subtype check_subtype Test whether value is in the reference list Error
External IDs validate_external_id Validate values using regular expressions Error
Event Name - - -
ISO validate_iso3_code Validate values using regular expressions Error
check_iso3_code Test whether value is in the reference list Warning
Country check_country Test whether value is in the reference list Warning
Subregion check_subregion Test whether value is in the reference list Error
Region check_region Test whether value is in the reference list Error
Location - - -
Origin - - -
Associated Types - - -
OFDA/BHA Response check_yes_no Test whether value is either 'Yes' or 'No' Error
Appeal check_yes_no Test whether value is either 'Yes' or 'No' Error
Declaration check_yes_no Test whether value is either 'Yes' or 'No' Error
AID Contribution ('000 US$) greater_than(0.) Test whether value is greater than 0 Error
Magnitude - - -
Magnitude Scale check_magnitude_unit Test whether value is in the reference list Error
Latitude in_range(-90., 90.) Test whether value is within range -90-90 Error
Longitude in_range(-180., 180.) Test whether value is within range -180-180 Error
River Basin - - -
Start Year in_range(1900, CURRENT_YEAR) Test whether value is within range 1900-{CURRENT_YEAR} Error
check_disno_vs_start_year2 Test that start year is the same as in DisNo Warning
Start Month check_month Test whether value is a valid month number (1-12) Error
Start Day check_day Test whether value is a valid day number (1-31) Error
End Year in_range(1900, CURRENT_YEAR) Test whether value is within range 1900-{CURRENT_YEAR} Error
End Month check_month Test whether value is a valid month number (1-12) Error
End Day check_day Test whether value is a valid day number (1-31) Error
Total Deaths greater_than(0.) Test whether value is greater than 0 Error
No. Injured greater_than(0.) Test whether value is greater than 0 Error
No. Affected greater_than(0.) Test whether value is greater than 0 Error
No. Homeless greater_than(0.) Test whether value is greater than 0 Error
Total Affected greater_than(0.) Test whether value is greater than 0 Error
Reconstruction Costs ('000 US$) greater_than(0.) Test whether value is greater than 0 Error
Reconstruction Costs, Adjusted ('000 US$) greater_than(0.) Test whether value is greater than 0 Error
Insured Damage ('000 US$) greater_than(0.) Test whether value is greater than 0 Error
Insured Damage, Adjusted ('000 US$) greater_than(0.) Test whether value is greater than 0 Error
Total Damage ('000 US$) greater_than(0.) Test whether value is greater than 0 Error
Total Damage, Adjusted ('000 US$) greater_than(0.) Test whether value is greater than 0 Error
CPI in_range(0., 110.) Test whether value is within range 0-110. Warning
Admin Units is_valid_json Test whether value is a json string Error
Entry Date in_range(1988/1/1, CURRENT_DATE) Test whether value is within valid date range Error
Last Update in_range(1988/1/1, CURRENT_DATE) Test whether value is within valid date range Error

Multi-column Checks

The pandera package enables defining Wide Checks at the DataFrame level, allowing multi-column checks. The currently implemented multi-column checks are listed below.

Columns Test Name Test Description Test Type
Latitude, Longitude check_both_lat_lon_coordinates Test whether latitude and longitude coordinates are either both defined or undefined Error
Start Month, Start Day check_no_start_day_if_no_month Test whether Start Day is set if Start Month is not Error
End Month, End Day check_no_end_day_if_no_month Test whether End Day is set if End Month is not Error
Start Year, End Year check_start_end_year_consistency Test whether start year is prior or equal to end year Error
Start Year, Start Month, End Year, End Month check_start_end_month_consistency Test whether start year is prior or equal to end year at the month resolution Error
Start Year, Start Month, Start Day, End Year, End Month, End Day check_start_end_day_consistency Test whether start year is prior or equal to end year at the day resolution Error
Disaster Subtype, Magnitude check_coldwave_magnitude Test whether cold wave magnitude is in realistic range (<=10°C) Error
Disaster Type, Magnitude check_earthquake_magnitude Test whether earthquake magnitude is in realistic range (3 to 10) Error
Disaster Subtype, Magnitude check_heatwave_magnitude Test whether heatwave magnitude is in realistic range (>=25°C) Error
Disaster Type, Magnitude check_other_magnitude Test whether disaster different from earthquake, cold and heat waves have magnitude above zero Error

How to Contribute?

If you notice an anomaly in the EM-DAT public data that could be prevented using the EM-TEST framework, we encourage you to submit an issue, using the GitHub issue tracker. We will consider the addition of new tests to update the framework.

How to Cite?

If you use EM-TEST in your research and activities, you may use the citation below or the citation metadata file CITATION.cff.

@software{delforge_2024_14275790,
  author       = {Delforge, Damien and
                  Wathelet, Valentin},
  title        = {EM-TEST: A Testing Framework for the EM-DAT Data},
  month        = dec,
  year         = 2024,
  publisher    = {Zenodo},
  version      = {2024.12.0},
  doi          = {10.5281/zenodo.14275790},
  url          = {https://doi.org/10.5281/zenodo.14275790}
}

Or

Delforge, D., & Wathelet, V. (2024). EM-TEST: A Testing Framework for the EM-DAT Data (2024.12.0). Zenodo. https://doi.org/10.5281/zenodo.14275790

Useful Links

Footnotes

  1. The integer type in Python is not nullable; therefore, these values are typed as float numbers. 2 3 4 5 6 7 8 9

  2. Comparing Start Year to DisNo. is not considered as a multi-column check because DisNo. is the index of the Column Series in pandera. Hence, the test is not implemented at the DataFrame level.

About

Open data testing framework for the EM-DAT data

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages