Skip to content

Uli-Z/dataset-eucast-mic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EUCAST MIC Dataset & Scraper

This repository bundles a machine-readable export of the EUCAST MIC database plus the scripts required to regenerate it from the official source. EUCAST publishes a comprehensive MIC dataset, but only via interactive HTML tables, which hinders automated analysis. The tooling here downloads those pages, converts them into normalized CSVs, and documents every intermediate step. All scripts were built through agentic coding with GPT‑5.1‑Codex, using a multi-stage workflow (scrape → transform → validate) that first captured intermediate artifacts, then simplified the pipeline for ongoing use.

grafik

Contents

Path Description
species_eucast_raw.csv Raw species list scraped from https://mic.eucast.org/search/ (EUCAST ID + display name).
species_eucast_with_amr.csv Species list mapped to AMR mo codes, including automatically derived phenotypes.
eucast_mic_html/ All downloaded MIC HTML pages (one per species ID).
mic_combinations.csv Final species/antibiotic metadata table (see format below).
mic_values.csv Final MIC distribution table (one row per dilution).
01_fetch_species_amr_mapping.py Scrapes the species dropdown, resolves AMR mo codes via the AMR R package (rpy2), and infers phenotypes from name differences.
02_download_html.py Downloads every MIC page with a fixed 0.5 s throttle and logs progress.
03_convert_html_to_csv.py Parses the HTML tables and emits the normalized CSV outputs.
04_validate_outputs.py Checks structural integrity (all dilutions present, counts sum up, ECOFF annotations match distribution counts).
06_export_amr_hierarchies.py Exports microorganism/antibiotic hierarchies (one row per species_id subtype for MOs).
07_compute_ecoff_fractions.py Computes ecoff_fraction (share ≤ ECOFF) per combination.
08_add_species_ids_to_combinations.py Adds species_id to mic_combinations.csv (join from species_eucast_with_amr.csv).
visualization/ Standalone hierarchy filter + ECOFF matrix (see below).

Data Format

mic_combinations.csv

One row per species/antibiotic combination.

Column Description
combo_id Technical primary key, referenced by mic_values.csv.
species_amr_mo AMR mo code for the microorganism.
species_name EUCAST display name.
phenotype Automatically derived suffix (e.g. MRSA, beta-lactamase pos).
antibiotic_name EUCAST antibiotic label.
antibiotic_amr_code AMR ab code resolved from antibiotics.csv or AMR::as_ab.
distribution_count Number of MIC distributions aggregated by EUCAST.
observation_count Total isolates underlying the distribution.
ecoff_value Parsed (T)ECOFF value (parentheses removed).
ecoff_annotation Interpretation of the ECOFF marker (value, tentative_ecoff, forced_ecoff, less_than_three, invalid, missing).
confidence_lower / confidence_upper Parsed bounds from the “Confidence interval” column.
ecoff_fraction Share of isolates with MIC ≤ ECOFF (computed by 07_compute_ecoff_fractions.py).
species_id Dataset-specific organism ID (subtype) joined from species_eucast_with_amr.csv.

mic_values.csv

One row per combination and dilution.

Column Description
combo_id Foreign key to mic_combinations.csv.
dilution_mg_l MIC dilution value as shown in the EUCAST table header.
count Number of isolates observed at that dilution.

Running the Pipeline

python3 dataset-eucast-mic/01_fetch_species_amr_mapping.py
python3 dataset-eucast-mic/02_download_html.py
python3 dataset-eucast-mic/03_convert_html_to_csv.py
python3 dataset-eucast-mic/04_validate_outputs.py
# Optional downstream steps for visualization and stats
python3 dataset-eucast-mic/06_export_amr_hierarchies.py
python3 dataset-eucast-mic/07_compute_ecoff_fractions.py
python3 dataset-eucast-mic/08_add_species_ids_to_combinations.py
./visualization/build_filter_html.py

Each script persists its output and can be re-run independently. The validator should report no new errors—known upstream discrepancies are listed below for awareness.

Visualization (interactive filter + ECOFF matrix)

We ship a standalone HTML-based explorer under visualization/. It builds a single HTML that embeds the microorganism/antibiotic hierarchies and ECOFF fractions so you can filter by organism/antibiotic and inspect an ECOFF matrix (share of isolates with MIC ≤ ECOFF).

Steps:

  1. Generate the hierarchies (amr_exports/*) and ECOFF fractions:
    python3 06_export_amr_hierarchies.py
    python3 07_compute_ecoff_fractions.py
    python3 08_add_species_ids_to_combinations.py
  2. Build the HTML:
    ./visualization/build_filter_html.py
    This produces visualization/filter.html (uses visualization/demo_filter.html as template).
  3. Open visualization/filter.html in your browser. Use the organism/antibiotic trees to filter; the matrix shows percent ≤ ECOFF (color-coded: red <95%, yellow 95–98%, green ≥99%).

Notes:

  • Organism filtering operates on dataset species_id (subtypes). Multiple subtypes with the same AMR code remain distinct; ECOFF fractions come from mic_combinations.csv (ecoff_fraction column).
  • The default percentiles are 10% (organisms) and 20% (antibiotics); adjust with the sliders as needed.

Notes & Disclaimer

  • The scripts download live content from mic.eucast.org; the fixed 0.5 s delay reflects good citizenship, but you should still follow EUCAST’s terms of use.
  • Species/antibiotic mappings rely on the AMR R package (as_mo, as_ab) and may require occasional manual review when the website changes.
  • ECOFF semantics (-, ( ), (( ))) follow EUCAST’s own descriptions— ensure downstream tools respect those meanings.

The validation script reports certain inconsistencies (e.g., MIC sums not matching observation_count, or - shown despite ≥3 distributions). These issues are already present in the original source and are not altered by this pipeline; treat the validator output as a heads-up for manual review. For reference, the current run surfaces the following upstream discrepancies:

 - Combo 71 (Bacteroides fragilis + Moxifloxacin): sum of MIC counts 2238 != observation_count 2237
 - Combo 301 (Enterococcus faecalis + Linezolid): sum of MIC counts 31415 != observation_count 31441
 - Combo 347 (Enterococcus faecium + Linezolid): sum of MIC counts 14392 != observation_count 14404
 - Combo 440 (Escherichia coli + Ciprofloxacin): sum of MIC counts 15813 != observation_count 15667
 - Combo 517 (Haemophilus influenzae + Moxifloxacin): sum of MIC counts 11365 != observation_count 15011
 - Combo 612 (Klebsiella pneumoniae + Ciprofloxacin): sum of MIC counts 3778 != observation_count 3788
 - Combo 717 (Moraxella catarrhalis + Moxifloxacin): sum of MIC counts 3835 != observation_count 4036
 - Combo 919 (Pseudomonas aeruginosa + Ciprofloxacin): sum of MIC counts 26990 != observation_count 26996
 - Combo 928 (Pseudomonas aeruginosa + Gatifloxacin): sum of MIC counts 6482 != observation_count 6465
 - Combo 936 (Pseudomonas aeruginosa + Moxifloxacin): sum of MIC counts 3065 != observation_count 5089
 - Combo 1087 (Staphylococcus aureus + Ciprofloxacin): sum of MIC counts 41721 != observation_count 41812
 - Combo 1101 (Staphylococcus aureus + Gatifloxacin): sum of MIC counts 2020 != observation_count 2021
 - Combo 1108 (Staphylococcus aureus + Linezolid): sum of MIC counts 66761 != observation_count 67705
 - Combo 1231 (Staphylococcus epidermidis + Moxifloxacin): sum of MIC counts 9776 != observation_count 10014
 - Combo 1332 (Staphylococcus saprophyticus + Ciprofloxacin): sum of MIC counts 742 != observation_count 739
 - Combo 1348 (Stenotrophomonas maltophilia + Ciprofloxacin): sum of MIC counts 2961 != observation_count 2962
 - Combo 1521 (Streptococcus pneumoniae + Benzylpenicillin): sum of MIC counts 15170 != observation_count 15161
 - Combo 1533 (Streptococcus pneumoniae + Ciprofloxacin): sum of MIC counts 73053 != observation_count 73054
 - Combo 1540 (Streptococcus pneumoniae + Erythromycin): sum of MIC counts 39854 != observation_count 39847
 - Combo 1541 (Streptococcus pneumoniae + Gatifloxacin): sum of MIC counts 14709 != observation_count 14704
 - Combo 1545 (Streptococcus pneumoniae + Linezolid): sum of MIC counts 60180 != observation_count 60207
 - Combo 1548 (Streptococcus pneumoniae + Moxifloxacin): sum of MIC counts 26858 != observation_count 27471
 - Combo 135 (Campylobacter jejuni + Sulfamethoxazole): ECOFF annotation less_than_three but distribution_count=5
 - Combo 344 (Enterococcus faecium + Lasalocid): ECOFF annotation less_than_three but distribution_count=5
 - Combo 349 (Enterococcus faecium + Monensin): ECOFF annotation less_than_three but distribution_count=5
 - Combo 355 (Enterococcus faecium + Salinomycin): ECOFF annotation less_than_three but distribution_count=13
 - Combo 656 (Lactobacillus rhamnosus + Streptomycin): ECOFF annotation less_than_three but distribution_count=3
 - Combo 759 (Mycobacterium avium ATCC 700898 + Amikacin): ECOFF annotation less_than_three but distribution_count=4
 - Combo 760 (Mycobacterium avium ATCC 700898 + Clarithromycin): ECOFF annotation less_than_three but distribution_count=4
 - Combo 761 (Mycobacterium avium ATCC 700898 + Ethambutol): ECOFF annotation less_than_three but distribution_count=4
 - Combo 762 (Mycobacterium avium ATCC 700898 + Linezolid): ECOFF annotation less_than_three but distribution_count=4
 - Combo 763 (Mycobacterium avium ATCC 700898 + Moxifloxacin): ECOFF annotation less_than_three but distribution_count=4
 - Combo 764 (Mycobacterium avium ATCC 700898 + Rifabutin): ECOFF annotation less_than_three but distribution_count=3
 - Combo 765 (Mycobacterium avium ATCC 700898 + Rifampicin): ECOFF annotation less_than_three but distribution_count=4
 - Combo 766 (Mycobacterium avium ATCC 700898 + Trimethoprim-sulfamethoxazole): ECOFF annotation less_than_three but distribution_count=4
 - Combo 924 (Pseudomonas aeruginosa + Enrofloxacin): ECOFF annotation less_than_three but distribution_count=4
 - Combo 1003 (Salmonella enterica + Enrofloxacin): ECOFF annotation less_than_three but distribution_count=6
 - Combo 1044 (Serratia marcescens + Cefiderocol): ECOFF annotation less_than_three but distribution_count=3
 - Combo 1147 (Staphylococcus aureus ATCC 29213 + Vancomycin): ECOFF annotation less_than_three but distribution_count=3
 - Combo 1215 (Staphylococcus coagulase negative + Pirlimycin): ECOFF annotation less_than_three but distribution_count=3
 - Combo 1349 (Stenotrophomonas maltophilia + Colistin): ECOFF annotation less_than_three but distribution_count=5
 - Combo 1408 (Streptococcus anginosus + Dalbavancin): ECOFF annotation less_than_three but distribution_count=3
 - Combo 1418 (Streptococcus bovis + Lefamulin): ECOFF annotation less_than_three but distribution_count=4
 - Combo 1421 (Streptococcus canis + Enrofloxacin): ECOFF annotation less_than_three but distribution_count=3
 - Combo 1426 (Streptococcus canis + Pirlimycin): ECOFF annotation less_than_three but distribution_count=3
 - Combo 1434 (Streptococcus constellatus + Delafloxacin): ECOFF annotation less_than_three but distribution_count=3
 - Combo 1614 (Streptococcus suis + Enrofloxacin): ECOFF annotation less_than_three but distribution_count=5
 - Combo 1618 (Streptococcus suis + Pirlimycin): ECOFF annotation less_than_three but distribution_count=6

Legal & Scientific Disclaimer

This project is provided strictly for didactic/demonstration purposes. The data and scripts do not support conclusions about real-world resistance prevalence or clinical decision-making. No guarantees are made regarding the accuracy, completeness, or suitability of the code, intermediate artifacts, or derived datasets. By using this repository, you acknowledge that you—and not the authors—bear all responsibility for validation, compliance, and downstream use. The authors disclaim any liability for direct or indirect consequences arising from its use.

About

Machine-readable EUCAST MIC dataset plus reproducible scraping & validation pipeline.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors