Releases · tatonetti-lab/onsides

22 Apr 22:35

ntatonetti

v3.1.1

50b0319

OnSIDES v3.1.1 Latest

Latest

Updated April 2026 with the latest drug labels from all four regulatory sources.

Summary

Type	v3.1.0	v3.1.1
Products	42,268	41,119
Adverse effects	5,472,613*	6,928,666
Ingredients	1,955	1,866

* The v3.1.0 release notes reported 5.4M adverse effects (the count after thresholding in the database), but the shipped CSV file actually contained 28.1 million rows because the confidence threshold (pred1 > 3.258) was not applied before export. This means v3.1.0 included ~20.5M low-confidence predictions that should have been filtered out. v3.1.1 correctly applies the threshold before export.

Labels processed

Source	Labels parsed	Labels in database	Adverse effects
US (DailyMed)	51,026	35,340	6,540,259
UK (EMC)	7,578	2,874	81,978
EU (EMA)	1,782	1,021	76,869
JP (KEGG)	8,924	1,884	229,560

Japan labels are scored by string matching only (no BERT model). Of ~20,000 JP labels downloaded, 8,924 contained a side-effects section; the remainder are supplements, diagnostics, and other products without listed adverse reactions.

Data quality fix: confidence thresholding

The most significant change in this release is that the BERT confidence threshold is now correctly applied. In v3.1.0, the product_adverse_effect table was exported with all 28.1M string match candidates, including ~20.5M low-confidence matches (pred1 ≤ 3.258) that the model flagged as unlikely true adverse effects. In v3.1.1, these are properly filtered, leaving 6.9M high-confidence adverse effect associations. Users of v3.1.0 data who did not filter on pred1 themselves may have been working with a high false-positive rate.

Pipeline fixes

This release includes several fixes to the data pipeline discovered during the update process:

Japanese filename handling: Reduced the filename byte-length limit from 240 to 220 bytes to prevent ENAMETOOLONG errors when the parse step appends derived suffixes (e.g. .side_effects.table.NN.csv) to Japanese multi-byte filenames.
Japanese labels without side effects: Labels lacking a side-effects section (par-11) now produce an empty output file instead of crashing, allowing the pipeline to continue gracefully.
MRCONSO.RRF encoding: Added ignore_errors = true to DuckDB's CSV reader for the UMLS MRCONSO file, which contains some non-UTF-8 bytes in non-English rows (e.g. Swedish). The skipped rows do not affect English or Japanese MedDRA vocabulary extraction.
MRCONSO.RRF filename case: Fixed all references from lowercase mrconso.rrf to uppercase MRCONSO.RRF to match the actual file distributed by UMLS.
MedDRA vocabulary deduplication: Fixed a UNIQUE constraint violation in the MedDRA vocab table by using ROW_NUMBER() to select one entry per meddra_id when the same ID maps to multiple name/type combinations across OMOP and MRCONSO sources.

New documentation

Added AI-assisted update guides for running future data releases with an AI coding agent (e.g. Claude Code):
- HUMAN_UPDATE_GUIDE.md — operator-facing guide
- AGENT_UPDATE_GUIDE.md — technical reference for the AI agent

Assets 3

08 May 17:51

zietzm

v3.1.0

ccd893d

OnSIDES v3.1.0

Updated May 2025.

Fixes major error in the vocab_rxnorm_* and high_confidence tables.
Adds a CLI for convenience when releasing (build-zip --version v3.1.0)

Summary

Because we prune the database for orphans, this update reduces the number of items in the database. However, we believe that this update is significantly more trustworthy than the previous one.

Type	v3.0.0	v3.1.0
Products	51,460	42,268
Adverse effects	7,134,660	5,472,613
Ingredients	10,794	1,955

Explanation

In v3.0.0, products had wrong ingredients. OnSIDES v3.0.0 and onwards uses products as the main unit. Products have labels with side effects, and products get mapped to RxNorm. Secondarily, using relationships from RxNorm, products are mapped to their respective ingredients. These mappings are also used for the "high confidence" set.

Previously (v3.0.0), I did the product-to-ingredient mapping using the OMOP CONCEPT and CONCEPT_RELATIONSHIP tables, without conditioning on the types of edges or intermediate types, just on path length. This was wrong. In the new version (v3.1.0), we use the RxNorm default paths (see https://lhncbc.nlm.nih.gov/RxNav/applications/RxNavViews.html for details and rxnorm_ingredients.py for implementation). This adds complexity but fixes the mapping. As an example, to go from "branded dose form" to ingredient, the correct path is SBDF => SCDF => IN, or "branded dose form" -> "clinical dose form" -> "ingredient".

Check

-- On average, how many ingredients do drug products have?
WITH n_ingredients_per_label AS (
    SELECT
        label_id,
        COUNT(DISTINCT ingredient_id) AS n_ingredients
    FROM
        product_label
        INNER JOIN product_to_rxnorm USING (label_id)
        INNER JOIN vocab_rxnorm_ingredient_to_product ON rxnorm_product_id = product_id
    GROUP BY
        label_id
)
SELECT
    AVG(n_ingredients)
FROM
    n_ingredients_per_label;

In v3.0.0, this returns 1343 (implausibly high)
In v3.1.0, this returns 1.13 (realistic)

Assets 3

24 Apr 23:38

zietzm

v3.0.0

550aed8

OnSIDES v3.0.0

Updated April 2025.

Major overhaul of the OnSIDES codebase. Updates include:

Unified processing for all English labels
New database schema
Reproducible database generation (Snakemake)
Schemas + loading scripts for common databases
Include manual annotations in the release file
Produce a high confidence set (adverse effects labeled in all sources)

The data ZIP file (onsides-v3.0.0.zip) has the following layout:

├── annotations   # Manual annotations
├── csv  # Main data tables
├── database_scripts  # Bash scripts for loading tables into a database
└── schema   # Database schema files for MySQL, Postgres, and SQLite

Direct questions to @ntatonetti or @zietzm

Contributors

ntatonetti and zietzm

Assets 3

04 Feb 23:57

zietzm

v2.1.1

fb367bd

Publication Version of Record

v2.1.1

feat(intl/eu): 75x speedup text data formatting

Assets 2

26 Sep 19:33

ntatonetti

v2.1.0-20240925

fc36f00

OnSIDES Data Release 2.1.0-20240925

September 25th, 2024 Data Release for OnSIDES v2.1.0. These data were generating using available structured product labels (SPLs) from DailyMed as of September 25th, 2024. This is the first release that includes the Warnings and Precautions section in addition to the Adverse Reactions and the Boxed Warnings sections.

Data releases are tagged with the release version of the code used to generate them followed by the date in YYYYMMDD format

Assets 3

28 May 20:49

ntatonetti

v2.0.0-20240312

fc36f00

OnSIDES Data Release v2.0.0-20240312

March 12th, 2024 Data Release for OnSIDES v2.0.0. These data were generating using available structured product labels (SPLs) from DailyMed as of March 12th, 2024.

Data releases are tagged with the release version of the code used to generate them followed by the date in YYYYMMDD format.

Assets 3

14 Nov 21:48

ntatonetti

v2.0.0-20231113

403d235

OnSIDES Data Release v2.0.0-20231113

November 13th, 2023 Data Release for OnSIDES v2.0.0. These data were generating using available structured product labels (SPLs) from DailyMed as of November 13th, 2023.

Data releases are tagged with the release version of the code used to generate them followed by the date in YYYYMMDD format.

Assets 3

29 Jun 19:33

ntatonetti

v2.0.0-20230629

cdd6c24

OnSIDES Data Release v2.0.0-20230629

June 29, 2023 Data Release for OnSIDES v2.0.0. These data were generating using available structured product labels (SPLs) from DailyMed as of June 29, 2023.

Data releases are tagged with the release version of the code used to generate them followed by the date in YYYYMMDD format.

Assets 3

09 Mar 17:01

ntatonetti

v2.0.0-20230309

22b8ab3

OnSIDES Data Release v2.0.0-20230309

March 9, 2023 Data Release for OnSIDES v2.0.0. These data were generating using available structured product labels (SPLs) from DailyMed as of March 9, 2023.

Data releases are tagged with the release version of the code used to generate them followed by the date in YYYYMMDD format.

Assets 3

04 Feb 19:22

ntatonetti

v2.0.0-20230203

f5dd375

OnSIDES Data Release v2.0.0-20230203

February 4, 2023 Data Release for OnSIDES v2.0.0. These data were generating using available structured product labels (SPLs) from DailyMed as of Feb 4, 2023.

Data releases are tagged with the release version of the code used to generate them followed by the date in YYYYMMDD format.

Assets 3

Releases: tatonetti-lab/onsides

OnSIDES v3.1.1

Summary

Labels processed

Data quality fix: confidence thresholding

Pipeline fixes

New documentation

Uh oh!

OnSIDES v3.1.0

Summary

Explanation

Check

Uh oh!

OnSIDES v3.0.0

Contributors

Uh oh!

Publication Version of Record

Uh oh!

OnSIDES Data Release 2.1.0-20240925

Uh oh!

OnSIDES Data Release v2.0.0-20240312

Uh oh!

OnSIDES Data Release v2.0.0-20231113

Uh oh!

OnSIDES Data Release v2.0.0-20230629

Uh oh!

OnSIDES Data Release v2.0.0-20230309

Uh oh!

OnSIDES Data Release v2.0.0-20230203

Uh oh!