💫This repository provides an automated data extraction and matrix construction tool for databases such as Ecoinvent. It can directly process native EcoSpold v2 (.spold) files without relying on commercial software, automatically handling data parsing, cleaning, and standardization, and ultimately generating a sparse flow × process matrix.
With this workflow, researchers can:
Efficiently extract and integrate LCI data, avoiding the complexity and errors of manual handling;
Obtain a high-quality numerical matrix consisting of tens of thousands of processes and flows ;
Retain rich textual information to support semantic modeling and machine learning tasks;
Build a solid data foundation for missing data prediction and automated LCA analysis.
Input
- EcoSpold v2
.spolddatasets from the Ecoinvent 3.11 various release (drop them underdata/spold/or pointECOSPOLD_ROOTto another directory). FilenameToActivityLookup.csv(semicolon separated) that maps file prefixes to activity names and locations (defaults todata/FilenameToActivityLookup.csv, override withFILENAME_LOOKUP).batch_number.txt, which stores the rolling batch ID (created automatically inoutputs/unless you setLCA_BATCH_FILE).
Output
Each run creates a timestamped batch folder (<output_root>/<MMDD>_<batch_number>/) containing:
- A CSV per
.spoldfile with cleaned intermediate and elementary exchanges (same basename as the source file). - Logs and diagnostics:
processing_debug.txt,summary.txt,failed_files.txt, and the spotlight listsnon1_amount_files.csvandneg1_amount_files.csv. global_activity_mapping.csv, recording every activity ID discovered across the dataset.
The helper build_lca_matrix step then consolidates the per-activity CSVs into a sparse flow × process matrix and writes it to LCA_MATRIX_TARGET (defaults to the batch folder).
ECOSPOLD_ROOT— directory that holds raw.spoldfiles (default:data/spold/).FILENAME_LOOKUP— path to the lookup CSV (default:data/FilenameToActivityLookup.csv).LCA_OUTPUT_ROOT— parent directory for batch outputs (default:outputs/).LCA_BATCH_FILE— custom location for the batch counter file.LCA_MATRIX_SOURCE/LCA_MATRIX_TARGET— override the matrix builder’s input/output paths.COMPARE_DIR1/COMPARE_DIR2— set when using the directory comparison helper.
We recommend using a virtual environment.
git clone https://github.com/IceLab-X/AI4LCA_LCI_Data_Extraction_from_ecoinvent.git
cd Automated-LCI-Data-Extraction-Protocol
python3 -m venv .venv && source .venv/bin/activate (in Linux, or in Windows: python -m venv .venv && .venv\Scripts\activate)
pip install -r requirements.txt
- Place licensed
.spoldfiles underdata/spold/(or exportECOSPOLD_ROOT). - Copy
FilenameToActivityLookup.csvintodata/(or setFILENAME_LOOKUP). - Launch the notebook (
JUPYTER_CONFIG_DIR=. jupyter labin Linux, orjupyter lab' in Windows) and runAutomated_LCI_Data_Extraction_Protocol.ipynb` top to bottom. - It takes time (for ecoinvent 3.11 dataset, takes 2-3 hours)to run the code, Please check Python 3 (ipykernel) in the jupyter for the operation state.
- Inspect the new batch folder under
outputs/for CSVs (the CSV file in Ecoinvent 3.11 is more than 2GB and hence it has exceed the limit of Microsoft Excel, try VS code or any other application), logs, and diagnostics. - (Optional) Rerun the matrix builder with
LCA_MATRIX_SOURCE/LCA_MATRIX_TARGETto aggregate a specific batch elsewhere.
To run this project, you need to download the ecoinvent 3.11 datasets (ecoSpold02 format). Go to the ecoinvent website and download one of the following archives (choose the system model you need):
ecoinvent 3.11_cutoff_ecoSpold02.7zecoinvent 3.11_consequential_ecoSpold02.7zecoinvent 3.11_apos_ecoSpold02.7z
⚠️ Download only one of the three, depending on your application (cut-off, consequential, or APOS).
After extracting, you will get a folder named like:
ecoinvent 3.11_cutoff_ecoSpold02ecoinvent 3.11_consequential_ecoSpold02ecoinvent 3.11_apos_ecoSpold02
Inside this folder you should see at least two subfolders:
datasets/(contains.spoldfiles)MasterData/(contains.xmlfiles)
Place the extracted folder under the data/ directory of this repository, for example:
project_root/
│
├─ data/
│ ├─ spold/ # (symlink here if desired)
│ ├─ ecoinvent 3.11_cutoff_ecoSpold02/
│ │ ├─ datasets/
│ │ └─ MasterData/
│
└─ outputs/
Now you can run the processing scripts as described above.
- Reporting bugs. To report a bug, simply open an issue in the GitHub Issues.
- Suggesting enhancements. To submit an enhancement suggestion, including completely new features or minor improvements on existing features, please open an issue in the GitHub Issues.
- Pull requests. If you made improvements to FidelityFusion, fixed a bug, or had a new example, feel free to send us a pull-request.
- Asking questions. To get help on how to use FidelityFusion or its functionalities, you can open a discussion in the GitHub.
💥Please cite our paper if you find it helpful :) SemaNet: Bridging Words and Numbers For Predicting Missing Environmental Data in Life Cycle Assessment. DOI: https://doi.org/10.1021/acs.est.5c07557