Skip to content

IceLab-X/AI4LCA_LCI_Data_Extraction_from_ecoinvent

Repository files navigation

Automated LCI Data Extraction

💫This repository provides an automated data extraction and matrix construction tool for databases such as Ecoinvent. It can directly process native EcoSpold v2 (.spold) files without relying on commercial software, automatically handling data parsing, cleaning, and standardization, and ultimately generating a sparse flow × process matrix.

With this workflow, researchers can:

Efficiently extract and integrate LCI data, avoiding the complexity and errors of manual handling;

Obtain a high-quality numerical matrix consisting of tens of thousands of processes and flows ;

Retain rich textual information to support semantic modeling and machine learning tasks;

Build a solid data foundation for missing data prediction and automated LCA analysis.

Input and Output

Input

  • EcoSpold v2 .spold datasets from the Ecoinvent 3.11 various release (drop them under data/spold/ or point ECOSPOLD_ROOT to another directory).
  • FilenameToActivityLookup.csv (semicolon separated) that maps file prefixes to activity names and locations (defaults to data/FilenameToActivityLookup.csv, override with FILENAME_LOOKUP).
  • batch_number.txt, which stores the rolling batch ID (created automatically in outputs/ unless you set LCA_BATCH_FILE).

Output Each run creates a timestamped batch folder (<output_root>/<MMDD>_<batch_number>/) containing:

  • A CSV per .spold file with cleaned intermediate and elementary exchanges (same basename as the source file).
  • Logs and diagnostics: processing_debug.txt, summary.txt, failed_files.txt, and the spotlight lists non1_amount_files.csv and neg1_amount_files.csv.
  • global_activity_mapping.csv, recording every activity ID discovered across the dataset.

The helper build_lca_matrix step then consolidates the per-activity CSVs into a sparse flow × process matrix and writes it to LCA_MATRIX_TARGET (defaults to the batch folder).

Environment overrides

  • ECOSPOLD_ROOT — directory that holds raw .spold files (default: data/spold/).
  • FILENAME_LOOKUP — path to the lookup CSV (default: data/FilenameToActivityLookup.csv).
  • LCA_OUTPUT_ROOT — parent directory for batch outputs (default: outputs/).
  • LCA_BATCH_FILE — custom location for the batch counter file.
  • LCA_MATRIX_SOURCE/LCA_MATRIX_TARGET — override the matrix builder’s input/output paths.
  • COMPARE_DIR1/COMPARE_DIR2 — set when using the directory comparison helper.

📝Run

⛏️Installation

We recommend using a virtual environment.

git clone https://github.com/IceLab-X/AI4LCA_LCI_Data_Extraction_from_ecoinvent.git
cd Automated-LCI-Data-Extraction-Protocol
python3 -m venv .venv && source .venv/bin/activate  (in Linux, or in Windows: python -m venv .venv && .venv\Scripts\activate)
pip install -r requirements.txt

▶️ Running the pipeline

  1. Place licensed .spold files under data/spold/ (or export ECOSPOLD_ROOT).
  2. Copy FilenameToActivityLookup.csv into data/ (or set FILENAME_LOOKUP).
  3. Launch the notebook (JUPYTER_CONFIG_DIR=. jupyter lab in Linux, or jupyter lab' in Windows) and run Automated_LCI_Data_Extraction_Protocol.ipynb` top to bottom.
  4. It takes time (for ecoinvent 3.11 dataset, takes 2-3 hours)to run the code, Please check Python 3 (ipykernel) in the jupyter for the operation state.
  5. Inspect the new batch folder under outputs/ for CSVs (the CSV file in Ecoinvent 3.11 is more than 2GB and hence it has exceed the limit of Microsoft Excel, try VS code or any other application), logs, and diagnostics.
  6. (Optional) Rerun the matrix builder with LCA_MATRIX_SOURCE/LCA_MATRIX_TARGET to aggregate a specific batch elsewhere.

Data Preparation

To run this project, you need to download the ecoinvent 3.11 datasets (ecoSpold02 format). Go to the ecoinvent website and download one of the following archives (choose the system model you need):

  • ecoinvent 3.11_cutoff_ecoSpold02.7z
  • ecoinvent 3.11_consequential_ecoSpold02.7z
  • ecoinvent 3.11_apos_ecoSpold02.7z

⚠️ Download only one of the three, depending on your application (cut-off, consequential, or APOS).

After extracting, you will get a folder named like:

  • ecoinvent 3.11_cutoff_ecoSpold02
  • ecoinvent 3.11_consequential_ecoSpold02
  • ecoinvent 3.11_apos_ecoSpold02

Inside this folder you should see at least two subfolders:

  • datasets/ (contains .spold files)
  • MasterData/ (contains .xml files)

Place the extracted folder under the data/ directory of this repository, for example:

project_root/
│
├─ data/
│   ├─ spold/                        # (symlink here if desired)
│   ├─ ecoinvent 3.11_cutoff_ecoSpold02/
│   │    ├─ datasets/
│   │    └─ MasterData/
│
└─ outputs/

Now you can run the processing scripts as described above.

💐Contributing to Automated LCI Data Extraction Protocol

  • Reporting bugs. To report a bug, simply open an issue in the GitHub Issues.
  • Suggesting enhancements. To submit an enhancement suggestion, including completely new features or minor improvements on existing features, please open an issue in the GitHub Issues.
  • Pull requests. If you made improvements to FidelityFusion, fixed a bug, or had a new example, feel free to send us a pull-request.
  • Asking questions. To get help on how to use FidelityFusion or its functionalities, you can open a discussion in the GitHub.

🤗Citation

💥Please cite our paper if you find it helpful :) SemaNet: Bridging Words and Numbers For Predicting Missing Environmental Data in Life Cycle Assessment. DOI: https://doi.org/10.1021/acs.est.5c07557

About

Codes for extracting LCI data into matrix. See paper Bridging Words and Numbers For Predicting Missing Environmental Data in Life Cycle Assessment Codes for more details.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors