SNCF Safety Knowledge Extraction Pipeline

AI TINKERERS Hackaton 10/12/25

This project extracts structured railway safety knowledge from unstructured incident documents and prepares it for ingestion into a Neo4J graph-based safety system.
Its purpose is to help railway safety personnel analyze incidents, identify root causes, and justify maintenance or budget decisions using a connected safety knowledge graph. The data is from the SNCF opendata and can be viewed here: https://eu.ftp.opendatasoft.com/sncf/rapports/rapport-securite-2023.PDF This a basic version of the graph, it should be enriched to demonstrate its full capability.

Overview

1. PDF → Markdown

Documents are converted to markdown using Docling. This provides a uniform text format for analysis.

2. Markdown Chunking

The markdown document is segmented into sections based on titles (#, ##, etc.).
Each chunk is processed independently.

3. Three Classification Modules

A. Incident Classification

Outputs:

incident_type
possible_impacts
possible_root_causes (with HIGH/LOW likelihood)
Cypher defining:
- Incident → Impact
- Incident → RootCause
- Impact → RootCause

B. Cause Classification

Outputs:

root cause (chosen from a fixed taxonomy)
normalized causal category
Cypher linking:
- Category → RootCause

C. Impact Classification

Outputs:

impact category
normalized sub-impact subtype
optional inferred root cause
Cypher linking:
- SubImpact → Impact

4. JSON Outputs

Each classifier produces a JSON file:

├── incidents.json ├── causes.json └── impacts.json

All files are ready for Neo4J ingestion.

FastAPI Endpoints

Convert PDF to Markdown

POST /convert_pdf?pdf_path=path/to/file.pdf

Process Incidents

POST /process_incidents?md_path=path/to/file.md

Process Causes

POST /process_causes?md_path=path/to/file.md

Process Impacts

POST /process_impacts?md_path=path/to/file.md

Neo4J Integration

The pipeline produces structured JSON files (incidents.json, causes.json, impacts.json) that already contain fully generated Cypher statements.
These Cypher statements express all required graph relationships for a safety-knowledge model.

What Gets Written to Neo4J

The system generates and ingests four major types of nodes:

IncidentType
Impact
SubImpact
RootCause
SubCause

And several relationship types:

(:IncidentType)-[:HAS_IMPACT]->(:Impact)
(:IncidentType)-[:HAS_CAUSE {likelihood}]->(:RootCause)
(:Impact)-[:LINKED_TO_CAUSE]->(:RootCause)
(:SubImpact)-[:EST_SOUS_IMPACT_DE]->(:Impact)

These relations form a multi-layered safety graph connecting:

Incident → Impact → Cause

This creates a navigable structure where analysts can:

explore causal chains
compare incident families
trace common failure patterns
justify maintenance decisions
support risk scoring and prioritization

How Ingestion Works

The ingestion route (/neo4j_ingest) reads the three JSON files and extracts every Cypher statement.

driver.execute_query(cypher)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
__pycache__		__pycache__
.gitignore		.gitignore
Knowledge Graph.png		Knowledge Graph.png
app.py		app.py
causes.json		causes.json
graphDB.JPG		graphDB.JPG
graphDB2.JPG		graphDB2.JPG
graphDB3.JPG		graphDB3.JPG
graphDB3.jpg		graphDB3.jpg
graphDB4.JPG		graphDB4.JPG
impacts.json		impacts.json
incidents.json		incidents.json
pipeline.py		pipeline.py
rapport_securite.md		rapport_securite.md
rapport_securite_distilled.md		rapport_securite_distilled.md
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SNCF Safety Knowledge Extraction Pipeline

Overview

1. PDF → Markdown

2. Markdown Chunking

3. Three Classification Modules

A. Incident Classification

B. Cause Classification

C. Impact Classification

4. JSON Outputs

FastAPI Endpoints

Convert PDF to Markdown

Process Incidents

Process Causes

Process Impacts

Neo4J Integration

What Gets Written to Neo4J

How Ingestion Works

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SNCF Safety Knowledge Extraction Pipeline

Overview

1. PDF → Markdown

2. Markdown Chunking

3. Three Classification Modules

A. Incident Classification

B. Cause Classification

C. Impact Classification

4. JSON Outputs

FastAPI Endpoints

Convert PDF to Markdown

Process Incidents

Process Causes

Process Impacts

Neo4J Integration

What Gets Written to Neo4J

How Ingestion Works

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages