Skip to content

LIAAD/MiNER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MiNER: A Two-Stage Pipeline for Metadata Extraction from Municipal Meeting Minutes

This is the official repository of the short paper "MiNER: A Two-Stage Pipeline for Metadata Extraction from Municipal Meeting Minutes".

MiNER is a two-stage framework for automatic metadata extraction from municipal meeting minutes, combining Question Answering (QA) for segment boundary detection and Named Entity Recognition (NER) for fine-grained metadata extraction.
This repository supports the experiments presented in the accompanying paper and provides the complete codebase used in our experiments. The full dataset is available at https://github.com/INESCTEC/citilink-dataset, and the trained models are released on Hugging Face.


1. Project Overview

This project introduces a unified pipeline for identifying and structuring metadata information in municipal meeting minutes.
The pipeline operates in two stages:

  1. QA-based segmentation – detects the opening and closing segments of each document.
  2. NER-based metadata extraction – extracts structured metadata fields such as date, location, meeting number and type, start and end times, and participants (as well as their presence).

Together, these components enable large-scale analysis of municipal records and serve as a benchmark for information extraction in long, formal administrative texts.

MiNER Flash Card

2. Key Features

  • Two-Stage Architecture: Combines QA-based document segmentation with Transformer-based NER.
  • Bilingual Dataset: Portuguese originals and English translations from six municipalities.
  • Structured Metadata Extraction: Supports the following entities minute_id, date, location, meeting_type, participants, begin_time, and end_time.
  • Efficiency & Performance: Fine-tuned models outperform large generative LLMs while being orders of magnitude faster and greener.
  • Open Resources: All datasets and code will be released for reproducibility. As of now its only a sample.

3. Project Status

The core components of the MiNER framework are fully implemented and validated. The system is considered stable for research use, and the codebase is actively maintained to ensure reproducibility.
Minor improvements and refactoring are ongoing, particularly concerning dataset expansion and model evaluation consistency.


4. Technology Stack

Language

  • Python 3.10+

Core Frameworks

  • PyTorch – Deep learning backend for model training and inference
  • Transformers (Hugging Face) – Pre-trained language models for QA and NER
  • Datasets & Evaluate (Hugging Face) – Dataset management, preprocessing, and metric computation
  • spaCy – Sentence segmentation and preprocessing for QA dataset creation
  • Faker – Lexical masking and synthetic data generation for data augmentation

Utilities

  • NumPy – Numerical operations and data manipulation
  • LangChain – Text chunking and utility functions for preprocessing
  • tqdm – Progress tracking and visualization during training

Development Tools

  • Git – Version control and collaboration
  • JSON – Data serialization and storage format
  • Markdown – Documentation and reporting

Hardware

  • All experiments were run on an NVIDIA GeForce RTX 5070 Ti.

5. Dependencies

All dependencies required to reproduce the experiments are listed in the requirements.txt file.
The main libraries and minimum versions are:

  • transformers >= 4.30.0
  • datasets >= 2.14.0
  • torch >= 2.0.0
  • evaluate >= 0.4.0
  • spacy >= 3.5.0
  • faker >= 19.0.0
  • langchain >= 0.1.0
  • numpy >= 1.24.0
  • tqdm >= 4.65.0

6. Usage

This section describes how to build datasets, train models, and reproduce the results for both stages of the MiNER pipeline.


Question Answering (QA)

Goal: Detect the opening and closing segments of each meeting minute, where metadata is typically concentrated.

Build the QA Dataset

python3 build_qa_dataset.py --lang pt
python3 build_qa_dataset.py --lang en

Train Model

python3 train_qa.py \
  --lang pt \
  --num_train_epochs 3 \
  --per_device_train_batch_size 8 \
  --fp16

Named Entity Recognition (NER)

Goal: Extract structured metadata entities from minutes.

Convert Metadata → BIO

python3 process_new.py

Tokenize & Align Labels

python3 transform_dataset.py

Train NER Model

python3 train_model.py

7. Dataset Description

⚠️ Important Note

⚠️ Dataset Access Notice:
The full dataset statistics are presented below; however, only one sample per municipality is publicly available in this repository. For the full data, please access https://github.com/INESCTEC/citilink-dataset

Two interactive demos are provided to test the models directly:

Overview

Attribute Description
Dataset Name Council Metadata Corpus
Languages Portuguese and English
Documents 6 municipalities × 20 minutes (2021–2024)
Total Tokens 1,170,417
Tokens in Metadata Segments 32,364
Total Metadata Segments 180
Annotation Fields date, minute_id, meeting_type, begin_time, end_time, location, participants (with role, presence, and party), opening_segment, closing_segment

Format

Each file (dataset_metadata_[lang]/municipality.json) follows the structure below:

{
  "documents": {
    "Municipality_Name": {
      "Municipality_cm_XXX_YYYY-MM-DD": {
        "document_id": "...",
        "full_text": "...",
        "metadata": {
          "minute_id": {...},
          "date": {...},
          "location": {...},
          "meeting_type": {...},
          "begin_time": {...},
          "end_time": {...},
          "participants": [...],
          "opening_segment": {...},
          "closing_segment": {...}
        }
      }
    }
  }
}

Data Files

The data files for the Council Metadata Corpus are located in the data/ directory:

  • dataset_metadata_en — Portuguese version (1 files with 1 documents each)
  • dataset_metadata_pt — English version (1 files with 1 documents each)
  • split — Train/val/test split information

Each JSON file corresponds to one municipality and contains the full text of the meeting minute, along with manually annotated metadata fields.


Annotation Process

  • Source: Official municipal meeting minutes provided by the respective municipalities (2021–2024)
  • Annotation Tool: INCEpTION
  • Annotation Guidelines: Annotators labeled only mentions of those metadata fields

Dataset Characteristics

Challenges

  • Domain Specificity: Contains formal administrative language and municipality-specific jargon.
  • Long Segments: Average segment length exceeds the context window of most Transformer architectures.

Advantages

  • Authentic Data: Based on real-world municipal records rather than synthetic text.
  • Bilingual Design: Enables cross-lingual and translation-based evaluation (Portuguese ↔ English).
  • Multi-Municipality Coverage: Facilitates generalization studies using Leave-One-Municipality-Out validation.
  • Fine-Grained Annotations: Includes both segment-level (QA) and token-level (NER) labels for multi-task evaluation.

8. Architecture

Component Descriptions

The MiNER framework is composed of two core components — a Question Answering (QA) model for segment detection and a Named Entity Recognition (NER) model for structured metadata extraction.

MiNER Pipeline Overview

Stage 1 – Question Answering (QA)

  • Model: deepset/xlm-roberta-large-squad2
  • Objective: Identify the opening and closing segments of each document that contain relevant metadata.
  • Prompts: For this task we built two prompts:
    • "At the beginning of the minutes there is an opening segment. What is the last sentence of that opening segment?"
    • "At the end of the minutes there is a closing segment. What is the first sentence of that closing segment?"
  • Training Data: SQuAD v2-style dataset manually built from annotated municipal minutes.
  • Evaluation Metrics:
    • F1-score: Measures token-level overlap between predicted and gold-standard answers.
    • Exact Match (EM): Percentage of predictions that exactly match the gold reference span.

Stage 2 – Named Entity Recognition (NER)

  • Models:
  • Objective: Extract token-level metadata entities (e.g., date, location, meeting type, participants, begin/end times).
  • Evaluation Metrics:
    • Precision (P) – ratio of correctly predicted entities to all predicted entities
    • Recall (R) – ratio of correctly predicted entities to all true entities
    • F1-score (F1) – harmonic mean of precision and recall, computed using the seqeval library.

9. LLM Experiments

We also conducted a comparison between our trained models and two Large Language Models (LLMs): one open-weight model (Phi) and one closed-weight model (Gemini).
The prompt used for these experiments is provided in assets/prompt_en.txt.

10. Reporting Issues

Please report any issues or bugs through the GitHub repository issue tracker:
Repository URL

When reporting an issue, please include the following details:

  • Python version
  • CUDA and PyTorch version
  • Complete error message and stack trace
  • Minimal reproducible example (if applicable)

Providing this information helps ensure faster and more accurate debugging.


11. License

This project is licensed under CC-BY-ND 4.0 (Creative Commons Attribution–NoDerivatives 4.0 International).

You are free to:

  • Share: Copy and redistribute the material in any medium or format

Under the following terms:

  • Attribution: You must give appropriate credit.
  • No Derivatives: If you remix, transform, or build upon the material, you may not distribute the modified version.

For details, see the LICENSE file.

Dataset License

The Council Metadata Corpus is derived from public municipal meeting minutes and is provided strictly for research purposes only.
Original documents remain the copyright of their respective municipal governments.


12. Resources

Models

Pre-trained models fine-tuned within the MiNER framework are available on the Hugging Face Model Hub:

External Resources


13. Acknowledgments

  • Municipal governments of M1–M6 for providing access to meeting minutes
  • The INCEpTION Project for the annotation platform
  • Hugging Face for hosting models and providing the Transformers library

Last updated: January 9, 2026

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages