A structured approach to opinion mining — extracting who expresses what sentiment toward whom, with a polarity label.
- Problem Statement
- Architecture
- Methodology
- Baseline System
- Features
- Datasets
- Results
- Installation
- Usage
- Project Structure
- Further Work
Structured Sentiment Analysis (SSA) extends traditional sentiment analysis by extracting complete sentiment graphs from raw text. Instead of simply classifying a text as positive or negative, SSA identifies all the structured opinion tuples present in a sentence.
Given a text
Where each element is defined as:
| Symbol | Role | Description |
|---|---|---|
| Holder | The entity who expresses the opinion | |
| Expression | The polar expression (words conveying sentiment) | |
| Target | The entity toward which the opinion is directed | |
| Polarity | The sentiment label: positive, negative, or neutral
|
Sentence: "Even though the price is decent for Paris, I would not recommend this hotel."
Opinion Tuple 1:
Holder → "I"
Expression → "would not recommend"
Target → "this hotel"
Polarity → NEGATIVE
Opinion Tuple 2:
Holder → (implicit)
Expression → "decent"
Target → "the price"
Polarity → POSITIVE
Unlike standard sentiment analysis, SSA captures complex, overlapping, and multi-target opinions within a single sentence — making it far more expressive and practically useful.
Our model follows a unified end-to-end architecture built on top of a pretrained BERT-based encoder, enhanced with a CRF decoding layer and a separate polarity head.
┌──────────────────────────────────────────┐
│ Input Text │
└───────────────────┬──────────────────────┘
│
┌───────────────────▼──────────────────────┐
│ [CLS] w₁ w₂ ... wₙ [SEP] │
│ Token Embedding Layer │
└───────────────────┬──────────────────────┘
│
┌───────────────────▼──────────────────────┐
│ Norwegian BERT Encoder │
│ (NbAiLab / nb-bert-base) │
│ Contextual Hidden State Representations │
└──────────┬────────────────────┬──────────┘
│ │
┌───────────────▼──────┐ ┌─────────▼────────────┐
│ CRF for BIO Tagging│ │ Polarity Head │
│ │ │ ([CLS] token) │
│ B-Source I-Source │ │ │
│ B-Target I-Target │ │ ┌──────────────┐ │
│ B-Polar I-Polar │ │ │ Positive │ │
│ O │ │ │ Negative │ │
└──────────┬───────────┘ │ │ Neutral │ │
│ │ └──────────────┘ │
│ └─────────┬────────────┘
│ │
┌──────────▼─────────────────────────▼──────────┐
│ Structured Opinion Generation │
│ │
│ Merge CRF spans + polarity predictions │
│ Apply offset mapping & span filtering │
│ │
│ Output: (Source, Target, Expression, Polarity)│
└─────────────────────────────────────────────────┘
| Component | Description | Technology |
|---|---|---|
| Encoder | Contextual representation of input tokens | RoBERTa / BERT (multilingual) |
| BIO Tagger | Sequence labeling for span identification | Linear Projection |
| CRF Decoder | Structured decoding with label constraints | Conditional Random Field |
| Polarity Head | Classifies sentiment of identified spans | Multi-layer Classifier on [CLS] |
| Opinion Builder | Assembles final structured output tuples | Post-processing pipeline |
- SSA extends basic sentiment analysis by extracting four key components: Holder, Target, Expression, Polarity
- We use a pre-trained RoBERTa model (
PolitAi Lab/nb-bert-base, [Liu et al., 2019]) as our encoder - The encoder transforms raw input text into rich contextual embeddings for each token
- We benchmarked RoBERTa with
bert-baseandbert-medium— RoBERTa consistently outperforms both
# Encoder produces hidden states for every token position
hidden_states = roberta_encoder(input_ids, attention_mask)
# Shape: [batch_size, seq_len, hidden_dim]- A linear projection layer maps contextual embeddings to BIO tag logits
- We use the BIO (Beginning-Inside-Outside) tagging scheme to represent the boundaries of each opinion component
O → Outside any opinion component
B-Source → Beginning of a Holder span
I-Source → Inside a Holder span
B-Target → Beginning of a Target span
I-Target → Inside a Target span
B-Polar → Beginning of a Polar Expression span
I-Polar → Inside a Polar Expression span
where
After obtaining token-level hidden states from BERT/RoBERTa, a CRF layer is applied to decode BIO tag sequences in a globally-consistent manner.
- The CRF learns transition weights between adjacent tags
- This enforces structural constraints (e.g.,
I-Sourcecannot followB-Target) - Enables span detection while explicitly modeling tag dependencies
Key Insight: CRF outperforms a plain softmax decoder because it captures inter-tag dependencies critical for multi-span structured extraction.
- A separate multi-layer classification head takes the
[CLS]token representation as input - Predicts one of three sentiment labels:
Sentiment Labels:
✅ Positive → Favorable / approving sentiment
❌ Negative → Critical / disapproving sentiment
➖ Neutral → Factual / non-opinionated expression
-
The model struggles with ambiguous sentences like:
"The food was great, but the service was terrible."
where polarity is mixed at the span level — a key challenge that motivates future work on span-level polarity.
In the final stage, we combine CRF-predicted spans with polarity predictions to produce the complete structured opinion quadruple (holder, expression, target, polarity).
Post-processing pipeline:
- Convert span boundaries to character offsets using BERT's offset mapping with adjacent token merging
- Filter short spans (< 3 characters) to remove noise
- Assemble final output as JSON-formatted opinion tuples
⚠️ This filtering step leads to a small loss (~a few percent) of true predictions.
As a comparison point, we implement a Sequence Labeling + Relation Classification pipeline using BiLSTM models.
Step 1: Span Extraction
├── BiLSTM → Extract Holders
├── BiLSTM → Extract Targets
└── BiLSTM → Extract Polar Expressions
Step 2: Relation Prediction
└── BiLSTM + Max Pooling
├── Full text representation
├── Holder / Target representation
└── Expression representation
└── Concatenate → Linear → Sigmoid
→ Binary: has_relation? (threshold = 0.5)
Step 3: Assemble tuples → (holder, target, expression, polarity)
The baseline uses GloVe / FastText embeddings and trains three separate BiLSTMs, one per annotation type, followed by a relation prediction model.
| Feature | Description |
|---|---|
| 🌍 Multilingual Support | Works across Norwegian, English, Spanish, Catalan, and Basque |
| 🏷️ BIO Sequence Labeling | Precise span-level identification using structured tagging |
| 🔗 CRF Decoding | Globally-consistent tag sequence prediction with structural constraints |
| 🎭 Polarity Classification | 3-class (Positive / Negative / Neutral) sentiment head |
| 🧩 Quadruple Extraction | Complete (holder, target, expression, polarity) output |
| 📊 Weighted F1 Evaluation | Partial-overlap scoring using token-level Jaccard intersection |
| 🔄 Cross-lingual Transfer | Train on English, evaluate on low-resource target languages |
| 📦 Modular Architecture | Encoder, CRF, and classifier heads are independently configurable |
| Dataset | Language | Domain |
|---|---|---|
norec |
Norwegian | Professional reviews (multi-domain) |
opener_en |
English | Hotel reviews |
opener_es |
Spanish | Hotel reviews |
multibooked_ca |
Catalan | Hotel reviews |
multibooked_eu |
Basque | Hotel reviews |
darmstadt_unis |
English | University reviews (online) |
mpqa |
English | News (opinion annotations) |
Trained on high-resource English data, evaluated on:
| Test Dataset | Language |
|---|---|
opener_es |
Spanish |
multibooked_ca |
Catalan |
multibooked_eu |
Basque |
Each dataset is in JSON format with the following schema:
{
"sent_id": "opener/en/hotel/english00164-6",
"text": "Even though the price is decent for Paris, I would not recommend this hotel.",
"opinions": [
{
"Source": [["I"], ["44:45"]],
"Target": [["this hotel"], ["66:76"]],
"Polar_expression": [["would not recommend"], ["46:65"]],
"Polarity": "negative",
"Intensity": "average"
}
]
}Average across all 7 datasets
| Metric | Score |
|---|---|
| SF1 (Sentiment F1) | 0.46 |
| SP (Sentiment Precision) | 0.62 |
| SR (Sentiment Recall) | 0.65 |
Per-dataset Breakdown:
| Dataset | SF1 | SP | SR |
|---|---|---|---|
| Opener_en | 0.41 | 0.37 | 0.47 |
| Opener_es | 0.35 | 0.33 | 0.38 |
| NoReC | 0.23 | 0.30 | 0.18 |
| Multibooked_ca | 0.57 | 0.53 | 0.63 |
| Multibooked_eu | 0.53 | 0.40 | 0.71 |
| darmstadt_unis | 0.55 | 0.58 | 0.00 |
| MPQA | 0.52 | 0.55 | 0.00 |
Average across 3 target language datasets
| Metric | Score |
|---|---|
| SF1 (Sentiment F1) | 0.35 |
| SP (Sentiment Precision) | 0.85 |
| SR (Sentiment Recall) | 0.63 |
Per-dataset Breakdown:
| Dataset | SF1 | Precision | Recall |
|---|---|---|---|
| Opener_es | 0.000 | 0.000 | 0.000 |
| Multibooked_ca | 0.481 | 0.461 | 0.503 |
| Multibooked_eu | 0.671 | 0.618 | 0.733 |
- 🟢 BERT significantly improves span extraction results over CRF baseline alone — especially for English and language-similar corpora
- 🟡
Multibooked_euperforms best in cross-lingual settings — likely due to the smaller dataset size and consistent hotel-review characteristics - 🔴 Complex/ambiguous expressions (different polarity in different contexts) present a challenge across all datasets
- 📌 Character-level BERT representations outperform word-level representations as a comparison baseline
- Python ≥ 3.8
- PyTorch ≥ 1.9
- CUDA (optional, for GPU acceleration)
git clone https://github.com/your-username/Structured-Sentiment-Analysis-IIITD-NLP-PROJECT.git
cd Structured-Sentiment-Analysis-IIITD-NLP-PROJECTpip install torch torchvision transformers
pip install nltk scikit-learn tqdm gensimpip install -r baseline/sequence_labeling/requirements.txtpip install -r data/requirements.txtMPQA 2.0 — Download from the MPQA website and run:
cd data/mpqa && bash process_mpqa.shDarmstadt Service Review Corpus — Download from TU Darmstadt and run:
cd data/darmstadt_unis && bash process_darmstadt.shRun the official evaluation script on model predictions:
python evaluate.py <input_dir> <output_dir>Where:
<input_dir>/res/contains yourpredictions.jsonper dataset<input_dir>/ref/data/contains the gold test files
# Train all BiLSTM baseline models across datasets
cd baseline/sequence_labeling
bash get_baselines.sh# Run inference on a specific dataset and split
python baseline/sequence_labeling/inference.py \
--DATADIR opener_en \
--FILE dev.jsonOutput will be saved to:
baseline/sequence_labeling/saved_models/relation_prediction/<DATADIR>/prediction.json
Prediction files must match the gold data format. Each entry should look like:
{
"sent_id": "unique-sentence-id",
"text": "Raw input sentence here.",
"opinions": [
{
"Source": [["holder text"], ["start:end"]],
"Target": [["target text"], ["start:end"]],
"Polar_expression": [["expression text"], ["start:end"]],
"Polarity": "positive"
}
]
}Structured-Sentiment-Analysis-IIITD-NLP-PROJECT/
│
├── 📄 evaluate.py ← Official evaluation script (SF1 / SP / SR)
│
├── 📁 baseline/
│ └── sequence_labeling/
│ ├── extraction_module.py ← BiLSTM span extractor (Holder / Target / Expr)
│ ├── relation_prediction_module.py ← BiLSTM relation classifier
│ ├── inference.py ← End-to-end inference pipeline
│ ├── convert_to_bio.py ← Convert JSON data to BIO format
│ ├── convert_to_rels.py ← Convert predictions to relation pairs
│ ├── utils.py ← Data loading & vocabulary utilities
│ ├── WordVecs.py ← Pretrained word embedding loader
│ ├── get_baselines.sh ← Script to train all baseline models
│ └── requirements.txt
│
├── 📁 data/
│ ├── norec/ ← Norwegian multi-domain reviews
│ │ ├── train.json
│ │ ├── dev.json
│ │ └── test.json
│ ├── opener_en/ ← English hotel reviews
│ ├── opener_es/ ← Spanish hotel reviews
│ ├── multibooked_ca/ ← Catalan hotel reviews
│ ├── multibooked_eu/ ← Basque hotel reviews
│ ├── mpqa/ ← MPQA news corpus
│ ├── darmstadt_unis/ ← English university reviews
│ └── README.md ← Data format documentation
│
└── 📁 predictions/
└── norec/
└── predictions.json ← Sample model predictions
We identify several promising directions to build upon this work:
In the future, we would likely move to a dependency graph parsing approach for structured sentiment — augmenting the token-level representation with their heads in a dependency tree (Kurtz et al., 2020). This allows richer relational reasoning between opinion components.
- Exploring multi-task learning across languages to better leverage shared structure in multilingual sentiment graphs
- Joint training on all monolingual datasets with language-specific adapters
- Point Network (Samuel & Straka, 2020) — a strong graph parser for SSA
- PERIN — a permutation-invariant structured prediction framework
- Currently, polarity is predicted globally per sentence via the
[CLS]token - Moving to span-level polarity prediction would handle cases like "The food was great, but the service was terrible"
- Explore whether large pre-trained models (e.g., GPT-4, LLaMA) can directly predict structured opinion tuples via in-context learning or fine-tuning, without an explicit CRF layer
Barnes, J. et al. (2021). SemEval-2022 Task 10: Structured Sentiment Analysis.
Proceedings of the 16th Workshop on Semantic Evaluation.
Liu, Y. et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
arXiv:1907.11692.
Kurtz, et al. (2020). Improving Low-Resource NMT through Relevance Based Linguistic Features.
ACL 2020.
Samuel, D. & Straka, M. (2020). ÚFAL at MRP 2020: Permutation-Invariant Semantic Parsing.
CoNLL 2020 Shared Task.
Made with ❤️ at IIIT Delhi | NLP Course Project
For questions or contributions, please open an issue or pull request.