A project that develops a sentiment analysis model using transfer learning techniques for effective cross-domain sentiment classification. The model can be trained on one domain (e.g., product reviews) and then adapted to perform well on a different domain (e.g., social media posts) with minimal labeled target data.
Sentiment analysis models often perform well in the domain they were trained on but struggle when applied to different domains with unique linguistic characteristics. This project addresses this challenge through transfer learning and domain adaptation techniques.
- Pre-trained BERT-based sentiment analysis model
- Transfer learning for cross-domain adaptation
- Multiple adaptation strategies (fine-tuning, gradual unfreezing, adversarial training)
- Comprehensive evaluation and comparison framework
- Visualization tools for model analysis
project/
├── data/
│ ├── source_domain/
│ │ ├── raw/ # Raw source domain data
│ │ └── processed/ # Processed source domain data
│ └── target_domain/
│ ├── raw/ # Raw target domain data
│ └── processed/ # Processed target domain data
├── models/
│ ├── baseline/ # Model trained on source domain
│ ├── adapted/ # Model adapted to target domain
│ └── target_only/ # Model trained only on target domain
├── src/
│ ├── preprocess.py # Data preprocessing utilities
│ ├── train.py # Model training functions
│ ├── adapt.py # Domain adaptation techniques
│ ├── evaluate.py # Model evaluation tools
│ └── utils.py # General utility functions
├── notebooks/
│ ├── exploratory_analysis.ipynb # Data exploration
│ └── model_training.ipynb # Model training and evaluation
├── requirements.txt
└── README.md
- Python 3.8+
- TensorFlow 2.x or PyTorch
- Hugging Face Transformers library
- Recommended: CUDA-capable GPU for faster training
-
Clone the repository:
git clone https://github.com/aagams2910/Cross-Domain-Sentiment-Analysis-with-Transfer-Learning.git cd Cross-Domain-Sentiment-Analysis-with-Transfer-Learning -
Install dependencies:
pip install -r requirements.txt -
Download and place datasets in the appropriate directories:
- Place source domain data in
data/source_domain/raw/ - Place target domain data in
data/target_domain/raw/
- Place source domain data in
The expected format for data files is CSV with at least two columns:
- A text column (can be named: 'text', 'review', 'content', 'tweet', 'comment')
- A label column (can be named: 'label', 'sentiment', 'class', 'target')
For binary sentiment classification, labels should be binary (0/1, positive/negative, etc.).
To preprocess the datasets:
from src.preprocess import DataPreprocessor
preprocessor = DataPreprocessor(max_length=128, tokenizer_name="bert-base-uncased")
preprocessor.load_and_preprocess("data/source_domain/raw/source_data.csv", domain="source")
preprocessor.load_and_preprocess("data/target_domain/raw/target_data.csv", domain="target")To train a model on the source domain:
from src.train import SentimentModelTrainer
trainer = SentimentModelTrainer(model_name="bert-base-uncased", num_labels=2)
model, history = trainer.train_source_model(train_dataset, val_dataset, epochs=3, batch_size=16)To adapt a source-trained model to the target domain:
from src.train import SentimentModelTrainer
trainer = SentimentModelTrainer(model_name="bert-base-uncased", num_labels=2)
adapted_model, history = trainer.adapt_model(
source_model, target_train_dataset, target_val_dataset,
epochs=3, batch_size=16, strategy='fine_tune'
)To evaluate model performance:
from src.evaluate import ModelEvaluator
evaluator = ModelEvaluator()
metrics, predictions = evaluator.evaluate_model(model, test_dataset, model_name="my_model")
print(f"Accuracy: {metrics['accuracy']}, F1 score: {metrics['f1']}")The project implements several transfer learning strategies:
- Fine-tuning: Further train the entire pre-trained model on the target domain data
- Gradual Unfreezing: Gradually unfreeze layers of the model during adaptation
- Adversarial Training: Use domain-adversarial training to learn domain-invariant features
The project includes Jupyter notebooks for:
- Exploratory data analysis of source and target domains
- Model training, adaptation, and evaluation
Suggested datasets for experimentation:
- IMDb Movie Reviews (sentiment analysis on movie reviews)
- Amazon Product Reviews (sentiment analysis on product reviews)
- Twitter Sentiment140 (sentiment analysis on tweets)
- Reddit Comments (sentiment analysis on social media posts)