Priority 1
- Use a morphological analyzer for preprocessing before the bow.
- Use n-grams: character, words, and syntactic levels (POS n-grams)
- Use TF-IDF or TF-IGF
Priority 2
- Clustering according to word level and phrase level.
- Comparing topical distribution (LDA, BERTopic) - Trying the syntactic language model (probabilistic context-free grammars)
- Comparing the sentiment
- Cluster minimas and maximas using vector semantics
- Experimenting with attention
Priority 3
- Analyzing the different styles that capture people's attention or influence them.
- Explainable AI
- Local Explanations: Techniques like LIME approximate the behavior of complex models with simpler, interpretable models to highlight factors influencing a specific prediction.
- Example-Based Explanations: These explanations show similar examples from the data to provide context and help understand the model's actions.
- Feature Importance: Methods like SHAP use game theory to assign importance values to input features, showing which ones most influenced a decision.
- Counterfactuals: These explanations describe the minimum changes to the input that would lead to a different outcome, helping users understand how to alter a decision.
- Trying language diffusion models in stylometry (if Llada models destroy the syntactic structure)
- Try Anomaly Detection
- Words are similar in certain dimensions. Can I focus on certain dimensions while clustering?
- How to identify those dimensions?
- Vector semantics and embeddings from SLP
- Stylometry Analysis
- Appendix G in SLP