A research-driven project that blends classic ML for large-scale review analysis with a multi-agent LLM layer (LangGraph + OpenAI) to simulate a virtual user-board session. Result: product teams get high-quality, data-grounded insights in hours—not weeks—slashing the time and cost of early-stage user research.
Read the full article description - https://open.substack.com/pub/vvk93/p/from-reviews-to-roadmap-building?r=t37oj&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true
.
├── Vladimir_Kovtunovskiy/homework2-userboard-simulation/
│ ├── cluster_outputs/ # Output directory for review clustering results
│ │ └── clusters_data.json # JSON file containing clustered review data and keywords
│ ├── multiagent_outputs/ # Output directory for the user board simulation
│ │ ├── board_session.log # Detailed log file for the simulation run
│ │ └── userboard_report.md # Final markdown report summarizing the simulation
│ ├── data_types.py # Defines shared data structures (Persona, FeatureProposal)
│ ├── persona_generator.py # Generates user personas from cluster data using an LLM
│ ├── review_prep_pipeline.py # Processes, cleans, embeds, and clusters user reviews
│ ├── board_simulation.py # Core logic for simulating the multi-agent discussion
│ ├── userboard_pipeline.py # Main script orchestrating the entire pipeline (clustering -> personas -> simulation)
│ ├── requirements.txt # Python package dependencies
│ ├── spotify_reviews.csv # Input dataset of Spotify user reviews
│ └── README.md # This file
└── ... (other project files/folders)
spotify_reviews.csv: The raw input data containing user reviews for Spotify.requirements.txt: Lists the necessary Python libraries required to run the project.data_types.py: Contains Pythondataclassdefinitions forPersonaandFeatureProposal, ensuring consistent data handling across modules.review_prep_pipeline.py:- Loads reviews from
spotify_reviews.csv. - Cleans and preprocesses the review text.
- Calculates sentiment scores (positive, neutral, negative).
- Generates text embeddings using Sentence Transformers (optimized for MPS).
- Performs dimensionality reduction using UMAP.
- Clusters the reviews using K-Means, determining the optimal 'k'.
- Extracts relevant keywords for each cluster using TF-IDF.
- Saves the clustering results, including keywords and sample reviews, to
cluster_outputs/clusters_data.json.
- Loads reviews from
persona_generator.py:- Reads the
clusters_data.jsonfile. - Uses an LLM (e.g., GPT-4) to generate distinct user
Personaobjects based on the characteristics and feedback within selected clusters.
- Reads the
board_simulation.py:- Takes generated
Personaobjects and proposedFeatureProposals as input. - Initializes AI agents for each persona and a facilitator agent using LangChain/LangGraph.
- Simulates a structured discussion over several rounds, where the facilitator asks questions about the features and personas respond based on their profiles.
- Captures the entire discussion transcript.
- Takes generated
userboard_pipeline.py:- Acts as the main entry point and orchestrator.
- Sequentially runs the review preparation (if
clusters_data.jsondoesn't exist or needs updating, though currently relies on pre-existing file), persona generation, feature ideation (based on clusters), and board simulation steps. - Uses LangGraph to manage the state and flow between these steps.
- Generates a final summary report (
userboard_report.md) in themultiagent_outputsdirectory.
cluster_outputs/: Directory where the output ofreview_prep_pipeline.pyis stored.multiagent_outputs/: Directory where the outputs of theuserboard_pipeline.py(logs and the final report) are stored.
- Clone the repository:
git clone <your-repo-url> cd <your-repo-directory>/board-simulation
- Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
- Install dependencies:
pip install -r requirements.txt
- Download NLTK data (if not already present):
Run Python and execute:
import nltk nltk.download('vader_lexicon') nltk.download('stopwords')
- Set up OpenAI API Key:
Create a
.envfile in theBoardSimulationdirectory and add your OpenAI API key:Alternatively, set it as an environment variable.OPENAI_API_KEY='your_openai_api_key_here'
The main pipeline can be executed by running the userboard_pipeline.py script:
python userboard_pipeline.pyThis script will:
- Load cluster data from
cluster_outputs/clusters_data.json. (Note: It assumes this file exists. You might need to runreview_prep_pipeline.pyseparately first if it doesn't, although the pipeline currently doesn't automatically trigger it).# To generate cluster data (if needed): python review_prep_pipeline.py - Select top clusters based on negative sentiment.
- Generate feature ideas based on selected clusters using an LLM.
- Generate user personas based on selected clusters using an LLM.
- Run the board simulation with the generated personas and features.
- Generate a summary report (
userboard_report.md) and log file (board_session.log) in themultiagent_outputsdirectory.
You can also run the review_prep_pipeline.py script independently if you only need to perform the review clustering:
# Uses default input ./spotify_reviews.csv and output ./cluster_outputs/
python review_prep_pipeline.py
# Specify input/output
python review_prep_pipeline.py --csv path/to/reviews.csv --out path/to/output_dir- AI Agent Simulation: Leverages LLMs to simulate realistic user personas and discussions.
- Data-Driven Personas: Personas are grounded in real user feedback clusters.
- Automated Insight Generation: Streamlines the process of understanding user sentiment and potential feature reception.
- Modular Pipeline: Code is organized into distinct, reusable modules.
- MPS Acceleration: Utilizes Apple Silicon GPUs for faster embedding generation in the clustering pipeline.
- LangGraph Orchestration: Uses LangGraph for managing the multi-step simulation pipeline state.