Automated pipeline for extracting and visualizing conversation flows with metadata from customer service dialogues.
This repository extends the original Dialog2Flow methodology with:
- β Semantic search to find relevant conversations
- β Metadata preservation (escalation levels, churn risk, empathy scores, etc.)
- β Speaker-separated clustering (Agent vs Customer)
- β LLM-based cluster labeling using Ollama
- β Interactive directed graph visualizations
- β Comprehensive metadata tracking at utterance and cluster levels
# 1. Clone and navigate
git clone <your-repo-url>
cd dialog2flow
# 2. Run automated setup
bash SETUP_ENVIRONMENT.sh
# 3. Run pipeline
python3 integrated_pipeline.py \
--query "escalation issues" \
--domain "Banking" \
--distance-threshold 0.4 \
--formats json graphml html \
-l -lm llama3:8bResults: Open output/graph_visualization.html in your browser to see the interactive graph!
- Find Top Conversations β Semantic search across your dataset
- Prepare Data β Convert to Dialog2Flow format with metadata
- Extract Trajectories β Cluster utterances (Agent & Customer separately)
- Generate Labels β Use LLM to name each cluster
- Build Graphs β Create directed flow graphs with metadata
- Visualize β Interactive HTML graphs with hover tooltips
Input:
- Query: "escalation issues"
- Domain: "Banking"
- 3 conversations, 115 utterances
Output:
- 40 clusters (21 Agent + 19 Customer)
- All clusters labeled by LLM:
- "Agent: Identify self and request assistance"
- "Customer: Compare financial services"
- "Agent: Request fee waiver"
- Directed graph showing conversation flow
- Metadata: escalation levels, churn risk, empathy scores
- Python 3.10+
- Ollama (for LLM cluster labeling)
- NVIDIA GPU (optional, for faster processing)
bash SETUP_ENVIRONMENT.shThis script handles everything: virtual environment, dependencies, Ollama, and model downloads.
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
pip install faiss-cpu ollama networkx
# Install and start Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama pull llama3:8bSee SETUP.md for detailed instructions.
python3 integrated_pipeline.py \
--query "your search query" \
--domain "Banking" \
--distance-threshold 0.4python3 integrated_pipeline.py \
--query "escalation issues" \
--domain "Banking" \
--distance-threshold 0.4 \
--formats json graphml html \
-l \
-lm llama3:8bpython3 integrated_pipeline.py \
--query "escalation issues" # Search query
--domain "Banking" # Filter by domain
--distance-threshold 0.4 # Clustering threshold (0.3-0.6)
--formats json graphml html # Output formats
--model sergioburdisso/dialog2flow-joint-bert-base # Embedding model
-l # Enable LLM labels
-lm llama3:8b # LLM model
--output-dir ./output # Output directoryYour data may include: Banking, Flight, Hotel, Retail, Telecom, Insurance, etc.
Check available domains:
python3 -c "
import json
with open('data/final_json_for_d2f.json') as f:
domains = sorted(set(t.get('domain', '') for t in json.load(f)))
print('\n'.join(domains))
"output/
βββ trajectories_with_metadata.json # Extracted trajectories + clusters
βββ graph_with_metadata.json # Graph in JSON format
βββ graph_with_metadata.graphml # Graph in GraphML format
βββ graph_visualization.html # Interactive visualization β
βββ graph_dialog2flow/
βββ graph.graphml # Dialog2Flow format
βββ graph.png # PNG visualization
βββ visualization/graph.html # Alternative HTML view
Bonus: Top 20 conversations saved in data/top_K/ with full metadata.
- Uses
sentence-transformers/all-mpnet-base-v2for embeddings - FAISS index for fast similarity search
- Finds top K most relevant conversations for any query
- Utterance-level: escalation_level, churn_risk_score, empathy_score, intents_emotions, dialogue_acts, action_type, escalation_reason_tags
- Cluster-level: Aggregated statistics (mean, std, min, max)
- Full tracking: transcript_id + turn_idx for every utterance
- Agent and Customer utterances clustered separately
- Prevents mixing of Agent/Customer in same cluster
- Follows Dialog2Flow methodology
- Uses Ollama (llama3:8b, mistral, gemma, etc.)
- Generates canonical labels for each cluster
- Example: "Agent: Identify self and request assistance"
- NetworkX DiGraph with visual arrow markers
- Interactive D3.js visualization
- Hover tooltips show full metadata
- Node size reflects number of utterances
After running the pipeline, verify outputs:
# Check graph statistics
python3 -c "
import json
with open('output/graph_with_metadata.json') as f:
data = json.load(f)
print(f'Nodes: {len(data[\"nodes\"])}')
print(f'Edges: {len(data[\"edges\"])}')
"
# Check LLM labels
python3 -c "
import json
with open('output/trajectories_with_metadata.json') as f:
data = json.load(f)
labels = data.get('cluster_labels', {})
print(f'Labels: {len(labels)}')
for cid, label in list(labels.items())[:5]:
print(f' {cid}: {label}')
"
# Open visualization
xdg-open output/graph_visualization.html # Linux
# open output/graph_visualization.html # macOS- SETUP.md - Detailed setup guide with troubleshooting
- PAPER.md - Research paper reference
- FILES_ANALYSIS.txt - File structure analysis
Step 1: Find Top 20 Conversations
ββ Load data from data/final_json_for_d2f.json
ββ Filter by domain
ββ Create utterance embeddings
ββ Build FAISS index
ββ Search with query
ββ Save top 20 to data/top_K/
Step 2: Prepare for Dialog2Flow
ββ Convert to simplified text format
ββ Save metadata to data/example/
Step 3: Extract Trajectories
ββ Load Dialog2Flow model
ββ Cluster Agent utterances separately
ββ Cluster Customer utterances separately
ββ Generate LLM labels (if enabled)
ββ Aggregate metadata per cluster
ββ Save to output/trajectories_with_metadata.json
Step 4: Build Graphs
ββ Build metadata-enhanced graph
ββ Build Dialog2Flow action flow graph
ββ Export as JSON, GraphML, HTML
ββ Generate interactive visualizations
This work extends the Dialog2Flow methodology:
@article{burdisso2022dialog2flow,
title={Dialog2Flow: Pre-training Soft-Contrastive Action-Driven Sentence Embeddings for Automatic Dialog Flow Extraction},
author={Burdisso, Sergio and Madikeri, Srikanth and Motlicek, Petr},
journal={arXiv preprint arXiv:2206.07148},
year={2022}
}Original repository: https://github.com/sergioburdisso/dialog2flow
# Start Ollama
ollama serve &
# Verify
ollama list# Download models
python3 -c "
from sentence_transformers import SentenceTransformer
SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
SentenceTransformer('sergioburdisso/dialog2flow-joint-bert-base')
"# Check available domains
python3 -c "
import json
with open('data/final_json_for_d2f.json') as f:
domains = set(t.get('domain') for t in json.load(f))
print(sorted(domains))
"See SETUP.md for more troubleshooting.
MIT License - see LICENSE for details.
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
For questions or issues:
- Open an issue on GitHub
- Check SETUP.md for troubleshooting
- Review existing issues for solutions
- Support for more LLM providers (OpenAI, Anthropic, etc.)
- Multi-domain comparison graphs
- Time-based flow analysis
- Export to Gephi/Cytoscape formats
- Real-time conversation analysis API
Happy analyzing! π