Skip to content

rrsathe/causal-rag-rca

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Dialog2Flow: Metadata-Enhanced Conversation Flow Analysis

Python 3.10+ License: MIT

Automated pipeline for extracting and visualizing conversation flows with metadata from customer service dialogues.

This repository extends the original Dialog2Flow methodology with:

  • βœ… Semantic search to find relevant conversations
  • βœ… Metadata preservation (escalation levels, churn risk, empathy scores, etc.)
  • βœ… Speaker-separated clustering (Agent vs Customer)
  • βœ… LLM-based cluster labeling using Ollama
  • βœ… Interactive directed graph visualizations
  • βœ… Comprehensive metadata tracking at utterance and cluster levels

πŸš€ Quick Start (3 Steps)

# 1. Clone and navigate
git clone <your-repo-url>
cd dialog2flow

# 2. Run automated setup
bash SETUP_ENVIRONMENT.sh

# 3. Run pipeline
python3 integrated_pipeline.py \
  --query "escalation issues" \
  --domain "Banking" \
  --distance-threshold 0.4 \
  --formats json graphml html \
  -l -lm llama3:8b

Results: Open output/graph_visualization.html in your browser to see the interactive graph!


πŸ“‹ What This Pipeline Does

  1. Find Top Conversations β†’ Semantic search across your dataset
  2. Prepare Data β†’ Convert to Dialog2Flow format with metadata
  3. Extract Trajectories β†’ Cluster utterances (Agent & Customer separately)
  4. Generate Labels β†’ Use LLM to name each cluster
  5. Build Graphs β†’ Create directed flow graphs with metadata
  6. Visualize β†’ Interactive HTML graphs with hover tooltips

πŸ“Š Example Output

Input:

  • Query: "escalation issues"
  • Domain: "Banking"
  • 3 conversations, 115 utterances

Output:

  • 40 clusters (21 Agent + 19 Customer)
  • All clusters labeled by LLM:
    • "Agent: Identify self and request assistance"
    • "Customer: Compare financial services"
    • "Agent: Request fee waiver"
  • Directed graph showing conversation flow
  • Metadata: escalation levels, churn risk, empathy scores

πŸ› οΈ Installation

Prerequisites

  • Python 3.10+
  • Ollama (for LLM cluster labeling)
  • NVIDIA GPU (optional, for faster processing)

Automated Setup (Recommended)

bash SETUP_ENVIRONMENT.sh

This script handles everything: virtual environment, dependencies, Ollama, and model downloads.

Manual Setup

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt
pip install faiss-cpu ollama networkx

# Install and start Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama pull llama3:8b

See SETUP.md for detailed instructions.


πŸ“– Usage

Basic Command

python3 integrated_pipeline.py \
  --query "your search query" \
  --domain "Banking" \
  --distance-threshold 0.4

With LLM Labels (Recommended)

python3 integrated_pipeline.py \
  --query "escalation issues" \
  --domain "Banking" \
  --distance-threshold 0.4 \
  --formats json graphml html \
  -l \
  -lm llama3:8b

All Options

python3 integrated_pipeline.py \
  --query "escalation issues"                    # Search query
  --domain "Banking"                             # Filter by domain
  --distance-threshold 0.4                       # Clustering threshold (0.3-0.6)
  --formats json graphml html                    # Output formats
  --model sergioburdisso/dialog2flow-joint-bert-base  # Embedding model
  -l                                             # Enable LLM labels
  -lm llama3:8b                                  # LLM model
  --output-dir ./output                          # Output directory

Available Domains

Your data may include: Banking, Flight, Hotel, Retail, Telecom, Insurance, etc.

Check available domains:

python3 -c "
import json
with open('data/final_json_for_d2f.json') as f:
    domains = sorted(set(t.get('domain', '') for t in json.load(f)))
print('\n'.join(domains))
"

πŸ“ Output Files

output/
β”œβ”€β”€ trajectories_with_metadata.json     # Extracted trajectories + clusters
β”œβ”€β”€ graph_with_metadata.json            # Graph in JSON format
β”œβ”€β”€ graph_with_metadata.graphml         # Graph in GraphML format
β”œβ”€β”€ graph_visualization.html            # Interactive visualization ⭐
└── graph_dialog2flow/
    β”œβ”€β”€ graph.graphml                   # Dialog2Flow format
    β”œβ”€β”€ graph.png                       # PNG visualization
    └── visualization/graph.html        # Alternative HTML view

Bonus: Top 20 conversations saved in data/top_K/ with full metadata.


🎯 Core Features

1. Semantic Search

  • Uses sentence-transformers/all-mpnet-base-v2 for embeddings
  • FAISS index for fast similarity search
  • Finds top K most relevant conversations for any query

2. Metadata Preservation

  • Utterance-level: escalation_level, churn_risk_score, empathy_score, intents_emotions, dialogue_acts, action_type, escalation_reason_tags
  • Cluster-level: Aggregated statistics (mean, std, min, max)
  • Full tracking: transcript_id + turn_idx for every utterance

3. Speaker-Separated Clustering

  • Agent and Customer utterances clustered separately
  • Prevents mixing of Agent/Customer in same cluster
  • Follows Dialog2Flow methodology

4. LLM Cluster Labeling

  • Uses Ollama (llama3:8b, mistral, gemma, etc.)
  • Generates canonical labels for each cluster
  • Example: "Agent: Identify self and request assistance"

5. Directed Graph Visualization

  • NetworkX DiGraph with visual arrow markers
  • Interactive D3.js visualization
  • Hover tooltips show full metadata
  • Node size reflects number of utterances

πŸ§ͺ Verification

After running the pipeline, verify outputs:

# Check graph statistics
python3 -c "
import json
with open('output/graph_with_metadata.json') as f:
    data = json.load(f)
print(f'Nodes: {len(data[\"nodes\"])}')
print(f'Edges: {len(data[\"edges\"])}')
"

# Check LLM labels
python3 -c "
import json
with open('output/trajectories_with_metadata.json') as f:
    data = json.load(f)
labels = data.get('cluster_labels', {})
print(f'Labels: {len(labels)}')
for cid, label in list(labels.items())[:5]:
    print(f'  {cid}: {label}')
"

# Open visualization
xdg-open output/graph_visualization.html  # Linux
# open output/graph_visualization.html    # macOS

πŸ“š Documentation


πŸ”§ Pipeline Architecture

Step 1: Find Top 20 Conversations
  β”œβ”€ Load data from data/final_json_for_d2f.json
  β”œβ”€ Filter by domain
  β”œβ”€ Create utterance embeddings
  β”œβ”€ Build FAISS index
  β”œβ”€ Search with query
  └─ Save top 20 to data/top_K/

Step 2: Prepare for Dialog2Flow
  β”œβ”€ Convert to simplified text format
  └─ Save metadata to data/example/

Step 3: Extract Trajectories
  β”œβ”€ Load Dialog2Flow model
  β”œβ”€ Cluster Agent utterances separately
  β”œβ”€ Cluster Customer utterances separately
  β”œβ”€ Generate LLM labels (if enabled)
  β”œβ”€ Aggregate metadata per cluster
  └─ Save to output/trajectories_with_metadata.json

Step 4: Build Graphs
  β”œβ”€ Build metadata-enhanced graph
  β”œβ”€ Build Dialog2Flow action flow graph
  β”œβ”€ Export as JSON, GraphML, HTML
  └─ Generate interactive visualizations

πŸŽ“ Research Citation

This work extends the Dialog2Flow methodology:

@article{burdisso2022dialog2flow,
  title={Dialog2Flow: Pre-training Soft-Contrastive Action-Driven Sentence Embeddings for Automatic Dialog Flow Extraction},
  author={Burdisso, Sergio and Madikeri, Srikanth and Motlicek, Petr},
  journal={arXiv preprint arXiv:2206.07148},
  year={2022}
}

Original repository: https://github.com/sergioburdisso/dialog2flow


πŸ› Troubleshooting

Ollama Not Running

# Start Ollama
ollama serve &

# Verify
ollama list

Missing Models

# Download models
python3 -c "
from sentence_transformers import SentenceTransformer
SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
SentenceTransformer('sergioburdisso/dialog2flow-joint-bert-base')
"

No Domain Found

# Check available domains
python3 -c "
import json
with open('data/final_json_for_d2f.json') as f:
    domains = set(t.get('domain') for t in json.load(f))
print(sorted(domains))
"

See SETUP.md for more troubleshooting.


πŸ“ License

MIT License - see LICENSE for details.


🀝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

πŸ“§ Contact

For questions or issues:

  • Open an issue on GitHub
  • Check SETUP.md for troubleshooting
  • Review existing issues for solutions

⭐ Features in Progress

  • Support for more LLM providers (OpenAI, Anthropic, etc.)
  • Multi-domain comparison graphs
  • Time-based flow analysis
  • Export to Gephi/Cytoscape formats
  • Real-time conversation analysis API

Happy analyzing! πŸŽ‰

About

Features the retrieval of dynamic induced subgraphs, Dockerized deployment, and refined reasoning models for efficient large-scale inference.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors