Complete pipeline for analyzing cowpea genetic data from PCR markers. This pipeline performs comprehensive analysis including descriptive statistics, genetic diversity assessment, distance matrix calculation, and cluster analysis.
cowpea_analysis/ ├── data/ # Raw data files │ └── Data.csv ├── scripts/ # Analysis scripts │ ├── 00_setup_environment.py │ ├── 01_data_loading.py │ ├── 02_data_preprocessing.py │ ├── 03_descriptive_statistics.py │ ├── 04_band_analysis.py │ ├── 05_visualization.py │ ├── 06_genetic_diversity.py │ ├── 07_distance_matrix.py │ ├── 08_cluster_analysis.py │ ├── 09_export_results.py │ └── utils/ # Helper functions │ ├── init.py │ └── helpers.py ├── notebooks/ # Jupyter notebooks │ ├── Cowpea_analyses.ipynb │ └── exploratory_analysis.ipynb ├── config/ # Configuration files │ └── settings.py ├── output/ # Generated outputs │ ├── plots/ # Visualization plots │ ├── tables/ # Statistical tables │ ├── matrices/ # Distance matrices │ └── reports/ # Summary reports ├── logs/ # Pipeline logs ├── run_pipeline.py # Main pipeline script ├── requirements.txt # Python dependencies ├── .gitignore # Git ignore rules └── README.md # This file
# Clone repository
git clone <repository-url>
cd cowpea_analysis
# Create virtual environment
python -m venv venv
# Activate virtual environment
# Windows:
venv\Scripts\activate
# Mac/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
#Run complete pipeline
python run_pipeline.py
#Run individual pipelines
# Setup environment
python scripts/00_setup_environment.py
# Load data
python scripts/01_data_loading.py
# Preprocess data
python scripts/02_data_preprocessing.py
# Calculate statistics
python scripts/03_descriptive_statistics.py
# Analyze band patterns
python scripts/04_band_analysis.py
# Create visualizations
python scripts/05_visualization.py
# Calculate genetic diversity
python scripts/06_genetic_diversity.py
# Calculate distance matrices
python scripts/07_distance_matrix.py
# Perform cluster analysis
python scripts/08_cluster_analysis.py
# Export results
python scripts/09_export_results.py
Analysis Steps
Step 1: Data Loading
Load Data.csv from data directory
Validate data structure and quality
Extract metadata
Step 2: Data Preprocessing
Handle missing values
Remove invariant markers
Clean and standardize data
Step 3: Descriptive Statistics
Band frequency analysis
Band richness calculation
Basic statistical summaries
Step 4: Band Analysis
Unique band pattern identification
Band sharing analysis
Diagnostic marker identification
Step 5: Visualization
Band frequency bar plots
Richness distribution histograms
Heatmaps of band presence/absence
Cumulative frequency plots
Step 6: Genetic Diversity
Nei's gene diversity (He)
Shannon diversity index
Polymorphism Information Content (PIC)
Marker Index and Resolving Power
Step 7: Distance Matrices
Jaccard distance matrix
Simple Matching Coefficient (SMC)
Euclidean distance
Similarity matrices
Step 8: Cluster Analysis
Hierarchical clustering
K-means clustering
Consensus clustering
Dimensionality reduction (PCA, MDS)
Step 9: Results Export
Comprehensive summary reports
Excel files with multiple sheets
JSON format for programmatic access
Markdown documentation
Output Files
Plots (output/plots/)
01_band_frequency.pdf - Band frequency by fragment size
02_band_richness_distribution.pdf - Distribution of band counts
03_cumulative_frequency.pdf - Cumulative band frequency
04_band_richness_boxplot.pdf - Boxplot of band richness
05_band_presence_heatmap.pdf - Heatmap of band presence/absence
Tables (output/tables/)
band_frequency.csv - Frequency of each band
band_richness.csv - Band counts per sample
descriptive_statistics.json - Complete statistical summary
genetic_diversity_metrics.csv - Diversity indices
cluster_assignments.csv - Sample cluster assignments
Matrices (output/matrices/)
jaccard_distance_matrix.csv - Jaccard distance matrix
smc_distance_matrix.csv - Simple Matching Coefficient matrix
Similarity matrices in various formats
Reports (output/reports/)
summary_report.md - Comprehensive markdown report
detailed_results.xlsx - Excel workbook with all results
complete_results.pkl - Python pickle file with all data
Configuration
Edit config/settings.py to customize:
Analysis parameters
Visualization settings
Genetic analysis thresholds
Output formats
Troubleshooting
Common Issues:
Missing data file: Ensure Data.csv is in the data/ directory
Package conflicts: Use virtual environment and exact versions in requirements.txt
Memory issues: For large datasets, consider sampling or increasing memory
Plotting errors: Ensure matplotlib backend is properly configured
View Logs:
Check logs/ directory for detailed execution logs.
References
Genetic Diversity Metrics:
Nei, M. (1973). Analysis of gene diversity in subdivided populations.
Shannon, C. E. (1948). A mathematical theory of communication.
Distance Metrics:
Jaccard, P. (1901). Distribution de la flore alpine.
Sokal, R. R., & Michener, C. D. (1958). A statistical method for evaluating relationships.
Cluster Analysis:
Ward, J. H. (1963). Hierarchical grouping to optimize an objective function.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations.
Contributing
Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contact
For questions or support, please contact:
Name: Benjamin Narh-Madey
Email: narhmadey@wisc.edu
*Pipeline developed for cowpea genetic analysis - Version 1.0*