Skip to content

K-nie/cowpea_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cowpea Genetic Analysis Pipeline

Overview

Complete pipeline for analyzing cowpea genetic data from PCR markers. This pipeline performs comprehensive analysis including descriptive statistics, genetic diversity assessment, distance matrix calculation, and cluster analysis.

Project Structure

cowpea_analysis/ ├── data/ # Raw data files │ └── Data.csv ├── scripts/ # Analysis scripts │ ├── 00_setup_environment.py │ ├── 01_data_loading.py │ ├── 02_data_preprocessing.py │ ├── 03_descriptive_statistics.py │ ├── 04_band_analysis.py │ ├── 05_visualization.py │ ├── 06_genetic_diversity.py │ ├── 07_distance_matrix.py │ ├── 08_cluster_analysis.py │ ├── 09_export_results.py │ └── utils/ # Helper functions │ ├── init.py │ └── helpers.py ├── notebooks/ # Jupyter notebooks │ ├── Cowpea_analyses.ipynb │ └── exploratory_analysis.ipynb ├── config/ # Configuration files │ └── settings.py ├── output/ # Generated outputs │ ├── plots/ # Visualization plots │ ├── tables/ # Statistical tables │ ├── matrices/ # Distance matrices │ └── reports/ # Summary reports ├── logs/ # Pipeline logs ├── run_pipeline.py # Main pipeline script ├── requirements.txt # Python dependencies ├── .gitignore # Git ignore rules └── README.md # This file

Quick Start

1. Installation

# Clone repository
git clone <repository-url>
cd cowpea_analysis

# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
venv\Scripts\activate
# Mac/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt


#Run complete pipeline
python run_pipeline.py

#Run individual pipelines
# Setup environment
python scripts/00_setup_environment.py

# Load data
python scripts/01_data_loading.py

# Preprocess data
python scripts/02_data_preprocessing.py

# Calculate statistics
python scripts/03_descriptive_statistics.py

# Analyze band patterns
python scripts/04_band_analysis.py

# Create visualizations
python scripts/05_visualization.py

# Calculate genetic diversity
python scripts/06_genetic_diversity.py

# Calculate distance matrices
python scripts/07_distance_matrix.py

# Perform cluster analysis
python scripts/08_cluster_analysis.py

# Export results
python scripts/09_export_results.py


Analysis Steps
Step 1: Data Loading
Load Data.csv from data directory

Validate data structure and quality

Extract metadata

Step 2: Data Preprocessing
Handle missing values

Remove invariant markers

Clean and standardize data

Step 3: Descriptive Statistics
Band frequency analysis

Band richness calculation

Basic statistical summaries

Step 4: Band Analysis
Unique band pattern identification

Band sharing analysis

Diagnostic marker identification

Step 5: Visualization
Band frequency bar plots

Richness distribution histograms

Heatmaps of band presence/absence

Cumulative frequency plots

Step 6: Genetic Diversity
Nei's gene diversity (He)

Shannon diversity index

Polymorphism Information Content (PIC)

Marker Index and Resolving Power

Step 7: Distance Matrices
Jaccard distance matrix

Simple Matching Coefficient (SMC)

Euclidean distance

Similarity matrices

Step 8: Cluster Analysis
Hierarchical clustering

K-means clustering

Consensus clustering

Dimensionality reduction (PCA, MDS)

Step 9: Results Export
Comprehensive summary reports

Excel files with multiple sheets

JSON format for programmatic access

Markdown documentation

Output Files
Plots (output/plots/)
01_band_frequency.pdf - Band frequency by fragment size

02_band_richness_distribution.pdf - Distribution of band counts

03_cumulative_frequency.pdf - Cumulative band frequency

04_band_richness_boxplot.pdf - Boxplot of band richness

05_band_presence_heatmap.pdf - Heatmap of band presence/absence

Tables (output/tables/)
band_frequency.csv - Frequency of each band

band_richness.csv - Band counts per sample

descriptive_statistics.json - Complete statistical summary

genetic_diversity_metrics.csv - Diversity indices

cluster_assignments.csv - Sample cluster assignments

Matrices (output/matrices/)
jaccard_distance_matrix.csv - Jaccard distance matrix

smc_distance_matrix.csv - Simple Matching Coefficient matrix

Similarity matrices in various formats

Reports (output/reports/)
summary_report.md - Comprehensive markdown report

detailed_results.xlsx - Excel workbook with all results

complete_results.pkl - Python pickle file with all data

Configuration
Edit config/settings.py to customize:

Analysis parameters

Visualization settings

Genetic analysis thresholds

Output formats

Troubleshooting
Common Issues:
Missing data file: Ensure Data.csv is in the data/ directory

Package conflicts: Use virtual environment and exact versions in requirements.txt

Memory issues: For large datasets, consider sampling or increasing memory

Plotting errors: Ensure matplotlib backend is properly configured

View Logs:
Check logs/ directory for detailed execution logs.

References
Genetic Diversity Metrics:

Nei, M. (1973). Analysis of gene diversity in subdivided populations.

Shannon, C. E. (1948). A mathematical theory of communication.

Distance Metrics:

Jaccard, P. (1901). Distribution de la flore alpine.

Sokal, R. R., & Michener, C. D. (1958). A statistical method for evaluating relationships.

Cluster Analysis:

Ward, J. H. (1963). Hierarchical grouping to optimize an objective function.

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations.

Contributing
Fork the repository

Create a feature branch

Make your changes

Add tests if applicable

Submit a pull request

License
This project is licensed under the MIT License - see the LICENSE file for details.

Contact
For questions or support, please contact:

Name: Benjamin Narh-Madey

Email: narhmadey@wisc.edu


*Pipeline developed for cowpea genetic analysis - Version 1.0*

About

This project provides a comprehensive analysis pipeline for cowpea (Vigna unguiculata) genetic data. The modular system processes genetic marker data to extract insights about genetic diversity, population structure, and relationships between different cowpea accessions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors