- Introduction
- High-Level Architecture
- Data Pipeline Components
- Project Structure
- Getting Started
- Further Documentation
- Glossary
Welcome to Sales Vision 360, an end-to-end data analytics solution designed to provide comprehensive insights into sales performance. This project integrates various technologies to create a robust data pipeline, from raw data acquisition to advanced analytics and interactive visualization. The primary goal is to empower businesses with data-driven decision-making capabilities for sales forecasting and customer segmentation.
Sales Vision 360 automates the entire process, ensuring data quality, consistency, and accessibility, ultimately transforming raw sales data into actionable business intelligence.
The Sales Vision 360 architecture is designed for scalability and maintainability, leveraging cloud-native services and modern data stack tools. The pipeline orchestrates data flow from external sources, through a centralized data warehouse, into transformed models, and finally into analytical and visualization layers.
graph TD
A[External Data Sources] --> B{Data Ingestion};
B -- Web Scraper --> C[Raw Data Storage];
B -- Snowflake Connector --> C;
C --> D[Snowflake Data Warehouse];
D -- dbt Transformations --> E[Transformed Data Models];
E -- ML Pipelines --> F[Forecasting & Segmentation Outputs];
F --> G[Streamlit Dashboard];
E --> H[Power BI Dashboard];
subgraph Data Flow
C;
D;
E;
F;
end
subgraph Analytics & Visualization
G;
H;
end
The initial stage of the pipeline involves gathering product data from various online marketplaces.
- Purpose: To collect product information, including titles, prices, ratings, and reviews, from e-commerce platforms like Amazon and Flipkart.
- Technology: Python-based web scraper utilizing
playwrightfor browser automation. - Output: Raw data is initially saved in
.txtfiles (JSONL format) and then converted into structured.csvfiles. This includesproducts_catalog.csvandsales_transactions.csv, which are enriched with synthetic sales transaction data. - Relevant Files:
sales_pipeline_scrapper.py: Python script for scraping and synthetic data generation.data/sales_raw.txt: Raw scraped data.data/products_catalog.csv: Cleaned product catalog.data/sales_transactions.csv: Generated synthetic sales transactions.
Snowflake serves as the central data warehouse for all raw and transformed data.
- Purpose: To store large volumes of structured and semi-structured data, providing a scalable and high-performance environment for analytics.
- Technology: Snowflake Cloud Data Warehouse.
- Process: The
sales_transactions.csvgenerated by the scraper is uploaded to Snowflake into theMYDATABASE.ABSSALESVISION360.SALES_TRANSACTIONStable. - Relevant Files:
snowflake_connector.py: Python script to connect to Snowflake and upload CSV data.csv_download.py: Python script to connect to Snowflake and download data (e.g.,SALES_TRANS.csv).snowflake_credits_checker.py: Utility script to check Snowflake connection details and available resources..env: Stores Snowflake connection credentials securely.
dbt (data build tool) is used to transform raw data in Snowflake into clean, consistent, and analysis-ready data models.
- Purpose: To define, manage, and execute data transformations, ensuring data quality and creating a single source of truth for downstream analytics.
- Technology: dbt (Data Build Tool) with Snowflake adapter.
- Process: Raw
SALES_TRANSACTIONSdata is transformed into various dimension and fact tables, such asCUSTOMER_DIM,PRODUCT_DIM,ORDER_DIM,MARKETING_ENGAGEMENT_DIM,SHIPPING_FULFILLMENT_DIM,METADATA, andSALES_FACT. - Relevant Files:
sales_vision_360/dbt_project.yml: dbt project configuration.sales_vision_360/models/sales_transactions_transform.sql: Core transformation model.sales_vision_360/models/dimensions/: Directory containing dimension models.sales_vision_360/models/fact/financial_fact.sql: Fact model for sales.sales_vision_360/models/source/sources.yml: dbt source definition for raw sales data.sales_vision_360/logs/dbt.log: dbt execution logs.sales_vision_360/README.md: dbt project specific README.
Machine learning models are integrated into the pipeline to generate predictive insights.
- Purpose: To perform sales forecasting and customer segmentation based on the transformed data. Churn prediction, though initially explored, was removed due to non-predictive data signals.
- Technology: Python with
scikit-learn,pandas,numpy,matplotlib,seaborn,plotly. - Process:
- Sales Forecasting: Uses
GradientBoostingRegressorto predict daily revenue based on historical trends and calendar features. - Customer Segmentation: Employs
KMeansclustering on RFM (Recency, Frequency, Monetary) and other customer behavioral features to identify distinct customer segments (e.g., High-Value, Dormant, Frequent, Regular).
- Sales Forecasting: Uses
- Output: Forecasting results (
forecast_results.csv), customer segments (customer_segments.csv), and a performance summary (performance_summary.csv) are saved. Visualizations likeforecast_plot.pngandsegmentation_plot.pngare also generated. - Relevant Files:
modeling/ml_pipelines_all.py: Python script implementing ML pipelines.modeling/churn_eda_diagnostic.py: EDA script (note: churn prediction is not part of the final ML pipeline).modeling/outputs/: Directory for ML outputs (CSVs, PNGs, joblib models).
The final stage presents the insights through interactive dashboards.
- Purpose: To provide business users with an intuitive interface to explore sales forecasts, customer segments, and key performance indicators.
- Technology: Streamlit for interactive web dashboards and Power BI for detailed business intelligence reports.
- Relevant Files:
modeling/dashboard_app.py: Streamlit application for visualizing ML outputs.Sales vision 360 dashboard.pbix: Power BI desktop file for detailed sales analytics.
The project repository is organized as follows:
graph TD
A[Sales Vision 360 Project] --> B[Data Layer]
A --> C[Transformation Layer]
A --> D[ML & Analytics Layer]
A --> E[Visualization Layer]
A --> F[Infrastructure]
B --> B1[data/]
B1 --> B2[products_catalog.csv]
B1 --> B3[sales_raw.txt]
B1 --> B4[sales_transactions.csv]
C --> C1[sales_vision_360/ - dbt]
C1 --> C2[dimensions/]
C1 --> C3[fact/]
C1 --> C4[sales_transactions_transform.sql]
C2 --> C5[customer_dim.sql]
C2 --> C6[product_dim.sql]
C2 --> C7[order_dim.sql]
C3 --> C8[financial_fact.sql]
D --> D1[modeling/]
D1 --> D2[ml_pipelines_all.py]
D1 --> D3[churn_eda_diagnostic.py]
D1 --> D4[dashboard_app.py]
D1 --> D5[outputs/]
D5 --> D6[Models & Results]
E --> E1[Sales vision 360 dashboard.pbix]
E --> E2[Streamlit Dashboard]
F --> F1[csv_download.py]
F --> F2[snowflake_connector.py]
F --> F3[snowflake_credits_checker.py]
style A fill:#e1f5ff
style B fill:#fff4e1
style C fill:#e8f5e9
style D fill:#f3e5f5
style E fill:#fce4ec
style F fill:#f5f5f5
To set up and run the Sales Vision 360 project, follow these steps:
Before you begin, ensure you have the following installed and configured:
- Python 3.8+: Download and install from python.org.
- Snowflake Account: Access to a Snowflake data warehouse with appropriate permissions.
- dbt Core: Install dbt-snowflake adapter:
pip install dbt-snowflake. - Environment Variables (
.env): A.envfile in the root directory of the project with your Snowflake credentials:SNOWFLAKE_USER=your_username SNOWFLAKE_PASSWORD=your_password SNOWFLAKE_ACCOUNT=your_account_identifier - Browser for Scraping: Google Chrome is used by
playwright.
-
Clone the Repository:
git clone https://github.com/your-username/SalesVision360.git cd SalesVision360 -
Set up Python Environment:
python -m venv venv source venv/bin/activate # On Windows: `venv\Scripts\activate` pip install -r requirements.txt # (assuming a requirements.txt is provided or create one from scripts)
-
Configure dbt Profile: Ensure your dbt
profiles.yml(typically located at~/.dbt/profiles.yml) is configured to connect to your Snowflake instance using thesales_vision_360profile. An exampleprofiles.ymlmight look like this:sales_vision_360: target: dev outputs: dev: type: snowflake account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}" user: "{{ env_var('SNOWFLAKE_USER') }}" password: "{{ env_var('SNOWFLAKE_PASSWORD') }}" role: SALES_360 # Or your specific role warehouse: SALES_VISION_360 # Or your specific warehouse database: MYDATABASE # Or your specific database schema: ABSSALESVISION360 # Or your specific schema threads: 4 client_session_keep_alive: False query_tag: dbt-sales-vision-360
You can verify your dbt connection using
dbt debugfrom within thesales_vision_360directory. -
Run the Data Pipeline: Follow the detailed instructions in the Detailed Setup Guide for executing the scraping, data loading, dbt transformations, and ML pipelines.
For more in-depth information on specific aspects of the project, please refer to the following documents:
- Architectural Overview: Detailed explanation of the system architecture and design principles.
- Data Ingestion & Snowflake Setup: Guide on setting up Snowflake and ingesting raw data.
- dbt Data Transformation Guide: Comprehensive guide to the dbt project, models, and transformations.
- ML Pipelines (Forecasting & Segmentation): Details on the machine learning models, feature engineering, and performance.
- Dashboards & Visualization: Information on interacting with the Streamlit and Power BI dashboards.
- Detailed Setup Guide: Step-by-step instructions for environment setup, dependency installation, and running the entire pipeline.
- dbt: Data Build Tool, an open-source tool for data transformation.
- EDA: Exploratory Data Analysis.
- MAPE: Mean Absolute Percentage Error, a measure of prediction accuracy.
- ML: Machine Learning.
- Power BI: A business intelligence tool for data visualization.
- RMSE: Root Mean Square Error, a measure of prediction accuracy.
- Snowflake: A cloud-based data warehousing platform.
- Streamlit: An open-source app framework for machine learning and data science.
- SKEY: Surrogate Key, a system-generated primary key in a data warehouse.
- Snowflake Connector: Python library to connect to Snowflake.
- Web Scraping: The process of extracting data from websites.