Skip to content

SakshamTapadia/SalesVision_360

Repository files navigation

Sales Vision 360

Table of Contents

  1. Introduction
  2. High-Level Architecture
  3. Data Pipeline Components
  4. Project Structure
  5. Getting Started
  6. Further Documentation
  7. Glossary

1. Introduction

Welcome to Sales Vision 360, an end-to-end data analytics solution designed to provide comprehensive insights into sales performance. This project integrates various technologies to create a robust data pipeline, from raw data acquisition to advanced analytics and interactive visualization. The primary goal is to empower businesses with data-driven decision-making capabilities for sales forecasting and customer segmentation.

Sales Vision 360 automates the entire process, ensuring data quality, consistency, and accessibility, ultimately transforming raw sales data into actionable business intelligence.

2. High-Level Architecture

The Sales Vision 360 architecture is designed for scalability and maintainability, leveraging cloud-native services and modern data stack tools. The pipeline orchestrates data flow from external sources, through a centralized data warehouse, into transformed models, and finally into analytical and visualization layers.

graph TD
    A[External Data Sources] --> B{Data Ingestion};
    B -- Web Scraper --> C[Raw Data Storage];
    B -- Snowflake Connector --> C;
    C --> D[Snowflake Data Warehouse];
    D -- dbt Transformations --> E[Transformed Data Models];
    E -- ML Pipelines --> F[Forecasting & Segmentation Outputs];
    F --> G[Streamlit Dashboard];
    E --> H[Power BI Dashboard];

    subgraph Data Flow
        C;
        D;
        E;
        F;
    end

    subgraph Analytics & Visualization
        G;
        H;
    end
Loading

3. Data Pipeline Components

3.1 Web Scraping

The initial stage of the pipeline involves gathering product data from various online marketplaces.

  • Purpose: To collect product information, including titles, prices, ratings, and reviews, from e-commerce platforms like Amazon and Flipkart.
  • Technology: Python-based web scraper utilizing playwright for browser automation.
  • Output: Raw data is initially saved in .txt files (JSONL format) and then converted into structured .csv files. This includes products_catalog.csv and sales_transactions.csv, which are enriched with synthetic sales transaction data.
  • Relevant Files:
    • sales_pipeline_scrapper.py: Python script for scraping and synthetic data generation.
    • data/sales_raw.txt: Raw scraped data.
    • data/products_catalog.csv: Cleaned product catalog.
    • data/sales_transactions.csv: Generated synthetic sales transactions.

3.2 Snowflake Data Warehousing

Snowflake serves as the central data warehouse for all raw and transformed data.

  • Purpose: To store large volumes of structured and semi-structured data, providing a scalable and high-performance environment for analytics.
  • Technology: Snowflake Cloud Data Warehouse.
  • Process: The sales_transactions.csv generated by the scraper is uploaded to Snowflake into the MYDATABASE.ABSSALESVISION360.SALES_TRANSACTIONS table.
  • Relevant Files:
    • snowflake_connector.py: Python script to connect to Snowflake and upload CSV data.
    • csv_download.py: Python script to connect to Snowflake and download data (e.g., SALES_TRANS.csv).
    • snowflake_credits_checker.py: Utility script to check Snowflake connection details and available resources.
    • .env: Stores Snowflake connection credentials securely.

3.3 dbt Data Transformation

dbt (data build tool) is used to transform raw data in Snowflake into clean, consistent, and analysis-ready data models.

  • Purpose: To define, manage, and execute data transformations, ensuring data quality and creating a single source of truth for downstream analytics.
  • Technology: dbt (Data Build Tool) with Snowflake adapter.
  • Process: Raw SALES_TRANSACTIONS data is transformed into various dimension and fact tables, such as CUSTOMER_DIM, PRODUCT_DIM, ORDER_DIM, MARKETING_ENGAGEMENT_DIM, SHIPPING_FULFILLMENT_DIM, METADATA, and SALES_FACT.
  • Relevant Files:
    • sales_vision_360/dbt_project.yml: dbt project configuration.
    • sales_vision_360/models/sales_transactions_transform.sql: Core transformation model.
    • sales_vision_360/models/dimensions/: Directory containing dimension models.
    • sales_vision_360/models/fact/financial_fact.sql: Fact model for sales.
    • sales_vision_360/models/source/sources.yml: dbt source definition for raw sales data.
    • sales_vision_360/logs/dbt.log: dbt execution logs.
    • sales_vision_360/README.md: dbt project specific README.

3.4 Machine Learning Pipelines

Machine learning models are integrated into the pipeline to generate predictive insights.

  • Purpose: To perform sales forecasting and customer segmentation based on the transformed data. Churn prediction, though initially explored, was removed due to non-predictive data signals.
  • Technology: Python with scikit-learn, pandas, numpy, matplotlib, seaborn, plotly.
  • Process:
    • Sales Forecasting: Uses GradientBoostingRegressor to predict daily revenue based on historical trends and calendar features.
    • Customer Segmentation: Employs KMeans clustering on RFM (Recency, Frequency, Monetary) and other customer behavioral features to identify distinct customer segments (e.g., High-Value, Dormant, Frequent, Regular).
  • Output: Forecasting results (forecast_results.csv), customer segments (customer_segments.csv), and a performance summary (performance_summary.csv) are saved. Visualizations like forecast_plot.png and segmentation_plot.png are also generated.
  • Relevant Files:
    • modeling/ml_pipelines_all.py: Python script implementing ML pipelines.
    • modeling/churn_eda_diagnostic.py: EDA script (note: churn prediction is not part of the final ML pipeline).
    • modeling/outputs/: Directory for ML outputs (CSVs, PNGs, joblib models).

3.5 Visualization and Dashboards

The final stage presents the insights through interactive dashboards.

  • Purpose: To provide business users with an intuitive interface to explore sales forecasts, customer segments, and key performance indicators.
  • Technology: Streamlit for interactive web dashboards and Power BI for detailed business intelligence reports.
  • Relevant Files:
    • modeling/dashboard_app.py: Streamlit application for visualizing ML outputs.
    • Sales vision 360 dashboard.pbix: Power BI desktop file for detailed sales analytics.

4. Project Structure

The project repository is organized as follows:

graph TD  
    A[Sales Vision 360 Project] --> B[Data Layer]  
    A --> C[Transformation Layer]  
    A --> D[ML & Analytics Layer]  
    A --> E[Visualization Layer]  
    A --> F[Infrastructure]
      
    B --> B1[data/]  
    B1 --> B2[products_catalog.csv]  
    B1 --> B3[sales_raw.txt]  
    B1 --> B4[sales_transactions.csv]
      
    C --> C1[sales_vision_360/ - dbt]  
    C1 --> C2[dimensions/]  
    C1 --> C3[fact/]  
    C1 --> C4[sales_transactions_transform.sql]  
    C2 --> C5[customer_dim.sql]  
    C2 --> C6[product_dim.sql]  
    C2 --> C7[order_dim.sql]  
    C3 --> C8[financial_fact.sql]
      
    D --> D1[modeling/]  
    D1 --> D2[ml_pipelines_all.py]  
    D1 --> D3[churn_eda_diagnostic.py]  
    D1 --> D4[dashboard_app.py]  
    D1 --> D5[outputs/]  
    D5 --> D6[Models & Results]
      
    E --> E1[Sales vision 360 dashboard.pbix]  
    E --> E2[Streamlit Dashboard]
      
    F --> F1[csv_download.py]  
    F --> F2[snowflake_connector.py]  
    F --> F3[snowflake_credits_checker.py]
      
    style A fill:#e1f5ff  
    style B fill:#fff4e1  
    style C fill:#e8f5e9  
    style D fill:#f3e5f5  
    style E fill:#fce4ec  
    style F fill:#f5f5f5  
Loading

5. Getting Started

To set up and run the Sales Vision 360 project, follow these steps:

5.1 Prerequisites

Before you begin, ensure you have the following installed and configured:

  • Python 3.8+: Download and install from python.org.
  • Snowflake Account: Access to a Snowflake data warehouse with appropriate permissions.
  • dbt Core: Install dbt-snowflake adapter: pip install dbt-snowflake.
  • Environment Variables (.env): A .env file in the root directory of the project with your Snowflake credentials:
    SNOWFLAKE_USER=your_username
    SNOWFLAKE_PASSWORD=your_password
    SNOWFLAKE_ACCOUNT=your_account_identifier
    
  • Browser for Scraping: Google Chrome is used by playwright.

5.2 Installation and Setup

  1. Clone the Repository:

    git clone https://github.com/your-username/SalesVision360.git
    cd SalesVision360
  2. Set up Python Environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: `venv\Scripts\activate`
    pip install -r requirements.txt # (assuming a requirements.txt is provided or create one from scripts)
  3. Configure dbt Profile: Ensure your dbt profiles.yml (typically located at ~/.dbt/profiles.yml) is configured to connect to your Snowflake instance using the sales_vision_360 profile. An example profiles.yml might look like this:

    sales_vision_360:
      target: dev
      outputs:
        dev:
          type: snowflake
          account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
          user: "{{ env_var('SNOWFLAKE_USER') }}"
          password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
          role: SALES_360 # Or your specific role
          warehouse: SALES_VISION_360 # Or your specific warehouse
          database: MYDATABASE # Or your specific database
          schema: ABSSALESVISION360 # Or your specific schema
          threads: 4
          client_session_keep_alive: False
          query_tag: dbt-sales-vision-360

    You can verify your dbt connection using dbt debug from within the sales_vision_360 directory.

  4. Run the Data Pipeline: Follow the detailed instructions in the Detailed Setup Guide for executing the scraping, data loading, dbt transformations, and ML pipelines.

6. Further Documentation

For more in-depth information on specific aspects of the project, please refer to the following documents:

7. Glossary

  • dbt: Data Build Tool, an open-source tool for data transformation.
  • EDA: Exploratory Data Analysis.
  • MAPE: Mean Absolute Percentage Error, a measure of prediction accuracy.
  • ML: Machine Learning.
  • Power BI: A business intelligence tool for data visualization.
  • RMSE: Root Mean Square Error, a measure of prediction accuracy.
  • Snowflake: A cloud-based data warehousing platform.
  • Streamlit: An open-source app framework for machine learning and data science.
  • SKEY: Surrogate Key, a system-generated primary key in a data warehouse.
  • Snowflake Connector: Python library to connect to Snowflake.
  • Web Scraping: The process of extracting data from websites.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages