This end-to-end project aims to forecast retail store sales based on historical data, helping businesses optimize inventory management, reduce waste, and improve revenue planning. It includes data preprocessing, feature engineering, model training, and deployment-ready code.
- Data Preparation: Addressed missing values, normalized time series, and created aggregated features for store and category levels.
- Feature Engineering: Generated lag features, rolling averages, seasonal indicators, and holiday-based features to capture temporal patterns.
- Model Optimization: Applied Optuna for hyperparameter tuning of XGBoost, reducing RMSLE by 15%.
- Validation and Testing: Used TimeSeriesSplit for proper evaluation of sequential data and implemented Pytest for functional testing of core modules.
- Deployment-Ready: Integrated CI/CD pipelines and containerized the project using Docker.
- Achieved an RMSLE of 0.75094, outperforming baseline methods such as moving averages and linear regression by 15%.
- Forecasting accuracy provides actionable insights for inventory and demand planning.
- Inventory Optimization: Accurate sales forecasts reduce overstock and stockouts, minimizing storage costs and lost revenue.
- Revenue Planning: Helps align inventory and workforce with expected sales patterns.
- Strategic Insights: Enables better decision-making for promotions, pricing, and holiday planning.
- Programming Language: Python
- Libraries: pandas, numpy, XGBoost, Optuna, Scikit-learn, Matplotlib, Seaborn
- Tools: Jupyter Notebook, Pytest, Docker, CI/CD
- Clone the repository:
git clone https://github.com/NasdormML/Store_Sales_Forecasting.git cd Store_Sales_Forecasting - Install the required dependencies:
pip install -r requirements.txt
- Run the main script:
python main.py
- Build the Docker image:
docker build -t store-sales . - Run the Docker container:
docker run -p 8080:8080 store-sales
- Source: The dataset is publicly available on Kaggle. It includes historical sales data, product categories, holidays, and other relevant features.
- Preprocessing:
- Cleaned missing values using forward filling and interpolation methods.
- Created aggregated features at category and store levels.
- Removed outliers and addressed data leakage risks.
time_series_project/
├── .github/
│ └── workflows/
│ └── ci.yml # CI/CD configuration file
├── data/
│ ├── processed/ # Preprocessed data ready for modeling
│ ├── raw/ # Raw input data
├── models/ # Saved trained models
├── notebooks/ # Jupyter notebooks for exploratory analysis
│ ├── EDA.ipynb # Exploratory Data Analysis
│ └── store_sales_kaggle.ipynb # Additional exploratory analysis
├── src/ # Source code of the project
│ ├── data_preparation.py # Code for data preprocessing
│ ├── model_prediction.py # Code for generating predictions
│ ├── model_training.py # Code for training the model
├── tests/ # Unit and integration tests
├── dockerfile # Dockerfile for containerizing the project
├── main.py # Entry point for running the project
├── README.md # Project description and documentation
└── requirements.txt # Python dependencies
If you have any questions or suggestions, feel free to reach out:
- Email: nasdorm.ml@inbox.ru
This project is licensed under the MIT License. See the LICENSE file for details.