This project explores how synthetic Power BI-style usage data can be turned into practical intelligence for analytics teams. The current version includes a notebook-first forecasting baseline, behavioural analytics, report and user segmentation, diagnostics, and a lightweight batch GenAI insight layer built from shareable synthetic data.
The repository is structured so the workflow can be reviewed through notebooks or regenerated through Python scripts, while leaving a clean foundation for future modelling, evaluation, and demo work.
The notebooks demonstrate an end-to-end workflow for report usage forecasting, behavioural analytics, and GenAI-assisted insight generation:
- Generates synthetic report usage data with weekly patterns, trend, noise, and zero-activity days.
- Builds a clean semantic model from raw telemetry-style tables.
- Validates the semantic model before feature engineering.
- Builds daily report-level time series from user-level usage records.
- Builds feature marts for adoption, engagement, performance, and final forecast features.
- Applies data sufficiency checks before modelling.
- Trains per-report Auto-ARIMA models.
- Compares model performance against naive and seasonal-naive baselines.
- Builds report and user analytics outputs, including segmentation and diagnostics.
- Publishes forecast, metrics, segment, diagnostic, validation, and insight outputs for downstream review.
The project is designed as a portfolio-friendly version of a realistic analytics problem, without exposing private Power BI or organisational usage data.
Analytics teams often know which Power BI reports exist, but not which ones are becoming more important, which ones are losing engagement, or where future demand may require support. A usage intelligence workflow can help answer questions such as:
- Which reports are likely to see higher demand over the next month?
- Which reports have stable enough usage patterns to forecast responsibly?
- Which reports should be monitored because their usage is volatile, declining, or difficult to predict?
- How can GenAI summaries help stakeholders understand changes in report behaviour?
The current project now includes the forecasting feature layer, behavioural analytics outputs, performance telemetry features, and a lightweight batch GenAI insight layer. Richer modelling beyond the baseline remains a planned extension.
The current workflow is intentionally lightweight:
- Synthetic usage data is generated into raw CSV tables.
- Semantic model build creates cleaned dimensions and facts under
data/processed/. - Validation checks the semantic model before downstream use.
- Feature engineering in
notebooks/04_feature_engineering.ipynbbuilds reusable marts underdata/processed/. - Forecasting baseline in
notebooks/05_forecasting_baseline.ipynbconsumesdata/processed/mart_forecast_features.csv, trains Auto-ARIMA models, and compares them with simple baselines. - Report and user analytics create segmentation, diagnostics, and engagement outputs.
- Output tables are written to
outputs/for forecasts, metrics, segments, diagnostics, validation, and insights. - GenAI insights read the output CSVs and publish structured report summaries under
outputs/insights/. - Future layers may add richer modelling, reviewer-friendly demo views, and stronger evaluation checks.
See docs/architecture.md for a small architecture note and future direction.
This is not just a time-series notebook. The aim is to show how forecasting can become part of a broader usage intelligence product:
- Forecasting: predict future report usage and compare against defensible baselines.
- Behavioural analytics: implemented feature marts for repeat use, concentration, inactivity gaps, and page-depth proxies.
- Performance telemetry: implemented feature marts for load-time levels, tails, and rolling performance signals.
- GenAI insight layer: lightweight batch-generated report summaries that explain forecast changes, risks, and stakeholder actions in plain language.
- Operational thinking: current outputs already consider schema-safe tables, forecast history, and realised-error backfill concepts.
The GenAI layer is intentionally lightweight in Version 0.1. It reads existing CSV outputs and writes structured report-level insights without adding a chatbot, vector database, or app layer.
report-usage-forecasting/
├── data/
│ ├── raw/ # Synthetic raw telemetry-style CSV tables
│ └── processed/ # Clean semantic model CSV tables
├── docs/ # Architecture and data model notes
├── notebooks/
│ ├── 01_generate_raw_tables.ipynb
│ ├── 02_build_semantic_model_csv.ipynb
│ ├── 03_validate_semantic_model_hybrid_gx_csv.ipynb
│ ├── 04_feature_engineering.ipynb
│ ├── 05_forecasting_baseline.ipynb
│ ├── 06_report_analytics.ipynb
│ ├── 07_user_analytics.ipynb
│ └── 08_genai_insights.ipynb
├── outputs/
│ ├── validation/ # Validation results and reconciliation outputs
│ ├── forecasts/ # Latest forecasts and forecast history
│ ├── metrics/ # Latest metrics, model comparisons, and error history
│ ├── segments/ # Report and user segmentation outputs
│ ├── diagnostics/ # Diagnostic rule outputs
│ ├── insights/ # Batch-generated GenAI insight outputs
│ └── anomalies/ # Optional anomaly outputs placeholder
├── src/
│ ├── data/
│ │ ├── generate_synthetic_data.py
│ │ ├── build_semantic_model.py
│ │ └── validate_model.py
│ ├── features/
│ │ ├── report_features.py
│ │ ├── engagement_features.py
│ │ ├── performance_features.py
│ │ └── build_forecast_features.py
│ ├── models/
│ │ ├── baselines.py
│ │ └── evaluate.py
│ ├── analytics/
│ │ ├── report_features.py
│ │ ├── report_segmentation.py
│ │ ├── report_diagnostics.py
│ │ ├── user_features.py
│ │ └── user_segmentation.py
│ ├── genai/
│ │ ├── prompts.py
│ │ └── insight_generator.py
│ └── pipelines/
│ ├── run_forecasting_pipeline.py
│ ├── run_report_analytics_pipeline.py
│ └── run_user_analytics_pipeline.py
├── .gitignore
├── LICENSE
├── README.md
└── requirements.txt
From the project root:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
jupyter notebook notebooks/01_generate_raw_tables.ipynbThen run the notebooks in order. Generated CSV outputs are written to data/raw/, data/processed/, and the project-level outputs/ folder.
The data pipeline can be run in two ways:
- Notebooks for exploration, transparency, and storytelling.
- Python scripts for a repeatable CSV-based pipeline.
Pipeline flow:
data/raw/ -> data/processed/ -> outputs/validation/
data/processed/mart_forecast_features.csv -> outputs/forecasts/ + outputs/metrics/
Use this path when you want to inspect the logic, understand the modelling choices, or walk through the workflow step by step.
Run the notebooks in this order:
-
notebooks/01_generate_raw_tables.ipynb- Generates synthetic raw telemetry-style tables.
- Writes CSV files to
data/raw/.
-
notebooks/02_build_semantic_model_csv.ipynb- Builds clean dimension and fact tables.
- Writes CSV files to
data/processed/.
-
notebooks/03_validate_semantic_model_hybrid_gx_csv.ipynb- Runs data quality checks using Great Expectations and pandas.
- Writes validation outputs to
outputs/validation/.
-
notebooks/04_feature_engineering.ipynb- Builds report adoption, behavioural, performance, and final forecast feature marts.
- Writes feature tables to
data/processed/.
-
notebooks/05_forecasting_baseline.ipynb- Reads
data/processed/mart_forecast_features.csv. - Trains the forecasting baseline and writes model outputs to
outputs/.
- Reads
-
notebooks/06_report_analytics.ipynb- Builds report-level analytics, segmentation, and diagnostics.
- Writes outputs to
outputs/segments/,outputs/diagnostics/, andoutputs/metrics/.
-
notebooks/07_user_analytics.ipynb- Builds user-level engagement features and segmentation outputs.
- Writes outputs to
outputs/segments/andoutputs/metrics/.
-
notebooks/08_genai_insights.ipynb- Reads forecast, model performance, segment, and diagnostic CSV outputs.
- Writes AI insight outputs to
outputs/insights/.
Use this path when you want to regenerate the pipeline outputs consistently from the command line.
From the project root, run:
python src/data/generate_synthetic_data.py
python src/data/build_semantic_model.py
python src/data/validate_model.py
python -m src.pipelines.run_forecasting_pipeline
python -m src.pipelines.run_report_analytics_pipeline
python -m src.pipelines.run_user_analytics_pipeline
python -m src.genai.insight_generatorThe scripts perform the same core workflow as the notebooks:
generate_synthetic_data.pycreates raw synthetic tables indata/raw/.build_semantic_model.pybuilds cleaned dimensions and fact tables indata/processed/.validate_model.pyruns validation checks and writes results tooutputs/validation/.run_forecasting_pipeline.pyconsumesdata/processed/mart_forecast_features.csvwhen available, falls back to compatible processed report-level tables, and writes forecast outputs tooutputs/forecasts/plus metrics outputs tooutputs/metrics/.run_report_analytics_pipeline.pywrites report segments and diagnostics tooutputs/segments/andoutputs/diagnostics/.run_user_analytics_pipeline.pywrites user engagement features and user segments tooutputs/metrics/andoutputs/segments/.insight_generator.pyreads the latest report forecast, metric, segment, and diagnostic CSVs and writes structured insights tooutputs/insights/.
outputs/forecasts/stores forecast outputs.outputs/metrics/stores model performance and comparison outputs.outputs/segments/stores report and user segmentation outputs.outputs/diagnostics/stores diagnostic rule outputs.outputs/insights/stores GenAI-generated insight outputs.outputs/validation/stores validation and reconciliation outputs.
Version 0.1 adds a batch-generated report insight layer under src/genai/.
Expected inputs:
outputs/forecasts/report_forecasts.csvoutputs/metrics/model_performance.csvoutputs/segments/report_segments.csvoutputs/diagnostics/report_diagnostics.csv
For compatibility with the current forecasting pipeline, the generator also recognizes report_view_forecasts_latest.csv, report_view_metrics_latest.csv, and report_model_comparison_latest.csv.
Run from the project root:
python -m src.genai.insight_generatorOutputs:
outputs/insights/report_ai_insights.jsonoutputs/insights/report_ai_insights.md
To use an OpenAI model, set OPENAI_API_KEY in your environment before running the script. Do not store API keys in the repository. If OPENAI_API_KEY is missing, the script generates deterministic rule-based placeholder insights so the notebook and command-line workflow still run.
- Separates raw telemetry-style data from cleaned semantic model outputs.
- Mirrors a real-world analytics engineering workflow.
- Supports both experimentation and reproducibility.
- Makes the project easier to extend with forecasting features, behavioural analytics, and GenAI insight evaluation.
Implemented now:
- Synthetic Power BI-style usage dataset.
- Semantic model build.
- Hybrid validation using Great Expectations and pandas checks.
- Feature marts for report usage, engagement, performance, and forecasting.
- Forecasting baseline with naive and seasonal-naive comparisons.
- Report analytics.
- User analytics.
- Diagnostics.
- Segmentation.
- Batch GenAI insight layer.
Planned next:
- Add a Streamlit app for a reviewer-friendly demo.
- Add screenshots or sample output images to the README.
- Improve forecast evaluation with rolling-origin backtesting.
- Add a stronger model governance table.
- Add optional open-source forecasting model comparison.
- Add GenAI output evaluation or prompt quality checks.
- Add a Streamlit app for reviewer-friendly walkthroughs.
- Add README screenshots or sample output images.
- Improve forecast evaluation with rolling-origin backtesting.
- Add a stronger model governance table.
- Add optional open-source forecasting model comparison.
- Add GenAI output evaluation or prompt quality checks.