Inference and monitoring#2
Merged
Merged
Conversation
Improved the data preprocessing orchestration script to always skip raw snapshot registration if metadata already exists, regardless of the `--skip-if-existing` flag. This ensures that we don't attempt to overwrite existing metadata, which could lead to inconsistencies. Updated the fake data generator to include pii columns, in order to make downstream pipeline running seamless. Updated the defaults in the same script to train a new model by default, and to generate 5000 rows of data by default. This ensures seamless downstream execution, since feature sets require at least 5000 rows. Changed the minimum row requirement in the interim hotel bookings config to 5000 as well. Creating a new version of interim configs for this purpose alone would be overkill at this point. Generated some fake data and snapshot bindings. Updated a test that was broken by the changes in data preprocessing orchestrator. Prepared some code to later alter in creating infer.py. Data, feature sets and experiments are still all held locally for now, but will be added to the repo eventually. Now is not the time to do so, as the code is still evolving rapidly.
Wrote an initial idea for infer.py, which will be the main entry point for inference runs. It is not modularized yet, but it is a starting point for the overall structure of inference runs. This required some upstream changes as well. Tests have been updated too. Marked infer.py as in progress.
Improved infer.py to include per-class probability columns, and to output some useful metadata.
Wrote the monitor.py pipeline, which also implied a few changes in the upstream code.
Modularized both inference and monitoring logic; added some docstrings; removed unused imports.
Added some prints for the frontend to show the user where the artifacts are being saved. Updated requirements.txt, since Docker needed it. Updated docker-compose.yml to have the used directories as volumes. Some of them were missing earlier.
Improved UI by adding background colors to the result text areas for pipelines and scripts, indicating success (green) or error (red) based on the presence of an "error" key in the result. Removed some parts of the comments from docker-related files.
Fully updated all of the relevant documentation, as well as the main README.md. Fixed a bug in .gitignore where env configs were ignored, while some of the log files were not. Fixed a bug in the fake data generator that caused failure due to poor datetime handling. This also implied a change in ml/components/feature_engineering/arrival_date.py, where the feature was not handled correctly, and assumed ideal scenarios. Added some tests for the snapshot binding generator. Updated pyproject.toml to ignore coverage for the fake data generator, since it requires very specific packages, and is not a core part of the repo anyway. Added some new gifs and uml diagrams and updated the old ones, as a part of the documentation update. Included some artifacts and logs to be committed, so that the user can immediately see the expected structure and content of some of the relevant pipelines and scripts outputs, without having to run them first. Only included a handful of them, to avoid cluttering the repo with too many files.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The main contribution are the two new pipelines - infer.py and monitor.py.
The two pipelines enable model inference and monitoring, which now
completes the model lifecycle.
In writing these, some related and unrelated bugs were discovered as well.
This branch includes many bug fixes, updated documentation, added tests,
and some data and artifacts, so that the user can view expected outputs of
relevant pipelines immediately, without running anything. Main README.md
has been updated as well, and includes some new gifs.
Type of change
Checklist