Near Real-Time Data Ingestion & MLOps Pipeline

This project demos a full end-to-end near real-time:

ingestion of data
data ETL
model training & validation
model deployment
model monitoring

It captures in near real-time blockchain transactions data and by default computes, per minute, the amount of average transaction fees. These data are store in SageMaker Feature Store. An MLOps pipeline, uses the Amazon SageMaker DeepAR forecasting algorithm to train a forecating model predicting the average transaction fees with a prediction window of 30 data points (in our case 30 minutes). Although it might be irrelevant, data are aggregated and predicted using a 1 minute window in order to quickly gather enough data, get results and quickls see the MLOps pipeline automation in action. Once enough data have been captured and a first model trained, you will be able to see the model being deployed being a SageMaker API endpoint and resources being provisioned to monitor the model. If the model performance alarm threshold is breached, you will see alarms in the dashboard and the model training pipeline being automatically triggered to retrain a new model based on the lastest ingested data, thus, fully atuomating the training, deployment and monitoring lifecycle of the model.

Warning

Since the creation of this demo, AWS has deprecated the CodeCommit service which is used by the MLOps project. It is not possible anymore to create a CodeCommit repositories. The MLOps project will not work as is. An issue has been opened to update this demo: Issue #63

The Data

For this project, we decided to ingest blockchain transactions from the blockchain.com API (see documentation here). We focus on 3 simple metrics:

The total number of transactions
The total amount of transaction fees
The average amount of transaction fees

These metrics are computed per minute. Although it might not be the best window period to analyze blockchain transactions, it allows us to quickly gather a lot of data points in a short period of time, avoiding to run the demo for too long which has an impact on the AWS billing.

Architecture

The Full Architecture

Near Real Time Data Ingestion Architecture

At a high level, the data are ingested as follow:

A container running on AWS Fargate polls the data and writes them to EventBridge.
EventBridge sends the data to a Lambda Function which writes filtered data into a Kinesis Data Stream.
A Managed Service for Apache Flink reads the data from the Kinesis Data Stream and aggregates the data using a 1 minute tumbling window and puts the results into a delivery Kinesis Data Stream.
A Lambda Funtion reads the aggregated data from the delivery Kinesis Data Stream and writes them into SageMaker Feature Store.

See this documentation for more details.

MLOps Architecture

The MLOps project contains 3 CodeCommit repositories with their own CodePipeline pipelines to train, deploy and monitor the model.

The Model Build pipeline creates a SageMaker Pipeline orchestrating all the steps to train a model and, if it passes the validation threshold, registers the model, which then has to be manually approved.
If the registered model is approved, the Model Deploy pipeline is automatically triggered and deploys the model behind 2 SageMaker API Endpoints, one for the staging environment and one for the production environment.
If the new model baseline performance is better than the existing one, we update the SSM Parameter storing the model monitoring threshold with the new model performance. When the new model monitoring is deployed, the alarm threshold will be updated from the SSM Parameter value. Note that in a real world scenario you might not want to update the model monitoring threshold like this, as it might be dictated by business criterias. We do this in this demo to test automated retraining and improvement of the model.
Once the endpoints are IN_SERVICE, it triggers automatically the Model Monitor pipeline, which deploys resources to monitor the pipeline
Every hour, a SageMaker Pipeline is executed to test the model against the latest data. The performance of the model is tracked by the SageMaker Monitoring service and a custom CloudWatch Alarm. We use a custom CloudWatch Alarm because, as of now
- AWS does not support monitoring for custom metrics (like the weighted quantile loss we use for the DeepAR model) and,
- AWS does not provide any built-in mechanism to capture the alarms raised by the SageMaker Monitoring service when a model performance is breached, in order to perform automatic retraining of the model.
If our custom metric is breached, the CloudWatch Alarm will trigger a Lambda Function, which will trigger the Model Build pipeline, retraining a new model, looping back automatically to the step one of this MLOps pipeline.

What are the Prerequisites?

The list below is for Windows environment

The AWS CLI (documentation)
Docker Desktop (Documentation)
NPM and Node.js (Documentation)
The AWS CDK: npm install -g aws-cdk
Configure your programmatic access to AWS for the CLI (see instructions here).

Documentation

Most of the architecture is fully automated through the CDK. You do not have much to configure and can directly play with the application and watch it ingest and aggregate the data in near real time. Once you have ingested some data you can start to train a forecasting model by running the corresponding pipeline and see the entire MLOps process automation.

Cost

This demo deploys many services (e.g. Fargate, DynamoDB, 2xKinesis Data Streams, Kinesis Firehose, Managed Service for Apache Flink, SageMaker endpoints...) and must be run for several days to collect enough data to be able to start training a model and see the model being retrained. This demo do generate costs which could be expensive, depending on your budget. The full demo costs about $480 per month in the Ireland region.

Name		Name	Last commit message	Last commit date
Latest commit History 559 Commits
bin		bin
doc		doc
lib		lib
resources		resources
test		test
.gitignore		.gitignore
.npmignore		.npmignore
LICENSE		LICENSE
README.md		README.md
cdk.json		cdk.json
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Near Real-Time Data Ingestion & MLOps Pipeline

The Data

Architecture

The Full Architecture

Near Real Time Data Ingestion Architecture

MLOps Architecture

What are the Prerequisites?

Documentation

Cost

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Near Real-Time Data Ingestion & MLOps Pipeline

The Data

Architecture

The Full Architecture

Near Real Time Data Ingestion Architecture

MLOps Architecture

What are the Prerequisites?

Documentation

Cost

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages