Group 2 Web-scale Data Management Project

Important

Note To Reader Dear reader of this README, in case you do not have time and need to skip over some of this. We believe the following are must-reads:

Architecture Overview
A Note on Our Consistency Model
Choreographing Saga
Core Components

Table of Contents

Group 2 Web-scale Data Management Project

📖 Introduction

This project serves as a scalable and distributed data management system.

It is built with a microservices architecture to efficiently handle orders, payments, and inventory management.

A Note on Our Consistency Model

Our system is designed to ensure eventual consistency when performing transactions. However, we tried hard to go beyond the classical eventual consistency model and keep our TTC (Time-To-Consistency) very low.

Our SAGA protocol is designed in a way such that transactions, which consist of stock decrement, payment decrement and mark checkout actions, are committed to the database only when we ensure a checkout is validated internally. This means the time our system spends in an inconsistent state is considerably lower than other applications.

Because of this, we prefer to refer to this consistency model as "Quick Consistency".

Architecture Overview

The project implements a microservices-based system for transaction processing, built around three core services: Order, Payment, and Stock. These services communicate synchronously via gRPC for request handling and asynchronously through Redis streams for transaction coordination. The system employs Redis as its primary database, with Redis Sentinel providing high availability through automatic failover capabilities.

The core of the system is a choreography-based Saga pattern that manages distributed transactions. When a checkout is initiated, the Order service coordinates with Payment and Stock services to validate and process the transaction. Each service records its transaction state in Redis streams, which are continuously monitored by dedicated stream processors. This approach allows the system to maintain data integrity even during partial service failures through compensating transactions.

The architecture implements what we call "Quick Consistency" - an optimized form of eventual consistency that minimizes the time services spend in inconsistent states. Transactions are committed to the database only after internal validation, significantly reducing the Time-To-Consistency (TTC). This design choice balances the performance benefits of asynchronous processing with the reliability needs of a transaction-based system.

Performance is enhanced through several optimizations, including asynchronous processing with asyncio, parallel execution of operations, and Redis Lua scripts for atomic actions. The system successfully handles 5,000+ concurrent transactions with response times averaging 300-400ms under normal load, and shows near-linear scaling when adding service replicas. The architecture prioritizes high availability and partition tolerance while ensuring data synchronization across all services.

⚠️ Instructions

‼️ 1. Consistency Test :

We enhanced the consistency test provided in the benchmark. The initial version did not account for systems that took time to process all checkouts. Moreover, it also assumed all the transactions it attempted were successful, but that is not always the case and the failed transactions need to be accounted for when checking consistency.

We have implemented an enhanced test suite that addresses these problems by allowing the user to re-run the check for consistency multiple times until all checkouts are processed. Moreover, it also shows useful metrics pertaining to the state of the system.

To run the consistency test, run the following commands in the terminal:

docker compose up --build -d
python tests/consistency/test.py

Or, if using PyCharm, you can click on the shortcut below after running command 1: test

‼️ 2. Performance Tests :

For your convenience, our docker compose file also spins up a locust instance with 3 workers. You can access it at http://localhost:8089. When you start the tests, it also initializes users, stock and order automatically.

You might have to click the start button twice.

😵 3. What can be killed :

Try anything. :)

🏗️ 4. Scaling the system for better performance:

Order Service: In order-service, modify the number of replicas by changing the replicas value in deployment.
Stock Service: In stock-service, stock-rpc, stock-stream the number of replicas can be changed by modifying the replicas value in the deployment section.
Payment Service: In payment-service, payment-rpc, payment-stream the number of replicas can be changed by modifying the replicas value in the deployment section.
Locust: For Locust, the worker container can be scaled to use multiple replicas by changing the replicas value in the deployment section.

👥 Contributors

🏗️ Core Components

🔴 Redis (Database)

Primary database for all services. Stores inventory, payment, and orders information.

Redis Streams

Redis Streams are used internally by the payment and order microservices to asynchronously store and consume transactions that need to be processed. They allow us to have the obtain the benefits of message queues without dealing with their additional overhead.

Redis Lua Scripts

We use Lua scripts to perform atomic actions in the redis database without the overhead of creating transactions. We have lua scripts to atomically check if a value is greater than another and if so, decrement it.

Redis Sentinel

Sentinel is a monitoring system provided by redis to catch database crashes and do the failover function.More than one Sentinel can be deployed in a system. Sentinel is being given the primary databases that it needs to watch and in case it notices that one fails it sends a failover messages. Once the whole quorum of Sentinels agrees on the failover, the replicas are being promoted to master. The replicas of the system are automatically detected by the Sentinel (it scans the whole networks trying to detect redis instances).

🔗 gRPC (Inter-Service Communication)

High-performance service communication protocol. Enables efficient cross-service communication.

🛒 Order Service

Coordinates transactions and manages the user shopping cart. Uses gRPC to call Payment and Stock services.

It is important to note order only acts as a relay that sends decrement events to stock and payment. If one service's decrement action fails, it does not attempt a rollback and simply fails. The roll-back is eventually handled internally by the stock and payment services.

📦 Stock Service

Manages inventory tracking and stock adjustments. Capable of rollback operations for failed transactions.

💳 Payment Service

Handles user credit management and payment processing. Capable of rollback operations for failed transactions.

🛠️ Implementation Details

⚡ Asynchronous Communication

Our project utilizes Quart (async Flask) for handling concurrent requests
Uses asyncio for non-blocking I/O operations
Improves performance under high load conditions

🔄 Communication Protocol

Services communicate using gRPC, which offers several advantages over REST:

Binary protocol (more efficient than JSON)
Strong typing with Protocol Buffers
Bidirectional streaming capabilities
Lower latency

🔀 Transaction Protocol

We implemented a Choreography-based Saga pattern to manage distributed transactions across the Order, Payment, and Stock microservices.

💃 Choreography-based Saga Implementation:

🛒 Order Service	📦 Stock Service	💳 Payment Service
Sends a request to Stock and Payment to deduct the amount for that order. Waits for the response from Stock and Payment. If both are successful, returns 200 as the checkout was successful. If any one of them fails, returns 400 and doesn't try again.	Verifies product availability. If sufficient stock is available, it deducts the items. Sends response to the Order service with the result of the deduction (success/failure). Pushes the transaction ID for this operation to the Redis Stream to be rolled back or committed. Continuously listens to messages from the Payment service for this transaction and rolls back if payment deduction failed. Continuously process the transaction id stream and poll payment for its status. If payment is down, pushes back to stream.	Verifies if the user has sufficient funds. If sufficient funds are available, it deducts the amount from its credit. Sends response to the Order service with the result of the deduction (success/failure). Pushes the transaction id for this operation to the Redis Stream to be rolled back or committed. Continuously listens to messages from the Stock service for this transaction and rolls back if stock deduction failed. Continuously process the transaction id stream and poll stock for its status. If stock is down, pushes back to stream.

sequenceDiagram
    participant O as Order Service
    participant P as Payment Service
    participant S as Stock Service
    participant PS as Payment Stream
    participant SS as Stock Stream

    O->>P: ProcessPayment(PaymentRequest, tid)
    O->>S: RemoveStock(StockAdjustment, tid)
    P-->>O: PaymentResponse(success/failure)
    S-->>O: StockResponse(success/failure)
    P->>PS: Push transaction message
    S->>SS: Push transaction message
    PS->>P: VibeCheckTransactionStatus (poll)
    SS->>S: VibeCheckTransactionStatus (poll)
    P-->>PS: TransactionStatus (commit/rollback)
    S-->>SS: TransactionStatus (commit/rollback)

🔄 Consistency Model

Quick Consistency

Our system implements an quick consistency model across all microservices, prioritizing availability and partition tolerance while ensuring data synchronization over time.

How It Works

Asynchronous Reconciliation
- Payment and Stock services record transaction states in Redis streams.
- Dedicated stream processors continuously poll and process pending transactions.
- When services recover from failures, they process backlogged transactions.
Failure Handling
- If a service goes down during a transaction, other services queue the transaction ID
- Upon recovery, the service processes the queue to maintain data integrity
- The system handles both immediate and delayed compensation actions
Benefits
- Higher availability during partial system failures
- Better performance through parallel processing
Tradeoffs
- Temporary inconsistency windows (typically few seconds)
- More complex system design and testing
- Requires careful transaction ID management

Verification

Our consistency testing accommodates the quick consistency model by:

Allowing for multiple test runs to verify system convergence
Checking both successful and failed transaction states
Validating that compensating transactions are correctly applied

For accurate testing, run the consistency check multiple times with delays between executions to allow the system to achieve a consistent state.

🚀 Performance Characteristics

⏱️ Benchmarks

High Throughput: Successfully processes 5,000+ concurrent transactions
Latency: Typical transaction completion in 300-400ms under normal load
Scaling: Near-linear performance scaling with additional service replicas

🔍 Performance Optimizations

Asynchronous Processing: Non-blocking I/O with asyncio reduces wait times
Concurrent Operations: Parallel execution of payment and stock operations

⚖️ Performance Tradeoffs

Consistency vs Speed: Quick consistency model prioritizes performance over immediate consistency

Coordinated snapshots:

Although we are able to recover in case of a database container failure, some inconsistencies might be noticed across the wider system. So, while the state of an individual service is consistent, it might not be in relation to the current state of the system. This is because snapshots might not be taken at the exact same time. To fix that, we implemented logic for synchronized snapshots. In our solution, there is a dedicated "snapshot coordinator" service running a scheduled task. This involves communicating with the microservices and making them set local flags in order to stop processing their Redis streams or order checkouts until everyone is ready to take the snapshot. The coordinator continously pings the microservices to check if they finished preparing. If they are, it signals them to take the snapshot. Afterwards, it sets their flags back, allowing them to resume execution.

The problem with this is that it involves using locks (in order to make sure that all the stream processors are halted and ready for the snapshot). In case of a failure, it might happen that a lock is still acquired and not properly restored. Due to time constraints, we couldn't fully test this behavior or implement fail-over completely safe locks. Therefore, the implementation can be found in the syncsnapshots branch. More specifically, cron is the snapshot coordinator and the implementation, on the side of the services, lies beneath additional RPC methods / REST endpoints, as well as a pre_xread enabling the stream processors to check if the flag is enabled and undertake the preparation procedure.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
common		common
env		env
grafana/provisioning/datasources		grafana/provisioning/datasources
helm-config		helm-config
k8s		k8s
order		order
payment		payment
prometheus		prometheus
proto		proto
stock		stock
tests		tests
unit_tests		unit_tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile.locust		Dockerfile.locust
Dockerfile.order		Dockerfile.order
Dockerfile.payment		Dockerfile.payment
Dockerfile.stock		Dockerfile.stock
README.md		README.md
build_proto.sh		build_proto.sh
contributions.txt		contributions.txt
deploy-charts-cluster.sh		deploy-charts-cluster.sh
deploy-charts-minikube.sh		deploy-charts-minikube.sh
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
envoy.yaml		envoy.yaml
gateway_nginx.conf		gateway_nginx.conf
requirements.txt		requirements.txt
sentinel.conf		sentinel.conf

Folders and files

Latest commit

History

Repository files navigation

Group 2 Web-scale Data Management Project

📖 Introduction

A Note on Our Consistency Model

Architecture Overview

⚠️ Instructions

‼️ 1. Consistency Test :

‼️ 2. Performance Tests :

😵 3. What can be killed :

🏗️ 4. Scaling the system for better performance:

👥 Contributors

🏗️ Core Components

🔴 Redis (Database)

Redis Streams

Redis Lua Scripts

Redis Sentinel

🔗 gRPC (Inter-Service Communication)

🛒 Order Service

📦 Stock Service

💳 Payment Service

🛠️ Implementation Details

⚡ Asynchronous Communication

🔄 Communication Protocol

🔀 Transaction Protocol

💃 Choreography-based Saga Implementation:

🔄 Consistency Model

Quick Consistency

How It Works

Verification

🚀 Performance Characteristics

⏱️ Benchmarks

🔍 Performance Optimizations

⚖️ Performance Tradeoffs

Coordinated snapshots:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages