Python Data Source for AWS DynamoDB

Python Data Source for Apache Spark enabling batch and streaming reads/writes to AWS DynamoDB.

Features

Batch Writes: Write DataFrames to DynamoDB tables with batch_writer
Streaming Writes: Write streaming DataFrames with micro-batch processing
Batch Reads: Parallel scan with segment-based partitioning
Schema Derivation: Auto-infer schema from table items, or provide explicit schema for projection
Delete Flag Support: Conditional row deletion during writes
Primary Key Validation: Early validation ensures DataFrame contains all key columns
Float/Decimal Handling: Automatic float-to-Decimal conversion for writes

Installation

poetry install

Quick Start

Batch Write

from pyspark.sql import SparkSession
from dynamodb_data_source import DynamoDbDataSource

spark = SparkSession.builder.appName("dynamodb").getOrCreate()
spark.dataSource.register(DynamoDbDataSource)

df = spark.createDataFrame([
    ("id-001", "Alice", 30),
    ("id-002", "Bob", 25)
], ["id", "name", "age"])

df.write.format("dynamodb") \
    .mode("append") \
    .option("table_name", "users") \
    .option("aws_region", "us-east-1") \
    .save()

Streaming Write

df.writeStream.format("dynamodb") \
    .option("table_name", "users") \
    .option("aws_region", "us-east-1") \
    .option("checkpointLocation", "/tmp/checkpoint") \
    .start()

Write with Delete Flag

df.write.format("dynamodb") \
    .mode("append") \
    .option("table_name", "users") \
    .option("aws_region", "us-east-1") \
    .option("delete_flag_column", "is_deleted") \
    .option("delete_flag_value", "true") \
    .save()

Batch Read

from pyspark.sql import SparkSession
from dynamodb_data_source import DynamoDbDataSource

spark = SparkSession.builder.appName("dynamodb").getOrCreate()
spark.dataSource.register(DynamoDbDataSource)

# Read entire table
df = spark.read.format("dynamodb") \
    .option("table_name", "users") \
    .option("aws_region", "us-east-1") \
    .load()

df.show()

Read with Parallel Scan

# Use 4 parallel scan segments for faster reads
df = spark.read.format("dynamodb") \
    .option("table_name", "users") \
    .option("aws_region", "us-east-1") \
    .option("total_segments", "4") \
    .load()

Read with Schema (Column Projection)

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("id", StringType()),
    StructField("name", StringType()),
    StructField("age", IntegerType())
])

df = spark.read.format("dynamodb") \
    .schema(schema) \
    .option("table_name", "users") \
    .option("aws_region", "us-east-1") \
    .load()

Configuration Options

Connection Options

Option	Required	Default	Description
`table_name`	Yes	-	DynamoDB table name
`aws_region`	Yes	-	AWS region (e.g., us-east-1)
`aws_access_key_id`	No	-	AWS access key ID (uses default credentials if not set)
`aws_secret_access_key`	No	-	AWS secret access key
`aws_session_token`	No	-	AWS session token for temporary credentials
`endpoint_url`	No	-	Custom endpoint URL (e.g., http://localhost:8000 for DynamoDB Local)

Write Options

Option	Required	Default	Description
`delete_flag_column`	No	-	Column indicating deletion (must be used with delete_flag_value)
`delete_flag_value`	No	-	Value triggering deletion (must be used with delete_flag_column)

Read Options

Option	Required	Default	Description
`total_segments`	No	1	Number of parallel scan segments
`consistent_read`	No	false	Use strongly consistent reads

Development

Setup

poetry install

Run Tests

# Unit tests only
poetry run pytest -v -m "not integration"

# Integration tests (requires DynamoDB Local)
cd tests && docker-compose up -d
poetry run pytest -v -m integration
cd tests && docker-compose down

Code Quality

poetry run ruff check src/
poetry run ruff format src/
poetry run mypy src/

Phase Status

Phase 1: Write Operations ✅

✅ Batch writes with batch_writer
✅ Streaming writes
✅ AWS credential management
✅ Primary key validation from table metadata
✅ Float-to-Decimal conversion
✅ Delete flag support

Phase 2: Batch Read Operations ✅

✅ Segment-based parallel scanning
✅ Schema derivation from sampled items
✅ Explicit schema support (column projection)
✅ Consistent read option
✅ Pagination handling

Phase 3: Streaming Reads (Future)

DynamoDB Streams integration
Change event processing
Offset management

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
src/dynamodb_data_source		src/dynamodb_data_source
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
examples.ipynb		examples.ipynb
poetry.lock		poetry.lock
prompt.md		prompt.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Data Source for AWS DynamoDB

Features

Installation

Quick Start

Batch Write

Streaming Write

Write with Delete Flag

Batch Read

Read with Parallel Scan

Read with Schema (Column Projection)

Configuration Options

Connection Options

Write Options

Read Options

Development

Setup

Run Tests

Code Quality

Phase Status

Phase 1: Write Operations ✅

Phase 2: Batch Read Operations ✅

Phase 3: Streaming Reads (Future)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Python Data Source for AWS DynamoDB

Features

Installation

Quick Start

Batch Write

Streaming Write

Write with Delete Flag

Batch Read

Read with Parallel Scan

Read with Schema (Column Projection)

Configuration Options

Connection Options

Write Options

Read Options

Development

Setup

Run Tests

Code Quality

Phase Status

Phase 1: Write Operations ✅

Phase 2: Batch Read Operations ✅

Phase 3: Streaming Reads (Future)

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages