Skip to content

False AlreadyExists 409 Errors During Parallel Batch Inserts (3.60.0+) #1475

@rbarrette

Description

@rbarrette

False AlreadyExists 409 Errors During Parallel Batch Inserts (3.60.0+)

Summary

The google-cloud-spanner Python client library version 3.60.0 and 3.61.0 exhibit a critical bug where unique primary keys are incorrectly reported as already existing during parallel batch insert operations, causing spurious AlreadyExists: 409 errors despite rigorous pre-insert validation confirming all keys are unique.

Environment

  • google-cloud-spanner versions tested:
    • 3.59.0 ✅ (working)
    • 3.60.0 ❌ (bug introduced)
    • 3.61.0 ❌ (bug persists)
  • Python version: 3.10.14
  • Operating System: Docker container (Debian-based, Python 3.10.14)
  • Workload characteristics:
    • 8 parallel workers
    • ~33,378 total rows across all workers
    • ~4,172 rows per worker
    • Batch insert operations using database.batch().insert()
    • UUID column is PRIMARY KEY

Bug Description

Observed Behavior

During parallel batch insert operations, the Spanner client randomly throws false AlreadyExists: 409 errors claiming that primary keys already exist in the table:

google.api_core.exceptions.AlreadyExists: 409 Row [5b45e22f-2f3c-4e69-9dfb-8cae1e9aedc8] in table TABLE_NAME already exists

Critical Evidence

Our validation PASSES before Spanner insert:

✓ Chunk validation passed: 8 workers, 79 total chunks, 33378 unique UUIDs

This proves:

  1. All 33,378 UUIDs are unique when sent to Spanner
  2. No duplicates exist in the data we're inserting
  3. The AlreadyExists errors are false - the bug is in the Spanner client library

Regression Test Results

We performed controlled regression testing by building identical Docker images with only the google-cloud-spanner version changed:

Version AlreadyExists Errors Total Errors Result
3.59.0 0 2 PASSED
3.60.0 4 13 FAILED
3.61.0 2 11 FAILED

Conclusion: Bug was introduced between versions 3.59.0 and 3.60.0.

Reproduction Steps

1. Environment Setup

from google.cloud import spanner
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
import uuid

client = spanner.Client(project='your-project')
instance = client.instance('your-instance')
database = instance.database('your-database')

2. UUID Generation with Validation

def generate_dataframe_with_uuids(num_rows):
    """Generate DataFrame with unique UUIDs."""
    df = pd.DataFrame({
        'UUID': [str(uuid.uuid4()) for _ in range(num_rows)],
        'DATA': [f'row_{i}' for i in range(num_rows)],
        # ... other columns
    })

    # Validate uniqueness (THIS PASSES)
    uuids = df['UUID'].tolist()
    unique_uuids = set(uuids)
    assert len(uuids) == len(unique_uuids), "Duplicates detected in generation!"

    return df

3. Parallel Batch Insert

def insert_chunk(chunk_df, table_name):
    """Insert a chunk using batch insert."""
    with database.batch() as batch:
        batch.insert(
            table_name,
            columns=chunk_df.columns.tolist(),
            values=chunk_df.values.tolist()
        )

def parallel_insert(df, table_name, num_workers=8):
    """Perform parallel batch inserts."""
    chunks = np.array_split(df, num_workers)

    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        futures = [
            executor.submit(insert_chunk, chunk, table_name)
            for chunk in chunks
        ]
        for future in futures:
            future.result()  # Will raise AlreadyExists with 3.60.0+

4. Trigger the Bug

# Using google-cloud-spanner==3.60.0 or 3.61.0
python parallel_insert_script.py
# Result: Random AlreadyExists: 409 errors

# Using google-cloud-spanner==3.59.0
python parallel_insert_script.py
# Result: Success, no errors

Expected Behavior

All unique UUIDs should insert successfully without AlreadyExists errors. Our validation confirms that:

  1. All UUIDs are generated uniquely
  2. No duplicates exist in the dataset
  3. Each UUID should be inserted exactly once

Actual Behavior (3.60.0+)

  1. UUID generation produces unique values (validation passes)
  2. Parallel batch insert operations randomly fail with false AlreadyExists: 409 errors
  3. Error claims UUID already exists in table, but validation proves it doesn't
  4. Hypothesis: Spanner client is either:
    • Incorrectly sending duplicate insert mutations for the same row
    • Mishandling transaction/batch boundaries in parallel operations
    • Retrying failed operations without proper deduplication
    • Double-processing mutations due to issues with mutation buffering in parallel contexts

Code Excerpts

Our UUID Validation (Always Passes)

From app/src/app_common/gcp/spanner.py:

def df_batch_insert(self, name, df):
    """Insert DataFrame with UUID validation."""
    logger.debug(f"df_batch_insert: {name=}")

    # Validate no duplicate UUIDs in this batch
    if 'UUID' in df.columns:
        uuids = df['UUID'].tolist()
        unique_uuids = set(uuids)
        if len(uuids) != len(unique_uuids):
            duplicates = [uuid for uuid in unique_uuids if uuids.count(uuid) > 1]
            error_msg = f'DUPLICATE UUIDs DETECTED IN BATCH! Duplicates: {duplicates[:5]}'
            get_run_logger().error(error_msg)
            raise ValueError(error_msg)

        # Log batch identity for debugging
        get_run_logger().debug(
            f"Batch for {name}: {len(df)} rows, "
            f"first UUID: {uuids[0]}, last UUID: {uuids[-1]}"
        )

    with self.database.batch() as batch:
        batch.insert(
            name,
            columns=df.columns.tolist(),
            values=df.values.tolist()
        )

Cross-Worker Validation (Also Passes)

# Validate chunks have no overlapping UUIDs across workers
if 'UUID' in df.columns:
    all_uuids = []
    for worker_idx, worker_chunks in enumerate(chunks):
        for chunk_idx, chunk in enumerate(worker_chunks):
            chunk_uuids = chunk['UUID'].tolist()
            all_uuids.extend(chunk_uuids)

    unique_count = len(set(all_uuids))
    total_count = len(all_uuids)

    if unique_count != total_count:
        duplicates = [u for u in set(all_uuids) if all_uuids.count(u) > 1]
        error_msg = f'DUPLICATE UUIDs DETECTED ACROSS CHUNKS! Duplicates: {duplicates[:5]}'
        get_run_logger().error(error_msg)
        raise ValueError(error_msg)

    get_run_logger().info(
        f"✓ Chunk validation passed: {len(chunks)} workers, "
        f"{sum(len(w) for w in chunks)} total chunks, "
        f"{unique_count} unique UUIDs"
    )

Log Evidence

With google-cloud-spanner==3.59.0 (Working)

✓ Chunk validation passed: 8 workers, 79 total chunks, 33378 unique UUIDs
Batch for TABLE_NAME: 4172 rows, first UUID: 5b45e22f-..., last UUID: 8a3f...
Batch for TABLE_NAME: 4173 rows, first UUID: 9c2d..., last UUID: 7b1e...
...
[All 8 workers complete successfully]
AlreadyExists Errors: 0
Result: ✅ PASSED

With google-cloud-spanner==3.60.0 (Buggy)

✓ Chunk validation passed: 8 workers, 79 total chunks, 33378 unique UUIDs
Batch for TABLE_NAME: 4172 rows, first UUID: 5b45e22f-..., last UUID: 8a3f...
...
ERROR - 409 Row [5b45e22f-2f3c-4e69-9dfb-8cae1e9aedc8] in table TABLE_NAME already exists
ERROR - google.api_core.exceptions.AlreadyExists: 409 Row [5b45e22f-...] in table already exists
AlreadyExists Errors: 4
Result: ❌ FAILED

Key Observation: The UUID 5b45e22f-2f3c-4e69-9dfb-8cae1e9aedc8 appears in the error despite:

  1. Being validated as unique before insert
  2. Only being generated once in our code
  3. Being part of a single worker's batch
  4. Never being sent to Spanner more than once by our code

Additional Context

Our Testing Framework

We've developed a comprehensive regression testing framework to validate Spanner versions:

  • Automated Docker builds with specific Spanner versions
  • Controlled test execution with identical datasets
  • Detailed logging and error analysis

Workaround

# In requirements.txt
google-cloud-spanner==3.59.0  # Pin to last known good version

Request to Google Team

  1. Investigate changes between 3.59.0 and 3.60.0 related to:

    • Batch insert implementation
    • Mutation handling in concurrent contexts
    • Transaction boundary management
    • Retry/idempotency logic
  2. Provide guidance on:

    • Recommended patterns for parallel batch inserts
    • Best practices for concurrent Spanner operations
    • Whether this is a known issue with a fix in progress
  3. Timeline for fix in upcoming releases

Version Information

>>> import google.cloud.spanner
>>> google.cloud.spanner.__version__
'3.60.0'  # or '3.61.0' - both exhibit the bug

>>> import sys
>>> sys.version
'3.10.14 (main, ...) [GCC 12.2.0]'

Note: We've validated this extensively through automated regression testing and are confident this is a client library bug producing false AlreadyExists errors. Our pre-insert validation conclusively proves all primary keys are unique when sent to Spanner, yet the client reports them as duplicates. This is not a user code issue.

Metadata

Metadata

Assignees

Labels

api: spannerIssues related to the googleapis/python-spanner API.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions