feat: add pandas DataFrame support to bulk_insert method by Mini256 · Pull Request #183 · pingcap/pytidb

Mini256 · 2025-08-26T15:26:54Z

Add DataFrame type detection in bulk_insert method
Implement _dataframe_to_models helper method for conversion
Handle pandas as optional dependency with proper error messages
Validate DataFrame columns against table schema
Support NaN value handling and required field validation
Preserve existing auto-embedding functionality
Maintain backward compatibility with List[Union[T, dict]]

Resolves issue #112

- Add DataFrame type detection in bulk_insert method - Implement _dataframe_to_models helper method for conversion - Handle pandas as optional dependency with proper error messages - Validate DataFrame columns against table schema - Support NaN value handling and required field validation - Preserve existing auto-embedding functionality - Maintain backward compatibility with List[Union[T, dict]] Resolves issue #112 Co-authored-by: Mini256 <Mini256@users.noreply.github.com>

pingcap-cla-assistant · 2025-08-26T15:27:08Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Mini256 · 2025-11-11T16:42:29Z

@pantheon-bot

Mini256 · 2025-11-12T02:21:11Z

@pantheon-bot

Mini256 · 2025-11-12T02:21:48Z

@pantheon-bot

Mini256 · 2025-11-12T02:22:33Z

@pantheon-bot

Mini256 · 2025-11-12T02:25:00Z

@pantheon-bot

Mini256 · 2026-03-17T16:59:26Z

@pantheon-bot review this pr

pantheon-ai · 2026-03-17T16:59:34Z

Review Complete

Findings: 5 issues
Posted: 5
Duplicates/Skipped: 0

_{ℹ️ Learn more details on Pantheon AI.}

pantheon-ai · 2026-03-17T17:30:03Z


-    def bulk_insert(self, data: List[Union[T, dict]]) -> List[T]:
-        if not isinstance(data, list):
+    def bulk_insert(self, data: Union[List[Union[T, dict]], "pd.DataFrame"]) -> List[T]:


[P1] Ruff lint fails (F821 undefined pd)

Why: Blocks merge because CI runs make lint/ruff check and errors on forward-ref annotations using pd without any import pandas as pd. This is a CI/dev-workflow break (not a production outage).

Evidence: def bulk_insert(self, data: Union[List[Union[T, dict]], "pd.DataFrame"]) -> List[T]: at pytidb/table.py:325

Suggested fix: Add if TYPE_CHECKING: import pandas as pd to make pd available for annotations without adding a runtime dependency.

pantheon-ai · 2026-03-17T17:30:03Z

+            for col_name, value in row.items():
+                if col_name in table_fields:
+                    # Handle NaN values
+                    if pd.isna(value):


[P1] DataFrame bulk_insert crashes on list/ndarray cell values

Why: Breaks the new feature for common DataFrame contents (e.g., vector embeddings). pd.isna(value) returns a boolean array for list/ndarray values, and if pd.isna(value): raises ValueError at runtime. This is realistically reachable for vector workloads since the codebase models vector values as List[float]/np.ndarray.

Evidence: if pd.isna(value): at pytidb/table.py:441 in _dataframe_to_models()

Suggested fix: Use scalar-safe NaN checking, e.g., if isinstance(value, (list, np.ndarray)) or not pd.isna(value):

pantheon-ai · 2026-03-17T17:30:03Z

+        required_fields = set()
+        if hasattr(self._table_model, '__table__'):
+            for col in self._table_model.__table__.columns:
+                if not col.nullable and not col.autoincrement and col.default is None and col.server_default is None:


[P2] Ruff format check fails

Why: CI runs ruff format --check and pytidb/table.py would be reformatted (whitespace/quote normalization/line wrapping). This blocks merge until formatting is applied, but has immediate low-risk workaround (run formatter) with no runtime/production impact.

Evidence: if not col.nullable and not col.autoincrement and col.default is None and col.server_default is None: at pytidb/table.py:426

Suggested fix: Run ruff format pytidb/table.py to auto-fix formatting issues.

pantheon-ai · 2026-03-17T17:30:03Z

+
+        # Convert DataFrame rows to TableModel instances
+        models = []
+        for row_idx, row in df.iterrows():


[P2] Inefficient DataFrame conversion for bulk_insert

Why: Could be a significant performance bottleneck for large DataFrames. The code iterates row-by-row via iterrows(), which is a known slow path. This adds avoidable overhead before the DB insert even starts.

Evidence: for row_idx, row in df.iterrows(): at pytidb/table.py:435

Suggested fix: Consider using df.to_dict('records') for faster conversion, or df.itertuples() if row-by-row processing is needed.

pantheon-ai · 2026-03-17T17:30:03Z

-        if not isinstance(data, list):
+    def bulk_insert(self, data: Union[List[Union[T, dict]], "pd.DataFrame"]) -> List[T]:
+        # Handle pandas DataFrame input
+        if hasattr(data, 'to_dict') and hasattr(data, 'iterrows'):


[P2] No tests cover the new DataFrame input path

Why: Increases risk of regressions/bugs shipping undetected. The DataFrame branch and _dataframe_to_models() conversion logic are non-trivial but completely unexercised by the test suite. Tests only call bulk_insert with Python lists.

Evidence: if hasattr(data, 'to_dict') and hasattr(data, 'iterrows'): at pytidb/table.py:327

Suggested fix: Add test cases that pass pandas DataFrames to bulk_insert(), covering both happy path and edge cases (NaN values, vector columns, validation errors).

pantheon-ai · 2026-03-17T17:59:43Z

+            # Create dict with only valid fields
+            model_data = {}
+            for col_name, value in row.items():
+                if col_name in table_fields:


[P2] DataFrame bulk_insert silently drops unexpected/typo columns

Why: Extra DataFrame columns that aren't model fields are silently ignored with no validation or warning. This can hide typos or schema mismatches, causing user-provided data to be lost without feedback. While missing required columns are caught, unexpected columns are dropped silently. This is a data integrity concern, though limited to the optional pandas ingestion path.

Evidence: if col_name in table_fields: at pytidb/table.py:439 only processes known fields, with no error path for extra columns

Suggested fix: Add strict mode validation to detect extra columns: extra_cols = set(df.columns) - table_fields; if extra_cols: raise ValueError(f'Unexpected columns: {extra_cols}') or add an explicit strict=False parameter for opt-out.

pantheon-ai bot reviewed Mar 17, 2026

View reviewed changes

Conversation

Mini256 commented Aug 26, 2025

Uh oh!

pingcap-cla-assistant bot commented Aug 26, 2025

Uh oh!

Mini256 commented Nov 11, 2025

Uh oh!

Mini256 commented Nov 12, 2025

Uh oh!

Mini256 commented Nov 12, 2025

Uh oh!

Mini256 commented Nov 12, 2025

Uh oh!

Mini256 commented Nov 12, 2025

Uh oh!

Mini256 commented Mar 17, 2026

Uh oh!

pantheon-ai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pantheon-ai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

pantheon-ai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

pantheon-ai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

pantheon-ai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

pantheon-ai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

pantheon-ai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pantheon-ai bot commented Mar 17, 2026 •

edited

Loading