Skip to content

feat: add pandas DataFrame support to bulk_insert method#183

Draft
Mini256 wants to merge 1 commit intomainfrom
claude/issue-112-20250826-1520
Draft

feat: add pandas DataFrame support to bulk_insert method#183
Mini256 wants to merge 1 commit intomainfrom
claude/issue-112-20250826-1520

Conversation

@Mini256
Copy link
Copy Markdown
Member

@Mini256 Mini256 commented Aug 26, 2025

  • Add DataFrame type detection in bulk_insert method
  • Implement _dataframe_to_models helper method for conversion
  • Handle pandas as optional dependency with proper error messages
  • Validate DataFrame columns against table schema
  • Support NaN value handling and required field validation
  • Preserve existing auto-embedding functionality
  • Maintain backward compatibility with List[Union[T, dict]]

Resolves issue #112

- Add DataFrame type detection in bulk_insert method
- Implement _dataframe_to_models helper method for conversion
- Handle pandas as optional dependency with proper error messages
- Validate DataFrame columns against table schema
- Support NaN value handling and required field validation
- Preserve existing auto-embedding functionality
- Maintain backward compatibility with List[Union[T, dict]]

Resolves issue #112

Co-authored-by: Mini256 <Mini256@users.noreply.github.com>
@pingcap-cla-assistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@Mini256
Copy link
Copy Markdown
Member Author

Mini256 commented Nov 11, 2025

@pantheon-bot

4 similar comments
@Mini256
Copy link
Copy Markdown
Member Author

Mini256 commented Nov 12, 2025

@pantheon-bot

@Mini256
Copy link
Copy Markdown
Member Author

Mini256 commented Nov 12, 2025

@pantheon-bot

@Mini256
Copy link
Copy Markdown
Member Author

Mini256 commented Nov 12, 2025

@pantheon-bot

@Mini256
Copy link
Copy Markdown
Member Author

Mini256 commented Nov 12, 2025

@pantheon-bot

@Mini256
Copy link
Copy Markdown
Member Author

Mini256 commented Mar 17, 2026

@pantheon-bot review this pr

@pantheon-ai
Copy link
Copy Markdown

pantheon-ai bot commented Mar 17, 2026

Review Complete

Findings: 5 issues
Posted: 5
Duplicates/Skipped: 0

ℹ️ Learn more details on Pantheon AI.

Comment thread pytidb/table.py

def bulk_insert(self, data: List[Union[T, dict]]) -> List[T]:
if not isinstance(data, list):
def bulk_insert(self, data: Union[List[Union[T, dict]], "pd.DataFrame"]) -> List[T]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Ruff lint fails (F821 undefined pd)

Why: Blocks merge because CI runs make lint/ruff check and errors on forward-ref annotations using pd without any import pandas as pd. This is a CI/dev-workflow break (not a production outage).

Evidence: def bulk_insert(self, data: Union[List[Union[T, dict]], "pd.DataFrame"]) -> List[T]: at pytidb/table.py:325

Suggested fix: Add if TYPE_CHECKING: import pandas as pd to make pd available for annotations without adding a runtime dependency.

Comment thread pytidb/table.py
for col_name, value in row.items():
if col_name in table_fields:
# Handle NaN values
if pd.isna(value):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] DataFrame bulk_insert crashes on list/ndarray cell values

Why: Breaks the new feature for common DataFrame contents (e.g., vector embeddings). pd.isna(value) returns a boolean array for list/ndarray values, and if pd.isna(value): raises ValueError at runtime. This is realistically reachable for vector workloads since the codebase models vector values as List[float]/np.ndarray.

Evidence: if pd.isna(value): at pytidb/table.py:441 in _dataframe_to_models()

Suggested fix: Use scalar-safe NaN checking, e.g., if isinstance(value, (list, np.ndarray)) or not pd.isna(value):

Comment thread pytidb/table.py
required_fields = set()
if hasattr(self._table_model, '__table__'):
for col in self._table_model.__table__.columns:
if not col.nullable and not col.autoincrement and col.default is None and col.server_default is None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Ruff format check fails

Why: CI runs ruff format --check and pytidb/table.py would be reformatted (whitespace/quote normalization/line wrapping). This blocks merge until formatting is applied, but has immediate low-risk workaround (run formatter) with no runtime/production impact.

Evidence: if not col.nullable and not col.autoincrement and col.default is None and col.server_default is None: at pytidb/table.py:426

Suggested fix: Run ruff format pytidb/table.py to auto-fix formatting issues.

Comment thread pytidb/table.py

# Convert DataFrame rows to TableModel instances
models = []
for row_idx, row in df.iterrows():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Inefficient DataFrame conversion for bulk_insert

Why: Could be a significant performance bottleneck for large DataFrames. The code iterates row-by-row via iterrows(), which is a known slow path. This adds avoidable overhead before the DB insert even starts.

Evidence: for row_idx, row in df.iterrows(): at pytidb/table.py:435

Suggested fix: Consider using df.to_dict('records') for faster conversion, or df.itertuples() if row-by-row processing is needed.

Comment thread pytidb/table.py
if not isinstance(data, list):
def bulk_insert(self, data: Union[List[Union[T, dict]], "pd.DataFrame"]) -> List[T]:
# Handle pandas DataFrame input
if hasattr(data, 'to_dict') and hasattr(data, 'iterrows'):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] No tests cover the new DataFrame input path

Why: Increases risk of regressions/bugs shipping undetected. The DataFrame branch and _dataframe_to_models() conversion logic are non-trivial but completely unexercised by the test suite. Tests only call bulk_insert with Python lists.

Evidence: if hasattr(data, 'to_dict') and hasattr(data, 'iterrows'): at pytidb/table.py:327

Suggested fix: Add test cases that pass pandas DataFrames to bulk_insert(), covering both happy path and edge cases (NaN values, vector columns, validation errors).

Comment thread pytidb/table.py
# Create dict with only valid fields
model_data = {}
for col_name, value in row.items():
if col_name in table_fields:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] DataFrame bulk_insert silently drops unexpected/typo columns

Why: Extra DataFrame columns that aren't model fields are silently ignored with no validation or warning. This can hide typos or schema mismatches, causing user-provided data to be lost without feedback. While missing required columns are caught, unexpected columns are dropped silently. This is a data integrity concern, though limited to the optional pandas ingestion path.

Evidence: if col_name in table_fields: at pytidb/table.py:439 only processes known fields, with no error path for extra columns

Suggested fix: Add strict mode validation to detect extra columns: extra_cols = set(df.columns) - table_fields; if extra_cols: raise ValueError(f'Unexpected columns: {extra_cols}') or add an explicit strict=False parameter for opt-out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant