feat: add pandas DataFrame support to bulk_insert method#183
feat: add pandas DataFrame support to bulk_insert method#183
Conversation
- Add DataFrame type detection in bulk_insert method - Implement _dataframe_to_models helper method for conversion - Handle pandas as optional dependency with proper error messages - Validate DataFrame columns against table schema - Support NaN value handling and required field validation - Preserve existing auto-embedding functionality - Maintain backward compatibility with List[Union[T, dict]] Resolves issue #112 Co-authored-by: Mini256 <Mini256@users.noreply.github.com>
|
|
4 similar comments
|
@pantheon-bot review this pr |
|
Review Complete Findings: 5 issues ℹ️ Learn more details on Pantheon AI. |
|
|
||
| def bulk_insert(self, data: List[Union[T, dict]]) -> List[T]: | ||
| if not isinstance(data, list): | ||
| def bulk_insert(self, data: Union[List[Union[T, dict]], "pd.DataFrame"]) -> List[T]: |
There was a problem hiding this comment.
[P1] Ruff lint fails (F821 undefined pd)
Why: Blocks merge because CI runs make lint/ruff check and errors on forward-ref annotations using pd without any import pandas as pd. This is a CI/dev-workflow break (not a production outage).
Evidence: def bulk_insert(self, data: Union[List[Union[T, dict]], "pd.DataFrame"]) -> List[T]: at pytidb/table.py:325
Suggested fix: Add if TYPE_CHECKING: import pandas as pd to make pd available for annotations without adding a runtime dependency.
| for col_name, value in row.items(): | ||
| if col_name in table_fields: | ||
| # Handle NaN values | ||
| if pd.isna(value): |
There was a problem hiding this comment.
[P1] DataFrame bulk_insert crashes on list/ndarray cell values
Why: Breaks the new feature for common DataFrame contents (e.g., vector embeddings). pd.isna(value) returns a boolean array for list/ndarray values, and if pd.isna(value): raises ValueError at runtime. This is realistically reachable for vector workloads since the codebase models vector values as List[float]/np.ndarray.
Evidence: if pd.isna(value): at pytidb/table.py:441 in _dataframe_to_models()
Suggested fix: Use scalar-safe NaN checking, e.g., if isinstance(value, (list, np.ndarray)) or not pd.isna(value):
| required_fields = set() | ||
| if hasattr(self._table_model, '__table__'): | ||
| for col in self._table_model.__table__.columns: | ||
| if not col.nullable and not col.autoincrement and col.default is None and col.server_default is None: |
There was a problem hiding this comment.
[P2] Ruff format check fails
Why: CI runs ruff format --check and pytidb/table.py would be reformatted (whitespace/quote normalization/line wrapping). This blocks merge until formatting is applied, but has immediate low-risk workaround (run formatter) with no runtime/production impact.
Evidence: if not col.nullable and not col.autoincrement and col.default is None and col.server_default is None: at pytidb/table.py:426
Suggested fix: Run ruff format pytidb/table.py to auto-fix formatting issues.
|
|
||
| # Convert DataFrame rows to TableModel instances | ||
| models = [] | ||
| for row_idx, row in df.iterrows(): |
There was a problem hiding this comment.
[P2] Inefficient DataFrame conversion for bulk_insert
Why: Could be a significant performance bottleneck for large DataFrames. The code iterates row-by-row via iterrows(), which is a known slow path. This adds avoidable overhead before the DB insert even starts.
Evidence: for row_idx, row in df.iterrows(): at pytidb/table.py:435
Suggested fix: Consider using df.to_dict('records') for faster conversion, or df.itertuples() if row-by-row processing is needed.
| if not isinstance(data, list): | ||
| def bulk_insert(self, data: Union[List[Union[T, dict]], "pd.DataFrame"]) -> List[T]: | ||
| # Handle pandas DataFrame input | ||
| if hasattr(data, 'to_dict') and hasattr(data, 'iterrows'): |
There was a problem hiding this comment.
[P2] No tests cover the new DataFrame input path
Why: Increases risk of regressions/bugs shipping undetected. The DataFrame branch and _dataframe_to_models() conversion logic are non-trivial but completely unexercised by the test suite. Tests only call bulk_insert with Python lists.
Evidence: if hasattr(data, 'to_dict') and hasattr(data, 'iterrows'): at pytidb/table.py:327
Suggested fix: Add test cases that pass pandas DataFrames to bulk_insert(), covering both happy path and edge cases (NaN values, vector columns, validation errors).
| # Create dict with only valid fields | ||
| model_data = {} | ||
| for col_name, value in row.items(): | ||
| if col_name in table_fields: |
There was a problem hiding this comment.
[P2] DataFrame bulk_insert silently drops unexpected/typo columns
Why: Extra DataFrame columns that aren't model fields are silently ignored with no validation or warning. This can hide typos or schema mismatches, causing user-provided data to be lost without feedback. While missing required columns are caught, unexpected columns are dropped silently. This is a data integrity concern, though limited to the optional pandas ingestion path.
Evidence: if col_name in table_fields: at pytidb/table.py:439 only processes known fields, with no error path for extra columns
Suggested fix: Add strict mode validation to detect extra columns: extra_cols = set(df.columns) - table_fields; if extra_cols: raise ValueError(f'Unexpected columns: {extra_cols}') or add an explicit strict=False parameter for opt-out.
Resolves issue #112