Skip to content

Commit 6765063

Browse files
codewizdaveclaude
andcommitted
fix: resolve all ruff linting errors
Fixed 304 ruff linting errors: - Auto-fixed 276 errors with ruff --fix - Manually fixed 27 remaining errors - Formatted codebase with ruff format Manual fixes: - Added missing 'Any' import in fp/_maybe.py - Added TYPE_CHECKING import for Maybe in fp/result.py - Removed unused variables in commands (group, rename, search) - Removed unused variable in operations/comparing.py - Replaced lambda assignments with def functions in property tests - Replaced unused variables with _ in test files 88 files reformatted, 20 files left unchanged. All ruff checks now pass successfully. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent a364d02 commit 6765063

121 files changed

Lines changed: 9424 additions & 4118 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/settings.local.json

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,12 @@
88
"Bash(git commit:*)",
99
"Bash(python:*)",
1010
"Bash(uv build:*)",
11-
"Bash(wc:*)"
11+
"Bash(wc:*)",
12+
"Bash(pip:*)",
13+
"Bash(gh issue create:*)",
14+
"Bash(uv run mypy:*)",
15+
"Bash(uv run ruff check:*)",
16+
"Bash(uv run ruff format:*)"
1217
]
1318
}
1419
}
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Title: Implement streaming/chunking for file loading to handle large files (50k-500k rows)
2+
3+
## Problem Description
4+
5+
The current file loading implementation in `excel_toolkit/core/file_handlers.py` loads entire files into memory at once without any streaming or chunking capabilities. This creates significant memory issues when processing large Excel/CSV files with 50k-500k rows.
6+
7+
### Current Behavior
8+
9+
In `file_handlers.py`:
10+
- Line 110: `pd.read_excel()` loads the entire Excel file into memory
11+
- Line 149: `read_all_sheets()` loads ALL sheets simultaneously
12+
- Line 313: `pd.read_csv()` loads the entire CSV file into memory
13+
14+
### Memory Impact
15+
16+
For a 500k row file with 20 columns:
17+
- **Memory usage**: 500MB - 2GB depending on data types
18+
- **Load time**: 10-30 seconds
19+
- The file is completely loaded before any operation can begin
20+
- No possibility to process data incrementally
21+
22+
### Real-World Impact
23+
24+
When processing a 500MB Excel file:
25+
- File size on disk: 500MB
26+
- Memory usage after loading: 2-4GB (pandas overhead)
27+
- For multi-sheet files: memory usage multiplies by number of sheets
28+
- Systems with 8GB RAM can crash or experience severe swapping
29+
30+
## Affected Files
31+
32+
- `excel_toolkit/core/file_handlers.py` (lines 110, 149, 313)
33+
- `excel_toolkit/core/const.py` (file size limits)
34+
35+
## Proposed Solution
36+
37+
Implement chunked reading for large files:
38+
39+
```python
40+
# For large files, read in chunks
41+
chunks = pd.read_excel(file_path, chunksize=50000)
42+
for chunk in chunks:
43+
process_chunk(chunk)
44+
# Write results incrementally
45+
```
46+
47+
Benefits:
48+
- Process files without loading everything into memory
49+
- Handle files larger than available RAM
50+
- Enable incremental processing and writing
51+
- Better user feedback during long operations
52+
53+
## Alternative Approaches
54+
55+
1. Use `dtype` parameter to optimize memory during loading
56+
2. Implement lazy loading with Polars instead of pandas
57+
3. Use Dask for out-of-core computation
58+
59+
## Additional Context
60+
61+
This is especially critical for:
62+
- **Merge operations**: Loading multiple large files simultaneously
63+
- **Multi-sheet Excel files**: All sheets loaded at once
64+
- **Servers/constrained environments**: Limited memory available
65+
66+
The current `MAX_FILE_SIZE_MB = 500` limit is misleading because a 500MB file on disk can easily consume 2-4GB in memory.
67+
68+
## Related Issues
69+
70+
- File size limits too permissive (#002)
71+
- Memory monitoring needed (#006)
72+
- Merge operations memory issues (#003)
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# Title: File size limits are too permissive and can cause system crashes
2+
3+
## Problem Description
4+
5+
The current file size limits in `excel_toolkit/core/const.py` allow files that, when loaded into memory, can overwhelm typical systems and cause crashes or severe performance degradation.
6+
7+
### Current Limits
8+
9+
In `const.py` (lines 26-27):
10+
```python
11+
MAX_FILE_SIZE_MB = 500
12+
WARNING_FILE_SIZE_MB = 100
13+
```
14+
15+
### The Problem
16+
17+
These limits refer to **file size on disk**, not memory usage. However, when pandas loads an Excel/CSV file, the memory usage is typically **2-4x the file size** due to:
18+
- Pandas DataFrame overhead
19+
- Python object overhead
20+
- String data expansion
21+
- Index creation
22+
- Type conversions
23+
24+
### Real-World Examples
25+
26+
| File on Disk | Memory Usage | Safe? |
27+
|--------------|--------------|-------|
28+
| 50MB | 100-200MB | ✅ Yes |
29+
| 100MB | 200-400MB | ⚠️ Warning threshold |
30+
| 200MB | 400-800MB | ❌ No warning, but risky |
31+
| 500MB | 1-2GB | ❌ At MAX limit, can crash 8GB systems |
32+
| 500MB (multi-sheet) | 2-4GB | 💀 Approaches MAX limit |
33+
34+
### Impact Scenarios
35+
36+
**Scenario 1**: User has 8GB RAM, 4GB available
37+
- Opens a 500MB Excel file with 3 sheets
38+
- Memory usage: 500MB × 3 sheets × 3 (overhead) = **4.5GB**
39+
- Result: System crash or severe swapping
40+
41+
**Scenario 2**: Merge operation with 3 files
42+
- Each file: 300MB on disk
43+
- Total memory: 300MB × 3 files × 3 (overhead) = **2.7GB**
44+
- Plus merge operation overhead: **3-4GB total**
45+
- May exceed available memory
46+
47+
## Affected Files
48+
49+
- `excel_toolkit/core/const.py` (lines 26-27)
50+
- `excel_toolkit/core/file_handlers.py` (size checks at lines 99-100, 308-309)
51+
52+
## Proposed Solution
53+
54+
Update file size limits to be more conservative and aligned with actual memory usage:
55+
56+
```python
57+
# More conservative limits based on actual memory impact
58+
MAX_FILE_SIZE_MB = 100 # ~300-400MB in memory
59+
WARNING_FILE_SIZE_MB = 25 # ~75-100MB in memory
60+
```
61+
62+
### Justification
63+
64+
- **100MB file on disk** ≈ 300-400MB in memory (safe for most systems)
65+
- **25MB file on disk** ≈ 75-100MB in memory (reasonable warning threshold)
66+
- Systems with 4GB RAM can still function
67+
- Multi-sheet files won't immediately crash systems
68+
69+
### Alternative Approach
70+
71+
Implement **memory-based limits** instead of file-size limits:
72+
73+
```python
74+
import psutil
75+
76+
def check_available_memory(required_mb: int):
77+
"""Check if enough memory is available."""
78+
available = psutil.virtual_memory().available / (1024 * 1024)
79+
if available < required_mb * 3: # 3x safety factor
80+
raise MemoryError(f"Not enough memory. Need: {required_mb * 3}MB, Available: {available}MB")
81+
```
82+
83+
## Related Issues
84+
85+
- File loading memory issues (#001)
86+
- Memory monitoring needed (#006)
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# Title: Merge operations load all files into memory simultaneously, causing crashes
2+
3+
## Problem Description
4+
5+
The merge command in `excel_toolkit/commands/merge.py` loads all input files into memory **before** performing the merge operation. This is extremely dangerous when dealing with multiple large files (50k-500k rows each).
6+
7+
### Current Behavior
8+
9+
When merging multiple files:
10+
1. **All files are loaded into memory simultaneously**
11+
2. Then they are concatenated
12+
3. Then the result is written
13+
14+
### Memory Impact Formula
15+
16+
```
17+
Total Memory = (File1_size × 3) + (File2_size × 3) + ... + (FileN_size × 3) + Merge_overhead
18+
```
19+
20+
The multiplier of 3 accounts for pandas overhead.
21+
22+
### Real-World Scenarios
23+
24+
**Scenario 1: Merging 3 medium files**
25+
- 3 files × 200MB each on disk
26+
- Memory usage: 200MB × 3 × 3 = **1.8GB minimum**
27+
- With merge overhead: **2-2.5GB**
28+
- Usable, but risky on 8GB systems
29+
30+
**Scenario 2: Merging 5 large files**
31+
- 5 files × 300MB each on disk
32+
- Memory usage: 300MB × 5 × 3 = **4.5GB minimum**
33+
- With merge overhead: **5-6GB**
34+
- Likely to crash or cause severe swapping
35+
36+
**Scenario 3: Merging many small files**
37+
- 20 files × 50MB each on disk
38+
- Total on disk: 1GB
39+
- Memory usage: 50MB × 20 × 3 = **3GB minimum**
40+
- Result: 5 million rows, hard to manage
41+
42+
## Affected Files
43+
44+
- `excel_toolkit/commands/merge.py`
45+
- Potentially affects append operations too
46+
47+
## Proposed Solution
48+
49+
Implement **streaming merge** that processes files incrementally:
50+
51+
```python
52+
def merge_files_streaming(file_paths: list[Path], output_path: Path):
53+
"""Merge files one at a time, writing incrementally."""
54+
55+
# Read first file
56+
result = pd.read_excel(file_paths[0])
57+
58+
# Process remaining files one at a time
59+
for file_path in file_paths[1:]:
60+
chunk = pd.read_excel(file_path)
61+
62+
# Concatenate with previous result
63+
result = pd.concat([result, chunk], ignore_index=True)
64+
65+
# Write intermediate results to disk
66+
result.to_excel(output_path, index=False)
67+
68+
return result
69+
```
70+
71+
### Benefits
72+
73+
- Only keep 2 files in memory at a time (current result + next file)
74+
- Write incrementally to avoid losing progress on crash
75+
- Can merge unlimited number of files
76+
- Better memory predictability
77+
78+
### Alternative: Chunked Merge
79+
80+
```python
81+
def merge_files_chunked(file_paths: list[Path], output_path: Path, chunksize: int = 50000):
82+
"""Merge files in chunks."""
83+
84+
# Initialize writer
85+
writer = pd.ExcelWriter(output_path, engine='openpyxl')
86+
87+
for file_path in file_paths:
88+
# Read file in chunks
89+
for chunk in pd.read_excel(file_path, chunksize=chunksize):
90+
# Process and write chunk
91+
chunk.to_excel(writer, index=False)
92+
93+
writer.close()
94+
```
95+
96+
## Additional Safeguards
97+
98+
1. **Memory check before merge**:
99+
```python
100+
total_estimated_memory = sum(file_sizes) * 3
101+
if total_estimated_memory > available_memory:
102+
raise MemoryError("Cannot merge: not enough memory")
103+
```
104+
105+
2. **Limit number of files**:
106+
```python
107+
MAX_MERGE_FILES = 10
108+
if len(file_paths) > MAX_MERGE_FILES:
109+
raise ValueError(f"Cannot merge more than {MAX_MERGE_FILES} files at once")
110+
```
111+
112+
3. **Row limit warning**:
113+
```python
114+
MAX_RESULT_ROWS = 1_000_000
115+
if total_rows > MAX_RESULT_ROWS:
116+
print(f"Warning: Result will have {total_rows} rows")
117+
```
118+
119+
## Related Issues
120+
121+
- File loading memory issues (#001)
122+
- File size limits too permissive (#002)
123+
- Memory monitoring needed (#006)

0 commit comments

Comments
 (0)