wareflowx
diff --git a/‎.claude/settings.local.json‎
Lines changed: 6 additions & 1 deletion b/‎.claude/settings.local.json‎
Lines changed: 6 additions & 1 deletion
diff --git a/‎docs/issues/001-file-loading-memory-issues.md‎
Lines changed: 72 additions & 0 deletions b/‎docs/issues/001-file-loading-memory-issues.md‎
Lines changed: 72 additions & 0 deletions
diff --git a/‎docs/issues/002-file-size-limits-too-permissive.md‎
Lines changed: 86 additions & 0 deletions b/‎docs/issues/002-file-size-limits-too-permissive.md‎
Lines changed: 86 additions & 0 deletions
diff --git a/‎docs/issues/003-merge-operations-memory-issues.md‎
Lines changed: 123 additions & 0 deletions b/‎docs/issues/003-merge-operations-memory-issues.md‎
Lines changed: 123 additions & 0 deletions
@@ -8,7 +8,12 @@
       "Bash(git commit:*)",
       "Bash(python:*)",
       "Bash(uv build:*)",
-      "Bash(wc:*)"
+      "Bash(wc:*)",
+      "Bash(pip:*)",
+      "Bash(gh issue create:*)",
+      "Bash(uv run mypy:*)",
+      "Bash(uv run ruff check:*)",
+      "Bash(uv run ruff format:*)"
     ]
   }
 }
@@ -0,0 +1,72 @@
+# Title: Implement streaming/chunking for file loading to handle large files (50k-500k rows)
+
+## Problem Description
+
+The current file loading implementation in `excel_toolkit/core/file_handlers.py` loads entire files into memory at once without any streaming or chunking capabilities. This creates significant memory issues when processing large Excel/CSV files with 50k-500k rows.
+
+### Current Behavior
+
+In `file_handlers.py`:
+- Line 110: `pd.read_excel()` loads the entire Excel file into memory
+- Line 149: `read_all_sheets()` loads ALL sheets simultaneously
+- Line 313: `pd.read_csv()` loads the entire CSV file into memory
+
+### Memory Impact
+
+For a 500k row file with 20 columns:
+- **Memory usage**: 500MB - 2GB depending on data types
+- **Load time**: 10-30 seconds
+- The file is completely loaded before any operation can begin
+- No possibility to process data incrementally
+
+### Real-World Impact
+
+When processing a 500MB Excel file:
+- File size on disk: 500MB
+- Memory usage after loading: 2-4GB (pandas overhead)
+- For multi-sheet files: memory usage multiplies by number of sheets
+- Systems with 8GB RAM can crash or experience severe swapping
+
+## Affected Files
+
+- `excel_toolkit/core/file_handlers.py` (lines 110, 149, 313)
+- `excel_toolkit/core/const.py` (file size limits)
+
+## Proposed Solution
+
+Implement chunked reading for large files:
+
+```python
+# For large files, read in chunks
+chunks = pd.read_excel(file_path, chunksize=50000)
+for chunk in chunks:
+    process_chunk(chunk)
+    # Write results incrementally
+```
+
+Benefits:
+- Process files without loading everything into memory
+- Handle files larger than available RAM
+- Enable incremental processing and writing
+- Better user feedback during long operations
+
+## Alternative Approaches
+
+1. Use `dtype` parameter to optimize memory during loading
+2. Implement lazy loading with Polars instead of pandas
+3. Use Dask for out-of-core computation
+
+## Additional Context
+
+This is especially critical for:
+- **Merge operations**: Loading multiple large files simultaneously
+- **Multi-sheet Excel files**: All sheets loaded at once
+- **Servers/constrained environments**: Limited memory available
+
+The current `MAX_FILE_SIZE_MB = 500` limit is misleading because a 500MB file on disk can easily consume 2-4GB in memory.
+
+## Related Issues
+
+- File size limits too permissive (#002)
+- Memory monitoring needed (#006)
+- Merge operations memory issues (#003)
@@ -0,0 +1,86 @@
+# Title: File size limits are too permissive and can cause system crashes
+
+## Problem Description
+
+The current file size limits in `excel_toolkit/core/const.py` allow files that, when loaded into memory, can overwhelm typical systems and cause crashes or severe performance degradation.
+
+### Current Limits
+
+In `const.py` (lines 26-27):
+```python
+MAX_FILE_SIZE_MB = 500
+WARNING_FILE_SIZE_MB = 100
+```
+
+### The Problem
+
+These limits refer to **file size on disk**, not memory usage. However, when pandas loads an Excel/CSV file, the memory usage is typically **2-4x the file size** due to:
+- Pandas DataFrame overhead
+- Python object overhead
+- String data expansion
+- Index creation
+- Type conversions
+
+### Real-World Examples
+
+| File on Disk | Memory Usage | Safe? |
+|--------------|--------------|-------|
+| 50MB | 100-200MB | ✅ Yes |
+| 100MB | 200-400MB | ⚠️ Warning threshold |
+| 200MB | 400-800MB | ❌ No warning, but risky |
+| 500MB | 1-2GB | ❌ At MAX limit, can crash 8GB systems |
+| 500MB (multi-sheet) | 2-4GB | 💀 Approaches MAX limit |
+
+### Impact Scenarios
+
+**Scenario 1**: User has 8GB RAM, 4GB available
+- Opens a 500MB Excel file with 3 sheets
+- Memory usage: 500MB × 3 sheets × 3 (overhead) = **4.5GB**
+- Result: System crash or severe swapping
+
+**Scenario 2**: Merge operation with 3 files
+- Each file: 300MB on disk
+- Total memory: 300MB × 3 files × 3 (overhead) = **2.7GB**
+- Plus merge operation overhead: **3-4GB total**
+- May exceed available memory
+
+## Affected Files
+
+- `excel_toolkit/core/const.py` (lines 26-27)
+- `excel_toolkit/core/file_handlers.py` (size checks at lines 99-100, 308-309)
+
+## Proposed Solution
+
+Update file size limits to be more conservative and aligned with actual memory usage:
+
+```python
+# More conservative limits based on actual memory impact
+MAX_FILE_SIZE_MB = 100  # ~300-400MB in memory
+WARNING_FILE_SIZE_MB = 25  # ~75-100MB in memory
+```
+
+### Justification
+
+- **100MB file on disk** ≈ 300-400MB in memory (safe for most systems)
+- **25MB file on disk** ≈ 75-100MB in memory (reasonable warning threshold)
+- Systems with 4GB RAM can still function
+- Multi-sheet files won't immediately crash systems
+
+### Alternative Approach
+
+Implement **memory-based limits** instead of file-size limits:
+
+```python
+import psutil
+
+def check_available_memory(required_mb: int):
+    """Check if enough memory is available."""
+    available = psutil.virtual_memory().available / (1024 * 1024)
+    if available < required_mb * 3:  # 3x safety factor
+        raise MemoryError(f"Not enough memory. Need: {required_mb * 3}MB, Available: {available}MB")
+```
+
+## Related Issues
+
+- File loading memory issues (#001)
+- Memory monitoring needed (#006)
@@ -0,0 +1,123 @@
+# Title: Merge operations load all files into memory simultaneously, causing crashes
+
+## Problem Description
+
+The merge command in `excel_toolkit/commands/merge.py` loads all input files into memory **before** performing the merge operation. This is extremely dangerous when dealing with multiple large files (50k-500k rows each).
+
+### Current Behavior
+
+When merging multiple files:
+1. **All files are loaded into memory simultaneously**
+2. Then they are concatenated
+3. Then the result is written
+
+### Memory Impact Formula
+
+```
+Total Memory = (File1_size × 3) + (File2_size × 3) + ... + (FileN_size × 3) + Merge_overhead
+```
+
+The multiplier of 3 accounts for pandas overhead.
+
+### Real-World Scenarios
+
+**Scenario 1: Merging 3 medium files**
+- 3 files × 200MB each on disk
+- Memory usage: 200MB × 3 × 3 = **1.8GB minimum**
+- With merge overhead: **2-2.5GB**
+- Usable, but risky on 8GB systems
+
+**Scenario 2: Merging 5 large files**
+- 5 files × 300MB each on disk
+- Memory usage: 300MB × 5 × 3 = **4.5GB minimum**
+- With merge overhead: **5-6GB**
+- Likely to crash or cause severe swapping
+
+**Scenario 3: Merging many small files**
+- 20 files × 50MB each on disk
+- Total on disk: 1GB
+- Memory usage: 50MB × 20 × 3 = **3GB minimum**
+- Result: 5 million rows, hard to manage
+
+## Affected Files
+
+- `excel_toolkit/commands/merge.py`
+- Potentially affects append operations too
+
+## Proposed Solution
+
+Implement **streaming merge** that processes files incrementally:
+
+```python
+def merge_files_streaming(file_paths: list[Path], output_path: Path):
+    """Merge files one at a time, writing incrementally."""
+
+    # Read first file
+    result = pd.read_excel(file_paths[0])
+
+    # Process remaining files one at a time
+    for file_path in file_paths[1:]:
+        chunk = pd.read_excel(file_path)
+
+        # Concatenate with previous result
+        result = pd.concat([result, chunk], ignore_index=True)
+
+        # Write intermediate results to disk
+        result.to_excel(output_path, index=False)
+
+    return result
+```
+
+### Benefits
+
+- Only keep 2 files in memory at a time (current result + next file)
+- Write incrementally to avoid losing progress on crash
+- Can merge unlimited number of files
+- Better memory predictability
+
+### Alternative: Chunked Merge
+
+```python
+def merge_files_chunked(file_paths: list[Path], output_path: Path, chunksize: int = 50000):
+    """Merge files in chunks."""
+
+    # Initialize writer
+    writer = pd.ExcelWriter(output_path, engine='openpyxl')
+
+    for file_path in file_paths:
+        # Read file in chunks
+        for chunk in pd.read_excel(file_path, chunksize=chunksize):
+            # Process and write chunk
+            chunk.to_excel(writer, index=False)
+
+    writer.close()
+```
+
+## Additional Safeguards
+
+1. **Memory check before merge**:
+   ```python
+   total_estimated_memory = sum(file_sizes) * 3
+       if total_estimated_memory > available_memory:
+           raise MemoryError("Cannot merge: not enough memory")
+   ```
+
+2. **Limit number of files**:
+   ```python
+   MAX_MERGE_FILES = 10
+   if len(file_paths) > MAX_MERGE_FILES:
+       raise ValueError(f"Cannot merge more than {MAX_MERGE_FILES} files at once")
+   ```
+
+3. **Row limit warning**:
+   ```python
+   MAX_RESULT_ROWS = 1_000_000
+   if total_rows > MAX_RESULT_ROWS:
+       print(f"Warning: Result will have {total_rows} rows")
+   ```
+
+## Related Issues
+
+- File loading memory issues (#001)
+- File size limits too permissive (#002)
+- Memory monitoring needed (#006)
Original file line number	Diff line number	Diff line change
`@@ -8,7 +8,12 @@`
`8`	`8`	`"Bash(git commit:*)",`
`9`	`9`	`"Bash(python:*)",`
`10`	`10`	`"Bash(uv build:*)",`
`11`		`- "Bash(wc:*)"`
	`11`	`+ "Bash(wc:*)",`
	`12`	`+ "Bash(pip:*)",`
	`13`	`+ "Bash(gh issue create:*)",`
	`14`	`+ "Bash(uv run mypy:*)",`
	`15`	`+ "Bash(uv run ruff check:*)",`
	`16`	`+ "Bash(uv run ruff format:*)"`
`12`	`17`	`]`
`13`	`18`	`}`
`14`	`19`	`}`