File Extractor Pro

File Extractor Pro is a GUI application to extract and process files based on specified criteria. It allows users to include or exclude files based on extensions, include hidden files, and generate detailed extraction reports.

Features

Select a folder to extract files from with a dropdown history of recent paths.
Choose between inclusion or exclusion mode for file extensions.
Include or exclude hidden files and folders.
Specify custom file extensions to include or exclude.
Exclude specific files and folders by name.
Accurate progress tracking and status updates.
Generate detailed extraction reports in JSON format.
Asynchronous file processing for improved performance.
Error handling and logging for robustness.
Streams large files without a hard-coded size cap while emitting warnings when configurable soft limits are exceeded.

Installation

Clone the repository:

git clone https://github.com/cortega26/File-Extractor-Pro.git
cd file-extractor-pro

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

Run the application:
```
python file_extractor.py
```
Use the GUI to select a folder, set the criteria, and start the extraction process.
View progress and logs in the application window.
Generate and view extraction reports.

Command-line usage

Run headless extractions via the CLI module:

python -m services.cli /path/to/folder

When --extensions is omitted in inclusion mode the CLI automatically processes a curated set of common file types (.txt, .md, .py, and more), preventing silent zero-output runs.

--mode defaults to inclusion. When omitted, the CLI processes files with a curated set of common extensions (for example .txt, .md, .py).
The --mode flag is case-insensitive; values such as INCLUSION or Exclusion are normalised to their lowercase equivalents before execution.
Extensions may be supplied with or without a leading dot. --extensions txt md is normalised to (.txt, .md) automatically.
Use --extensions "*" to process all file types in inclusion mode while still respecting exclude patterns.
Provide comma-separated lists for long extension sets: --extensions "txt,md,pdf".
In exclusion mode, omit --extensions to process all files or provide the extensions that should be skipped.
Append --include-hidden to traverse hidden files and folders.
Provide --max-file-size-mb to emit warnings for files exceeding the specified soft limit while still streaming their contents. When omitted, the processor derives a soft cap from available system memory to avoid MemoryError regressions.
Adjust --poll-interval (seconds) and --log-level (DEBUG–CRITICAL) to tune queue responsiveness and console verbosity.
Use --report to generate a JSON snapshot of the latest extraction summary on disk.
Each run logs throughput and reliability metrics (files processed, elapsed time, files per second, queue saturation, dropped status messages, service queue drops, large file warnings, and skipped files) at the end of the execution to aid monitoring.
When directory enumeration falls back to an indeterminate mode (for very large trees), the CLI records a final progress update with the completed file count and logs a completion timestamp so automation can detect when the run finished.
Programmatic consumers that instantiate CLIOptions directly inherit the same default extension set when running in inclusion mode, preventing empty extraction results.

Flag	Type	Default	Description
`folder`	positional path	required	Folder to extract files from.
`--mode`	choice	`inclusion`	Choose between inclusion and exclusion filtering.
`--extensions`	list	varies	Extensions to include/exclude; defaults to common types in inclusion mode.
`--include-hidden`	boolean	`false`	Traverse hidden files and folders.
`--exclude-files`	list	`[]`	File patterns to skip during extraction.
`--exclude-folders`	list	`[]`	Folder patterns to skip.
`--output`	path	`extraction.txt`	Destination text file for extracted content.
`--report`	path	none	Optional JSON report output location.
`--max-file-size-mb`	int	auto	Soft warning threshold for large files; defaults to available memory.
`--poll-interval`	float	`0.1`	Queue polling interval while monitoring extraction progress.
`--log-level`	choice	`INFO`	Console logging verbosity.

Configuration

The application persists user preferences to config.ini. Values are validated at startup to prevent invalid modes or resource limits from causing runtime errors.

Name	Type	Default	Required	Description
`output_file`	string	`output.txt`	No	Default filename used when generating extraction output.
`mode`	string	`inclusion`	No	Extraction mode. Must be either `inclusion` or `exclusion`.
`include_hidden`	boolean	`false`	No	Controls whether hidden files and folders are processed. Accepts `true/false`, `yes/no`, or `1/0`.
`exclude_files`	list	see defaults	No	Comma-separated list of file patterns to exclude from extraction.
`exclude_folders`	list	see defaults	No	Comma-separated list of folder patterns to exclude from extraction.
`theme`	string	`light`	No	UI theme preference. Must be `light` or `dark`.
`batch_size`	integer	`100`	No	Number of files processed before UI progress updates. Must be greater than zero.
`max_memory_mb`	integer	`512`	No	Soft memory cap used for processing safeguards. Must be greater than zero.
`recent_folders`	list	`[]`	No	JSON array tracking the most recently selected folders for quick access in the Browse menu.

Testing

Install development dependencies to enable coverage and security tooling:

pip install -r requirements-dev.txt

Run the full suite with coverage thresholds enforced:

pytest
python tools/coverage_gate.py  # Enforces ≥90% per-file coverage for tracked modules

The configuration enables branch coverage and fails the run when overall coverage drops below 80%. Aim for ≥90% coverage on changed modules to satisfy internal quality targets. The coverage_gate helper reads coverage.xml and reports any files that dip below the threshold so regressions are caught before CI.

Security Scans

Run the required security tooling before submitting changes:

bandit -ll -r .
pip-audit
gitleaks detect --redact
python tools/security_checks.py  # Runs all scanners sequentially

Fix: testing_ci_security_scanners

bandit and pip-audit install via requirements-dev.txt. Install the latest gitleaks binary from the official releases and ensure it is on your PATH so tools/security_checks.py can locate it. Capture and address any medium- or high-severity findings before merging.

Requirements

Python 3.9+
Standard library only (no additional runtime dependencies required)

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
audit		audit
services		services
tests		tests
tools		tools
ui_support		ui_support
.gitignore		.gitignore
README.md		README.md
config_manager.py		config_manager.py
constants.py		constants.py
file_extractor.py		file_extractor.py
logging_utils.py		logging_utils.py
mypy.ini		mypy.ini
processor.py		processor.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
ui.py		ui.py
user_guide.txt		user_guide.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

File Extractor Pro

Features

Installation

Usage

Command-line usage

Configuration

Testing

Security Scans

Fix: testing_ci_security_scanners

Requirements

License

About

Uh oh!

Uh oh!

Languages

cortega26/File-Extractor-Pro

Folders and files

Latest commit

History

Repository files navigation

File Extractor Pro

Features

Installation

Usage

Command-line usage

Configuration

Testing

Security Scans

Fix: testing_ci_security_scanners

Requirements

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages