Created: 2025-11-14 Status: Phase 1 Complete - Ready for Testing
-
Hash Signature Generation
- MD5 and SHA256 hashing
- ClamAV .hdb format (MD5-based)
- ClamAV .hsb format (SHA256-based)
-
String Extraction
- Shannon entropy calculation
- Configurable entropy threshold (default: 4.0)
- Minimum/maximum length filtering
- Sorted by entropy (highest first)
-
Body Signatures
- .ndb format generation
- Hex pattern conversion
- Multiple signature generation from extracted strings
-
PE File Support
- Section hash extraction
- .mdb format signatures
- pefile library integration
-
CLI Interface
hash- Generate hash signaturesextract- Extract and analyze stringsgenerate- Generate all signature typesfrom-capa- Placeholder for capa integrationvalidate- Placeholder for validation
- Unit test suite with pytest (7 tests, all passing)
- Test coverage for core functionality
- Example scripts demonstrating usage
- Full project documentation
- Modern Python packaging (pyproject.toml)
- Development dependencies (pytest, mypy, pylint, black)
- Makefile for common tasks
- Documentation structure
- Git repository ready
# Package installed successfully
pip install -e .
# Successfully installed clamav-siggen-0.1.0
# All tests passing
pytest tests/ -v
# 7 passed in 0.04s
# CLI working
clamav-siggen --help
# All commands available$ clamav-siggen hash /tmp/test_sample.bin --name "Test.Malware"
# MD5 Hash
7a3cf2a1badf2011845df945bee64d2f
# SHA256 Hash
a1304402131e0c8d428e2bfb96e4188e90bdbff714a7232b9b7c961652117c2d
# ClamAV .hdb format (MD5)
7a3cf2a1badf2011845df945bee64d2f:55:Test.Malware
# ClamAV .hsb format (SHA256)
a1304402131e0c8d428e2bfb96e4188e90bdbff714a7232b9b7c961652117c2d:55:Test.Malware$ clamav-siggen extract /tmp/test_sample.bin --min-entropy 2.0 --format text
String Entropy Length
----------------------------------------------------------------------
This is a test malware sample for signature gen... 3.84 54
Total: 1 strings extractedfrom clamav_siggen import ClamAVSigGen
siggen = ClamAVSigGen(file_path="malware.exe")
hash_sigs = siggen.generate_hash()
# Works correctly- Clean file corpus scanning
- False positive detection
- Signature quality scoring
- Needed for: Production use
- capa JSON result parser
- Capability-based signature generation
- Integration with capa-server API
- Needed for: capa-server integration
- YARA rule parser (plyara)
- YARA to ClamAV conversion
- Logical signature building
- Needed for: Advanced features
-
Test with real malware samples
- Place samples in
data/samples/ - Run:
clamav-siggen generate sample.exe --name "Malware.Family" - Verify signatures are generated correctly
- Place samples in
-
Test with PE files
- Use Windows executables
- Verify PE section hashes are extracted
- Check .mdb signature format
-
Validate entropy filtering
- Experiment with different
--min-entropyvalues - Typical range: 3.0 (low quality) to 5.0 (high quality)
- Default 4.0 is a good starting point
- Experiment with different
-
Implement validation framework
- Download clean file corpus (e.g., Windows System32 files)
- Place in
data/clean/ - Implement signature testing against corpus
- Report false positive rate
-
Add capa integration
- Parse capa JSON results
- Extract capability indicators
- Generate targeted signatures from capabilities
- Create quality scoring based on capability uniqueness
-
Future integration with capa-server
- Import clamav_siggen as library
- Add
/api/analyses/{id}/generate-signatureendpoint - Pass capa results to generator
- Return signatures via API
-
String extraction currently only handles ASCII printable characters
- Unicode strings are not extracted
- Future: Add Unicode support
-
Entropy calculation is byte-based Shannon entropy
- May not catch all interesting patterns
- Consider adding n-gram analysis
-
No signature optimization
- Generates many signatures without prioritization
- Future: Add quality scoring and ranking
-
No logical signatures
- Can't combine multiple patterns with AND/OR logic
- Future: Implement .ldb format support