A local writing style analyzer that uses Large Language Models (LLMs) to analyze and profile writing styles in German and English text. This tool runs completely locally without external API calls.
- Local LLM Integration: Uses HuggingFace Transformers or llama.cpp with GGUF models
- Multilingual Support: Optimized for German and English text analysis
- Comprehensive Analysis:
- Sentence and paragraph structure
- Lexical diversity metrics
- Language-specific features (German formality, compound words, etc.)
- Common phrases and vocabulary patterns
- Tone and formality detection
- Dual-Format Output: Generates both JSON (for analysis) and Markdown (for AI agents)
- Profile Generation: Creates detailed profiles for different writing contexts
- No External Dependencies: Runs completely offline using local models
writing-style-analyzer/
├── analyze.py # Main profile generation tool ⭐
├── german_academic_analyzer.py # Universal German text analysis library ⭐⭐
├── pyproject.toml # UV project configuration
├── config.yaml # Configuration file
├── texts/ # Input directory for text samples
├── profiles/ # Output directory for generated profiles
├── user-profiles/ # V2 validated profiles and documentation
│ ├── profiles/ # Validated academic profiles (default, excellence)
│ ├── test-prompts/ # Test validation framework
│ ├── validate_test*.py # Test validation scripts (⚠️ user-specific)
│ └── *.md # Comprehensive usage guides
├── SCRIPTS_README.md # Guide to all analysis scripts ⭐
└── README.md # This file
Key Files for Other Users:
analyze.py- Create your own writing profile ✅german_academic_analyzer.py- Universal German text analyzer ✅user-profiles/validate_test*.py-⚠️ SKIP these (hardcoded to original author)
See SCRIPTS_README.md for detailed explanation of each script!
- Python 3.10 or higher
- UV package manager
curl -LsSf https://astral.sh/uv/install.sh | sh# Navigate to project directory
cd writing-style-analyzer
# Create virtual environment and install dependencies
uv venv
uv sync
# Or if using pip:
uv pip install -e .The analyzer uses HuggingFace models by default. On first run, the model will be downloaded automatically (~3-7GB depending on model choice).
Recommended Models for German/English:
- Qwen/Qwen2.5-3B-Instruct (Default, excellent multilingual support)
- meta-llama/Llama-3.2-3B-Instruct (Good multilingual performance)
- mistralai/Mistral-7B-Instruct-v0.2 (Larger, better quality, needs more resources)
Configure your preferred model in config.yaml:
model:
type: "transformers"
name: "Qwen/Qwen2.5-3B-Instruct"
device: "auto" # auto-detects GPU/CPUEdit config.yaml to customize:
- Model settings: Model type, name, device, parameters
- Analysis settings: Chunk size, languages, detail level
- File processing: Extensions, encoding, ignore patterns
- Output settings: JSON formatting, example inclusion
See the config.yaml file for detailed comments on all options.
# Analyze blog posts
uv run analyze.py --input texts/blog --output profiles/blog-profile.json --profile-type blog
# Analyze social media content
uv run analyze.py --input texts/social --output profiles/social-profile.json --profile-type social
# Use custom config
uv run analyze.py --input texts/blog --output profiles/custom.json --config my-config.yamlOptions:
--input, -i Input directory containing text files (required)
--output, -o Output path for profile JSON (required)
--profile-type, -t Profile type name (default: general)
--config, -c Path to config file (default: config.yaml)
--help, -h Show help message
-
Collect your text samples:
mkdir -p texts/blog # Copy your writing samples (.txt, .md, .pdf, .docx, .odt) -
Run analysis:
uv run analyze.py --input texts/blog --output profiles/my-blog.json --profile-type tech-blog
-
Review the profile:
cat profiles/my-blog.json
The analyzer generates two files for each profile:
- JSON file (
profile-name.json): Complete analysis data, metrics, and metadata - Markdown file (
profile-name.md): AI-friendly instructions for writing guidance
The JSON profile contains the following structure:
{
"profile_name": "tech-blog",
"created_at": "2025-10-26T12:34:56.789",
"analyzed_files": 15,
"primary_language": "de",
"languages_detected": ["de", "en"],
"metrics": {
"avg_sentence_length": 18.5,
"avg_paragraph_length": 3.2,
"lexical_diversity": 0.73,
"total_words": 5420,
"total_sentences": 293
},
"style_characteristics": {
"tone": "friendly-informative, conversational",
"formality": "casual-professional",
"typical_elements": [
"Uses 'du' form (German informal you)",
"Starts with questions or scenarios",
"Short paragraphs (2-4 sentences)"
],
"structural_patterns": [
"Question-led openings",
"Code examples embedded",
"Summary conclusions"
]
},
"vocabulary": {
"common_phrases": [
"im grunde",
"tatsächlich",
"aber",
"eigentlich"
],
"characteristics": "Mix of German and English technical terms"
},
"german_features": {
"formality": "informal (du-form)",
"has_compound_words": true,
"compound_word_examples": ["softwareentwicklung", "datenbankverbindung"],
"uses_umlauts": true
},
"avoid": [
"Marketing language",
"Passive voice",
"Overly formal structures"
]
}The markdown file provides AI-friendly instructions:
# Profile Name Writing Style Profile
## Quick Instructions
Write in this style using these characteristics:
### Voice & Structure
- **Passive voice:** 45%
- **Sentence length:** ~20 words average
- **Lexical diversity:** 0.35
### Transition Words
**Contrastive:**
- Use: jedoch, allerdings, dennoch
- **Target:** ~25 uses per document
### Style Signature
- **Tone:** Professional and technical
- **Formality:** Formal
### What to Avoid
- Colloquial language
- Personal opinions without evidenceOnce you've generated a profile, you can use it to guide AI assistants when writing new text.
Best for: Regular use, convenience
- Create a project in your AI platform (Claude Desktop, ChatGPT, etc.)
- Upload the generated
.mdprofile file as project knowledge - Reference it in your prompts
Example:
Write a 500-word paragraph about [TOPIC] using my writing style from the profile.
Best for: One-off use, testing different profiles
- Open the generated
.mdprofile file - Copy the entire content
- Paste it into your AI conversation
- Follow with your writing request
Example:
[Paste full profile content]
Based on this writing style profile, write about [TOPIC]...
Best for: Fine-tuning specific aspects
Extract key metrics from your profile and reference them:
Write a paragraph with:
- Average sentence length: ~[X] words
- Passive voice ratio: ~[Y]%
- Use transitions from categories: [list]
MCP Memory (if available): Store profiles in memory for later retrieval
File Attachment (if available):
Attach the .md file directly to conversations
See user-profiles/ directory for example usage guides and validation results.
The analyzer includes specialized support for German language features:
- du-form (informal): du, dich, dir, dein
- Sie-form (formal): Sie, Ihnen, Ihr
Detects long German compound words (e.g., "Softwareentwicklungsumgebung")
Full UTF-8 support for ä, ö, ü, ß
Adapts to typically longer German sentences compared to English
- CPU: Modern x86_64 processor
- RAM: 8GB (for 3B parameter models)
- Storage: 10GB free space
- GPU: NVIDIA GPU with 6GB+ VRAM (CUDA support)
- RAM: 16GB
- Storage: 20GB free space
-
Use GPU acceleration when available:
model: device: "cuda" # or "mps" for Apple Silicon
-
Use smaller models for faster analysis:
- 3B models: Fast, good quality
- 7B models: Slower, better quality
-
Adjust chunk size in config for memory constraints:
analysis: chunk_size: 4000 # Reduce if running out of memory
Symptoms: Process killed or CUDA out of memory
Solutions:
- Use CPU instead of GPU:
device: "cpu"in config - Use smaller model (3B instead of 7B)
- Reduce chunk_size in config
- Close other applications
Symptoms: Connection timeouts or download failures
Solutions:
- Check internet connection
- Use HuggingFace mirror if available
- Manually download model and configure path
- Try alternative model
Symptoms: Wrong language detected
Solutions:
- Ensure text files have sufficient content (>100 words)
- Check UTF-8 encoding is correct
- Mixed-language texts may show "en" as primary if English dominates
Symptoms: Analysis takes very long
Solutions:
- Enable GPU acceleration in config
- Use smaller/faster model
- Reduce number of input files
- Increase chunk_size for batch processing
For potentially better performance with quantized models:
-
Install llama-cpp-python:
uv pip install llama-cpp-python
-
Download a GGUF model (e.g., from HuggingFace)
-
Configure:
model: type: "llama-cpp" path: "/path/to/model.gguf"
#!/bin/bash
for dir in texts/*/; do
profile_name=$(basename "$dir")
uv run analyze.py --input "$dir" --output "profiles/${profile_name}.json" --profile-type "$profile_name"
doneCreate multiple config files for different use cases:
# Quick analysis (lower quality, faster)
uv run analyze.py --input texts/blog --output profiles/quick.json --config config-fast.yaml
# Detailed analysis (higher quality, slower)
uv run analyze.py --input texts/blog --output profiles/detailed.json --config config-detailed.yamlThe project includes a comprehensive automated test suite with 49 tests covering profile validation, analysis functions, and regression testing.
# Run all tests
uv run pytest tests/
# Run with coverage
uv run pytest tests/ --cov=. --cov-report=term-missing
# Run specific test categories
uv run pytest tests/ -m profile # Profile validation
uv run pytest tests/ -m analysis # Analysis functions
uv run pytest tests/ -m regression # Regression testsPrivacy-First Design: All tests use synthetic data only. Your personal profiles and texts remain private (gitignored).
See tests/README.md for complete test suite documentation.
Core dependencies:
transformers: HuggingFace model supporttorch: PyTorch for model inferencepyyaml: Configuration file parsinglangdetect: Language detectiontqdm: Progress barspypdf: PDF text extractionpython-docx: Microsoft Word (.docx) text extractionodfpy: LibreOffice Writer (.odt) text extraction
Optional:
llama-cpp-python: GGUF model support
Development:
pytest: Testing frameworkpytest-cov: Coverage reportingblack: Code formattingruff: Linting
- TextProcessor: Text analysis and metric calculation
- LLMAnalyzer: LLM integration and style analysis
- WritingStyleAnalyzer: Main orchestrator
- Configuration: YAML-based configuration management
| Model | Size | Quality | Speed | Notes |
|---|---|---|---|---|
| Qwen2.5-3B-Instruct | 3B | ⭐⭐⭐⭐ | ⚡⚡⚡ | Best balance, default |
| Llama-3.2-3B | 3B | ⭐⭐⭐ | ⚡⚡⚡ | Good alternative |
| Mistral-7B-Instruct | 7B | ⭐⭐⭐⭐⭐ | ⚡⚡ | Best quality, slower |
Qwen2.5 series has excellent German support and is recommended for German-heavy content.
This project is provided as-is for personal and educational use.
Contributions welcome! Areas for improvement:
- Additional language support
- Profile comparison tools
- Statistical validation metrics
- Web interface
For issues and questions:
- Check this README and
config.yamlcomments - Review logs in
analyzer.log - Check HuggingFace model documentation
- Verify Python and dependency versions
This repository includes example profile documentation in the user-profiles/ directory (gitignored for privacy). This shows how to organize your personal writing style profiles and documentation.
Project-level (this directory):
- Tool documentation (README, QUICKSTART, CLAUDE.md)
- Example texts for testing
- Analyzer source code
User-level (user-profiles/ - gitignored):
- Your analyzed writing style profiles
- Profile usage guides
- Comparison and test documentation
The user-profiles/ directory is where you'll store your generated profiles and documentation. This directory is gitignored to protect your privacy.
To create your first profile:
-
Collect text samples (10-20 files, 5000+ words total)
mkdir -p texts/my-writing # Copy your .txt, .md, .pdf, .docx files here -
Run the analyzer:
uv run analyze.py --input texts/my-writing --output profiles/my-style.json --profile-type my-style
-
Review the output:
profiles/my-style.json- Complete analysis dataprofiles/my-style.md- AI-friendly profile for guidance
-
Use with AI assistants:
- Upload the
.mdfile to your AI platform - Reference it when asking for text generation
- Upload the
Profile Organization:
We recommend creating a user-profiles/ directory structure:
user-profiles/
├── profiles/ # Your generated profiles
│ ├── academic.json
│ ├── academic.md
│ ├── blog.json
│ └── blog.md
└── README.md # Your personal usage notes
Example profiles are available in the repository's issue tracker for reference, but your profiles will be unique to your writing style.
- First stable release: Production-ready writing style analyzer
- MIT License: Open source and freely usable
- GitHub Actions CI/CD: Automated testing, linting, formatting, and releases
- Uses reusable workflows for consistent builds
- Automated version extraction and release creation
- Comprehensive test suite (49 tests, 95% coverage, <1s runtime)
- Repository publishing: Public GitHub repository with comprehensive documentation
- 10 relevant topics/tags for discoverability
- Automated wheel and source distribution builds
- Professional release notes and changelog
- Code quality improvements:
- Fixed all linting issues (modern Python type hints)
- Consistent code formatting with black
- Clean codebase ready for contributions
- Privacy-first design: All personal data gitignored by default
- Documentation: Complete setup guides, QUICKSTART, and developer documentation
- Hybrid Pattern Discovery System: Major upgrade to profile generation
- Combines authoritative patterns from Duden/academic style guides with LLM-discovered patterns
- Generates profiles with 3-4x more linguistic patterns than basic analysis
- New transition categories: conditional, clarifying, concessive
- Improved passive voice accuracy and argumentation detection
- Dual-Format Output: Profiles now generated in both JSON and Markdown
- JSON for analysis and metrics
- Markdown for AI assistant integration
- Comprehensive Documentation: Profile usage guides and validation framework
- Profile creation guide
- AI integration best practices
- Validation test framework
- Documentation restructuring: Separated project and user-specific documentation
- Created
user-profiles/directory for personal profiles (gitignored) - Moved profile-specific documentation to
user-profiles/ - Updated .gitignore to protect user privacy
- Created
- Profile management improvements: Simplified profile organization
- Clearer naming conventions for generated profiles
- Profile archiving and versioning support
- Validation framework for testing profile quality
- Empirical validation: Testing framework confirms profile quality and distinctiveness
- Added LaTeX (.tex) file support with pylatexenc
- Replaced pypdf with pdfplumber for better PDF text extraction
- Added comprehensive linguistic analysis:
- Voice analysis (passive vs active)
- Transition word analysis (5 categories)
- Sentence complexity metrics
- Rhetorical device detection
- Improved content filtering (code/formula/reference detection)
- Enhanced phrase extraction with stopword filtering
- Robust JSON parsing with retry logic
- Created pre-analyzed academic profiles with detailed documentation
- Improved German language support
- Added profile merging capabilities
- Enhanced error handling
- Initial release
- German and English support
- HuggingFace Transformers integration
- Basic profile generation
- JSON output format