A modular, educational Text Processing Engine written in pure C99. Designed to teach core software engineering concepts: Modularity, State Machines, and API Design.
The engine is built as a pipeline:
Raw Text -> [Scanner] -> Chars -> [Tokenizer] -> Tokens -> [App]
- Scanner: Reads raw text sources safely.
- Tokenizer: Groups characters into Words, Numbers, or Punctuation.
- Stats: Computes metrics (Word count, etc.).
- Normalizer: Standardizes text (e.g., Lowercasing).
We use a Makefile for automation.
# Build the library and demo
make
# Run the tests
make test
# Clean up
make clean#include "tokenizer.h"
// ... inside main ...
Scanner scanner;
scanner_init(&scanner, "Hello World 123");
Tokenizer tokenizer;
tokenizer_init(&tokenizer, &scanner);
Token token;
while ((token = tokenizer_next(&tokenizer)).type != TOKEN_END) {
if (token.type == TOKEN_WORD) {
printf("Word found: %.*s\n", token.length, token.start);
}
}include/: Public API (Header files).src/: Implementation (Source code).examples/: Demo programs.tests/: Unit tests.docs/: Educational step-by-step guides.
- Architecture
- Module Definitions
- Header Design
- Implementation
- Build System
- Example Program
- Testing
- Git Best Practices
- UTF-8 Support
- Python Bindings (ctypes)
- Streaming file input (not just strings)