This project implements a text redaction pipeline that automates the process of identifying and censoring sensitive information from text documents. It utilizes a command-line interface to enable users to specify the types of sensitive data they wish to redact, making it flexible for various use cases, such as processing police reports, court transcripts, and hospital records.
We are using the full block character █ (U+2588) to redact the words based on our parameters. And we are using SPACY's doc object entities to span the word length, based on that we are not redacting the whitespaces, while for concept redaction, nltk redacts the spaces between words as well. The white space redaction is required to achieve complete abstraction, not letting even the context of redacted words be leaked out.
Concept is recognized in the files as a simple SPACY word match, case in-sensitive matching, followed by NLTK's advanced synonym search, which redacts any synonym search present in the file to hide the similar words as well.
This project uses Python3 to implement functions that:
- Run throught the current directory, scans each file with the input extension
- Perform redaction based on the input arguments using spaCy, regular expression, and NLTK wordnet synonym matching.
- Followed by storing the redacted files with name--> {existing_nameextension}censored.
- Saves and prints out the statistics of the program to the given format in the input argument.
Libraries Used:
- spacy: For natural language processing and named entity recognition
- nltk: For advanced concept redaction and synonym detection
- argparse: For command-line argument parsing
- datetime: For timestamp generation in reports
- re: For pattern matching
- typing: For type hints
- dataclasses: For data structure organization
Run the following to install the required dependencies:
- Download python virtual environment
pipenv install- Install spacy and nltk libraries:
pip install nltk
pip install spacy- Download spacy's large model for refined redaction:
python -m spacy download en_core_web_lg- Import NLTK and download the wordnet and omw data
import nltk
nltk.download("omw-1.4")
nltk.download("wordnet")pipenv run python redactor.py --input {file_extension} --names --dates --phones --address --concept {concept word: str} --output '{file_location}' --stats {stream_type}
python redactor.py --input "*.txt" --names --dates --phones --address --concept "secret" --output ./redacted --stats stderr
--- REDACTION STATS ---
Created on: 2024-10-31 12:00:00
--- Summary ---
NAMES:
Total unique items: 10
Items:
- John Doe
- Jane Smith
...
--- File Details ---
File: example.txt
Total characters: 1000
Total redactions: 15
Redacted items:
- [NAME] John Doe (indices: 10-18)
- [DATE] 2024-10-31 (indices: 20-30)- --input: Input file pattern (e.g., "*.txt")
- --names: Enable name redaction
- --dates: Enable date redaction
- --phones: Enable phone number redaction
- --address: Enable address redaction
- --concept: Concepts to redact (can be specified multiple times)
- --output: Output directory for redacted files
- --stats: Output file for redaction statistics
Represents a single redaction item with text, start and end indices, and category.
- Attributes:
text(str): Redacted text content.start_idx(int): Start index of redaction.end_idx(int): End index of redaction.category(str): Category of redaction (e.g., "NAME").
Stores file statistics.
- Attributes:
filename(str): File name.size(int): File size in bytes.num_redactions(int): Number of redactions made.
Redacts specified items from the given text.
-
Parameters:
text(str): Original text.redaction_items(List[RedactionItem]): Items to redact.
-
Returns:
str: Redacted text.
Calculates statistics for the given file.
-
Parameters:
filepath(str): Path to the file.
-
Returns:
FileStats: Statistics of the file.
Logs redaction statistics.
- Parameters:
file_stats(FileStats): File statistics to log.
Tests redaction of personal names from a text sample.
- Input: Text containing a personal name.
- Expected Outcome: Names are redacted, matching the expected redaction format.
Tests redaction of dates from a text sample.
- Input: Text containing a date.
- Expected Outcome: Date is redacted, matching the expected redaction format. Additionally, the original date appears in
stats.dates.
Tests redaction of address-related information from a text sample.
- Input: Text containing an address.
- Expected Outcome: Address is redacted, with redaction recorded in
stats.addresses.
Tests redaction of specified concept keywords from a text sample.
- Input: Text containing specified concepts.
- Expected Outcome: Concepts are redacted, with redactions recorded in
stats.concepts.
Tests redaction of phone numbers from a text sample.
- Input: Text containing phone numbers.
- Expected Outcome: Phone numbers are redacted, with redaction recorded in
stats.phones.
To run the tests, use the following command:
pipenv run python -m pytest- Requires SpaCy's large English model for optimal performance, getting better performance on large spacy model than smaller ones.
- Pattern-based redaction may have false positives, since email name, address, phones could be written in odd format.
- Some 10 digit numbers such as ID etc. are being redacted by the program, mis judging it as a phone number.
- Custom concept redaction depends on WordNet coverage
- All input files are assumed to be UTF-8 encoded
- Memory usage scales with input file size