Text Redaction Engine

Project Description

This project implements a text redaction pipeline that automates the process of identifying and censoring sensitive information from text documents. It utilizes a command-line interface to enable users to specify the types of sensitive data they wish to redact, making it flexible for various use cases, such as processing police reports, court transcripts, and hospital records.

We are using the full block character █ (U+2588) to redact the words based on our parameters. And we are using SPACY's doc object entities to span the word length, based on that we are not redacting the whitespaces, while for concept redaction, nltk redacts the spaces between words as well. The white space redaction is required to achieve complete abstraction, not letting even the context of redacted words be leaked out.

Concept is recognized in the files as a simple SPACY word match, case in-sensitive matching, followed by NLTK's advanced synonym search, which redacts any synonym search present in the file to hide the similar words as well.

This project uses Python3 to implement functions that:

Run throught the current directory, scans each file with the input extension
Perform redaction based on the input arguments using spaCy, regular expression, and NLTK wordnet synonym matching.
Followed by storing the redacted files with name--> {existing_nameextension}censored.
Saves and prints out the statistics of the program to the given format in the input argument.

Libraries Used:

spacy: For natural language processing and named entity recognition
nltk: For advanced concept redaction and synonym detection
argparse: For command-line argument parsing
datetime: For timestamp generation in reports
re: For pattern matching
typing: For type hints
dataclasses: For data structure organization

How to Install

Run the following to install the required dependencies:

Download python virtual environment

pipenv install

Install spacy and nltk libraries:

pip install nltk
pip install spacy

Download spacy's large model for refined redaction:

python -m spacy download en_core_web_lg

Import NLTK and download the wordnet and omw data

import nltk
nltk.download("omw-1.4")
nltk.download("wordnet")

How to run

pipenv run python redactor.py --input {file_extension} --names --dates --phones --address --concept {concept word: str} --output '{file_location}' --stats {stream_type}

Example:

python redactor.py --input "*.txt" --names --dates --phones --address --concept "secret" --output ./redacted --stats stderr

Stats Output format

--- REDACTION STATS ---
Created on: 2024-10-31 12:00:00

--- Summary ---
NAMES:
Total unique items: 10
Items:
 - John Doe
 - Jane Smith
...

--- File Details ---
File: example.txt
Total characters: 1000
Total redactions: 15
Redacted items:
 - [NAME] John Doe (indices: 10-18)
 - [DATE] 2024-10-31 (indices: 20-30)

Command Line Arguments

--input: Input file pattern (e.g., "*.txt")
--names: Enable name redaction
--dates: Enable date redaction
--phones: Enable phone number redaction
--address: Enable address redaction
--concept: Concepts to redact (can be specified multiple times)
--output: Output directory for redacted files
--stats: Output file for redaction statistics

Code Components

Classes

`RedactionItem`

Represents a single redaction item with text, start and end indices, and category.

Attributes:
- text (str): Redacted text content.
- start_idx (int): Start index of redaction.
- end_idx (int): End index of redaction.
- category (str): Category of redaction (e.g., "NAME").

`FileStats`

Stores file statistics.

Attributes:
- filename (str): File name.
- size (int): File size in bytes.
- num_redactions (int): Number of redactions made.

Functions

`redact_text(text: str, redaction_items: List[RedactionItem]) -> str`

Redacts specified items from the given text.

Parameters:
- text (str): Original text.
- redaction_items (List[RedactionItem]): Items to redact.
Returns:
- str: Redacted text.

`calculate_file_stats(filepath: str) -> FileStats`

Calculates statistics for the given file.

Parameters:
- filepath (str): Path to the file.
Returns:
- FileStats: Statistics of the file.

`log_redactions(file_stats: FileStats)`

Logs redaction statistics.

Parameters:
- file_stats (FileStats): File statistics to log.

Tests Overview

`test_names.py`

`test_redact_names()`

Tests redaction of personal names from a text sample.

Input: Text containing a personal name.
Expected Outcome: Names are redacted, matching the expected redaction format.

`test_dates.py`

`test_redact_dates()`

Tests redaction of dates from a text sample.

Input: Text containing a date.
Expected Outcome: Date is redacted, matching the expected redaction format. Additionally, the original date appears in stats.dates.

`test_address.py`

`test_redact_addresses()`

Tests redaction of address-related information from a text sample.

Input: Text containing an address.
Expected Outcome: Address is redacted, with redaction recorded in stats.addresses.

`test_concepts.py`

`test_redact_concepts()`

Tests redaction of specified concept keywords from a text sample.

Input: Text containing specified concepts.
Expected Outcome: Concepts are redacted, with redactions recorded in stats.concepts.

`test_phones.py`

`test_redact_phones()`

Tests redaction of phone numbers from a text sample.

Input: Text containing phone numbers.
Expected Outcome: Phone numbers are redacted, with redaction recorded in stats.phones.

Running the Tests

To run the tests, use the following command:

pipenv run python -m pytest

Bugs and Assumptions

Requires SpaCy's large English model for optimal performance, getting better performance on large spacy model than smaller ones.
Pattern-based redaction may have false positives, since email name, address, phones could be written in odd format.
Some 10 digit numbers such as ID etc. are being redacted by the program, mis judging it as a phone number.
Custom concept redaction depends on WordNet coverage
All input files are assumed to be UTF-8 encoded
Memory usage scales with input file size

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
tests		tests
COLLABORATORS		COLLABORATORS
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
redactor.py		redactor.py
setup.cfg		setup.cfg
setup.py		setup.py
stderr		stderr
test.txt		test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text Redaction Engine

Project Description

How to Install

How to run

Example:

Stats Output format

Command Line Arguments

Code Components

Classes

`RedactionItem`

`FileStats`

Functions

`redact_text(text: str, redaction_items: List[RedactionItem]) -> str`

`calculate_file_stats(filepath: str) -> FileStats`

`log_redactions(file_stats: FileStats)`

Tests Overview

`test_names.py`

`test_redact_names()`

`test_dates.py`

`test_redact_dates()`

`test_address.py`

`test_redact_addresses()`

`test_concepts.py`

`test_redact_concepts()`

`test_phones.py`

`test_redact_phones()`

Running the Tests

Bugs and Assumptions

About

Uh oh!

Releases

Packages

Languages

License

divyam2207/text-redaction-engine

Folders and files

Latest commit

History

Repository files navigation

Text Redaction Engine

Project Description

How to Install

How to run

Example:

Stats Output format

Command Line Arguments

Code Components

Classes

RedactionItem

FileStats

Functions

redact_text(text: str, redaction_items: List[RedactionItem]) -> str

calculate_file_stats(filepath: str) -> FileStats

log_redactions(file_stats: FileStats)

Tests Overview

test_names.py

test_redact_names()

test_dates.py

test_redact_dates()

test_address.py

test_redact_addresses()

test_concepts.py

test_redact_concepts()

test_phones.py

test_redact_phones()

Running the Tests

Bugs and Assumptions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`RedactionItem`

`FileStats`

`redact_text(text: str, redaction_items: List[RedactionItem]) -> str`

`calculate_file_stats(filepath: str) -> FileStats`

`log_redactions(file_stats: FileStats)`

`test_names.py`

`test_redact_names()`

`test_dates.py`

`test_redact_dates()`

`test_address.py`

`test_redact_addresses()`

`test_concepts.py`

`test_redact_concepts()`

`test_phones.py`

`test_redact_phones()`

Packages