A command-line tool for generating parallel corpora from JSON files containing sentences with lemmas and tags.
This tool processes JSON files containing linguistic corpora data and uses a language model to generate parallel sentences. The original sentences and their generated counterparts are saved as parallel corpora.
- Process individual JSON files or all JSON files in a directory
- Utilize GPU acceleration with vLLM for faster processing (falls back to transformers if vLLM is not available)
- Fall back to CPU if GPU is not available
- Generate parallel sentences using the Qwen-0.5b model
- Save results as JSON files with original and generated sentence pairs
-
Clone this repository:
git clone <repository-url> cd parallel_corpora -
Run the installation script:
python install.pyThis script will:
- Install all required dependencies
- Try to install vLLM for faster processing
- Fall back to transformers if vLLM installation fails
- Provide information about your system's compatibility
Alternatively, you can manually install the dependencies:
pip install -r requirements.txtNote: If you encounter issues with vLLM installation, the tool will automatically fall back to using the transformers library.
Process all JSON files in the data directory:
python -m src.cli
Process a specific JSON file:
python -m src.cli --file example.json
Specify custom input and output directories:
python -m src.cli --data-dir custom_data --output-dir custom_output
The tool expects JSON files containing corpora data with sentences, lemmas, and tags. The parser is currently a dummy implementation that will be replaced with an actual implementation later.
Example expected JSON structure:
[
{
"sentence": "This is a sample sentence.",
"lemmas": ["this", "be", "a", "sample", "sentence"],
"tags": ["DET", "VERB", "DET", "ADJ", "NOUN"]
},
...
]The tool generates output files in the following format:
[
{
"original": "This is a sample sentence.",
"generated": "This sentence is a sample."
},
...
]If you encounter issues with vLLM installation:
- The tool will automatically fall back to using transformers
- You can try installing vLLM manually:
pip install ninja packaging setuptools>=49.4.0 pip install git+https://github.com/vllm-project/vllm.git - For Windows users, vLLM might not be fully supported. The transformers fallback should work in all cases.
- Python 3.8 or higher
- PyTorch 2.0.0 or higher
- Either vLLM 0.2.0+ or transformers 4.30.0+ (the tool will use vLLM if available, otherwise fall back to transformers)
- CUDA-compatible GPU (optional, for faster processing)