Parallel Corpora Generator

A command-line tool for generating parallel corpora from JSON files containing sentences with lemmas and tags.

Overview

This tool processes JSON files containing linguistic corpora data and uses a language model to generate parallel sentences. The original sentences and their generated counterparts are saved as parallel corpora.

Features

Process individual JSON files or all JSON files in a directory
Utilize GPU acceleration with vLLM for faster processing (falls back to transformers if vLLM is not available)
Fall back to CPU if GPU is not available
Generate parallel sentences using the Qwen-0.5b model
Save results as JSON files with original and generated sentence pairs

Installation

Clone this repository:

git clone <repository-url>
cd parallel_corpora

Run the installation script:
```
python install.py
```
This script will:
- Install all required dependencies
- Try to install vLLM for faster processing
- Fall back to transformers if vLLM installation fails
- Provide information about your system's compatibility
Alternatively, you can manually install the dependencies:
```
pip install -r requirements.txt
```
Note: If you encounter issues with vLLM installation, the tool will automatically fall back to using the transformers library.

Usage

Basic Usage

Process all JSON files in the data directory:

python -m src.cli

Specify a Single File

Process a specific JSON file:

python -m src.cli --file example.json

Custom Directories

Specify custom input and output directories:

python -m src.cli --data-dir custom_data --output-dir custom_output

Input Format

The tool expects JSON files containing corpora data with sentences, lemmas, and tags. The parser is currently a dummy implementation that will be replaced with an actual implementation later.

Example expected JSON structure:

[
  {
    "sentence": "This is a sample sentence.",
    "lemmas": ["this", "be", "a", "sample", "sentence"],
    "tags": ["DET", "VERB", "DET", "ADJ", "NOUN"]
  },
  ...
]

Output Format

The tool generates output files in the following format:

[
  {
    "original": "This is a sample sentence.",
    "generated": "This sentence is a sample."
  },
  ...
]

Troubleshooting

vLLM Installation Issues

If you encounter issues with vLLM installation:

The tool will automatically fall back to using transformers

You can try installing vLLM manually:

pip install ninja packaging setuptools>=49.4.0
pip install git+https://github.com/vllm-project/vllm.git

For Windows users, vLLM might not be fully supported. The transformers fallback should work in all cases.

Requirements

Python 3.8 or higher
PyTorch 2.0.0 or higher
Either vLLM 0.2.0+ or transformers 4.30.0+ (the tool will use vLLM if available, otherwise fall back to transformers)
CUDA-compatible GPU (optional, for faster processing)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallel Corpora Generator

Overview

Features

Installation

Usage

Basic Usage

Specify a Single File

Custom Directories

Input Format

Output Format

Troubleshooting

vLLM Installation Issues

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Parallel Corpora Generator

Overview

Features

Installation

Usage

Basic Usage

Specify a Single File

Custom Directories

Input Format

Output Format

Troubleshooting

vLLM Installation Issues

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages