Skip to content

XvKuoMing/lemmas2corpora

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parallel Corpora Generator

A command-line tool for generating parallel corpora from JSON files containing sentences with lemmas and tags.

Overview

This tool processes JSON files containing linguistic corpora data and uses a language model to generate parallel sentences. The original sentences and their generated counterparts are saved as parallel corpora.

Features

  • Process individual JSON files or all JSON files in a directory
  • Utilize GPU acceleration with vLLM for faster processing (falls back to transformers if vLLM is not available)
  • Fall back to CPU if GPU is not available
  • Generate parallel sentences using the Qwen-0.5b model
  • Save results as JSON files with original and generated sentence pairs

Installation

  1. Clone this repository:

    git clone <repository-url>
    cd parallel_corpora
    
  2. Run the installation script:

    python install.py
    

    This script will:

    • Install all required dependencies
    • Try to install vLLM for faster processing
    • Fall back to transformers if vLLM installation fails
    • Provide information about your system's compatibility

    Alternatively, you can manually install the dependencies:

    pip install -r requirements.txt
    

    Note: If you encounter issues with vLLM installation, the tool will automatically fall back to using the transformers library.

Usage

Basic Usage

Process all JSON files in the data directory:

python -m src.cli

Specify a Single File

Process a specific JSON file:

python -m src.cli --file example.json

Custom Directories

Specify custom input and output directories:

python -m src.cli --data-dir custom_data --output-dir custom_output

Input Format

The tool expects JSON files containing corpora data with sentences, lemmas, and tags. The parser is currently a dummy implementation that will be replaced with an actual implementation later.

Example expected JSON structure:

[
  {
    "sentence": "This is a sample sentence.",
    "lemmas": ["this", "be", "a", "sample", "sentence"],
    "tags": ["DET", "VERB", "DET", "ADJ", "NOUN"]
  },
  ...
]

Output Format

The tool generates output files in the following format:

[
  {
    "original": "This is a sample sentence.",
    "generated": "This sentence is a sample."
  },
  ...
]

Troubleshooting

vLLM Installation Issues

If you encounter issues with vLLM installation:

  1. The tool will automatically fall back to using transformers
  2. You can try installing vLLM manually:
    pip install ninja packaging setuptools>=49.4.0
    pip install git+https://github.com/vllm-project/vllm.git
    
  3. For Windows users, vLLM might not be fully supported. The transformers fallback should work in all cases.

Requirements

  • Python 3.8 or higher
  • PyTorch 2.0.0 or higher
  • Either vLLM 0.2.0+ or transformers 4.30.0+ (the tool will use vLLM if available, otherwise fall back to transformers)
  • CUDA-compatible GPU (optional, for faster processing)

About

transform corpus with tags and lemmas only into parallel corpora

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages