Skip to content

WeiLiuAH/AIR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AIR Logo

AIR: Complex Instruction Generation via Automatic Iterative Refinement

🌟 Overview

AIR is a novel framework for generating complex instructions with constraints, significantly enhancing Large Language Models' ability to follow complex instructions. Our approach uses an innovative two-stage process:

  1. Initial Instruction Generation: Generate base instructions from documents
  2. Iterative Refinement: Enhance instructions through LLM-as-judge guidance

The framework produces more challenging and realistic instructions, leading to improved model performance on complex tasks.

🚀 Key Features

  • Automatic Iterative Refinement: Novel approach to generate complex instructions
  • Constraint-aware Generation: Instructions that better reflect real-world scenarios
  • Large-scale Dataset: AIR-10K dataset with 10,000 complex instructions
  • Enhanced Performance: Significant improvements over existing instruction-following methods

⚙️ Installation

pip install -r requirements.txt

📊 Dataset Preparation

Download Dolma Dataset

# Download dataset chunks
huggingface-cli download --repo-type dataset --local-dir-use-symlinks False emozilla/dolma-v1_7-cc_en_head --local-dir ./data/dolma --include "*000_00000.parquet*"
huggingface-cli download --repo-type dataset --local-dir-use-symlinks False emozilla/dolma-v1_7-cc_en_head --local-dir ./data/dolma --include "*001_00000.parquet*"
huggingface-cli download --repo-type dataset --local-dir-use-symlinks False emozilla/dolma-v1_7-cc_en_head --local-dir ./data/dolma --include "*002_00000.parquet*"

Initial Processing

# Convert data format
python ./init_process/data_acquire.py \
    --input_path ./data/dolma \
    --output_path ./data/dolma.jsonl

# Generate embeddings
python ./init_process/embeds_gene.py \
    --input_path ./data/dolma.jsonl \
    --output_path ./data/doc_embeds.jsonl

# Select diverse documents
python ./init_process/select_diverse_based_doc_embeds.py \
    --input_path ./data/dolma.jsonl \
    --embedding_path ./data/doc_embeds.jsonl \
    --output_path ./data/dolma_60k.jsonl

# Generate initial instructions
CUDA_VISIBLE_DEVICES=0,1,2,3 python ./init_process/instruct_generate.py \
    -i ./data/dolma_60k.jsonl \
    -o ./data/dolma_init_process.jsonl \
    -m /path/llama3_70b_instruct

# Filter and score instructions
python ./init_process/instruct_score_filter.py \
    --input_path ./data/dolma_init_process.jsonl \
    --output_path ./data/dolma_init_process.jsonl

Generate and Process Judge Data

scripts

# Generate judge data (max 5 iterations)
bash ./judge_data_gene/run_main.sh

# Process for SFT training
bash ./judge_data_process/data_process.sh

Models Used

Guidance Models Used

🔄 Training

We support training using LlamaFactory with the following models:



About

AIR: Complex Instruction Generation via Automatic Iterative Refinement

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors