- Introduction
- Features
- Installation
- Usage
- Dataset Preparation
- Training the Model
- Generating Stories
- Contributing
- License
Provide a brief overview of your project:
- What does it do?
- What problem does it solve?
- Who is it for?
Example:
This project is a story generation model using GPT-2. It generates coherent and engaging stories based on given prompts. The model is fine-tuned on a custom dataset of story texts, allowing it to produce creative and contextually relevant narratives.
List the main features of your project:
- Text generation using GPT-2
- Custom dataset preparation
- Easy-to-use training and generation scripts
Example:
- Generates stories based on user prompts.
- Trained on a custom dataset for better relevance.
- Easy-to-use interface for training and generating text.
Provide step-by-step instructions on how to install and set up the project.
Example:
1. Clone the repository:
```bash
git clone https://github.com/harinath23/build-text-llm-model
cd build-text-llm-model
```
2. Create and activate a virtual environment:
```bash
python -m venv env
source env/bin/activate # On Windows use `env\Scripts\activate`
```
3. Install the required dependencies:
```bash
pip install -r requirements.txt
```
4. Set up the project (if any additional setup is needed, e.g., downloading pre-trained models):
```bash
# Example command to download pre-trained models
python main.py
```
Provide instructions on how to use the project. Include examples of commands and expected outputs.
Example:
1. **Tokenizing the Data:**
```python
from yourmodule import tokenizer_func
book_directory = "path/to/your/cleaned_books"
tokenizer_func(book_directory)
```
2. **Training the Model:**
```python
from transformers import Trainer, TrainingArguments
# Define training arguments
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=2,
save_steps=10_000,
save_total_limit=2,
)
# Initialize Trainer and train
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
trainer.train()
```
3. **Generating Stories:**
```python
prompt = "Once upon a time, in a faraway land,"
input_ids = tokenizer.encode(prompt, return_tensors='pt')
output = model.generate(input_ids, max_length=200, num_return_sequences=1)
generated_story = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_story)
```
Detail how to prepare the dataset for training.
Example:
1. Collect text data from sources such as Project Gutenberg, online stories, etc.
2. Clean and preprocess the data to remove unwanted text (e.g., headers, footers).
3. Tokenize the text using the provided `read_and_tokenize_book` function.
4. Split the tokenized text into chunks using the `chunk_tokens` function.
Provide detailed instructions on how to train the model.
Example:
1. Ensure your data is tokenized and chunked properly.
2. Use the `TextDataset` class to create a dataset.
3. Define training arguments using `TrainingArguments`.
4. Initialize the `Trainer` and start training using the `trainer.train()` method.
Explain how to use the trained model to generate stories.
Example:
1. Provide a prompt to the model.
2. Encode the prompt using the tokenizer.
3. Use the `model.generate` method to generate the story.
4. Decode the generated tokens back to text using the tokenizer.
Guidelines for contributing to the project.
Example:
1. Fork the repository.
2. Create a new branch (e.g., `feature/your-feature`).
3. Commit your changes.
4. Push to the branch.
5. Open a pull request.