This document details the various data formats supported by BumbleCore to help you prepare training data.
💡 Format Support: All training stages support both JSON and JSONL file formats, with automatic recognition and processing.
- Pretraining Data Format
- SFT (Supervised Fine-Tuning) Data Format
- DPO (Direct Preference Optimization) Data Format
- Data Preparation Tips
{"text": "This is the first pretraining text content..."}
{"text": "This is the second pretraining text content..."}
{"text": "This is the third pretraining text content..."}| Field | Type | Required | Description |
|---|---|---|---|
text |
string | ✅ | Pretraining text content, can be articles, code, conversations, or any text |
SFT stage supports two mainstream data formats: Alpaca and ShareGPT.
Alpaca format is a concise instruction-input-output triplet format.
{
"instruction": "Explain what machine learning is",
"input": "",
"output": "Machine learning is a branch of artificial intelligence..."
},
{
"instruction": "Translate the following English to Chinese",
"input": "Hello, how are you?",
"output": "你好,你好吗?"
}| Field | Type | Required | Description |
|---|---|---|---|
instruction |
string | ✅ | User's instruction or question |
input |
string | ❌ | Supplementary input content, can be omitted or filled with "" if empty |
output |
string | ✅ | Model's response content |
system |
string | ❌ | Custom system prompt, defaults to "You are a helpful assistant." |
tools |
string/list | ❌ | Tool definitions, JSON string or list format |
{
"system": "You are a professional math tutor",
"instruction": "Solve this equation",
"input": "x + 2 = 5",
"output": "x = 3",
"tools": "[{\"name\": \"calculator\", \"description\": \"Math calculator\"}]"
}ShareGPT format is a conversational format supporting multi-turn dialogues.
{
"conversations": [
{"from": "human", "value": "Hello"},
{"from": "gpt", "value": "Hello! How can I help you?"}
]
},
{
"conversations": [
{"from": "human", "value": "Explain quantum computing"},
{"from": "gpt", "value": "Quantum computing is a type of computing that uses quantum mechanics principles..."}
]
}
| Field | Type | Required | Description |
|---|---|---|---|
conversations |
list | ✅ | Conversation list containing multi-turn dialogues |
conversations[].from |
string | ✅ | Role identifier: "system" / "human" / "gpt" |
conversations[].value |
string | ✅ | Conversation content |
tools |
string/list | ❌ | Tool definitions |
[
{
"conversations": [
{"from": "system", "value": "You are a helpful AI assistant"},
{"from": "human", "value": "What is deep learning?"},
{"from": "gpt", "value": "Deep learning is a subfield of machine learning based on multi-layer neural networks..."},
{"from": "human", "value": "What are its applications?"},
{"from": "gpt", "value": "Deep learning has wide applications in image recognition, natural language processing, speech recognition, and more."}
]
}
][
{
"conversations": [
{"from": "system", "value": "You are a professional translation assistant"},
{"from": "human", "value": "Translate 'Hello' to French"},
{"from": "gpt", "value": "Bonjour"}
],
"tools": "[{\"name\": \"translator\", \"description\": \"Multi-language translation tool\"}]"
}
]DPO data contains chosen (preferred response) and rejected (non-preferred response) replies, supporting both Alpaca and ShareGPT formats.
[
{
"instruction": "Write a poem about spring",
"input": "",
"chosen": "Spring breeze caresses the blooming garden fair, Butterflies dance gracefully through the air. Green willows sway to greet the warming sun, Birds sing and swallows soar, their joy has just begun.",
"rejected": "Spring has come, flowers bloom, it's beautiful."
}
]| Field | Type | Required | Description |
|---|---|---|---|
instruction |
string | ✅ | User's instruction or question |
input |
string | ❌ | Supplementary input content |
chosen |
string | ✅ | Better response (preferred response) |
rejected |
string | ✅ | Worse response (non-preferred response) |
system |
string | ❌ | Custom system prompt |
tools |
string/list | ❌ | Tool definitions |
[
{
"system": "You are a poetic poet",
"instruction": "Write a poem about the ocean",
"input": "",
"chosen": "Waves gently hum their ancient tune, Tides tell tales of a thousand moons. Blue waters dance with stars aglow, Deep and vast, where secrets flow.",
"rejected": "The ocean is big and also blue.",
"tools": "[{\"name\": \"rhyme_suggester\", \"description\": \"Rhyme suggestion tool\"}]"
}
][
{
"conversations": [
{"from": "human", "value": "Hello"},
{"from": "gpt", "value": "Hello!"},
{"from": "human", "value": "How are you today?"}
],
"chosen": {"from": "gpt", "value": "I'm doing great, thanks for asking! I've had a very productive day."},
"rejected": {"from": "gpt", "value": "Okay I guess."}
}
]| Field | Type | Required | Description |
|---|---|---|---|
conversations |
list | ✅ | Conversation history containing previous multi-turn dialogues |
chosen |
object | ✅ | Preferred final response |
chosen.from |
string | ✅ | Usually "gpt" |
chosen.value |
string | ✅ | Preferred response content |
rejected |
object | ✅ | Non-preferred final response |
rejected.from |
string | ✅ | Usually "gpt" |
rejected.value |
string | ✅ | Non-preferred response content |
tools |
string/list | ❌ | Tool definitions |
[
{
"conversations": [
{"from": "system", "value": "You are a health advisor"},
{"from": "human", "value": "I've been feeling tired recently"}
],
"chosen": {
"from": "gpt",
"value": "Fatigue can have multiple causes. I recommend: 1. Ensure 7-8 hours of sleep per night; 2. Exercise moderately, walk for 10-15 minutes daily; 3. Maintain a balanced diet; 4. Reduce caffeine intake. If fatigue persists, please consult a doctor."
},
"rejected": {
"from": "gpt",
"value": "You're probably just lazy, exercise more."
},
"tools": "[{\"name\": \"wellness_tips\", \"description\": \"Evidence-based health advice\"}]"
}
]- Quality over Quantity: High-quality, well-formatted data is more valuable than large amounts of noisy data.
- Consistent Formatting: Ensure all data entries follow the same format structure.
- Validation: Validate your JSON/JSONL files before training to catch formatting errors.
- Balance: For DPO, ensure chosen and rejected responses are meaningfully different.
- Diversity: Include diverse examples covering different use cases and edge cases.