BumbleCore Data Format Guide

This document details the various data formats supported by BumbleCore to help you prepare training data.

💡 Format Support: All training stages support both JSON and JSONL file formats, with automatic recognition and processing.

📋 Table of Contents

Pretraining Data Format
SFT (Supervised Fine-Tuning) Data Format
DPO (Direct Preference Optimization) Data Format
Data Preparation Tips

🔍 Format auto-detection: The data formatter detects the input format by inspecting the first sample's top-level keys, in priority order: messages → conversations → instruction. Make sure each file uses a single consistent format.

1️⃣ Pretraining Data Format

Format:

{"text": "This is the first pretraining text content..."}
{"text": "This is the second pretraining text content..."}
{"text": "This is the third pretraining text content..."}

Field Description

Field	Type	Required	Description
`text`	string	✅	Pretraining text content, can be articles, code, conversations, or any text

2️⃣ SFT (Supervised Fine-Tuning) Data Format

SFT stage supports three formats: Alpaca, ShareGPT, and OpenAI-style Messages.

Format 1: Alpaca Format

Alpaca format is a concise instruction-input-output triplet format.

Basic Format

  {
    "instruction": "Explain what machine learning is",
    "input": "",
    "output": "Machine learning is a branch of artificial intelligence..."
  },
  {
    "instruction": "Translate the following English to Chinese",
    "input": "Hello, how are you?",
    "output": "你好，你好吗？"
  }

Field Description

Field	Type	Required	Description
`instruction`	string	✅	User's instruction or question
`input`	string	❌	Supplementary input content, can be omitted or filled with `""` if empty
`output`	string	✅	Model's response content
`system`	string	❌	Custom system prompt, defaults to "You are Bumblebee, a helpful AI assistant."

Alpaca format does not support tool calling. Use ShareGPT or OpenAI-style Messages format for tool-calling data.

Complete Format Example (with system)

    {
        "system": "You are a professional math tutor",
        "instruction": "Solve this equation",
        "input": "x + 2 = 5",
        "output": "x = 3"
    }

Format 2: ShareGPT Format

ShareGPT format is a conversational format supporting multi-turn dialogues.

Basic Format

  {
    "conversations": [
      {"from": "human", "value": "Hello"},
      {"from": "gpt", "value": "Hello! How can I help you?"}
    ]
  },
  {
    "conversations": [
      {"from": "human", "value": "Explain quantum computing"},
      {"from": "gpt", "value": "Quantum computing is a type of computing that uses quantum mechanics principles..."}
    ]
  }

Field Description

Field	Type	Required	Description
`conversations`	list	✅	Conversation list containing multi-turn dialogues
`conversations[].from`	string	✅	Role identifier: `"system"` / `"human"` / `"gpt"`
`conversations[].value`	string	✅	Conversation content
`tools`	string/list	❌	Tool definitions, usually used with `function_call` / `observation` for tool calling. Providing `tools` alone does not mean a tool call happened.

Multi-turn Conversation Example

[
  {
    "conversations": [
      {"from": "system", "value": "You are a helpful AI assistant"},
      {"from": "human", "value": "What is deep learning?"},
      {"from": "gpt", "value": "Deep learning is a subfield of machine learning based on multi-layer neural networks..."},
      {"from": "human", "value": "What are its applications?"},
      {"from": "gpt", "value": "Deep learning has wide applications in image recognition, natural language processing, speech recognition, and more."}
    ]
  }
]

Tool Calling Support (ShareGPT)

ShareGPT format supports tool calling via two extra from values:

`from`	Description
`function_call`	The assistant invokes a tool. `value` is a JSON string with `name` and `arguments`.
`observation`	Tool's return value. Must immediately follow a `function_call` or another `observation`.

These two roles are converted internally to the OpenAI / Qwen tool-calling structure (assistant.tool_calls and role: "tool") before being passed to the chat template.

[
  {
    "conversations": [
      {"from": "human", "value": "What's the weather in Beijing?"},
      {"from": "function_call", "value": "{\"name\": \"get_weather\", \"arguments\": {\"city\": \"Beijing\"}}"},
      {"from": "observation", "value": "{\"city\": \"Beijing\", \"temperature\": 18, \"condition\": \"sunny\"}"},
      {"from": "gpt", "value": "It's sunny in Beijing today, around 18°C."}
    ],
    "tools": "[{\"name\": \"get_weather\", \"description\": \"Look up weather for a city\", \"parameters\": {\"type\": \"object\", \"properties\": {\"city\": {\"type\": \"string\"}}, \"required\": [\"city\"]}}]"
  }
]

⚠️ Samples with malformed tool-calling structure (e.g. an observation not preceded by a function_call) are skipped with a warning at load time. See datasets/glaive_toolcall_zh_demo.json for a real example.

Format 3: Messages Format (OpenAI-style)

If your data is already in OpenAI / Qwen messages form (e.g. exported from a chat API), you can use it directly. This is the recommended format for tool-calling data because it requires no role translation.

Basic Format

[
  {
    "messages": [
      {"role": "user", "content": "Explain large language models in one sentence."},
      {"role": "assistant", "content": "An LLM is a deep neural network trained on massive text corpora to understand and generate natural language."}
    ]
  }
]

Field Description

Field	Type	Required	Description
`messages`	list	✅	OpenAI-style message list
`messages[].role`	string	✅	One of `"system"` / `"user"` / `"assistant"` / `"tool"`
`messages[].content`	string \| null	depends	Required for most roles. For an `assistant` message, `content` may be `null` only if `tool_calls` is set. `content` and `tool_calls` may also coexist - see "mixed assistant message" below.
`messages[].tool_calls`	list	❌	Used when an assistant message invokes tools
`tools`	string/list	❌	Tool definitions, usually used with `tool_calls` in `messages`

If no system message exists at the start, the default system prompt ("You are Bumblebee, a helpful AI assistant.") is automatically prepended.

Mixed assistant message (text + tool_calls)

An assistant message may carry both a textual content (e.g. a brief reasoning or status update for the user) and tool_calls in the same message. The Qwen chat template renders the text first, followed by <tool_call>...</tool_call> blocks within the same <|im_start|>assistant ... <|im_end|> turn:

{
  "role": "assistant",
  "content": "Sure, let me check the weather in Beijing for you.",
  "tool_calls": [
    {
      "type": "function",
      "function": {"name": "get_weather", "arguments": {"city": "Beijing"}}
    }
  ]
}

This is the recommended pattern when you want the model to learn to verbalize its intent before invoking a tool.

Tool Calling Example

[
  {
    "messages": [
      {"role": "user", "content": "What's the weather in Beijing?"},
      {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "type": "function",
            "function": {"name": "get_weather", "arguments": {"city": "Beijing", "unit": "celsius"}}
          }
        ]
      },
      {"role": "tool", "content": "{\"city\": \"Beijing\", \"temperature\": 18, \"condition\": \"sunny\"}"},
      {"role": "assistant", "content": "It's sunny in Beijing today, about 18°C."}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Look up weather for a city",
          "parameters": {
            "type": "object",
            "properties": {
              "city": {"type": "string"},
              "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["city"]
          }
        }
      }
    ]
  }
]

💡 See datasets/messages_zh_demo.json for runnable end-to-end examples.

3️⃣ DPO (Direct Preference Optimization) Data Format

DPO data contains chosen (preferred response) and rejected (non-preferred response) replies, supporting three formats: Alpaca, ShareGPT, and OpenAI-style Messages.

Format 1: Alpaca Format

Basic Format

[
  {
    "instruction": "Write a poem about spring",
    "input": "",
    "chosen": "Spring breeze caresses the blooming garden fair, Butterflies dance gracefully through the air. Green willows sway to greet the warming sun, Birds sing and swallows soar, their joy has just begun.",
    "rejected": "Spring has come, flowers bloom, it's beautiful."
  }
]

Field Description

Field	Type	Required	Description
`instruction`	string	✅	User's instruction or question
`input`	string	❌	Supplementary input content
`chosen`	string	✅	Better response (preferred response)
`rejected`	string	✅	Worse response (non-preferred response)
`system`	string	❌	Custom system prompt

Alpaca format does not support tool calling. Use ShareGPT or OpenAI-style Messages format for tool-calling data.

Complete Format Example

[
  {
    "system": "You are a poetic poet",
    "instruction": "Write a poem about the ocean",
    "input": "",
    "chosen": "Waves gently hum their ancient tune, Tides tell tales of a thousand moons. Blue waters dance with stars aglow, Deep and vast, where secrets flow.",
    "rejected": "The ocean is big and also blue."
  }
]

Format 2: ShareGPT Format

ShareGPT DPO uses conversations for the prompt history before the candidate response, and chosen / rejected for two competing final assistant responses under the same context.

Basic Format

[
  {
    "conversations": [
      {"from": "human", "value": "Hello"},
      {"from": "gpt", "value": "Hello!"},
      {"from": "human", "value": "How are you today?"}
    ],
    "chosen": {"from": "gpt", "value": "I'm doing great, thanks for asking! I've had a very productive day."},
    "rejected": {"from": "gpt", "value": "Okay I guess."}
  }
]

Field Description

Field	Type	Required	Description
`conversations`	list	✅	Conversation history containing previous multi-turn dialogues
`conversations[].from`	string	✅	`"system"` / `"human"` / `"gpt"`; tool-calling history also supports `"function_call"` / `"observation"`
`conversations[].value`	string	✅	Message content. For `function_call`, this is a JSON string with `name` and `arguments`.
`chosen`	object	✅	Preferred final response
`chosen.from`	string	✅	Usually `"gpt"`
`chosen.value`	string	✅	Preferred response content
`rejected`	object	✅	Non-preferred final response
`rejected.from`	string	✅	Usually `"gpt"`
`rejected.value`	string	✅	Non-preferred response content
`tools`	string/list	❌	Tool definitions. This is a tool-calling training sample only when `conversations` contains `function_call` / `observation`.

Tool Calling Example

[
  {
    "conversations": [
      {"from": "human", "value": "What's the weather in Beijing today?"},
      {"from": "function_call", "value": "{\"name\": \"get_weather\", \"arguments\": {\"city\": \"Beijing\"}}"},
      {"from": "observation", "value": "{\"city\": \"Beijing\", \"temperature\": 18, \"condition\": \"sunny\"}"}
    ],
    "chosen": {
      "from": "gpt",
      "value": "It's sunny in Beijing today, around 18°C, so it should be comfortable to go outside."
    },
    "rejected": {
      "from": "gpt",
      "value": "I don't know the weather in Beijing."
    },
    "tools": "[{\"name\": \"get_weather\", \"description\": \"Look up weather for a city\", \"parameters\": {\"type\": \"object\", \"properties\": {\"city\": {\"type\": \"string\"}}, \"required\": [\"city\"]}}]"
  }
]

Format 3: Messages Format (OpenAI-style)

You can also feed DPO data in the OpenAI-style messages form. The messages array carries the full prompt history before the candidate response, and chosen / rejected carry the two competing responses. Candidate responses support three forms:

string: a single plain-text assistant response.
object: a single assistant message, optionally with both content and tool_calls.
list: a full candidate message trajectory, which may include assistant.tool_calls, role: "tool", and a final assistant response.

Basic Format (string `chosen` / `rejected`)

[
  {
    "messages": [
      {"role": "user", "content": "Translate to English: 人工智能正在改变我们的生活方式。"}
    ],
    "chosen": "Artificial intelligence is changing the way we live.",
    "rejected": "AI change life."
  }
]

chosen / rejected may also use the object form {"role": "assistant", "content": "..."}. If a candidate response should invoke a tool, include tool_calls in that object. This form is useful for preferring "call a tool" over a direct hallucinated answer, but it does not include the tool result and final answer.

Tool Calling Example

[
  {
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Look up weather for a city",
          "parameters": {
            "type": "object",
            "properties": {
              "city": {"type": "string"}
            },
            "required": ["city"]
          }
        }
      }
    ],
    "messages": [
      {"role": "user", "content": "What's the weather in Beijing today?"}
    ],
    "chosen": {
      "role": "assistant",
      "content": "Let me check the current weather in Beijing.",
      "tool_calls": [
        {
          "type": "function",
          "function": {
            "name": "get_weather",
            "arguments": {"city": "Beijing"}
          }
        }
      ]
    },
    "rejected": {
      "role": "assistant",
      "content": "The weather in Beijing should be nice today."
    }
  }
]

Full Tool Trajectory Example (`chosen` / `rejected` as message lists)

Use the list form when you want DPO to compare the full tool-use trajectory. During training, assistant turns in the candidate trajectory are supervised (including <tool_call> and the final response), while role: "tool" messages are masked out.

[
  {
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Look up weather for a city",
          "parameters": {
            "type": "object",
            "properties": {
              "city": {"type": "string"}
            },
            "required": ["city"]
          }
        }
      }
    ],
    "messages": [
      {"role": "user", "content": "What's the weather in Beijing today?"}
    ],
    "chosen": [
      {
        "role": "assistant",
        "content": "Let me check the current weather in Beijing.",
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": {"city": "Beijing"}
            }
          }
        ]
      },
      {
        "role": "tool",
        "content": "{\"temperature\": 18, \"condition\": \"sunny\"}"
      },
      {
        "role": "assistant",
        "content": "It's sunny in Beijing today, about 18°C, so it should be comfortable to go outside."
      }
    ],
    "rejected": [
      {
        "role": "assistant",
        "content": "The weather in Beijing should be nice today."
      }
    ]
  }
]

Field Description

Field	Type	Required	Description
`messages`	list	✅	Conversation history before the candidate response. May include historical `assistant.tool_calls` and `role: "tool"`.
`chosen`	string \| object \| list	✅	Preferred candidate response. The list form represents a full candidate message trajectory.
`rejected`	string \| object \| list	✅	Non-preferred candidate response, same forms as `chosen`
`tools`	string/list	❌	Tool definitions, usually used with `tool_calls` in `messages`

💡 In DPO list trajectories, assistant turns are included in labels and tool turns are masked. This lets DPO compare the full "tool call + final answer" trajectory without teaching the model to generate tool return values.

💡 See datasets/dpo_messages_zh_demo.json for runnable end-to-end examples.

📝 Data Preparation Tips

Quality over Quantity: High-quality, well-formatted data is more valuable than large amounts of noisy data.
Consistent Formatting: All samples within a single file should use the same format - do not mix formats.
Validation: Validate your JSON/JSONL files before training to catch formatting errors.
Balance: For DPO, ensure chosen and rejected responses are meaningfully different.
Diversity: Include diverse examples covering different use cases and edge cases.
Tool-calling data: Prefer the Messages format for tool-calling samples - it maps 1:1 to the chat template and is the least error-prone.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BumbleCore Data Format Guide

📋 Table of Contents

1️⃣ Pretraining Data Format

Format:

Field Description

2️⃣ SFT (Supervised Fine-Tuning) Data Format

Format 1: Alpaca Format

Basic Format

Field Description

Complete Format Example (with system)

Format 2: ShareGPT Format

Basic Format

Field Description

Multi-turn Conversation Example

Tool Calling Support (ShareGPT)

Format 3: Messages Format (OpenAI-style)

Basic Format

Field Description

Mixed assistant message (text + tool_calls)

Tool Calling Example

3️⃣ DPO (Direct Preference Optimization) Data Format

Format 1: Alpaca Format

Basic Format

Field Description

Complete Format Example

Format 2: ShareGPT Format

Basic Format

Field Description

Tool Calling Example

Format 3: Messages Format (OpenAI-style)

Basic Format (string `chosen` / `rejected`)

Tool Calling Example

Full Tool Trajectory Example (`chosen` / `rejected` as message lists)

Field Description

📝 Data Preparation Tips

FilesExpand file tree

DATA_FORMAT.md

Latest commit

History

DATA_FORMAT.md

File metadata and controls

BumbleCore Data Format Guide

📋 Table of Contents

1️⃣ Pretraining Data Format

Format:

Field Description

2️⃣ SFT (Supervised Fine-Tuning) Data Format

Format 1: Alpaca Format

Basic Format

Field Description

Complete Format Example (with system)

Format 2: ShareGPT Format

Basic Format

Field Description

Multi-turn Conversation Example

Tool Calling Support (ShareGPT)

Format 3: Messages Format (OpenAI-style)

Basic Format

Field Description

Mixed assistant message (text + tool_calls)

Tool Calling Example

3️⃣ DPO (Direct Preference Optimization) Data Format

Format 1: Alpaca Format

Basic Format

Field Description

Complete Format Example

Format 2: ShareGPT Format

Basic Format

Field Description

Tool Calling Example

Format 3: Messages Format (OpenAI-style)

Basic Format (string chosen / rejected)

Tool Calling Example

Full Tool Trajectory Example (chosen / rejected as message lists)

Field Description

📝 Data Preparation Tips

Basic Format (string `chosen` / `rejected`)

Full Tool Trajectory Example (`chosen` / `rejected` as message lists)