llm-datasets

Star

Here are 28 public repositories matching this topic...

neo4j-labs / text2cypher

Star

collection of text2cypher datasets, evaluations, and finetuning instructions

neo4j graph cypher cypher-query-language llm llms llm-training llm-datasets text2cypher

Updated Jun 13, 2024
Jupyter Notebook

dsdanielpark / open-llm-datasets

Sponsor

Star

Repository for organizing datasets and papers used in Open LLM.

natural-language-processing datasets large-language-models llm llm-training llm-datasets

Updated Jul 6, 2023

ServiceNow / SyGra

Star

SyGra - Graph-oriented Synthetic data generation Pipeline

python open-source ai multimodality synthetic-data synthetic-dataset-generation dpo image-datasets low-code-no-code llm-datasets llm-framework sft-data llm-training-data

Updated Apr 21, 2026
Python

discus-labs / discus

Star

A data-centric AI package for ML/AI. Get the best high-quality data for the best results. Discord: https://discord.gg/t6ADqBKrdZ

python openai gpt synthetic-data fine-tuning synthetic-dataset-generation ner-data huggingface-transformers gpt-4 large-language-models llms llm-training llm-datasets fine-tuning-llm

Updated Nov 20, 2023
Python

asimsinan / LLM-Research

Star

A collection of LLM related papers, thesis, tools, datasets, courses, open source models, benchmarks

arxiv-papers large-language-models llm llms llm-datasets llm-tools buyuk-dil-modelleri llm-research llm-theses llm-benchmarking llm-frameworks

Updated Oct 8, 2024
Python

lightning-rod-labs / lightningrod-python-sdk

Star

Python SDK for dataset generation on LightningRod platform ⚡

dataset-generation synthetic-data fine-tuning llm-datasets llm-tooling

Updated Apr 22, 2026
Python

amao0o0 / awesome-AI-Math-Datasets

Star

A collection of recent open-source math datasets for training and evaluating Math LLMs

math mathematics llm ai4math llm-datasets math-llm

Updated Mar 29, 2026

A framework to analyze how AGI/ASI might emerge from decentralized, adaptive systems, rather than as the fruit of a single model deployment. It also aims to present orientation as a dynamic and self-evolving Magna Carta, helping to guide the emergence of such phenomena.

machine-learning agi dataset artificial-general-intelligence machine-learning-library datasets machine-learning-projects llm llms rlhf llm-datasets llm-framework llms-benchmarking llm-benchmarking artificial-general-super-intelligence agi-development

Updated Feb 26, 2026

altunenes / rustysozluk

Sponsor

Star

Efficiently fetch and perform sentiment analysis (Turkish Only) on eksisozluk.com entries using Rust

rust scraper sentiment-analysis turkish eksisozluk rust-lang webscraping eksi-sozluk reqwest duyguanalizi rust-scraping llm-training llm-datasets

Updated Feb 8, 2024
Rust

ishida-lab / capbencher

Star

CapBencher toolkit: Give your LLM benchmark a built-in alarm for leakage and gaming

contamination-detection llm llm-datasets llm-benchmark leaderboard-hacking

Updated Feb 24, 2026
Python

arian-askari / SOLID

Star

Synthetically Generating Intent-Aware Information-Seeking Dialogues! Useful for various tasks such as training/evaluating User Intent Predictors with the possibility to training/evaluating on real human dialogues. The backbone LLM of SOLID is Zephyr-7b-beta.

solid dataset-generation conversational-ai intent-classification llm-training llm-inference llm-datasets llm-dialogs llm-conversations zephyr-7b-beta intent-aware-conversation-generation solid-rl

Updated Aug 18, 2024
Python

tiddly-gittly / TiddlyWiki-LLM-dataset

Star

WikiText syntax dataset generation pipeline and open dataset for auto UI generation in TiddlyWiki. (WIP)

dataset tiddlywiki wikitext llm llm-training llm-datasets

Updated Nov 20, 2024
TypeScript

neuralwork / audio2chat

Star

Convert multi-speaker audio files to structured chat data for LLMs

chat transcription whisper speaker-diarization llm llm-datasets

Updated Jan 29, 2025
Python

DefinetlyNotAI / LLM_Data

Star

A bunch of very famous repos source code's in python as pure localdocs all in this repo to train CODE AI

c data cpp cuda jupyter-notebook python3 code-examples llm llm-datasets data-dum programming-data programming-data-sets llm-code

Updated Dec 12, 2024
Python

dmeldrum6 / LLM-Dataset-Builder

Star

LLM-Powered Dataset Creation Tool

synthetic-data synthetic-dataset-generation synthetic-data-generation llm llm-training llm-datasets

Updated Aug 15, 2025
HTML

JochiRaider / sievio

Star

Sievio turns GitHub, local repos, and web PDFs into clean JSONL for LLM pretraining, fine-tuning, and RAG. It offers structure-aware chunking, reliable Unicode decoding, pluggable QC and safety checks, plus optional dataset cards and deduplication.

python data-deduplication dataset-creation data-pipelines repository-mining jsonl github-repos rag text-preprocessing quality-filtering code-mining llm llm-training llm-datasets

Updated Dec 27, 2025
Python

AmanPriyanshu / Stratified-LLM-Subsets-100K-1M-Scale

Sponsor

Star

Stratified LLM Subsets delivers diverse training data at 100K-1M scales across pre-training (FineWeb-Edu, Proof-Pile-2), instruction-following (Tulu-3, Orca AgentInstruct), and reasoning distillation (Llama-Nemotron). Embedding-based k-means clustering ensures maximum diversity across 5 high-quality open datasets.

Updated Oct 4, 2025
HTML

redblock-ai / parrot-python

Star

PARROT (Performance Assessment of Reasoning and Responses On Trivia) is a novel benchmarking framework designed to evaluate Large Language Models (LLMs) on real-world, complex, and ambiguous QA tasks.

benchmarking-framework llm-inference llm-datasets llm-qa-document llm-benchmarking

Updated Oct 14, 2024
Python

mohammadreza-mohammadi94 / Persian-Poem-Dataset

Star

A collection of Persian poems structured for NLP and LLM tasks. Each poem is stored as a separate file, organized by poet, and formatted for easy use in training, fine-tuning, or text analysis workflows.

dataset persian-dataset llm-datasets

Updated Jul 3, 2025

aloobun / ccpem-modified

Star

A modified dataset consisting of English dialogs between a user and an assistant discussing movie preferences in natural language.

dataset llm-datasets

Updated Sep 29, 2023

Improve this page

Add a description, image, and links to the llm-datasets topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-datasets topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-datasets

Here are 28 public repositories matching this topic...

neo4j-labs / text2cypher

dsdanielpark / open-llm-datasets

ServiceNow / SyGra

discus-labs / discus

asimsinan / LLM-Research

lightning-rod-labs / lightningrod-python-sdk

amao0o0 / awesome-AI-Math-Datasets

ronniross / asi-core-protocol

altunenes / rustysozluk

ishida-lab / capbencher

arian-askari / SOLID

tiddly-gittly / TiddlyWiki-LLM-dataset

neuralwork / audio2chat

DefinetlyNotAI / LLM_Data

dmeldrum6 / LLM-Dataset-Builder

JochiRaider / sievio

AmanPriyanshu / Stratified-LLM-Subsets-100K-1M-Scale

redblock-ai / parrot-python

mohammadreza-mohammadi94 / Persian-Poem-Dataset

aloobun / ccpem-modified

Improve this page

Add this topic to your repo