Skip to content

[21] LLM Dataset and generator#129

Merged
Andriamanampisoa merged 4 commits into
stagingfrom
21-llm-datasets
Apr 12, 2026
Merged

[21] LLM Dataset and generator#129
Andriamanampisoa merged 4 commits into
stagingfrom
21-llm-datasets

Conversation

@Andriamanampisoa
Copy link
Copy Markdown
Collaborator

What type of PR is this? (check all applicable)

  • ✨ Feature
  • 🛑 Bug
  • ⚠️ Anomaly
  • 📝 Doc
  • 🎨 Style
  • 🧑‍💻 Refactor
  • 🛠️ Setup
  • 🏗️ Build
  • 🔥 Perfs
  • ✅ Test
  • 🔁 CI
  • ⏩ Revert

Description

This pull request introduces a dataset generation system for simulating IT recruiter-candidate conversations, aimed at fine-tuning the Lucie-7B-Instruct-v1.1 model. It adds a configurable, multi-role, multi-level, and quality-based synthetic data generator, along with supporting configuration files for roles, levels, qualities, and domain-specific Q&A.

A sample dataset have also been provided.

Linked GitHub Ticket

Closes EpitechPromo2027/G-EIP-600-NAN-6-1-eip-tugdual.de-reviers#21

Workspace

  • 🖥️ Web
  • 🛠️ Server
  • 🔁 CI
  • 🤖 Ai
  • 📱 App

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 11, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
talk-up-ai-dev Ready Ready Preview, Comment Apr 12, 2026 7:26am

@railway-app
Copy link
Copy Markdown

railway-app Bot commented Apr 11, 2026

🚅 Deployed to the TalkUp.AI-pr-129 environment in talk-up-ai

Service Status Web Updated (UTC)
Backend ✅ Success (View Logs) Apr 12, 2026 at 7:28 am

@railway-app railway-app Bot temporarily deployed to talk-up-ai / TalkUp.AI-pr-129 April 11, 2026 07:53 Destroyed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a synthetic multi-turn conversation dataset generator (IT recruiter ↔ candidate) intended for fine-tuning Lucie-7B-Instruct-v1.1, backed by JSON configuration for roles, levels, qualities, and role-specific Q&A.

Changes:

  • Introduce a Python script that generates multi-turn recruiter/candidate conversations and writes them to a dataset file.
  • Add JSON configuration files for role selection, experience levels, answer “quality”, and role-specific question/answer banks.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
ai/llm-core/datasets/generator/dataset-generator.py Implements the synthetic conversation generator and dataset writer.
ai/llm-core/datasets/generator/config/roles.json Defines the set of roles to sample when generating conversations.
ai/llm-core/datasets/generator/config/levels.json Defines experience levels used in prompts/responses.
ai/llm-core/datasets/generator/config/qualities.json Defines response quality tiers that control answer content.
ai/llm-core/datasets/generator/config/role_data.json Provides role-specific Q&A templates (with a default fallback).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ai/llm-core/datasets/generator/dataset-generator.py Outdated
Comment thread ai/llm-core/datasets/generator/dataset-generator.py Outdated
Comment thread ai/llm-core/datasets/generator/dataset-generator.py Outdated
Comment thread ai/llm-core/datasets/generator/config/roles.json
@railway-app railway-app Bot temporarily deployed to talk-up-ai / TalkUp.AI-pr-129 April 12, 2026 07:26 Destroyed
Copy link
Copy Markdown
Owner

@BhuvanArn BhuvanArn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Andriamanampisoa Andriamanampisoa merged commit ab020e1 into staging Apr 12, 2026
8 checks passed
@Andriamanampisoa Andriamanampisoa deleted the 21-llm-datasets branch April 12, 2026 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants