A comprehensive pipeline for generating AI character files and training datasets by analyzing public figures' online presence across Twitter. The pipeline consists of multiple stages that transform raw Twitter data into structured character profiles suitable for AI training.
⚠️ IMPORTANT: Create a new Twitter account for this tool. DO NOT use your main account as it may trigger Twitter's automation detection and result in account restrictions.
This pipeline allows you to:
- Scrape tweets from any public Twitter profile
- Process and organize the raw data
- Generate detailed character profiles
- Create structured AI training datasets
- Generate fine-tuned character models
Scrapes tweets from a specified Twitter account and saves them in a structured format:
- Raw tweets are stored in
pipeline/{username}/{date}/raw/tweets.json - URLs are extracted to
pipeline/{username}/{date}/raw/urls.txt - Media files are saved to
pipeline/{username}/{date}/raw/media/
bun run twitter -- usernameProcesses the raw Twitter data to create a comprehensive character profile:
- Analyzes tweet patterns and content
- Extracts behavioral characteristics
- Identifies communication style
- Stores results in
characters/{username}.json
bun run character -- username YYYY-MM-DDCreates a refined AI-ready profile by:
- Selecting representative tweet examples
- Extracting key topics using NLP
- Analyzing communication characteristics
- Identifying language patterns
- Generating a structured profile in
aigent/{username}.json
bun run aigent -- usernameThe generated profile includes:
name: Character's nametweetExamples: 10 representative tweetscharacteristics: AI-generated behavioral analysistopics: Key topics extracted using NLPlanguage: Primary language usedtwitterUsername: Original Twitter handle
Uses the processed data to create custom AI models:
- Prepares training datasets
- Fine-tunes language models
- Creates character-specific models
bun run finetune # Regular fine-tuning
bun run finetune:test # Fine-tuning with test set-
Install dependencies:
bun i
-
Copy
.env.exampleto.envand configure:# (Required) Twitter Authentication TWITTER_USERNAME= # your twitter username TWITTER_PASSWORD= # your twitter password # (Optional) Scraping Configuration MAX_TWEETS= # max tweets to scrape (default: 1000) MAX_RETRIES= # max retries for scraping (default: 3) RETRY_DELAY= # delay between retries (default: 5000) MIN_DELAY= # minimum delay between requests (default: 1000) MAX_DELAY= # maximum delay between requests (default: 3000)
-
Scrape tweets from a user:
bun run twitter -- tomkowalczyk
-
Generate character profile:
bun run character -- tomkowalczyk 2025-01-28
-
Create AI-ready profile:
bun run aigent -- tomkowalczyk
-
Fine-tune model:
bun run finetune
project/
├── pipeline/ # Raw scraped data
│ └── {username}/
│ └── {date}/
│ └── raw/
│ ├── tweets.json
│ ├── urls.txt
│ └── media/
├── characters/ # Processed character profiles
│ └── {username}.json
└── aigent/ # AI-ready profiles
└── {username}.json
- Uses
compromisefor natural language processing - Implements intelligent topic extraction
- Performs sentiment and style analysis
- Generates weighted topic scoring
- Handles multi-word phrases and domain-specific terms
- Only works with public Twitter profiles
- Rate limited by Twitter's API restrictions
- English language focused analysis
- Requires manual Twitter authentication