v1.0.5 | Methodology for LLM-powered project development -- task-model fit analysis, pipeline architecture, file system state machines, cost estimation, and architectural reduction. 1 skill | 2 references | 13 trigger evals, 3 output evals
Starting an LLM-powered project without methodology leads to a predictable sequence of failures. Teams jump straight into building automation around a task the model cannot actually do well, burning weeks before discovering that the fundamental approach is flawed. A developer spends three days building a batch processing pipeline, only to find that the model hallucinates on 30% of inputs -- something a five-minute manual test would have revealed.
Even when the task is well-suited to LLM processing, teams build monolithic pipelines where the expensive LLM call, the fragile output parsing, and the deterministic rendering are all tangled together in a single script. When the parsing breaks (and it will -- LLMs do not follow format instructions perfectly), the only option is to re-run the entire pipeline, including all the LLM calls at $0.03 per item. At 1,000 items, that is $30 per debugging cycle.
Architecture decisions compound the problem. Teams add 17 specialized tools when the model only needs two. They build complex multi-agent orchestrations when a simple batch pipeline would suffice. They skip cost estimation and discover a $500 surprise bill after their first production run. Without a structured methodology, every LLM project is an expensive experiment that teams repeat from scratch each time.
This plugin provides a complete methodology for LLM-powered project development, from initial task-model fit assessment through production pipeline architecture. It starts with the most important step most teams skip: manually testing one example with the target model before writing any code. This five-minute validation prevents weeks of wasted development.
The core architecture is a five-stage pipeline -- acquire, prepare, process, parse, render -- where each stage is discrete, idempotent, and cacheable. The expensive non-deterministic LLM call (stage 3) is isolated from the deterministic stages, so you can iterate on parsing and rendering without re-running LLM calls. The file system serves as the state machine: each processing unit gets a directory, and stage completion is marked by file existence.
The skill also covers task-model fit recognition (which problems benefit from LLMs and which do not), cost estimation formulas, architectural reduction (when removing tools improves performance), and the decision framework for single-agent vs multi-agent approaches. Real case studies ground the methodology: Karpathy's HN Time Capsule ($58 for 930 items) and Vercel's d0 agent (80% to 100% success rate by reducing from 17 tools to 2).
The more concretely you describe your task and constraints, the more accurate the task-model fit assessment, pipeline design, and cost estimate will be.
What information to include in your prompt:
- Task description: What transformation do you want to perform? (e.g., "extract structured financial data from earnings call transcripts")
- Input type and volume: What are you processing? How many items? What is the average size in words or tokens?
- Output format: What should the result look like? (JSON with specific fields, a summary paragraph, classification labels, etc.)
- Error tolerance: Is 95% accuracy acceptable or does every item need to be correct? What happens when the model is wrong?
- Latency constraints: Does this need to run in real time (per-user request) or can it run in batch overnight?
- Cost context: Do you have a budget constraint? Is this a one-time run or recurring (daily/weekly/monthly)?
- Existing attempts: Have you tried anything already? What happened? Did you test one example manually?
- Model preference: Any specific model constraints (cost, capability, API access)?
What makes results better:
- Sharing the result of a manual test ("I tried one example in Claude and it produced this output: [paste output]") immediately tells the skill whether the task is viable
- Specifying output fields explicitly ("I need: company name, revenue figure, guidance change, and risk factors as a JSON object") enables structured output prompt design
- Describing the failure mode you are trying to avoid ("the model sometimes invents revenue numbers") shapes the validation and retry strategy
- Giving the actual item size ("transcripts average 15,000 words") enables accurate cost estimation instead of guesses
What makes results worse:
- Starting with "build me an AI app" before testing one example manually
- Omitting volume and frequency (cost estimates are meaningless without these)
- Skipping error tolerance ("just make it work") prevents appropriate validation design
Template prompt:
I want to use an LLM to [task description]. Inputs: [what you are processing], approximately [size per item], [total volume] items. Outputs: [exact format or fields needed]. Error tolerance: [acceptable failure rate and consequences of errors]. This would run [frequency]. I [have / have not] tested one example manually. [If tested: the result was [describe].] Model preference: [Claude Sonnet / Haiku / no preference]. Budget constraint: [if any].
| Without this plugin | With this plugin |
|---|---|
| Build automation first, discover the model cannot do the task weeks later | Manual prototype in 5 minutes validates task-model fit before any code |
| Monolithic pipeline re-runs all LLM calls ($30+) when parsing breaks | Staged pipeline isolates LLM calls; iterate on parsing for free |
| No cost estimate until the bill arrives | Formula-based estimation before development: items x tokens x price + 20-30% buffer |
| 17 specialized tools, 80% success rate | Architectural reduction to 2 primitives, 100% success rate (Vercel d0 pattern) |
| Debug by reading logs and guessing | File system state machine: every intermediate result is a human-readable file |
| Default to multi-agent because it sounds sophisticated | Decision framework: single-agent for independent items, multi-agent only when context isolation is necessary |
Add the SkillStack marketplace and install:
/plugin marketplace add viktorbezdek/skillstack
/plugin install agent-project-development@skillstack
After installing, test with:
I want to build a batch pipeline that analyzes 500 customer reviews with LLMs -- help me plan the architecture
- Install the plugin using the commands above
- Ask:
I have an idea for an LLM project -- help me figure out if it's worth building - The skill walks you through task-model fit analysis: characteristics that match vs. don't match LLM strengths
- Once validated, ask:
Design the pipeline architecture for this project - The skill produces a staged pipeline with file system state management, cost estimate, and development plan
Task Idea
|
v
+-------------------+
| Task-Model Fit | <-- Manual prototype validation
| Assessment |
+-------------------+
|
v
+-------------------+ +-----------------------+
| Pipeline Design | --> | acquire -> prepare -> |
| (5-stage) | | process -> parse -> |
| | | render |
+-------------------+ +-----------------------+
| |
v v
+-------------------+ +-------------------+
| Cost Estimation | | File System State |
| items x tokens | | data/{id}/ |
| x price | | raw.json |
+-------------------+ | prompt.md |
| | response.md |
v | parsed.json |
+-------------------+ +-------------------+
| Architecture |
| Decision |
| single vs multi |
+-------------------+
| Component | Type | Description |
|---|---|---|
agent-project-development |
Skill | Complete methodology for LLM project planning, architecture, and development |
case-studies.md |
Reference | Detailed analysis of production LLM projects: Karpathy HN Capsule, Vercel d0, Manus patterns |
pipeline-patterns.md |
Reference | Detailed pipeline architecture patterns for batch processing, data analysis, and content generation |
What it does: Activates when you are starting an LLM-powered project, evaluating task-model fit, designing pipeline architecture, estimating costs, or deciding between single-agent and multi-agent approaches. Provides a structured methodology that starts with manual validation and progresses through staged pipeline design.
Input -> Output: A project idea or requirement description -> Task-model fit assessment, pipeline architecture, cost estimate, development plan, and architectural decision rationale.
When to use:
- Starting a new project that might benefit from LLM processing
- Designing a batch processing pipeline for structured outputs
- Estimating costs and feasibility before committing development time
- Deciding between single-agent and multi-agent approaches
- Refactoring an existing LLM pipeline that is too complex or expensive
When NOT to use:
- Evaluating agent quality or building evaluation rubrics -> use
agent-evaluation - Designing multi-agent coordination, handoffs, or routing -> use
multi-agent-patterns - Building the actual tools for your agent -> use
tool-design
Try these prompts:
I want to use an LLM to classify 10,000 support tickets by category and urgency. Tickets average 200 words. Categories: billing, technical, account, feature request. I need 95%+ accuracy because wrong classifications trigger incorrect routing. I tested one example and Claude got the category right but missed the urgency. Is this a good fit?
Design a batch pipeline that takes company 10-K filings and produces structured financial summaries. Filings are 50-150 pages. I need: revenue, EBITDA, YoY change, top 3 risk factors, and forward guidance -- as JSON. This runs quarterly on ~500 filings. I'm fine with Claude Sonnet.
My LLM pipeline costs $200 per run and takes 4 hours -- help me optimize the architecture. We process 1,000 product descriptions and generate marketing copy. It currently re-runs all LLM calls when the output formatter breaks. I want to cut cost and make it resumable.
Should I use a single agent or multi-agent setup for my research synthesis project? The agent reads 10-20 papers per query, extracts key findings, and writes a 500-word synthesis. Papers average 8,000 words each. Concerned about context limits.
Key references:
| Reference | Topic |
|---|---|
case-studies.md |
Karpathy's HN Time Capsule (930 items, $58), Vercel d0 (17 tools to 2), Manus refactoring patterns |
pipeline-patterns.md |
Acquire-prepare-process-parse-render pipeline, file system state machines, structured output design, parallelization |
| Bad (vague, won't activate well) | Good (specific, activates reliably) |
|---|---|
| "Build me an AI app" | "I want to analyze 500 product reviews and extract sentiment, key themes, and improvement suggestions -- help me plan the pipeline" |
| "Should I use agents?" | "I have a data enrichment task: take company names and produce structured profiles. Is this suited for LLM processing or should I use APIs?" |
| "Make it faster" | "My batch pipeline processes 1000 items but re-runs all LLM calls when parsing fails. How do I restructure it?" |
| "Use agent-project-development" | "Estimate the cost of running Claude Sonnet on 5,000 research papers with 4K token prompts and 2K token outputs" |
For task-model fit assessment:
I want to use an LLM to [task description]. The inputs are [input type and size]. The outputs need to be [output format]. Error tolerance is [high/low]. Help me assess whether this is a good fit for LLM processing.
For pipeline design:
Design a batch processing pipeline for [task]. I have [N] items, each is [item description]. I need [output format]. Show me the stage breakdown, file system structure, and parallelization approach.
For cost estimation:
Estimate the cost of processing [N] [items] with [model name]. Each item has approximately [N] input tokens and I expect [N] output tokens. Include a buffer for retries.
- Skipping manual validation: "Build the full pipeline for analyzing legal contracts" -- without first testing one contract manually, you risk building automation around a task the model cannot do. Always start with "test one example first."
- Requesting multi-agent by default: "I need a multi-agent system with a coordinator, researcher, and writer" -- multi-agent adds complexity. The skill will help you determine if your task actually needs context isolation across agents or if a simple pipeline suffices.
- Ignoring costs: "Process all 100K items in one run" -- token costs compound. The skill estimates costs before execution and recommends staging approaches (start with 100 items, validate quality, then scale).
Starting situation: You work at a fintech company with 5,000 quarterly earnings call transcripts. You need to extract structured data -- revenue figures, guidance changes, risk factors, and sentiment -- from each transcript and produce a searchable database. The transcripts average 15,000 words each.
Step 1: Task-model fit assessment. You ask: "Is extracting structured financial data from earnings call transcripts a good fit for LLM processing?" The skill walks through the fit criteria: synthesis across sources (yes -- each transcript is a single source, but the extraction requires understanding financial context), subjective judgment with rubrics (partially -- revenue figures are objective, but "risk factor significance" is subjective), error tolerance (moderate -- a few missed data points are acceptable, but fabricated revenue numbers are not). The skill identifies a key risk: hallucinated numbers. Recommendation: proceed with validation, but add a post-processing verification step for numerical claims.
Step 2: Manual prototype. You test one transcript manually. You paste it into Claude with a structured prompt requesting JSON output. The result is promising: revenue figures are correct, guidance changes are captured, risk factors are reasonable. Output tokens average 800 per transcript. Estimated input tokens: ~5,000 per transcript (after truncating non-essential sections like operator instructions).
Step 3: Cost estimation. You ask: "Estimate the cost for 5,000 transcripts with Claude Sonnet." The skill calculates: 5,000 items x (5,000 input + 800 output tokens) x pricing = approximately $195 base cost. With 25% retry buffer: ~$245 total. Wall-clock time with 15 parallel workers: approximately 2.5 hours. This is feasible for a quarterly run.
Step 4: Pipeline architecture. You ask: "Design the pipeline." The skill produces a five-stage architecture:
- Acquire: Download transcripts from SEC EDGAR API into
data/{ticker}_{date}/raw.txt - Prepare: Truncate to relevant sections, format as prompt with structured output template, write to
data/{ticker}_{date}/prompt.md - Process: Run LLM calls with 15 parallel workers, write responses to
data/{ticker}_{date}/response.md - Parse: Extract JSON from markdown response, validate numerical fields against regex patterns, write to
data/{ticker}_{date}/parsed.json - Render: Aggregate all parsed.json files into SQLite database and generate summary report
Each stage checks for output file existence before running, making the pipeline idempotent and resumable.
Step 5: Architectural decision. You consider whether to split extraction into specialized sub-agents (one for revenue, one for risk factors). The skill recommends against it: the transcript is a single document that fits in one context window, the extraction task does not exceed context limits, and a single prompt with multiple extraction targets is simpler and cheaper than coordinating multiple agents. Multi-agent would make sense if transcripts were 100K+ tokens or if different extraction tasks required different tools.
Step 6: Iteration. After the first run of 100 transcripts, parsing fails on 12% of items due to the model occasionally using different section headers. You fix the parser to handle variations (the skill warned about this) and re-run only the parse and render stages -- no LLM calls needed. Total cost of debugging: $0.
Gotchas discovered: The prepare stage needed to strip boilerplate sections (operator introductions, safe harbor disclaimers) to keep input tokens under 5,000. The skill recommended building a section-detection heuristic in the prepare stage rather than asking the LLM to ignore irrelevant content, saving ~30% on token costs.
Context: Your PM wants to use AI to auto-generate personalized onboarding emails for new users based on their signup data. You are not sure if this is actually a good use of LLMs versus template-based generation.
You say: "We want to generate personalized onboarding emails from user signup data -- name, industry, use case. Should we use an LLM or stick with templates?"
The skill provides:
- Task-model fit analysis against six suitability criteria
- Identification that this is a natural language output task with synthesis (good fit) but has deterministic personalization needs (templates might suffice)
- Decision framework: if emails need to vary significantly by industry context, use LLM; if it is slot-filling with fixed structure, use templates
- Manual prototype instructions to test one email with the target model
You end up with: A clear recommendation with rationale, and one manually-generated example email to compare against your current template output.
Context: Your content team has 2,000 blog posts and needs to categorize each by topic, extract key takeaways, and score readability. This needs to run monthly on new content.
You say: "Design a batch pipeline to analyze 2,000 blog posts for topic, takeaways, and readability score. It runs monthly."
The skill provides:
- Five-stage pipeline with file system state management
- Cost estimate ($40-60 per monthly run with Claude Haiku)
- Structured output prompt template with explicit format requirements
- Parallelization strategy (20 workers) and error handling
- Incremental processing: only analyze new/modified posts each month
You end up with: A production-ready pipeline architecture document with cost projections, directory structure, and implementation milestones.
Context: Your agent has 12 specialized tools for data analysis, but success rate is only 65%. Adding more tools and examples has not helped.
You say: "My data analysis agent has 12 tools and only succeeds 65% of the time. More tools don't seem to help. What's wrong?"
The skill provides:
- Architectural reduction analysis following the Vercel d0 pattern
- Identification that specialized tools may be constraining rather than enabling
- Recommendation to test with 2-3 primitive tools (bash, SQL, file read) instead
- Before/after comparison framework to measure improvement
You end up with: A reduction plan that removes 9 tools and replaces them with direct file access, with a test protocol to validate that simpler architecture actually improves success rate.
When does this skill recommend single-agent vs multi-agent?
Single-agent is the default recommendation when items are independent (no cross-item dependencies), context fits in one window, and the task does not require specialized sub-tools that would benefit from isolated context. Multi-agent is recommended when context isolation is needed (tasks exceed one context window), when specialized sub-agents measurably improve quality on specific subtasks, or when parallel exploration of different aspects is required. The primary reason for multi-agent is context isolation, not role anthropomorphization.
When does architectural reduction apply?
Reduction is recommended when success rate plateaus or declines as tools are added, when the data layer is well-documented and consistently structured (the model can navigate it directly), and when teams spend more time maintaining scaffolding than improving outcomes. The skill points to the Vercel d0 case study as evidence.
| Failure | Symptom | Recovery |
|---|---|---|
| Skipped manual validation | Pipeline built on a task the model hallucinates on consistently | Stop. Test one example manually. If it fails, the task is not suited for LLM processing. |
| Monolithic pipeline | Debugging requires re-running all LLM calls; costs escalate | Refactor into five discrete stages with file system checkpoints between each |
| Cost overrun | Bill 5x higher than expected | Re-estimate with actual token counts from first 100 items; add retry buffer; consider cheaper model tier for simple items |
| Over-engineered architecture | Many tools, low success rate, high maintenance | Apply architectural reduction: test with 2-3 primitive tools; measure before/after |
| Parsing fragility | LLM output format varies; parser breaks on 10%+ of items | Build flexible regex parsers, add format enforcement in prompt, log failures for review |
- Engineers starting their first LLM-powered project who need a structured methodology instead of trial-and-error
- Data teams building batch processing pipelines who need cost-effective, resumable, debuggable architectures
- Technical leads evaluating project feasibility who need task-model fit analysis and cost estimates before committing resources
- Teams with over-engineered agent systems who need to simplify architecture to improve success rates
- Evaluating agent quality after building -- once the pipeline is running, use
agent-evaluationfor systematic quality measurement - Designing multi-agent coordination -- if you already know you need multiple agents, use
multi-agent-patternsfor handoff protocols, routing, and orchestration - Building specific agent tools -- for tool schema design and error handling patterns, use
tool-design
- agent-evaluation -- Evaluate the outputs of the pipelines this skill helps you build
- multi-agent-patterns -- When your project outgrows single-agent architecture
- tool-design -- Design the tools your agent pipeline uses
- context-optimization -- Manage context efficiently in long-running agent tasks
- prompt-engineering -- Optimize the prompts within your pipeline stages
SkillStack plugin by Viktor Bezdek -- licensed under MIT.