You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<pclass="subtitle">Design, build, evaluate, and present a production-grade LLM application that integrates every major skill from this book</p>
157
+
<pclass="chapter-subtitle">Design, build, evaluate, and present a production-grade LLM application that integrates every major skill from this book</p>
<divclass="code-caption"><strong>Code Fragment 1:</strong>Cost tracking utility that estimates API spend from token usage. Mapping model names to per-token prices lets you monitor expenses programmatically.</div>
527
+
<divclass="code-caption"><strong>Code Fragment 1:</strong>TF-IDF plus logistic regression classifier for customer support tickets. The pipeline trains on bigram features (<code>ngram_range=(1, 2)</code>) and benchmarks inference at 0.12 ms per query, roughly 3,000x faster than an LLM API call, at negligible cost.</div>
528
528
529
529
<p>For structured extraction tasks, regular expressions offer even faster, deterministic results. The following snippet demonstrates regex-based entity extraction for common patterns such as emails, phone numbers, and monetary amounts.</p>
<divclass="code-caption"><strong>Code Fragment 2:</strong>RLHF training loop using PPO to optimize the language model against a reward signal. The KL divergence penalty prevents drift from the reference model.</div>
573
+
<divclass="code-caption"><strong>Code Fragment 2:</strong>Regex-based entity extractor for deterministic structured patterns (emails, phone numbers, monetary amounts, dates). The benchmark loop of 10,000 iterations demonstrates sub-microsecond per-extraction latency with zero false positives and zero API cost.</div>
Copy file name to clipboardExpand all lines: part-3-working-with-llms/module-11-hybrid-ml-llm/section-11.2.html
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -329,7 +329,7 @@ <h3>1.2 Generating Embeddings with OpenAI</h3>
329
329
Cost: ~$0.00002 per text (text-embedding-3-small)
330
330
Total for 5 texts: ~$0.0001
331
331
</div>
332
-
<divclass="code-caption"><strong>Code Fragment 1:</strong>Embedding generation for converting text into dense vector representations. These vectors capture semantic meaning, enabling similarity search and clustering.</div>
332
+
<divclass="code-caption"><strong>Code Fragment 1:</strong>Batch embedding via OpenAI's <code>text-embedding-3-small</code> model. The <code>get_embeddings()</code> function sends multiple texts in a single API call and returns a NumPy array of shape (n_texts, 1536), with per-text cost at approximately $0.00002.</div>
333
333
334
334
<p>Code Fragment 2 loads the model via Transformers.</p>
335
335
@@ -379,7 +379,7 @@ <h3>1.2 Generating Embeddings with OpenAI</h3>
379
379
Similarity between 'charged twice' and 'app crashes': 0.089
380
380
</div>
381
381
382
-
<divclass="code-caption"><strong>Code Fragment 2:</strong>Cost tracking utility that estimates API spend from token usage. Mapping model names to per-token prices lets you monitor expenses programmatically.</div>
382
+
<divclass="code-caption"><strong>Code Fragment 2:</strong>Local embedding with <code>SentenceTransformer('all-MiniLM-L6-v2')</code>, an 80 MB model producing 384-dimensional vectors. The <code>normalize_embeddings=True</code> flag enables direct dot-product similarity. At 5.7 ms per text on CPU with zero API cost, this is orders of magnitude cheaper than cloud embedding APIs.</div>
383
383
384
384
<h2>4. Combining Embeddings with Structured Features</h2>
385
385
@@ -444,7 +444,7 @@ <h2>4. Combining Embeddings with Structured Features</h2>
<divclass="code-caption"><strong>Code Fragment 3:</strong>Embedding generation for converting text into dense vector representations. These vectors capture semantic meaning, enabling similarity search and clustering.</div>
447
+
<divclass="code-caption"><strong>Code Fragment 3:</strong>Feature ablation study comparing structured-only, embeddings-only, and combined feature sets using XGBoost with 5-fold cross-validation. The combined configuration (<code>StandardScaler</code> on structured features concatenated with 384-dim embeddings) outperforms either source alone, demonstrating complementary signal.</div>
@@ -506,7 +506,7 @@ <h3>4.1 Semantic Caching as a Hybrid Pattern</h3>
506
506
self.responses.pop(0)
507
507
508
508
return response</code></pre>
509
-
<divclass="code-caption"><strong>Code Fragment 4:</strong>Embedding generation for converting text into dense vector representations. These vectors capture semantic meaning, enabling similarity search and clustering.</div>
509
+
<divclass="code-caption"><strong>Code Fragment 4:</strong>Semantic cache implementation using cosine similarity for cache lookup. The <code>SemanticCache.get_or_generate()</code> method embeds incoming queries, compares against stored vectors at a configurable <code>threshold</code> (default 0.95), and returns cached responses on hits, bypassing the LLM entirely.</div>
510
510
511
511
<p>The cost savings from semantic caching can be dramatic. In customer support applications where 30% to 50% of queries are paraphrases of common questions, semantic caching reduces LLM API costs proportionally while cutting median response latency from 1 to 2 seconds (LLM generation) to under 50 milliseconds (vector lookup). The embedding cost for the cache lookup is negligible: a single embedding API call costs roughly 1,000x less than a full LLM generation. For even lower latency, a local embedding model like all-MiniLM-L6-v2 can handle the cache lookup in under 5 milliseconds on CPU.</p>
<divclass="code-caption"><strong>Code Fragment 1:</strong>Anthropic Messages API call showing the distinct parameter layout. The system prompt is a top-level parameter rather than a message role, and max_tokens is required. Content blocks provide structured access to generated text.</div>
367
+
<divclass="code-caption"><strong>Code Fragment 1:</strong>Hybrid triage router in <code>TriageRouter</code> that uses a TF-IDF classifier as the first pass. When <code>confidence</code> exceeds the threshold (0.85), the classifier handles the query at $0.00001 per call. Ambiguous or mixed-intent queries (e.g., "change email and also get a refund") fall through to the LLM at $0.003.</div>
368
368
369
369
<p>Code Fragment 2 implements request routing.</p>
370
370
@@ -688,7 +688,7 @@ <h2>5. Lab: Building a Customer Support Pipeline</h2>
688
688
Cost: $0.00500
689
689
</div>
690
690
691
-
<divclass="code-caption"><strong>Code Fragment 4:</strong>Cost tracking utility that estimates API spend from token usage. Mapping model names to per-token prices lets you monitor expenses programmatically.</div>
691
+
<divclass="code-caption"><strong>Code Fragment 4:</strong>End-to-end customer support pipeline in <code>CustomerSupportPipeline</code>. Each ticket flows through classification, regex extraction, and conditional LLM escalation. The output includes <code>routing_tier</code>, <code>extracted_info</code>, suggested <code>action</code>, and <code>total_cost</code>, showing how layered processing keeps most tickets under $0.0001.</div>
Copy file name to clipboardExpand all lines: part-3-working-with-llms/module-11-hybrid-ml-llm/section-11.4.html
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -460,7 +460,7 @@ <h3>2.1 Mapping Your Frontier</h3>
460
460
Pareto-optimal configs: 6 / 10
461
461
</div>
462
462
463
-
<divclass="code-caption"><strong>Code Fragment 2:</strong>Few-shot prompting pattern providing labeled examples before the actual query. The examples establish the expected input-output format. Ordering and diversity of examples significantly affect output quality.</div>
463
+
<divclass="code-caption"><strong>Code Fragment 2:</strong>Pareto frontier analysis across 10 model configurations using <code>find_pareto_frontier()</code>. Each <code>ModelConfig</code> records accuracy, cost, and latency. The output marks dominated configurations (e.g., "Bad prompt" costs more than DistilBERT but achieves lower accuracy) and highlights the hybrid router as a Pareto-optimal point at one-fifth the cost of GPT-4o.</div>
0 commit comments