Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 28 additions & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,9 @@ jobs:
docker stop bubblog-ai || true
docker rm bubblog-ai || true

docker stop bubblog-ai-worker || true
docker rm bubblog-ai-worker || true

docker pull $IMAGE_NAME:latest

docker run -d --name bubblog-ai -p 8000:3000 \
Expand All @@ -51,4 +54,28 @@ jobs:
-e ALGORITHM="${{ secrets.ALGORITHM }}" \
-e EMBED_MODEL="${{ secrets.EMBED_MODEL }}" \
-e CHAT_MODEL="${{ secrets.CHAT_MODEL }}" \
$IMAGE_NAME:latest
-e REDIS_URL="${{ secrets.REDIS_URL }}" \
-e REDIS_HOST="${{ secrets.REDIS_HOST }}" \
-e REDIS_PORT="${{ secrets.REDIS_PORT }}" \
-e EMBEDDING_QUEUE_KEY="${{ secrets.EMBEDDING_QUEUE_KEY }}" \
-e EMBEDDING_FAILED_QUEUE_KEY="${{ secrets.EMBEDDING_FAILED_QUEUE_KEY }}" \
-e EMBEDDING_WORKER_MAX_RETRIES="${{ secrets.EMBEDDING_WORKER_MAX_RETRIES }}" \
-e EMBEDDING_WORKER_BACKOFF_MS="${{ secrets.EMBEDDING_WORKER_BACKOFF_MS }}" \
$IMAGE_NAME:latest

docker run -d --name bubblog-ai-worker \
-e OPENAI_API_KEY="${{ secrets.OPENAI_API_KEY }}" \
-e DATABASE_URL="${{ secrets.DATABASE_URL }}" \
-e SECRET_KEY="${{ secrets.SECRET_KEY }}" \
-e TOKEN_AUDIENCE="${{ secrets.TOKEN_AUDIENCE }}" \
-e ALGORITHM="${{ secrets.ALGORITHM }}" \
-e EMBED_MODEL="${{ secrets.EMBED_MODEL }}" \
-e CHAT_MODEL="${{ secrets.CHAT_MODEL }}" \
-e REDIS_URL="${{ secrets.REDIS_URL }}" \
-e REDIS_HOST="${{ secrets.REDIS_HOST }}" \
-e REDIS_PORT="${{ secrets.REDIS_PORT }}" \
-e EMBEDDING_QUEUE_KEY="${{ secrets.EMBEDDING_QUEUE_KEY }}" \
-e EMBEDDING_FAILED_QUEUE_KEY="${{ secrets.EMBEDDING_FAILED_QUEUE_KEY }}" \
-e EMBEDDING_WORKER_MAX_RETRIES="${{ secrets.EMBEDDING_WORKER_MAX_RETRIES }}" \
-e EMBEDDING_WORKER_BACKOFF_MS="${{ secrets.EMBEDDING_WORKER_BACKOFF_MS }}" \
$IMAGE_NAME:latest node dist/worker/queue-consumer.js
203 changes: 67 additions & 136 deletions TASK.md

Large diffs are not rendered by default.

19 changes: 19 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
version: "3.9"

services:
api:
build:
context: .
command: ["node", "dist/server.js"]
env_file:
- .env
ports:
- "3000:3000"
restart: unless-stopped
worker:
build:
context: .
command: ["node", "dist/worker/queue-consumer.js"]
env_file:
- .env
restart: unless-stopped
136 changes: 136 additions & 0 deletions docs/history-tasks/HybridSearchUpgradePlan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
## Hybrid Search Upgrade Plan (Working Doc)

### 1. Current Implementation Snapshot
- `runHybridSearch(question, userId, plan)` (src/services/hybrid-search.service.ts)
- Embeds `[question, ...plan.rewrites]` and runs `findSimilarChunksV2` per embedding.
- Executes `textSearchChunksV2` once using the original question + keywords.
- Merges chunk candidates by `postId:postChunk`, keeps max vector/text score per chunk, min–max normalizes each modality, then fuses via `alpha` blend.
- Returns top `plan.top_k` chunks (capped 10); `plan.limit` ignored here.
- Semantic-only fallback uses same repository call without text blending (`runSemanticSearch`).
- Planner (`generateSearchPlan`) currently emits rewrites/keywords but keyword quality/quantity varies; schema clamps counts post-hoc.
- Category filters from API are not wired into hybrid search; text rewrites are not reused in lexical search; chunk key uses raw text.

### 2. Pain Points & Gaps
1. **Filtering gaps** – category/time filters partially ignored, final `limit` unused, vector threshold normalization can collapse to zero when max=min.
2. **Keyword quality** – LLM often emits multi-word phrases or duplicates; count not consistently within intended range.
3. **Rewrite redundancy** – All rewrites treated equally; no semantic-distance-aware weighting → aggressive rewrites may be undervalued or noisy ones over-weighted.
4. **Fused scoring sensitivity** – Min–max normalization across union is brittle when modalities have outliers; no similarity-based bonus for high-confidence hits.
5. **Post-level UX** – Current pipeline optimized for RAG chunk retrieval; no reusable API that returns deduplicated post-level hits with pagination.
6. **Observability** – Limited metrics around rewrite effectiveness, keyword usage, or threshold activations.

### 3. Goals & Guiding Principles
- Preserve strong recall via multi-embedding + lexical hybrid while adding stability and transparency.
- Make rewrite/keyword generation purposeful: enforce concise tokens, staged semantic drift, and maintain question coverage.
- Provide a standalone hybrid search endpoint for user-facing search with post-level results.
- Instrument similarity thresholds and modality contributions to support tuning.

### 4. Retrieval Quality Enhancements (Track A)

4.1 **Similarity Threshold Boosting**
- Reuse the existing retrieval bias labels (`lexical`, `balanced`, `semantic`) to derive both `alpha` and default modality thresholds (`sem_boost_threshold`, `lex_boost_threshold`) so planner output stays compact. Defaults (retain current behavior for now):
- `lexical`: `alpha = 0.30`, `sem_boost_threshold = 0.65`, `lex_boost_threshold = 0.80`
- `balanced`: `alpha = 0.50`, `sem_boost_threshold = 0.70`, `lex_boost_threshold = 0.75`
- `semantic`: `alpha = 0.75`, `sem_boost_threshold = 0.80`, `lex_boost_threshold = 0.65`
- Encode the mapping as a single constants table (e.g., `RETRIEVAL_BIAS_PRESETS`) so both planner normalization and hybrid scoring reference identical values.
- Permit optional overrides in `plan.hybrid`, but clamp to sensible bounds (e.g., 0.4–0.85) for consistency.
- When a normalized vector/text score crosses its threshold, apply a bounded boost (e.g., multiply by 1.1–1.3 or add 0.1), log activations, and cap boosts to maintain ranking stability.

4.2 **Rewrite Strategy & Weighting**
- Update planner prompt to generate staged rewrites:
- `rewrite_1`: conservative paraphrase.
- `rewrite_2`: adds synonymous term / clarifies entity.
- `rewrite_3+`: higher semantic drift or alternative framing.
- After plan normalization (`search-plan.service.ts`):
- Compute embedding-based cosine similarity between original question and each rewrite.
- Drop rewrites below a floor (e.g., <0.35) or route them to lexical-only usage.
- Derive per-rewrite weights (e.g., `weight = clamp(similarity, 0.6, 1.2)`) and supply to `runHybridSearch`.
- Similarity calculations use fresh embedding API calls (no caching) for both the question and rewrites within the request.
- In hybrid service, apply weights when aggregating vector scores (weighted max/avg instead of pure max) so high-quality rewrites contribute proportionally.

4.3 **Keyword Constraints & Quality**
- Modify `planSchema` / prompt: keywords must be single Korean/English tokens (no spaces), trimmed, 1–5 items.
- In normalization, enforce `.slice(0,5)`, drop tokens <2 chars or containing whitespace/punctuation (except hyphen/underscore if needed).
- Extend text search to run over `[question, ...filtered rewrites]` for lexical recall or compute textual similarity per rewrite (optional v2 step).

4.4 **Repository/Data Adjustments**
- Update `findSimilarChunksV2` / `textSearchChunksV2` to return `chunk_index`, `post_created_at`, and optionally `post_tags` for downstream boosts.
- Tag aggregation via `post_tag` ⇔ `tag` should be added only if such tables exist; otherwise return `[]` and skip joins.
- Switch dedup key to `${postId}:${chunk_index}` to avoid string-heavy keys.
- Filters wiring: Do NOT add `filters.category_ids` to the plan. Keep the plan schema limited to `filters.time`.
- Use `categoryId` from the controller as a server-side pre-filter only.
- Derive `from/to` from the normalized plan time window (label → absolute) and apply in repositories.
- Respect `plan.limit` at the final slicing stage.
- Keep retrieval as exact KNN on `pgvector` (ORDER BY `<=>`); `top_k` stays per-source fetch size while final slicing respects `plan.limit`.

<!-- moved to Backlog: see section 11 -->

### 5. Search API & Post-Level Experience (Track B)
- **Service decomposition** – Extract shared primitive `buildHybridCandidates({ question, rewrites, keywords, plan, userId, categoryId })` returning chunk-level scores + metadata + diagnostic stats.
- **Post aggregation** – Create aggregator to deduplicate by post (max score, optional average of top 2, representative snippet) and apply deterministic `limit/offset` pagination (page size default 10, max 10).
- **Public API** – Add unauthenticated REST endpoint (JSON, no SSE) such as `GET /search/hybrid` accepting question, filters, paging params; reuse the planner or a lightweight variant as UX dictates.
- **QA reuse** – `answerStreamV2` continues calling chunk-level layer; search endpoint uses same embeddings/threshold logic to avoid drift.

### 6. Prompts & Planner Improvements
- Update `buildSearchPlanPrompt` instructions:
- Require keywords to be single words, explicitly request “1~5 단일 키워드”.
- Outline staged rewrite roles to nudge LLM output.
- Remind that temporal expressions stay in `filters.time`.
- Keep the client-facing `planSchema` minimal (no explicit `alpha`/threshold/weight fields). Server derives weights/thresholds internally from the retrieval bias label and does not surface them to the frontend.
- Update schema docs only for keyword bounds (1–5) and any internal validation notes; no additional fields are exposed over the API.
- In normalization, log keyword count, rewrite count, threshold values to support telemetry.

### 7. Observability & Telemetry
- Structured logs/SSE events:
- For each query: number of rewrites retained, similarity weights, threshold boosts triggered, counts per modality.
- Emit metrics for search endpoint (total posts returned, pagination info, latency).
- Standardize a log payload (e.g., `type: 'retrieval.boost', bias, alpha, sem_thr, lex_thr, modality, original_score, boosted_score`) to simplify analysis and tuning.
- Add debug flags to inspect per-rewrite vector/text hit lists for evaluation.

### 8. Performance Considerations
- Generate embeddings for `[question, rewrites]` with fresh API calls per request (no caching); accept the additional cost for correctness.
- Cap total vector queries by `plan.hybrid.max_rewrites`; consider batching embeddings via OpenAI API if supported.
- Monitor effect of threshold boosts on latency; adjust SQL to prefetch needed metadata in single round-trip.

### 9. Execution Roadmap (Detailed)

**Phase 0 – Foundations & Bugfixes**
- Task 0.1: Thread request filters (`categoryId`, `limit`) through `qa.v2.service.ts`. Do NOT add `filters.category_ids` to the plan; the server applies `categoryId` as a pre-filter. Derive `from/to` solely from the normalized plan `filters.time` (label → absolute) and use in repositories.
- Task 0.2: Honor `plan.limit` when returning hybrid results, switch dedupe key to `${postId}:${chunk_index}`, and propagate `chunk_index` through types.
- Task 0.3: Expand `findSimilarChunksV2`/`textSearchChunksV2` to select `chunk_index`, `post_created_at`, and optionally aggregated `post_tags` (only if tag tables exist); update SQL joins and DTOs with safe fallbacks.
- Task 0.4: Update hybrid/semantic services to surface new metadata in SSE payloads, keeping backward compatibility for existing clients.

**Phase 1 – Planner & Prompt Hardening**
- Task 1.1: Tighten `planSchema` validation (keywords 1–5 single tokens, rewrites ≤ max_rewrites) and normalize via shared helpers with telemetry hooks.
- Task 1.2: Revise `buildSearchPlanPrompt` instructions to enforce staged rewrites, single-token keywords, and explicit temporal guidance; add regression fixtures for prompt drift.
- Task 1.3: Implement normalization pass that cleans keywords, generates embeddings for rewrites, filters low-similarity variants, and records per-rewrite cosine similarity.
- Task 1.4: Persist summary logs (`rewrites_len`, similarity weights, keyword counts) via structured logger for observability.

**Phase 2 – Retrieval Scoring Upgrades**
- Task 2.1: Introduce `RETRIEVAL_BIAS_PRESETS` mapping (`alpha`, `sem_boost_threshold`, `lex_boost_threshold`) and clamp overrides in normalization.
- Task 2.2: Apply threshold-based boosts in `runHybridSearch`, logging activations and capping final scores for stability.
- Task 2.3: Weight vector scores by rewrite similarity (e.g., weighted max/avg) and expose diagnostics per rewrite.
- Task 2.4: Extend lexical search to iterate across `[question, rewrites]`, merging results while respecting keyword filters and avoiding redundant queries.
- Task 2.5: Enforce post-level diversity (max N chunks/post) before final ranking and respect `plan.limit` after fusion.

**Phase 3 – Search API Delivery**
- Task 3.1: Extract `buildHybridCandidates` service returning chunk-level hits plus diagnostics; retrofit QA flow to consume it.
- Task 3.2: Build post aggregation layer (score fusion, snippet selection, pagination respecting `limit/offset`) with deterministic ordering.
- Task 3.3: Add `GET /search/hybrid` route, request validation, and integration tests covering filters, pagination, and telemetry events.
- Task 3.4: Document API usage and ensure rate-limiting/auth hooks match product requirements.

**Phase 4 – Tuning & Observability**
- Task 4.1: Emit structured SSE/log events for threshold boosts, rewrite weighting, keyword pruning, and modality contributions.
- Task 4.2: Backfill dashboards or log queries (e.g., BigQuery/Redash) to monitor latency, hit counts, and boost frequency.
- Task 4.3: Create evaluation playbook with canonical queries, offline regression scripts, and guidance for tuning boost factors.
- Task 4.4: Investigate alternative fusion strategies (RRF/z-score) gated behind feature flags for safe experimentation.

### 10. Open Questions
- Do we need separate planner settings for public search vs QA (e.g., higher keyword count)?
- Should rewrite weights persist back into plan schema for transparency to the client?
- What default boost factors strike best balance between recall and precision? Requires offline eval.

---
Use this document as the anchor before implementation; update sections as design decisions finalize or metrics inform threshold choices.

### 11. Backlog
- Normalization stability (min–max collapse): evaluate mitigations without immediate implementation. Candidates include constant fallback (e.g., 0.5), epsilon guards, rank-based fusion (RRF), z-score fusion, unimodal fallback, and telemetry for activation frequency.
71 changes: 71 additions & 0 deletions docs/reports/REPORT-embedding-worker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# 보고서: Redis 큐 기반 임베딩 워커 도입 및 배포 구성

## 1. 개요
- 목적: Spring Boot → Redis → Node.js 파이프라인으로 임베딩 생성을 비동기 처리하고, Express API와 분리된 워커를 운영한다.
- 상태: 워커 엔트리포인트·환경 변수 스키마·도커 컴포즈·GitHub Actions 배포 흐름까지 반영 완료.
- 범위: 기존 API 서버 코드는 유지하면서 Redis 큐 소비 로직을 추가하고, 단일 Docker 이미지로 API/워커 컨테이너를 분리 운용한다.

## 2. 워커 구조
- 파일: `src/worker/queue-consumer.ts`
- Redis 연결: `REDIS_URL`(우선) 또는 `REDIS_HOST`/`REDIS_PORT`.
- 작업 형식: `{ postId, title?, content?, attempt? }`.
- 처리 순서
1. `BRPOP` 으로 `EMBEDDING_QUEUE_KEY` 대기.
2. 제목(`storeTitleEmbedding`)과 본문(`chunkText` → `createEmbeddings` → `storeContentEmbeddings`) 순차 처리.
3. 오류 시 재시도: `attempt` 증가, `EMBEDDING_WORKER_MAX_RETRIES`, `EMBEDDING_WORKER_BACKOFF_MS` 기반 backoff, 한계를 넘으면 `EMBEDDING_FAILED_QUEUE_KEY` 로 이동.
- 기타: Graceful shutdown(SIGINT/SIGTERM), DebugLogger 로 주요 이벤트 기록.

## 3. 환경 변수 (추가 항목)
| 키 | 용도 | 기본값 |
| --- | --- | --- |
| `REDIS_URL` | 외부 Redis 접속 URL (우선 사용) | 없음 |
| `REDIS_HOST` / `REDIS_PORT` | URL 미지정 시 호스트/포트 | `127.0.0.1` / `6379` |
| `EMBEDDING_QUEUE_KEY` | 작업 큐 이름 | `embedding:queue` |
| `EMBEDDING_FAILED_QUEUE_KEY` | 실패 큐 이름 | `embedding:failed` |
| `EMBEDDING_WORKER_MAX_RETRIES` | 최대 재시도 횟수 | `3` |
| `EMBEDDING_WORKER_BACKOFF_MS` | 재시도 간 대기(ms) | `5000` |

## 4. 도커 이미지 & 실행
- Dockerfile 기본 CMD: `node dist/server.js` (Express API).
- 동일 이미지를 재사용하되 `docker run ... node dist/worker/queue-consumer.js` 로 커맨드를 오버라이드하면 워커가 실행된다.
- pm2 불필요: 컨테이너 단일 프로세스 가정 + Docker `restart` 정책으로 복구.

## 5. docker-compose (개발용)
```yaml
services:
api:
build: .
command: ["node", "dist/server.js"]
env_file: [.env]
ports: ["3000:3000"]
restart: unless-stopped

worker:
build: .
command: ["node", "dist/worker/queue-consumer.js"]
env_file: [.env]
restart: unless-stopped
```
- 외부 Redis 사용이 기본 전제. 필요 시 개발 환경에서만 Redis 서비스를 추가해 `.env` 를 해당 컨테이너로 지정.

## 6. GitHub Actions 배포 (main.yml)
- 이미지: `${{ secrets.DOCKER_USERNAME }}/bubblog-ai:latest` 빌드/푸시.
- EC2 배포 단계:
1. 기존 `bubblog-ai`, `bubblog-ai-worker` 컨테이너 정지/삭제.
2. 최신 이미지 pull.
3. API 컨테이너 실행(기본 CMD).
4. 워커 컨테이너 실행(`node dist/worker/queue-consumer.js` 명령).
- 두 컨테이너 모두 Redis 및 재시도 관련 Secrets를 전달하여 구성 누락을 방지.
- Secrets(예시): `REDIS_URL`, `REDIS_HOST`, `REDIS_PORT`, `EMBEDDING_QUEUE_KEY`, `EMBEDDING_FAILED_QUEUE_KEY`, `EMBEDDING_WORKER_MAX_RETRIES`, `EMBEDDING_WORKER_BACKOFF_MS` 등.

## 7. 운영 참고 사항
- Spring Boot 프로듀서는 LPUSH 로 작업을 큐에 적재(이미 구현됨).
- Redis 는 외부 서버/매니지드 환경을 사용; 본 프로젝트 컨테이너에서는 Consumer 역할만 수행.
- 실패 큐(`embedding:failed`) 모니터링 및 재처리(예: RPOP → LPUSH → 재시도 스케줄러) 전략 필요.
- API 컨테이너에서 Redis 변수가 필요하지는 않지만, 비상시 커맨드 오버라이드를 대비해 공통으로 주입해 둔 상태.

## 8. 향후 체크리스트
- [ ] 스테이징 환경에서 Redis/DB 연결 및 임베딩 저장 성공 여부 검증.
- [ ] 실패 큐 모니터링/알림 구성.
- [ ] 워커 스케일 아웃 전략 정의 (컨테이너 수 확장 시 처리 충돌 없는지 확인).
- [ ] Redis 접근 제어/TLS 여부 점검.
Loading