diff --git a/CLAUDE.md b/CLAUDE.md index e276d36..8ed9263 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -17,6 +17,14 @@ LIF Core (Learner Information Framework) is a modular monorepo for aggregating l - [`docs/specs/data-model-rules.md`](docs/specs/data-model-rules.md) — PascalCase entities vs camelCase scalars and the files that must follow the convention. - [`docs/operations/guides/adding-a-new-microservice.md`](docs/operations/guides/adding-a-new-microservice.md) — runbook for standing up a new HTTP service (Polylith brick layout, pyproject hygiene, Dockerfile2, AuthMiddleware, compose wiring). +Detail extracted from this file lives in: + +- [`docs/operations/guides/testing.md`](docs/operations/guides/testing.md) — unit/integration test principles, sample data, test users, service-layer order. +- [`docs/design/cross-cutting/schema-loading.md`](docs/design/cross-cutting/schema-loading.md) — schema loading pattern, `SchemaStateManager`, PascalCase/camelCase convention, Strawberry GraphQL details. +- [`docs/design/components/semantic-search.md`](docs/design/components/semantic-search.md) — semantic search MCP server architecture, endpoints, tools. +- [`docs/operations/guides/graphql-api-keys.md`](docs/operations/guides/graphql-api-keys.md) — GraphQL org1 API key auth and key management. +- [`docs/operations/guides/deployment.md`](docs/operations/guides/deployment.md) — deployment scripts, env config, MDR migrations, ECS debugging, Docker build gotchas. + Each base and component also has a brick-level `README.md` describing purpose, public surface, and consumers. ## Commands @@ -113,76 +121,11 @@ Types encouraged: `feat:`, `fix:`, `docs:`, `refactor:` ## Testing -### Unit Test Principles - -Write tests that earn their keep. Every test should justify its existence by verifying something non-obvious. - -**What to test:** -- **Non-trivial transformations** — regex, recursion, type dispatch, multi-step logic where inputs interact in non-obvious ways -- **Boundary conditions** — empty inputs, None values, edge cases where behavior changes (e.g., leading digits in identifiers) -- **Regression tests for bugs** — every bug fix should include a test that would have caught it. The test should fail without the fix. -- **Integration-style unit tests** — testing a function end-to-end with real inputs is more valuable than mocking every internal call (e.g., test `generate_graphql_schema()` with a real schema, not with mocked sub-functions) - -**What not to test:** -- **Trivial wrappers** — if a function is a one-liner delegating to another tested function (e.g., `".".join([f(x) for x in path.split(".")])`) -- **Framework behavior** — don't test that Pydantic validates types or that `re.sub` works; test *your* logic -- **Obvious guard clauses** — `if not s: return s` doesn't need its own test case unless the empty-input behavior is part of a documented contract -- **Coverage for coverage's sake** — a placeholder test like `assert module is not None` has no value; either write a real test or leave the file empty +- Unit tests in `test/` mirror source structure; pytest with `asyncio_mode = auto`. Write tests that earn their keep (non-trivial logic, boundaries, bug regressions) — skip trivial wrappers and framework behavior. +- **Avoid `importlib.reload()` in tests** — it breaks `isinstance()`/`pytest.raises()` matching. Use `mock.patch.object(module, "VAR_NAME", value)` instead. +- Integration tests in `integration_tests/` verify data consistency across the full stack, dynamically loading sample data from `projects/mongodb/sample_data/{org-key}/`. -### Unit Test Mechanics -- Tests are in `test/` mirroring source structure -- Uses pytest with `asyncio_mode = auto` -- Run specific module tests: `uv run pytest test/components/lif//` -- **Avoid `importlib.reload()` in tests** — reloading a module creates new class objects, breaking `isinstance()` checks and `pytest.raises()` matching. Use `mock.patch.object(module, "VAR_NAME", value)` to override module-level variables instead. - -### Integration Tests - -Integration tests are in `integration_tests/` and verify data consistency across the full service stack. - -```bash -uv run pytest integration_tests/ # Run all integration tests -uv run pytest integration_tests/ --org org1 # Run for specific org -uv run pytest integration_tests/ --skip-unavailable # Skip tests for unavailable services -``` - -**Key design principles:** -- Tests **dynamically load sample data** from JSON files at runtime (no hardcoded constants) -- The `SampleDataLoader` class reads from `projects/mongodb/sample_data/{org-key}/` -- Tests compare API responses against dynamically loaded expected values -- If sample data changes, tests automatically adapt - -**Sample data organization:** -``` -projects/mongodb/sample_data/ -├── advisor-demo-org1/ # Matt, Renee, Sarah, Tracy (4 users) -├── advisor-demo-org2/ # Alan, Jenna, Sarah, Tracy (4 users) -├── advisor-demo-org3/ # Alan, Jenna, Matt, Renee (4 users) -└── dev-single-org/ # All 6 users combined -``` - -**Test users (6 total unique):** -| User | Native Org | Notes | -|------|-----------|-------| -| Matt | org1 | Core user | -| Renee | org1 | Core user | -| Sarah | org1 | Core user | -| Tracy | org1 | Core user | -| Alan | org2 | Async-ingested into org1 via orchestration | -| Jenna | org2 | Async-ingested into org1 via orchestration | - -**Testing async-ingested users:** -- Core users (org1 native) must always be present - tests fail if missing -- Async users (from org2/org3) warn/skip if not yet ingested -- To verify actual ingestion, tests query GraphQL directly (not just sample files) -- GraphQL queries require specific identifiers - empty filter `{}` returns empty results - -**Service layer testing order:** -1. `test_01_mongodb.py` - Direct MongoDB verification -2. `test_02_query_cache.py` - Query cache layer -3. `test_03_query_planner.py` - Query planner routing -4. `test_04_graphql.py` - GraphQL API layer -5. `test_05_cross_org.py` - Cross-organization data isolation -6. `test_06_semantic_search.py` - Semantic search MCP server +Full reference (test principles, sample data orgs, the 6 test users, service-layer order): [`docs/operations/guides/testing.md`](docs/operations/guides/testing.md). ## Pre-commit Hooks @@ -196,248 +139,28 @@ All enforced automatically on commit: ## LIF Schema & Data Model -### Schema Hierarchy -1. **`schemas/lif-schema.json`** - Source of truth for LIF data model rules and policies -2. **MDR (Metadata Registry)** - Captures schema dynamically, allows extension by deployers -3. **Seed data** - Must validate against the schema from MDR -4. **Components** - Must honor the schema, load from MDR with short cache if needed -5. **GraphQL queries** - Should align with schema as best as practical - -### Schema Loading Pattern (IMPORTANT) - -Services load OpenAPI schema from MDR at startup. Key design decisions: +- **Source of truth**: `schemas/lif-schema.json` → captured dynamically by MDR → seed data, components, and GraphQL queries all honor the MDR schema. +- **Loaded from MDR at startup** via `LIFSchemaConfig.from_environment()` and the `SchemaStateManager` component. **No silent fallback to bundled file** — if MDR is configured but unavailable, the service fails loudly. `USE_OPENAPI_DATA_MODEL_FROM_FILE=true` forces the bundled file (dev only). +- **Capitalization convention**: entity/object/array properties are **PascalCase** (`Name`, `Identifier`, `EmploymentPreferences`); scalar attributes are **camelCase** (`firstName`, `identifierType`). Applies to seed data, `.graphql` queries, `information_sources_config*.yml` fragment paths, and test fixtures. -**No silent fallback to file:** -- If MDR is configured but unavailable, the service **fails with a clear error** (does not silently fall back to bundled file) -- This prevents using stale/outdated schema data in production -- Use `USE_OPENAPI_DATA_MODEL_FROM_FILE=true` to explicitly use bundled file (development only) - -**Configuration via `LIFSchemaConfig`:** -- All schema-related config should use `LIFSchemaConfig.from_environment()` (not direct `os.getenv()`) -- Provides centralized validation and consistent defaults -- Key env vars: `OPENAPI_DATA_MODEL_ID`, `LIF_MDR_API_URL`, `USE_OPENAPI_DATA_MODEL_FROM_FILE` - -**SchemaStateManager component** (`components/lif/schema_state_manager/`): -- Shared component for services that need schema data (semantic search, GraphQL) -- Handles sync and async initialization -- Thread-safe state access via lock -- Tracks schema source ("mdr" or "file") -- Supports schema refresh without restart - -```python -from lif.schema_state_manager import SchemaStateManager -from lif.lif_schema_config import LIFSchemaConfig - -config = LIFSchemaConfig.from_environment() -manager = SchemaStateManager(config) -manager.initialize_sync() # or await manager.initialize() - -state = manager.state # Access schema leaves, filter models, embeddings -``` - -### Capitalization Convention (IMPORTANT) - -The LIF schema uses a specific naming convention based on data type: - -| Type | Case | Examples | -|------|------|----------| -| **Entity/Object/Array properties** | PascalCase | `Name`, `Contact`, `Identifier`, `EmploymentLearningExperience`, `CredentialAward`, `Proficiency` | -| **Scalar attributes** | camelCase | `firstName`, `lastName`, `identifier`, `identifierType`, `informationSourceId`, `startDate` | - -**Example structure:** -```json -{ - "person": [{ - "Name": [{ // PascalCase - array of objects - "firstName": "John", // camelCase - scalar attribute - "lastName": "Doe", - "informationSourceId": "Org1" - }], - "Identifier": [{ // PascalCase - array of objects - "identifier": "12345", // camelCase - scalar attribute - "identifierType": "SCHOOL_ASSIGNED_NUMBER" - }], - "EmploymentPreferences": [{ // PascalCase - array of objects - "organizationTypes": ["Public"] // camelCase - scalar attribute - }] - }] -} -``` - -### Files That Must Follow This Convention -- **Seed data**: `projects/mongodb/sample_data/**/*.json` -- **GraphQL queries**: `components/lif/data_source_adapters/**/*.graphql` -- **Config files**: `deployments/**/information_sources_config*.yml` (fragment paths like `person.Name`) -- **Test fixtures**: Any test data in `test/` - -### Key Implementation Details - -1. **Strawberry GraphQL types** (`type_factory.py`): - - Uses `strawberry.field(name=field_name)` to preserve original schema case - - `resolve_actual_type()` preserves `List` wrappers for proper type resolution - - `dict_to_dataclass()` handles nested type conversion - - **Resolver annotations must use `Info` type** — dynamic resolvers in `build_root_query_type` and `build_root_mutation_type` must annotate `info` as `strawberry.types.Info`, not `object` or `Any`. Strawberry identifies the `info` parameter by type (name-based fallback was removed in 0.297.0). - - **MDR schemas have no `$ref`** — the MDR `generate_openapi_schema()` function deep-copies and inlines all referenced schemas. The `$ref` branch in `create_type()` exists but is not exercised by production schemas. - -2. **Fragment paths** use format `person.EntityName` (e.g., `person.EmploymentPreferences`) - -3. **Translator service** returns data with PascalCase root (`Person` not `person`) - - The `adjust_lif_fragments_for_initial_orchestrator_simplification()` function uses case-insensitive key lookup to handle this - -4. **Filter inputs** in GraphQL also use PascalCase for entity names: - ```graphql - person(filter: { Identifier: { identifier: "12345", identifierType: "..." } }) - ``` +Full reference (schema hierarchy, `SchemaStateManager` usage, the convention's affected files, and Strawberry GraphQL implementation details): [`docs/design/cross-cutting/schema-loading.md`](docs/design/cross-cutting/schema-loading.md). Normative rules: [`docs/specs/data-model-rules.md`](docs/specs/data-model-rules.md). ## Semantic Search MCP Server -The semantic search service (`bases/lif/semantic_search_mcp_server/`) provides MCP tools for AI-powered learner data queries. - -**Architecture:** -- Uses FastMCP for Model Context Protocol -- Loads schema from MDR at startup (sync initialization required for tool registration) -- Connected to org1's GraphQL API for data queries -- Embeddings computed via Sentence-Transformers - -**HTTP Endpoints:** -- `GET /health` - Readiness check -- `GET /schema/status` - Schema metadata (source, leaf count, roots, filter models) -- `POST /schema/refresh` - Reload schema from MDR (state only, not tool definitions) - -**MCP Tools:** -- `lif_query` - Semantic search over LIF data fields -- `lif_mutation` - Update LIF data fields (if mutation model available) - -**Docker port:** 8003 (exposed for integration testing) - -**GraphQL authentication:** Uses `graphql_client` component for all GraphQL HTTP calls, which automatically sends `X-API-Key` from the `LIF_GRAPHQL_API_KEY` env var when set. +`bases/lif/semantic_search_mcp_server/` exposes MCP tools (`lif_query`, `lif_mutation`) for AI-powered learner data queries — FastMCP + Sentence-Transformers embeddings, loading schema from MDR at startup and querying org1's GraphQL API (Docker port 8003). See [`docs/design/components/semantic-search.md`](docs/design/components/semantic-search.md). ## GraphQL API Key Authentication -GraphQL org1 supports API key authentication. Keys are managed in AWS SSM Parameter Store. - -**How it works:** -- Server-side: `/{env}/graphql-org1/ApiKeys` stores comma-separated `key:client-name` pairs (e.g., `abc123:semantic-search,def456:workshop-01`) -- Client-side: Each client has its own SSM param with the bare key (e.g., `/{env}/semantic-search/GraphqlApiKey`) -- The `graphql_client` component reads `LIF_GRAPHQL_API_KEY` env var and sends it as `X-API-Key` header -- GraphQL server validates incoming keys against its `GRAPHQL_AUTH__API_KEYS` env var -- When `GRAPHQL_AUTH__API_KEYS` is empty/unset, authentication is disabled (local dev default) - -**Key env vars:** -| Variable | Service | Purpose | -|----------|---------|---------| -| `GRAPHQL_AUTH__API_KEYS` | GraphQL org1 | Server-side: comma-separated `key:label` pairs to accept | -| `LIF_GRAPHQL_API_KEY` | Semantic search | Client-side: bare API key to send with requests | - -**Managing keys:** -```bash -# Preview what will happen -AWS_PROFILE=lif ./scripts/setup-graphql-api-keys.sh demo - -# Create/update service key (semantic-search) -AWS_PROFILE=lif ./scripts/setup-graphql-api-keys.sh demo --apply - -# Generate workshop participant keys -AWS_PROFILE=lif ./scripts/setup-graphql-api-keys.sh demo --workshop 10 --apply - -# Remove all workshop keys (preserves service keys) -AWS_PROFILE=lif ./scripts/setup-graphql-api-keys.sh demo --workshop 0 --apply -``` - -After key changes, redeploy affected services: -```bash -./aws-deploy.sh -s demo --only-stack demo-lif-semantic-search -./aws-deploy.sh -s demo --only-stack demo-lif-graphql-org1 -``` +GraphQL org1 accepts API keys managed in AWS SSM. The server validates incoming `X-API-Key` headers against `GRAPHQL_AUTH__API_KEYS` (disabled when empty); clients send `LIF_GRAPHQL_API_KEY` via the `graphql_client` component. Key management (`scripts/setup-graphql-api-keys.sh`, service vs. workshop modes) and redeploy steps: [`docs/operations/guides/graphql-api-keys.md`](docs/operations/guides/graphql-api-keys.md). ## Deployment & Operations -### Environment Configuration -- `{env}.aws` files (repo root) — define `AWS_REGION`, `SAM_CONFIG_ENV`, `STACKS` map, and `STACK_ORDER` for each environment -- `cloudformation/{env}-*.params` — CloudFormation parameter files per stack, including `ImageUrl` for ECS services -- Environments: `dev`, `demo` (demo is manually promoted from dev) - -### Deployment Scripts - -| Script | Purpose | -|--------|---------| -| `aws-deploy.sh` | Deploy CloudFormation stacks (`-s demo`, `--only-stack`, `--update-ecs`, `--update-sam`) | -| `scripts/release-demo.sh` | Update demo param files with latest ECR image tags from dev | -| `scripts/release-demo-frontend.sh` | Build and deploy MDR frontend to S3/CloudFront from a git ref | -| `scripts/verify-demo-images.sh` | Compare param file image tags against running ECS tasks | -| `scripts/setup-mdr-api-keys.sh` | Generate and store MDR service API keys in SSM Parameter Store | -| `scripts/setup-graphql-api-keys.sh` | Generate and store GraphQL org1 API keys in SSM (service + workshop modes) | -| `scripts/reset-mdr-database.sh` | Reset MDR database (flyway clean + migrate) when V1.1 SQL is replaced | -| `sam/deploy-sam.sh` | Build Flyway Docker image, push to ECR, run SAM deploy for database stacks | - -### Environment Differences -- **Dev** uses `:latest` ECR image tags in param files; **demo** uses pinned version tags (e.g., `:1.2.3`) -- `scripts/release-demo.sh` copies the current dev image tags to demo param files for promotion -- Dev has a single-org setup (`dev-single-org`); demo has multi-org (`advisor-demo-org1/2/3`) - -### MDR Schema Migrations (V1.2+) - -- **Deployed envs** run migrations via Flyway (tracked in `flyway_schema_history`). -- **Local docker-compose** loads `backup.sql` (a pg_dump snapshot of V1.1 content) and then runs every `V1.*.sql` file in the Flyway directory through `psql` — it does *not* use real Flyway and does *not* track history. See `projects/lif_mdr_database/restore.sh`. -- This only re-runs on first init (empty data dir) or after `docker compose down -v`. Persistent-volume `up`/`down` cycles are safe. -- **Authoring convention:** every V1.2+ migration must be idempotent so local re-init is safe. Use `CREATE OR REPLACE FUNCTION`, `CREATE TABLE IF NOT EXISTS`, `DROP TRIGGER IF EXISTS … CREATE TRIGGER`, etc. rather than raw `CREATE`. This is a local-dev concession; deployed envs would tolerate non-idempotent migrations via Flyway's history tracking, but we keep one style across both. -- Wiring real Flyway into local docker-compose is a known gap; deferred pending the broader MDR database tooling evaluation (`docs/proposals/mdr-mongodb-evaluation.md`). - -### Key Operational Notes -- **Demo update guide**: See `docs/operations/guides/demo-environment-update.md` for the full end-to-end process -- **SAM databases**: See `sam/README.md` for database deployment architecture and Flyway migration details -- **Apple Silicon**: Docker images for Lambda must use `--platform linux/amd64` (already handled in scripts) -- **SSM parameters**: ECS tasks fail to start if referenced SSM parameters are missing, even optional ones like `ApiKeys` -- **Deploy sequentially**: Running multiple `aws-deploy.sh` commands in parallel causes SSO login conflicts -- **MDR frontend**: Deployed to S3 + CloudFront (not ECS); use `scripts/release-demo-frontend.sh` for demo -- **Bash `grep -v` with `pipefail`**: In scripts using `set -o pipefail`, `grep -v` returns exit code 1 when all lines are filtered out (no matches). Wrap in `(grep -v ... || true)` to prevent script failure. - -### Docker Build Dependency Resolution (IMPORTANT) - -Project Dockerfiles (`projects/*/Dockerfile2`) build wheels independently from the monorepo lock file. The runtime stage installs the wheel with `uv pip install --system`, which resolves dependencies from PyPI based on the wheel's metadata constraints — **not** from `uv.lock`. This means Docker images can get different dependency versions than local development. - -**PEP 440 `~=` gotcha**: `~=0.275` means `>= 0.275, < 1.0`, NOT `>= 0.275, < 0.276`. To constrain to a minor version range, use `~=0.275.0` (which means `>= 0.275.0, < 0.276.0`). This distinction caused a production crash when `strawberry-graphql~=0.275` resolved to `0.297.0` in Docker. - -### Debugging ECS Services - -**CloudWatch log group**: All dev ECS services share a single log group named `dev`. Log streams are prefixed by service name (e.g., `graphql-org1/graphql-org1/`). - -```bash -# Tail recent logs for a service -AWS_PROFILE=lif AWS_REGION=us-east-1 aws logs filter-log-events \ - --log-group-name dev --log-stream-name-prefix graphql-org1 \ - --start-time $(python3 -c "import time; print(int((time.time()-3600)*1000))") \ - --limit 100 --query 'events[].message' --output text - -# Filter for errors only -AWS_PROFILE=lif AWS_REGION=us-east-1 aws logs filter-log-events \ - --log-group-name dev --log-stream-name-prefix graphql-org1 \ - --filter-pattern "ERROR" --limit 20 --query 'events[].message' --output text - -# Check service status and recent events -AWS_PROFILE=lif AWS_REGION=us-east-1 aws ecs describe-services \ - --cluster dev --services graphql-org1-FARGATE \ - --query 'services[0].{status:status,running:runningCount,events:events[:3]}' -``` - -**ECS Exec** is enabled on dev services but requires the Session Manager Plugin (`session-manager-plugin`) installed locally. - -### Querying the MDR API - -The MDR API provides the OpenAPI schema that GraphQL services load at startup. Useful for debugging schema-related issues. - -```bash -# Get the MDR API key for a service (stored in SSM) -AWS_PROFILE=lif AWS_REGION=us-east-1 aws ssm get-parameter \ - --name /dev/graphql-org1/MdrApiKey --with-decryption \ - --query 'Parameter.Value' --output text - -# Fetch the OpenAPI schema (data model ID 17 for dev) -curl -s -H "X-API-Key: " \ - "https://mdr-api.dev.lif.unicon.net/datamodels/open_api_schema/17?include_attr_md=true&include_entity_md=false" -``` +- Environments: `dev` (uses `:latest` image tags, single-org) and `demo` (pinned version tags, multi-org), manually promoted from dev. Config in `{env}.aws` files and `cloudformation/{env}-*.params`. +- Deploy with `aws-deploy.sh`; promote images with `scripts/release-demo.sh`. **Deploy sequentially** — parallel runs cause SSO login conflicts. +- **MDR migrations (V1.2+) must be idempotent** (`CREATE OR REPLACE`, `IF NOT EXISTS`) — local docker-compose replays every `V1.*.sql` through `psql` without Flyway history tracking. +- **PEP 440 `~=` gotcha**: `~=0.275` means `< 1.0`, not `< 0.276`; use `~=0.275.0` to pin a minor range. (Caused a prod crash via Docker wheel resolution.) -**MDR auth uses `X-API-Key` header** (not Bearer token). The endpoint path is `/datamodels/open_api_schema/{data_model_id}`. +Full reference (script table, env differences, Docker build resolution, ECS/CloudWatch debugging, MDR API querying): [`docs/operations/guides/deployment.md`](docs/operations/guides/deployment.md). Demo promotion walkthrough: [`docs/operations/guides/demo-environment-update.md`](docs/operations/guides/demo-environment-update.md). ## Key Technologies diff --git a/docs/INDEX.md b/docs/INDEX.md index cdc4076..8e7d6b6 100644 --- a/docs/INDEX.md +++ b/docs/INDEX.md @@ -33,6 +33,7 @@ - [`lif-query-cache.md`](design/components/lif-query-cache.md) — LIF Query Cache service: caching layer for resolved queries. - [`lif-query-planner.md`](design/components/lif-query-planner.md) — LIF Query Planner service: query routing and optimization. - [`mdr.md`](design/components/mdr.md) — MDR (Metadata Repository) service: schema registry and contract authority. +- [`semantic-search.md`](design/components/semantic-search.md) — Semantic Search MCP server: FastMCP tools (`lif_query`/`lif_mutation`), HTTP endpoints, embeddings, startup schema load. - [`translator.md`](design/components/translator.md) — Translator service: source-data-to-LIF transformation engine. ### `docs/design/adr/` — Architectural Decision Records @@ -58,9 +59,10 @@ ### `docs/design/cross-cutting/` — Topics spanning services +- [`schema-loading.md`](design/cross-cutting/schema-loading.md) — Schema loading pattern (MDR-at-startup, no silent file fallback), `SchemaStateManager`, PascalCase/camelCase convention, Strawberry GraphQL implementation details. - [`self-serve-tenant-auth.md`](design/cross-cutting/self-serve-tenant-auth.md) — Self-serve tenant onboarding narrative: Cognito sign-up → post-confirmation Lambda → schema-per-tenant provisioning → workspace selection cookie → invite tokens (#882/#883/#884). -*Other planned topics: `auth.md` (all-service auth model), `schema-loading.md`, `polylith-conventions.md`.* +*Other planned topics: `auth.md` (all-service auth model), `polylith-conventions.md`.* --- @@ -74,7 +76,10 @@ - [`adding-a-new-microservice.md`](operations/guides/adding-a-new-microservice.md) — Runbook for standing up a new HTTP microservice: Polylith brick layout, pyproject hygiene, Dockerfile2, AuthMiddleware wiring, docker-compose entry. - [`creating-a-data-source-adapter.md`](operations/guides/creating-a-data-source-adapter.md) — Reference for the data source adapter contract: what adapters are, what they receive, what they return. - [`demo-environment-update.md`](operations/guides/demo-environment-update.md) — End-to-end runbook for promoting dev images to demo. +- [`deployment.md`](operations/guides/deployment.md) — Deployment scripts, env config (dev vs. demo), MDR schema migrations, Docker build dependency resolution, ECS/CloudWatch debugging, querying the MDR API. +- [`graphql-api-keys.md`](operations/guides/graphql-api-keys.md) — GraphQL org1 API key authentication: SSM key storage, `X-API-Key` flow, `setup-graphql-api-keys.sh` service/workshop modes. - [`load-testing.md`](operations/guides/load-testing.md) — Load testing notes for LIF services. +- [`testing.md`](operations/guides/testing.md) — Unit/integration test principles, sample data orgs, the 6 test users, service-layer testing order. ### `docs/operations/proposals/` — Proposed work diff --git a/docs/design/components/semantic-search.md b/docs/design/components/semantic-search.md new file mode 100644 index 0000000..ff8dedc --- /dev/null +++ b/docs/design/components/semantic-search.md @@ -0,0 +1,22 @@ +# Semantic Search MCP Server + +The semantic search service (`bases/lif/semantic_search_mcp_server/`) provides MCP tools for AI-powered learner data queries. + +**Architecture:** +- Uses FastMCP for Model Context Protocol +- Loads schema from MDR at startup (sync initialization required for tool registration) — see [`schema-loading.md`](../cross-cutting/schema-loading.md) +- Connected to org1's GraphQL API for data queries +- Embeddings computed via Sentence-Transformers + +**HTTP Endpoints:** +- `GET /health` - Readiness check +- `GET /schema/status` - Schema metadata (source, leaf count, roots, filter models) +- `POST /schema/refresh` - Reload schema from MDR (state only, not tool definitions) + +**MCP Tools:** +- `lif_query` - Semantic search over LIF data fields +- `lif_mutation` - Update LIF data fields (if mutation model available) + +**Docker port:** 8003 (exposed for integration testing) + +**GraphQL authentication:** Uses `graphql_client` component for all GraphQL HTTP calls, which automatically sends `X-API-Key` from the `LIF_GRAPHQL_API_KEY` env var when set. See [`graphql-api-keys.md`](../../operations/guides/graphql-api-keys.md). diff --git a/docs/design/cross-cutting/schema-loading.md b/docs/design/cross-cutting/schema-loading.md new file mode 100644 index 0000000..c824241 --- /dev/null +++ b/docs/design/cross-cutting/schema-loading.md @@ -0,0 +1,96 @@ +# Schema Loading & Data Model + +How services load the LIF schema, the PascalCase/camelCase naming convention, and the GraphQL implementation details that depend on it. The normative rules live in [`docs/specs/data-model-rules.md`](../../specs/data-model-rules.md); this doc is the agent-oriented implementation reference. + +## Schema Hierarchy +1. **`schemas/lif-schema.json`** - Source of truth for LIF data model rules and policies +2. **MDR (Metadata Registry)** - Captures schema dynamically, allows extension by deployers +3. **Seed data** - Must validate against the schema from MDR +4. **Components** - Must honor the schema, load from MDR with short cache if needed +5. **GraphQL queries** - Should align with schema as best as practical + +## Schema Loading Pattern (IMPORTANT) + +Services load OpenAPI schema from MDR at startup. Key design decisions: + +**No silent fallback to file:** +- If MDR is configured but unavailable, the service **fails with a clear error** (does not silently fall back to bundled file) +- This prevents using stale/outdated schema data in production +- Use `USE_OPENAPI_DATA_MODEL_FROM_FILE=true` to explicitly use bundled file (development only) + +**Configuration via `LIFSchemaConfig`:** +- All schema-related config should use `LIFSchemaConfig.from_environment()` (not direct `os.getenv()`) +- Provides centralized validation and consistent defaults +- Key env vars: `OPENAPI_DATA_MODEL_ID`, `LIF_MDR_API_URL`, `USE_OPENAPI_DATA_MODEL_FROM_FILE` + +**SchemaStateManager component** (`components/lif/schema_state_manager/`): +- Shared component for services that need schema data (semantic search, GraphQL) +- Handles sync and async initialization +- Thread-safe state access via lock +- Tracks schema source ("mdr" or "file") +- Supports schema refresh without restart + +```python +from lif.schema_state_manager import SchemaStateManager +from lif.lif_schema_config import LIFSchemaConfig + +config = LIFSchemaConfig.from_environment() +manager = SchemaStateManager(config) +manager.initialize_sync() # or await manager.initialize() + +state = manager.state # Access schema leaves, filter models, embeddings +``` + +## Capitalization Convention (IMPORTANT) + +The LIF schema uses a specific naming convention based on data type: + +| Type | Case | Examples | +|------|------|----------| +| **Entity/Object/Array properties** | PascalCase | `Name`, `Contact`, `Identifier`, `EmploymentLearningExperience`, `CredentialAward`, `Proficiency` | +| **Scalar attributes** | camelCase | `firstName`, `lastName`, `identifier`, `identifierType`, `informationSourceId`, `startDate` | + +**Example structure:** +```json +{ + "person": [{ + "Name": [{ // PascalCase - array of objects + "firstName": "John", // camelCase - scalar attribute + "lastName": "Doe", + "informationSourceId": "Org1" + }], + "Identifier": [{ // PascalCase - array of objects + "identifier": "12345", // camelCase - scalar attribute + "identifierType": "SCHOOL_ASSIGNED_NUMBER" + }], + "EmploymentPreferences": [{ // PascalCase - array of objects + "organizationTypes": ["Public"] // camelCase - scalar attribute + }] + }] +} +``` + +### Files That Must Follow This Convention +- **Seed data**: `projects/mongodb/sample_data/**/*.json` +- **GraphQL queries**: `components/lif/data_source_adapters/**/*.graphql` +- **Config files**: `deployments/**/information_sources_config*.yml` (fragment paths like `person.Name`) +- **Test fixtures**: Any test data in `test/` + +## Key Implementation Details + +1. **Strawberry GraphQL types** (`type_factory.py`): + - Uses `strawberry.field(name=field_name)` to preserve original schema case + - `resolve_actual_type()` preserves `List` wrappers for proper type resolution + - `dict_to_dataclass()` handles nested type conversion + - **Resolver annotations must use `Info` type** — dynamic resolvers in `build_root_query_type` and `build_root_mutation_type` must annotate `info` as `strawberry.types.Info`, not `object` or `Any`. Strawberry identifies the `info` parameter by type (name-based fallback was removed in 0.297.0). + - **MDR schemas have no `$ref`** — the MDR `generate_openapi_schema()` function deep-copies and inlines all referenced schemas. The `$ref` branch in `create_type()` exists but is not exercised by production schemas. + +2. **Fragment paths** use format `person.EntityName` (e.g., `person.EmploymentPreferences`) + +3. **Translator service** returns data with PascalCase root (`Person` not `person`) + - The `adjust_lif_fragments_for_initial_orchestrator_simplification()` function uses case-insensitive key lookup to handle this + +4. **Filter inputs** in GraphQL also use PascalCase for entity names: + ```graphql + person(filter: { Identifier: { identifier: "12345", identifierType: "..." } }) + ``` diff --git a/docs/operations/guides/deployment.md b/docs/operations/guides/deployment.md new file mode 100644 index 0000000..0f5fbec --- /dev/null +++ b/docs/operations/guides/deployment.md @@ -0,0 +1,90 @@ +# Deployment & Operations + +Deployment scripts, environment config, MDR migrations, and debugging runbooks. For the full demo-promotion walkthrough see [`demo-environment-update.md`](demo-environment-update.md); for SAM/database details see [`sam/README.md`](../../../sam/README.md). + +## Environment Configuration +- `{env}.aws` files (repo root) — define `AWS_REGION`, `SAM_CONFIG_ENV`, `STACKS` map, and `STACK_ORDER` for each environment +- `cloudformation/{env}-*.params` — CloudFormation parameter files per stack, including `ImageUrl` for ECS services +- Environments: `dev`, `demo` (demo is manually promoted from dev) + +## Deployment Scripts + +| Script | Purpose | +|--------|---------| +| `aws-deploy.sh` | Deploy CloudFormation stacks (`-s demo`, `--only-stack`, `--update-ecs`, `--update-sam`) | +| `scripts/release-demo.sh` | Update demo param files with latest ECR image tags from dev | +| `scripts/release-demo-frontend.sh` | Build and deploy MDR frontend to S3/CloudFront from a git ref | +| `scripts/verify-demo-images.sh` | Compare param file image tags against running ECS tasks | +| `scripts/setup-mdr-api-keys.sh` | Generate and store MDR service API keys in SSM Parameter Store | +| `scripts/setup-graphql-api-keys.sh` | Generate and store GraphQL org1 API keys in SSM (service + workshop modes) | +| `scripts/reset-mdr-database.sh` | Reset MDR database (flyway clean + migrate) when V1.1 SQL is replaced | +| `sam/deploy-sam.sh` | Build Flyway Docker image, push to ECR, run SAM deploy for database stacks | + +## Environment Differences +- **Dev** uses `:latest` ECR image tags in param files; **demo** uses pinned version tags (e.g., `:1.2.3`) +- `scripts/release-demo.sh` copies the current dev image tags to demo param files for promotion +- Dev has a single-org setup (`dev-single-org`); demo has multi-org (`advisor-demo-org1/2/3`) + +## MDR Schema Migrations (V1.2+) + +- **Deployed envs** run migrations via Flyway (tracked in `flyway_schema_history`). +- **Local docker-compose** loads `backup.sql` (a pg_dump snapshot of V1.1 content) and then runs every `V1.*.sql` file in the Flyway directory through `psql` — it does *not* use real Flyway and does *not* track history. See `projects/lif_mdr_database/restore.sh`. +- This only re-runs on first init (empty data dir) or after `docker compose down -v`. Persistent-volume `up`/`down` cycles are safe. +- **Authoring convention:** every V1.2+ migration must be idempotent so local re-init is safe. Use `CREATE OR REPLACE FUNCTION`, `CREATE TABLE IF NOT EXISTS`, `DROP TRIGGER IF EXISTS … CREATE TRIGGER`, etc. rather than raw `CREATE`. This is a local-dev concession; deployed envs would tolerate non-idempotent migrations via Flyway's history tracking, but we keep one style across both. +- Wiring real Flyway into local docker-compose is a known gap; deferred pending the broader MDR database tooling evaluation (`docs/proposals/mdr-mongodb-evaluation.md`). + +## Key Operational Notes +- **Demo update guide**: See [`demo-environment-update.md`](demo-environment-update.md) for the full end-to-end process +- **SAM databases**: See `sam/README.md` for database deployment architecture and Flyway migration details +- **Apple Silicon**: Docker images for Lambda must use `--platform linux/amd64` (already handled in scripts) +- **SSM parameters**: ECS tasks fail to start if referenced SSM parameters are missing, even optional ones like `ApiKeys` +- **Deploy sequentially**: Running multiple `aws-deploy.sh` commands in parallel causes SSO login conflicts +- **MDR frontend**: Deployed to S3 + CloudFront (not ECS); use `scripts/release-demo-frontend.sh` for demo +- **Bash `grep -v` with `pipefail`**: In scripts using `set -o pipefail`, `grep -v` returns exit code 1 when all lines are filtered out (no matches). Wrap in `(grep -v ... || true)` to prevent script failure. + +## Docker Build Dependency Resolution (IMPORTANT) + +Project Dockerfiles (`projects/*/Dockerfile2`) build wheels independently from the monorepo lock file. The runtime stage installs the wheel with `uv pip install --system`, which resolves dependencies from PyPI based on the wheel's metadata constraints — **not** from `uv.lock`. This means Docker images can get different dependency versions than local development. + +**PEP 440 `~=` gotcha**: `~=0.275` means `>= 0.275, < 1.0`, NOT `>= 0.275, < 0.276`. To constrain to a minor version range, use `~=0.275.0` (which means `>= 0.275.0, < 0.276.0`). This distinction caused a production crash when `strawberry-graphql~=0.275` resolved to `0.297.0` in Docker. + +## Debugging ECS Services + +**CloudWatch log group**: All dev ECS services share a single log group named `dev`. Log streams are prefixed by service name (e.g., `graphql-org1/graphql-org1/`). + +```bash +# Tail recent logs for a service +AWS_PROFILE=lif AWS_REGION=us-east-1 aws logs filter-log-events \ + --log-group-name dev --log-stream-name-prefix graphql-org1 \ + --start-time $(python3 -c "import time; print(int((time.time()-3600)*1000))") \ + --limit 100 --query 'events[].message' --output text + +# Filter for errors only +AWS_PROFILE=lif AWS_REGION=us-east-1 aws logs filter-log-events \ + --log-group-name dev --log-stream-name-prefix graphql-org1 \ + --filter-pattern "ERROR" --limit 20 --query 'events[].message' --output text + +# Check service status and recent events +AWS_PROFILE=lif AWS_REGION=us-east-1 aws ecs describe-services \ + --cluster dev --services graphql-org1-FARGATE \ + --query 'services[0].{status:status,running:runningCount,events:events[:3]}' +``` + +**ECS Exec** is enabled on dev services but requires the Session Manager Plugin (`session-manager-plugin`) installed locally. + +## Querying the MDR API + +The MDR API provides the OpenAPI schema that GraphQL services load at startup. Useful for debugging schema-related issues. + +```bash +# Get the MDR API key for a service (stored in SSM) +AWS_PROFILE=lif AWS_REGION=us-east-1 aws ssm get-parameter \ + --name /dev/graphql-org1/MdrApiKey --with-decryption \ + --query 'Parameter.Value' --output text + +# Fetch the OpenAPI schema (data model ID 17 for dev) +curl -s -H "X-API-Key: " \ + "https://mdr-api.dev.lif.unicon.net/datamodels/open_api_schema/17?include_attr_md=true&include_entity_md=false" +``` + +**MDR auth uses `X-API-Key` header** (not Bearer token). The endpoint path is `/datamodels/open_api_schema/{data_model_id}`. diff --git a/docs/operations/guides/graphql-api-keys.md b/docs/operations/guides/graphql-api-keys.md new file mode 100644 index 0000000..85ce4b3 --- /dev/null +++ b/docs/operations/guides/graphql-api-keys.md @@ -0,0 +1,37 @@ +# GraphQL API Key Authentication + +GraphQL org1 supports API key authentication. Keys are managed in AWS SSM Parameter Store. + +**How it works:** +- Server-side: `/{env}/graphql-org1/ApiKeys` stores comma-separated `key:client-name` pairs (e.g., `abc123:semantic-search,def456:workshop-01`) +- Client-side: Each client has its own SSM param with the bare key (e.g., `/{env}/semantic-search/GraphqlApiKey`) +- The `graphql_client` component reads `LIF_GRAPHQL_API_KEY` env var and sends it as `X-API-Key` header +- GraphQL server validates incoming keys against its `GRAPHQL_AUTH__API_KEYS` env var +- When `GRAPHQL_AUTH__API_KEYS` is empty/unset, authentication is disabled (local dev default) + +**Key env vars:** +| Variable | Service | Purpose | +|----------|---------|---------| +| `GRAPHQL_AUTH__API_KEYS` | GraphQL org1 | Server-side: comma-separated `key:label` pairs to accept | +| `LIF_GRAPHQL_API_KEY` | Semantic search | Client-side: bare API key to send with requests | + +**Managing keys:** +```bash +# Preview what will happen +AWS_PROFILE=lif ./scripts/setup-graphql-api-keys.sh demo + +# Create/update service key (semantic-search) +AWS_PROFILE=lif ./scripts/setup-graphql-api-keys.sh demo --apply + +# Generate workshop participant keys +AWS_PROFILE=lif ./scripts/setup-graphql-api-keys.sh demo --workshop 10 --apply + +# Remove all workshop keys (preserves service keys) +AWS_PROFILE=lif ./scripts/setup-graphql-api-keys.sh demo --workshop 0 --apply +``` + +After key changes, redeploy affected services: +```bash +./aws-deploy.sh -s demo --only-stack demo-lif-semantic-search +./aws-deploy.sh -s demo --only-stack demo-lif-graphql-org1 +``` diff --git a/docs/operations/guides/testing.md b/docs/operations/guides/testing.md new file mode 100644 index 0000000..df6eef3 --- /dev/null +++ b/docs/operations/guides/testing.md @@ -0,0 +1,74 @@ +# Testing + +How tests are organized and what's worth testing in LIF Core. Quick commands live in [`CLAUDE.md → Commands`](../../../CLAUDE.md#commands); this is the full reference. + +## Unit Test Principles + +Write tests that earn their keep. Every test should justify its existence by verifying something non-obvious. + +**What to test:** +- **Non-trivial transformations** — regex, recursion, type dispatch, multi-step logic where inputs interact in non-obvious ways +- **Boundary conditions** — empty inputs, None values, edge cases where behavior changes (e.g., leading digits in identifiers) +- **Regression tests for bugs** — every bug fix should include a test that would have caught it. The test should fail without the fix. +- **Integration-style unit tests** — testing a function end-to-end with real inputs is more valuable than mocking every internal call (e.g., test `generate_graphql_schema()` with a real schema, not with mocked sub-functions) + +**What not to test:** +- **Trivial wrappers** — if a function is a one-liner delegating to another tested function (e.g., `".".join([f(x) for x in path.split(".")])`) +- **Framework behavior** — don't test that Pydantic validates types or that `re.sub` works; test *your* logic +- **Obvious guard clauses** — `if not s: return s` doesn't need its own test case unless the empty-input behavior is part of a documented contract +- **Coverage for coverage's sake** — a placeholder test like `assert module is not None` has no value; either write a real test or leave the file empty + +## Unit Test Mechanics +- Tests are in `test/` mirroring source structure +- Uses pytest with `asyncio_mode = auto` +- Run specific module tests: `uv run pytest test/components/lif//` +- **Avoid `importlib.reload()` in tests** — reloading a module creates new class objects, breaking `isinstance()` checks and `pytest.raises()` matching. Use `mock.patch.object(module, "VAR_NAME", value)` to override module-level variables instead. + +## Integration Tests + +Integration tests are in `integration_tests/` and verify data consistency across the full service stack. + +```bash +uv run pytest integration_tests/ # Run all integration tests +uv run pytest integration_tests/ --org org1 # Run for specific org +uv run pytest integration_tests/ --skip-unavailable # Skip tests for unavailable services +``` + +**Key design principles:** +- Tests **dynamically load sample data** from JSON files at runtime (no hardcoded constants) +- The `SampleDataLoader` class reads from `projects/mongodb/sample_data/{org-key}/` +- Tests compare API responses against dynamically loaded expected values +- If sample data changes, tests automatically adapt + +**Sample data organization:** +``` +projects/mongodb/sample_data/ +├── advisor-demo-org1/ # Matt, Renee, Sarah, Tracy (4 users) +├── advisor-demo-org2/ # Alan, Jenna, Sarah, Tracy (4 users) +├── advisor-demo-org3/ # Alan, Jenna, Matt, Renee (4 users) +└── dev-single-org/ # All 6 users combined +``` + +**Test users (6 total unique):** +| User | Native Org | Notes | +|------|-----------|-------| +| Matt | org1 | Core user | +| Renee | org1 | Core user | +| Sarah | org1 | Core user | +| Tracy | org1 | Core user | +| Alan | org2 | Async-ingested into org1 via orchestration | +| Jenna | org2 | Async-ingested into org1 via orchestration | + +**Testing async-ingested users:** +- Core users (org1 native) must always be present - tests fail if missing +- Async users (from org2/org3) warn/skip if not yet ingested +- To verify actual ingestion, tests query GraphQL directly (not just sample files) +- GraphQL queries require specific identifiers - empty filter `{}` returns empty results + +**Service layer testing order:** +1. `test_01_mongodb.py` - Direct MongoDB verification +2. `test_02_query_cache.py` - Query cache layer +3. `test_03_query_planner.py` - Query planner routing +4. `test_04_graphql.py` - GraphQL API layer +5. `test_05_cross_org.py` - Cross-organization data isolation +6. `test_06_semantic_search.py` - Semantic search MCP server