A repository for interacting with the lex database for the Lex AI project. This project provides a wrapper around a PostgreSQL database to enable querying encyclopedia articles via API requests, supporting both vector (semantic) search and full-text/keyword search.
- PostgreSQL database with pgvector extension for vector search
- Full-text search capabilities using PostgreSQL native FTS with Danish language support
- FastAPI-based REST API with automatic OpenAPI documentation
- Hybrid querying via metadata filtering and text search
- Vector index management and semantic search
- Connection pooling for efficient database access
- Python 3.12+
- Astral UV for package management
- PostgreSQL 15+ with pgvector extension installed
git clone https://github.com/centre-for-humanities-computing/lex-db
cd lex-dbmake installEnsure PostgreSQL is installed and the pgvector extension is available:
# Install PostgreSQL and pgvector (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install postgresql postgresql-contrib
sudo apt-get install postgresql-15-pgvector
# Or using Homebrew (macOS)
brew install postgresql pgvectorCreate a .env file with your PostgreSQL connection details:
# Database Configuration
DB_HOST=localhost
DB_PORT=5432
DB_NAME=lex_db
DB_USER=lex_user
DB_PASSWORD=your_password_here
DB_SSLMODE=prefer
# Connection Pooling
DB_POOL_MIN_SIZE=2
DB_POOL_MAX_SIZE=10
# Application Settings
APP_NAME=Lex DB API
DEBUG=false
# OpenAI API (for embeddings, optional)
OPENAI_API_KEY=your_api_key_hereRun the database setup script to create tables and indexes:
# Create database and user (run as postgres user)
sudo -u postgres psql <<EOF
CREATE DATABASE lex_db;
CREATE USER lex_user WITH PASSWORD 'your_password_here';
GRANT ALL PRIVILEGES ON DATABASE lex_db TO lex_user;
\c lex_db
GRANT ALL ON SCHEMA public TO lex_user;
CREATE EXTENSION vector;
EOFIf you're migrating from an existing SQLite database, follow these steps:
Use pgloader to migrate the articles table:
# Run the migration script
cd db
./run_migration.shThis will:
- Migrate all articles from SQLite to PostgreSQL
- Add full-text search (FTS) column and GIN index
- Verify the migration was successful
If you have existing vector indexes in SQLite that you want to preserve:
# Export a vector index to JSONL format
uv run python -m scripts.export_sqlite_vector_index \
--sqlite-db db/lex.db \
--index-name e5_small \
--output exports/e5_small.jsonl# Create an empty vector index structure
uv run python -m scripts.create_vector_index \
--index-name e5_small \
--embedding-model intfloat/multilingual-e5-small \
--source-table articles \
--source-column xhtml_md \
--chunking-strategy semantic_chunks \
--chunk-size 250 \
--chunk-overlap 30If you exported embeddings from SQLite:
# Import embeddings into PostgreSQL
uv run python -m scripts.import_vector_embeddings \
--jsonl-file exports/e5_small.jsonl \
--index-name e5_small \
--batch-size 1000Or generate new embeddings:
# Generate and populate embeddings from scratch
uv run python -m scripts.update_vector_indexes \
--index-name e5_small \
--batch-size 64uv run python -m scripts.create_vector_index \
--index-name my_index \
--embedding-model text-embedding-3-small \
--source-table articles \
--source-column xhtml_md \
--chunking-strategy semantic_chunksAvailable embedding models:
intfloat/multilingual-e5-small(384 dimensions, local)intfloat/multilingual-e5-large(1024 dimensions, local)text-embedding-ada-002(1536 dimensions, OpenAI)text-embedding-3-small(1536 dimensions, OpenAI)text-embedding-3-large(3072 dimensions, OpenAI)
Available chunking strategies:
tokens- Token-based chunking using tiktokencharacters- Simple character-based chunkingsections- Markdown section-based chunkingsemantic_chunks- Danish sentence-aware semantic chunking (recommended)
uv run python -m scripts.update_vector_indexes \
--index-name my_index \
--batch-size 64This will:
- Detect new, modified, and deleted articles
- Remove outdated embeddings
- Generate and add new embeddings
- Update metadata timestamps
The sync_articles.py script synchronizes articles from lex.dk sitemaps to the database:
# Sync all encyclopedias
uv run python -m scripts.sync_articles
# Dry run to see what would change
uv run python -m scripts.sync_articles --dry-run
# Sync specific encyclopedias only
uv run python -m scripts.sync_articles --encyclopedia-ids 14,15,18Important behaviors:
- Deletions are skipped if not all sitemaps are fetched successfully (prevents accidental data loss)
- Vector indexes are automatically updated after article changes
- Use
--dry-runto preview changes without modifying the database
Start the FastAPI server:
make runThe server will be available at http://0.0.0.0:8000.
GET /api/tables- Returns a list of all tables in the database.
- Example response:
{ "tables": ["articles", "vector_index", "metadata"] }
-
GET /api/articles- Retrieve articles filtered by ID or full-text search query.
- Supports optional query parameters:
query: Text-based search in article content.ids: Filter by article IDs (supports comma-separated string, JSON array, or repeatedidsparameter).limit: Maximum number of results (1–100, default: 50).
Examples:
- By IDs only (comma-separated):
GET /api/articles?ids=1,2,5 - By IDs (repeated parameters):
GET /api/articles?ids=1&ids=2&ids=5 - By IDs and text search:
GET /api/articles?query=Rundetårn&ids=1,2&limit=10 - Full-text search only:
GET /api/articles?query=Denmark
Response:
Returns structured search results including matched articles with metadata and scores.
-
POST /api/vector-search/indexes/{index_name}/query- Perform semantic search on a specific vector index.
- Path Parameter:
index_name: Name of the vector index to search.
- Request Body (JSON):
{ "query_text": "What is the capital of Denmark?", "top_k": 5 }query_text: The search query (required).top_k: Number of top results to return (optional, default: 5).
Example Request:
POST /api/vector-search/indexes/article_embeddings/query Content-Type: application/json { "query_text": "Scandinavian history", "top_k": 3 }
Response:
Returns a list of semantically similar documents with metadata and similarity scores.
GET /api/vector-search/indexes- Retrieve metadata for all available vector indexes.
- Example response:
[ { "index_name": "article_embeddings", "embedding_model": "text-embedding-3-small", "dimension": 1536, "created_at": "2025-04-05T12:00:00Z" } ]
GET /api/vector-search/indexes/{index_name}- Retrieve metadata for a specific vector index.
- Path Parameter:
index_name: The name of the vector index.
- Returns details such as model used, dimension, and creation timestamp.
Once the server is running, access auto-generated API documentation at:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
The OpenAPI 3.1 specification is available at:
/openapi/openapi.yaml
You can generate clients in various languages using OpenAPI Generator:
openapi-generator-cli generate -i openapi/openapi.yaml -g <language> -o ./clientReplace <language> with your target (e.g., python, typescript-fetch, java).
This project uses a Makefile to streamline development tasks:
| Command | Description |
|---|---|
make install |
Install dependencies |
make run |
Start the API server |
make lint |
Format code and fix lint issues (using Ruff) |
make lint-check |
Check formatting and linting without applying fixes |
make static-type-check |
Run static type checking with Mypy |
make test |
Run tests using Pytest |
make pr |
Run all pre-PR checks (linting, type checking, testing) |
make help |
Show all available commands |
N/A