lex-db

A repository for interacting with the lex database for the Lex AI project. This project provides a wrapper around a PostgreSQL database to enable querying encyclopedia articles via API requests, supporting both vector (semantic) search and full-text/keyword search.

Features

PostgreSQL database with pgvector extension for vector search
Full-text search capabilities using PostgreSQL native FTS with Danish language support
FastAPI-based REST API with automatic OpenAPI documentation
Hybrid querying via metadata filtering and text search
Vector index management and semantic search
Connection pooling for efficient database access

Requirements

Python 3.12+
Astral UV for package management
PostgreSQL 15+ with pgvector extension installed

Installation

1. Clone the repository

git clone https://github.com/centre-for-humanities-computing/lex-db 
cd lex-db

2. Install dependencies

make install

3. Set up PostgreSQL database

Ensure PostgreSQL is installed and the pgvector extension is available:

# Install PostgreSQL and pgvector (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install postgresql postgresql-contrib
sudo apt-get install postgresql-15-pgvector

# Or using Homebrew (macOS)
brew install postgresql pgvector

4. Configure environment variables

Create a .env file with your PostgreSQL connection details:

# Database Configuration
DB_HOST=localhost
DB_PORT=5432
DB_NAME=lex_db
DB_USER=lex_user
DB_PASSWORD=your_password_here
DB_SSLMODE=prefer

# Connection Pooling
DB_POOL_MIN_SIZE=2
DB_POOL_MAX_SIZE=10

# Application Settings
APP_NAME=Lex DB API
DEBUG=false

# OpenAI API (for embeddings, optional)
OPENAI_API_KEY=your_api_key_here

5. Initialize the database

Run the database setup script to create tables and indexes:

# Create database and user (run as postgres user)
sudo -u postgres psql <<EOF
CREATE DATABASE lex_db;
CREATE USER lex_user WITH PASSWORD 'your_password_here';
GRANT ALL PRIVILEGES ON DATABASE lex_db TO lex_user;
\c lex_db
GRANT ALL ON SCHEMA public TO lex_user;
CREATE EXTENSION vector;
EOF

Usage

Database Migration from SQLite

If you're migrating from an existing SQLite database, follow these steps:

1. Migrate Articles Table

Use pgloader to migrate the articles table:

# Run the migration script
cd db
./run_migration.sh

This will:

Migrate all articles from SQLite to PostgreSQL
Add full-text search (FTS) column and GIN index
Verify the migration was successful

2. Export Vector Indexes from SQLite (Optional)

If you have existing vector indexes in SQLite that you want to preserve:

# Export a vector index to JSONL format
uv run python -m scripts.export_sqlite_vector_index \
    --sqlite-db db/lex.db \
    --index-name e5_small \
    --output exports/e5_small.jsonl

3. Create Vector Index in PostgreSQL

# Create an empty vector index structure
uv run python -m scripts.create_vector_index \
    --index-name e5_small \
    --embedding-model intfloat/multilingual-e5-small \
    --source-table articles \
    --source-column xhtml_md \
    --chunking-strategy semantic_chunks \
    --chunk-size 250 \
    --chunk-overlap 30

4. Import Embeddings (Optional)

If you exported embeddings from SQLite:

# Import embeddings into PostgreSQL
uv run python -m scripts.import_vector_embeddings \
    --jsonl-file exports/e5_small.jsonl \
    --index-name e5_small \
    --batch-size 1000

Or generate new embeddings:

# Generate and populate embeddings from scratch
uv run python -m scripts.update_vector_indexes \
    --index-name e5_small \
    --batch-size 64

Managing Vector Indexes

Create a New Vector Index

uv run python -m scripts.create_vector_index \
    --index-name my_index \
    --embedding-model text-embedding-3-small \
    --source-table articles \
    --source-column xhtml_md \
    --chunking-strategy semantic_chunks

Available embedding models:

intfloat/multilingual-e5-small (384 dimensions, local)
intfloat/multilingual-e5-large (1024 dimensions, local)
text-embedding-ada-002 (1536 dimensions, OpenAI)
text-embedding-3-small (1536 dimensions, OpenAI)
text-embedding-3-large (3072 dimensions, OpenAI)

Available chunking strategies:

tokens - Token-based chunking using tiktoken
characters - Simple character-based chunking
sections - Markdown section-based chunking
semantic_chunks - Danish sentence-aware semantic chunking (recommended)

Update Vector Index with New/Modified Articles

uv run python -m scripts.update_vector_indexes \
    --index-name my_index \
    --batch-size 64

This will:

Detect new, modified, and deleted articles
Remove outdated embeddings
Generate and add new embeddings
Update metadata timestamps

Synchronizing Articles from lex.dk

The sync_articles.py script synchronizes articles from lex.dk sitemaps to the database:

# Sync all encyclopedias
uv run python -m scripts.sync_articles

# Dry run to see what would change
uv run python -m scripts.sync_articles --dry-run

# Sync specific encyclopedias only
uv run python -m scripts.sync_articles --encyclopedia-ids 14,15,18

Important behaviors:

Deletions are skipped if not all sitemaps are fetched successfully (prevents accidental data loss)
Vector indexes are automatically updated after article changes
Use --dry-run to preview changes without modifying the database

Running the API Server

Start the FastAPI server:

make run

The server will be available at http://0.0.0.0:8000.

API Endpoints

List Database Tables

GET /api/tables
- Returns a list of all tables in the database.
- Example response:
```
{
  "tables": ["articles", "vector_index", "metadata"]
}
```

Filter Articles by Metadata

GET /api/articles
- Retrieve articles filtered by ID or full-text search query.
- Supports optional query parameters:
  - query: Text-based search in article content.
  - ids: Filter by article IDs (supports comma-separated string, JSON array, or repeated ids parameter).
  - limit: Maximum number of results (1–100, default: 50).
Examples:
- By IDs only (comma-separated):
```
GET /api/articles?ids=1,2,5
```
- By IDs (repeated parameters):
```
GET /api/articles?ids=1&ids=2&ids=5
```
- By IDs and text search:
```
GET /api/articles?query=Rundetårn&ids=1,2&limit=10
```
- Full-text search only:
```
GET /api/articles?query=Denmark
```
Response:
Returns structured search results including matched articles with metadata and scores.

Vector Search

POST /api/vector-search/indexes/{index_name}/query
- Perform semantic search on a specific vector index.
- Path Parameter:
  - index_name: Name of the vector index to search.
- Request Body (JSON):
```
{
  "query_text": "What is the capital of Denmark?",
  "top_k": 5
}
```
  - query_text: The search query (required).
  - top_k: Number of top results to return (optional, default: 5).
Example Request:
```
POST /api/vector-search/indexes/article_embeddings/query
Content-Type: application/json

{
  "query_text": "Scandinavian history",
  "top_k": 3
}
```
Response:
Returns a list of semantically similar documents with metadata and similarity scores.

List All Vector Indexes

GET /api/vector-search/indexes

Retrieve metadata for all available vector indexes.

Example response:

[
  {
    "index_name": "article_embeddings",
    "embedding_model": "text-embedding-3-small",
    "dimension": 1536,
    "created_at": "2025-04-05T12:00:00Z"
  }
]

Get Metadata for a Specific Vector Index

GET /api/vector-search/indexes/{index_name}
- Retrieve metadata for a specific vector index.
- Path Parameter:
  - index_name: The name of the vector index.
- Returns details such as model used, dimension, and creation timestamp.

API Documentation

Once the server is running, access auto-generated API documentation at:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

The OpenAPI 3.1 specification is available at:

/openapi/openapi.yaml

You can generate clients in various languages using OpenAPI Generator:

openapi-generator-cli generate -i openapi/openapi.yaml -g <language> -o ./client

Replace <language> with your target (e.g., python, typescript-fetch, java).

Development

This project uses a Makefile to streamline development tasks:

Command	Description
`make install`	Install dependencies
`make run`	Start the API server
`make lint`	Format code and fix lint issues (using Ruff)
`make lint-check`	Check formatting and linting without applying fixes
`make static-type-check`	Run static type checking with Mypy
`make test`	Run tests using Pytest
`make pr`	Run all pre-PR checks (linting, type checking, testing)
`make help`	Show all available commands

License

N/A

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
.vscode		.vscode
db		db
openapi		openapi
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Makefile		Makefile
README.md		README.md
conftest.py		conftest.py
generate_openapi.py		generate_openapi.py
main.py		main.py
pyproject.toml		pyproject.toml
setup.py		setup.py
uv.lock		uv.lock

centre-for-humanities-computing/lex-db

Folders and files

Latest commit

History

Repository files navigation

lex-db

Features

Requirements

Installation

1. Clone the repository

2. Install dependencies

3. Set up PostgreSQL database

4. Configure environment variables

5. Initialize the database

Usage

Database Migration from SQLite

1. Migrate Articles Table

2. Export Vector Indexes from SQLite (Optional)

3. Create Vector Index in PostgreSQL

4. Import Embeddings (Optional)

Managing Vector Indexes

Create a New Vector Index

Update Vector Index with New/Modified Articles

Synchronizing Articles from lex.dk

Running the API Server

API Endpoints

List Database Tables

Filter Articles by Metadata

Vector Search

List All Vector Indexes

Get Metadata for a Specific Vector Index

API Documentation

Development

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages