sqlite-ann is a SQLite extension providing Approximate Nearest Neighbor (ANN) search capabilities for vector embeddings.
Language: C (C11+) with TypeScript/Node.js bindings Target: SQLite extension with high performance and portability Core Focus: Vector similarity search with ANN algorithms Package Format: Hybrid CJS/ESM for maximum compatibility
When setting up build infrastructure, tooling, and CI/CD pipelines, use these sibling projects as references:
- Use for: Cross-platform GHA (GitHub Actions) build pipeline
- Why: Has a well-designed, clean cross-platform build setup that we want to match
- What to reference:
- GitHub Actions workflows, cross-platform compilation, build matrix setup
- AddressSanitizer (asan) configuration
- Bear (Build EAR - compilation database generator) setup
- clang-tidy integration and configuration
- Use for: Additional tooling examples
- What to reference:
- AddressSanitizer (asan) setup
- Bear configuration
- clang-tidy configuration
- Avoid: The npm pipeline implementation
- Why: It was bolted on after a fork and doesn't represent good design
- Note: Other aspects of sqlite-vec may still be useful for SQLite extension patterns
When in doubt about build setup, packaging, CI/CD configuration, or static analysis tooling, check fs-metadata and node-sqlite first.
CRITICAL: Preserve original copyright notices
This project is derived from libSQL's DiskANN implementation (MIT licensed). All derived code MUST:
- Retain libSQL's copyright:
Copyright 2024 the libSQL authors - Add our copyright:
Copyright 2026 PhotoStructure Inc. - Keep the MIT license text intact
- NEVER claim sole copyright on derived code
For new files (not derived from libSQL):
/*
** Copyright 2026 PhotoStructure Inc.
** MIT License (see LICENSE file)
*/For derived/modified files:
/*
** Derived from libSQL DiskANN implementation
** Copyright 2024 the libSQL authors
** Copyright 2026 PhotoStructure Inc.
** MIT License (see LICENSE file)
*/Before making ANY changes, you MUST read:
- @DESIGN-PRINCIPLES.md - C coding standards and best practices
- @TDD.md - Testing conventions and methodology
- Relevant source files for the area you're working on
- Use C17 standard (C11 minimum)
- Follow all conventions in @DESIGN-PRINCIPLES.md
- Compile with
-Wall -Wextra -Werror -pedantic
// Functions: snake_case with prefix
int ann_index_create(ann_index_t **out);
int ann_search(ann_index_t *idx, float *query, int k, ann_result_t *results);
// Types: snake_case with _t suffix
typedef struct ann_index ann_index_t;
typedef struct ann_result ann_result_t;
// Constants: SCREAMING_SNAKE_CASE with prefix
#define ANN_MAX_DIMENSIONS 2048
#define ANN_DEFAULT_K 10
// Static (internal) functions: snake_case without prefix
static int compute_distance(const float *a, const float *b, size_t dim);src/
ann_core.c # Core ANN algorithm implementation
ann_core.h # Public API
ann_sqlite.c # SQLite extension interface
ann_index.c # Index management
utils/
vector_ops.c # Vector operations (distance, normalization)
vector_ops.h
tests/
test_ann_core.c # Core algorithm tests
test_vector_ops.c # Vector operation tests
test_integration.c # SQLite integration tests
- Follow ownership model from @DESIGN-PRINCIPLES.md
- All public APIs validate inputs
- Use
goto cleanuppattern for resource cleanup - Null pointers after free:
free(ptr); ptr = NULL;
// Return codes
#define ANN_OK 0
#define ANN_ERR_NULL -1
#define ANN_ERR_NOMEM -2
#define ANN_ERR_INVALID -3
#define ANN_ERR_IO -4
#define ANN_ERR_SQLITE -5
// Always check allocations
float *vectors = malloc(count * dim * sizeof(float));
if (!vectors) return ANN_ERR_NOMEM;
// Never mask errors with defaults
// Bad: int get_dimension(ann_index_t *idx) { return idx ? idx->dim : 0; }
// Good: int get_dimension(ann_index_t *idx, size_t *dim) {
// if (!idx || !dim) return ANN_ERR_NULL;
// *dim = idx->dim; return ANN_OK;
// }#ifdef _WIN32
__declspec(dllexport)
#endif
int sqlite3_ann_init(
sqlite3 *db,
char **pzErrMsg,
const sqlite3_api_routines *pApi
) {
SQLITE_EXTENSION_INIT2(pApi);
// Register virtual tables, functions, etc.
return SQLITE_OK;
}- Use SQLite's virtual table interface for ANN indexes
- Implement
xCreate,xConnect,xBestIndex,xFilter, etc. - Document which SQLite versions are supported
// Register ANN search function
sqlite3_create_function_v2(
db, "ann_search", 3, SQLITE_UTF8, NULL,
ann_search_func, NULL, NULL, NULL
);- Test all vector operations (distance, normalization)
- Test ANN algorithms with known datasets
- Test error paths (NULL inputs, allocation failures)
- Test SQLite extension loading
- Test CREATE VIRTUAL TABLE
- Test search queries with various parameters
- Test concurrent access if supported
# Compile with all warnings
gcc -std=c17 -Wall -Wextra -Werror -pedantic -Wconversion \
-Wshadow -Wstrict-prototypes -shared -fPIC \
-o ann.so src/*.c -lsqlite3
# Run tests
./test_suite
# Memory check
valgrind --leak-check=full ./test_suite
# Address sanitizer
gcc -fsanitize=address -g src/*.c tests/*.c -o test_suite
./test_suite- Use SIMD when available (SSE, AVX, NEON)
- Minimize allocations in hot paths
- Consider cache-friendly data layouts
- Balance memory usage vs search speed
- Support incremental index updates when possible
- Document time/space complexity
- Include benchmark suite for common operations
- Test with realistic dataset sizes (10K, 100K, 1M vectors)
- Measure queries per second and recall@k
CRITICAL: Document all performance experiments in experiments/ directory.
Performance tuning is expensive (hours of benchmark time). Before running benchmarks, copy experiments/template.md and document your hypothesis, expected results, setup. After completion, document actual results, analysis, and WHY results differed from expectations. Update experiments/README.md index. This prevents future engineers from repeating failed experiments. See experiments/README.md for full guidelines.
- Comment WHY, not WHAT
- Document ownership and lifecycle of resources
- Explain non-obvious algorithms or optimizations
- Keep comments up-to-date (no "lava flow")
/**
* Create a new ANN index.
*
* @param out Pointer to receive the new index (must not be NULL)
* @param dim Vector dimensionality (must be > 0 and <= ANN_MAX_DIMENSIONS)
* @param metric Distance metric (e.g., ANN_METRIC_EUCLIDEAN)
* @return ANN_OK on success, error code on failure
*
* The caller takes ownership of the returned index and must call
* ann_index_destroy() when done.
*
* Example:
* ann_index_t *idx;
* int rc = ann_index_create(&idx, 128, ANN_METRIC_EUCLIDEAN);
* if (rc != ANN_OK) { /* handle error */ }
* // ... use index ...
* ann_index_destroy(idx);
*/
int ann_index_create(ann_index_t **out, size_t dim, ann_metric_t metric);- Follow Conventional Commits format
- Scope is the most-changed file (without extension)
- Keep commits focused and atomic
- DO NOT include Co-Authored-By tags
Example:
feat(ann_core): implement HNSW algorithm for ANN search
Adds hierarchical navigable small world graph implementation
with configurable M and efConstruction parameters.
Benchmarks show 10x speedup vs brute force on 100K vectors.
- Feature branches:
feature/hnsw-index - Bug fixes:
fix/normalize-vectors - Experiments:
exp/simd-distance
- Always check function signatures in SQLite headers
- Verify vector math formulas (Euclidean, cosine, dot product)
- Test edge cases (zero vectors, high dimensions, large datasets)
- Use static analyzers and sanitizers to catch issues early
Use a simple Makefile or CMake:
# Example Makefile snippet
CFLAGS = -std=c17 -Wall -Wextra -Werror -pedantic -fPIC
LDFLAGS = -shared -lsqlite3
ann.so: src/*.c
$(CC) $(CFLAGS) $(LDFLAGS) -o $@ $^
test: tests/*.c src/*.c
$(CC) $(CFLAGS) -o test_suite $^ -lsqlite3
./test_suite
clean:
rm -f ann.so test_suite *.oCRITICAL: This library must support both CommonJS and ESM consumers.
The package.json must be configured for dual-format distribution:
{
"type": "module",
"main": "./dist/index.cjs",
"module": "./dist/index.mjs",
"types": "./dist/index.d.ts",
"exports": {
".": {
"require": {
"types": "./dist/index.d.cts",
"default": "./dist/index.cjs"
},
"import": {
"types": "./dist/index.d.ts",
"default": "./dist/index.mjs"
}
},
"./package.json": "./package.json"
}
}Why this matters:
- Node.js projects may use
require()(CJS) orimport(ESM) - TypeScript consumers need correct
.d.tsvs.d.ctstypes - Bundlers (webpack, vite, rollup) need to understand the export map
- Without proper configuration, imports will fail at runtime
Reference: See ../fs-metadata/package.json for the canonical example.
TypeScript wrapper functions MUST validate all SQL identifiers before interpolation.
// Bad - SQL injection vulnerability
function createIndex(db: Database, tableName: string) {
db.exec(`CREATE VIRTUAL TABLE ${tableName} USING diskann(...)`);
}
// Good - validate identifier first
function createIndex(db: Database, tableName: string) {
if (!isValidIdentifier(tableName)) {
throw new Error(`Invalid table name: ${tableName}`);
}
db.exec(`CREATE VIRTUAL TABLE ${tableName} USING diskann(...)`);
}
function isValidIdentifier(name: string): boolean {
// Match C layer validate_identifier logic
return /^[a-zA-Z_][a-zA-Z0-9_]*$/.test(name) && name.length <= 64;
}All wrapper functions (createDiskAnnIndex, searchNearest, insertVector, deleteVector) that interpolate table/column names must validate identifiers to prevent injection attacks like "; DROP TABLE users; --".
Include in npm package:
build/- Prebuilt native binaries for supported platformssrc/- TypeScript source for sourcemaps and debuggingdist/- Compiled CJS/ESM JavaScript and type definitionsREADME.md,LICENSE
Exclude from npm package:
- Development docs (CLAUDE.md, DESIGN-PRINCIPLES.md, TDD.md)
- Tests (except maybe a smoke test)
- Build artifacts (compile_commands.json, *.o files)
The files array in package.json controls what gets published. Be conservative - users don't need build infrastructure.
- Validate all SQL inputs to prevent injection
- Bounds-check all vector operations
- Prevent integer overflows in dimension calculations
- Use secure random for any randomized algorithms
- Document any known limitations or attack vectors