speciFic

speciFic is a passion project developed in the context of a university NLP class. The core goal is to improve recommendation quality for Archive of Our Own (AO3) fanfiction retrieval by comparing two different retrieval paradigms in a controlled, self-designed experimental setup:

a Knowledge Graph (KG) approach
a Semantic Embeddings (SE) approach

The project includes end-to-end data handling, indexing, recommendation, and automatic evaluation with relevance metrics (including F1 score) saved to results/.

What is included

1) Two recommendation approaches

Knowledge Graph approach (KG.py):
- builds an RDF graph from AO3 metadata stored in PostgreSQL
- links fics, authors, fandoms, pairings, tags, warnings, etc.
- supports tag inference from synopsis + tag implication edges + similarity edges
- runs SPARQL-like recommendation queries and evaluates them
Semantic Embeddings approach (embeddings.py, faiss_agent.py, query_handler.py):
- interprets/cleans natural-language queries
- embeds queries, synopses, and tags using sentence-transformers
- combines tag-based and embedding-based signals in a hybrid ranking
- evaluates recommendations against relevance sets in query JSON files

2) Self-made query test suites

Under queries/ there are six evaluation scenarios:

explicit tags (queries/explicit_tags)
explicit tags in synopses (queries/explicit_synopses_tags)
implicit tags in synopses (queries/implicit_synopses_tags)
excluding tags (queries/exclude_tags)
tag preference strength / degree of preference (queries/tag_preferences)
similar fic retrieval (queries/similar_fics)

Each JSON query includes expected relevant works so precision, recall, and F1 can be computed automatically.

3) FAISS usage (brief)

faiss_agent.py uses FAISS to store vector embeddings (for fanfic synopses and tags) and run fast nearest-neighbor search. In practical terms: instead of comparing a query embedding with all vectors one by one in Python, FAISS indexes them efficiently and returns the most similar items quickly.

4) Database handling

PostgreSQL schema in create_tables.sql
reset helper in clear_db.sql
metadata/tag ingestion utilities in db_agent.py and populate_database.py

The DB stores fanfic metadata, canonical tags, and fic-tag links (explicit or inferred).

5) Automatic experiment outputs

KG evaluator writes per-query JSON files to:
- results/KG/<query_set>/query_*_results.json
Embeddings evaluator writes per-query JSON summaries (including F1) under:
- results/embeddings/... (mirrors queries/...)
Aggregate analysis script:
- aggregate_results.py → results/aggregated_results.csv

Setup

1) Python dependencies

pip install -r requirements.txt
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_md

2) Environment variables

Create a .env file with:

DB_USERNAME=...
DB_PASSWORD=...
DB_HOST=...
DB_PORT=...
DB_NAME=...

3) Create/reset DB schema

psql "postgresql://$DB_USERNAME:$DB_PASSWORD@$DB_HOST:$DB_PORT/$DB_NAME" -f create_tables.sql

Optional reset:

psql "postgresql://$DB_USERNAME:$DB_PASSWORD@$DB_HOST:$DB_PORT/$DB_NAME" -f clear_db.sql

4) Populate tags + fanfiction metadata

Load AO3 tag metadata from CSV:

python -c "from db_agent import DB_Agent; DB_Agent().populate_tag_table('tags.csv', verbose=True)"

Populate fics from URL lists in temp.json:

python populate_database.py

Build FAISS indexes

Run once after DB population (or rebuild after deleting old index files):

python - <<'PY'
from faiss_agent import FaissAgent

fa = FaissAgent()
fa.build_tag_index_once()
fa.build_canonical_reverse_map()
fa.build_fanfic_index_once()
PY

Run experiments

Knowledge Graph evaluation

python test_KG.py

This runs the six query sets and writes metrics (including F1) into results/KG/....

Embeddings evaluation

python test_embeddings.py

By default, test_embeddings.py is currently configured to walk one query folder (queries/tag_preferences). To evaluate all six sets with embeddings, run the evaluator over each folder or adapt base_query_dir in that file.

Aggregate and inspect results

Create aggregated CSV (mean/std F1 by query set and approach):

python aggregate_results.py

Optional plotting/statistical scripts:

python plot_queryset_results.py
python test_normality.py

Repository map (high-level)

KG.py — knowledge graph build/query/evaluation
embeddings.py — embedding recommender + evaluation
faiss_agent.py — FAISS indexing/loading/search support
query_handler.py — NLP query interpretation (ships/characters/title matching/cleanup)
db_agent.py — DB CRUD and tag canonicalization utilities
populate_database.py — AO3 metadata ingestion pipeline
queries/ — benchmark query suites
results/ — stored evaluation outputs and analyses

Notes

This repository is designed as an experimental framework rather than a production AO3 recommender.
The main contribution is the side-by-side comparison of graph-based and embedding-based retrieval under a consistent query benchmark and metric protocol.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

speciFic

What is included

1) Two recommendation approaches

2) Self-made query test suites

3) FAISS usage (brief)

4) Database handling

5) Automatic experiment outputs

Setup

1) Python dependencies

2) Environment variables

3) Create/reset DB schema

4) Populate tags + fanfiction metadata

Build FAISS indexes

Run experiments

Knowledge Graph evaluation

Embeddings evaluation

Aggregate and inspect results

Repository map (high-level)

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
__pycache__		__pycache__
lib		lib
queries		queries
results		results
.gitignore		.gitignore
KG.py		KG.py
README.md		README.md
clear_db.sql		clear_db.sql
create_tables.sql		create_tables.sql
db_agent.py		db_agent.py
embeddings.py		embeddings.py
faiss_agent.py		faiss_agent.py
fic_list.json		fic_list.json
kg_save_file.ttl		kg_save_file.ttl
kg_visual.html		kg_visual.html
plot_queryset_results.py		plot_queryset_results.py
populate_database.py		populate_database.py
query_handler.py		query_handler.py
requirements.txt		requirements.txt
temp.json		temp.json
test_KG.py		test_KG.py
test_db.py		test_db.py
test_embeddings.py		test_embeddings.py
test_faiss.py		test_faiss.py

Folders and files

Latest commit

History

Repository files navigation

speciFic

What is included

1) Two recommendation approaches

2) Self-made query test suites

3) FAISS usage (brief)

4) Database handling

5) Automatic experiment outputs

Setup

1) Python dependencies

2) Environment variables

3) Create/reset DB schema

4) Populate tags + fanfiction metadata

Build FAISS indexes

Run experiments

Knowledge Graph evaluation

Embeddings evaluation

Aggregate and inspect results

Repository map (high-level)

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages