speciFic is a passion project developed in the context of a university NLP class.
The core goal is to improve recommendation quality for Archive of Our Own (AO3) fanfiction retrieval by comparing two different retrieval paradigms in a controlled, self-designed experimental setup:
- a Knowledge Graph (KG) approach
- a Semantic Embeddings (SE) approach
The project includes end-to-end data handling, indexing, recommendation, and automatic evaluation with relevance metrics (including F1 score) saved to results/.
-
Knowledge Graph approach (
KG.py):- builds an RDF graph from AO3 metadata stored in PostgreSQL
- links fics, authors, fandoms, pairings, tags, warnings, etc.
- supports tag inference from synopsis + tag implication edges + similarity edges
- runs SPARQL-like recommendation queries and evaluates them
-
Semantic Embeddings approach (
embeddings.py,faiss_agent.py,query_handler.py):- interprets/cleans natural-language queries
- embeds queries, synopses, and tags using
sentence-transformers - combines tag-based and embedding-based signals in a hybrid ranking
- evaluates recommendations against relevance sets in query JSON files
Under queries/ there are six evaluation scenarios:
- explicit tags (
queries/explicit_tags) - explicit tags in synopses (
queries/explicit_synopses_tags) - implicit tags in synopses (
queries/implicit_synopses_tags) - excluding tags (
queries/exclude_tags) - tag preference strength / degree of preference (
queries/tag_preferences) - similar fic retrieval (
queries/similar_fics)
Each JSON query includes expected relevant works so precision, recall, and F1 can be computed automatically.
faiss_agent.py uses FAISS to store vector embeddings (for fanfic synopses and tags) and run fast nearest-neighbor search.
In practical terms: instead of comparing a query embedding with all vectors one by one in Python, FAISS indexes them efficiently and returns the most similar items quickly.
- PostgreSQL schema in
create_tables.sql - reset helper in
clear_db.sql - metadata/tag ingestion utilities in
db_agent.pyandpopulate_database.py
The DB stores fanfic metadata, canonical tags, and fic-tag links (explicit or inferred).
- KG evaluator writes per-query JSON files to:
results/KG/<query_set>/query_*_results.json
- Embeddings evaluator writes per-query JSON summaries (including F1) under:
results/embeddings/...(mirrorsqueries/...)
- Aggregate analysis script:
aggregate_results.py→results/aggregated_results.csv
pip install -r requirements.txt
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_mdCreate a .env file with:
DB_USERNAME=...
DB_PASSWORD=...
DB_HOST=...
DB_PORT=...
DB_NAME=...psql "postgresql://$DB_USERNAME:$DB_PASSWORD@$DB_HOST:$DB_PORT/$DB_NAME" -f create_tables.sqlOptional reset:
psql "postgresql://$DB_USERNAME:$DB_PASSWORD@$DB_HOST:$DB_PORT/$DB_NAME" -f clear_db.sqlLoad AO3 tag metadata from CSV:
python -c "from db_agent import DB_Agent; DB_Agent().populate_tag_table('tags.csv', verbose=True)"Populate fics from URL lists in temp.json:
python populate_database.pyRun once after DB population (or rebuild after deleting old index files):
python - <<'PY'
from faiss_agent import FaissAgent
fa = FaissAgent()
fa.build_tag_index_once()
fa.build_canonical_reverse_map()
fa.build_fanfic_index_once()
PYpython test_KG.pyThis runs the six query sets and writes metrics (including F1) into results/KG/....
python test_embeddings.pyBy default, test_embeddings.py is currently configured to walk one query folder (queries/tag_preferences).
To evaluate all six sets with embeddings, run the evaluator over each folder or adapt base_query_dir in that file.
Create aggregated CSV (mean/std F1 by query set and approach):
python aggregate_results.pyOptional plotting/statistical scripts:
python plot_queryset_results.py
python test_normality.pyKG.py— knowledge graph build/query/evaluationembeddings.py— embedding recommender + evaluationfaiss_agent.py— FAISS indexing/loading/search supportquery_handler.py— NLP query interpretation (ships/characters/title matching/cleanup)db_agent.py— DB CRUD and tag canonicalization utilitiespopulate_database.py— AO3 metadata ingestion pipelinequeries/— benchmark query suitesresults/— stored evaluation outputs and analyses
- This repository is designed as an experimental framework rather than a production AO3 recommender.
- The main contribution is the side-by-side comparison of graph-based and embedding-based retrieval under a consistent query benchmark and metric protocol.