Skip to content

minasilva2003/speciFic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

speciFic

speciFic is a passion project developed in the context of a university NLP class. The core goal is to improve recommendation quality for Archive of Our Own (AO3) fanfiction retrieval by comparing two different retrieval paradigms in a controlled, self-designed experimental setup:

  1. a Knowledge Graph (KG) approach
  2. a Semantic Embeddings (SE) approach

The project includes end-to-end data handling, indexing, recommendation, and automatic evaluation with relevance metrics (including F1 score) saved to results/.


What is included

1) Two recommendation approaches

  • Knowledge Graph approach (KG.py):

    • builds an RDF graph from AO3 metadata stored in PostgreSQL
    • links fics, authors, fandoms, pairings, tags, warnings, etc.
    • supports tag inference from synopsis + tag implication edges + similarity edges
    • runs SPARQL-like recommendation queries and evaluates them
  • Semantic Embeddings approach (embeddings.py, faiss_agent.py, query_handler.py):

    • interprets/cleans natural-language queries
    • embeds queries, synopses, and tags using sentence-transformers
    • combines tag-based and embedding-based signals in a hybrid ranking
    • evaluates recommendations against relevance sets in query JSON files

2) Self-made query test suites

Under queries/ there are six evaluation scenarios:

  1. explicit tags (queries/explicit_tags)
  2. explicit tags in synopses (queries/explicit_synopses_tags)
  3. implicit tags in synopses (queries/implicit_synopses_tags)
  4. excluding tags (queries/exclude_tags)
  5. tag preference strength / degree of preference (queries/tag_preferences)
  6. similar fic retrieval (queries/similar_fics)

Each JSON query includes expected relevant works so precision, recall, and F1 can be computed automatically.

3) FAISS usage (brief)

faiss_agent.py uses FAISS to store vector embeddings (for fanfic synopses and tags) and run fast nearest-neighbor search. In practical terms: instead of comparing a query embedding with all vectors one by one in Python, FAISS indexes them efficiently and returns the most similar items quickly.

4) Database handling

  • PostgreSQL schema in create_tables.sql
  • reset helper in clear_db.sql
  • metadata/tag ingestion utilities in db_agent.py and populate_database.py

The DB stores fanfic metadata, canonical tags, and fic-tag links (explicit or inferred).

5) Automatic experiment outputs

  • KG evaluator writes per-query JSON files to:
    • results/KG/<query_set>/query_*_results.json
  • Embeddings evaluator writes per-query JSON summaries (including F1) under:
    • results/embeddings/... (mirrors queries/...)
  • Aggregate analysis script:
    • aggregate_results.pyresults/aggregated_results.csv

Setup

1) Python dependencies

pip install -r requirements.txt
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_md

2) Environment variables

Create a .env file with:

DB_USERNAME=...
DB_PASSWORD=...
DB_HOST=...
DB_PORT=...
DB_NAME=...

3) Create/reset DB schema

psql "postgresql://$DB_USERNAME:$DB_PASSWORD@$DB_HOST:$DB_PORT/$DB_NAME" -f create_tables.sql

Optional reset:

psql "postgresql://$DB_USERNAME:$DB_PASSWORD@$DB_HOST:$DB_PORT/$DB_NAME" -f clear_db.sql

4) Populate tags + fanfiction metadata

Load AO3 tag metadata from CSV:

python -c "from db_agent import DB_Agent; DB_Agent().populate_tag_table('tags.csv', verbose=True)"

Populate fics from URL lists in temp.json:

python populate_database.py

Build FAISS indexes

Run once after DB population (or rebuild after deleting old index files):

python - <<'PY'
from faiss_agent import FaissAgent

fa = FaissAgent()
fa.build_tag_index_once()
fa.build_canonical_reverse_map()
fa.build_fanfic_index_once()
PY

Run experiments

Knowledge Graph evaluation

python test_KG.py

This runs the six query sets and writes metrics (including F1) into results/KG/....

Embeddings evaluation

python test_embeddings.py

By default, test_embeddings.py is currently configured to walk one query folder (queries/tag_preferences). To evaluate all six sets with embeddings, run the evaluator over each folder or adapt base_query_dir in that file.


Aggregate and inspect results

Create aggregated CSV (mean/std F1 by query set and approach):

python aggregate_results.py

Optional plotting/statistical scripts:

python plot_queryset_results.py
python test_normality.py

Repository map (high-level)

  • KG.py — knowledge graph build/query/evaluation
  • embeddings.py — embedding recommender + evaluation
  • faiss_agent.py — FAISS indexing/loading/search support
  • query_handler.py — NLP query interpretation (ships/characters/title matching/cleanup)
  • db_agent.py — DB CRUD and tag canonicalization utilities
  • populate_database.py — AO3 metadata ingestion pipeline
  • queries/ — benchmark query suites
  • results/ — stored evaluation outputs and analyses

Notes

  • This repository is designed as an experimental framework rather than a production AO3 recommender.
  • The main contribution is the side-by-side comparison of graph-based and embedding-based retrieval under a consistent query benchmark and metric protocol.

About

SpeciFic is an NLP fanfiction recommender for AO3 that compares Knowledge Graph and semantic-embedding (FAISS) retrieval approaches in an automated evaluation framework.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors