query test sets

We want some pre-defined query test sets which we can use to assess the quality of our scores (when integrated in RAG). 

This includes: 
- Actual benchmarks as listed in #262 
- Sanity test sets per domain which we ourselves know:
     - Set of academic websites, e.g our and our colleagues' websites
     - Set of queries related to tools we use or 'niche' subjects we know could be easily fooled
- A new structure for test sets that focuses on testing the retrieval of new material (i.e, material that exceeds LLM's knowledge cutoff dates), e.g, 
     - A ^ B pairs that we ask the LLM to compare (*A v. B?*) 
     - Where A is a library or tool or new that happened in the past and is factually grounded 
     - And B is a new version, library or concept post cut-off date
     - Cross domain: do tech tools, health debunked stuff, conspiracy theories, political news.
     

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query test sets #266

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

query test sets #266

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions