We want some pre-defined query test sets which we can use to assess the quality of our scores (when integrated in RAG). This includes: - Actual benchmarks as listed in #262 - Sanity test sets per domain which we ourselves know: - Set of academic websites, e.g our and our colleagues' websites - Set of queries related to tools we use or 'niche' subjects we know could be easily fooled - A new structure for test sets that focuses on testing the retrieval of new material (i.e, material that exceeds LLM's knowledge cutoff dates), e.g, - A ^ B pairs that we ask the LLM to compare (*A v. B?*) - Where A is a library or tool or new that happened in the past and is factually grounded - And B is a new version, library or concept post cut-off date - Cross domain: do tech tools, health debunked stuff, conspiracy theories, political news.
We want some pre-defined query test sets which we can use to assess the quality of our scores (when integrated in RAG).
This includes: