Skip to content

Latest commit

 

History

History
50 lines (35 loc) · 1.79 KB

File metadata and controls

50 lines (35 loc) · 1.79 KB

TTBench: LLM Benchmark for Test-Time-Compute

This is a repository for TTBench, the Test-Time Compute Benchmark.

This benchmark features CoT queried for multiple LLMs on a variety of mathematical and reasoning datasets. The few-shot query process and answer extraction are standardised for every dataset, which eases the burden on researchers in terms of time and money.

Installation

Please, install this becnhmark from the source.

pip install .

It requires api_responses.zip (download from Google Drive) file containing a database. For the following example, let us assume this file is in your code directory.

Example

from ttbench import load, DatasetType, LLMType
    
dataset, [llm1,llm2] = load(DatasetType.SVAMP, [LLMType.LLaMA3B32, LLMType.Qwen72B25], api_path="api_responses.zip")

for question_id, dataentry in dataset:
    print("Question: ", dataentry.question)
    print("True answer: ", dataentry.answer)
    llm1_response = llm1(question_id, N=20)
    print("Cost: $", llm1_response.cost)
    print("1st CoT answer: ",  llm1_response.cots[0].answer)

Refer to examples folder for more examples of the benchmark evaluation

Cost modelling

We also provide a procedure to model the dollar cost for each query. This ensures the fair comparison between test-time-compute methods.

from ttbench import load, DatasetType, LLMType

dataset, [llm, ] = load(DatasetType.CommonsenseQA, [LLMType.Mixtral8x7B], api_path="api_responses.zip")

question_id = 42
response = llm(question_id, N=2)

print(f"Request processing cost: ${response.request.cost:0.9f}")
print(f"First CoT response cost: ${response.cots[0].metadata.cost:0.9f}")
print(f"Total LLM query cost: ${response.cost:0.9f}")