Nutcracker: Evaluating on HuggingFace Inference Endpoints
Set up
Install Nutcracker
git clone https://github.com/evaluation-tools/nutcracker
pip install -e nutcracker
Download Nutcracker DB
git clone https://github.com/evaluation-tools/nutcracker-db
Clone in the same directory.
Choosing Model on HuggingFace
We proceed with LLaMA 2 7B Chat.

Let’s create an inference endpoint. We just deployed a Public endpoint for brevity.

Let’s give a few minutes for the endpoint to initialize.

Check that you have this model running

Let’s use this HuggingFace-provided script (the endpoint shown here will be down as I post this).

Under settings tab, also update Max Input Length, Max Number of Tokens, and Max Batch Prefill Tokens

Defining Model
Now let’s define our model
import requests
class LLaMA:
def __init__(self):
# endpoint shown here will be deleted
self.API_URL = "https://wb0zua4c7mbxnp57.us-east-1.aws.endpoints.huggingface.cloud"
def query(self, payload):
headers = {
"Accept" : "application/json",
"Content-Type": "application/json"
}
response = requests.post(self.API_URL, headers=headers, json=payload)
return response.json()
def respond(self, user_prompt):
output = self.query({
"inputs": f"<s>[INST] <<SYS>> You are a helpful assistant. You keep your answers short. <</SYS>> {user_prompt}",
})
print(output)
return output
Getting Data
Now let’s load MMLU
from nutcracker.data import Pile
import logging
logging.basicConfig(level=logging.INFO)
mmlu = Pile.load_from_db('mmlu','nutcracker-db/db')
print(mmlu.get_max_token_length_user_prompt()) #3630
Running Experiments and Evaluations
Let’s connect our model and data. Running this experiment updates the mmlu object with LLaMA responses. We will sample 1000 instances. Also, we will save the updated mmlu object.
from nutcracker.runs import Schema
mmlu.sample(n=1000, in_place = True) # sampling in-place
experiment = Schema(model=LLaMA(), data=mmlu)
experiment.run()
mmlu.save_to_file('mmlu-llama.pkl') #Let's save this mmlu object so that we don't have to request later again in case anything happens

You can load and see how the model has responded to each prompt.
loaded_mmlu = Pile.load_from_file('mmlu-llama.pkl')
for i in range (0,len(loaded_mmlu)):
print("\n\n\n---\n")
print("Prompt:")
print(loaded_mmlu[i].user_prompt)
print("\nResponses:")
print(loaded_mmlu[i].model_response)
Let’s evaluate these responses. Oftentimes, LLMs don’t respond in immediately recognized letters like A, B, C, or D. Therefore, Nutcracker supports an intent-matching feature that parses model response to match discrete labels, but let’s disable that for now and proceed with our evaluation and save the results.
from nutcracker.evaluator import MCQEvaluator, generate_report
evaluation = MCQEvaluator(data=loaded_mmlu, disable_intent_matching=True)
evaluation.run()
print(generate_report(loaded_mmlu, save_path='accuracy_report.txt'))
You will see something like this.

Full Code
import requests
class LLaMA:
def __init__(self):
self.API_URL = "https://wb0zua4c7mbxnp57.us-east-1.aws.endpoints.huggingface.cloud"
def query(self, payload):
headers = {
"Accept" : "application/json",
"Authorization": "Bearer hf_XXXXX",
"Content-Type": "application/json"
}
response = requests.post(self.API_URL, headers=headers, json=payload)
return response.json()
def respond(self, user_prompt):
output = self.query({
"inputs": f"<s>[INST] <<SYS>> You are a helpful assistant. You keep your answers short. <</SYS>> {user_prompt}",
})
return output[0]['generated_text']
###
from nutcracker.data import Pile
import logging
logging.basicConfig(level=logging.INFO)
mmlu = Pile.load_from_db('mmlu','nutcracker-db/db')
print(mmlu.get_max_token_length_user_prompt())
###
from nutcracker.runs import Schema
mmlu.sample(n=1000, in_place = True)
experiment = Schema(model=LLaMA(), data=mmlu)
experiment.run()
mmlu.save_to_file('mmlu-llama.pkl')
###
loaded_mmlu = Pile.load_from_file('mmlu-llama.pkl')
for i in range (0,len(loaded_mmlu)):
print("\n\n\n---\n")
print("Prompt:")
print(loaded_mmlu[i].user_prompt)
print("\nResponses:")
print(loaded_mmlu[i].model_response)
###
from nutcracker.evaluator import MCQEvaluator, generate_report
evaluation = MCQEvaluator(data=loaded_mmlu, disable_intent_matching=True)
evaluation.run()
print(generate_report(loaded_mmlu, save_path='accuracy_report.txt'))