Nutcracker: Evaluating on HuggingFace Inference Endpoints

3 min readFeb 18, 2024

Set up

Install Nutcracker

git clone https://github.com/evaluation-tools/nutcracker
pip install -e nutcracker

Download Nutcracker DB

git clone https://github.com/evaluation-tools/nutcracker-db

Clone in the same directory.

Choosing Model on HuggingFace

We proceed with LLaMA 2 7B Chat.

Let’s create an inference endpoint. We just deployed a Public endpoint for brevity.

Let’s give a few minutes for the endpoint to initialize.

Check that you have this model running

Let’s use this HuggingFace-provided script (the endpoint shown here will be down as I post this).

Under settings tab, also update Max Input Length, Max Number of Tokens, and Max Batch Prefill Tokens

Defining Model

Now let’s define our model

import requests

class LLaMA:
    def __init__(self):
        # endpoint shown here will be deleted
        self.API_URL = "https://wb0zua4c7mbxnp57.us-east-1.aws.endpoints.huggingface.cloud"

    def query(self, payload):
        headers = {
            "Accept" : "application/json",
            "Content-Type": "application/json" 
        }
        response = requests.post(self.API_URL, headers=headers, json=payload)
        return response.json()

    def respond(self, user_prompt):
        output = self.query({
            "inputs": f"<s>[INST] <<SYS>> You are a helpful assistant. You keep your answers short. <</SYS>> {user_prompt}",
        })
        print(output)

        return output

Getting Data

Now let’s load MMLU

from nutcracker.data import Pile
import logging
logging.basicConfig(level=logging.INFO)

mmlu = Pile.load_from_db('mmlu','nutcracker-db/db')
print(mmlu.get_max_token_length_user_prompt()) #3630

Running Experiments and Evaluations

Let’s connect our model and data. Running this experiment updates the mmlu object with LLaMA responses. We will sample 1000 instances. Also, we will save the updated mmlu object.

from nutcracker.runs import Schema
mmlu.sample(n=1000, in_place = True) # sampling in-place

experiment = Schema(model=LLaMA(), data=mmlu)
experiment.run()
mmlu.save_to_file('mmlu-llama.pkl') #Let's save this mmlu object so that we don't have to request later again in case anything happens

You should be seeing something like this

You can load and see how the model has responded to each prompt.

loaded_mmlu = Pile.load_from_file('mmlu-llama.pkl')
for i in range (0,len(loaded_mmlu)):
    print("\n\n\n---\n")
    print("Prompt:")
    print(loaded_mmlu[i].user_prompt)
    print("\nResponses:")
    print(loaded_mmlu[i].model_response)

Let’s evaluate these responses. Oftentimes, LLMs don’t respond in immediately recognized letters like A, B, C, or D. Therefore, Nutcracker supports an intent-matching feature that parses model response to match discrete labels, but let’s disable that for now and proceed with our evaluation and save the results.

from nutcracker.evaluator import MCQEvaluator, generate_report
evaluation = MCQEvaluator(data=loaded_mmlu, disable_intent_matching=True)
evaluation.run()
print(generate_report(loaded_mmlu, save_path='accuracy_report.txt'))

You will see something like this.

Full Code

import requests

class LLaMA:
    def __init__(self):
        self.API_URL = "https://wb0zua4c7mbxnp57.us-east-1.aws.endpoints.huggingface.cloud"

    def query(self, payload):
        headers = {
            "Accept" : "application/json",
            "Authorization": "Bearer hf_XXXXX",
            "Content-Type": "application/json" 
        }
        response = requests.post(self.API_URL, headers=headers, json=payload)
        return response.json()

    def respond(self, user_prompt):
        output = self.query({
            "inputs": f"<s>[INST] <<SYS>> You are a helpful assistant. You keep your answers short. <</SYS>> {user_prompt}",
        })
        return output[0]['generated_text']


###


from nutcracker.data import Pile
import logging
logging.basicConfig(level=logging.INFO)

mmlu = Pile.load_from_db('mmlu','nutcracker-db/db')
print(mmlu.get_max_token_length_user_prompt())


###


from nutcracker.runs import Schema
mmlu.sample(n=1000, in_place = True)

experiment = Schema(model=LLaMA(), data=mmlu)
experiment.run()
mmlu.save_to_file('mmlu-llama.pkl')


###


loaded_mmlu = Pile.load_from_file('mmlu-llama.pkl')
for i in range (0,len(loaded_mmlu)):
    print("\n\n\n---\n")
    print("Prompt:")
    print(loaded_mmlu[i].user_prompt)
    print("\nResponses:")
    print(loaded_mmlu[i].model_response)


###


from nutcracker.evaluator import MCQEvaluator, generate_report
evaluation = MCQEvaluator(data=loaded_mmlu, disable_intent_matching=True)
evaluation.run()
print(generate_report(loaded_mmlu, save_path='accuracy_report.txt'))