Nutcracker: Instance-Task-Pile

4 min readFeb 17, 2024

Despite our lengthy history of model evaluation, my understanding of the field is that we have not reached a clear consensus on what a “benchmark” is (Is MMLU a “benchmark”? Is Huggingface Open LLM leaderboard a “benchmark”?).

To help organize and support the mixing and merging of existing benchmarks, Nutcracker proposes the instance-task-pile model.

Set up

Install Nutcracker

git clone https://github.com/evaluation-tools/nutcracker
pip install -e nutcracker

Download Nutcracker DB

git clone https://github.com/evaluation-tools/nutcracker-db

Clone in the same directory.

Organizing Data

Instead of using the word benchmark, Nutcracker divides the data structure into Instance, Task, and Pile.

An instance is a single test item object that has many attributes directly accessible on the user side. For example, there is an MCQInstance for MCQ-style tasks, which is also one of the most popular benchmarking construction strategies.

We wanted to make Nutcracker as modularized as possible. For example, we seek to evaluate MMLU geography with a 2-shot setup. The question string in the original MMLU is not what we actually pass into LLM for evaluation.

Rather, we combine the question (= centerpiece) different properties like the answer candidates (= options), few shot data (= example_data_list), and certain prompt templates (= config[‘user_prompt_template’]) to create the final prompt (= user_prompt). We clearly distinguish these different components as separate attributes for an MCQInstance.

What is actually being passed into the LLM (= user_prompt) is often not obvious.

A Task is the smallest unit that some set of instances can be grouped. A Pile is a collection of Instances within Tasks, not a collection of Tasks.

Let’s play with some code for clarity.

Import

from nutcracker.data import Task, Pile
import logging
logging.basicConfig(level=logging.INFO)

Let’s play with the AI2 Reasoning Challenge and

MMLU College Computer Science

my_task_A = Task.load_from_db(
    task_name = 'arc-challenge',
    db_directory = 'nutcracker-db/db'
    )

my_task_B = Task.load_from_db(
    task_name = 'mmlu-college-computer-science',
    db_directory = 'nutcracker-db/db'
    )

This is Nutcracker’s Task object

# prints -> <nutcracker.data.task.Task object at 0x104654100>
print(my_task_A)

How long is this task?

# prints -> 1172
print(len(my_task_A))

A Task object is a collection of Instance objects

# prints -> <nutcracker.data.instance.MCQInstance object at 0x10597e890>
print(my_task_A[0])

An Instance object is full of useful information

# prints -> dict_keys(['config', 'example_data_list', 'centerpiece', 'options', 'correct_options', 'user_prompt', 'model_response', 'answers', 'response_correct'])
print(vars(my_task_A[0]).keys())

Let’s view AI2 Reasoning Challenge’s first question and answer

# prints-> Question: An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation?. 
# Options: ['Planetary density will decrease.', 'Planetary years will become longer.', 'Planetary days will become shorter.', 'Planetary gravity will become stronger.']. 
# Answer: ['C']
print(f"Question: {my_task_A[0].centerpiece}. \nOptions: {my_task_A[0].options}. \nAnswer: {my_task_A[0].correct_options}")

But with few-shot evaluation, this is what actually goes into a model.

# prints -> User Prompt: Question: Juan and LaKeisha roll a few objects down a ramp. They want to see which object rolls the farthest. What should they do so they can repeat their investigation? A. Put the objects in groups. B. Change the height of the ramp. C. Choose different objects to roll. D. Record the details of the investigation. Answer: D 
#Question: High-pressure systems stop air from rising into the colder regions of the atmosphere where water can condense. What will most likely result if a high-pressure system remains in an area for a long period of time? A. fog B. rain C. drought D. tornado Answer: C 
#<...skipped>
#Question: Amanda and Jake learned about kinetic and potential forms of energy within a simple electrical circuit. The circuit they are studying has a battery, wires, and a light bulb. Which is a form of potential energy in the circuit? A. chemical energy in the battery B. light energy from the light bulb C. heat energy lost from the electric wires D. electrical energy moving through the light bulb Answer: A 
#Question: An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation? A. Planetary density will decrease. B. Planetary years will become longer. C. Planetary days will become shorter. D. Planetary gravity will become stronger. Answer: 
print(f"User Prompt: {my_task_A[0].user_prompt}.")

Two (or more) Tasks can be merged to create a Pile.

my_pile = Pile([my_task_A, my_task_B])

A Pile is a collection of Instances within Tasks, not a collection of Tasks.

# <nutcracker.data.instance.MCQInstance object at 0x1045319c0>
print(my_pile[0])

Nutcracker: Instance-Task-Pile

Set up

Organizing Data

Written by Bruce W. Lee