Benchmark Your Own Self-Improvement Method

← Back to GRASP

GRASP is one point in a space of methods that improve an agent from its own experience. The harness lets you drop in your own method and run it on the same tasks GRASP uses, so comparisons are apples-to-apples.

The contract

A method is a subclass of grasp.Method. The harness constructs it with a resolved config, a run directory to write into, and the Task to learn on, then calls run() once.

from grasp import Method
from grasp.agent import build_agent

class MyMethod(Method):
    # self.config: dict   self.run_dir: Path   self.task: Task
    def run(self) -> None:
        dev = self.task.samples("dev")
        val = self.task.samples("val")
        agent = build_agent(self.config["agent"])

        for epoch in range(self.config["cycle"]["epochs"]):
            for sample in dev:
                rollout = self.task.rollout(sample, agent)
                correct = self.task.evaluate(sample, rollout)
                # ... update your memory / skills / prompt from failures ...

            # monitor on val (do not learn from it)
            score = sum(self.task.evaluate(s, self.task.rollout(s, agent))
                        for s in val) / len(val)
            # ... write artifacts into self.run_dir ...

If your method injects learned context, wrap the agent the way GRASP does in grasp/cycle.py (see SkillAwareAgent).

Conventional outputs

Not enforced, but writing these makes your runs comparable to GRASP’s with the same tooling:

val_scores.json — a list of {epoch, score, ...} (the learning curve)
per-epoch logs of what the method did
the learned artifact (skill/memory library) under run_dir/

Running it

from grasp import run_method

run_method(MyMethod, MyTask(), "path/to/config.yaml", agent="local")

run_method loads the config, resolves the backend (CLI agent > GRASP_BACKEND env > config agent_preset), creates the run directory, and calls MyMethod(config, run_dir, task).run().

Backend setup

The agent argument (or agent_preset in the config) is a name resolved against a YAML file in <config dir>/agents/. The quickstart ships a local preset at examples/quickstart/configs/agents/local.yaml that works with any OpenAI-compatible endpoint — configure it via env vars:

export OPENAI_BASE_URL="http://localhost:8000/v1"   # your endpoint
export OPENAI_API_KEY="sk-..."                        # or "EMPTY" for local
export GRASP_MODEL="your-model-name"

Copy that file into your own config directory and adjust as needed. A Gemini preset using Vertex AI is at examples/quickstart/configs/agents/gemini.yaml.

Worked references: the five baselines

The paper implements five self-improvement baselines alongside GRASP, one per benchmark directory. They predate the Method base class but follow the same __init__(config, run_dir, …) + run() shape and are the best concrete templates to read and diff against.

Entry point	Paper name	Idea
`grasp`	GRASP (ours)	Regression-gated skill library
`memory_cycle`	Sequential memory	Append lessons after each sample
`batch_memory_cycle`	Batch memory	Summarize a batch into memory
`expel_cycle`	ExpeL	Insight extraction from successes and failures
`evo_memory_cycle`	Evo-MedAgent	Evolutionary memory updates
`skillx_cycle`	SkillX	Skill extraction baseline

Each *_cycle.py entry point lives in benchmarks/MedAgentBench/src/ alongside its supporting package. To benchmark a new method, implement Method.run() and run it on the same Task and config.