Benchmark Your Own Self-Improvement Method

GRASP is one point in a space of methods that improve an agent from its own experience. The harness lets you drop in your own method and run it on the same tasks GRASP uses, so comparisons are apples-to-apples.


The contract

A method is a subclass of grasp.Method. The harness constructs it with a resolved config, a run directory to write into, and the Task to learn on, then calls run() once.

from grasp import Method
from grasp.agent import build_agent

class MyMethod(Method):
    # self.config: dict   self.run_dir: Path   self.task: Task
    def run(self) -> None:
        dev = self.task.samples("dev")
        val = self.task.samples("val")
        agent = build_agent(self.config["agent"])

        for epoch in range(self.config["cycle"]["epochs"]):
            for sample in dev:
                rollout = self.task.rollout(sample, agent)
                correct = self.task.evaluate(sample, rollout)
                # ... update your memory / skills / prompt from failures ...

            # monitor on val (do not learn from it)
            score = sum(self.task.evaluate(s, self.task.rollout(s, agent))
                        for s in val) / len(val)
            # ... write artifacts into self.run_dir ...

If your method injects learned context, wrap the agent the way GRASP does in grasp/cycle.py (see SkillAwareAgent).

Conventional outputs

Not enforced, but writing these makes your runs comparable to GRASP’s with the same tooling:

  • val_scores.json — a list of {epoch, score, ...} (the learning curve)
  • per-epoch logs of what the method did
  • the learned artifact (skill/memory library) under run_dir/

Running it

from grasp import run_method

run_method(MyMethod, MyTask(), "path/to/config.yaml", agent="local")

run_method loads the config, resolves the backend (CLI agent > GRASP_BACKEND env > config agent_preset), creates the run directory, and calls MyMethod(config, run_dir, task).run().

Backend setup

The agent argument (or agent_preset in the config) is a name resolved against a YAML file in <config dir>/agents/. The quickstart ships a local preset at examples/quickstart/configs/agents/local.yaml that works with any OpenAI-compatible endpoint — configure it via env vars:

export OPENAI_BASE_URL="http://localhost:8000/v1"   # your endpoint
export OPENAI_API_KEY="sk-..."                        # or "EMPTY" for local
export GRASP_MODEL="your-model-name"

Copy that file into your own config directory and adjust as needed. A Gemini preset using Vertex AI is at examples/quickstart/configs/agents/gemini.yaml.


Worked references: the five baselines

The paper implements five self-improvement baselines alongside GRASP, one per benchmark directory. They predate the Method base class but follow the same __init__(config, run_dir, …) + run() shape and are the best concrete templates to read and diff against.

Entry point Paper name Idea
grasp GRASP (ours) Regression-gated skill library
memory_cycle Sequential memory Append lessons after each sample
batch_memory_cycle Batch memory Summarize a batch into memory
expel_cycle ExpeL Insight extraction from successes and failures
evo_memory_cycle Evo-MedAgent Evolutionary memory updates
skillx_cycle SkillX Skill extraction baseline

Each *_cycle.py entry point lives in benchmarks/MedAgentBench/src/ alongside its supporting package. To benchmark a new method, implement Method.run() and run it on the same Task and config.