Benchmark Your Own Self-Improvement Method
GRASP is one point in a space of methods that improve an agent from its own experience. The harness lets you drop in your own method and run it on the same tasks GRASP uses, so comparisons are apples-to-apples.
The contract
A method is a subclass of grasp.Method. The harness constructs it with a resolved config, a run directory to write into, and the Task to learn on, then calls run() once.
from grasp import Method
from grasp.agent import build_agent
class MyMethod(Method):
# self.config: dict self.run_dir: Path self.task: Task
def run(self) -> None:
dev = self.task.samples("dev")
val = self.task.samples("val")
agent = build_agent(self.config["agent"])
for epoch in range(self.config["cycle"]["epochs"]):
for sample in dev:
rollout = self.task.rollout(sample, agent)
correct = self.task.evaluate(sample, rollout)
# ... update your memory / skills / prompt from failures ...
# monitor on val (do not learn from it)
score = sum(self.task.evaluate(s, self.task.rollout(s, agent))
for s in val) / len(val)
# ... write artifacts into self.run_dir ...
If your method injects learned context, wrap the agent the way GRASP does in grasp/cycle.py (see SkillAwareAgent).
Conventional outputs
Not enforced, but writing these makes your runs comparable to GRASP’s with the same tooling:
-
val_scores.json— a list of{epoch, score, ...}(the learning curve) - per-epoch logs of what the method did
- the learned artifact (skill/memory library) under
run_dir/
Running it
from grasp import run_method
run_method(MyMethod, MyTask(), "path/to/config.yaml", agent="local")
run_method loads the config, resolves the backend (CLI agent > GRASP_BACKEND env > config agent_preset), creates the run directory, and calls MyMethod(config, run_dir, task).run().
Backend setup
The agent argument (or agent_preset in the config) is a name resolved against a YAML file in <config dir>/agents/. The quickstart ships a local preset at examples/quickstart/configs/agents/local.yaml that works with any OpenAI-compatible endpoint — configure it via env vars:
export OPENAI_BASE_URL="http://localhost:8000/v1" # your endpoint
export OPENAI_API_KEY="sk-..." # or "EMPTY" for local
export GRASP_MODEL="your-model-name"
Copy that file into your own config directory and adjust as needed. A Gemini preset using Vertex AI is at examples/quickstart/configs/agents/gemini.yaml.
Worked references: the five baselines
The paper implements five self-improvement baselines alongside GRASP, one per benchmark directory. They predate the Method base class but follow the same __init__(config, run_dir, …) + run() shape and are the best concrete templates to read and diff against.
| Entry point | Paper name | Idea |
|---|---|---|
grasp | GRASP (ours) | Regression-gated skill library |
memory_cycle | Sequential memory | Append lessons after each sample |
batch_memory_cycle | Batch memory | Summarize a batch into memory |
expel_cycle | ExpeL | Insight extraction from successes and failures |
evo_memory_cycle | Evo-MedAgent | Evolutionary memory updates |
skillx_cycle | SkillX | Skill extraction baseline |
Each *_cycle.py entry point lives in benchmarks/MedAgentBench/src/ alongside its supporting package. To benchmark a new method, implement Method.run() and run it on the same Task and config.