Add Your Own Task or Benchmark
To run GRASP (or any Method) on a new environment, implement grasp.Task. It has three required methods and a few optional hooks that improve proposal quality.
The contract
from grasp import Task, Rollout
class MyTask(Task):
name = "my-task"
def samples(self, split): # "dev" | "val" | "test"
# return a list of dicts; the only required key is a stable "id".
...
def rollout(self, sample, agent): # run ONE episode
# drive agent.inference(history, tools=None) against your environment,
# then return a Rollout capturing the transcript and outcome.
...
return Rollout(history=history, agent_actions=actions,
answer=final_answer, status="completed", raw=...)
def evaluate(self, sample, rollout): # -> bool
...
Rollout fields
The learning loop reads these fields — populate what applies:
-
history— the chat transcript ({"role", "content"}; use role"agent"for the agent’s turns) -
agent_actions— the agent’s actions/tool calls as readable strings (used to describe failures to the skill writer) -
answer— the final answer text, if the task has one -
status—"completed", or a short label such as"agent invalid action"/"task limit reached"/"error"that the gate treats specially -
raw— the native result, for yourevaluate
Optional hooks
failure_tags(sample, rollout) -> list[str]Mechanism tags for failures. Grouping failures by tag before proposing skills sharpens proposals considerably — this is the single highest-value hook to implement.
protocol_hook(first_user_content) -> str | NoneInject an environment-specific tool reminder alongside the skill library in the agent's context.
updater_task_family / updater_guidance / updater_failure_examplesDomain labels and guidance threaded into the skill-writer prompt, helping it produce environment-appropriate skills.
The smallest complete example is examples/quickstart/task.py (rollout loop and graders) backed by an in-process mock environment. Copy it as a starting point.
Config a task needs
A run config supplies the backend and loop hyperparameters; the task provides the data. Minimum:
agent_preset: local # resolved from <config dir>/agents/local.yaml
skills:
base_dir: path/to/skills/base # the read-only skeleton template lives here
cycle:
epochs: 3
update_every: 100
grpo_k: 3
grpo_eval_n: 8
agent_preset: local resolves to <config dir>/agents/local.yaml. The quickstart ships a ready-made preset at examples/quickstart/configs/agents/local.yaml that works with any OpenAI-compatible endpoint — set three env vars to point it at your model:
export OPENAI_BASE_URL="http://localhost:8000/v1" # your endpoint
export OPENAI_API_KEY="sk-..." # or "EMPTY" for local
export GRASP_MODEL="your-model-name"
Copy that file into your config directory and adjust as needed. A Gemini preset using Vertex AI is at examples/quickstart/configs/agents/gemini.yaml.
Wiring an AgentBench environment
benchmarks/AgentBench ships ten task families; four are wired for the paper (os, dbbench, webshop, alfworld). The other six — avalon, card_game, kg, ltp, mind2web, task_assembly — have task servers and configs but no GRASP wiring yet. Two ways to reach them:
Option 1: In-process Task (recommended)
Write a Task whose rollout drives the environment directly and whose evaluate calls that environment’s grader, exactly what the quickstart does for FHIR. You provide samples from the task’s data split.
Option 2: Wrap a running AgentBench worker
AgentBench runs tasks behind a client/server worker (TaskClient.run_sample(index, agent)). A thin Task can start or connect to that worker and translate its TaskClientOutput into a Rollout (map history, the final answer, and status). This reuses the benchmark’s existing environment and grader unchanged.
Either way, splits come from Task.samples(), so you control train/val/test without touching the core. Define failure_tags for the environment’s characteristic failures and set updater_task_family to get sharper proposals.