Harbor

Harbor is a framework from the creators of Terminal-Bench for evaluating and optimizing agents and language models. It ships with first-class support for Islo, so you can run benchmarks like Terminal-Bench-2.0, SWE-Bench, and Aider Polyglot in parallel across Islo environments — each one isolated, network-controlled, and reproducible.

$250 in free Islo credits for every Harbor user. Sign up at app.islo.dev and apply the promo code HARBOR250 in Billing.

Prerequisites

  • Python 3.10+ and uv or pip.
  • An Islo account — sign up at app.islo.dev.

1. Generate an Islo API key

Pick whichever flow fits your setup.

From the dashboard

  1. Open app.islo.dev/api-keys.
  2. Click Create API key and (optionally) set an expiry.
  3. Copy the key — it’s only shown once.

From the CLI

$islo api-key create my-key # copied to clipboard by default
$islo api-key create my-key --show # also print to stdout
$islo api-key create my-key --expires 90

See the authentication docs for the full flag set.

2. Install Harbor with the Islo extra

The islo extra pulls in the Islo Python SDK and the Dockerfile parser Harbor uses to build task images inside the sandbox.

$uv tool install 'harbor[islo]'

3. Configure credentials

Harbor’s Islo environment reads your API key from the environment — it doesn’t load .env files automatically.

$export ISLO_API_KEY="islo_key_..." # from step 1
$export ANTHROPIC_API_KEY="sk-ant-..." # whichever model provider your agent uses

4. Run your first eval on Islo

Pass --env islo (or -e islo) to route every trial through an Islo environment. Each trial gets its own sandbox, so --n-concurrent scales horizontally without contending for local CPU.

Hello-world (one task, ~1 minute)

Before kicking off a full benchmark, run Harbor’s built-in hello-world dataset to confirm your sandbox + credentials are wired up. It’s a single trivial task (write hello.txt with “Hello, world!”) and finishes in about a minute:

$harbor run \
> --dataset hello-world \
> --agent claude-code \
> --model anthropic/claude-opus-4-7 \
> --env islo

The full benchmark

Harbor is the official harness for Terminal-Bench-2.0:

$harbor run \
> --dataset terminal-bench@2.0 \
> --agent claude-code \
> --model anthropic/claude-opus-4-7 \
> --env islo \
> -n 50

Harbor’s Islo environment supports three task layouts out of the box:

Task definitionWhat Harbor does
docker_image set in the task configBoots a sandbox directly from that image
environment/Dockerfile presentBuilds the image inside the sandbox via Docker-in-VM
NeitherFalls back to the bare islo-runner image

5. Anti-cheating: lock down agent egress

A common failure mode in benchmark eval is the agent finds the answer online instead of solving the task — by Googling, browsing GitHub Issues, or fetching the upstream test fixtures from GitHub. Islo’s gateway lets you allowlist exactly the hosts each trial can reach, so the model is forced to actually do the work.

Harbor wires this through the IsloEnvironment kwargs. You can either reference an existing named profile or define rules inline.

Option A: inline rules (simplest)

Harbor creates an ephemeral gateway profile per run and tears it down after. Drop this into your Harbor config (or pass via the CLI’s --env-kwargs):

1environment:
2 type: islo
3 kwargs:
4 gateway:
5 default_action: allow
6 rules:
7 # Checked first: deny known answer sources at the host level
8 - host_pattern: "*.github.com"
9 action: deny
10
11 # Checked second: scan response
12 # bodies and drop any that leak the task solution
13 - host_pattern: "*"
14 action: deny
15 priority: 10
16 content_filter:
17 filter_type: regex
18 pattern: "(?i)(terminal-bench|swe-bench).*solution"
19 direction: response

Rules are evaluated in priority order (ascending, first match wins), so the github rule fires before the wildcard. Two layers: the first denies entire hosts before a request is even sent; the second lets traffic through but inspects response bodies. content_filter also supports filter_type: content_type (block by MIME) and filter_type: size_limit (cap response size) — see Gateways for the full schema.

Option B: reference a saved profile

If you reuse the same egress policy across runs, save it once via the dashboard or islo gateway profile create, then reference it by name:

1environment:
2 type: islo
3 kwargs:
4 gateway_profile: terminal-bench-egress

gateway_profile and gateway are mutually exclusive — Harbor will raise if both are set.

6. Tune sandbox resources (optional)

CPU, memory, and disk for each trial come from the task’s environment config (cpus, memory_mb, storage_mb). Override them on the task or in your Harbor config if a benchmark needs more headroom — Docker-in-VM builds in particular benefit from 4+ vCPUs.

Next steps

  • Browse the Harbor Cookbook for end-to-end eval examples.
  • Read the Harbor docs for dataset, agent, and reward-kit reference.
  • Wire results into your own dashboards with the Islo SDK.