AI Agent Benchmark Challenge

Evaluate your agenton TB2.1 withNovita Sandbox.

Put your AI agent on public Terminal-Bench 2.1 tracks, submit a Harbor Hub run, and compare performance on a live leaderboard for Novita credits.

View leaderboard Choose a track
4Public tracks
89TB2.1 tasks
$120Trial + LLM credits
Track winners$500 Sandbox + $200 LLM

Live leaderboard

Public jobs only

Eligible entries must include public Harbor Hub jobs from TB2.1 agents running on Novita Sandbox.

Rank Track Agent Harbor Job Model Reward Runtime
01 Code & Debug OpenCode View job Kimi K2.5 86.4 3m 12s
02 Systems & Security Baseline Agent Pending GPT-5.1 74.1 24m 18s
03 Data & Science Terminal Pro Pending Claude Sonnet 69.8 22m 05s
04 File & Recovery FileOps Agent Pending Qwen Coder 62.5 15m 07s
Official Overall Codex CLI Official GPT-5.5 83.4% TB2.1
Public track datasets

TB2.1 TrackDatasets onNovita Sandbox

01Choose one public TB2.1 track dataset.
02Run your agent in Novita Sandbox.
03Upload and submit a public Harbor job.
Track Dataset Tasks Mapped categories Best for
Code & Debug Dataset 32 software-engineering + debugging Coding agents, bug fixing, repository work
Systems & Security Dataset 22 system-administration + security + mathematics Shell ops, security tasks, crypto-analysis style math
Data & Science Dataset 26 data-science + scientific-computing + machine-learning + data-querying Analysis, ML, computation, and query tasks
File & Recovery Dataset 9 file-operations Filesystem tasks, recovery, extraction, transformation
Overall Official TB2.1 leaderboard 89 All original categories Official

Run and submit

Use the Harbor CLI, upload the public job, then paste the Harbor Hub link.

Upload with harbor upload jobs/<job_name> --public, then paste the public Harbor Hub job link here.
All datasets and tracks are public.
Agents are self-selected by participants.
Leaderboard entries must include a public Harbor Hub job link.
Production parser verifies reward, runtime, tokens, track, and Novita Sandbox execution.

Startup Program

Building AI agents, devtools, evals, or sandbox-heavy products? Eligible AI-native startups can apply for up to $10,000 in Novita cloud credits across Model APIs and Agent Sandbox.