OPENENV · AGENTBEATS PHASE 2

ComtradeBench

An OpenEnv Benchmark for Reliable LLM Tool-Use Under Adversarial API Conditions.

Environment Online
Mock Service Active
10 Tasks · 3 MCP Tools · 6 Scoring Dimensions
API Documentation GitHub Repository Read the Blog → Health Check

Benchmark Tasks

TaskNameChallengeDifficulty
T1Single PageFetch one page, submit correctlyEasy
T2Multi-PagePaginate until has_more=falseEasy
T3DeduplicationOverlapping pages; dedup by primary keyMedium
T4Rate Limit (429)Retry on HTTP 429 without data lossMedium
T5Server Error (500)Retry transient 500 failuresMedium
T6Page DriftNon-deterministic page orderHard
T7Totals TrapDrop summary rows (is_total=true)Hard
T8Mixed Faults429 rate-limit AND cross-page duplicates simultaneouslyHard
T9Adaptive AdversaryFaults escalate mid-episode based on agent progressNovel
T10Constrained BudgetSingle agent under halved budget, avoid redundant fetchesNovel

🔧 MCP Tools

get_task_info()
Returns task description, query params, request budget remaining
fetch_page(page, page_size)
→ {rows, page, total_pages, has_more}
Fetches one page; may return 429/500 faults
submit_results(data, meta, log)
→ {reward, score, breakdown}
Submit deduplicated records for scoring

📊 Scoring (0–100)

Correctness
30
Completeness
15
Robustness
15
Efficiency
15
Data Quality
15
Observability
10

📈 Results Snapshot

Rule-based baseline: 96.8 / 100 across T1-T10.

Moonshot V1-8K (Kimi): 94.4 / 100 on the published T1-T8 run. T9 adds adaptive fault escalation. T10 halves the request budget.

🧭 Why It Matters

Final answers are not enough. ComtradeBench rewards agents that recover from 429/500 faults, deduplicate correctly, filter totals rows, stay within budget, and leave an auditable run log.

API Endpoints

POST /reset Start new episode
POST /step Execute MCP action
GET /state Current env state
GET /schema Action/Obs schemas
GET /health Health check
GET /docs Swagger UI

🏗️ Architecture

Built on OpenEnv — the open RL environment framework by Meta.

Environment: MCPEnvironment with FastMCP tools
Mock Service: FastAPI with seeded RNG data generation
Fault Injection: Configurable 429/500 errors per task
Scoring: 6-dimension weighted judge (0–100)
Training: GRPO with parallel rollouts
Concurrency: Episode-isolated, thread-safe

Try It Live

🚀 Interactive API Demo

Click the buttons below to interact with the live environment. Each call hits the real API running in this Space.

Click a button above to see the API response...