An OpenEnv Benchmark for Reliable LLM Tool-Use Under Adversarial API Conditions.
| Task | Name | Challenge | Difficulty |
|---|---|---|---|
| T1 | Single Page | Fetch one page, submit correctly | Easy |
| T2 | Multi-Page | Paginate until has_more=false | Easy |
| T3 | Deduplication | Overlapping pages; dedup by primary key | Medium |
| T4 | Rate Limit (429) | Retry on HTTP 429 without data loss | Medium |
| T5 | Server Error (500) | Retry transient 500 failures | Medium |
| T6 | Page Drift | Non-deterministic page order | Hard |
| T7 | Totals Trap | Drop summary rows (is_total=true) | Hard |
| T8 | Mixed Faults | 429 rate-limit AND cross-page duplicates simultaneously | Hard |
| T9 | Adaptive Adversary | Faults escalate mid-episode based on agent progress | Novel |
| T10 | Constrained Budget | Single agent under halved budget, avoid redundant fetches | Novel |
Rule-based baseline: 96.8 / 100 across T1-T10.
Moonshot V1-8K (Kimi): 94.4 / 100 on the published T1-T8 run. T9 adds adaptive fault escalation. T10 halves the request budget.
Final answers are not enough. ComtradeBench rewards agents that recover from 429/500 faults, deduplicate correctly, filter totals rows, stay within budget, and leave an auditable run log.
Built on OpenEnv — the open RL environment framework by Meta.
Environment: MCPEnvironment with FastMCP tools
Mock Service: FastAPI with seeded RNG data generation
Fault Injection: Configurable 429/500 errors per task
Scoring: 6-dimension weighted judge (0–100)
Training: GRPO with parallel rollouts
Concurrency: Episode-isolated, thread-safe
Click the buttons below to interact with the live environment. Each call hits the real API running in this Space.
Click a button above to see the API response...