An OpenEnv Benchmark for Reliable LLM Tool-Use Under Adversarial API Conditions.
| Task | Name | Challenge | Difficulty |
|---|---|---|---|
| T1 | Single Page | Fetch one page, submit correctly | Easy |
| T2 | Multi-Page | Paginate until has_more=false | Easy |
| T3 | Deduplication | Overlapping pages; dedup by primary key | Medium |
| T4 | Rate Limit (429) | Retry on HTTP 429 without data loss | Medium |
| T5 | Server Error (500) | Retry transient 500 failures | Medium |
| T6 | Page Drift | Non-deterministic page order | Hard |
| T7 | Totals Trap | Drop summary rows (is_total=true) | Hard |
| T8 | Mixed Faults | 429 rate-limit AND cross-page duplicates simultaneously | Hard |
| T9 | Adaptive Adversary | Faults escalate mid-episode based on agent progress | Novel |
| T10 | Constrained Budget | Single agent under halved budget, avoid redundant fetches | Novel |
Rule-based baseline: 96.8 / 100 across T1-T10.
Kimi V1-128k & Claude Sonnet 4.6: 97.5 each — identical per-task scores. GPT-5: 93.2 avg, T9 = 75.7 — reasoning-oriented, 2 steps in 223s vs Kimi's 7 steps in 8s. Llama 3.3 70B: bimodal on T9 (18.7 – 97.5 across seeds). T9 separates execution-oriented from reasoning-oriented frontier.
GRPO training envelope empirically mapped: Qwen2.5-1.5B lacks task capacity (reward oscillates 0.22–0.94 over 50 iter, no net trend). Qwen2.5-7B + LoRA saturates at init (mean 0.97 → reward_std ≈ 0 → no gradient signal). Useful training band: ~3B params.
Final answers are not enough. ComtradeBench rewards agents that recover from 429/500 faults, deduplicate correctly, filter totals rows, stay within budget, and leave an auditable run log.
Built on OpenEnv — the open RL environment framework by Meta.
Environment: MCPEnvironment with FastMCP tools
Mock Service: FastAPI with seeded RNG data generation
Fault Injection: Configurable 429/500 errors per task
Scoring: 6-dimension weighted judge (0–100)
Training: GRPO with parallel rollouts
Concurrency: Episode-isolated, thread-safe
Click the buttons below to interact with the live environment. Each call hits the real API running in this Space.
Click a button above to see the API response...