ComtradeBench

Task	Name	Challenge	Difficulty
T1	Single Page	Fetch one page, submit correctly	Easy
T2	Multi-Page	Paginate until `has_more=false`	Easy
T3	Deduplication	Overlapping pages; dedup by primary key	Medium
T4	Rate Limit (429)	Retry on HTTP 429 without data loss	Medium
T5	Server Error (500)	Retry transient 500 failures	Medium
T6	Page Drift	Non-deterministic page order	Hard
T7	Totals Trap	Drop summary rows (`is_total=true`)	Hard
T8	Mixed Faults	429 rate-limit AND cross-page duplicates simultaneously	Hard
T9	Adaptive Adversary	Faults escalate mid-episode based on agent progress	Novel
T10	Constrained Budget	Single agent under halved budget, avoid redundant fetches	Novel

🔧 MCP Tools

get_task_info()

Returns task description, query params, request budget remaining

fetch_page(page, page_size)

→ {rows, page, total_pages, has_more}

Fetches one page; may return 429/500 faults

submit_results(data, meta, log)

→ {reward, score, breakdown}

Submit deduplicated records for scoring

📊 Scoring (0–100)

Correctness

30

Completeness

15

Robustness

15

Efficiency

15

Data Quality

15

Observability

10

📈 Results Snapshot

Rule-based baseline: 96.8 / 100 across T1-T10.

Kimi V1-128k & Claude Sonnet 4.6: 97.5 each — identical per-task scores. GPT-5: 93.2 avg, T9 = 75.7 — reasoning-oriented, 2 steps in 223s vs Kimi's 7 steps in 8s. Llama 3.3 70B: bimodal on T9 (18.7 – 97.5 across seeds). T9 separates execution-oriented from reasoning-oriented frontier.

GRPO training envelope empirically mapped: Qwen2.5-1.5B lacks task capacity (reward oscillates 0.22–0.94 over 50 iter, no net trend). Qwen2.5-7B + LoRA saturates at init (mean 0.97 → reward_std ≈ 0 → no gradient signal). Useful training band: ~3B params.

🧭 Why It Matters

Final answers are not enough. ComtradeBench rewards agents that recover from 429/500 faults, deduplicate correctly, filter totals rows, stay within budget, and leave an auditable run log.

⚡ API Endpoints

POST /reset Start new episode

POST /step Execute MCP action

GET /state Current env state

GET /schema Action/Obs schemas

GET /health Health check

GET /docs Swagger UI

🏗️ Architecture

Built on OpenEnv — the open RL environment framework by Meta.

Environment: MCPEnvironment with FastMCP tools
Mock Service: FastAPI with seeded RNG data generation
Fault Injection: Configurable 429/500 errors per task
Scoring: 6-dimension weighted judge (0–100)
Training: GRPO with parallel rollouts
Concurrency: Episode-isolated, thread-safe

Benchmark Tasks