| MISC (Embodied and Scientific Simulation, and Travel Planning) |
| ALFWorld (Shridhar et al., 2021) |
Embodied instruction-following tasks in a simulated household, combining textual descriptions with high-level symbolic actions. |
3,553 |
21,031 |
| ScienceWorld (Wang et al., 2022) |
An interactive science lab simulator rendered in natural language, where agents perform multi-step experiments. |
1,000 |
14,506 |
| TravelPlanner (Xie et al., 2024a) |
Long-horizon travel planning tasks that require generating and refining multi-day itineraries using various tools. |
45 |
1,395 |
| Multi-Turn Tool-Use |
| BFCLv3 (Patil et al., 2025) |
Multi-turn tool-use tasks from the Berkeley Function Call Leaderboard v3, where agents interact with a Python-based API environment. |
125 |
1,264 |
| Tau-Bench (Yao et al., 2025) |
Realistic customer-service scenarios requiring agents to interact with LM-simulated users and perform multi-turn tool use. |
452 |
5,239 |
| SearchQA (Jin et al., 2025) |
Multi-hop question answering where agents issue search queries and reason over retrieved snippets to answer complex questions. |
2,082 |
7,691 |
| Web Navigation |
| WebShop (Yao et al., 2022) |
Shopping tasks in a simulated e-commerce site, where agents must navigate, filter, and select the correct product. |
1,571 |
15,464 |
| WebArena-Lite (Zhou et al., 2024) |
Web navigation tasks across domains like e-commerce, forums, and content management. |
554 |
7,044 |