Core Results: The experimental results demonstrate that PokeeResearch consistently outperforms all other 7B-scale baseline models across all ten benchmarks.
该图像是一张柱状图,展示了不同7B规模深度研究模型在HLE、GAIA和BrowseComp三个基准数据集上的性能比较。图中PokeeResearch模型在各项基准测试中表现优异,尤其在GAIA数据集上得分最高。
该图像是图表,展示了论文中7B规模深度研究模型在7个QA基准测试上的性能比较。不同模型以不同颜色区分,PokeeResearch整体表现优异,得分最高。
As shown in Figure 1 and Figure 2, PokeeResearch (the dark orange bar) achieves the highest score in every single category. The improvement is particularly notable on the more difficult benchmarks like GAIA, where it scores 37.6, significantly ahead of the next best baseline at 24.03.
The following tables, transcribed from the paper's Table 1, provide the detailed numerical results.
Performance on HLE, GAIA, and BrowseComp:
| Method |
HLE |
GAIA |
BrowseComp |
| R1searcher |
5.4 |
8.3 |
1.0 |
| SearchR1 |
13.0 |
18.7 |
0.4 |
| ZeroSearch |
8.6 |
9.9 |
1.4 |
| ASearcher |
13.8 |
22.1 |
3.2 |
| DeepResearcher |
6.0 |
24.03 |
1.8 |
| PokeeResearch |
15.0 |
37.6 |
6.0 |
Performance on QA Benchmarks:
| Method |
BAMB |
2WIKI |
TQ |
NQ |
POPQA |
MUSIQUE |
HOTPOTQA |
| R1searcher |
63.2 |
61.4 |
77.2 |
59.6 |
51.8 |
35.8 |
62.4 |
| SearchR1 |
67.8 |
62.8 |
81.0 |
67.6 |
59.6 |
33.2 |
63.2 |
| ZeroSearch |
51.4 |
33.6 |
61.6 |
48.2 |
38.0 |
19.0 |
32.4 |
| ASearcher |
68.8 |
69.2 |
85.2 |
71.2 |
58.2 |
35.8 |
71.0 |
| DeepResearcher |
71.0 |
58.8 |
82.2 |
60.2 |
55.2 |
26.8 |
56.6 |
| PokeeResearch |
78.2 |
73.4 |
89.8 |
76.0 |
63.2 |
36.6 |
71.4 |