Full Stack Benchmarking for Knowledge Work

Name: Full Stack Benchmarking for Knowledge Work
Start: 2025-07-29T00:00:00Z
Location: IVADO Workshop on Assessing and Improving the Capabilities and Safety of Agents, Montreal, Canada

Video

Abstract

In less than a year, AI agents have evolved from a research curiosity into the foundation of some of the largest software platform updates in decades. These systems promise to automate substantial portions of knowledge work, and their progress has been rapid, with early 2025 reports by METR suggesting that the complexity of solvable tasks doubles roughly every seven months. In this talk, we take a closer empirical look at this claim by examining what it truly takes to benchmark agentic performance on long-running, open-ended knowledge work tasks. We review recent contributions from ServiceNow Research and others across domains such as browser use, multimodal understanding, data analytics, and deep research. We also discuss benchmarks that evaluate agentic safety and security, arguing that these dimensions cannot be meaningfully separated from primary task performance. Our analysis leads to a more nuanced picture of the field, highlighting both genuine advances and persistent challenges that frontier agents have yet to overcome.

Date

Jul 29, 2025 12:00 AM

Location

IVADO Workshop on Assessing and Improving the Capabilities and Safety of Agents, Montreal, Canada

Full Stack Benchmarking for Knowledge Work

Abstract

Alexandre Drouin

Head of Frontier AI Research