Full Stack Benchmarking for Knowledge Work

Abstract

In less than a year, AI agents have evolved from a research curiosity into the foundation of some of the largest software platform updates in decades. These systems promise to automate substantial portions of knowledge work, and their progress has been rapid, with early 2025 reports by METR suggesting that the complexity of solvable tasks doubles roughly every seven months. In this talk, we take a closer empirical look at this claim by examining what it truly takes to benchmark agentic performance on long-running, open-ended knowledge work tasks. We review recent contributions from ServiceNow Research and others across domains such as browser use, multimodal understanding, data analytics, and deep research. We also discuss benchmarks that evaluate agentic safety and security, arguing that these dimensions cannot be meaningfully separated from primary task performance. Our analysis leads to a more nuanced picture of the field, highlighting both genuine advances and persistent challenges that frontier agents have yet to overcome.

Date
Jul 29, 2025 12:00 AM
Location
IVADO Workshop on Assessing and Improving the Capabilities and Safety of Agents, Montreal, Canada
Avatar
Alexandre Drouin
Head of Frontier AI Research

My research interests include machine learning, causal inference, and computational biology.