In less than a year, AI agents have evolved from a research curiosity into the foundation of some of the largest software platform updates in decades. These systems promise to automate substantial portions of knowledge work, and their progress has been rapid, with early 2025 reports by METR suggesting that the complexity of solvable tasks doubles roughly every seven months. In this talk, we take a closer empirical look at this claim by examining what it truly takes to benchmark agentic performance on long-running, open-ended knowledge work tasks. We review recent contributions from ServiceNow Research and others across domains such as browser use, multimodal understanding, data analytics, and deep research. We also discuss benchmarks that evaluate agentic safety and security, arguing that these dimensions cannot be meaningfully separated from primary task performance. Our analysis leads to a more nuanced picture of the field, highlighting both genuine advances and persistent challenges that frontier agents have yet to overcome.