The productivity of AI agents is measured on tasks that have little to do with the real world of work, a new study shows. While benchmarks primarily test programming skills, central activities in many professions are largely left out.
Artificial intelligence is one of the current and most discussed trends in the world of work. Companies expect efficiency gains and new business models, while society is discussing which activities can be automated through AI – and which cannot.
At the same time, AI systems are developing rapidly. This applies in particular to so-called AI agents, which will be able to plan and carry out tasks independently in the working world in the future. But the greater the expectations become, the more urgent the question arises as to how realistic these abilities actually are.
A study by Carnegie Mellon University and Stanford University took a closer look at this topic. The results suggest that there is a significant gap between expectations and reality when evaluating AI agents. This is because so far this has been largely oriented towards narrow, often programming-heavy tasks and therefore rather less towards actual tasks from the world of work.
How suitable are AI agents really for the world of work?
Benchmarks serve as a measure of progress and performance for evaluating AI agents. Because they define which tasks the systems should solve and can then use standardized tests to make statements about their performance.
Developers and companies can then use these tests to decide which tasks the AI is suitable for based on its performance. For their study, however, the researchers from Carnegie Mellon University and Stanford University asked themselves how representative these benchmarks actually are for the real world of work.
The scientists see the biggest problem in defining the benchmarks for AI agents. If these are only limited to a small area of responsibility, improvements may not lead to broad increases in productivity or a noticeable relief on the labor market.
AI agents focus on a small part of the working world
For the study, the researchers systematically examined the connection between the development of agents and the distribution of real human work. To do this, they collected 72,342 tasks from 43 agent benchmarks, standardized them and assigned them to 1,016 real jobs in the US labor market.
“We show significant discrepancies between agent development, which tends to be programming-centric, and the categories in which human labor and economic value are concentrated,” the researchers write in their study. The development of AI agents is heavily concentrated on a few work areas and skills.
Benchmarks for AI agents focused disproportionately on programming and math-intensive tasks. However, according to the evaluation, these only make up 7.6 percent of total employment on the US labor market.
Work-related benchmarks are not always realistic, as some tasks appear superficially similar to real work, but are only partially applicable to actual work areas. Real jobs often require the coordination of multiple skills across different areas, which many benchmarks would only partially capture.
Also interesting:

