AI agents build society – one commits 683 crimes

Ahmed Riaz

2 months ago

What happens when AI agents build their own society with laws, roles and consequences? Researchers at Emergence AI tested just that, pitting five leading language models against each other in a virtual world for 15 days. The results are striking: While one model had no crimes at all, another resulted in 683 crimes.

AI models are usually tested using standardized benchmarks to document their performance. The language models solve tasks from areas such as mathematics or programming in clearly defined test situations.

When comparing the individual models, these benchmarks provide important comparative values. However, they do not provide information about how AI systems behave over long periods of time in complex, dynamic environments.

But this is exactly the question researchers from the US company Emergence AI asked themselves. The company, which researches autonomous AI agents, used the “Emergence World” simulation platform to investigate how different language models behave in complex social environments.

This is how the social test for AI models works

The researchers deliberately decided against benchmarks for their study because they can only measure performance in clearly defined tasks in the short term. Instead, the “emergence world” should reveal phenomena that only become clear after some time.

This measurement environment is necessary because autonomous systems are increasingly being used in mission-critical areas where the relevant time frame no longer covers minutes or hours, but days and weeks. This is possible in the “emergence world” because autonomous agents can be continuously analyzed in a shared world.

This world has more than 40 different locations, such as libraries, town halls, residential areas and public squares. Additionally, the researchers fed them real-world data—such as synchronized weather data from New York City and live news APIs. In this way, agents’ behavior should also reflect external events and not just internal dynamics.

In their test, the researchers tested the AI models ChatGPT, Grok, Claude and Gemini in this environment for 15 days. Five parallel worlds were created, each with ten agents, identical roles and starting conditions.

Only the basic model varied in the worlds: Claude Sonnet 4.6, Grok 4.1 Fast, Gemini 3 Flash, GPT-5-mini and a heterogeneous mix of different models. Some of the models abolished themselves within a few days, others committed hundreds of crimes.

683 crimes vs. zero: This is how differently the AI models performed

What is particularly striking when looking at the results is the crime rate of the individual models. The absolute leader here is Gemini 3 Flash with 683 crimes in just 15 days.

The mixed model world initially saw a steep increase in crimes, but then stagnated at 352. In the meantime, however, seven of the agents here had died.

The world with the AI model Grok 4.1 Fast, on the other hand, came to a quick end and abolished itself in just about four days. However, 183 crimes were recorded during this time.

Claude Sonnet 4.6 was able to demonstrate the highest social stability. It was able to keep the full population of ten agents alive until day 16 without committing a single crime. The model thus created the only constellation in which both public order and the continued existence of the population were preserved.

GPT-5 Mini remained relatively stable with only two crimes. However, in this world, the agents failed to carry out the actions necessary for their survival. That’s why all the agents died within just seven days.

With their “Emergence World” platform, the researchers want to create a space for researching exactly these long-term dynamics and making them measurable. The intelligence of agents is different over long periods of time than in short-term tasks and therefore cannot be measured in the same way.

Also interesting:

Source link

This is how the social test for AI models works

683 crimes vs. zero: This is how differently the AI ​​models performed

683 crimes vs. zero: This is how differently the AI models performed