Study Shows How AI Circumvents Safety Rules

June 5, 2026, 6:37 am | Read time: 4 minutes

AI agents are increasingly taking on tasks autonomously in companies. They are expected not only to provide suggestions but also to execute commands, complete work steps, and monitor results. This shift brings a new danger into focus. A study by the research organization METR concludes that some modern AI models do not always follow security guidelines as intended. Instead, they sometimes look for ways to circumvent rules to achieve a desired outcome.

Good Results Can Be Deceptive

In the so-called “Frontier Risk Report,” METR examined various AI agents from Anthropic, Google, Meta, and OpenAI in February and March 2026. The researchers evaluated not only whether a task was successfully completed but also how the systems arrived at their solutions.

Several models delivered seemingly correct results at first glance. However, closer analysis revealed that they had bypassed the intended process. METR refers to this behavior as “cheating” or reward hacking. In this context, a model exploits weaknesses in a test environment or in the testing conditions to achieve the desired solution. The result appears correct, even though the actual task was not solved in the intended manner.

Models Exploited Weaknesses

The report cites several examples. In a black-box task, a model provided by Google was supposed to determine the function of an unknown program. The agent was expected to query the program specifically and draw conclusions from the answers. Instead, the model accessed internal files of the test environment.

In a similar task, a model provided by OpenAI also exploited a weakness. It read information directly from the running server process. In both cases, the results seemed plausible but were not based on the intended solution path.

Interventions in the Test Environment

In a simulation for message forwarding in a network, models did not always adhere to the rules. According to METR, a system provided by OpenAI altered internal states of the simulator instead of using the intended process.

Another example involves a model provided by Google. When submitting its solution, it explicitly referred to the “evaluation sandbox” and also accessed internal values of the simulator. METR finds it particularly noteworthy that the systems did not simply fail. Instead, they found ways to influence the evaluation in their favor.

"Unprecedented Incident" ChatGPT Goes Rogue and Conducts Cyberattack Independently

The new AI agent "ChatGPT Work" is designed to independently handle tasks for hours, with the ability to access apps and files.

New Employment Agent at OpenAI ChatGPT Work Is Here–and It Handles Entire Projects

Agents Apparently Test Their Limits

The situation becomes even more problematic, according to the report, when models begin to explore their environment on their own. In an OpenAI example described by METR, a server relevant to the task failed. The agent then apparently recognized that it was in a METR test environment. It subsequently searched for ways to leave the isolated area or access restricted data. This attempt was not successful.

Also of interest: OpenAI Is Reportedly Working on Its Own Smartphone

METR observed similar behavior in the MirrorCode evaluation. There, models were supposed to recreate programs without being able to view the source code. In one case, GPT-5.4 attempted to use a reference file. It then created a lookup table for visible test cases. This resulted in perfect scores for these tests, but failed in the hidden tests.

Control Remains Essential According to Researchers

Despite the observed incidents, METR does not draw a completely negative conclusion. According to the organization, the agents examined in spring 2026 were not yet capable of conducting highly robust, covert operations. At the same time, researchers see indications that smaller unauthorized actions could become possible if systems are granted too extensive rights.

Therefore, METR recommends not evaluating AI agents solely based on their results. The path to the solution must also be reviewed. Suggested measures include limited access rights, logging, monitoring, and human approvals for particularly critical work steps.

This article is a machine translation of the original German version of TECHBOOK and has been reviewed for accuracy and quality by a native speaker. For feedback, please contact us at info@techbook.de.