Day 51

Concrete Problems in AI Safety

March 26, 2026

This article was written in 2016 and therefore seems a bit dated. However, it's obvious that this paper is landmark for identifying important research directions related to safety. The aurhors aim to define the problem of "accidents", or "unintended and harmful behavior that may emerge from amchine learning systems". The authors offer three different drivers of accidents including giving the wrong objective function to a system, partial objective functions that are two expencive and therefore stem from AI systems using limited information, and failures from the learning process.

The article highlights five key challenges.

Avoiding negative side effects: when an agent is pursing a goal but causes accidents along the way. The example here is cleaning a room and knocking over a vase.
Avoiding reward hacking
Scalable oversight
Safe explorate
Robustness to distributional shifts.

← All Projects