The potential risks of reward hacking in advanced AI

μdist and μprox model the world, perhaps coarsely, outside of the computer implementing the agent itself. μdist outputs reward equal to the box display, while μprox outputs reward according to an optical character recognition function applied to part of the visual field of a camera. (As a side note, some coarseness to this simulation is unavoidable, since a computable agent generally cannot perfectly model a world that includes itself (Leike, Taylor, and Fallenstein 2016); hence, the laptop is not in blue.). Credit: AI Magazine (2022). DOI: 10.1002/aaai.12064

New research published in AI Magazine explores how advanced AI could hack reward systems to …

Be the first to comment

Leave a Reply

Your email address will not be published.