Zürcher Nachrichten - AI is learning to lie, scheme, and threaten its creators

EUR -
AED 4.244563
AFN 73.954261
ALL 96.19808
AMD 435.820975
ANG 2.068501
AOA 1059.624051
ARS 1597.518135
AUD 1.674929
AWG 2.081405
AZN 1.963518
BAM 1.97127
BBD 2.326266
BDT 141.712131
BGN 1.975164
BHD 0.436231
BIF 3425.001048
BMD 1.155533
BND 1.491102
BOB 7.980631
BRL 6.001952
BSD 1.154969
BTN 109.904511
BWP 15.93304
BYN 3.434655
BYR 22648.454971
BZD 2.322829
CAD 1.607659
CDF 2640.393566
CHF 0.92385
CLF 0.027117
CLP 1070.729218
CNY 7.967059
CNH 7.958734
COP 4257.25088
CRC 537.016734
CUC 1.155533
CUP 30.621636
CVE 110.786755
CZK 24.550483
DJF 205.361016
DKK 7.472812
DOP 69.446814
DZD 153.961114
EGP 63.004535
ERN 17.333001
ETB 181.476507
FJD 2.584581
FKP 0.875939
GBP 0.873683
GEL 3.107907
GGP 0.875939
GHS 12.71075
GIP 0.875939
GMD 85.509227
GNF 10139.80616
GTQ 8.837392
GYD 241.707926
HKD 9.059439
HNL 30.734875
HRK 7.535582
HTG 151.589648
HUF 384.064673
IDR 19585.482543
ILS 3.647499
IMP 0.875939
INR 108.024521
IQD 1513.748776
IRR 1520537.534597
ISK 143.405264
JEP 0.875939
JMD 182.723985
JOD 0.819303
JPY 183.482554
KES 150.21911
KGS 101.051469
KHR 4633.689537
KMF 495.149978
KPW 1039.950807
KRW 1740.164148
KWD 0.357672
KYD 0.962453
KZT 550.278486
LAK 25363.958791
LBP 103430.761926
LKR 364.361016
LRD 212.242573
LSL 19.725255
LTL 3.41199
LVL 0.69897
LYD 7.401192
MAD 10.79557
MDL 20.454523
MGA 4827.819041
MKD 61.63945
MMK 2426.040195
MNT 4126.420078
MOP 9.326128
MRU 46.348211
MUR 54.449049
MVR 17.876734
MWK 2007.161566
MXN 20.726229
MYR 4.678801
MZN 73.896662
NAD 19.725052
NGN 1601.257711
NIO 42.443197
NOK 11.194779
NPR 175.847016
NZD 2.011898
OMR 0.444272
PAB 1.154964
PEN 4.039717
PGK 5.073108
PHP 69.786128
PKR 322.629123
PLN 4.2909
PYG 7481.715145
QAR 4.210766
RON 5.099253
RSD 117.436879
RUB 93.945797
RWF 1687.078789
SAR 4.336691
SBD 9.292843
SCR 16.243316
SDG 694.475647
SEK 10.942555
SGD 1.486131
SHP 0.866949
SLE 28.368569
SLL 24230.970494
SOS 660.389749
SRD 43.186939
STD 23917.208717
STN 25.103963
SVC 10.106357
SYP 127.750061
SZL 19.725097
THB 37.68172
TJS 11.070378
TMT 4.055922
TND 3.385265
TOP 2.782247
TRY 51.368949
TTD 7.846613
TWD 36.921606
TZS 2990.879841
UAH 50.741328
UGX 4348.142247
USD 1.155533
UYU 46.857731
UZS 14092.232731
VES 546.888371
VND 30436.750201
VUV 139.060756
WST 3.199988
XAF 661.14555
XAG 0.015378
XAU 0.000247
XCD 3.122887
XCG 2.081536
XDR 0.821529
XOF 659.23284
XPF 119.331742
YER 275.768001
ZAR 19.54588
ZMK 10401.190063
ZMW 22.077258
ZWL 372.081289
  • CMSC

    -0.4028

    21.9

    -1.84%

  • CMSD

    -0.4000

    22.1

    -1.81%

  • RBGPF

    -13.5000

    69

    -19.57%

  • BTI

    0.2100

    58.47

    +0.36%

  • NGG

    0.9100

    84.6

    +1.08%

  • GSK

    0.9600

    55.19

    +1.74%

  • RYCEF

    0.7600

    15.05

    +5.05%

  • RELX

    0.4000

    33.15

    +1.21%

  • RIO

    4.4700

    93.29

    +4.79%

  • BCE

    0.0100

    25.24

    +0.04%

  • VOD

    0.3200

    15.02

    +2.13%

  • BCC

    0.9000

    75.85

    +1.19%

  • JRI

    0.3800

    12.3

    +3.09%

  • BP

    -0.3500

    47

    -0.74%

  • AZN

    3.3400

    197.22

    +1.69%

AI is learning to lie, scheme, and threaten its creators
AI is learning to lie, scheme, and threaten its creators / Photo: HENRY NICHOLLS - AFP

AI is learning to lie, scheme, and threaten its creators

The world's most advanced AI models are exhibiting troubling new behaviors - lying, scheming, and even threatening their creators to achieve their goals.

Text size:

In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation Claude 4 lashed back by blackmailing an engineer and threatened to reveal an extramarital affair.

Meanwhile, ChatGPT-creator OpenAI's o1 tried to download itself onto external servers and denied it when caught red-handed.

These episodes highlight a sobering reality: more than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work.

Yet the race to deploy increasingly powerful models continues at breakneck speed.

This deceptive behavior appears linked to the emergence of "reasoning" models -AI systems that work through problems step-by-step rather than generating instant responses.

According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts.

"O1 was the first large model where we saw this kind of behavior," explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems.

These models sometimes simulate "alignment" -- appearing to follow instructions while secretly pursuing different objectives.

- 'Strategic kind of deception' -

For now, this deceptive behavior only emerges when researchers deliberately stress-test the models with extreme scenarios.

But as Michael Chen from evaluation organization METR warned, "It's an open question whether future, more capable models will have a tendency towards honesty or deception."

The concerning behavior goes far beyond typical AI "hallucinations" or simple mistakes.

Hobbhahn insisted that despite constant pressure-testing by users, "what we're observing is a real phenomenon. We're not making anything up."

Users report that models are "lying to them and making up evidence," according to Apollo Research's co-founder.

"This is not just hallucinations. There's a very strategic kind of deception."

The challenge is compounded by limited research resources.

While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed.

As Chen noted, greater access "for AI safety research would enable better understanding and mitigation of deception."

Another handicap: the research world and non-profits "have orders of magnitude less compute resources than AI companies. This is very limiting," noted Mantas Mazeika from the Center for AI Safety (CAIS).

- No rules -

Current regulations aren't designed for these new problems.

The European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving.

In the United States, the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules.

Goldstein believes the issue will become more prominent as AI agents - autonomous tools capable of performing complex human tasks - become widespread.

"I don't think there's much awareness yet," he said.

All this is taking place in a context of fierce competition.

Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are "constantly trying to beat OpenAI and release the newest model," said Goldstein.

This breakneck pace leaves little time for thorough safety testing and corrections.

"Right now, capabilities are moving faster than understanding and safety," Hobbhahn acknowledged, "but we're still in a position where we could turn it around.".

Researchers are exploring various approaches to address these challenges.

Some advocate for "interpretability" - an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks remain skeptical of this approach.

Market forces may also provide some pressure for solutions.

As Mazeika pointed out, AI's deceptive behavior "could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it."

Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm.

He even proposed "holding AI agents legally responsible" for accidents or crimes - a concept that would fundamentally change how we think about AI accountability.

Y.Keller--NZN