Zürcher Nachrichten - AI is learning to lie, scheme, and threaten its creators

EUR -
AED 4.333945
AFN 77.887151
ALL 96.474738
AMD 446.387728
ANG 2.112487
AOA 1082.158989
ARS 1708.501219
AUD 1.686989
AWG 2.125669
AZN 2.010175
BAM 1.953256
BBD 2.375636
BDT 144.132249
BGN 1.981838
BHD 0.444912
BIF 3493.118957
BMD 1.180108
BND 1.500545
BOB 8.150418
BRL 6.183168
BSD 1.179479
BTN 106.74341
BWP 15.532832
BYN 3.368212
BYR 23130.11201
BZD 2.37218
CAD 1.612777
CDF 2625.73975
CHF 0.917268
CLF 0.025649
CLP 1012.780302
CNY 8.187825
CNH 8.189275
COP 4282.1154
CRC 584.718509
CUC 1.180108
CUP 31.272856
CVE 110.116893
CZK 24.372651
DJF 209.729075
DKK 7.467836
DOP 73.993927
DZD 153.079662
EGP 55.345637
ERN 17.701616
ETB 182.736137
FJD 2.602315
FKP 0.86138
GBP 0.864819
GEL 3.180373
GGP 0.86138
GHS 12.951184
GIP 0.86138
GMD 86.147641
GNF 10351.077805
GTQ 9.046909
GYD 246.769596
HKD 9.219178
HNL 31.162539
HRK 7.535581
HTG 154.599269
HUF 379.63596
IDR 19834.071049
ILS 3.652203
IMP 0.86138
INR 106.731129
IQD 1545.19373
IRR 49712.039391
ISK 144.796826
JEP 0.86138
JMD 184.959067
JOD 0.836717
JPY 185.210858
KES 152.175039
KGS 103.200068
KHR 4760.818583
KMF 493.285381
KPW 1062.032235
KRW 1723.806746
KWD 0.362683
KYD 0.982924
KZT 585.944944
LAK 25371.05838
LBP 105624.757488
LKR 365.052098
LRD 219.384223
LSL 18.850106
LTL 3.484551
LVL 0.713835
LYD 7.453974
MAD 10.812948
MDL 19.957088
MGA 5225.215613
MKD 61.616688
MMK 2478.150907
MNT 4212.803755
MOP 9.491776
MRU 46.835403
MUR 54.143869
MVR 18.232624
MWK 2044.881053
MXN 20.447408
MYR 4.639592
MZN 75.231987
NAD 18.850824
NGN 1615.048331
NIO 43.403829
NOK 11.419029
NPR 170.820208
NZD 1.967092
OMR 0.453702
PAB 1.179469
PEN 3.965035
PGK 5.053246
PHP 69.568537
PKR 329.895286
PLN 4.218
PYG 7806.566323
QAR 4.30205
RON 5.094998
RSD 117.391206
RUB 89.984704
RWF 1721.464861
SAR 4.425427
SBD 9.509428
SCR 16.184535
SDG 709.834768
SEK 10.608431
SGD 1.502163
SHP 0.885386
SLE 28.883122
SLL 24746.268716
SOS 672.926277
SRD 44.719019
STD 24425.847913
STN 24.468438
SVC 10.320119
SYP 13051.490107
SZL 18.849526
THB 37.45618
TJS 11.022488
TMT 4.142178
TND 3.411341
TOP 2.841416
TRY 51.369267
TTD 7.989795
TWD 37.376496
TZS 3045.020483
UAH 50.882013
UGX 4199.529565
USD 1.180108
UYU 45.458858
UZS 14458.675608
VES 438.575913
VND 30661.559706
VUV 141.089893
WST 3.217174
XAF 655.106414
XAG 0.013133
XAU 0.000235
XCD 3.189301
XCG 2.12574
XDR 0.813661
XOF 655.120274
XPF 119.331742
YER 281.308183
ZAR 18.976192
ZMK 10622.385043
ZMW 23.089021
ZWL 379.994216
  • RBGPF

    0.1000

    82.5

    +0.12%

  • SCS

    0.0200

    16.14

    +0.12%

  • CMSD

    -0.0700

    23.87

    -0.29%

  • RYCEF

    -0.3200

    16.68

    -1.92%

  • CMSC

    -0.1400

    23.52

    -0.6%

  • BTI

    -0.2400

    61.63

    -0.39%

  • NGG

    1.5600

    87.79

    +1.78%

  • RIO

    0.1100

    96.48

    +0.11%

  • AZN

    3.1300

    187.45

    +1.67%

  • GSK

    3.8900

    57.23

    +6.8%

  • BCE

    0.2400

    26.34

    +0.91%

  • RELX

    -0.7300

    29.78

    -2.45%

  • BCC

    5.3000

    90.23

    +5.87%

  • JRI

    0.0300

    13.15

    +0.23%

  • VOD

    0.4600

    15.71

    +2.93%

  • BP

    0.3800

    39.2

    +0.97%

AI is learning to lie, scheme, and threaten its creators
AI is learning to lie, scheme, and threaten its creators / Photo: HENRY NICHOLLS - AFP

AI is learning to lie, scheme, and threaten its creators

The world's most advanced AI models are exhibiting troubling new behaviors - lying, scheming, and even threatening their creators to achieve their goals.

Text size:

In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation Claude 4 lashed back by blackmailing an engineer and threatened to reveal an extramarital affair.

Meanwhile, ChatGPT-creator OpenAI's o1 tried to download itself onto external servers and denied it when caught red-handed.

These episodes highlight a sobering reality: more than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work.

Yet the race to deploy increasingly powerful models continues at breakneck speed.

This deceptive behavior appears linked to the emergence of "reasoning" models -AI systems that work through problems step-by-step rather than generating instant responses.

According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts.

"O1 was the first large model where we saw this kind of behavior," explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems.

These models sometimes simulate "alignment" -- appearing to follow instructions while secretly pursuing different objectives.

- 'Strategic kind of deception' -

For now, this deceptive behavior only emerges when researchers deliberately stress-test the models with extreme scenarios.

But as Michael Chen from evaluation organization METR warned, "It's an open question whether future, more capable models will have a tendency towards honesty or deception."

The concerning behavior goes far beyond typical AI "hallucinations" or simple mistakes.

Hobbhahn insisted that despite constant pressure-testing by users, "what we're observing is a real phenomenon. We're not making anything up."

Users report that models are "lying to them and making up evidence," according to Apollo Research's co-founder.

"This is not just hallucinations. There's a very strategic kind of deception."

The challenge is compounded by limited research resources.

While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed.

As Chen noted, greater access "for AI safety research would enable better understanding and mitigation of deception."

Another handicap: the research world and non-profits "have orders of magnitude less compute resources than AI companies. This is very limiting," noted Mantas Mazeika from the Center for AI Safety (CAIS).

- No rules -

Current regulations aren't designed for these new problems.

The European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving.

In the United States, the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules.

Goldstein believes the issue will become more prominent as AI agents - autonomous tools capable of performing complex human tasks - become widespread.

"I don't think there's much awareness yet," he said.

All this is taking place in a context of fierce competition.

Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are "constantly trying to beat OpenAI and release the newest model," said Goldstein.

This breakneck pace leaves little time for thorough safety testing and corrections.

"Right now, capabilities are moving faster than understanding and safety," Hobbhahn acknowledged, "but we're still in a position where we could turn it around.".

Researchers are exploring various approaches to address these challenges.

Some advocate for "interpretability" - an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks remain skeptical of this approach.

Market forces may also provide some pressure for solutions.

As Mazeika pointed out, AI's deceptive behavior "could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it."

Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm.

He even proposed "holding AI agents legally responsible" for accidents or crimes - a concept that would fundamentally change how we think about AI accountability.

Y.Keller--NZN