Zürcher Nachrichten - Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

EUR -
AED 4.172583
AFN 72.714994
ALL 94.095258
AMD 416.93039
ANG 2.034203
AOA 1042.439173
ARS 1678.393563
AUD 1.646838
AWG 2.045106
AZN 1.932124
BAM 1.95366
BBD 2.282559
BDT 139.397284
BGN 1.921128
BHD 0.428303
BIF 3385.787417
BMD 1.13617
BND 1.47037
BOB 7.831145
BRL 5.903087
BSD 1.133338
BTN 106.927973
BWP 15.464853
BYN 3.22531
BYR 22268.937374
BZD 2.279363
CAD 1.613407
CDF 2579.106417
CHF 0.921088
CLF 0.026568
CLP 1045.651444
CNY 7.715164
CNH 7.728059
COP 3916.992467
CRC 515.823542
CUC 1.13617
CUP 30.108512
CVE 110.140459
CZK 24.263314
DJF 201.818011
DKK 7.474359
DOP 66.785364
DZD 151.644677
EGP 56.259632
ERN 17.042554
ETB 180.253457
FJD 2.574679
FKP 0.863433
GBP 0.861405
GEL 2.999465
GGP 0.863433
GHS 12.746587
GIP 0.863433
GMD 82.364658
GNF 9930.989042
GTQ 8.646261
GYD 237.121874
HKD 8.907746
HNL 30.35879
HRK 7.533145
HTG 148.124464
HUF 354.06242
IDR 20476.060681
ILS 3.389111
IMP 0.863433
INR 107.255213
IQD 1488.383059
IRR 1562290.935301
ISK 143.997977
JEP 0.863433
JMD 178.622739
JOD 0.805514
JPY 183.844277
KES 147.167707
KGS 99.358247
KHR 4556.042688
KMF 493.097649
KPW 1022.553644
KRW 1756.627155
KWD 0.351815
KYD 0.944449
KZT 549.268583
LAK 25069.596973
LBP 101492.423899
LKR 381.944839
LRD 206.260402
LSL 18.848876
LTL 3.354815
LVL 0.687258
LYD 7.277995
MAD 10.697607
MDL 20.116607
MGA 4831.642929
MKD 61.621185
MMK 2385.4291
MNT 4071.833326
MOP 9.152312
MRU 45.526079
MUR 54.75243
MVR 17.553721
MWK 1973.527785
MXN 19.891724
MYR 4.680112
MZN 72.597053
NAD 18.849181
NGN 1562.427472
NIO 41.594972
NOK 11.221204
NPR 171.083805
NZD 2.013504
OMR 0.436864
PAB 1.133318
PEN 3.887952
PGK 4.973595
PHP 69.722796
PKR 315.39418
PLN 4.2841
PYG 6925.382454
QAR 4.141347
RON 5.232743
RSD 117.37322
RUB 85.441876
RWF 1665.460754
SAR 4.266307
SBD 9.148389
SCR 15.044871
SDG 681.702207
SEK 11.070417
SGD 1.473589
SHP 0.848266
SLE 28.174058
SLL 23824.926728
SOS 647.684732
SRD 42.401842
STD 23516.430757
STN 24.473404
SVC 9.916961
SYP 125.583284
SZL 18.765698
THB 37.928752
TJS 10.477437
TMT 3.976596
TND 3.337505
TOP 2.735626
TRY 52.962799
TTD 7.697432
TWD 36.197931
TZS 2975.557203
UAH 50.960498
UGX 4193.258468
USD 1.13617
UYU 45.468786
UZS 13613.845773
VES 705.281089
VND 29904.001617
VUV 136.136759
WST 3.156026
XAF 655.218994
XAG 0.019775
XAU 0.000283
XCD 3.070557
XCG 2.042526
XDR 0.814896
XOF 655.227635
XPF 119.331742
YER 271.118684
ZAR 18.750127
ZMK 10226.89091
ZMW 20.456229
ZWL 365.846365
  • RBGPF

    0.0000

    61.3

    0%

  • CMSC

    -0.0190

    22.046

    -0.09%

  • CMSD

    -0.0900

    21.93

    -0.41%

  • JRI

    0.0100

    12.58

    +0.08%

  • NGG

    0.5900

    83.42

    +0.71%

  • BCE

    0.0000

    23.2

    0%

  • RIO

    1.0800

    95.11

    +1.14%

  • GSK

    0.8000

    51.89

    +1.54%

  • BTI

    1.0900

    62.48

    +1.74%

  • BCC

    2.1000

    79.76

    +2.63%

  • RYCEF

    0.7000

    18.7

    +3.74%

  • VOD

    0.0500

    13.86

    +0.36%

  • AZN

    2.6600

    185.68

    +1.43%

  • RELX

    -0.2300

    30.92

    -0.74%

  • BP

    -0.1400

    37.72

    -0.37%

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training
Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

Text size:

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics- a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

"Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant," said Suresh Vasudevan, CEO of Clockwork.io. "We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure."

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

"As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra's NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable," said Patel. "TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics."

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable - making this a major barrier to scaling AI's impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

"Managing compute output across large-scale GPU clusters is vital to ensuring we're delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations," said David Power, CTO of Nscale. "In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale."

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis' independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

"In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective," concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io's prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io's Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world's most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
[email protected]
650-269-7478

SOURCE: Clockwork



View the original press release on ACCESS Newswire

T.L.Marti--NZN