Zürcher Nachrichten - Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

EUR -
AED 4.330578
AFN 75.468553
ALL 95.370831
AMD 434.26718
ANG 2.110613
AOA 1082.496254
ARS 1649.279971
AUD 1.625347
AWG 2.125489
AZN 2.009303
BAM 1.955202
BBD 2.368676
BDT 144.305864
BGN 1.967008
BHD 0.444064
BIF 3500.4294
BMD 1.179189
BND 1.491244
BOB 8.126515
BRL 5.795828
BSD 1.17604
BTN 111.057033
BWP 15.789171
BYN 3.323484
BYR 23112.111202
BZD 2.365277
CAD 1.612129
CDF 2670.864298
CHF 0.916177
CLF 0.026704
CLP 1050.508704
CNY 8.019372
CNH 8.014083
COP 4394.855841
CRC 540.634648
CUC 1.179189
CUP 31.248518
CVE 110.231286
CZK 24.334582
DJF 209.425947
DKK 7.476537
DOP 69.938609
DZD 156.038276
EGP 62.195977
ERN 17.68784
ETB 183.631137
FJD 2.574218
FKP 0.865474
GBP 0.864889
GEL 3.154379
GGP 0.865474
GHS 13.247948
GIP 0.865474
GMD 86.674958
GNF 10318.844
GTQ 8.979254
GYD 246.064742
HKD 9.234999
HNL 31.264438
HRK 7.538916
HTG 153.972908
HUF 353.981307
IDR 20491.303919
ILS 3.421187
IMP 0.865474
INR 111.345548
IQD 1540.628801
IRR 1546506.829043
ISK 143.873347
JEP 0.865474
JMD 185.35331
JOD 0.836092
JPY 184.753623
KES 151.883547
KGS 103.085327
KHR 4718.556838
KMF 492.90156
KPW 1061.251335
KRW 1723.880942
KWD 0.36279
KYD 0.9801
KZT 543.543758
LAK 25791.111834
LBP 105315.489444
LKR 378.634195
LRD 215.803997
LSL 19.293799
LTL 3.48184
LVL 0.71328
LYD 7.436725
MAD 10.75591
MDL 20.110849
MGA 4912.497521
MKD 61.621153
MMK 2476.100645
MNT 4223.124889
MOP 9.4824
MRU 47.006623
MUR 55.210091
MVR 18.163925
MWK 2038.876413
MXN 20.255648
MYR 4.623647
MZN 75.362436
NAD 19.293799
NGN 1609.593864
NIO 43.276764
NOK 10.859513
NPR 177.691653
NZD 1.976185
OMR 0.453611
PAB 1.17604
PEN 4.066156
PGK 5.193412
PHP 71.358689
PKR 327.765953
PLN 4.239717
PYG 7183.802847
QAR 4.298685
RON 5.21945
RSD 117.334114
RUB 87.543025
RWF 1724.072695
SAR 4.44258
SBD 9.456429
SCR 17.539736
SDG 708.107537
SEK 10.86706
SGD 1.494509
SHP 0.880384
SLE 29.067455
SLL 24727.006491
SOS 672.094441
SRD 44.100547
STD 24406.83871
STN 24.492509
SVC 10.290853
SYP 130.375396
SZL 19.281103
THB 37.973479
TJS 10.972544
TMT 4.127163
TND 3.415955
TOP 2.839205
TRY 53.473293
TTD 7.970562
TWD 36.927538
TZS 3063.662984
UAH 51.6595
UGX 4406.652233
USD 1.179189
UYU 46.905654
UZS 14265.63688
VES 588.693738
VND 31022.113342
VUV 139.685143
WST 3.192143
XAF 655.756438
XAG 0.014675
XAU 0.00025
XCD 3.186819
XCG 2.119552
XDR 0.815551
XOF 655.756438
XPF 119.331742
YER 281.384102
ZAR 19.315959
ZMK 10614.123377
ZMW 22.390152
ZWL 379.698489
  • CMSD

    0.1140

    23.534

    +0.48%

  • BCC

    -2.0900

    70.67

    -2.96%

  • RBGPF

    0.7000

    63.61

    +1.1%

  • VOD

    0.5100

    16.2

    +3.15%

  • RIO

    2.2700

    105.38

    +2.15%

  • BCE

    -0.4300

    24.14

    -1.78%

  • RELX

    0.0759

    33.58

    +0.23%

  • CMSC

    0.1400

    23.11

    +0.61%

  • RYCEF

    -0.4100

    16.37

    -2.5%

  • GSK

    -0.0900

    50.41

    -0.18%

  • NGG

    0.9800

    86.89

    +1.13%

  • BTI

    0.2000

    58.28

    +0.34%

  • AZN

    0.3300

    182.85

    +0.18%

  • JRI

    0.0000

    13.15

    0%

  • BP

    -0.4700

    43.34

    -1.08%

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training
Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

Text size:

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics- a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

"Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant," said Suresh Vasudevan, CEO of Clockwork.io. "We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure."

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

"As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra's NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable," said Patel. "TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics."

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable - making this a major barrier to scaling AI's impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

"Managing compute output across large-scale GPU clusters is vital to ensuring we're delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations," said David Power, CTO of Nscale. "In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale."

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis' independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

"In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective," concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io's prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io's Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world's most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
[email protected]
650-269-7478

SOURCE: Clockwork



View the original press release on ACCESS Newswire

T.L.Marti--NZN