Zürcher Nachrichten - Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

EUR -
AED 4.258946
AFN 73.644244
ALL 95.798613
AMD 437.043724
ANG 2.075528
AOA 1063.432933
ARS 1622.920043
AUD 1.620274
AWG 2.087436
AZN 1.975819
BAM 1.950622
BBD 2.337955
BDT 142.182605
BGN 1.910753
BHD 0.437819
BIF 3445.358972
BMD 1.159687
BND 1.476226
BOB 8.020814
BRL 6.028514
BSD 1.160854
BTN 106.577032
BWP 15.512227
BYN 3.409309
BYR 22729.862161
BZD 2.334564
CAD 1.573139
CDF 2522.318599
CHF 0.903286
CLF 0.026191
CLP 1033.814027
CNY 7.975134
CNH 7.971537
COP 4303.71385
CRC 548.159202
CUC 1.159687
CUP 30.731701
CVE 109.974044
CZK 24.386588
DJF 206.706686
DKK 7.473567
DOP 69.686833
DZD 152.476734
EGP 60.270435
ERN 17.395303
ETB 180.058429
FJD 2.547719
FKP 0.861723
GBP 0.863555
GEL 3.154192
GGP 0.861723
GHS 12.524917
GIP 0.861723
GMD 84.657029
GNF 10176.296199
GTQ 8.900452
GYD 242.858522
HKD 9.076522
HNL 30.724243
HRK 7.533097
HTG 152.210581
HUF 387.760437
IDR 19594.068932
ILS 3.605762
IMP 0.861723
INR 106.706788
IQD 1520.676783
IRR 1532758.102435
ISK 145.030416
JEP 0.861723
JMD 182.141255
JOD 0.822219
JPY 183.83584
KES 149.889079
KGS 101.414382
KHR 4658.774825
KMF 490.547711
KPW 1043.757932
KRW 1710.967761
KWD 0.355699
KYD 0.967341
KZT 565.653464
LAK 24866.319001
LBP 103950.02288
LKR 360.826925
LRD 212.419838
LSL 18.893894
LTL 3.424254
LVL 0.701483
LYD 7.410554
MAD 10.824608
MDL 19.977576
MGA 4815.34321
MKD 61.590751
MMK 2434.688632
MNT 4152.733598
MOP 9.353912
MRU 46.07689
MUR 53.240931
MVR 17.928903
MWK 2012.809472
MXN 20.442351
MYR 4.54191
MZN 74.160483
NAD 18.893813
NGN 1621.636342
NIO 42.717903
NOK 11.173391
NPR 170.525785
NZD 1.957818
OMR 0.44588
PAB 1.160834
PEN 4.049551
PGK 5.003848
PHP 68.772327
PKR 324.328623
PLN 4.259037
PYG 7558.133978
QAR 4.233001
RON 5.093927
RSD 117.403854
RUB 92.360375
RWF 1697.039452
SAR 4.35133
SBD 9.337405
SCR 15.958452
SDG 696.971804
SEK 10.670186
SGD 1.476734
SHP 0.870065
SLE 28.533318
SLL 24318.052542
SOS 662.259298
SRD 43.533452
STD 24003.176292
STN 24.435877
SVC 10.157128
SYP 129.016644
SZL 18.899324
THB 36.79334
TJS 11.108706
TMT 4.070501
TND 3.394818
TOP 2.792248
TRY 51.134117
TTD 7.876196
TWD 36.851018
TZS 3009.387547
UAH 50.933226
UGX 4300.640443
USD 1.159687
UYU 46.816542
UZS 14109.609718
VES 505.27161
VND 30441.77968
VUV 138.490957
WST 3.16681
XAF 654.237383
XAG 0.013442
XAU 0.000224
XCD 3.134112
XCG 2.091965
XDR 0.813661
XOF 654.240197
XPF 119.331742
YER 276.70102
ZAR 18.991954
ZMK 10438.571552
ZMW 22.519808
ZWL 373.418691
  • RBGPF

    0.1000

    82.5

    +0.12%

  • RYCEF

    0.7800

    17.68

    +4.41%

  • CMSC

    -0.0100

    23.24

    -0.04%

  • AZN

    -1.3400

    193.65

    -0.69%

  • VOD

    -0.1100

    14.35

    -0.77%

  • RELX

    -0.2600

    34.93

    -0.74%

  • RIO

    -0.4100

    91.27

    -0.45%

  • GSK

    -0.1500

    55.17

    -0.27%

  • BCE

    -0.5300

    25.86

    -2.05%

  • BCC

    -0.6950

    71.845

    -0.97%

  • NGG

    -0.4600

    89.39

    -0.51%

  • CMSD

    -0.0100

    23.07

    -0.04%

  • JRI

    0.1660

    12.806

    +1.3%

  • BP

    1.4900

    41.43

    +3.6%

  • BTI

    -0.6700

    58.74

    -1.14%

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training
Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

Text size:

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics- a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

"Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant," said Suresh Vasudevan, CEO of Clockwork.io. "We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure."

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

"As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra's NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable," said Patel. "TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics."

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable - making this a major barrier to scaling AI's impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

"Managing compute output across large-scale GPU clusters is vital to ensuring we're delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations," said David Power, CTO of Nscale. "In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale."

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis' independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

"In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective," concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io's prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io's Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world's most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
[email protected]
650-269-7478

SOURCE: Clockwork



View the original press release on ACCESS Newswire

T.L.Marti--NZN