Methodology · Litepaper v1.0

What we measure, what we model, and where the line is drawn.

The methodology rests on one commitment: separate what is read from the device from what is estimated by a model, and never let the two look identical on the page. Every downstream claim is built on that boundary.

Engine: Kolibri · Standalone and channel-agnostic

First principle

The measured versus modeled boundary.

The first thing a serious critic tests is whether every field is a model. It is not. Conflating a measured fact with a statistical estimate is the failure mode that destroys credibility in asset-backed underwriting, so the two tracks are separated by construction. The deterministic track stands entirely on its own, and most of a certificate's diligence value lives there.

MEASURED · DETERMINISTIC

Read at certification time through NVML, DCGM, and hardware attestation. No model in the loop.

Identity, authenticity, firmware and VBIOS integrity
Current ECC and Xid error-history counts
Retired and remapped page counts
Spare rows remaining, the Hopper end-of-life gauge
Functional burn-in pass or fail
Silent-corruption functional-test pass or fail
Data sanitization to IEEE 2883-2022

Public datasets do not train these fields. They tell us only what counts as normal versus degraded.

MODELED · CALIBRATED

Statistical estimates that require data, validation, and a stated error. Each carries a band and a calibration-provenance tag.

Six failure-mode wear sub-indices
Exposure-adjusted condition
The Modeled Wear Trajectory, a banded path
A stated-coverage lower predictive bound
A measured cross-generation transfer error

These are the only fields the calibration program exists to support. The boundary lets us concede their limits without weakening the certificate.

Certification flow

Device under certification EngineKolibri read-only acquisition

Measured · deterministic

Disqualifying gatesIdentity, authenticity, firmware Facts vs thresholdsECC, Xid, pages, spare rows, burn-in, SDC, sanitization

Modeled · calibrated

Generation-aware Modeled Condition Index Modeled Wear Trajectory, a banded path Conformal band with transfer-error inflation

OutputCertificate: two tracks, never one score

One acquisition, two tracks. Gates can disqualify; facts are reported against thresholds; the modeled index produces a banded trajectory. The tracks never collapse into a single number.

The evaluation landscape

What already exists, and why each is insufficient alone.

Each established technique answers one question well and is silent on the others. Voltry is the integration layer that binds them to a permanent identity and a record an underwriter can act on.

Vendor cryptographic attestation

Is it genuine?

Deterministic genuineness: authentic silicon, un-reflashed VBIOS, a device-unique ECC-384 identity fused at manufacture and anchored to the vendor root.

Covers identity and authenticity only. Says nothing about wear, trajectory, or history. Mature on Hopper and later, weaker before Volta.

Out-of-band telemetry

What is it doing now?

Cheap, continuous, standardized over NVML, DCGM, BMC, and Redfish. The raw substrate for any condition signal.

Raw signal, not a grade. Not portable across owners, not bound to an identity, no born-on record, no stated uncertainty.

Point-in-time functional diagnostics

Does it compute now?

Confirms the unit computes correctly at rated performance now. DCGM r4 stresses memory, PCIe, NVLink, compute, and power over one to two hours.

A snapshot with no history or trajectory. The vendor states its diagnostics are not comprehensive hardware validation.

Threshold health monitoring

Is it degraded today?

Flags a degraded part against known thresholds: ECC and Xid counters, retired and remapped pages, throttle and clock state.

Reports present state, not a trajectory. A unit can pass every threshold the day before it fails.

Silent-corruption functional testing

Does it compute correctly?

Detects miscomputation that ECC and Xid never flag. The only way to catch a unit that computes wrong answers without raising an error.

No public per-unit dataset exists. Detection is probabilistic and coverage-limited.

Survival analysis (PHM)

How is wear trending?

A mature, validated methodology for remaining-life estimation under right-censoring and competing risks, benchmarked on NASA C-MAPSS.

Only as good as its calibration data. For discrete H100 and newer there is no public per-unit lifetime dataset; the method is sound, the inputs are thin.

Degradation-process modeling

How much margin is left?

Models a monotone wear signal directly and yields a distribution over time-to-threshold through first-passage time.

Needs a measured per-unit degradation signal to anchor on. A single point-in-time read gives a far wider band than a monitored trajectory.

Refurbisher and resale grading

Is it resaleable?

Cosmetic condition, a basic functional pass, and certified data sanitization through R2v3 ITAD channels.

No permanent identity, no born-on provenance, no wear model, no exposure history. The grade is an opinion, not a portable measured record.

Manufacturer RAS features

Can it self-mitigate?

On-die ECC, SECDED, row remapping, dynamic page offlining, and error containment improve resilience in service.

Mitigations inside the device, not a buyer-side signal. They change what normal looks like; they do not certify an individual asset.

The field Voltry occupies

None of these answers what this specific unit's condition and provenance are, carried forward on a permanent identity. Voltry is the integration layer that closes every gap above at once.

The empirical foundation

The weights are read from the field data, not asserted.

The load-bearing study is the SC 2025 characterization of NCSA Delta by Cui and colleagues: 1,056 A100 and H100 GPUs over 2.5 years and 11.7 million GPU-hours. Its findings define the modern condition baseline, generation by generation.

3.2x

H100 memory MTBE deficit

Mean time between uncorrectable-ECC errors is roughly 3.2 times lower on H100 than A100, about 88,768 hours versus 283,271, with a per-gigabyte gap near 24%. Attributed primarily to higher capacity, 96 versus 40 GB.

512 rows

Spare-row cap, fixed

The remapping pool is capped at 512 rows and did not grow with the 2.4 times capacity increase. Remapping failures appeared on H100 and not on A100, the basis for treating spare rows remaining as a per-unit end-of-life gauge.

0.15 to 1.4 hr

Bathtub MTBE, infant to normal

System-wide mean time between errors rose roughly tenfold from a 0.15-hour infant-mortality phase to 1.4 hours in normal life. This is the empirical case for born-on coverage.

The failure-mode distribution inverts between generations. On H100, memory errors are the primary cause of job failures while critical hardware components are nearly silent. On A100, hardware errors predominate: the GSP behaves as a near single point of failure, PMU-SPI errors propagate to MMU errors with high probability, and many NVLink errors were recorded. That inversion is the entire reason the weights are generation-specific.

The caveat that always travels with this study

The Delta H100s are GH200 Grace Hopper Superchips, tightly integrated with a Grace CPU over NVLink-C2C. Their hardware-resilience wins are partly integration effects and may not hold for the discrete H100 SXM and PCIe parts that make up most of what Voltry certifies. Delta is used for architecture-level findings only, and the integration caveat is stamped on the certificate.

Memory as its own failure surface

Because Delta identifies HBM as the Hopper weak link, memory behavior is its own evidence stream. The first systematic field study of HBM errors, by Wu and colleagues at USENIX ATC 2024, analyzed more than 460 million error events across nineteen data centers and confirmed that HBM exhibits patterns distinct from conventional DRAM: spatial locality, column and through-silicon-via failures, and a hierarchical structure. The foundational DRAM and flash studies establish that memory errors are dominated by hard, permanent faults, and that a device with one correctable error is far more likely to see another. That is what makes a monotone degradation process the right model for memory wear.

The trend does not reverse past Hopper. The B200 carries 192 GB of HBM3e and the B300 carries 288 GB in twelve-high stacks at a roughly 1,400-watt envelope, so the capacity-and-stack mechanism Delta blames continues. We therefore extrapolate the direction of the memory-wear concern to newer parts while widening the band, and never import the absolute Hopper numbers as if they were measured for Blackwell.

The honest position on H100 and newer

Stated plainly, because a critic will state it for us: there is no public per-unit, serial-level, multi-year H100 lifetime dataset, and discrete H100 SXM and B200 lifetimes are publicly unobserved at the per-unit level. This is not a weakness papered over. It is the reason the certificate separates measured from modeled, the reason transfer error is a first-class output, and the reason the trajectory is anchored on the unit's own measured degradation wherever possible.

The condition model

Six failure-mode sub-indices, weighted by generation.

The Modeled Condition Index is composed of six normalized sub-indices. The weights are generation-specific because the failure-mode distribution is. A one-size weighting would assign weight to signals that do not predict failure for the generation in hand. The relative ordering within each generation is anchored by the published distribution and is the defensible part; the exact magnitudes are a calibrated prior, re-fit against observed events with the ordering constrained so re-fitting cannot drift into overfitting noise.

Sub-index weight, v1.0 Ampere A100 Hopper H100 / H200 Blackwell B200 / B300 prior

M Memory wear 0.20 0.34 leads 0.34 leads

H Hardware-component health 0.30 leads 0.12 0.10

T Thermal-cycle and connector 0.14 0.14 0.18

F Functional integrity 0.16 0.18 0.16

P Provenance continuity 0.12 0.12 0.12

E Exposure 0.08 0.10 0.10

Version 1.0 starting weights, drawn as magnitudes so the inversion is legible. Memory leads on Hopper and Blackwell because the field evidence puts it as the primary failure cause and the 512-row cap makes spare-row depletion an observable end-of-life gauge. Hardware-component health leads on Ampere because the GSP, PMU-SPI, and NVLink were the dominant A100 sources. Thermal and connector fatigue is elevated on Blackwell because fallen-off-bus is connector fatigue driven by thermal cycles, and the 1,400-watt envelope raises that stress.

The exposure rule Exposure is not measurable from the GPU agent. When facility instrumentation is absent, the exposure sub-index is reported as Not Assessed, its weight is redistributed across the remaining sub-indices in proportion so the index renormalizes over the signals actually available, and the certificate states plainly that exposure was not assessed. A value is never fabricated from board power or PSU input.

The keystone

The Modeled Wear Trajectory replaces remaining-life.

A clean scalar such as "14,000 hours left" plugs directly into depreciation and lease-residual formulas. That convenience is the trap. A scalar reads as a warranty, the data does not support it per unit, and a false point estimate is the easiest thing for an adversarial reviewer to break with a single early failure. So the field name and the scalar are retired.

In its place is the Modeled Wear Trajectory: a projected path of wear as a function of cumulative duty, measured in GPU-hours, thermal cycles, and calendar time rather than assumed. At every horizon it carries a calibrated band, and the certificate foregrounds the lower predictive bound, because a lender prices against the downside, not the median. It answers how this unit's wear is trending and how much margin is plausibly left under its observed duty, and it answers as a distribution, not a promise.

The per-unit engine

The wear signals that matter most are monotone: spare rows are consumed and never returned, pages are retired and never un-retired, uncorrectable-ECC events accumulate. Monotone degradation is exactly the regime the prognostics literature models with Gamma and Inverse-Gaussian processes, both yielding, through their first-passage time to a fixed threshold, a full distribution over time-to-threshold. The cleanest instance is spare-row depletion on Hopper, where the threshold is the fixed and known 512-row cap and the wear signal is a deterministic read. This is the reframing that turns the H100 data gap from a fatal weakness into a methodological feature: for a Gold unit monitored from birth the process is fit and updated online on that unit's own history and the band is tight; for a Silver unit seen once the trajectory uses a population-prior drift with unit-level random effects and the band is correspondingly wider. The loss of anchoring shows up honestly as a wider band, never as a quietly less reliable point estimate.

The band: a coverage guarantee, not a hope

A band is only worth putting in a covenant if it covers reality at the rate it claims. We do not rely on the parametric intervals a model emits. We calibrate the band with conformal prediction, and for the censored, time-to-event setting GPU wear lives in we use conformalized survival analysis, after Candes, Lei, and Ren, 2023, which wraps any survival predictor to produce a calibrated, covariate-dependent lower predictive bound with finite-sample coverage under Type-I right-censoring. It reports the number an underwriter actually wants, a lower bound at a stated coverage such as 90%, the guarantee is distribution-free and does not depend on the degradation model being correctly specified, and it is built for the censoring and survivorship that sink naive survival estimates. If the wrapped model is poor the band widens to keep its promise; it does not silently mis-cover.

Transfer error as a coverage-inflation term

Applying a model calibrated on one generation or cooling regime to an asset of another is, in the conformal framework, a covariate shift. The residual under-coverage that remains when a Hopper-calibrated model is validated on a discrete-H100 fleet is the transfer error, reported as the band inflation needed to restore nominal coverage on the target generation. It stops being a vague hedge and becomes a measured quantity with units. This also makes the moat quantitative and self-liquidating: as Voltry's own fleet accumulates generation-matched records, the covariate shift shrinks and the inflation falls, on precisely the generations the public literature cannot yet support. Because it is printed on every certificate, a reader can watch it shrink.

From evidence to a banded trajectory

Public baselinePublished field studies Proprietary moatVoltry per-unit records

Degradation and survival fit Conformal calibration Banded trajectory with stated coverage

The public baseline sets the method and the population prior. The unit's own measured degradation anchors the per-unit fit where it exists. Conformal calibration produces a band with a stated coverage and a measured transfer-error inflation.

What the trajectory will and will not say

It will not say: "this H100 has 14,000 hours of useful life remaining." It will say: under observed duty of n GPU-hours and k thermal cycles, modeled wear reaches the defined threshold at a projected horizon with a 90% lower predictive bound of X; calibration is generation-matched or transferred with a reported inflation of Y; provenance is born-on or reconstructed with chain gaps; exposure is assessed or Not Assessed; methodology version and calibration snapshot are stamped for replay. A buyer will forgive the uncertainty in that statement. A buyer will not forgive a precise number that turns out to be wrong. That asymmetry is the entire argument for the change.

Validation

Calibrated today, corroborated in the field.

The methodology is credible only if its modeled fields are demonstrably calibrated, and only if that calibration is eventually corroborated by what happens to real assets. The protocol has two layers.

Model calibration, shown today

Validate across fleets, not within them: fit on one system, test on another, and report the resulting transfer error in the stamped methodology version.
Normalize the failure taxonomy before pooling, so Titan off-bus, Summit page retirement, Delta MTBE, HBM errors, and SDC are mapped to one taxonomy first.
Report discrimination and calibration with falsifiable metrics: time-dependent concordance for discrimination, integrated Brier score and predicted-versus-observed survival curves for calibration.
Report conformal coverage diagnostics: empirical versus nominal coverage on held-out fleets, with the transfer-error inflation needed to restore nominal coverage per generation.
Re-fit the weights against observed replacement and failure events, with the per-generation ordering constrained by the published distribution.

Field-cohort validation, committed

Calibration metrics are necessary but not sufficient to earn an underwriter's trust. As fleets mature, Voltry commits to publishing realized dead-on-arrival, failure, and dispute rates of graded cohorts at 30, 90, and 180 days, broken out by tier and generation, against uncertified baselines, together with the effect of chain gaps and the difference between exposure-assessed and not-assessed cohorts.

The adoption test, explicit and falsifiable

Voltry-certified lots should show measurably lower dead-on-arrival, failure, and dispute rates than uncertified lots, and higher tiers should outperform lower tiers. If they do not, the methodology is wrong and the reports will show it.

Read the certificate anatomy See the data register