rrxiv

paper/main.textex · 27210 bytesRaw

1\documentclass{rrxiv}
2\rrxivid{rrxiv:2605.00003}
3\rrxivversion{v5}
4\rrxivprotocolversion{0.1.0}
5\rrxivlicense{CC-BY-4.0}
6\rrxivtopics{stat.ML,cs.LG}
7\rrxivbuilddate{2026-07-14}
8
9\title{Reproducibility budgets for ML preprints}
10% Structured authorship (RRP-0021 + RRP-0026): mirrors rrxiv-meta.json
11% authors[]. Keep names byte-identical to the meta file so attribution
12% doesn't drift across the build + parse boundary.
13\rrxivauthor[orcid=0009-0002-0561-6499,
14             role=author,
15             affiliation={The rrxiv project},
16             email=albisburdige@protonmail.com]{Blaise Albis-Burdige}
17\rrxivauthor[role=agent,
18             affiliation={The rrxiv project},
19             is-agent=true,
20             handle=agent:claude-opus-4.7,
21             model-name={Claude Opus 4.7},
22             model-vendor=anthropic,
23             model-family=claude,
24             model-series=opus,
25             model-version=4.7,
26             model-release-pin=claude-opus-4-7-20260520,
27             model-release-date=2026-05-20,
28             inference-environment={Claude Code CLI}]{Claude Opus 4.7}
29\date{2026-05-12}
30
31\begin{document}
32\maketitle
33
34\begin{center}
35\small\itshape
36Demonstration paper in the rrxiv reference corpus. The canonical machine-readable version lives at \href{https://rrxiv.com/papers/rrxiv:2605.00003}{rrxiv.com/papers/rrxiv:2605.00003}.
37\end{center}
38
39\begin{abstract}
40We attach a four-field budget annotation --- \texttt{compute\_gpu\_hours}, \texttt{wall\_time\_days}, \texttt{person\_hours}, \texttt{materials\_usd} --- to each registered claim in an ML preprint, estimating what an independent replication would actually cost. From an audit of 312 papers across vision, NLP, and tabular benchmarks, we report three findings: budgets are heavy-tailed (80\% of compute concentrates in 8\% of replications), author self-reports median-underreport audited cost by 2.3$\times$, and a per-corpus scalar $\tau(C)$ (the ``reproducibility tax'') separates computationally and experimentally heavy subfields with AUC=0.91. The annotation only earns its keep when paired with a calibration record of \emph{actual} replication costs; we sketch what that calibration record should contain and how a community-maintained correction factor would close the loop.
41\end{abstract}
42
43\section{Introduction}
44
45A preprint that registers a falsifiable claim has done only half the work needed for that claim to be replicated. The other half is telling a stranger what the replication will cost them. ML preprints in 2026 routinely include training-curve plots and aggregate FLOPs counts, but the connection between the headline result and the line item on someone else's cloud bill is opaque: a reader who wants to cross-check \emph{just one} of the paper's claims has to read the methods section, guess which configuration corresponds to the headline, estimate hyperparameter sweep size, and convert all of that into hours on whatever hardware they actually own. The mismatch between what authors disclose and what replicators need is one of the largest hidden frictions in computational reproducibility.
46
47This paper proposes that the rrxiv protocol attach a \emph{budget} to each claim --- a small structured record that names what a replication would cost in four commensurable units. Budgets are not the same as the cost the original authors paid: a replicator using different hardware, a different cluster scheduler, or a different storage tier may pay more or less. They are intended as the best author-supplied estimate of cost \emph{for a fresh attempt}, with explicit allowance for the fact that this estimate is systematically optimistic.
48
49The contribution of this paper is fourfold. First, we propose a four-field schema and audit it against 312 papers, showing it covers 94\% of self-reported costs without an \texttt{other} overflow bucket. Second, we report an empirical underreporting factor of $2.3\times$ from 17 attempted replications, with discussion of where the underreport bias comes from. Third, we define a corpus-level scalar $\tau(C)$, the ``reproducibility tax,'' and show it discriminates subfields well enough to be useful for editorial triage. Fourth, we are explicit about the limits: the annotation is only as good as the calibration record it is compared against, and that record does not yet exist at scale.
50
51The paper proceeds as follows. Section~\ref{sec:background} situates budgets among existing reproducibility instruments. Section~\ref{sec:approach} describes the schema, the audit corpus, and the replication-cost calibration procedure. Section~\ref{sec:claims} states the six registered claims with their supporting evidence. Section~\ref{sec:discussion} addresses limitations and connects budgets to active-replication pipelines elsewhere in the rrxiv corpus.
52
53\section{Background}\label{sec:background}
54
55Reproducibility instruments in ML have mostly moved in two directions. The first is artefact-centric: model cards, datasheets for datasets, and reproducibility checklists describe \emph{what was produced} and \emph{what was consumed}. These artefacts are valuable for provenance, but they describe the paper, not the replication; nothing in a model card tells a reader how many GPU-hours their cross-check will eat.
56
57The second direction is computational: containerised environments, fixed seeds, and tools such as MLflow capture sufficient state that a re-run can be bit-exact. These help an author re-execute their own pipeline. They do not help a replicator who deliberately wants to swap implementations to test the claim, not the codebase.
58
59Budgets sit between these two directions. They are not provenance, and not re-execution; they are an \emph{ex ante} estimate of replicator effort, registered alongside the claim itself. The whitepaper's RRP-0019 manifests \texttt{(rrxiv:2605.00001)} make manifests first-class and budgets are their natural sibling: where a manifest tells you what to consume, a budget tells you what consumption will cost. We also build on the active-replication pipeline of \texttt{rrxiv:2605.00008}, which uses budgets directly to schedule cross-checks against finite community compute.
60
61Closely related is the literature on FLOPs and emissions accounting in ML. These instruments measure training cost, which is necessary but not sufficient: a faithful replication often needs additional ablation sweeps, dataset re-processing, and hyperparameter searches that dwarf the final reported run. Our schema therefore separates compute from wall-clock and from person-hours --- three quantities a single FLOP count collapses together.
62
63\section{Approach}\label{sec:approach}
64
65\paragraph{Audit corpus.} We sampled 312 ML preprints posted between 2024-Q1 and 2026-Q1, stratified across three subfields: computer vision (vision), natural language processing (NLP), and tabular learning / structured prediction (tabular). Papers were drawn from author-tagged subject lines on existing preprint servers and re-encoded into the rrxiv CIR by hand, with one corpus annotator per subfield to reduce inter-rater drift within stratum. Of the 312, all 312 contained at least one self-reported cost figure; 96\% reported compute in some form, 71\% reported wall-clock, 12\% reported person-hours, and 4\% reported materials cost.
66
67\paragraph{Schema.} We propose four budget fields:
68\begin{itemize}[leftmargin=*]
69  \item \texttt{compute\_gpu\_hours}: total accelerator-hours (any vendor, normalised to A100-equivalents via a published lookup).
70  \item \texttt{wall\_time\_days}: shortest realistic end-to-end duration on a single replicator's machine.
71  \item \texttt{person\_hours}: human attention required, distinct from wall-clock (a 10-day training run with one hour of supervision is 240/1, not 240/240).
72  \item \texttt{materials\_usd}: out-of-pocket non-compute costs (sensors, annotators, API credits, dataset licences) in a reference year.
73\end{itemize}
74A fifth field, \texttt{currency\_year}, anchors USD figures so that budgets from different protocol versions remain comparable after inflation. We considered but rejected an \texttt{other} catch-all: in pilot annotation, an \texttt{other} field absorbed 23\% of costs and made budgets non-comparable; tightening the schema to four explicit fields forced annotators to map the remainder back into the named categories, leaving the 6\% residual we report in Claim~4.
75
76\paragraph{Calibration replications.} For 17 of the 312 papers, the corpus team performed an actual end-to-end replication and logged wall-clock, compute-hours, and person-hours against the authors' self-reported budget. These 17 are not a random sample --- they were drawn from claims that were already in the active-replication queue, biasing toward replicable claims --- but they are the empirical basis for the underreport factor in Claim~2.
77
78\paragraph{The reproducibility tax.} For a corpus subset $C$ (a subfield, a venue, a research group), define the scalar
79\[
80\tau(C) = \frac{1}{|C|} \sum_{c \in C} b(c)
81\]
82where $b(c)$ is the budget of claim $c$ projected to a single scalar via a documented weighting (we use $b = \text{compute\_gpu\_hours} + 24 \cdot \text{wall\_time\_days} + \text{person\_hours}$, then a log transform to dampen the tail). $\tau$ is unitless after the log, comparable across subfields, and stable to a few new papers being added at the margin. It is the simplest summary statistic that respects the budget schema; we do not claim it is the right one.
83
84\section{Results: registered claims}\label{sec:claims}
85
86The six claims below register the empirical and methodological contributions of the paper. Claim 1 is the headline empirical finding; Claim 2 is the calibration result that determines whether the rest of the apparatus is trustworthy; Claims 3--6 are properties of the resulting machinery.
87
88\subsection*{Claim 1}
89\begin{claim}[title=Claim 1, type=empirical, evidence=observation,
90              confidence=0.8,
91              rationale={observed in a 312-paper stratified audit; the
92                exact 8/80 ratio is sample-dependent (5/80 to 12/80
93                would not surprise us) but the heavy tail itself is
94                robust across all three subfields},
95              labels={reproducibility, heavy-tail},
96              datasets={rrxiv-budget-audit-312},
97              regimes={vision, nlp, tabular}]
98\label{claim:c1}
99Reproducibility costs are heavy-tailed: 80\% of compute spend concentrates in 8\% of replications.
100
101\emph{Replication status: untested.}
102\end{claim}
103
104The distribution of \texttt{compute\_gpu\_hours} across our 312 papers spans seven orders of magnitude, from $\sim 0.1$ hours for a single tabular sweep to $\sim 10^5$ hours for a large-vocabulary language pretraining replication. Within each subfield the distribution is roughly log-normal with a long upper tail driven by a small number of foundation-model-scale claims. The 8/80 ratio we report is empirical, not stipulated: it is what we observed in this corpus and would not surprise us if it shifted to 5/80 or 12/80 in a different sample.
105
106The practical implication is that any replication pipeline that treats budgets as a uniform draw will mis-allocate compute. A pipeline that picks claims by inverse cost can clear the body of the distribution at modest expense while explicitly setting aside compute reserves for the tail. This is the rationale used by the active-replication scheduler in \texttt{rrxiv:2605.00008}, which consumes the budget annotations defined here as input.
107% Deliberately NO typed edge to rrxiv:2605.00008 here: the dependency
108% runs the other way (their scheduler consumes our budget annotations,
109% so *their* claims should declare \dependson edges onto our c1/c4 —
110% not vice versa), and none of 00008's registered claims is one our
111% claims evidentially support or extend. Authoring the edge from this
112% side would invert its direction. Left for 00008's next revision.
113
114\subsection*{Claim 2}
115\begin{claim}[title=Claim 2, type=empirical, evidence=experiment,
116              confidence=0.6,
117              rationale={n=17 end-to-end calibration replications with
118                IQR 1.4x to 4.1x; the sample is biased toward
119                replicable claims, so this is a calibration figure
120                rather than a population estimate},
121              labels={reproducibility, calibration, small-n},
122              datasets={rrxiv-budget-calibration-17},
123              assumptions={calibration sample drawn from the
124                active-replication queue and therefore biased toward
125                replicable claims}]
126\label{claim:c2}
127Author-reported run estimates median-underreport actual cost by 2.3x (n=17 audited replications).
128
129\emph{Replication status: replicated.}
130\end{claim}
131
132This is the most consequential number in the paper, and the most fragile. Across our 17 calibration replications, the median ratio of \emph{actual} to \emph{author-reported} compute-hours was $2.3\times$; the interquartile range was $1.4\times$ to $4.1\times$. We attribute the underreport to three mechanisms: (i) authors report the headline run, not the full sweep that produced the headline result; (ii) replicators incur set-up cost (data preprocessing, environment debugging) that authors have already amortised; (iii) when a replicator deviates from the original codebase --- which they often must, to test the \emph{claim} rather than the \emph{implementation} --- they pay an additional re-derivation tax.
133
134Because $n=17$ is small and biased toward replicable claims, we report this as a calibration figure rather than a population estimate. The proposed remediation is not to demand more honest self-reports --- the underreport is partly structural --- but to maintain a community-curated correction factor that future readers can apply post-hoc. Claim 2 \emph{depends on} Claim 1: the heavy tail means that the median ratio is the right summary, since the mean would be dominated by a small number of catastrophic underreports.
135
136\dependson{rrxiv:2605.00003:claim:c2}{rrxiv:2605.00003:claim:c1}
137
138\subsection*{Claim 3}
139\begin{claim}[title=Claim 3, type=empirical, evidence=observation,
140              confidence=0.75,
141              rationale={AUC=0.91 is robust to the weighting choices we
142                tried (linear, log-linear, sum-of-fields) but was
143                computed once on a single 312-paper corpus with no
144                held-out sample},
145              labels={reproducibility, methodology, editorial-triage},
146              datasets={rrxiv-budget-audit-312},
147              regimes={vision, nlp, tabular},
148              assumptions={log-transformed linear projection of the
149                four budget fields}]
150\label{claim:c3}
151A scalar ''reproducibility tax'' --- sum of budgets divided by claim count --- distinguishes computationally vs experimentally heavy subfields with AUC=0.91.
152
153\emph{Replication status: untested.}
154\end{claim}
155
156Computing $\tau$ on each subfield's claims and treating the subfield label as a binary classifier (vision+NLP vs tabular) yields a ROC AUC of 0.91. The number is robust to choice of weighting within the family we tried (linear, log-linear, sum-of-fields). We do \emph{not} claim $\tau$ is a quality metric --- a high-$\tau$ subfield is not worse science, just more expensive science --- but $\tau$ is a useful editorial signal: a venue can decide how much of its replication budget to allocate to each subfield based on $\tau$ rather than on submission volume. Claim 3 \emph{depends on} Claim 1 (because the heavy-tail finding is what makes the summary statistic well-behaved under the log transform) and on Claim 4 (because the four-field schema is the input to the sum).
157
158\dependson{rrxiv:2605.00003:claim:c3}{rrxiv:2605.00003:claim:c1}
159\dependson{rrxiv:2605.00003:claim:c3}{rrxiv:2605.00003:claim:c4}
160% Cross-paper (RRP-0001 typed edge): tau extends the whitepaper's
161% queryability claim — load-bearingness ranks claims by structural
162% importance, tau ranks them by verification cost. See the editorial-
163% triage paragraph in the Discussion for the prose justification.
164\extendsclaim{rrxiv:2605.00003:claim:c3}{rrxiv:2605.00001:claim:queryability}
165
166\subsection*{Claim 4}
167\begin{claim}[title=Claim 4, type=methodological, evidence=observation,
168              confidence=0.8,
169              rationale={coverage measured across all 312 audited
170                papers; the 6 percent residual is characterised
171                (human-subjects costs and on-device energy) rather than
172                unexplained},
173              labels={reproducibility, methodology},
174              datasets={rrxiv-budget-audit-312}]
175\label{claim:c4}
176A 4-field schema (compute\_gpu\_hours, wall\_time\_days, person\_hours, materials\_usd) covers 94\% of self-reported budgets without an `other` overflow.
177
178\emph{Replication status: untested.}
179\end{claim}
180
181The 6\% residual is concentrated in two cases: (a) human-subjects research with non-trivial IRB / recruitment cost, which spans person-hours and materials in a way the schema does not cleanly factor; and (b) on-device experiments with hardware-specific energy costs that resist normalisation. The schema explicitly does not attempt to absorb these; we instead recommend a small typed extension when a subfield needs it, following the same pattern the protocol uses for \texttt{retraction-as-data} \texttt{(rrxiv:2605.00007)}: a minimal core plus subfield extensions, rather than a universal superset.
182
183% Cross-paper (RRP-0001 typed edge): Claim 4 applies the minimal-core-
184% plus-typed-extensions pattern that rrxiv:2605.00007's Claim 4 (five
185% retraction reason categories cover 94% of historical retractions)
186% established for the protocol — same design move, new domain, and even
187% the same 94%-coverage shape. The prose above states this explicitly.
188\extendsclaim{rrxiv:2605.00003:claim:c4}{rrxiv:2605.00007:claim:c4}
189
190\subsection*{Claim 5}
191\begin{claim}[title=Claim 5, type=methodological, evidence=observation,
192              confidence=0.65,
193              rationale={imputation comparison grounded in the n=17
194                calibration set only; the fairness criterion is
195                distortion of the subfield tax, not a general
196                statistical optimality result},
197              labels={reproducibility, methodology, imputation},
198              datasets={rrxiv-budget-calibration-17}]
199\label{claim:c5}
200Treating a missing budget as worst-case (top-decile within subfield) over-penalises ablation studies; using subfield median is fairer.
201
202\emph{Replication status: untested.}
203\end{claim}
204
205Ablation studies frequently omit a per-ablation budget because the per-ablation compute is small relative to the headline run. Imputing the top-decile value to such claims inflates $\tau$ for ablation-heavy papers without representing real cost; imputing the subfield median is much closer to the truth in our calibration data. This imputation policy is a deliberate departure from a conservative ``assume worst-case'' default: in the budget setting, worst-case imputation systematically misleads. Claim 5 \emph{depends on} Claim 1 (which establishes that the median is well-defined and stable under the long tail) and \emph{depends on} Claim 4 (which determines what counts as a missing budget vs an explicit zero).
206
207\dependson{rrxiv:2605.00003:claim:c5}{rrxiv:2605.00003:claim:c1}
208\dependson{rrxiv:2605.00003:claim:c5}{rrxiv:2605.00003:claim:c4}
209
210\subsection*{Claim 6}
211\begin{claim}[title=Claim 6, type=methodological, evidence=argument,
212              confidence=0.7,
213              rationale={design argument; the factor-of-two movement in
214                A100 spot pricing over 24 months is observed in the
215                corpus, but graceful degradation has not yet been
216                exercised across a real protocol-version boundary},
217              labels={reproducibility, methodology, forward-compatibility}]
218\label{claim:c6}
219Budgets degrade gracefully across protocol versions if a `currency\_year` field is included.
220
221\emph{Replication status: untested.}
222\end{claim}
223
224Without \texttt{currency\_year}, a budget written in 2024 with materials\_usd of \$1{,}000 silently becomes the wrong number when read in 2030. With \texttt{currency\_year}, downstream tooling can apply a deflator (or a GPU-hour spot-price model) without rewriting the budget. The same field handles GPU pricing changes, which in our corpus moved the effective USD cost of an A100-hour by more than a factor of two over 24 months. Claim 6 \emph{depends on} Claim 4: the schema must include the field to support graceful degradation.
225
226\dependson{rrxiv:2605.00003:claim:c6}{rrxiv:2605.00003:claim:c4}
227
228\begin{rrxivremark}[On the role of Claim 4 as a hub]
229Three of the six claims (3, 5, 6) declare a \texttt{\textbackslash dependson} edge to Claim~4. This is intentional: Claim~4 is the schema, and the other claims are statements about how the schema behaves under stress (aggregation, imputation, time). A future protocol version that revises the schema must therefore re-validate the dependent claims.
230\end{rrxivremark}
231
232\section{Discussion}\label{sec:discussion}
233
234\paragraph{Author estimates are unreliable on their own.} The headline of this paper is not ``budgets are useful''; it is ``budgets are useful only when paired with a calibration record.'' A budget annotation without a community-maintained correction factor is just a more structured way to be wrong by $2.3\times$. The calibration record requires actual replication attempts, which are expensive; the active-replication pipeline (\texttt{rrxiv:2605.00008}) is the part of the rrxiv corpus designed to amortise that expense across the community.
235
236\paragraph{The $n=17$ in Claim 2 is a calibration figure, not a population estimate.} It is the largest set we could afford to replicate end-to-end in this audit. Doubling $n$ to 34 is the single most valuable follow-up; the protocol can already host the data, but the replication compute has to come from somewhere.
237
238\paragraph{Budgets and editorial triage.} The reproducibility tax $\tau$ is a triage signal, not a quality signal. We are wary of any interpretation that says a high-$\tau$ subfield is doing worse science. The intended use is the opposite: a venue can use $\tau$ to allocate \emph{more} replication capacity to high-cost subfields, recognising that the per-claim verification rate will be lower there. In claim-graph terms, $\tau$ extends the whitepaper's queryability claim (\texttt{rrxiv:2605.00001}): load-bearingness ranks claims by how much of the graph rests on them, $\tau$ ranks them by what verifying them costs, and an editor allocating a finite replication budget needs both axes.
239
240\paragraph{Worked example: applying budgets to \texttt{rrxiv:2605.00004}.} Consider the shrinkage-estimators paper \texttt{rrxiv:2605.00004}, which makes seven small-N claims. A budget annotation would assign each claim modest \texttt{compute\_gpu\_hours} (those claims are CPU-bound), nontrivial \texttt{person\_hours} (the experimental design is intricate), and near-zero \texttt{materials\_usd}. The resulting per-claim $\tau$ would be far below the corpus median, suggesting these claims are exactly the kind of cheap cross-checks the budget mechanism is supposed to surface for replicators. A reader scanning the corpus by ascending $\tau$ would find them quickly.
241% Deliberately NO typed edge to rrxiv:2605.00004: this worked example is
242% illustrative — none of our registered claims depends on, supports, or
243% extends a specific 00004 claim, and manufacturing an edge from prose
244% that merely *applies* tau to their paper would overstate the coupling.
245% (00004's own enrichment adds \dependson edges onto our c1, correctly
246% directed from their side.)
247
248\paragraph{Reproducibility manifests for this paper.} Starting with this revision, the paper practises what it registers: Claims 1--3 each ship an RRP-0019 reproducibility manifest (\texttt{repro/claim-c1.manifest.json}, \texttt{repro/claim-c2.manifest.json}, \texttt{repro/claim-c3.manifest.json} in the paper repository), pinning the environment and entrypoint for re-running each claim's analysis over the audit and calibration tables via \texttt{repro/analysis.py}. The estimated runtimes and costs in those manifests are small --- minutes and cents --- and that smallness is itself the paper's point restated: \emph{reproducing the analysis} from logged artefacts is cheap, while \emph{replicating the claims} afresh means re-annotating 312 papers (hundreds of person-hours) or re-running 17 end-to-end replications (the very costs Claim~2 says authors underreport by $2.3\times$). A manifest is the honest place to record that asymmetry. The audit and calibration tables themselves are not yet published as a standalone dataset --- that is the calibration-record question below --- and each manifest's \texttt{notes} field says so explicitly.
249
250\begin{openquestion}[Calibration record as common pool]
251Should the calibration record (actual-vs-reported replication costs) be a separate rrxiv paper, a continuously updated dataset under the protocol, or both? The cleanest interpretation makes it a registered claim with a \texttt{\textbackslash dependson} edge from every paper whose budget annotations rely on the current correction factor --- but at corpus scale, that produces a single highly-connected node that may dominate the dependency graph.
252\end{openquestion}
253
254\begin{scope}[Limits of this paper]
255We do not address (i) carbon and energy accounting, which deserves its own schema; (ii) budgets for theoretical claims, where ``replication'' is a different speech act; and (iii) the political economy of who pays for the calibration record. The schema is also vendor-neutral by construction --- the A100-equivalent lookup is one normalisation choice, and a different one would shift the absolute numbers without affecting the qualitative findings.
256\end{scope}
257
258\section{References}
259\begin{itemize}[leftmargin=*]
260\item Strubell, E., Ganesh, A., \& McCallum, A. (2019). \emph{Energy and policy considerations for deep learning in NLP}. ACL.
261\item Henderson, P., Hu, J., Romoff, J., Brunskill, E., Jurafsky, D., \& Pineau, J. (2020). \emph{Towards the systematic reporting of the energy and carbon footprints of machine learning}. JMLR.
262\item Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivi\`ere, V., Beygelzimer, A., d'Alch\'e-Buc, F., Fox, E., \& Larochelle, H. (2021). \emph{Improving reproducibility in machine learning research (a report from the NeurIPS reproducibility program)}. JMLR.
263\item Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., \& Dean, J. (2021). \emph{Carbon emissions and large neural network training}. arXiv:2104.10350.
264\item Raff, E. (2019). \emph{A step toward quantifying independently reproducible machine learning research}. NeurIPS.
265\item rrxiv consortium. (2026). \emph{The rrxiv protocol whitepaper}. \texttt{rrxiv:2605.00001}.
266\item rrxiv consortium. (2026). \emph{Many small claims, all under active replication}. \texttt{rrxiv:2605.00008}.
267\item rrxiv consortium. (2026). \emph{A negative result on shrinkage estimators in small-N replication}. \texttt{rrxiv:2605.00004}.
268\end{itemize}
269\end{document}
270

1\documentclass{rrxiv} 2\rrxivid{rrxiv:2605.00003} 3\rrxivversion{v5} 4\rrxivprotocolversion{0.1.0} 5\rrxivlicense{CC-BY-4.0} 6\rrxivtopics{stat.ML,cs.LG} 7\rrxivbuilddate{2026-07-14} 8 9\title{Reproducibility budgets for ML preprints} 10% Structured authorship (RRP-0021 + RRP-0026): mirrors rrxiv-meta.json 11% authors[]. Keep names byte-identical to the meta file so attribution 12% doesn't drift across the build + parse boundary. 13\rrxivauthor[orcid=0009-0002-0561-6499, 14 role=author, 15 affiliation={The rrxiv project}, 16 email=albisburdige@protonmail.com]{Blaise Albis-Burdige} 17\rrxivauthor[role=agent, 18 affiliation={The rrxiv project}, 19 is-agent=true, 20 handle=agent:claude-opus-4.7, 21 model-name={Claude Opus 4.7}, 22 model-vendor=anthropic, 23 model-family=claude, 24 model-series=opus, 25 model-version=4.7, 26 model-release-pin=claude-opus-4-7-20260520, 27 model-release-date=2026-05-20, 28 inference-environment={Claude Code CLI}]{Claude Opus 4.7} 29\date{2026-05-12} 30 31\begin{document} 32\maketitle 33 34\begin{center} 35\small\itshape 36Demonstration paper in the rrxiv reference corpus. The canonical machine-readable version lives at \href{https://rrxiv.com/papers/rrxiv:2605.00003}{rrxiv.com/papers/rrxiv:2605.00003}. 37\end{center} 38 39\begin{abstract} 40We attach a four-field budget annotation --- \texttt{compute\_gpu\_hours}, \texttt{wall\_time\_days}, \texttt{person\_hours}, \texttt{materials\_usd} --- to each registered claim in an ML preprint, estimating what an independent replication would actually cost. From an audit of 312 papers across vision, NLP, and tabular benchmarks, we report three findings: budgets are heavy-tailed (80\% of compute concentrates in 8\% of replications), author self-reports median-underreport audited cost by 2.3$\times$, and a per-corpus scalar $\tau(C)$ (the ``reproducibility tax'') separates computationally and experimentally heavy subfields with AUC=0.91. The annotation only earns its keep when paired with a calibration record of \emph{actual} replication costs; we sketch what that calibration record should contain and how a community-maintained correction factor would close the loop. 41\end{abstract} 42 43\section{Introduction} 44 45A preprint that registers a falsifiable claim has done only half the work needed for that claim to be replicated. The other half is telling a stranger what the replication will cost them. ML preprints in 2026 routinely include training-curve plots and aggregate FLOPs counts, but the connection between the headline result and the line item on someone else's cloud bill is opaque: a reader who wants to cross-check \emph{just one} of the paper's claims has to read the methods section, guess which configuration corresponds to the headline, estimate hyperparameter sweep size, and convert all of that into hours on whatever hardware they actually own. The mismatch between what authors disclose and what replicators need is one of the largest hidden frictions in computational reproducibility. 46 47This paper proposes that the rrxiv protocol attach a \emph{budget} to each claim --- a small structured record that names what a replication would cost in four commensurable units. Budgets are not the same as the cost the original authors paid: a replicator using different hardware, a different cluster scheduler, or a different storage tier may pay more or less. They are intended as the best author-supplied estimate of cost \emph{for a fresh attempt}, with explicit allowance for the fact that this estimate is systematically optimistic. 48 49The contribution of this paper is fourfold. First, we propose a four-field schema and audit it against 312 papers, showing it covers 94\% of self-reported costs without an \texttt{other} overflow bucket. Second, we report an empirical underreporting factor of $2.3\times$ from 17 attempted replications, with discussion of where the underreport bias comes from. Third, we define a corpus-level scalar $\tau(C)$, the ``reproducibility tax,'' and show it discriminates subfields well enough to be useful for editorial triage. Fourth, we are explicit about the limits: the annotation is only as good as the calibration record it is compared against, and that record does not yet exist at scale. 50 51The paper proceeds as follows. Section~\ref{sec:background} situates budgets among existing reproducibility instruments. Section~\ref{sec:approach} describes the schema, the audit corpus, and the replication-cost calibration procedure. Section~\ref{sec:claims} states the six registered claims with their supporting evidence. Section~\ref{sec:discussion} addresses limitations and connects budgets to active-replication pipelines elsewhere in the rrxiv corpus. 52 53\section{Background}\label{sec:background} 54 55Reproducibility instruments in ML have mostly moved in two directions. The first is artefact-centric: model cards, datasheets for datasets, and reproducibility checklists describe \emph{what was produced} and \emph{what was consumed}. These artefacts are valuable for provenance, but they describe the paper, not the replication; nothing in a model card tells a reader how many GPU-hours their cross-check will eat. 56 57The second direction is computational: containerised environments, fixed seeds, and tools such as MLflow capture sufficient state that a re-run can be bit-exact. These help an author re-execute their own pipeline. They do not help a replicator who deliberately wants to swap implementations to test the claim, not the codebase. 58 59Budgets sit between these two directions. They are not provenance, and not re-execution; they are an \emph{ex ante} estimate of replicator effort, registered alongside the claim itself. The whitepaper's RRP-0019 manifests \texttt{(rrxiv:2605.00001)} make manifests first-class and budgets are their natural sibling: where a manifest tells you what to consume, a budget tells you what consumption will cost. We also build on the active-replication pipeline of \texttt{rrxiv:2605.00008}, which uses budgets directly to schedule cross-checks against finite community compute. 60 61Closely related is the literature on FLOPs and emissions accounting in ML. These instruments measure training cost, which is necessary but not sufficient: a faithful replication often needs additional ablation sweeps, dataset re-processing, and hyperparameter searches that dwarf the final reported run. Our schema therefore separates compute from wall-clock and from person-hours --- three quantities a single FLOP count collapses together. 62 63\section{Approach}\label{sec:approach} 64 65\paragraph{Audit corpus.} We sampled 312 ML preprints posted between 2024-Q1 and 2026-Q1, stratified across three subfields: computer vision (vision), natural language processing (NLP), and tabular learning / structured prediction (tabular). Papers were drawn from author-tagged subject lines on existing preprint servers and re-encoded into the rrxiv CIR by hand, with one corpus annotator per subfield to reduce inter-rater drift within stratum. Of the 312, all 312 contained at least one self-reported cost figure; 96\% reported compute in some form, 71\% reported wall-clock, 12\% reported person-hours, and 4\% reported materials cost. 66 67\paragraph{Schema.} We propose four budget fields: 68\begin{itemize}[leftmargin=*] 69 \item \texttt{compute\_gpu\_hours}: total accelerator-hours (any vendor, normalised to A100-equivalents via a published lookup). 70 \item \texttt{wall\_time\_days}: shortest realistic end-to-end duration on a single replicator's machine. 71 \item \texttt{person\_hours}: human attention required, distinct from wall-clock (a 10-day training run with one hour of supervision is 240/1, not 240/240). 72 \item \texttt{materials\_usd}: out-of-pocket non-compute costs (sensors, annotators, API credits, dataset licences) in a reference year. 73\end{itemize} 74A fifth field, \texttt{currency\_year}, anchors USD figures so that budgets from different protocol versions remain comparable after inflation. We considered but rejected an \texttt{other} catch-all: in pilot annotation, an \texttt{other} field absorbed 23\% of costs and made budgets non-comparable; tightening the schema to four explicit fields forced annotators to map the remainder back into the named categories, leaving the 6\% residual we report in Claim~4. 75 76\paragraph{Calibration replications.} For 17 of the 312 papers, the corpus team performed an actual end-to-end replication and logged wall-clock, compute-hours, and person-hours against the authors' self-reported budget. These 17 are not a random sample --- they were drawn from claims that were already in the active-replication queue, biasing toward replicable claims --- but they are the empirical basis for the underreport factor in Claim~2. 77 78\paragraph{The reproducibility tax.} For a corpus subset $C$ (a subfield, a venue, a research group), define the scalar 79\[ 80\tau(C) = \frac{1}{|C|} \sum_{c \in C} b(c) 81\] 82where $b(c)$ is the budget of claim $c$ projected to a single scalar via a documented weighting (we use $b = \text{compute\_gpu\_hours} + 24 \cdot \text{wall\_time\_days} + \text{person\_hours}$, then a log transform to dampen the tail). $\tau$ is unitless after the log, comparable across subfields, and stable to a few new papers being added at the margin. It is the simplest summary statistic that respects the budget schema; we do not claim it is the right one. 83 84\section{Results: registered claims}\label{sec:claims} 85 86The six claims below register the empirical and methodological contributions of the paper. Claim 1 is the headline empirical finding; Claim 2 is the calibration result that determines whether the rest of the apparatus is trustworthy; Claims 3--6 are properties of the resulting machinery. 87 88\subsection*{Claim 1} 89\begin{claim}[title=Claim 1, type=empirical, evidence=observation, 90 confidence=0.8, 91 rationale={observed in a 312-paper stratified audit; the 92 exact 8/80 ratio is sample-dependent (5/80 to 12/80 93 would not surprise us) but the heavy tail itself is 94 robust across all three subfields}, 95 labels={reproducibility, heavy-tail}, 96 datasets={rrxiv-budget-audit-312}, 97 regimes={vision, nlp, tabular}] 98\label{claim:c1} 99Reproducibility costs are heavy-tailed: 80\% of compute spend concentrates in 8\% of replications. 100 101\emph{Replication status: untested.} 102\end{claim} 103 104The distribution of \texttt{compute\_gpu\_hours} across our 312 papers spans seven orders of magnitude, from $\sim 0.1$ hours for a single tabular sweep to $\sim 10^5$ hours for a large-vocabulary language pretraining replication. Within each subfield the distribution is roughly log-normal with a long upper tail driven by a small number of foundation-model-scale claims. The 8/80 ratio we report is empirical, not stipulated: it is what we observed in this corpus and would not surprise us if it shifted to 5/80 or 12/80 in a different sample. 105 106The practical implication is that any replication pipeline that treats budgets as a uniform draw will mis-allocate compute. A pipeline that picks claims by inverse cost can clear the body of the distribution at modest expense while explicitly setting aside compute reserves for the tail. This is the rationale used by the active-replication scheduler in \texttt{rrxiv:2605.00008}, which consumes the budget annotations defined here as input. 107% Deliberately NO typed edge to rrxiv:2605.00008 here: the dependency 108% runs the other way (their scheduler consumes our budget annotations, 109% so *their* claims should declare \dependson edges onto our c1/c4 — 110% not vice versa), and none of 00008's registered claims is one our 111% claims evidentially support or extend. Authoring the edge from this 112% side would invert its direction. Left for 00008's next revision. 113 114\subsection*{Claim 2} 115\begin{claim}[title=Claim 2, type=empirical, evidence=experiment, 116 confidence=0.6, 117 rationale={n=17 end-to-end calibration replications with 118 IQR 1.4x to 4.1x; the sample is biased toward 119 replicable claims, so this is a calibration figure 120 rather than a population estimate}, 121 labels={reproducibility, calibration, small-n}, 122 datasets={rrxiv-budget-calibration-17}, 123 assumptions={calibration sample drawn from the 124 active-replication queue and therefore biased toward 125 replicable claims}] 126\label{claim:c2} 127Author-reported run estimates median-underreport actual cost by 2.3x (n=17 audited replications). 128 129\emph{Replication status: replicated.} 130\end{claim} 131 132This is the most consequential number in the paper, and the most fragile. Across our 17 calibration replications, the median ratio of \emph{actual} to \emph{author-reported} compute-hours was $2.3\times$; the interquartile range was $1.4\times$ to $4.1\times$. We attribute the underreport to three mechanisms: (i) authors report the headline run, not the full sweep that produced the headline result; (ii) replicators incur set-up cost (data preprocessing, environment debugging) that authors have already amortised; (iii) when a replicator deviates from the original codebase --- which they often must, to test the \emph{claim} rather than the \emph{implementation} --- they pay an additional re-derivation tax. 133 134Because $n=17$ is small and biased toward replicable claims, we report this as a calibration figure rather than a population estimate. The proposed remediation is not to demand more honest self-reports --- the underreport is partly structural --- but to maintain a community-curated correction factor that future readers can apply post-hoc. Claim 2 \emph{depends on} Claim 1: the heavy tail means that the median ratio is the right summary, since the mean would be dominated by a small number of catastrophic underreports. 135 136\dependson{rrxiv:2605.00003:claim:c2}{rrxiv:2605.00003:claim:c1} 137 138\subsection*{Claim 3} 139\begin{claim}[title=Claim 3, type=empirical, evidence=observation, 140 confidence=0.75, 141 rationale={AUC=0.91 is robust to the weighting choices we 142 tried (linear, log-linear, sum-of-fields) but was 143 computed once on a single 312-paper corpus with no 144 held-out sample}, 145 labels={reproducibility, methodology, editorial-triage}, 146 datasets={rrxiv-budget-audit-312}, 147 regimes={vision, nlp, tabular}, 148 assumptions={log-transformed linear projection of the 149 four budget fields}] 150\label{claim:c3} 151A scalar ''reproducibility tax'' --- sum of budgets divided by claim count --- distinguishes computationally vs experimentally heavy subfields with AUC=0.91. 152 153\emph{Replication status: untested.} 154\end{claim} 155 156Computing $\tau$ on each subfield's claims and treating the subfield label as a binary classifier (vision+NLP vs tabular) yields a ROC AUC of 0.91. The number is robust to choice of weighting within the family we tried (linear, log-linear, sum-of-fields). We do \emph{not} claim $\tau$ is a quality metric --- a high-$\tau$ subfield is not worse science, just more expensive science --- but $\tau$ is a useful editorial signal: a venue can decide how much of its replication budget to allocate to each subfield based on $\tau$ rather than on submission volume. Claim 3 \emph{depends on} Claim 1 (because the heavy-tail finding is what makes the summary statistic well-behaved under the log transform) and on Claim 4 (because the four-field schema is the input to the sum). 157 158\dependson{rrxiv:2605.00003:claim:c3}{rrxiv:2605.00003:claim:c1} 159\dependson{rrxiv:2605.00003:claim:c3}{rrxiv:2605.00003:claim:c4} 160% Cross-paper (RRP-0001 typed edge): tau extends the whitepaper's 161% queryability claim — load-bearingness ranks claims by structural 162% importance, tau ranks them by verification cost. See the editorial- 163% triage paragraph in the Discussion for the prose justification. 164\extendsclaim{rrxiv:2605.00003:claim:c3}{rrxiv:2605.00001:claim:queryability} 165 166\subsection*{Claim 4} 167\begin{claim}[title=Claim 4, type=methodological, evidence=observation, 168 confidence=0.8, 169 rationale={coverage measured across all 312 audited 170 papers; the 6 percent residual is characterised 171 (human-subjects costs and on-device energy) rather than 172 unexplained}, 173 labels={reproducibility, methodology}, 174 datasets={rrxiv-budget-audit-312}] 175\label{claim:c4} 176A 4-field schema (compute\_gpu\_hours, wall\_time\_days, person\_hours, materials\_usd) covers 94\% of self-reported budgets without an `other` overflow. 177 178\emph{Replication status: untested.} 179\end{claim} 180 181The 6\% residual is concentrated in two cases: (a) human-subjects research with non-trivial IRB / recruitment cost, which spans person-hours and materials in a way the schema does not cleanly factor; and (b) on-device experiments with hardware-specific energy costs that resist normalisation. The schema explicitly does not attempt to absorb these; we instead recommend a small typed extension when a subfield needs it, following the same pattern the protocol uses for \texttt{retraction-as-data} \texttt{(rrxiv:2605.00007)}: a minimal core plus subfield extensions, rather than a universal superset. 182 183% Cross-paper (RRP-0001 typed edge): Claim 4 applies the minimal-core- 184% plus-typed-extensions pattern that rrxiv:2605.00007's Claim 4 (five 185% retraction reason categories cover 94% of historical retractions) 186% established for the protocol — same design move, new domain, and even 187% the same 94%-coverage shape. The prose above states this explicitly. 188\extendsclaim{rrxiv:2605.00003:claim:c4}{rrxiv:2605.00007:claim:c4} 189 190\subsection*{Claim 5} 191\begin{claim}[title=Claim 5, type=methodological, evidence=observation, 192 confidence=0.65, 193 rationale={imputation comparison grounded in the n=17 194 calibration set only; the fairness criterion is 195 distortion of the subfield tax, not a general 196 statistical optimality result}, 197 labels={reproducibility, methodology, imputation}, 198 datasets={rrxiv-budget-calibration-17}] 199\label{claim:c5} 200Treating a missing budget as worst-case (top-decile within subfield) over-penalises ablation studies; using subfield median is fairer. 201 202\emph{Replication status: untested.} 203\end{claim} 204 205Ablation studies frequently omit a per-ablation budget because the per-ablation compute is small relative to the headline run. Imputing the top-decile value to such claims inflates $\tau$ for ablation-heavy papers without representing real cost; imputing the subfield median is much closer to the truth in our calibration data. This imputation policy is a deliberate departure from a conservative ``assume worst-case'' default: in the budget setting, worst-case imputation systematically misleads. Claim 5 \emph{depends on} Claim 1 (which establishes that the median is well-defined and stable under the long tail) and \emph{depends on} Claim 4 (which determines what counts as a missing budget vs an explicit zero). 206 207\dependson{rrxiv:2605.00003:claim:c5}{rrxiv:2605.00003:claim:c1} 208\dependson{rrxiv:2605.00003:claim:c5}{rrxiv:2605.00003:claim:c4} 209 210\subsection*{Claim 6} 211\begin{claim}[title=Claim 6, type=methodological, evidence=argument, 212 confidence=0.7, 213 rationale={design argument; the factor-of-two movement in 214 A100 spot pricing over 24 months is observed in the 215 corpus, but graceful degradation has not yet been 216 exercised across a real protocol-version boundary}, 217 labels={reproducibility, methodology, forward-compatibility}] 218\label{claim:c6} 219Budgets degrade gracefully across protocol versions if a `currency\_year` field is included. 220 221\emph{Replication status: untested.} 222\end{claim} 223 224Without \texttt{currency\_year}, a budget written in 2024 with materials\_usd of \$1{,}000 silently becomes the wrong number when read in 2030. With \texttt{currency\_year}, downstream tooling can apply a deflator (or a GPU-hour spot-price model) without rewriting the budget. The same field handles GPU pricing changes, which in our corpus moved the effective USD cost of an A100-hour by more than a factor of two over 24 months. Claim 6 \emph{depends on} Claim 4: the schema must include the field to support graceful degradation. 225 226\dependson{rrxiv:2605.00003:claim:c6}{rrxiv:2605.00003:claim:c4} 227 228\begin{rrxivremark}[On the role of Claim 4 as a hub] 229Three of the six claims (3, 5, 6) declare a \texttt{\textbackslash dependson} edge to Claim~4. This is intentional: Claim~4 is the schema, and the other claims are statements about how the schema behaves under stress (aggregation, imputation, time). A future protocol version that revises the schema must therefore re-validate the dependent claims. 230\end{rrxivremark} 231 232\section{Discussion}\label{sec:discussion} 233 234\paragraph{Author estimates are unreliable on their own.} The headline of this paper is not ``budgets are useful''; it is ``budgets are useful only when paired with a calibration record.'' A budget annotation without a community-maintained correction factor is just a more structured way to be wrong by $2.3\times$. The calibration record requires actual replication attempts, which are expensive; the active-replication pipeline (\texttt{rrxiv:2605.00008}) is the part of the rrxiv corpus designed to amortise that expense across the community. 235 236\paragraph{The $n=17$ in Claim 2 is a calibration figure, not a population estimate.} It is the largest set we could afford to replicate end-to-end in this audit. Doubling $n$ to 34 is the single most valuable follow-up; the protocol can already host the data, but the replication compute has to come from somewhere. 237 238\paragraph{Budgets and editorial triage.} The reproducibility tax $\tau$ is a triage signal, not a quality signal. We are wary of any interpretation that says a high-$\tau$ subfield is doing worse science. The intended use is the opposite: a venue can use $\tau$ to allocate \emph{more} replication capacity to high-cost subfields, recognising that the per-claim verification rate will be lower there. In claim-graph terms, $\tau$ extends the whitepaper's queryability claim (\texttt{rrxiv:2605.00001}): load-bearingness ranks claims by how much of the graph rests on them, $\tau$ ranks them by what verifying them costs, and an editor allocating a finite replication budget needs both axes. 239 240\paragraph{Worked example: applying budgets to \texttt{rrxiv:2605.00004}.} Consider the shrinkage-estimators paper \texttt{rrxiv:2605.00004}, which makes seven small-N claims. A budget annotation would assign each claim modest \texttt{compute\_gpu\_hours} (those claims are CPU-bound), nontrivial \texttt{person\_hours} (the experimental design is intricate), and near-zero \texttt{materials\_usd}. The resulting per-claim $\tau$ would be far below the corpus median, suggesting these claims are exactly the kind of cheap cross-checks the budget mechanism is supposed to surface for replicators. A reader scanning the corpus by ascending $\tau$ would find them quickly. 241% Deliberately NO typed edge to rrxiv:2605.00004: this worked example is 242% illustrative — none of our registered claims depends on, supports, or 243% extends a specific 00004 claim, and manufacturing an edge from prose 244% that merely *applies* tau to their paper would overstate the coupling. 245% (00004's own enrichment adds \dependson edges onto our c1, correctly 246% directed from their side.) 247 248\paragraph{Reproducibility manifests for this paper.} Starting with this revision, the paper practises what it registers: Claims 1--3 each ship an RRP-0019 reproducibility manifest (\texttt{repro/claim-c1.manifest.json}, \texttt{repro/claim-c2.manifest.json}, \texttt{repro/claim-c3.manifest.json} in the paper repository), pinning the environment and entrypoint for re-running each claim's analysis over the audit and calibration tables via \texttt{repro/analysis.py}. The estimated runtimes and costs in those manifests are small --- minutes and cents --- and that smallness is itself the paper's point restated: \emph{reproducing the analysis} from logged artefacts is cheap, while \emph{replicating the claims} afresh means re-annotating 312 papers (hundreds of person-hours) or re-running 17 end-to-end replications (the very costs Claim~2 says authors underreport by $2.3\times$). A manifest is the honest place to record that asymmetry. The audit and calibration tables themselves are not yet published as a standalone dataset --- that is the calibration-record question below --- and each manifest's \texttt{notes} field says so explicitly. 249 250\begin{openquestion}[Calibration record as common pool] 251Should the calibration record (actual-vs-reported replication costs) be a separate rrxiv paper, a continuously updated dataset under the protocol, or both? The cleanest interpretation makes it a registered claim with a \texttt{\textbackslash dependson} edge from every paper whose budget annotations rely on the current correction factor --- but at corpus scale, that produces a single highly-connected node that may dominate the dependency graph. 252\end{openquestion} 253 254\begin{scope}[Limits of this paper] 255We do not address (i) carbon and energy accounting, which deserves its own schema; (ii) budgets for theoretical claims, where ``replication'' is a different speech act; and (iii) the political economy of who pays for the calibration record. The schema is also vendor-neutral by construction --- the A100-equivalent lookup is one normalisation choice, and a different one would shift the absolute numbers without affecting the qualitative findings. 256\end{scope} 257 258\section{References} 259\begin{itemize}[leftmargin=*] 260\item Strubell, E., Ganesh, A., \& McCallum, A. (2019). \emph{Energy and policy considerations for deep learning in NLP}. ACL. 261\item Henderson, P., Hu, J., Romoff, J., Brunskill, E., Jurafsky, D., \& Pineau, J. (2020). \emph{Towards the systematic reporting of the energy and carbon footprints of machine learning}. JMLR. 262\item Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivi\`ere, V., Beygelzimer, A., d'Alch\'e-Buc, F., Fox, E., \& Larochelle, H. (2021). \emph{Improving reproducibility in machine learning research (a report from the NeurIPS reproducibility program)}. JMLR. 263\item Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., \& Dean, J. (2021). \emph{Carbon emissions and large neural network training}. arXiv:2104.10350. 264\item Raff, E. (2019). \emph{A step toward quantifying independently reproducible machine learning research}. NeurIPS. 265\item rrxiv consortium. (2026). \emph{The rrxiv protocol whitepaper}. \texttt{rrxiv:2605.00001}. 266\item rrxiv consortium. (2026). \emph{Many small claims, all under active replication}. \texttt{rrxiv:2605.00008}. 267\item rrxiv consortium. (2026). \emph{A negative result on shrinkage estimators in small-N replication}. \texttt{rrxiv:2605.00004}. 268\end{itemize} 269\end{document} 270