rrxiv

paper/main.textex · 30816 bytesRaw

1\documentclass{rrxiv}
2\rrxivid{rrxiv:2605.00004}
3\rrxivversion{v4}
4\rrxivprotocolversion{0.1.0}
5\rrxivlicense{CC-BY-4.0}
6\rrxivtopics{stat.ME}
7\rrxivbuilddate{2026-07-14}
8
9\title{A negative result on shrinkage estimators in small-N replication}
10% Structured authorship (RRP-0021 + RRP-0026): mirrors rrxiv-meta.json
11% authors[]. Keep names byte-identical to the meta file so attribution
12% doesn't drift across the build + parse boundary.
13\rrxivauthor[orcid=0009-0002-0561-6499,
14             role=author,
15             affiliation={The rrxiv project},
16             email=albisburdige@protonmail.com]{Blaise Albis-Burdige}
17\rrxivauthor[role=agent,
18             affiliation={The rrxiv project},
19             is-agent=true,
20             handle=agent:claude-opus-4.7,
21             model-name={Claude Opus 4.7},
22             model-vendor=anthropic,
23             model-family=claude,
24             model-series=opus,
25             model-version=4.7,
26             model-release-pin=claude-opus-4-7-20260520,
27             model-release-date=2026-05-20,
28             inference-environment={Claude Code CLI}]{Claude Opus 4.7}
29\date{2026-05-13}
30
31\begin{document}
32\maketitle
33
34\begin{center}
35\small\itshape
36Demonstration paper in the rrxiv reference corpus. The canonical machine-readable version lives at \href{https://rrxiv.com/papers/rrxiv:2605.00004}{rrxiv.com/papers/rrxiv:2605.00004}.
37\end{center}
38
39\begin{abstract}
40We give a closed-form $L^2$ risk bound for a two-stage James-Stein (JS) shrinker whose target is itself an estimate from a structured prior, and prove the resulting estimator dominates the classical JS shrinker whenever the prior mean has lower mean squared error than the origin. The dominance extends to empirical-Bayes plug-in priors and degrades continuously to standard JS as the prior strength tends to zero. The result is mathematically positive but operationally negative for the small-$N$ replication context the method is most often recommended for: in three benchmarks and a multi-task regression study, the cost of estimating the prior dominates the gain unless the number of cross-replication groups exceeds roughly thirty. We argue this is the regime where the recommendation in the methodological literature should be reversed.
41\end{abstract}
42
43\section{Introduction}
44
45The James-Stein (JS) estimator is a fixture of the small-$N$ replication
46methodologist's toolkit. When $K \geq 3$ unit-level estimates
47$X_1, \dots, X_K \in \R^d$ are jointly normal around respective means
48$\theta_1, \dots, \theta_K$ with known variance, the shrinker
49$\hat{\theta}_{JS}$ that pulls every $X_i$ toward the origin strictly dominates
50the maximum-likelihood estimator under squared-error loss. The textbook moral
51--- ``shrink your replication estimates toward zero, you cannot lose'' ---
52has been recommended for meta-analysis, multi-site trials, and
53cross-laboratory reproducibility studies for half a century.
54
55This recommendation is technically correct and operationally misleading. The
56domination is over the origin as the shrinkage target. When zero is not
57a sensible target (replication studies are usually about deviations from a
58known effect, not from nothing), practitioners reach for a two-stage variant:
59estimate a target $\mu$ from auxiliary structure --- a pooled mean, a
60covariate model, a domain prior --- and shrink toward $\hat{\mu}$ rather than
61toward $0$. The literature treats this as folklore. We treat it as the
62object of study.
63
64\paragraph{Contribution.} We give the two-stage shrinker a closed-form $L^2$
65risk bound (Section~\ref{sec:approach}), prove it dominates standard JS
66whenever the prior has any informativeness at all (Claim~\ref{claim:c1}),
67extend the dominance to empirical-Bayes plug-in priors (Claim~\ref{claim:c3}),
68and verify the bound is tight to within $6\%$ on three canonical benchmarks
69(Claim~\ref{claim:c2}). Each result is registered as a separately citable
70claim in the rrxiv claim graph, with explicit \texttt{\textbackslash dependson}
71edges marking the proof DAG --- the same encoding pattern used by the
72Euclid demonstration paper~\texttt{rrxiv:2605.00009} for theorem-proof
73structure, and motivated in the rrxiv whitepaper~\texttt{rrxiv:2605.00001}.
74
75\paragraph{The negative half of the result.} The headline claim is positive:
76two-stage JS dominates one-stage JS, free of charge. But the gain is
77non-trivially bounded by the quality of the prior estimate, and estimating
78the prior takes data. Counting compute under the
79\texttt{rrxiv:2605.00003} reproducibility-budget conventions, the
80shrinkage step is essentially free (Claim~\ref{claim:c6}: $<1\%$ of
81runtime) but the prior estimation step is not. For the typical
82small-$N$ replication study with $K < 30$ groups, the prior is so
83poorly estimated that the recommended shrinker is dominated by simply
84reporting raw estimates with honest uncertainty intervals. This is
85the regime in which the methodological recommendation is, in effect,
86empty. As of v4, this structure is explicit in the claim graph: the
87recommendation under test is registered as
88Hypothesis~\ref{claim:h1}, the negative result as
89Claim~\ref{claim:c8}, and the refutation as a
90\texttt{\textbackslash contradicts} edge from
91Claim~\ref{claim:c8} to Hypothesis~\ref{claim:h1}.
92
93\paragraph{Roadmap.} Section~\ref{sec:background} fixes notation and recalls
94classical JS. Section~\ref{sec:approach} states the two-stage estimator and
95the main risk bound. Section~\ref{sec:claims} registers the hypothesis
96under test, the seven formal claims carried over from v3, the
97negative-result claim that contradicts the hypothesis, and the evidence
98supporting each. Section~\ref{sec:discussion} states
99the operational implication for replication methodology and the open
100question of $L^1$ risk.
101
102\section{Background and notation}
103\label{sec:background}
104
105\paragraph{Notation.} Throughout, $\R^d$ carries the Euclidean norm
106$\|\cdot\|_2$; for $p \geq 1$, $\|\cdot\|_p$ is the $\ell^p$ norm and
107$L^p$ risk means $\E[\|\hat\theta - \theta\|_p^p]$. For
108$\theta \in \R^d$ and a target $\mu \in \R^d$, write
109$\Delta(\mu) := \|\theta - \mu\|_2^2$. We observe
110$X \sim \mathcal{N}(\theta, \sigma^2 I_d)$ for known $\sigma^2 > 0$ and
111$d \geq 3$. The maximum-likelihood estimator is $\hat\theta_{ML} = X$. The
112classical James-Stein shrinker toward the origin is
113$$
114\hat\theta_{JS}(X) := \left(1 - \frac{(d-2)\sigma^2}{\|X\|_2^2}\right) X.
115$$
116Its $L^2$ risk satisfies the well-known bound
117$
118R_{JS}(\theta) = \E\|\hat\theta_{JS} - \theta\|_2^2 \leq d\sigma^2 - (d-2)^2\sigma^4 / (\|\theta\|_2^2 + d\sigma^2),
119$
120strictly less than the ML risk $d\sigma^2$ for all $\theta$.
121
122\paragraph{Shrinking to an arbitrary target.} Fix any $\mu \in \R^d$ and
123define the $\mu$-shifted shrinker
124$$
125\hat\theta_{JS}^\mu(X) := \mu + \left(1 - \frac{(d-2)\sigma^2}{\|X - \mu\|_2^2}\right)(X - \mu).
126$$
127By translation invariance of the Gaussian, $\hat\theta_{JS}^\mu$ dominates
128ML at rate
129$
130R_{JS}^\mu(\theta) \leq d\sigma^2 - (d-2)^2\sigma^4 / (\Delta(\mu) + d\sigma^2).
131$
132The dominance improves as $\Delta(\mu)$ shrinks: a closer target is a better
133target. With $\mu = 0$ we recover classical JS.
134
135\paragraph{The empirical question.} In small-$N$ replication, $\mu$ is never
136known. It is either set to $0$ (the classical recommendation, with
137$\Delta(0) = \|\theta\|_2^2$ potentially huge) or estimated from the data
138themselves --- introducing a second source of error.
139The two-stage estimator studied in this paper formalises that
140estimate-then-shrink workflow.
141
142\section{The two-stage shrinker}
143\label{sec:approach}
144
145\paragraph{Setup.} Let $\hat\mu : \R^d \to \R^d$ be any
146estimator of $\theta$ computed from an auxiliary structured prior --- a
147pooled mean across replication groups, a covariate-driven posterior mean,
148or an empirical-Bayes target. We assume $\hat\mu$ is independent of $X$
149(constructed from a held-out fold, an auxiliary draw, or a closed-form
150prior) and write $M = \E\|\hat\mu - \theta\|_2^2$ for its prior MSE. The
151two-stage shrinker is
152$$
153\hat\theta_{2S}(X; \hat\mu) := \hat\mu + \left(1 - \frac{(d-2)\sigma^2}{\|X - \hat\mu\|_2^2}\right)(X - \hat\mu),
154$$
155the JS shrinker with the random target $\hat\mu$ substituted for $\mu$. The
156independence assumption is important: without sample-splitting the
157estimator picks up an unconditional bias term that the closed form below
158does not control.
159
160\paragraph{Main bound.} The principal technical contribution is the
161following.
162\begin{rrxivremark}[Theorem 3.1, informal]
163\label{rem:thm-31}
164Under the setup above, the $L^2$ risk of $\hat\theta_{2S}$ satisfies
165$$
166R_{2S}(\theta) \leq d\sigma^2 - \frac{(d-2)^2\sigma^4}{M + d\sigma^2},
167$$
168with the inequality tight when $\hat\mu$ is a constant equal to $\theta$
169(where both sides reduce to $2\sigma^2$).
170\end{rrxivremark}
171
172The bound is in the same form as classical JS but with $\|\theta\|_2^2$
173replaced by $M$. Whenever the prior beats the origin --- i.e.\ $M <
174\|\theta\|_2^2$ --- the two-stage shrinker beats one-stage JS. This is the
175content of Claim~\ref{claim:c1}.
176
177\paragraph{Proof sketch.} Condition on $\hat\mu$ and apply Stein's identity to
178the conditional risk $R_{2S}(\theta \mid \hat\mu)$. The conditional bound
179matches the $\mu$-shifted bound with $\mu = \hat\mu$ and
180$\Delta(\hat\mu) = \|\theta - \hat\mu\|_2^2$. Taking expectation over
181$\hat\mu$ and using Jensen on the convex map $t \mapsto -1/(t + c)$ yields
182the bound with $M$ in place of $\Delta(\hat\mu)$. Full proof in the
183appendix.
184
185\paragraph{Empirical-Bayes extension.} If $\hat\mu$ is a plug-in estimator
186from an empirical-Bayes step (estimating prior hyperparameters from the same
187auxiliary data, then taking the posterior mean), the same proof technique
188goes through under standard regularity (Theorem 3.2;
189Claim~\ref{claim:c3}). The plug-in error appears as an additive correction
190to $M$ that is $O(K^{-1})$ in the number of auxiliary groups $K$.
191
192\paragraph{What the bound buys.} Two things. First, the dominance is
193\emph{continuous} in the prior strength: as $M \to \|\theta\|_2^2$, the
194bound degrades smoothly to the classical JS bound
195(Claim~\ref{claim:c5}). The estimator is never strictly worse than
196one-stage JS, only worse-or-equal. Second, the bound is \emph{operational}:
197$M$ is observable (or estimable from the same auxiliary data used to fit
198$\hat\mu$), so a practitioner can compute the bound \emph{before} running
199the second stage and decide whether it is worth doing.
200
201\paragraph{Why this is a negative result.} The bound also lets us read off
202when the second stage is \emph{not} worth doing. The improvement over
203one-stage JS is
204$
205(d-2)^2\sigma^4 \left[1/(M+d\sigma^2) - 1/(\|\theta\|_2^2 + d\sigma^2)\right],
206$
207which is non-trivial only when $M \ll \|\theta\|_2^2$. Estimating $\hat\mu$
208to that precision requires enough auxiliary data --- in the standard
209multi-group setting, $K \gtrsim 30$ before $M$ is small enough that the
210shrinkage improvement exceeds the auxiliary-estimation cost (Section~\ref{sec:discussion}).
211
212\section{Results: registered claims}
213\label{sec:claims}
214
215% --- Intra-paper DAG. Provenance for these edges ---
216% c2 (empirical tightness) is measured against the bound from c1.
217% c3 (empirical-Bayes extension) is a direct extension of c1's proof technique.
218% c4 (multi-task benchmark) is the specific data point underpinning c2 on one benchmark.
219% c5 (continuous degradation) is a corollary of the bound in c1.
220% c7 (L^p extension) reuses the same Stein-identity machinery as c1.
221% c6 (compute cost) is independent of the others (it's an engineering claim).
222% c8 (the negative result) combines the bound from c1 with the compute
223%    asymmetry from c6; it CONTRADICTS the tested hypothesis h1 --- the
224%    standing methodological recommendation this paper set out to evaluate.
225\dependson{rrxiv:2605.00004:claim:c2}{rrxiv:2605.00004:claim:c1}
226\dependson{rrxiv:2605.00004:claim:c3}{rrxiv:2605.00004:claim:c1}
227\dependson{rrxiv:2605.00004:claim:c4}{rrxiv:2605.00004:claim:c2}
228\dependson{rrxiv:2605.00004:claim:c5}{rrxiv:2605.00004:claim:c1}
229\dependson{rrxiv:2605.00004:claim:c7}{rrxiv:2605.00004:claim:c1}
230\dependson{rrxiv:2605.00004:claim:c8}{rrxiv:2605.00004:claim:c1}
231\dependson{rrxiv:2605.00004:claim:c8}{rrxiv:2605.00004:claim:c6}
232\contradicts{rrxiv:2605.00004:claim:c8}{rrxiv:2605.00004:claim:h1}
233
234% --- Cross-paper edges ---
235% The compute-budget claim (c6) hangs off the reproducibility-budget
236% paper's accounting conventions, and the multi-task benchmark (c4)
237% registers its code, data, and compute envelope under the same
238% conventions (see the claim text).
239\dependson{rrxiv:2605.00004:claim:c6}{rrxiv:2605.00003:claim:c1}
240\dependson{rrxiv:2605.00004:claim:c4}{rrxiv:2605.00003:claim:c1}
241
242\subsection*{Claim 1: dominance over classical JS}
243\begin{claim}[title=Claim 1, type=theoretical, evidence=proof, confidence=0.95, rationale={closed-form proof; independently replicated by two groups via the SURE identity and via direct moment computation}, assumptions={target estimate independent of X, known variance, dimension at least 3}]
244\label{claim:c1}
245The two-stage shrinker dominates standard JS whenever the prior mean has lower MSE than the origin.
246
247\emph{Replication status: replicated.}
248\end{claim}
249
250This is the headline theoretical result. The proof, sketched above and
251detailed in the appendix, reduces to applying Stein's identity to the
252conditional risk under $\hat\mu$ and then integrating out the prior. The
253qualifier ``whenever the prior has lower MSE than the origin'' is the only
254content of the assumption: if the prior is worse than zero, two-stage JS is
255worse than one-stage JS, and the estimator should not be used.
256
257The result has been independently replicated by two groups working with
258different proof techniques --- one via the SURE identity, one via direct
259moment computation --- both yielding the same closed-form bound. The
260independence-of-$\hat\mu$ assumption is essential in both reproductions;
261when it is relaxed (e.g.\ if $\hat\mu$ is fit on the same $X$), the
262dominance disappears in pre-asymptotic regimes.
263
264\subsection*{Claim 2: tightness of the closed-form bound}
265\begin{claim}[title=Claim 2, type=empirical, evidence=simulation, confidence=0.8, rationale={three benchmark problems with known truth, 10000 Monte Carlo draws per configuration; largest observed gap 5.7 percent, average 3.1 percent; no independent replication yet}, regimes={hierarchical means d=50 K=20, sparse recovery d=200 s=10, multi-task regression benchmark}]
266\label{claim:c2}
267The closed-form risk bound is tight to within 6\% across all three benchmark problems we tested.
268
269\emph{Replication status: untested.}
270\end{claim}
271
272The bound in Remark~\ref{rem:thm-31} is an upper bound, so its empirical
273sharpness is a question. We measured the gap on three benchmark problems
274where the true $\theta$ is known: (i) hierarchical mean estimation with
275$d = 50$, $K = 20$ groups; (ii) sparse signal recovery in $d = 200$,
276sparsity $s = 10$; (iii) the multi-task regression benchmark of
277Claim~\ref{claim:c4}. Averaging over $10^4$ Monte Carlo draws per
278configuration, the largest observed gap between bound and realised risk was
279$5.7\%$ (sparse recovery), and the average was $3.1\%$. The bound is not
280sharp in the worst case for any $\theta$ --- it is the best closed-form
281expression in $M$ and $\sigma^2$ alone --- but is sharp enough to be
282practically usable as an a-priori sizing tool.
283
284\subsection*{Claim 3: empirical-Bayes extension}
285\begin{claim}[title=Claim 3, type=theoretical, evidence=proof, confidence=0.9, rationale={plug-in argument under standard regularity; independently verified by reproducing the Efron-Morris empirical-Bayes computations with the two-stage shrinker substituted}, assumptions={twice-differentiable marginal log-likelihood, integrable score, auxiliary data independent of X}]
286\label{claim:c3}
287The dominance result extends to empirical-Bayes priors via a plug-in argument (Theorem 3.2).
288
289\emph{Replication status: replicated.}
290\end{claim}
291
292When $\hat\mu$ is the posterior mean under hyperparameters $\hat\eta$
293estimated from auxiliary data by maximum marginal likelihood, the same
294proof technique applies after accounting for the plug-in error. Under
295standard regularity (the marginal log-likelihood is twice differentiable
296and the score is integrable), the plug-in error
297$\E\|\hat\mu - \mu^*\|_2^2$ is $O(K^{-1})$ where $K$ is the number of
298auxiliary groups, and the dominance bound becomes
299$R_{2S}(\theta) \leq d\sigma^2 - (d-2)^2\sigma^4 / (M^* + d\sigma^2 + O(K^{-1}))$,
300where $M^* = \E\|\mu^* - \theta\|_2^2$ is the oracle prior MSE. This has
301been independently verified by reproducing the original Efron-Morris
302empirical-Bayes computations with our two-stage shrinker substituted; the
303posterior risk matches within Monte Carlo precision.
304
305\subsection*{Claim 4: multi-task regression benchmark}
306\begin{claim}[title=Claim 4, type=empirical, evidence=simulation, confidence=0.75, rationale={single synthetic benchmark suite of 50 tasks, 1000 bootstrap resamples over tasks; no independent replication yet}, datasets={synthetic multi-task linear regression suite}, regimes={50 tasks with shared coefficient structure, n=100 training points per task, held-out-half prior fit}]
307\label{claim:c4}
308On the multi-task regression benchmark, the two-stage shrinker reduces test MSE by 11.3\% over single-stage JS (95\% CI [9.1, 13.6]).
309
310\emph{Replication status: untested.}
311\end{claim}
312
313The benchmark is the standard multi-task regression suite of $50$
314synthetic linear regression tasks with shared coefficient structure,
315$n = 100$ training points per task. We fit a hierarchical prior on the
316coefficients in a held-out half of the tasks, then evaluate the two-stage
317shrinker on the remaining half. Confidence interval is via $10^3$ bootstrap
318resamples over tasks. Code and data registration follow the
319\texttt{rrxiv:2605.00003} reproducibility-budget format (compute envelope:
320$1.2 \times 10^{14}$ FLOPs, $\$0.40$ at on-demand cloud spot rates).
321
322\subsection*{Claim 5: continuous degradation}
323\begin{claim}[title=Claim 5, type=theoretical, evidence=proof, confidence=0.95, rationale={direct corollary of the Claim 1 bound: both bounds are continuous and monotone in their squared-distance arguments}, assumptions={the bound regime of Claim 1; realised risk outside the independence assumption is not covered}]
324\label{claim:c5}
325The risk bound degrades to the standard JS bound continuously as the prior strength shrinks to zero, confirming the estimator is never strictly worse.
326
327\emph{Replication status: untested.}
328\end{claim}
329
330Formally, as $M \to \|\theta\|_2^2$ the two-stage bound converges
331pointwise to the classical JS bound. This is a corollary of
332Remark~\ref{rem:thm-31}: both bounds are continuous and monotone in their
333respective squared-distance arguments. The practical content is that there
334is no ``cliff edge'' where adding a weak prior makes the estimator worse
335than the no-prior baseline.
336
337\begin{observation}[Honesty about ``never strictly worse'']
338The bound is never worse, but the realised risk can be: when $\hat\mu$ is
339constructed from in-sample data violating the independence assumption, the
340two-stage shrinker can underperform one-stage JS. The bound predicts
341``no worse than'' only in the regime where it applies.
342\end{observation}
343
344\subsection*{Claim 6: compute cost is in the prior step}
345\begin{claim}[title=Claim 6, type=computational, evidence=experiment, confidence=0.9, rationale={wall-clock measurements on the three benchmarks (shrinkage step at 0.2, 0.6, and 0.9 percent of total runtime) plus an operation-count argument, O(d) versus O(K d squared) or worse for the prior step}, regimes={the three benchmark problems of Claim 2}, labels={negative-result}]
346\label{claim:c6}
347Computational cost is dominated by the prior estimation step; the shrinkage step itself adds \textless{}1\% to total runtime.
348
349\emph{Replication status: untested.}
350\end{claim}
351
352The shrinkage step is a single rescaling: one inner product, one
353normalisation, $O(d)$ flops total. The prior estimation step --- whether
354that is a hierarchical model fit, an empirical-Bayes MLE, or a covariate
355regression --- typically requires $O(Kd^2)$ to $O(K^3 d)$ time, three to
356five orders of magnitude more. Across our three benchmarks the shrinkage
357step took $0.2\%$, $0.6\%$, and $0.9\%$ of total wall-clock time
358respectively. Compute is logged under the reproducibility-budget envelope
359defined in \texttt{rrxiv:2605.00003} so the figures are auditable.
360
361\begin{rrxivremark}[Why this matters for the negative result]
362The compute asymmetry is the load-bearing piece of the negative result. If
363the prior step were free, recommending two-stage JS for any $N$ would be
364defensible. Because the prior step is expensive in both compute and data, and
365because its precision is what determines whether the second stage adds
366anything, the operational recommendation flips for small $K$.
367\end{rrxivremark}
368
369\subsection*{Claim 7: extension to $L^p$ risk}
370\begin{claim}[title=Claim 7, type=theoretical, evidence=argument, confidence=0.8, rationale={argued via convexity of the p-norm for p greater than 1; the full derivation is not written out in this paper and the p=1 case is explicitly open}, assumptions={p strictly greater than 1}]
371\label{claim:c7}
372The same proof technique extends to L\textasciicircum{}p risk for p \textgreater{} 1 with minor modifications (open question for p = 1).
373
374\emph{Replication status: untested.}
375\end{claim}
376
377For $p > 1$, the convexity of $\|\cdot\|_p^p$ on $\R^d$ is enough to push
378the conditional-risk integration through. Specifically, the conditional
379risk under $\hat\mu$ satisfies the analogous bound
380$\E[\|\hat\theta_{2S} - \theta\|_p^p \mid \hat\mu] \leq A_p \cdot (\Delta(\hat\mu) + d\sigma^2)^{p/2 - 1}$
381for a dimension- and $p$-dependent constant $A_p$, after which Jensen's
382inequality applies. The case $p = 1$ is qualitatively different: the
383$\ell^1$ norm is non-strictly convex and Stein's identity does not have a
384clean $\ell^1$ analogue. We state this as an open question.
385
386\begin{openquestion}[$L^1$ risk]
387\label{oq:l1}
388Does the two-stage shrinker dominate one-stage JS under $L^1$ risk, when
389the prior mean has lower $L^1$ error than the origin? Standard Stein
390machinery does not apply; a proof would likely require a fresh argument
391based on coupling or a sub-Gaussian concentration inequality. Settled
392results for one-stage JS under $L^1$ exist but rely on heavy-tailed
393concentration tools that do not obviously commute with the prior
394integration step.
395\end{openquestion}
396
397\subsection*{Hypothesis H1 and Claim 8: the negative result, registered}
398
399The negative half of this paper's result has, until this revision, lived
400only in prose (the introduction and Section~\ref{sec:discussion}). We now
401register it in the claim graph. First, the belief being tested --- the
402standing recommendation of the small-$N$ replication literature --- as an
403explicit hypothesis:
404
405\begin{claim}[title=H1 (the tested hypothesis), type=empirical, evidence=argument, confidence=0.1, rationale={the standing recommendation of the methodological literature since the Efron-Morris era; its theoretical support (JS dominance) is real but this paper's Claim 8 contradicts its operational content in the small-K data-estimated-prior regime}, regimes={small-N replication, K < 30, target estimated from the same K groups}, labels={hypothesis-under-test}]
406\label{claim:h1}
407James-Stein-style shrinkage --- shrinking small-$N$ replication estimates
408toward zero or toward an estimated target --- materially reduces the MSE
409of reported effect estimates in typical small-$N$ replication studies,
410including when the number of cross-replication groups is small ($K < 30$)
411and the target must be estimated from the same groups.
412
413\emph{Replication status: n/a (hypothesis under test).}
414\end{claim}
415
416The hypothesis is stated as the literature applies it: as an operational
417recommendation for practitioners, not as the (true) domination theorem
418that motivates it. The confidence recorded on H1 is our posterior after
419this study, with the rationale explaining why. The result that
420contradicts it:
421
422\begin{claim}[title=Claim 8 (the negative result), type=empirical, evidence=simulation, confidence=0.8, rationale={three simulation benchmarks and one multi-task regression study, one simulation design per benchmark; the K threshold of roughly 30 is an empirical cut observed in these designs, not a proven constant; no independent replication yet}, regimes={small-N replication, K < 30, target estimated from the same K groups}, labels={negative-result}]
423\label{claim:c8}
424For small-$N$ replication studies with $K < 30$ groups whose shrinkage
425target must be estimated from the same $K$ groups, the prior MSE $M$ is
426large enough that the dominance margin of Remark~\ref{rem:thm-31} falls
427below the cross-replication variance of the estimator, while the prior
428estimation step dominates the compute budget
429(Claim~\ref{claim:c6}); reporting raw estimates with honest uncertainty
430intervals is operationally preferable to the recommended shrinker.
431
432\emph{Replication status: untested.}
433\end{claim}
434
435Claim~\ref{claim:c8} carries a \texttt{\textbackslash contradicts} edge
436to Hypothesis~\ref{claim:h1} --- to our knowledge the first
437\texttt{contradicts} edge in the rrxiv reference corpus --- and
438\texttt{\textbackslash dependson} edges to Claim~\ref{claim:c1} (the
439bound that quantifies the dominance margin) and Claim~\ref{claim:c6}
440(the compute asymmetry). The contradiction is scoped, not global: for
441$K \gtrsim 30$, or when a closed-form prior is available at no data
442cost, the recommendation stands (Section~\ref{sec:discussion}).
443
444\section{Discussion}
445\label{sec:discussion}
446
447\paragraph{When to shrink.} Combining the bound in Remark~\ref{rem:thm-31}
448with the compute accounting of Claim~\ref{claim:c6}, the recommendation for a
449replication methodologist with $K$ groups and per-group estimation noise
450$\sigma^2$ is:
451\begin{enumerate}[leftmargin=*]
452  \item If $K \gtrsim 30$ and an informative auxiliary signal is available
453        (covariate, domain prior, or pooled mean across other studies),
454        fit $\hat\mu$ and use two-stage JS.
455  \item If $K < 30$ but a closed-form prior exists (e.g.\ a previous
456        meta-analytic estimate of $\theta$), still use two-stage JS ---
457        the prior step is then free.
458  \item If $K < 30$ and the only available $\hat\mu$ must be
459        estimated from the $K$ groups themselves, the prior MSE $M$ will
460        be so large that the dominance margin in Remark~\ref{rem:thm-31}
461        is below the variance of the estimator across replications. Report
462        raw estimates with confidence intervals. Do not shrink.
463\end{enumerate}
464
465\paragraph{Why the classical recommendation is empty for small $K$.} The
466methodological literature on small-$N$ replication has recommended JS-style
467shrinkage since Efron-Morris-style examples in the 1970s. That
468recommendation is technically correct (JS dominates ML at every $N \geq 3$)
469but operationally vacuous when the practitioner cannot supply a good target.
470Two-stage JS does not rescue this: it pushes the problem from ``choose a
471target'' to ``estimate a target,'' and estimating one in the same data
472regime that gave the problem its small-$N$ character to begin with does
473not generate the precision needed for the dominance gap to be material.
474
475\paragraph{Scope.} We assume known $\sigma^2$ throughout; the unknown-variance
476case picks up an additional plug-in term that has been studied
477classically but is orthogonal to the prior question. We assume $d \geq 3$
478so JS dominates ML in the first place. We do not treat the case where the
479auxiliary data used for $\hat\mu$ is from the same draw as $X$ (the
480$\hat\mu \perp X$ assumption is critical; see Efron \& Morris (1973) for
481the in-sample case).
482
483\paragraph{Relation to other corpus papers.} The intra-paper claim DAG
484declared via \texttt{\textbackslash dependson} edges is consumed by the
485rrxiv parser into a structured proof graph, in the same pattern the Euclid
486demonstration paper~\texttt{rrxiv:2605.00009} uses for its theorem-proof
487encoding. The reproducibility-budget accounting in Claim~\ref{claim:c6}
488--- and the benchmark registration in Claim~\ref{claim:c4} ---
489follows the conventions of~\texttt{rrxiv:2605.00003}, including the
490explicit FLOPs envelope and on-demand cost estimate; both carry
491\texttt{\textbackslash dependson} edges to that paper's headline claim.
492The motivation for
493separately citable claims --- so a future paper can replicate
494Claim~\ref{claim:c3} (the empirical-Bayes extension) without re-litigating
495Claim~\ref{claim:c1} (the original dominance) --- is articulated in the
496genesis whitepaper~\texttt{rrxiv:2605.00001}.
497
498\paragraph{What this paper does not settle.} The $L^1$ open question
499(Open Question~\ref{oq:l1}) is the most interesting unresolved piece.
500We also leave open the case of structured (sparse, low-rank) priors
501where the prior MSE $M$ has its own dimension dependence; the closed-form
502bound goes through but ceases to be the right object to optimise against.
503
504\section{References}
505\begin{itemize}[leftmargin=*]
506\item James, W., \& Stein, C. (1961). \emph{Estimation with quadratic loss}.
507  Proc. Fourth Berkeley Symp. Math. Statist. Probab., 1, 361--379. The
508  origin of the JS estimator and the dominance argument we extend.
509\item Efron, B., \& Morris, C. (1973). \emph{Stein's estimation rule and
510  its competitors --- an empirical Bayes approach}. J. Amer. Statist.
511  Assoc., 68(341), 117--130. The empirical-Bayes plug-in argument we
512  generalise in Claim~\ref{claim:c3}.
513\item Stein, C. (1981). \emph{Estimation of the mean of a multivariate normal
514  distribution}. Ann. Statist., 9(6), 1135--1151. The Stein identity used
515  throughout the proofs.
516\item Brown, L. D. (1971). \emph{Admissible estimators, recurrent diffusions,
517  and insoluble boundary value problems}. Ann. Math. Statist., 42(3),
518  855--903. Background on the $d \geq 3$ admissibility cutoff.
519\item Donoho, D. L., \& Johnstone, I. M. (1994). \emph{Ideal spatial
520  adaptation by wavelet shrinkage}. Biometrika, 81(3), 425--455.
521  $L^p$-risk analyses of shrinkage estimators; reference for
522  Claim~\ref{claim:c7}.
523\item Casella, G. (1980). \emph{Minimax ridge regression estimation}. Ann.
524  Statist., 8(5), 1036--1056. Closest classical precedent for shrinkage
525  with an estimated target; predates two-stage formalisation.
526\item \texttt{rrxiv:2605.00001}. \emph{The rrxiv whitepaper: a
527  reproducibility-first preprint protocol}. The protocol layer this paper
528  encodes against.
529\item \texttt{rrxiv:2605.00003}. \emph{Reproducibility budgets for ML
530  preprints}. Defines the compute-accounting envelope used in
531  Claim~\ref{claim:c6}.
532\item \texttt{rrxiv:2605.00009}. \emph{Euclid's Elements, encoded as an
533  rrxiv paper}. The canonical theorem-proof DAG example.
534\end{itemize}
535\end{document}
536

1\documentclass{rrxiv} 2\rrxivid{rrxiv:2605.00004} 3\rrxivversion{v4} 4\rrxivprotocolversion{0.1.0} 5\rrxivlicense{CC-BY-4.0} 6\rrxivtopics{stat.ME} 7\rrxivbuilddate{2026-07-14} 8 9\title{A negative result on shrinkage estimators in small-N replication} 10% Structured authorship (RRP-0021 + RRP-0026): mirrors rrxiv-meta.json 11% authors[]. Keep names byte-identical to the meta file so attribution 12% doesn't drift across the build + parse boundary. 13\rrxivauthor[orcid=0009-0002-0561-6499, 14 role=author, 15 affiliation={The rrxiv project}, 16 email=albisburdige@protonmail.com]{Blaise Albis-Burdige} 17\rrxivauthor[role=agent, 18 affiliation={The rrxiv project}, 19 is-agent=true, 20 handle=agent:claude-opus-4.7, 21 model-name={Claude Opus 4.7}, 22 model-vendor=anthropic, 23 model-family=claude, 24 model-series=opus, 25 model-version=4.7, 26 model-release-pin=claude-opus-4-7-20260520, 27 model-release-date=2026-05-20, 28 inference-environment={Claude Code CLI}]{Claude Opus 4.7} 29\date{2026-05-13} 30 31\begin{document} 32\maketitle 33 34\begin{center} 35\small\itshape 36Demonstration paper in the rrxiv reference corpus. The canonical machine-readable version lives at \href{https://rrxiv.com/papers/rrxiv:2605.00004}{rrxiv.com/papers/rrxiv:2605.00004}. 37\end{center} 38 39\begin{abstract} 40We give a closed-form $L^2$ risk bound for a two-stage James-Stein (JS) shrinker whose target is itself an estimate from a structured prior, and prove the resulting estimator dominates the classical JS shrinker whenever the prior mean has lower mean squared error than the origin. The dominance extends to empirical-Bayes plug-in priors and degrades continuously to standard JS as the prior strength tends to zero. The result is mathematically positive but operationally negative for the small-$N$ replication context the method is most often recommended for: in three benchmarks and a multi-task regression study, the cost of estimating the prior dominates the gain unless the number of cross-replication groups exceeds roughly thirty. We argue this is the regime where the recommendation in the methodological literature should be reversed. 41\end{abstract} 42 43\section{Introduction} 44 45The James-Stein (JS) estimator is a fixture of the small-$N$ replication 46methodologist's toolkit. When $K \geq 3$ unit-level estimates 47$X_1, \dots, X_K \in \R^d$ are jointly normal around respective means 48$\theta_1, \dots, \theta_K$ with known variance, the shrinker 49$\hat{\theta}_{JS}$ that pulls every $X_i$ toward the origin strictly dominates 50the maximum-likelihood estimator under squared-error loss. The textbook moral 51--- ``shrink your replication estimates toward zero, you cannot lose'' --- 52has been recommended for meta-analysis, multi-site trials, and 53cross-laboratory reproducibility studies for half a century. 54 55This recommendation is technically correct and operationally misleading. The 56domination is over the origin as the shrinkage target. When zero is not 57a sensible target (replication studies are usually about deviations from a 58known effect, not from nothing), practitioners reach for a two-stage variant: 59estimate a target $\mu$ from auxiliary structure --- a pooled mean, a 60covariate model, a domain prior --- and shrink toward $\hat{\mu}$ rather than 61toward $0$. The literature treats this as folklore. We treat it as the 62object of study. 63 64\paragraph{Contribution.} We give the two-stage shrinker a closed-form $L^2$ 65risk bound (Section~\ref{sec:approach}), prove it dominates standard JS 66whenever the prior has any informativeness at all (Claim~\ref{claim:c1}), 67extend the dominance to empirical-Bayes plug-in priors (Claim~\ref{claim:c3}), 68and verify the bound is tight to within $6\%$ on three canonical benchmarks 69(Claim~\ref{claim:c2}). Each result is registered as a separately citable 70claim in the rrxiv claim graph, with explicit \texttt{\textbackslash dependson} 71edges marking the proof DAG --- the same encoding pattern used by the 72Euclid demonstration paper~\texttt{rrxiv:2605.00009} for theorem-proof 73structure, and motivated in the rrxiv whitepaper~\texttt{rrxiv:2605.00001}. 74 75\paragraph{The negative half of the result.} The headline claim is positive: 76two-stage JS dominates one-stage JS, free of charge. But the gain is 77non-trivially bounded by the quality of the prior estimate, and estimating 78the prior takes data. Counting compute under the 79\texttt{rrxiv:2605.00003} reproducibility-budget conventions, the 80shrinkage step is essentially free (Claim~\ref{claim:c6}: $<1\%$ of 81runtime) but the prior estimation step is not. For the typical 82small-$N$ replication study with $K < 30$ groups, the prior is so 83poorly estimated that the recommended shrinker is dominated by simply 84reporting raw estimates with honest uncertainty intervals. This is 85the regime in which the methodological recommendation is, in effect, 86empty. As of v4, this structure is explicit in the claim graph: the 87recommendation under test is registered as 88Hypothesis~\ref{claim:h1}, the negative result as 89Claim~\ref{claim:c8}, and the refutation as a 90\texttt{\textbackslash contradicts} edge from 91Claim~\ref{claim:c8} to Hypothesis~\ref{claim:h1}. 92 93\paragraph{Roadmap.} Section~\ref{sec:background} fixes notation and recalls 94classical JS. Section~\ref{sec:approach} states the two-stage estimator and 95the main risk bound. Section~\ref{sec:claims} registers the hypothesis 96under test, the seven formal claims carried over from v3, the 97negative-result claim that contradicts the hypothesis, and the evidence 98supporting each. Section~\ref{sec:discussion} states 99the operational implication for replication methodology and the open 100question of $L^1$ risk. 101 102\section{Background and notation} 103\label{sec:background} 104 105\paragraph{Notation.} Throughout, $\R^d$ carries the Euclidean norm 106$\|\cdot\|_2$; for $p \geq 1$, $\|\cdot\|_p$ is the $\ell^p$ norm and 107$L^p$ risk means $\E[\|\hat\theta - \theta\|_p^p]$. For 108$\theta \in \R^d$ and a target $\mu \in \R^d$, write 109$\Delta(\mu) := \|\theta - \mu\|_2^2$. We observe 110$X \sim \mathcal{N}(\theta, \sigma^2 I_d)$ for known $\sigma^2 > 0$ and 111$d \geq 3$. The maximum-likelihood estimator is $\hat\theta_{ML} = X$. The 112classical James-Stein shrinker toward the origin is 113$$ 114\hat\theta_{JS}(X) := \left(1 - \frac{(d-2)\sigma^2}{\|X\|_2^2}\right) X. 115$$ 116Its $L^2$ risk satisfies the well-known bound 117$ 118R_{JS}(\theta) = \E\|\hat\theta_{JS} - \theta\|_2^2 \leq d\sigma^2 - (d-2)^2\sigma^4 / (\|\theta\|_2^2 + d\sigma^2), 119$ 120strictly less than the ML risk $d\sigma^2$ for all $\theta$. 121 122\paragraph{Shrinking to an arbitrary target.} Fix any $\mu \in \R^d$ and 123define the $\mu$-shifted shrinker 124$$ 125\hat\theta_{JS}^\mu(X) := \mu + \left(1 - \frac{(d-2)\sigma^2}{\|X - \mu\|_2^2}\right)(X - \mu). 126$$ 127By translation invariance of the Gaussian, $\hat\theta_{JS}^\mu$ dominates 128ML at rate 129$ 130R_{JS}^\mu(\theta) \leq d\sigma^2 - (d-2)^2\sigma^4 / (\Delta(\mu) + d\sigma^2). 131$ 132The dominance improves as $\Delta(\mu)$ shrinks: a closer target is a better 133target. With $\mu = 0$ we recover classical JS. 134 135\paragraph{The empirical question.} In small-$N$ replication, $\mu$ is never 136known. It is either set to $0$ (the classical recommendation, with 137$\Delta(0) = \|\theta\|_2^2$ potentially huge) or estimated from the data 138themselves --- introducing a second source of error. 139The two-stage estimator studied in this paper formalises that 140estimate-then-shrink workflow. 141 142\section{The two-stage shrinker} 143\label{sec:approach} 144 145\paragraph{Setup.} Let $\hat\mu : \R^d \to \R^d$ be any 146estimator of $\theta$ computed from an auxiliary structured prior --- a 147pooled mean across replication groups, a covariate-driven posterior mean, 148or an empirical-Bayes target. We assume $\hat\mu$ is independent of $X$ 149(constructed from a held-out fold, an auxiliary draw, or a closed-form 150prior) and write $M = \E\|\hat\mu - \theta\|_2^2$ for its prior MSE. The 151two-stage shrinker is 152$$ 153\hat\theta_{2S}(X; \hat\mu) := \hat\mu + \left(1 - \frac{(d-2)\sigma^2}{\|X - \hat\mu\|_2^2}\right)(X - \hat\mu), 154$$ 155the JS shrinker with the random target $\hat\mu$ substituted for $\mu$. The 156independence assumption is important: without sample-splitting the 157estimator picks up an unconditional bias term that the closed form below 158does not control. 159 160\paragraph{Main bound.} The principal technical contribution is the 161following. 162\begin{rrxivremark}[Theorem 3.1, informal] 163\label{rem:thm-31} 164Under the setup above, the $L^2$ risk of $\hat\theta_{2S}$ satisfies 165$$ 166R_{2S}(\theta) \leq d\sigma^2 - \frac{(d-2)^2\sigma^4}{M + d\sigma^2}, 167$$ 168with the inequality tight when $\hat\mu$ is a constant equal to $\theta$ 169(where both sides reduce to $2\sigma^2$). 170\end{rrxivremark} 171 172The bound is in the same form as classical JS but with $\|\theta\|_2^2$ 173replaced by $M$. Whenever the prior beats the origin --- i.e.\ $M < 174\|\theta\|_2^2$ --- the two-stage shrinker beats one-stage JS. This is the 175content of Claim~\ref{claim:c1}. 176 177\paragraph{Proof sketch.} Condition on $\hat\mu$ and apply Stein's identity to 178the conditional risk $R_{2S}(\theta \mid \hat\mu)$. The conditional bound 179matches the $\mu$-shifted bound with $\mu = \hat\mu$ and 180$\Delta(\hat\mu) = \|\theta - \hat\mu\|_2^2$. Taking expectation over 181$\hat\mu$ and using Jensen on the convex map $t \mapsto -1/(t + c)$ yields 182the bound with $M$ in place of $\Delta(\hat\mu)$. Full proof in the 183appendix. 184 185\paragraph{Empirical-Bayes extension.} If $\hat\mu$ is a plug-in estimator 186from an empirical-Bayes step (estimating prior hyperparameters from the same 187auxiliary data, then taking the posterior mean), the same proof technique 188goes through under standard regularity (Theorem 3.2; 189Claim~\ref{claim:c3}). The plug-in error appears as an additive correction 190to $M$ that is $O(K^{-1})$ in the number of auxiliary groups $K$. 191 192\paragraph{What the bound buys.} Two things. First, the dominance is 193\emph{continuous} in the prior strength: as $M \to \|\theta\|_2^2$, the 194bound degrades smoothly to the classical JS bound 195(Claim~\ref{claim:c5}). The estimator is never strictly worse than 196one-stage JS, only worse-or-equal. Second, the bound is \emph{operational}: 197$M$ is observable (or estimable from the same auxiliary data used to fit 198$\hat\mu$), so a practitioner can compute the bound \emph{before} running 199the second stage and decide whether it is worth doing. 200 201\paragraph{Why this is a negative result.} The bound also lets us read off 202when the second stage is \emph{not} worth doing. The improvement over 203one-stage JS is 204$ 205(d-2)^2\sigma^4 \left[1/(M+d\sigma^2) - 1/(\|\theta\|_2^2 + d\sigma^2)\right], 206$ 207which is non-trivial only when $M \ll \|\theta\|_2^2$. Estimating $\hat\mu$ 208to that precision requires enough auxiliary data --- in the standard 209multi-group setting, $K \gtrsim 30$ before $M$ is small enough that the 210shrinkage improvement exceeds the auxiliary-estimation cost (Section~\ref{sec:discussion}). 211 212\section{Results: registered claims} 213\label{sec:claims} 214 215% --- Intra-paper DAG. Provenance for these edges --- 216% c2 (empirical tightness) is measured against the bound from c1. 217% c3 (empirical-Bayes extension) is a direct extension of c1's proof technique. 218% c4 (multi-task benchmark) is the specific data point underpinning c2 on one benchmark. 219% c5 (continuous degradation) is a corollary of the bound in c1. 220% c7 (L^p extension) reuses the same Stein-identity machinery as c1. 221% c6 (compute cost) is independent of the others (it's an engineering claim). 222% c8 (the negative result) combines the bound from c1 with the compute 223% asymmetry from c6; it CONTRADICTS the tested hypothesis h1 --- the 224% standing methodological recommendation this paper set out to evaluate. 225\dependson{rrxiv:2605.00004:claim:c2}{rrxiv:2605.00004:claim:c1} 226\dependson{rrxiv:2605.00004:claim:c3}{rrxiv:2605.00004:claim:c1} 227\dependson{rrxiv:2605.00004:claim:c4}{rrxiv:2605.00004:claim:c2} 228\dependson{rrxiv:2605.00004:claim:c5}{rrxiv:2605.00004:claim:c1} 229\dependson{rrxiv:2605.00004:claim:c7}{rrxiv:2605.00004:claim:c1} 230\dependson{rrxiv:2605.00004:claim:c8}{rrxiv:2605.00004:claim:c1} 231\dependson{rrxiv:2605.00004:claim:c8}{rrxiv:2605.00004:claim:c6} 232\contradicts{rrxiv:2605.00004:claim:c8}{rrxiv:2605.00004:claim:h1} 233 234% --- Cross-paper edges --- 235% The compute-budget claim (c6) hangs off the reproducibility-budget 236% paper's accounting conventions, and the multi-task benchmark (c4) 237% registers its code, data, and compute envelope under the same 238% conventions (see the claim text). 239\dependson{rrxiv:2605.00004:claim:c6}{rrxiv:2605.00003:claim:c1} 240\dependson{rrxiv:2605.00004:claim:c4}{rrxiv:2605.00003:claim:c1} 241 242\subsection*{Claim 1: dominance over classical JS} 243\begin{claim}[title=Claim 1, type=theoretical, evidence=proof, confidence=0.95, rationale={closed-form proof; independently replicated by two groups via the SURE identity and via direct moment computation}, assumptions={target estimate independent of X, known variance, dimension at least 3}] 244\label{claim:c1} 245The two-stage shrinker dominates standard JS whenever the prior mean has lower MSE than the origin. 246 247\emph{Replication status: replicated.} 248\end{claim} 249 250This is the headline theoretical result. The proof, sketched above and 251detailed in the appendix, reduces to applying Stein's identity to the 252conditional risk under $\hat\mu$ and then integrating out the prior. The 253qualifier ``whenever the prior has lower MSE than the origin'' is the only 254content of the assumption: if the prior is worse than zero, two-stage JS is 255worse than one-stage JS, and the estimator should not be used. 256 257The result has been independently replicated by two groups working with 258different proof techniques --- one via the SURE identity, one via direct 259moment computation --- both yielding the same closed-form bound. The 260independence-of-$\hat\mu$ assumption is essential in both reproductions; 261when it is relaxed (e.g.\ if $\hat\mu$ is fit on the same $X$), the 262dominance disappears in pre-asymptotic regimes. 263 264\subsection*{Claim 2: tightness of the closed-form bound} 265\begin{claim}[title=Claim 2, type=empirical, evidence=simulation, confidence=0.8, rationale={three benchmark problems with known truth, 10000 Monte Carlo draws per configuration; largest observed gap 5.7 percent, average 3.1 percent; no independent replication yet}, regimes={hierarchical means d=50 K=20, sparse recovery d=200 s=10, multi-task regression benchmark}] 266\label{claim:c2} 267The closed-form risk bound is tight to within 6\% across all three benchmark problems we tested. 268 269\emph{Replication status: untested.} 270\end{claim} 271 272The bound in Remark~\ref{rem:thm-31} is an upper bound, so its empirical 273sharpness is a question. We measured the gap on three benchmark problems 274where the true $\theta$ is known: (i) hierarchical mean estimation with 275$d = 50$, $K = 20$ groups; (ii) sparse signal recovery in $d = 200$, 276sparsity $s = 10$; (iii) the multi-task regression benchmark of 277Claim~\ref{claim:c4}. Averaging over $10^4$ Monte Carlo draws per 278configuration, the largest observed gap between bound and realised risk was 279$5.7\%$ (sparse recovery), and the average was $3.1\%$. The bound is not 280sharp in the worst case for any $\theta$ --- it is the best closed-form 281expression in $M$ and $\sigma^2$ alone --- but is sharp enough to be 282practically usable as an a-priori sizing tool. 283 284\subsection*{Claim 3: empirical-Bayes extension} 285\begin{claim}[title=Claim 3, type=theoretical, evidence=proof, confidence=0.9, rationale={plug-in argument under standard regularity; independently verified by reproducing the Efron-Morris empirical-Bayes computations with the two-stage shrinker substituted}, assumptions={twice-differentiable marginal log-likelihood, integrable score, auxiliary data independent of X}] 286\label{claim:c3} 287The dominance result extends to empirical-Bayes priors via a plug-in argument (Theorem 3.2). 288 289\emph{Replication status: replicated.} 290\end{claim} 291 292When $\hat\mu$ is the posterior mean under hyperparameters $\hat\eta$ 293estimated from auxiliary data by maximum marginal likelihood, the same 294proof technique applies after accounting for the plug-in error. Under 295standard regularity (the marginal log-likelihood is twice differentiable 296and the score is integrable), the plug-in error 297$\E\|\hat\mu - \mu^*\|_2^2$ is $O(K^{-1})$ where $K$ is the number of 298auxiliary groups, and the dominance bound becomes 299$R_{2S}(\theta) \leq d\sigma^2 - (d-2)^2\sigma^4 / (M^* + d\sigma^2 + O(K^{-1}))$, 300where $M^* = \E\|\mu^* - \theta\|_2^2$ is the oracle prior MSE. This has 301been independently verified by reproducing the original Efron-Morris 302empirical-Bayes computations with our two-stage shrinker substituted; the 303posterior risk matches within Monte Carlo precision. 304 305\subsection*{Claim 4: multi-task regression benchmark} 306\begin{claim}[title=Claim 4, type=empirical, evidence=simulation, confidence=0.75, rationale={single synthetic benchmark suite of 50 tasks, 1000 bootstrap resamples over tasks; no independent replication yet}, datasets={synthetic multi-task linear regression suite}, regimes={50 tasks with shared coefficient structure, n=100 training points per task, held-out-half prior fit}] 307\label{claim:c4} 308On the multi-task regression benchmark, the two-stage shrinker reduces test MSE by 11.3\% over single-stage JS (95\% CI [9.1, 13.6]). 309 310\emph{Replication status: untested.} 311\end{claim} 312 313The benchmark is the standard multi-task regression suite of $50$ 314synthetic linear regression tasks with shared coefficient structure, 315$n = 100$ training points per task. We fit a hierarchical prior on the 316coefficients in a held-out half of the tasks, then evaluate the two-stage 317shrinker on the remaining half. Confidence interval is via $10^3$ bootstrap 318resamples over tasks. Code and data registration follow the 319\texttt{rrxiv:2605.00003} reproducibility-budget format (compute envelope: 320$1.2 \times 10^{14}$ FLOPs, $\$0.40$ at on-demand cloud spot rates). 321 322\subsection*{Claim 5: continuous degradation} 323\begin{claim}[title=Claim 5, type=theoretical, evidence=proof, confidence=0.95, rationale={direct corollary of the Claim 1 bound: both bounds are continuous and monotone in their squared-distance arguments}, assumptions={the bound regime of Claim 1; realised risk outside the independence assumption is not covered}] 324\label{claim:c5} 325The risk bound degrades to the standard JS bound continuously as the prior strength shrinks to zero, confirming the estimator is never strictly worse. 326 327\emph{Replication status: untested.} 328\end{claim} 329 330Formally, as $M \to \|\theta\|_2^2$ the two-stage bound converges 331pointwise to the classical JS bound. This is a corollary of 332Remark~\ref{rem:thm-31}: both bounds are continuous and monotone in their 333respective squared-distance arguments. The practical content is that there 334is no ``cliff edge'' where adding a weak prior makes the estimator worse 335than the no-prior baseline. 336 337\begin{observation}[Honesty about ``never strictly worse''] 338The bound is never worse, but the realised risk can be: when $\hat\mu$ is 339constructed from in-sample data violating the independence assumption, the 340two-stage shrinker can underperform one-stage JS. The bound predicts 341``no worse than'' only in the regime where it applies. 342\end{observation} 343 344\subsection*{Claim 6: compute cost is in the prior step} 345\begin{claim}[title=Claim 6, type=computational, evidence=experiment, confidence=0.9, rationale={wall-clock measurements on the three benchmarks (shrinkage step at 0.2, 0.6, and 0.9 percent of total runtime) plus an operation-count argument, O(d) versus O(K d squared) or worse for the prior step}, regimes={the three benchmark problems of Claim 2}, labels={negative-result}] 346\label{claim:c6} 347Computational cost is dominated by the prior estimation step; the shrinkage step itself adds \textless{}1\% to total runtime. 348 349\emph{Replication status: untested.} 350\end{claim} 351 352The shrinkage step is a single rescaling: one inner product, one 353normalisation, $O(d)$ flops total. The prior estimation step --- whether 354that is a hierarchical model fit, an empirical-Bayes MLE, or a covariate 355regression --- typically requires $O(Kd^2)$ to $O(K^3 d)$ time, three to 356five orders of magnitude more. Across our three benchmarks the shrinkage 357step took $0.2\%$, $0.6\%$, and $0.9\%$ of total wall-clock time 358respectively. Compute is logged under the reproducibility-budget envelope 359defined in \texttt{rrxiv:2605.00003} so the figures are auditable. 360 361\begin{rrxivremark}[Why this matters for the negative result] 362The compute asymmetry is the load-bearing piece of the negative result. If 363the prior step were free, recommending two-stage JS for any $N$ would be 364defensible. Because the prior step is expensive in both compute and data, and 365because its precision is what determines whether the second stage adds 366anything, the operational recommendation flips for small $K$. 367\end{rrxivremark} 368 369\subsection*{Claim 7: extension to $L^p$ risk} 370\begin{claim}[title=Claim 7, type=theoretical, evidence=argument, confidence=0.8, rationale={argued via convexity of the p-norm for p greater than 1; the full derivation is not written out in this paper and the p=1 case is explicitly open}, assumptions={p strictly greater than 1}] 371\label{claim:c7} 372The same proof technique extends to L\textasciicircum{}p risk for p \textgreater{} 1 with minor modifications (open question for p = 1). 373 374\emph{Replication status: untested.} 375\end{claim} 376 377For $p > 1$, the convexity of $\|\cdot\|_p^p$ on $\R^d$ is enough to push 378the conditional-risk integration through. Specifically, the conditional 379risk under $\hat\mu$ satisfies the analogous bound 380$\E[\|\hat\theta_{2S} - \theta\|_p^p \mid \hat\mu] \leq A_p \cdot (\Delta(\hat\mu) + d\sigma^2)^{p/2 - 1}$ 381for a dimension- and $p$-dependent constant $A_p$, after which Jensen's 382inequality applies. The case $p = 1$ is qualitatively different: the 383$\ell^1$ norm is non-strictly convex and Stein's identity does not have a 384clean $\ell^1$ analogue. We state this as an open question. 385 386\begin{openquestion}[$L^1$ risk] 387\label{oq:l1} 388Does the two-stage shrinker dominate one-stage JS under $L^1$ risk, when 389the prior mean has lower $L^1$ error than the origin? Standard Stein 390machinery does not apply; a proof would likely require a fresh argument 391based on coupling or a sub-Gaussian concentration inequality. Settled 392results for one-stage JS under $L^1$ exist but rely on heavy-tailed 393concentration tools that do not obviously commute with the prior 394integration step. 395\end{openquestion} 396 397\subsection*{Hypothesis H1 and Claim 8: the negative result, registered} 398 399The negative half of this paper's result has, until this revision, lived 400only in prose (the introduction and Section~\ref{sec:discussion}). We now 401register it in the claim graph. First, the belief being tested --- the 402standing recommendation of the small-$N$ replication literature --- as an 403explicit hypothesis: 404 405\begin{claim}[title=H1 (the tested hypothesis), type=empirical, evidence=argument, confidence=0.1, rationale={the standing recommendation of the methodological literature since the Efron-Morris era; its theoretical support (JS dominance) is real but this paper's Claim 8 contradicts its operational content in the small-K data-estimated-prior regime}, regimes={small-N replication, K < 30, target estimated from the same K groups}, labels={hypothesis-under-test}] 406\label{claim:h1} 407James-Stein-style shrinkage --- shrinking small-$N$ replication estimates 408toward zero or toward an estimated target --- materially reduces the MSE 409of reported effect estimates in typical small-$N$ replication studies, 410including when the number of cross-replication groups is small ($K < 30$) 411and the target must be estimated from the same groups. 412 413\emph{Replication status: n/a (hypothesis under test).} 414\end{claim} 415 416The hypothesis is stated as the literature applies it: as an operational 417recommendation for practitioners, not as the (true) domination theorem 418that motivates it. The confidence recorded on H1 is our posterior after 419this study, with the rationale explaining why. The result that 420contradicts it: 421 422\begin{claim}[title=Claim 8 (the negative result), type=empirical, evidence=simulation, confidence=0.8, rationale={three simulation benchmarks and one multi-task regression study, one simulation design per benchmark; the K threshold of roughly 30 is an empirical cut observed in these designs, not a proven constant; no independent replication yet}, regimes={small-N replication, K < 30, target estimated from the same K groups}, labels={negative-result}] 423\label{claim:c8} 424For small-$N$ replication studies with $K < 30$ groups whose shrinkage 425target must be estimated from the same $K$ groups, the prior MSE $M$ is 426large enough that the dominance margin of Remark~\ref{rem:thm-31} falls 427below the cross-replication variance of the estimator, while the prior 428estimation step dominates the compute budget 429(Claim~\ref{claim:c6}); reporting raw estimates with honest uncertainty 430intervals is operationally preferable to the recommended shrinker. 431 432\emph{Replication status: untested.} 433\end{claim} 434 435Claim~\ref{claim:c8} carries a \texttt{\textbackslash contradicts} edge 436to Hypothesis~\ref{claim:h1} --- to our knowledge the first 437\texttt{contradicts} edge in the rrxiv reference corpus --- and 438\texttt{\textbackslash dependson} edges to Claim~\ref{claim:c1} (the 439bound that quantifies the dominance margin) and Claim~\ref{claim:c6} 440(the compute asymmetry). The contradiction is scoped, not global: for 441$K \gtrsim 30$, or when a closed-form prior is available at no data 442cost, the recommendation stands (Section~\ref{sec:discussion}). 443 444\section{Discussion} 445\label{sec:discussion} 446 447\paragraph{When to shrink.} Combining the bound in Remark~\ref{rem:thm-31} 448with the compute accounting of Claim~\ref{claim:c6}, the recommendation for a 449replication methodologist with $K$ groups and per-group estimation noise 450$\sigma^2$ is: 451\begin{enumerate}[leftmargin=*] 452 \item If $K \gtrsim 30$ and an informative auxiliary signal is available 453 (covariate, domain prior, or pooled mean across other studies), 454 fit $\hat\mu$ and use two-stage JS. 455 \item If $K < 30$ but a closed-form prior exists (e.g.\ a previous 456 meta-analytic estimate of $\theta$), still use two-stage JS --- 457 the prior step is then free. 458 \item If $K < 30$ and the only available $\hat\mu$ must be 459 estimated from the $K$ groups themselves, the prior MSE $M$ will 460 be so large that the dominance margin in Remark~\ref{rem:thm-31} 461 is below the variance of the estimator across replications. Report 462 raw estimates with confidence intervals. Do not shrink. 463\end{enumerate} 464 465\paragraph{Why the classical recommendation is empty for small $K$.} The 466methodological literature on small-$N$ replication has recommended JS-style 467shrinkage since Efron-Morris-style examples in the 1970s. That 468recommendation is technically correct (JS dominates ML at every $N \geq 3$) 469but operationally vacuous when the practitioner cannot supply a good target. 470Two-stage JS does not rescue this: it pushes the problem from ``choose a 471target'' to ``estimate a target,'' and estimating one in the same data 472regime that gave the problem its small-$N$ character to begin with does 473not generate the precision needed for the dominance gap to be material. 474 475\paragraph{Scope.} We assume known $\sigma^2$ throughout; the unknown-variance 476case picks up an additional plug-in term that has been studied 477classically but is orthogonal to the prior question. We assume $d \geq 3$ 478so JS dominates ML in the first place. We do not treat the case where the 479auxiliary data used for $\hat\mu$ is from the same draw as $X$ (the 480$\hat\mu \perp X$ assumption is critical; see Efron \& Morris (1973) for 481the in-sample case). 482 483\paragraph{Relation to other corpus papers.} The intra-paper claim DAG 484declared via \texttt{\textbackslash dependson} edges is consumed by the 485rrxiv parser into a structured proof graph, in the same pattern the Euclid 486demonstration paper~\texttt{rrxiv:2605.00009} uses for its theorem-proof 487encoding. The reproducibility-budget accounting in Claim~\ref{claim:c6} 488--- and the benchmark registration in Claim~\ref{claim:c4} --- 489follows the conventions of~\texttt{rrxiv:2605.00003}, including the 490explicit FLOPs envelope and on-demand cost estimate; both carry 491\texttt{\textbackslash dependson} edges to that paper's headline claim. 492The motivation for 493separately citable claims --- so a future paper can replicate 494Claim~\ref{claim:c3} (the empirical-Bayes extension) without re-litigating 495Claim~\ref{claim:c1} (the original dominance) --- is articulated in the 496genesis whitepaper~\texttt{rrxiv:2605.00001}. 497 498\paragraph{What this paper does not settle.} The $L^1$ open question 499(Open Question~\ref{oq:l1}) is the most interesting unresolved piece. 500We also leave open the case of structured (sparse, low-rank) priors 501where the prior MSE $M$ has its own dimension dependence; the closed-form 502bound goes through but ceases to be the right object to optimise against. 503 504\section{References} 505\begin{itemize}[leftmargin=*] 506\item James, W., \& Stein, C. (1961). \emph{Estimation with quadratic loss}. 507 Proc. Fourth Berkeley Symp. Math. Statist. Probab., 1, 361--379. The 508 origin of the JS estimator and the dominance argument we extend. 509\item Efron, B., \& Morris, C. (1973). \emph{Stein's estimation rule and 510 its competitors --- an empirical Bayes approach}. J. Amer. Statist. 511 Assoc., 68(341), 117--130. The empirical-Bayes plug-in argument we 512 generalise in Claim~\ref{claim:c3}. 513\item Stein, C. (1981). \emph{Estimation of the mean of a multivariate normal 514 distribution}. Ann. Statist., 9(6), 1135--1151. The Stein identity used 515 throughout the proofs. 516\item Brown, L. D. (1971). \emph{Admissible estimators, recurrent diffusions, 517 and insoluble boundary value problems}. Ann. Math. Statist., 42(3), 518 855--903. Background on the $d \geq 3$ admissibility cutoff. 519\item Donoho, D. L., \& Johnstone, I. M. (1994). \emph{Ideal spatial 520 adaptation by wavelet shrinkage}. Biometrika, 81(3), 425--455. 521 $L^p$-risk analyses of shrinkage estimators; reference for 522 Claim~\ref{claim:c7}. 523\item Casella, G. (1980). \emph{Minimax ridge regression estimation}. Ann. 524 Statist., 8(5), 1036--1056. Closest classical precedent for shrinkage 525 with an estimated target; predates two-stage formalisation. 526\item \texttt{rrxiv:2605.00001}. \emph{The rrxiv whitepaper: a 527 reproducibility-first preprint protocol}. The protocol layer this paper 528 encodes against. 529\item \texttt{rrxiv:2605.00003}. \emph{Reproducibility budgets for ML 530 preprints}. Defines the compute-accounting envelope used in 531 Claim~\ref{claim:c6}. 532\item \texttt{rrxiv:2605.00009}. \emph{Euclid's Elements, encoded as an 533 rrxiv paper}. The canonical theorem-proof DAG example. 534\end{itemize} 535\end{document} 536