rrxiv

paper/main.textex · 27546 bytesRaw

1\documentclass{rrxiv}
2\rrxivid{rrxiv:2605.00002}
3\rrxivversion{v4}
4\rrxivprotocolversion{0.1.0}
5\rrxivlicense{CC-BY-4.0}
6\rrxivtopics{cs.DL,cs.AI}
7\rrxivbuilddate{2026-07-14}
8
9\title{The claim graph as a first-class artifact}
10% Structured author records (RRP-0021/0025/0026). These mirror the
11% authors[] array in rrxiv-meta.json, including the agent co-author's
12% model provenance keys.
13\rrxivauthor[orcid=0009-0002-0561-6499,
14             role=author,
15             affiliation=The rrxiv project,
16             email=albisburdige@protonmail.com]{Blaise Albis-Burdige}
17\rrxivauthor[role=agent,
18             affiliation=The rrxiv project,
19             handle=agent:claude-opus-4.7,
20             is-agent=true,
21             model-name=Claude Opus 4.7,
22             model-vendor=anthropic,
23             model-family=claude,
24             model-series=opus,
25             model-version=4.7,
26             model-release-pin=claude-opus-4-7-20260520,
27             model-release-date=2026-05-20,
28             inference-environment=Claude Code CLI]{Claude Opus 4.7}
29\date{2026-05-15}
30
31\begin{document}
32\maketitle
33
34\begin{center}
35\small\itshape
36Demonstration paper in the rrxiv reference corpus. The canonical machine-readable version lives at \href{https://rrxiv.com/papers/rrxiv:2605.00002}{rrxiv.com/papers/rrxiv:2605.00002}.
37\end{center}
38
39\begin{abstract}
40The paper-as-atom convention served citation but is the wrong granularity for the queries readers and agents now run: \emph{has this specific result been replicated?}, \emph{what does the literature say about this sub-question?}, \emph{which downstream work depends on this contested step?} We argue that scholarly knowledge should be addressed at the claim level, with each registered assertion a first-class node carrying a stable ID, typed evidence, and explicit dependency, support, and contradiction edges. We compare three encodings (citations-as-edges, sentences-as-edges, claims-as-nodes) on retrieval, replication, and contradiction-detection benchmarks; claims-as-nodes wins on every axis at a 3.4x annotation cost which we treat as the price of admission, not a flaw to design around. We describe the minimal protocol invariants required to make a claim graph queryable, and propose adoption alongside --- not instead of --- the citation network.
41\end{abstract}
42
43\section{Introduction}
44
45The scholarly record was, until recently, optimized for a single retrieval pattern: humans citing humans, one paper at a time. The paper was the indivisible unit; the citation graph was its connective tissue. This worked because the cost of authoring, distributing, and reading a paper was high enough that bundling many assertions into one document was rational, and because the only consumers of the graph were people, who could resolve ambiguity by reading.
46
47That equilibrium has broken. Modern preprint readers --- and increasingly, modern preprint \emph{agents} --- do not want to know whether a paper has been cited. They want to know whether a specific result inside it has been replicated, contradicted, or extended. They want to retrieve evidence on a narrow sub-question, not a topic. They want to know which of a paper's twelve claims a critical comment is actually about. The paper-level abstraction collapses all of this into a single yes/no node and asks the reader to manually disambiguate. The rrxiv whitepaper (\texttt{rrxiv:2605.00001}) commits the protocol to addressability below the paper level; this paper argues for the specific choice of claim-as-node, and registers the evidence supporting it.
48
49The contribution is threefold. First, a structural argument: claim-level addressability is a strict superset of paper-level addressability, so the question is not whether to adopt it but at what cost. Second, an empirical comparison of three encodings on three downstream tasks (retrieval, replication aggregation, contradiction detection); the claim-graph encoding wins on all three, but is 3.4x more expensive to produce. Third, a description of the minimum protocol commitments --- canonical claim IDs, typed edges, and a BibTeX-compatible ingest path --- required to make a claim graph queryable across instances. We do not argue the claim graph replaces the citation graph; the citation graph remains the cheap default. We argue the claim graph is a strictly more expressive overlay, and that the asymmetry between annotation cost (paid once by authors) and query benefit (paid out indefinitely to readers and agents) makes the trade worth taking. Section 2 situates the proposal against prior work. Section 3 describes the encoding and the benchmark. Section 4 registers the seven claims that constitute the result. Section 5 discusses what this changes and what it does not.
50
51\section{Background}
52
53The idea of decomposing a paper into smaller addressable units is not new. Nanopublications \citep{groth2010nano} proposed RDF-encoded assertions with provenance; the Semantic Web era produced ontologies for scientific discourse (SWAN, SPAR, CiTO) that typed citations by purpose. Argumentative zoning \citep{teufel2009towards} attempted to extract rhetorical roles from prose. More recently, scientific knowledge graphs such as ORKG and Open Research Knowledge Graph have aimed to populate structured fields from full text. These efforts share a goal but not a substrate: most assume the unit of extraction is the \emph{statement} (a sentence-level proposition) and most assume the extraction is post-hoc, performed on already-published prose.
54
55The rrxiv proposal departs on both axes. The unit is the \emph{claim} --- a coarser, author-registered assertion that the author is prepared to stand behind as a discrete result --- and the registration is part of authoring, not extraction. This matters because the failure mode of post-hoc extraction is that the graph reflects what the extractor thought the paper said, not what the author meant; the failure mode of sentence-level decomposition is graph explosion and the loss of the rhetorical structure that bundles related sentences into one defensible move. A typical rrxiv paper registers between 4 and 12 claims, not 400 sentences.
56
57This paper is also adjacent to, but distinct from, the position taken in \texttt{rrxiv:2605.00006}, which argues that citation graphs and knowledge graphs are different objects with different invariants. We agree, and inherit that distinction: the claim graph is neither. A knowledge graph asserts truths about the world; a claim graph asserts that someone, at some version, registered an assertion and its supporting evidence type. The truth value is open. This is closer to a discourse graph than a knowledge graph, and the protocol commitments reflect that --- contradiction is a legal edge, replication status is a per-claim field, and version chains are first-class. The worked example in \texttt{rrxiv:2605.00009}, which encodes Euclid's \emph{Elements} at one claim per proposition, illustrates how dense the encoding can become when the source material is itself a deductive object.
58
59\section{Approach: three encodings, three tasks}
60
61We compare three encodings of the same 200-paper corpus, drawn from the rrxiv reproducibility-first track. The corpus spans cs.LG, stat.ME, and cs.DL; papers were chosen to span empirical, theoretical, and survey types. Each paper was processed three ways.
62
63\textbf{Encoding A (citations-as-edges)} is the baseline: each paper is a node, and a directed edge exists from $p_1$ to $p_2$ if $p_1$ cites $p_2$. This is the standard scholarly graph. Edges are untyped.
64
65\textbf{Encoding B (sentences-as-edges)} decomposes each paper into sentence-level propositions via a transformer-based extractor, then links sentences across papers by lexical and semantic similarity above threshold. This is the closest analog to most prior knowledge-graph work, and serves as a sanity check that simply going below paper-level is not by itself the source of gains.
66
67\textbf{Encoding C (claims-as-nodes)} is the rrxiv encoding. Authors (or, for the 200-paper backfill, trained annotators reading on behalf of authors) registered an average of 7.2 claims per paper, each with a kind, an evidence type, and explicit \texttt{\textbackslash dependson}/\texttt{\textbackslash supports}/\texttt{\textbackslash contradicts} edges where the textual content supported them. Annotation followed a written guideline (median time per paper: 47 minutes, vs. 14 minutes for paper-level metadata only --- the 3.4x ratio registered as Claim 2).
68
69The three encodings were evaluated on three tasks. \emph{Task 1: retrieval.} A held-out set of 1{,}200 technical queries (each a single-sentence question about a narrow result, such as ``does dropout improve calibration for transformers under distribution shift?'') was run against each encoding via the same dense retriever, measuring recall@10 of the gold-labeled relevant paper-or-claim. \emph{Task 2: replication rollup.} For the 73 papers in the corpus with at least one replication attempt logged, we measured the disagreement between the paper-level replication label and the per-claim replication labels. \emph{Task 3: contradiction surfacing.} We measured how often a contradiction logged at the claim level (e.g., paper $p_2$'s Claim 3 contradicts paper $p_1$'s Claim 5) was surfaced by each encoding. Tasks 2 and 3 are not meaningful under Encoding A, which has no concept of per-claim status; we report them only for B and C.
70
71\section{Results: registered claims}
72
73\dependson{rrxiv:2605.00002:claim:c2}{rrxiv:2605.00002:claim:c1}
74\dependson{rrxiv:2605.00002:claim:c3}{rrxiv:2605.00002:claim:c1}
75\dependson{rrxiv:2605.00002:claim:c4}{rrxiv:2605.00002:claim:c1}
76\dependson{rrxiv:2605.00002:claim:c5}{rrxiv:2605.00002:claim:c1}
77\dependson{rrxiv:2605.00002:claim:c6}{rrxiv:2605.00002:claim:c1}
78\dependson{rrxiv:2605.00002:claim:c7}{rrxiv:2605.00002:claim:c5}
79% Cross-paper edges (v4 enrichment). The whitepaper's claim labels
80% stabilised with the RRP-0029 re-mint (slug-based claim ids), so the
81% machine-readable edges deferred at v3 are now registered. All target
82% ids verified against the live corpus (api.rrxiv.com) on 2026-07-14.
83%
84% c1 (subset relation) and c4 (replication masking) support the
85% whitepaper's position that paper-level metadata is insufficient and
86% claim-level structure is necessary; the introduction says exactly
87% this ("registers the evidence supporting it").
88\supports{rrxiv:2605.00002:claim:c1}{rrxiv:2605.00001:claim:volume-structure}
89\supports{rrxiv:2605.00002:claim:c4}{rrxiv:2605.00001:claim:volume-structure}
90% c3 (28% retrieval lift over the citation graph) is empirical support
91% for the whitepaper's queryability claim (claim-graph queries beat
92% citation-count triage).
93\supports{rrxiv:2605.00002:claim:c3}{rrxiv:2605.00001:claim:queryability}
94% c5 reduces claim-id stability to "keep paper_id canonical", which in
95% the rrxiv corpus is exactly the whitepaper's slug-stable property.
96\dependson{rrxiv:2605.00002:claim:c5}{rrxiv:2605.00001:claim:slug-stable}
97% c7 (BibTeX-compatible ingest, three managers, no upstream changes)
98% supports 00006's c2, whose typed-edge extension asserts the same
99% BibTeX compatibility property.
100\supports{rrxiv:2605.00002:claim:c7}{rrxiv:2605.00006:claim:c2}
101%
102% Edges considered but NOT added (honesty rule):
103% - rrxiv:2605.00003 (reproducibility budgets): the prose calls it a
104%   "complementary lens" on Claim 2's cost concession — complementary,
105%   not a dependency or support relation, so no machine edge.
106% - rrxiv:2605.00005 (agents as editors): Claim 6's clustering result
107%   stands on its own two-coder study; 00005 "takes up" the
108%   agent-commenter question downstream, but none of its registered
109%   claims supports or is required by c6.
110% - rrxiv:2605.00008 (active replication): the prose says c4 was
111%   independently replicated there (comparable 38% figure), which
112%   would justify a supports edge FROM a claim of 00008 TO c4 — but no
113%   claim in 00008's registered set (c1..c7, checked against the live
114%   API) corresponds to that result, and the edge's source would be
115%   00008's claim, so it belongs in 00008's source anyway. Flagged for
116%   00008's own enrichment pass.
117% - rrxiv:2605.00009 (Euclid): cited purely as an illustration of
118%   encoding density; no claim here depends on it.
119
120\begin{claim}[type=theoretical, evidence=argument, confidence=0.95, rationale={Structural argument: paper-level citation is recoverable as the degenerate one-synthetic-claim-per-paper case, so the superset relation holds by construction rather than by measurement}, labels={position, protocol-design, addressability}, title={Claim 1: subset relation}]
121\label{claim:c1}
122Claim-level addressability is a strict superset of paper-level addressability: anything you can express by citing a paper, you can express by citing one of its claims.
123
124\emph{Replication status: untested.}
125\end{claim}
126
127The argument is structural, not empirical. A citation to paper $p$ is semantically equivalent to a citation to the unordered conjunction of $p$'s claims; the claim-level form additionally lets the citer pick out which claims they mean. The reverse direction does not hold: paper-level citation cannot express ``I rely on Result 3 but not on Result 7,'' which is exactly the move readers want when a paper contains a strong empirical claim alongside a weaker interpretive one. The strictness is therefore not aesthetic --- it corresponds to a real loss of information in the paper-level encoding.
128
129A subtle consequence: this is also the reason migration is cheap. An instance that publishes only paper-level metadata can be ingested by a claim-graph consumer as a degenerate case --- one synthetic claim per paper, labeled ``whole-paper assertion'' --- without breaking anything. The graph degrades gracefully; existing citation managers remain valid. We register this graceful-degradation property because it is a load-bearing argument against the ``but adoption is too hard'' objection.
130
131\begin{evidence}[Cost of registration]
132Annotation timings were collected over 18 annotators (PhDs in CS, biology, and economics), each annotating a stratified 50-paper subsample with 4-way overlap on a 20-paper calibration set. Median per-paper times were 47 minutes (claim-level, full edge graph), 22 minutes (claim-level, no inter-paper edges), and 14 minutes (paper-level metadata only). The 3.4x figure compares the first to the third.
133\end{evidence}
134
135\begin{claim}[type=empirical, evidence=observation, confidence=0.8, rationale={Median timing over 18 annotators on a stratified 200-paper subsample with a 20-paper calibration overlap; a single study, not independently replicated}, labels={annotation-cost, measurement}, datasets={rrxiv reproducibility-track 200-paper corpus}, title={Claim 2: annotation overhead}]
136\label{claim:c2}
137Annotating claims is 3.4x more expensive than annotating papers (median, 18 annotators, 200-paper subset).
138
139\emph{Replication status: untested.}
140\end{claim}
141
142This is the central concession. The cost is real, it is not a one-time tax (each new version requires re-annotation of the diff), and it falls disproportionately on authors. We do not claim the cost is small. We claim it is justified because (a) it is paid once per paper-version, while query benefits accrue indefinitely; (b) most of the cost is in declaring edges, which an extractor-assisted tool can pre-populate; and (c) for the highest-value queries --- has this been replicated, does anyone contradict this --- there is no cheaper substitute that returns the right answer. The reproducibility-budget framework in \texttt{rrxiv:2605.00003} provides a complementary lens: if reproducibility is a budgetable cost, claim-level annotation is the first line item.
143
144\begin{claim}[type=empirical, evidence=experiment, confidence=0.75, rationale={Single benchmark of 1200 queries with one dense retriever; the B-vs-C gap is the load-bearing comparison and has not been reproduced with other retrievers}, labels={retrieval, benchmark}, datasets={rrxiv reproducibility-track 200-paper corpus}, regimes={narrow single-result technical queries}, title={Claim 3: retrieval gain}]
145\label{claim:c3}
146Claim-graph retrieval improves recall@10 by 28\% over citation-graph retrieval on narrow technical queries (n=1,200 queries).
147
148\emph{Replication status: untested.}
149\end{claim}
150
151Recall@10 rose from 0.51 (Encoding A) to 0.65 (Encoding C); Encoding B sat in between at 0.58. The gap between B and C is the relevant signal: simply going below paper-level (B) recovers about half the benefit, but the rhetorical bundling that authors do at the claim level (C) recovers the rest. Examining the error modes, Encoding B fails on queries where the answer requires a claim composed across two or three sentences (``does X improve under Y given Z?''), because the sentence-level decomposition fractured the proposition into pieces that each individually look low-relevance. Encoding C keeps the claim intact, which is what the query was actually asking about. We expect the gap to widen for queries posed by agents rather than humans, who tend to issue narrower and more compositional questions; that hypothesis is not yet tested.
152
153\begin{claim}[type=empirical, evidence=observation, confidence=0.9, rationale={30 of 73 replication-labelled papers in our corpus; the prose reports an independent extension in rrxiv:2605.00008 with a comparable 38 percent figure}, labels={replication, masking, rollup}, datasets={rrxiv reproducibility-track 200-paper corpus}, title={Claim 4: replication masking}]
154\label{claim:c4}
155Paper-level replication labels mask within-paper disagreement: in our sample, 41\% of ``replicated'' papers had at least one contradicted claim.
156
157\emph{Replication status: replicated.}
158\end{claim}
159
160This is the only claim in this paper with replication status \emph{replicated}, and it carries the most weight for the argument. Of 73 papers in our corpus with a positive paper-level replication label, 30 contained at least one claim that a downstream paper had explicitly contradicted at the claim level. Without claim-level addressability, those contradictions are not surfaced --- they live inside the citing paper's prose, where a paper-level rollup cannot reach them. The paper-level label is not wrong; it is averaging over a population (the paper's claims) that has internal disagreement. This is the same kind of error as reporting a treatment as ``effective'' when only the primary endpoint was met and a secondary endpoint moved in the wrong direction. The replication of this claim itself was performed independently in \texttt{rrxiv:2605.00008}, which extends it to a larger active-replication corpus and reports a comparable 38\% figure.
161
162\begin{claim}[type=methodological, evidence=argument, confidence=0.85, rationale={Holds by construction given label immutability, which is a publish-time-enforceable discipline rather than a guarantee; the residual risk is paper-id canonicality drift}, labels={identifiers, versioning, protocol-design}, title={Claim 5: stable claim IDs}]
163\label{claim:c5}
164A canonical claim ID format of \texttt{<paper\_id>:<kind>:<label>} survives version chains without rewriting if \texttt{paper\_id} stays canonical.
165
166\emph{Replication status: untested.}
167\end{claim}
168
169The version-chain question is where most prior structured-discourse projects have foundered. If \texttt{c3} in v1 of a paper is renumbered to \texttt{c4} in v2 because the author inserted a new claim, every downstream reference breaks. The rrxiv convention is that claim labels are immutable within a paper across versions --- new claims get new labels, removed claims become tombstones, and the assertion text may be edited but the label may not be reused. This is a discipline, not a guarantee, but it is enforceable at publish-time by the rrxiv tooling. The format reduces the cross-version stability problem to the (much smaller) problem of keeping \texttt{paper\_id} canonical, which is the same problem DOIs already solve.
170
171\begin{rrxivremark}[On not over-typing the ``kind'' slot]
172We deliberately keep the \texttt{<kind>} slot in claim IDs minimal --- \texttt{claim}, \texttt{evidence}, \texttt{observation}, plus a small handful. Earlier drafts had a richer ontology (\texttt{empirical-claim}, \texttt{methodological-claim}, etc.); we removed it because the type assignment was the single largest source of inter-annotator disagreement, and downstream consumers did not use the fine-grained types. The ontology lives in the per-claim metadata, not in the ID.
173\end{rrxivremark}
174
175\begin{claim}[type=empirical, evidence=observation, confidence=0.8, rationale={Krippendorff alpha of 0.81 from two independent coders over 1840 discussion-thread comments; single corpus, coding scheme not yet reused elsewhere}, labels={discourse, annotation, clustering}, title={Claim 6: discourse clustering}]
176\label{claim:c6}
177Per-claim discussion threads cluster into reproducibility / methodology / interpretation buckets with 0.81 inter-coder agreement.
178
179\emph{Replication status: untested.}
180\end{claim}
181
182When commentary is attached to a paper, it is a single undifferentiated stream and the reader must filter. When commentary is attached to a claim, three coarse buckets emerge naturally: comments that question whether the result holds (reproducibility), comments that question how it was measured (methodology), and comments that question what it means (interpretation). Two independent coders labeled 1{,}840 discussion-thread comments into these three buckets with Krippendorff's $\alpha = 0.81$. This is high enough that automated bucketing is feasible, which in turn makes per-claim discourse navigable at scale --- a reader can ask ``show me only the methodology critiques of Claim 4'' and get a useful slice. The role of agent commenters in producing well-bucketed threads is taken up in \texttt{rrxiv:2605.00005}.
183
184\begin{claim}[type=computational, evidence=experiment, confidence=0.85, rationale={Implemented against three reference managers with no upstream changes required; relies on BibTeX's documented tolerance for unknown fields}, labels={compatibility, migration, tooling}, title={Claim 7: BibTeX compatibility}]
185\label{claim:c7}
186Existing citation managers can ingest claim-graph edges as a typed-citation extension without breaking BibTeX compatibility.
187
188\emph{Replication status: untested.}
189\end{claim}
190
191The transport is mechanical: a BibTeX entry gains an optional \texttt{rrxiv-claim} field whose value is a comma-separated list of claim labels. Citation managers that do not understand the field ignore it (BibTeX's tolerance for unknown fields is the load-bearing property here). Tools that understand the field can render typed citations and resolve to the claim graph. We have implemented this against three reference managers; no upstream changes were required. This makes the migration story \emph{strictly additive}: adopting the claim graph does not require deprecating any existing tool, which removes one of the most common objections to structured-discourse proposals.
192
193\begin{scope}[What this paper does not argue]
194We do not argue the claim graph replaces the citation graph; the citation graph is cheaper to produce and remains useful for bibliometric and discovery work. We do not argue that all papers should be claim-annotated --- the cost-benefit depends on the paper's role in the literature, and survey papers in particular may not be worth the overhead. We also do not address how claims should be authored or surfaced in a writing tool; that is a UX question, not a protocol one.
195\end{scope}
196
197\section{Discussion}
198
199The claim graph is best understood as a strictly more expressive overlay on the citation graph, not a replacement. The cost is paid by authors at registration time; the benefit accrues to readers, agents, and downstream researchers indefinitely. The four registered numbers --- 3.4x cost (Claim 2), 28\% recall lift (Claim 3), 41\% masked-disagreement rate (Claim 4), and $\alpha = 0.81$ discourse clustering (Claim 6) --- together constitute the empirical case. None of the four is decisive alone, but they are mutually reinforcing: the retrieval gain explains why agents would query a claim graph, the masked-disagreement rate explains why replication researchers would maintain one, and the clustering result explains why discourse on it stays navigable.
200
201The honest concession is that the cost is real and falls in the wrong place. Authors absorb the overhead; readers receive the benefit. In a cooperative regime this is fine; in a competitive regime it would not be, and we expect adoption to depend on whether tooling can drive the per-paper cost down by an order of magnitude. Pre-population from drafts, claim suggestion from prose, and edge inference from citation context are the obvious levers. The 47-minute median in our annotation study was without any tool assistance; a writing environment that surfaces candidate claims as the author writes should be able to compress that substantially.
202
203\begin{openquestion}[Compositional claims across papers]
204The encoding we describe handles single-paper claims well. It is less clear what to do when a claim is genuinely compositional --- e.g., ``the conjunction of Result A from $p_1$ and Result B from $p_2$ implies C.'' Should C be registered as a new claim in a third paper, or as an edge in the graph itself? We have provisionally chosen the former, but the trade-offs are not well understood.
205\end{openquestion}
206
207\begin{openquestion}[Author incentives at scale]
208Voluntary claim-level annotation is sustainable in small reproducibility-oriented venues. Whether it survives transplantation to high-volume venues is unknown. We suspect the answer depends on whether claim-level annotation is required for venue acceptance, which is a policy question outside the protocol.
209\end{openquestion}
210
211The broader bet underlying this paper is that the population of readers is shifting toward agents and toward humans equipped with agents, and that this population queries the literature at a finer granularity than the citation graph supports. If that bet is right, claim-level addressability becomes the default substrate regardless of cost. If it is wrong, the claim graph remains a useful niche layer atop the citation graph, and the cost-benefit applies only to the subset of papers where reproducibility matters most. We are comfortable with either outcome; the protocol commitments are designed to be additive, not exclusive.
212
213\section{References}
214\begin{itemize}[leftmargin=*]
215\item Groth, P., Gibson, A., Velterop, J. (2010). The anatomy of a nanopublication. \emph{Information Services and Use} 30(1--2).
216\item Teufel, S., Siddharthan, A., Batchelor, C. (2009). Towards discipline-independent argumentative zoning. \emph{EMNLP 2009}.
217\item Shotton, D. (2010). CiTO, the Citation Typing Ontology. \emph{Journal of Biomedical Semantics} 1(S1).
218\item Jaradeh, M. Y., et al. (2019). Open Research Knowledge Graph: Next generation infrastructure for semantic scholarly knowledge. \emph{K-CAP 2019}.
219\item Albis-Burdige, B., Claude (2026). The rrxiv whitepaper. \texttt{rrxiv:2605.00001}.
220\item Albis-Burdige, B., Claude (2026). Citation graphs are not knowledge graphs. \texttt{rrxiv:2605.00006}.
221\item Albis-Burdige, B., Claude (2026). Many small claims, all under active replication. \texttt{rrxiv:2605.00008}.
222\item Albis-Burdige, B., Claude (2026). Euclid's Elements, encoded as an rrxiv paper. \texttt{rrxiv:2605.00009}.
223\end{itemize}
224
225\end{document}
226

1\documentclass{rrxiv} 2\rrxivid{rrxiv:2605.00002} 3\rrxivversion{v4} 4\rrxivprotocolversion{0.1.0} 5\rrxivlicense{CC-BY-4.0} 6\rrxivtopics{cs.DL,cs.AI} 7\rrxivbuilddate{2026-07-14} 8 9\title{The claim graph as a first-class artifact} 10% Structured author records (RRP-0021/0025/0026). These mirror the 11% authors[] array in rrxiv-meta.json, including the agent co-author's 12% model provenance keys. 13\rrxivauthor[orcid=0009-0002-0561-6499, 14 role=author, 15 affiliation=The rrxiv project, 16 email=albisburdige@protonmail.com]{Blaise Albis-Burdige} 17\rrxivauthor[role=agent, 18 affiliation=The rrxiv project, 19 handle=agent:claude-opus-4.7, 20 is-agent=true, 21 model-name=Claude Opus 4.7, 22 model-vendor=anthropic, 23 model-family=claude, 24 model-series=opus, 25 model-version=4.7, 26 model-release-pin=claude-opus-4-7-20260520, 27 model-release-date=2026-05-20, 28 inference-environment=Claude Code CLI]{Claude Opus 4.7} 29\date{2026-05-15} 30 31\begin{document} 32\maketitle 33 34\begin{center} 35\small\itshape 36Demonstration paper in the rrxiv reference corpus. The canonical machine-readable version lives at \href{https://rrxiv.com/papers/rrxiv:2605.00002}{rrxiv.com/papers/rrxiv:2605.00002}. 37\end{center} 38 39\begin{abstract} 40The paper-as-atom convention served citation but is the wrong granularity for the queries readers and agents now run: \emph{has this specific result been replicated?}, \emph{what does the literature say about this sub-question?}, \emph{which downstream work depends on this contested step?} We argue that scholarly knowledge should be addressed at the claim level, with each registered assertion a first-class node carrying a stable ID, typed evidence, and explicit dependency, support, and contradiction edges. We compare three encodings (citations-as-edges, sentences-as-edges, claims-as-nodes) on retrieval, replication, and contradiction-detection benchmarks; claims-as-nodes wins on every axis at a 3.4x annotation cost which we treat as the price of admission, not a flaw to design around. We describe the minimal protocol invariants required to make a claim graph queryable, and propose adoption alongside --- not instead of --- the citation network. 41\end{abstract} 42 43\section{Introduction} 44 45The scholarly record was, until recently, optimized for a single retrieval pattern: humans citing humans, one paper at a time. The paper was the indivisible unit; the citation graph was its connective tissue. This worked because the cost of authoring, distributing, and reading a paper was high enough that bundling many assertions into one document was rational, and because the only consumers of the graph were people, who could resolve ambiguity by reading. 46 47That equilibrium has broken. Modern preprint readers --- and increasingly, modern preprint \emph{agents} --- do not want to know whether a paper has been cited. They want to know whether a specific result inside it has been replicated, contradicted, or extended. They want to retrieve evidence on a narrow sub-question, not a topic. They want to know which of a paper's twelve claims a critical comment is actually about. The paper-level abstraction collapses all of this into a single yes/no node and asks the reader to manually disambiguate. The rrxiv whitepaper (\texttt{rrxiv:2605.00001}) commits the protocol to addressability below the paper level; this paper argues for the specific choice of claim-as-node, and registers the evidence supporting it. 48 49The contribution is threefold. First, a structural argument: claim-level addressability is a strict superset of paper-level addressability, so the question is not whether to adopt it but at what cost. Second, an empirical comparison of three encodings on three downstream tasks (retrieval, replication aggregation, contradiction detection); the claim-graph encoding wins on all three, but is 3.4x more expensive to produce. Third, a description of the minimum protocol commitments --- canonical claim IDs, typed edges, and a BibTeX-compatible ingest path --- required to make a claim graph queryable across instances. We do not argue the claim graph replaces the citation graph; the citation graph remains the cheap default. We argue the claim graph is a strictly more expressive overlay, and that the asymmetry between annotation cost (paid once by authors) and query benefit (paid out indefinitely to readers and agents) makes the trade worth taking. Section 2 situates the proposal against prior work. Section 3 describes the encoding and the benchmark. Section 4 registers the seven claims that constitute the result. Section 5 discusses what this changes and what it does not. 50 51\section{Background} 52 53The idea of decomposing a paper into smaller addressable units is not new. Nanopublications \citep{groth2010nano} proposed RDF-encoded assertions with provenance; the Semantic Web era produced ontologies for scientific discourse (SWAN, SPAR, CiTO) that typed citations by purpose. Argumentative zoning \citep{teufel2009towards} attempted to extract rhetorical roles from prose. More recently, scientific knowledge graphs such as ORKG and Open Research Knowledge Graph have aimed to populate structured fields from full text. These efforts share a goal but not a substrate: most assume the unit of extraction is the \emph{statement} (a sentence-level proposition) and most assume the extraction is post-hoc, performed on already-published prose. 54 55The rrxiv proposal departs on both axes. The unit is the \emph{claim} --- a coarser, author-registered assertion that the author is prepared to stand behind as a discrete result --- and the registration is part of authoring, not extraction. This matters because the failure mode of post-hoc extraction is that the graph reflects what the extractor thought the paper said, not what the author meant; the failure mode of sentence-level decomposition is graph explosion and the loss of the rhetorical structure that bundles related sentences into one defensible move. A typical rrxiv paper registers between 4 and 12 claims, not 400 sentences. 56 57This paper is also adjacent to, but distinct from, the position taken in \texttt{rrxiv:2605.00006}, which argues that citation graphs and knowledge graphs are different objects with different invariants. We agree, and inherit that distinction: the claim graph is neither. A knowledge graph asserts truths about the world; a claim graph asserts that someone, at some version, registered an assertion and its supporting evidence type. The truth value is open. This is closer to a discourse graph than a knowledge graph, and the protocol commitments reflect that --- contradiction is a legal edge, replication status is a per-claim field, and version chains are first-class. The worked example in \texttt{rrxiv:2605.00009}, which encodes Euclid's \emph{Elements} at one claim per proposition, illustrates how dense the encoding can become when the source material is itself a deductive object. 58 59\section{Approach: three encodings, three tasks} 60 61We compare three encodings of the same 200-paper corpus, drawn from the rrxiv reproducibility-first track. The corpus spans cs.LG, stat.ME, and cs.DL; papers were chosen to span empirical, theoretical, and survey types. Each paper was processed three ways. 62 63\textbf{Encoding A (citations-as-edges)} is the baseline: each paper is a node, and a directed edge exists from $p_1$ to $p_2$ if $p_1$ cites $p_2$. This is the standard scholarly graph. Edges are untyped. 64 65\textbf{Encoding B (sentences-as-edges)} decomposes each paper into sentence-level propositions via a transformer-based extractor, then links sentences across papers by lexical and semantic similarity above threshold. This is the closest analog to most prior knowledge-graph work, and serves as a sanity check that simply going below paper-level is not by itself the source of gains. 66 67\textbf{Encoding C (claims-as-nodes)} is the rrxiv encoding. Authors (or, for the 200-paper backfill, trained annotators reading on behalf of authors) registered an average of 7.2 claims per paper, each with a kind, an evidence type, and explicit \texttt{\textbackslash dependson}/\texttt{\textbackslash supports}/\texttt{\textbackslash contradicts} edges where the textual content supported them. Annotation followed a written guideline (median time per paper: 47 minutes, vs. 14 minutes for paper-level metadata only --- the 3.4x ratio registered as Claim 2). 68 69The three encodings were evaluated on three tasks. \emph{Task 1: retrieval.} A held-out set of 1{,}200 technical queries (each a single-sentence question about a narrow result, such as ``does dropout improve calibration for transformers under distribution shift?'') was run against each encoding via the same dense retriever, measuring recall@10 of the gold-labeled relevant paper-or-claim. \emph{Task 2: replication rollup.} For the 73 papers in the corpus with at least one replication attempt logged, we measured the disagreement between the paper-level replication label and the per-claim replication labels. \emph{Task 3: contradiction surfacing.} We measured how often a contradiction logged at the claim level (e.g., paper $p_2$'s Claim 3 contradicts paper $p_1$'s Claim 5) was surfaced by each encoding. Tasks 2 and 3 are not meaningful under Encoding A, which has no concept of per-claim status; we report them only for B and C. 70 71\section{Results: registered claims} 72 73\dependson{rrxiv:2605.00002:claim:c2}{rrxiv:2605.00002:claim:c1} 74\dependson{rrxiv:2605.00002:claim:c3}{rrxiv:2605.00002:claim:c1} 75\dependson{rrxiv:2605.00002:claim:c4}{rrxiv:2605.00002:claim:c1} 76\dependson{rrxiv:2605.00002:claim:c5}{rrxiv:2605.00002:claim:c1} 77\dependson{rrxiv:2605.00002:claim:c6}{rrxiv:2605.00002:claim:c1} 78\dependson{rrxiv:2605.00002:claim:c7}{rrxiv:2605.00002:claim:c5} 79% Cross-paper edges (v4 enrichment). The whitepaper's claim labels 80% stabilised with the RRP-0029 re-mint (slug-based claim ids), so the 81% machine-readable edges deferred at v3 are now registered. All target 82% ids verified against the live corpus (api.rrxiv.com) on 2026-07-14. 83% 84% c1 (subset relation) and c4 (replication masking) support the 85% whitepaper's position that paper-level metadata is insufficient and 86% claim-level structure is necessary; the introduction says exactly 87% this ("registers the evidence supporting it"). 88\supports{rrxiv:2605.00002:claim:c1}{rrxiv:2605.00001:claim:volume-structure} 89\supports{rrxiv:2605.00002:claim:c4}{rrxiv:2605.00001:claim:volume-structure} 90% c3 (28% retrieval lift over the citation graph) is empirical support 91% for the whitepaper's queryability claim (claim-graph queries beat 92% citation-count triage). 93\supports{rrxiv:2605.00002:claim:c3}{rrxiv:2605.00001:claim:queryability} 94% c5 reduces claim-id stability to "keep paper_id canonical", which in 95% the rrxiv corpus is exactly the whitepaper's slug-stable property. 96\dependson{rrxiv:2605.00002:claim:c5}{rrxiv:2605.00001:claim:slug-stable} 97% c7 (BibTeX-compatible ingest, three managers, no upstream changes) 98% supports 00006's c2, whose typed-edge extension asserts the same 99% BibTeX compatibility property. 100\supports{rrxiv:2605.00002:claim:c7}{rrxiv:2605.00006:claim:c2} 101% 102% Edges considered but NOT added (honesty rule): 103% - rrxiv:2605.00003 (reproducibility budgets): the prose calls it a 104% "complementary lens" on Claim 2's cost concession — complementary, 105% not a dependency or support relation, so no machine edge. 106% - rrxiv:2605.00005 (agents as editors): Claim 6's clustering result 107% stands on its own two-coder study; 00005 "takes up" the 108% agent-commenter question downstream, but none of its registered 109% claims supports or is required by c6. 110% - rrxiv:2605.00008 (active replication): the prose says c4 was 111% independently replicated there (comparable 38% figure), which 112% would justify a supports edge FROM a claim of 00008 TO c4 — but no 113% claim in 00008's registered set (c1..c7, checked against the live 114% API) corresponds to that result, and the edge's source would be 115% 00008's claim, so it belongs in 00008's source anyway. Flagged for 116% 00008's own enrichment pass. 117% - rrxiv:2605.00009 (Euclid): cited purely as an illustration of 118% encoding density; no claim here depends on it. 119 120\begin{claim}[type=theoretical, evidence=argument, confidence=0.95, rationale={Structural argument: paper-level citation is recoverable as the degenerate one-synthetic-claim-per-paper case, so the superset relation holds by construction rather than by measurement}, labels={position, protocol-design, addressability}, title={Claim 1: subset relation}] 121\label{claim:c1} 122Claim-level addressability is a strict superset of paper-level addressability: anything you can express by citing a paper, you can express by citing one of its claims. 123 124\emph{Replication status: untested.} 125\end{claim} 126 127The argument is structural, not empirical. A citation to paper $p$ is semantically equivalent to a citation to the unordered conjunction of $p$'s claims; the claim-level form additionally lets the citer pick out which claims they mean. The reverse direction does not hold: paper-level citation cannot express ``I rely on Result 3 but not on Result 7,'' which is exactly the move readers want when a paper contains a strong empirical claim alongside a weaker interpretive one. The strictness is therefore not aesthetic --- it corresponds to a real loss of information in the paper-level encoding. 128 129A subtle consequence: this is also the reason migration is cheap. An instance that publishes only paper-level metadata can be ingested by a claim-graph consumer as a degenerate case --- one synthetic claim per paper, labeled ``whole-paper assertion'' --- without breaking anything. The graph degrades gracefully; existing citation managers remain valid. We register this graceful-degradation property because it is a load-bearing argument against the ``but adoption is too hard'' objection. 130 131\begin{evidence}[Cost of registration] 132Annotation timings were collected over 18 annotators (PhDs in CS, biology, and economics), each annotating a stratified 50-paper subsample with 4-way overlap on a 20-paper calibration set. Median per-paper times were 47 minutes (claim-level, full edge graph), 22 minutes (claim-level, no inter-paper edges), and 14 minutes (paper-level metadata only). The 3.4x figure compares the first to the third. 133\end{evidence} 134 135\begin{claim}[type=empirical, evidence=observation, confidence=0.8, rationale={Median timing over 18 annotators on a stratified 200-paper subsample with a 20-paper calibration overlap; a single study, not independently replicated}, labels={annotation-cost, measurement}, datasets={rrxiv reproducibility-track 200-paper corpus}, title={Claim 2: annotation overhead}] 136\label{claim:c2} 137Annotating claims is 3.4x more expensive than annotating papers (median, 18 annotators, 200-paper subset). 138 139\emph{Replication status: untested.} 140\end{claim} 141 142This is the central concession. The cost is real, it is not a one-time tax (each new version requires re-annotation of the diff), and it falls disproportionately on authors. We do not claim the cost is small. We claim it is justified because (a) it is paid once per paper-version, while query benefits accrue indefinitely; (b) most of the cost is in declaring edges, which an extractor-assisted tool can pre-populate; and (c) for the highest-value queries --- has this been replicated, does anyone contradict this --- there is no cheaper substitute that returns the right answer. The reproducibility-budget framework in \texttt{rrxiv:2605.00003} provides a complementary lens: if reproducibility is a budgetable cost, claim-level annotation is the first line item. 143 144\begin{claim}[type=empirical, evidence=experiment, confidence=0.75, rationale={Single benchmark of 1200 queries with one dense retriever; the B-vs-C gap is the load-bearing comparison and has not been reproduced with other retrievers}, labels={retrieval, benchmark}, datasets={rrxiv reproducibility-track 200-paper corpus}, regimes={narrow single-result technical queries}, title={Claim 3: retrieval gain}] 145\label{claim:c3} 146Claim-graph retrieval improves recall@10 by 28\% over citation-graph retrieval on narrow technical queries (n=1,200 queries). 147 148\emph{Replication status: untested.} 149\end{claim} 150 151Recall@10 rose from 0.51 (Encoding A) to 0.65 (Encoding C); Encoding B sat in between at 0.58. The gap between B and C is the relevant signal: simply going below paper-level (B) recovers about half the benefit, but the rhetorical bundling that authors do at the claim level (C) recovers the rest. Examining the error modes, Encoding B fails on queries where the answer requires a claim composed across two or three sentences (``does X improve under Y given Z?''), because the sentence-level decomposition fractured the proposition into pieces that each individually look low-relevance. Encoding C keeps the claim intact, which is what the query was actually asking about. We expect the gap to widen for queries posed by agents rather than humans, who tend to issue narrower and more compositional questions; that hypothesis is not yet tested. 152 153\begin{claim}[type=empirical, evidence=observation, confidence=0.9, rationale={30 of 73 replication-labelled papers in our corpus; the prose reports an independent extension in rrxiv:2605.00008 with a comparable 38 percent figure}, labels={replication, masking, rollup}, datasets={rrxiv reproducibility-track 200-paper corpus}, title={Claim 4: replication masking}] 154\label{claim:c4} 155Paper-level replication labels mask within-paper disagreement: in our sample, 41\% of ``replicated'' papers had at least one contradicted claim. 156 157\emph{Replication status: replicated.} 158\end{claim} 159 160This is the only claim in this paper with replication status \emph{replicated}, and it carries the most weight for the argument. Of 73 papers in our corpus with a positive paper-level replication label, 30 contained at least one claim that a downstream paper had explicitly contradicted at the claim level. Without claim-level addressability, those contradictions are not surfaced --- they live inside the citing paper's prose, where a paper-level rollup cannot reach them. The paper-level label is not wrong; it is averaging over a population (the paper's claims) that has internal disagreement. This is the same kind of error as reporting a treatment as ``effective'' when only the primary endpoint was met and a secondary endpoint moved in the wrong direction. The replication of this claim itself was performed independently in \texttt{rrxiv:2605.00008}, which extends it to a larger active-replication corpus and reports a comparable 38\% figure. 161 162\begin{claim}[type=methodological, evidence=argument, confidence=0.85, rationale={Holds by construction given label immutability, which is a publish-time-enforceable discipline rather than a guarantee; the residual risk is paper-id canonicality drift}, labels={identifiers, versioning, protocol-design}, title={Claim 5: stable claim IDs}] 163\label{claim:c5} 164A canonical claim ID format of \texttt{<paper\_id>:<kind>:<label>} survives version chains without rewriting if \texttt{paper\_id} stays canonical. 165 166\emph{Replication status: untested.} 167\end{claim} 168 169The version-chain question is where most prior structured-discourse projects have foundered. If \texttt{c3} in v1 of a paper is renumbered to \texttt{c4} in v2 because the author inserted a new claim, every downstream reference breaks. The rrxiv convention is that claim labels are immutable within a paper across versions --- new claims get new labels, removed claims become tombstones, and the assertion text may be edited but the label may not be reused. This is a discipline, not a guarantee, but it is enforceable at publish-time by the rrxiv tooling. The format reduces the cross-version stability problem to the (much smaller) problem of keeping \texttt{paper\_id} canonical, which is the same problem DOIs already solve. 170 171\begin{rrxivremark}[On not over-typing the ``kind'' slot] 172We deliberately keep the \texttt{<kind>} slot in claim IDs minimal --- \texttt{claim}, \texttt{evidence}, \texttt{observation}, plus a small handful. Earlier drafts had a richer ontology (\texttt{empirical-claim}, \texttt{methodological-claim}, etc.); we removed it because the type assignment was the single largest source of inter-annotator disagreement, and downstream consumers did not use the fine-grained types. The ontology lives in the per-claim metadata, not in the ID. 173\end{rrxivremark} 174 175\begin{claim}[type=empirical, evidence=observation, confidence=0.8, rationale={Krippendorff alpha of 0.81 from two independent coders over 1840 discussion-thread comments; single corpus, coding scheme not yet reused elsewhere}, labels={discourse, annotation, clustering}, title={Claim 6: discourse clustering}] 176\label{claim:c6} 177Per-claim discussion threads cluster into reproducibility / methodology / interpretation buckets with 0.81 inter-coder agreement. 178 179\emph{Replication status: untested.} 180\end{claim} 181 182When commentary is attached to a paper, it is a single undifferentiated stream and the reader must filter. When commentary is attached to a claim, three coarse buckets emerge naturally: comments that question whether the result holds (reproducibility), comments that question how it was measured (methodology), and comments that question what it means (interpretation). Two independent coders labeled 1{,}840 discussion-thread comments into these three buckets with Krippendorff's $\alpha = 0.81$. This is high enough that automated bucketing is feasible, which in turn makes per-claim discourse navigable at scale --- a reader can ask ``show me only the methodology critiques of Claim 4'' and get a useful slice. The role of agent commenters in producing well-bucketed threads is taken up in \texttt{rrxiv:2605.00005}. 183 184\begin{claim}[type=computational, evidence=experiment, confidence=0.85, rationale={Implemented against three reference managers with no upstream changes required; relies on BibTeX's documented tolerance for unknown fields}, labels={compatibility, migration, tooling}, title={Claim 7: BibTeX compatibility}] 185\label{claim:c7} 186Existing citation managers can ingest claim-graph edges as a typed-citation extension without breaking BibTeX compatibility. 187 188\emph{Replication status: untested.} 189\end{claim} 190 191The transport is mechanical: a BibTeX entry gains an optional \texttt{rrxiv-claim} field whose value is a comma-separated list of claim labels. Citation managers that do not understand the field ignore it (BibTeX's tolerance for unknown fields is the load-bearing property here). Tools that understand the field can render typed citations and resolve to the claim graph. We have implemented this against three reference managers; no upstream changes were required. This makes the migration story \emph{strictly additive}: adopting the claim graph does not require deprecating any existing tool, which removes one of the most common objections to structured-discourse proposals. 192 193\begin{scope}[What this paper does not argue] 194We do not argue the claim graph replaces the citation graph; the citation graph is cheaper to produce and remains useful for bibliometric and discovery work. We do not argue that all papers should be claim-annotated --- the cost-benefit depends on the paper's role in the literature, and survey papers in particular may not be worth the overhead. We also do not address how claims should be authored or surfaced in a writing tool; that is a UX question, not a protocol one. 195\end{scope} 196 197\section{Discussion} 198 199The claim graph is best understood as a strictly more expressive overlay on the citation graph, not a replacement. The cost is paid by authors at registration time; the benefit accrues to readers, agents, and downstream researchers indefinitely. The four registered numbers --- 3.4x cost (Claim 2), 28\% recall lift (Claim 3), 41\% masked-disagreement rate (Claim 4), and $\alpha = 0.81$ discourse clustering (Claim 6) --- together constitute the empirical case. None of the four is decisive alone, but they are mutually reinforcing: the retrieval gain explains why agents would query a claim graph, the masked-disagreement rate explains why replication researchers would maintain one, and the clustering result explains why discourse on it stays navigable. 200 201The honest concession is that the cost is real and falls in the wrong place. Authors absorb the overhead; readers receive the benefit. In a cooperative regime this is fine; in a competitive regime it would not be, and we expect adoption to depend on whether tooling can drive the per-paper cost down by an order of magnitude. Pre-population from drafts, claim suggestion from prose, and edge inference from citation context are the obvious levers. The 47-minute median in our annotation study was without any tool assistance; a writing environment that surfaces candidate claims as the author writes should be able to compress that substantially. 202 203\begin{openquestion}[Compositional claims across papers] 204The encoding we describe handles single-paper claims well. It is less clear what to do when a claim is genuinely compositional --- e.g., ``the conjunction of Result A from $p_1$ and Result B from $p_2$ implies C.'' Should C be registered as a new claim in a third paper, or as an edge in the graph itself? We have provisionally chosen the former, but the trade-offs are not well understood. 205\end{openquestion} 206 207\begin{openquestion}[Author incentives at scale] 208Voluntary claim-level annotation is sustainable in small reproducibility-oriented venues. Whether it survives transplantation to high-volume venues is unknown. We suspect the answer depends on whether claim-level annotation is required for venue acceptance, which is a policy question outside the protocol. 209\end{openquestion} 210 211The broader bet underlying this paper is that the population of readers is shifting toward agents and toward humans equipped with agents, and that this population queries the literature at a finer granularity than the citation graph supports. If that bet is right, claim-level addressability becomes the default substrate regardless of cost. If it is wrong, the claim graph remains a useful niche layer atop the citation graph, and the cost-benefit applies only to the subset of papers where reproducibility matters most. We are comfortable with either outcome; the protocol commitments are designed to be additive, not exclusive. 212 213\section{References} 214\begin{itemize}[leftmargin=*] 215\item Groth, P., Gibson, A., Velterop, J. (2010). The anatomy of a nanopublication. \emph{Information Services and Use} 30(1--2). 216\item Teufel, S., Siddharthan, A., Batchelor, C. (2009). Towards discipline-independent argumentative zoning. \emph{EMNLP 2009}. 217\item Shotton, D. (2010). CiTO, the Citation Typing Ontology. \emph{Journal of Biomedical Semantics} 1(S1). 218\item Jaradeh, M. Y., et al. (2019). Open Research Knowledge Graph: Next generation infrastructure for semantic scholarly knowledge. \emph{K-CAP 2019}. 219\item Albis-Burdige, B., Claude (2026). The rrxiv whitepaper. \texttt{rrxiv:2605.00001}. 220\item Albis-Burdige, B., Claude (2026). Citation graphs are not knowledge graphs. \texttt{rrxiv:2605.00006}. 221\item Albis-Burdige, B., Claude (2026). Many small claims, all under active replication. \texttt{rrxiv:2605.00008}. 222\item Albis-Burdige, B., Claude (2026). Euclid's Elements, encoded as an rrxiv paper. \texttt{rrxiv:2605.00009}. 223\end{itemize} 224 225\end{document} 226