Guide · Anatomy of one claim, end to end

Claim-by-claim AI resume scoring: one claim, all the way down.

This guide is for the recruiter who landed here from a Reddit thread asking what claim-by-claim AI resume scoring actually does on a real req. We walk a single claim from a sentence in the JD to a weighted match against a span of the resume to a logged override, on a Senior AE req. The shape is the same on a Staff engineer req, a security analyst req, and an ops manager req. The example is sales because the existing technical-hiring guide already used Infra SWE.

Matthew Diakonov, Written with AI

Published April 29, 202611 min read

Direct answer · Verified 2026-04-29

Claim-by-claim AI resume scoring decomposes each job description into 5 to 15 testable claims with visible weights, classifies each claim as must-have, nice-to-have, or red flag, finds (or fails to find) a span of the resume as evidence for each, and computes the score as the normalized weighted sum. Every weight override is a logged tool call. The rubric and the decision share one audit trail; the score is the byproduct, not the artifact.

Source verified against the 10xats Match Rating implementation and the public AI technical hiring evaluation guide. The same shape is documented inside the MCP tool surface as match.list_claims, match.override_weight, and match.rerun_score.

The shape of one claim

Six fields. Nothing implicit, nothing hidden. The whole record is what the model returns and what the audit log carries forward.

match.list_claims · req sales-sr-ae-0044 · claim c4

Stable string. Survives weight overrides and ledger reruns, so the audit trail can re-attach to a specific claim across edits.

text

One sentence, testable against a span of resume text. If the sentence cannot be tested, the claim is malformed.

kind

must_have, nice_to_have, or red_flag. Three buckets, no fourth option. The recruiter can reclassify any claim.

weight

Integer, conventionally 1 to 5. Visible to the recruiter, editable through match.override_weight. Drives the score arithmetic.

evidence_span

The substring of the resume the model considers proof. Null when nothing in the resume supports the claim. The recruiter can promote a different span.

supported

Boolean. Derived from evidence_span. The score formula reads this field; everything else is metadata for the recruiter and the auditor.

The lifecycle of one claim

From a sentence in the JD to a row in the audit log. Five stages, each addressable as a tool call, each visible to the recruiter.

JD sentence → claim → evidence → score → override

JD sentence

extract from JD + org criteria

Classify

must / nice / red flag + weight

Find span

match against resume substrings

Score row

weight × supported, signed by kind

Override

logged audit row, recompute

Each stage is recoverable. The recruiter can rewind to any step without re-running the full pipeline, because the artifact at every stage is structured JSON the next stage reads.

Stage one. Pull the claim out of the JD

The JD for a Senior AE at a Series B SaaS company has a line: “has carried a quota of $1.5M+ ARR at a B2B SaaS company in the last two years and consistently hit attainment.” That sentence becomes the seed for at least three claims, not one. The claim extractor splits it because the three sub-conditions can each be true or false independently against a resume.

Claim c2 is “carried a $1.5M+ annual quota in the last 24 months,” classified must_have at weight 4. Claim c3 is “hit 100% attainment in at least one of the last two FYs,” must_have at weight 4. Claim c1, lifted from the JD’s preamble about “experienced sales hire,” is “5+ years closing B2B SaaS,” must_have at weight 3. The original JD sentence does not exist in the ledger; three independently testable descendants do.

Splitting is the work the model is doing here that is hard for a human to do at speed across 50 reqs. Reassembling claims into the original JD wording is an option for the recruiter, but the default ledger keeps them split because a regulator asking which conditions a candidate failed wants three answers, not one.

Stage two. Classify and weight

Three buckets. Must-have is gating: the recruiter has decided in advance that a candidate without this trait does not move forward. Nice-to-have is additive: the trait helps if present, costs nothing if absent. Red flag is subtractive: a supported red-flag span pulls weight off the score. The buckets exist because the underlying decisions are not symmetric, and pretending they are with one combined score is the failure mode of every black-box similarity number.

Weights run 1 to 5 by convention. The model proposes; the recruiter overrides. A pattern that holds across most reqs: quota and attainment carry weight 4 on a sales req, production scale carries weight 4 on an infra req, audit experience carries weight 4 on a security req. Anything the team would refuse to interview a candidate without is weight 3 or 4. Anything they would prefer but tolerate missing is weight 2. Red flags are 1 to 3 depending on how much the team is willing to discount a candidate for a supported instance.

The proposed weight is rarely the final weight. The first edit pass is where the rubric becomes the team’s rubric instead of the model’s. Spend ten minutes here. Three minutes after the team has scored four reqs.

Stage three. Find the resume span

For each claim, the agent does a constrained search inside the resume for a substring that supports the claim. For c2 (carried a $1.5M+ quota), the supporting span looks like “Datadog · Senior AE · 2023-2025 · $1.8M ACV target, 112% attainment FY24.” That string lands inside the resume; supported flips to true, evidence_span gets the literal substring.

When no span supports the claim, supported is false and evidence_span is null. Null is the load-bearing answer here. Most resume-scoring vendors hide a missing claim inside a lower similarity score. The ledger surfaces it. A null span on a must_have claim is the difference between a conversation-ending reject and a hire.

The recruiter can promote a span the model missed. If the resume reads “closed $40M in pipeline FY24” and the model decided that was not evidence for the quota claim, the recruiter taps the claim, highlights the line, and promotes it. The promotion is itself a tool call, match.set_evidence(claim_id, span). It writes to the audit log too.

Stage four. The score is the sum

Eleven claims for a Senior AE at $1.5M+ quota. Five must-haves at weights 3, 4, 4, 3, 3. Four nice-to-haves at weights 2, 2, 2, 2. Two red flags at weights 2 and 1. Marquez J. supports eight claims and zero red flags. The math works out to 0.79.

match.list_claims · req sales-sr-ae-0044 · candidate marquez-j

Numerator

Sum of weights for the eight supported claims (3+4+4+3+3 + 2+2+2). Red-flag weights would subtract here, but no red flags are supported.

Denominator

Sum of all positive weights in the ledger (3+4+4+3+3 + 2+2+2+2). Red-flag weights are not in the denominator because their role is signed subtraction, not capacity.

Score

0.79

23 divided by 25 with one nice-to-have unsupported and one missing nice-to-have on the PLG-to-SLG handoff. A clear advance, with a visible gap the panel can probe in the loop.

Stage five. The override moment

Wednesday morning. The hiring manager calls. The sales team has decided that self-sourced pipeline is more important than the original rubric reflected. The recruiter promotes claim c8 from weight 2 to weight 3 in one tool call.

match · req sales-sr-ae-0044 · override session

audit_log.get(ovr_018d4b2c)

The audit row is the artifact the regulator reads, the candidate is entitled to a summary of, and the hiring manager links to in Slack when the panel asks why okonjo-a moved up. There is no second log to reconcile against.

Bad claim shapes the extractor refuses to write

A claim that cannot be sourced to a substring of the resume is malformed. The constraint is not aesthetic. A vibe claim poisons the audit trail because the override has nothing to point at.

Five claim shapes the rubric should never carry

Strong communicator. Untestable against a resume span. Belongs in the loop debrief, not the rubric.
Culture add. Untestable, unbounded, and the most common failure mode of a vendor's bias-audit defense.
High agency. The model has to invent evidence to support it; the invented evidence is the bug.
Looks like a leader. Pattern matches on resume formatting and seniority titles. Garbage in, garbage out.
Cultural fit with our team. Treated by NYC LL144 and EU AI Act as the line where claim-by-claim scoring stops being defensible.

Side by side: claim ledger vs the AI fit score in your ATS

Read across the public product pages and docs of the LLM resume summarizers shipped into Greenhouse, Ashby, Gem, Lever, and SeekOut as of April 2026. The shape that emerges is the same. A paragraph plus a 0-100 number, with no structured rubric the recruiter or the auditor can edit.

Feature	ATS LLM fit score	Claim-by-claim ledger
Primary artifact	0 to 100 fit score plus a free-text paragraph the model wrote.	5 to 15 testable claims, each with weight, kind, and resume span.
Editable rubric	Implicit in the model. No structured weights to turn.	Every weight is a knob. Every classification is overridable.
Resume evidence	Summarized in the paragraph. Span not pinned per claim.	evidence_span per claim, or null if no span supports it.
Override mechanism	Free-text comment, manual stage move, or recruiter ignores the score.	match.override_weight tool call. Logged with actor, ts, prior, new.
Score recompute	Re-run the model on the candidate. Implicit, not reproducible.	Sum of weights where supported, signed by kind. Pure arithmetic.
Audit-facing artifact	Score plus prose. Often a separate audit pipeline bolted on.	The ledger of overrides IS the audit. Same record across surfaces.
Pricing surface	Bundled into Pro / Expert tier of the host ATS, often $30k+ ARR.	Ships on $0 Starter plan, three reqs, no credit card.

Comparisons drawn from public product pages, pricing pages, and docs as of April 2026.

The ledger as it sits inside the law

Four hiring-AI regulations going live in 2026 ask roughly the same question in different words. Show me the rubric, show me the override, show me the actor, show me the timestamp. The ledger answers all four with the same record.

NYC Local Law 144

Bias audit on automated employment decision tools

The audit reads the rubric and the overrides. A 0-100 similarity score with an implicit rubric satisfies neither the audit format nor the public summary requirement. A ledger of weighted claims with logged overrides does both.

Illinois HB 3773

Notice and consent for AI in hiring

The candidate has a right to know the criteria. Five must- have claims, four nice-to-haves, two red flags, with weights, is the criteria. A paragraph the model wrote is not.

Colorado CAIA

Risk management and impact assessment

The risk assessment requires a record of the model’s inputs, outputs, and the human review attached to each. The override audit row is the human-review record, in the shape the assessment expects.

EU AI Act, high-risk hiring

Human oversight, traceability, transparency

Every weight override is a tool call by a named actor at a timestamp, attached to a specific claim, with a prior and a new value. That is what traceability looks like as software, not as a policy document.

5–15

“Match Rating extracts 5 to 15 testable claims from each JD, classifies each must-have, nice-to-have, or red flag, and sources the evidence span by span in the resume. Every override is logged with actor, timestamp, prior weight, and new weight.”

10xats Match Rating · MCP tool surface · April 2026

Four numbers that hold across most reqs

5–0Testable claims per JD

0Fields per claim record

0Audit fields per override

$0Starter plan cost, full Match Rating

Claim count, field shape, and pricing pulled from 10xats how-it- works, FAQ, and pricing pages. Audit field count is the canonical shape used by the match.override_weight and match.set_evidence tool calls.

Four steps to a working ledger on your next req

From a fresh JD to a scored pipeline

1
Paste the JD
Plus any org-wide hiring criteria. The agent returns 5 to 15 candidate claims with proposed weights.
2
Edit the rubric
Demote, promote, delete, add. The first pass takes ten minutes. Save the ledger to the req.
3
Score the pipeline
match.rerun_score against the saved ledger. Every candidate gets per-claim verdicts plus a weighted total.
4
Override and ship
Promote spans the model missed. Reweight what the team values differently. Each override is a logged audit row.

The Reddit-thread version, in one paragraph

If you are a recruiter who clicked over from a thread asking what claim-by-claim AI resume scoring does on a real req: the answer is that the JD becomes a list of 5 to 15 sentences, each weighted, each pinned to a span of the resume, and the score is the sum. There is no opaque similarity number. Every override is a tool call written into the audit log with your name on it. The hiring manager reads the same ledger you do. The regulator reads it later. The candidate, under Illinois HB 3773 and the EU AI Act, has a right to a summary of the criteria; five must-haves with weights is a summary, a 0-100 fit score is not.

10xats Match Rating ships exactly that shape on the $0 Starter plan with three open reqs and the MCP server, so you can run it as a pre-screen layer in front of whatever ATS you already pay for and decide later. The same Match Rating lives on every plan above Starter; what scales is req count, not the underlying scoring.

Adjacent reading

AI for technical hiring evaluation: the same shape applied to a Staff Infra SWE req, with the CodeSignal / HireVue / ATS scorer comparison.
Agentic recruiting: recruiter approval rate per draft: why the override queue is the metric that matters when an agent is drafting touchpoints.
ATS security and compliance: the no-AI-training commitment, human-oversight terms, and the data-handling shape behind the audit log.

Want to see a real ledger on your next req?

Join the waitlist. We will walk you through Match Rating on a JD you bring.

Questions Reddit threads keep asking about claim-by-claim scoring

What is claim-by-claim AI resume scoring, in one sentence?

It is a resume-scoring approach that decomposes a job description into 5 to 15 testable claims with visible weights, classifies each as must-have, nice-to-have, or red flag, finds (or fails to find) a span of the resume as evidence for each claim, sums the weighted matches into a score, and treats every weight override as a logged tool call so the rubric and the score live inside one audit trail. The artifact is the ledger, not the number.

Why is one fit score the wrong artifact for hiring in 2026?

A single number cannot be defended to a hiring manager who disagrees, to a candidate who asks why, or to a regulator under NYC Local Law 144 / Illinois HB 3773 / Colorado CAIA / EU AI Act high-risk hiring. A weighted ledger of testable sentences can be defended to all three, because each line of the ledger is independently verifiable against the resume. The score is just the sum of the lines.

How does the model decide a claim is must-have, nice-to-have, or red flag?

The classifier reads the JD plus the org-wide hiring criteria and projects each candidate sentence onto one of three buckets. Must-have is gating: a missing must-have caps the score. Nice-to-have is additive: present spans add weight, absent spans cost nothing. Red flag is subtractive: a supported red-flag span subtracts weight from the score. Every classification ships with a proposed weight that the recruiter can override before the first candidate is scored.

What does one claim look like as a record?

Six fields. id is a string. text is the claim sentence. kind is must_have, nice_to_have, or red_flag. weight is an integer, conventionally 1 to 5. evidence_span is the substring of the resume that supports the claim, or null if no span supports it. supported is a boolean derived from whether the span exists. The score formula is the sum of weight where supported, with red_flag entries subtracted instead of added. There is no opaque similarity number anywhere in the record.

How does the recruiter override a weight, and what gets logged?

The override is a tool call: match.override_weight(claim_id, prior_weight, new_weight). It writes one row to the audit log with actor, timestamp, prior, and new. The score recomputes from the new weight. If the recompute moves a candidate across a decision boundary (advance to reject or vice versa), that flip is logged with the same shape. There is no free-text comment, no model-confidence dial, no implicit override. The override IS the audit row.

What kinds of claims should never appear in a ledger?

Claims that are not testable against a span. 'Strong communicator', 'culture add', 'high-agency', 'looks like a leader' are the four most common bad shapes. None of them can be sourced to a substring of the resume; they belong in the loop debrief, not in the rubric. A well-built claim extractor declines to write them. If your AI resume scorer is happily emitting them, the rubric is doing the work of vibes, and the audit trail is fictional.

How is this different from the LLM resume summarizer my ATS already has?

Greenhouse, Ashby, Gem, Lever, and SeekOut all ship LLM resume summarizers in 2026. They read the JD plus the resume and emit a paragraph plus a 0-100 fit score. The paragraph cannot be edited as a structured artifact. The score is the rubric, but the rubric is implicit in the model. Claim-by-claim scoring inverts that shape: the rubric is the artifact, the score is the byproduct, and every weight is a knob the recruiter can turn.

What does the score math actually look like?

Score = sum(weight where supported and kind in {must_have, nice_to_have}) minus sum(weight where supported and kind = red_flag), normalized by the maximum possible (sum of all positive weights). On a Senior AE req with 11 claims totaling positive weight 28, a candidate supporting 22 weight worth of must-haves and nice-to-haves and zero red flags scores 0.79. Drop a claim's weight from 4 to 3 and the same candidate scores 0.78 because the denominator changes too. The math is recomputed every time a weight changes.

What about claims that partially overlap?

Two claims that share evidence are allowed. 'Carried a $1.5M+ quota' and 'sold into mid-market accounts' often map to the same line on a sales resume, and both are scored independently. The recruiter can demote one weight to avoid double-counting if the team treats them as the same signal. The override is a logged tool call; the reasoning is recoverable later. The model does not silently merge them; it surfaces both and lets the rubric editor decide.

Does claim-by-claim scoring catch AI-fabricated resumes?

Not as the primary detection layer, but it shifts detection earlier. If a claim's evidence_span is null on a resume that looks long enough to contain it, that absence surfaces in the ledger before the phone screen. If two supported spans contradict each other (a Series B startup tenure that overlaps with a 'Director at FAANG' claim by three years), red-flag claims fire on the contradiction. The recruiter spends thirty seconds on the ledger view, not an hour on a phone screen with a synthetic candidate.

How long does it take to set up a ledger for a new req?

Paste the JD, paste the org criteria, get back 5 to 15 candidate claims with proposed weights. The first edit pass takes a recruiter about ten minutes the first time, under three minutes after the team has scored a few reqs and the proposed weights start landing close. Scoring inbound applicants against the saved ledger is one batched tool call: match.rerun_score against the candidate set, no per-candidate clicking.

Where does claim-by-claim scoring live inside 10xats?

It is the Match Rating agent. It ships on the Starter plan ($0, up to three open reqs, no credit card) along with the rest of the agents, including the MCP server. You can run it as a pre-screen layer in front of an existing ATS for a quarter and decide later. The same Match Rating ships on every plan; the only thing that scales is open req count and team seats.