Guide · The honest shape of AI in technical hiring

AI for technical hiring evaluation: stop scoring, start reading the claim ledger.

The AI tools sold for technical hiring evaluation in 2026 return a number. A HireVue percentile. A CodeSignal grade. A Gem or Ashby “fit score” bolted onto an ATS. None of them give you the rubric the number came from. This guide describes the other shape: a Match Rating agent that decomposes every technical JD into 5 to 15 testable claims, pins each to a span of the resume, and turns every weight override into a logged tool call. The rubric and the decision live in the same ledger.

Nhat Nguyen, Co-founder, Chosen HQ

Published April 24, 202614 min read

4.9from Series A to mid-B technical TA teams

5 to 15 testable claims per JD. Each pinned to a resume span.

match.override_weight is a tool call. The override is the audit log.

Ships on the $0 Starter plan. Same scoring on every plan.

The score is the sum. The rubric is the artifact.

Match Rating: 5 to 15 testable claims, weighted, with resume evidence.

JD goes in. 5 to 15 testable claims come out.

Each claim gets a weight you can see and edit.

Each claim gets a resume span, or it does not.

Override the weight. The override is a tool call.

The ledger is the audit log. The number is just the sum.

0:00 / 0:05

Three shapes of AI in technical hiring, and what each one gives you

The category called “AI for technical hiring evaluation” is three different products in a trench coat. The first is automated coding tests, sold by CodeSignal, HackerRank, and CoderPad. The candidate writes code in a sandbox under timed conditions, the AI grades correctness and complexity, and the recruiter gets a number plus a recording. The second is AI video screening, sold by HireVue and a handful of newer entrants. The candidate records answers, the model scores tone and content, and the recruiter gets a percentile. The third is the LLM resume summarizer built into every modern ATS in the last eighteen months: Greenhouse, Ashby, Gem, SeekOut, Lever. The model reads the JD plus the resume, writes a paragraph, returns a 0-100 fit score.

Three different products, one shared shape. They all return a number, and the rubric the number came from is either absent (HireVue, CodeSignal) or summarized into a paragraph the recruiter cannot edit (Greenhouse, Ashby). When you push back on the score, the answer is some variant of “the model weighed it this way”. The rubric is not an artifact. The model is.

That shape was tolerable when AI in hiring was a stretch goal. It is not tolerable in 2026. NYC DCWP wants the rubric. The EU AI Act high-risk hiring rules want the rubric. The hiring manager who wants to know why a Staff candidate was rejected wants the rubric. The candidate who has the right under Illinois HB 3773 to be told the criteria wants the rubric. A paragraph is not a rubric.

Black-box score vs. claim ledger

A 0-100 fit score, a percentile, or a letter grade. Optionally, a paragraph the model wrote that summarizes its reasoning in non-machine-readable prose. The recruiter cannot edit the rubric, the candidate cannot see it, the auditor cannot read it. The artifact is the score.

One number, no rubric you can edit.
Reasoning is prose, not a structured list of claims.
Override has no shape. 'I disagree' is a comment, not a tool call.
Audit log is a chat transcript, not a decision ledger.

The wiring: from a JD to a claim ledger to an override

Match Rating sits between the job description and the recruiter. The output is the ledger, not the score. The score recomputes from the ledger every time a weight changes.

JD + org criteria → Match Rating → claim ledger → recruiter override

The anchor: what one claim ledger actually looks like

A real Infra SWE req, scored against one resume. Notice the shape: every claim is a sentence, every weight is a number, every evidence span is either present or null. The score is the bottom line, not the artifact.

match.list_claims · req-2417 · candidate priya-k

That JSON is what the model returns to Claude or ChatGPT through the Chosen MCP server. The recruiter UI renders the same data as a list of cards with a knob next to each weight. The two surfaces agree because they are reading the same record.

Eight kinds of claims a Match Rating ledger holds

The claim shape is content-agnostic. The weight is what makes the rubric defensible.

must_have · weight 3must_have · weight 4nice_to_have · weight 1red_flag · weight 2must_have · weight 5nice_to_have · weight 2red_flag · weight 3must_have · weight 4

A real Match Rating session, top to bottom

Tuesday morning. The recruiter has a Staff Infra SWE req, 47 inbound applicants, and 90 minutes before standup. Here is what the session looks like in the recruiter’s shell against the Chosen MCP server.

match · req-2417 · staff infra swe · session

The override path, end to end

When the recruiter disagrees with a weight, the override is not a comment, not a chat suggestion, not a free-text note. It is a tool call. Here is the full path it travels.

recruiter → MCP → Match Rating → audit log → score recompute

Match Rating vs the four AI eval categories you have heard of

A like-for-like read of the public-facing product pages and docs as of April 2026. Each category does something useful; none of them returns the rubric as the artifact.

Feature	Coding test (CodeSignal) / Video AI (HireVue) / ATS LLM scorer (Ashby, Gem)	Chosen Match Rating
Primary artifact	Numeric score, percentile, or 0-100 fit grade. Reasoning summarized as prose.	Claim ledger of 5-15 testable sentences, each with weight and resume span.
Where it sits in the funnel	Coding test = mid-funnel. Video AI = mid-funnel. ATS scorer = pre-screen.	Pre-screen and pre-panel. Reads the resume, not a code submission.
Override mechanism	Free-text note, manual stage move, or recruiter ignores the score.	match.override_weight tool call. Logged with actor / ts / prior / new.
Evidence shape	Code recording, video recording, or vendor-summarized resume paragraph.	Resume span pinned to each claim. Span is null if the claim is unsupported.
Regulator-facing artifact	Score plus prose. Rubric is implicit in the model. Often custom-built audit on top.	The ledger of overrides IS the artifact. NYC LL144 / EU AI Act ready.
Pricing surface	Per-test or per-seat, often with an enterprise floor. Most start at $10k+ ARR.	Ships on $0 Starter plan, 3 reqs. Growth $99 founding / $399 after.
Fits a small TA team running 12-25 reqs	Coding tests need a coordinator to schedule. Video AI is a separate workflow. ATS scorers are bundled at higher tiers.	Same plan, same MCP toolset, no per-seat math. Ledger view shipped on Starter.

Comparisons drawn from public product pages, pricing pages, and docs as of April 2026.

5–15

“Match Rating extracts 5 to 15 testable claims from each JD, classifies each must-have, nice-to-have, or red flag, and sources the evidence span-by-span in the resume. Every override is logged with actor, timestamp, prior weight, and new weight.”

Chosen HQ how-it-works (Match Rating section, April 2026)

Why this shape, and why now

Three forces in 2026 are pushing every honest AI hiring product toward the claim-ledger shape. None of them are optional.

Regulation

0 jurisdictions, one ledger

NYC LL144, IL HB 3773, CO CAIA, and the EU AI Act high-risk hiring rules all demand a defensible reason for every automated decision. A claim override logged through an MCP tool call is that reason. A 0-100 score with a paragraph is not.

Resume fraud

0+ inbound per role

AI-fabricated resumes are a normal input now. A claim that cannot be sourced to a specific resume span is the earliest fraud signal. Surfacing it as a red-flag claim before the phone screen saves the recruiter an hour per false positive.

Hiring manager pushback

0 reason per claim

When a Staff candidate gets rejected, the hiring manager asks why. “The AI gave them a 0.62” is not an answer. “Three must-have claims were unsupported, the ledger is here” is. The claim ledger ends the conversation in 30 seconds instead of dragging it across three Slack threads.

What Match Rating reads, what it writes

Both halves of the loop are tools the recruiter and the recruiter’s Claude or ChatGPT instance can call. The ledger is the medium of exchange.

What it reads

The claim, not the score

5 to 15 testable claims per req, classified must-have, nice-to-have, or red flag, each with weight and resume evidence span. Claude can quote the span back to the recruiter in a sentence. ChatGPT can ask for the missing-evidence claims. There is no opaque similarity number to defend.

What it writes

A logged override, not a side effect

Every weight flip, claim reclassification, evidence promotion, or score rerun is a tool call with actor, timestamp, prior, and new value. The ledger feeds the bias-audit artifact on Enterprise; the same ledger drives the why-this-score notice required under NYC Local Law 144 and the EU AI Act.

The Match Rating week, in numbers

5–0Testable claims per JD

0Jurisdictions the ledger satisfies

0+Inbound applicants per role, normal

$0Starter plan with full Match Rating

Claim count, jurisdictions, and Starter pricing pulled from Chosen’s how-it-works, FAQ, and pricing pages. The inbound-per-role figure is industry baseline for technical roles on a public job board, not a Chosen-specific stat.

Five steps from JD to scored pipeline

From a fresh req to a fully-scored inbound batch with overrides applied. The whole loop is a 30-minute exercise the first time, under 10 minutes after that.

Drop the JD into Match Rating

Paste the job description plus any org-wide hiring criteria. The agent returns 5 to 15 candidate claims, classified must-have / nice-to-have / red flag, with proposed weights.

Edit the rubric before you score anything

Promote, demote, delete, or add claims. Adjust weights. The rubric is yours; the model gave you a first draft. Save the rubric to the req.

Score the inbound applicants in one batch

match.rerun_score against the saved ledger. Every candidate gets a per-claim verdict plus a weighted total. The middle band is what you read first.

Override what the model got wrong

Open the candidates the model marked unsupported. Promote evidence spans the model missed. match.override_weight when the team values a claim differently than the JD implied. Each override is logged.

Hand the ledger to the panel and the audit

The hiring manager reads the same ledger you did. The bias-audit artifact (Enterprise) consumes the ledger directly. The why-this-score notice for the candidate writes itself.

The 10-minute due-diligence script for any AI eval vendor

Take this list into the next demo with any AI hiring evaluation vendor. The answers separate the rubric-as-artifact products from the score-as-artifact products quickly.

Ten questions for the sales engineer

Show me the rubric the score came from. If the answer is a paragraph, you are buying a black box.
Edit one weight live in the demo. If there is no weight to edit, the rubric does not exist as an artifact.
Show me the audit log for that override. It should have actor, timestamp, prior weight, new weight, claim id.
Show me one candidate where a claim is marked unsupported. The resume span (or the absence of one) should be visible.
What does this look like exported as JSON? Can the bias-audit consume it directly, or do I need a services project?
Tell me, on the spot, what the eval costs per req. If it is gated behind Enterprise and a six-week procurement, the buying cycle is the product.
How does this work for an ML or security role, not a vanilla SWE? The claim shape should be content-agnostic.
Show me what happens when an AI-fabricated resume hits the system. Red-flag claims should surface before the phone screen.
Get the no-training-on-customer-data commitment in the DPA. A privacy policy paragraph is not enough.
Show me the same workflow from inside Claude or ChatGPT through MCP. If the eval only works inside their UI, you are buying a silo.

The one-paragraph version, for the recruiter who clicked over from a Reddit thread

If you are a technical recruiter, a hiring manager, or an engineer wondering what AI in your hiring pipeline actually looks like in 2026: the version worth defending is not the one that returns a number. It is the one that returns a rubric. 5 to 15 testable claims pulled from the JD, each pinned to a span of the resume, each with a weight you can see and edit, and an override that is itself a logged tool call. The score is the sum, not the artifact. The ledger is the artifact, and the ledger is what survives a regulator, a hiring manager pushback, and an AI-fabricated resume.

Chosen HQ ships exactly that shape, on the $0 Starter plan, with the same Match Rating on every plan above it.

Bring a real JD on the call. We score it live.

30 minutes with the team. Drop a job description into Match Rating. We extract the claims, score five candidates from your inbound, and override one weight together while you watch the audit log update.

Questions Reddit threads keep asking

What does 'AI for technical hiring evaluation' actually mean in 2026?

It is a fragmented category. Three shapes dominate. The first is automated coding tests scored by an AI, sold by CodeSignal, HackerRank, and CoderPad: the candidate writes code in a sandbox, the AI grades correctness, complexity, and time. The second is AI video screening, sold by HireVue and Hireflix: the candidate records answers, the AI scores tone, vocabulary, sometimes content. The third is the new wave of LLM resume summarizers built into Greenhouse, Ashby, Gem, and SeekOut: the model writes a paragraph plus a numeric fit score against the JD. All three return a number. None of them return the rubric. Chosen's Match Rating returns the rubric: 5 to 15 testable claims pulled from the JD, each with a visible weight, each pinned to a resume span. The score is the sum, not the artifact.

Why is a single fit score the wrong shape for technical evaluation?

Three reasons. First, a single number is not testable. You cannot defend it to a hiring manager who disagrees, you cannot defend it to a candidate who asks why, and you cannot defend it to a NYC DCWP auditor under Local Law 144. Second, technical claims are unusually well-suited to claim-by-claim evaluation: 'shipped a distributed storage layer in the last 3 years', 'has paged on-call rotation experience', 'has production PyTorch experience' are sentences a model can verify span by span against the resume. Compressing them into one number throws that work away. Third, AI-fabricated resumes are a normal input now (200+ inbound per role is no longer rare). A claim that cannot be sourced to a specific resume span is the earliest fraud signal you have, and a single fit score hides it.

How does Chosen's Match Rating turn a JD into testable claims?

The agent reads the JD plus your org's hiring criteria and produces a claim ledger: 5 to 15 sentences, each classified as must-have, nice-to-have, or red flag, each with a weight you can see and edit. Examples for an Infra SWE req: 'on-call rotation in last 3 years' (must-have, weight 3), 'distributed storage or queueing systems shipped to production' (must-have, weight 4), 'open-source contribution to a runtime or DB' (nice-to-have, weight 1), 'startup-only resume with no production scale' (red flag, weight 2). For each candidate, the agent finds the resume span that supports each claim, or marks it absent. The recruiter sees the claim, the weight, and the evidence on one screen.

What happens if the recruiter disagrees with a weight?

They override it, and the override is itself an MCP tool call: match.override_weight(claim_id, prior_weight, new_weight). That call writes to the audit log with actor, timestamp, prior, and new. The score recomputes. If the override changes a candidate from rejected to advance, that flip is logged with the same shape. There is no 'AI changed its mind, I trust it' moment to defend later. The override IS the defense. NYC Local Law 144 wants it. Illinois HB 3773 wants it. Colorado CAIA wants it. The EU AI Act high-risk hiring rules want it. A black-box similarity score does not give it to you.

How does this compare to CodeSignal, HackerRank, or HireVue?

Different layers of the funnel, different artifacts. CodeSignal and HackerRank evaluate code that the candidate writes in a sandbox under timed conditions; the artifact is a numeric score plus a code recording. HireVue evaluates a recorded video answer; the artifact is a percentile plus a transcript. Chosen's Match Rating evaluates the resume against the JD before the phone screen, and the artifact is the claim ledger itself. The three are not mutually exclusive. The point is what survives a regulator or a hiring manager pushback. A claim ledger with overrides survives. A 0-100 fit score with no traceable rubric does not.

Can the model be wrong about a claim, and what happens then?

Yes, regularly, especially on edge cases. The agent might extract 'experience with a distributed storage layer' as a claim, then mark it absent because the resume says 'object storage' instead. The recruiter clicks the claim, sees the resume span the model considered (or did not consider), promotes 'object storage' as evidence, and reruns the score. The rerun is itself a tool call, match.rerun_score(candidate_id, req_id), and it lands in the audit log. The model is not the final word. The model is the first draft, the recruiter is the editor, and the ledger is the receipt.

Does this work for non-coding roles in a technical org (DevOps, ML, security)?

Yes. The claim shape is content-agnostic. For a security engineer req, claims look like 'led a SOC2 Type II audit', 'authored a CSP rollout for a public web app', 'has experience with eBPF-based runtime monitoring'. For an ML engineer, 'shipped a production transformer fine-tune at scale', 'has familiarity with vLLM or TGI inference'. The constraint is that each claim has to be testable against a resume span. Soft claims like 'is a strong communicator' are explicitly the wrong shape; they belong in the panel, not the rubric. The Match Rating agent declines to write them.

Can you use Match Rating without the rest of Chosen?

It ships on the Starter plan ($0, up to 3 open reqs, no credit card) along with the rest of the agents and the MCP server. You can run it as a pre-screen layer in front of an existing ATS for the duration of a quarter, then make a buying decision based on actual data. The rest of the agents (Sourcing, Scheduling, Analytics) come on the same plan, so most teams end up using the whole graph after a week.

What about deepfake video interviews and AI candidate impersonation?

Match Rating is a pre-interview layer; it will not catch a synthetic candidate inside a video call by itself. What it does is shift detection earlier. If the resume's claims do not have evidence spans, or if the spans contradict each other (a 'staff engineer at FAANG since 2019' next to a graduation date in 2024), the red-flag claims surface in the ledger before the recruiter spends an hour on a phone screen. Ashby shipped Fraudulent Candidate Detection in April 2026 for the same problem class; Chosen approaches it from the rubric side rather than the runtime side.

What does the Reddit r/recruiting / r/ExperiencedDevs reader actually get from this?

If you are a technical recruiter or a hiring manager: a way to evaluate 200 inbound applicants per role without rubber-stamping a vendor's number, and a way to defend a reject decision to a candidate, a hiring manager, or a regulator with the same artifact. If you are an engineer wondering how AI is being used on your resume: this is what it looks like when AI evaluation is honest. The model is not a judge; it is a junior recruiter producing a structured first draft that a human signs off on, claim by claim. That is the version of AI in hiring that is worth defending.