How to Test a Claude Skill: Evals, Triggering, and Description Optimization
Most Claude skills ship without a single real test. Here's the eval loop I use: 20 queries, a train/test split, and a trigger-rate target to prove a skill fires when it should and stays quiet when it shouldn't.
A skill that never fires is not a skill. It’s a file.
Writing a Claude skill is the easy part. Knowing whether it actually works (whether it fires when you need it and stays quiet when it shouldn’t) requires evals.
If you haven’t shipped a skill yet, start with the full guide to building Claude Code skills. To get the SKILL.md structure right before testing, read the SKILL.md structure and frontmatter guide. For real-world examples of tested skills, see the Claude skill examples.
Skills tend to undertrigger, and you won’t notice
The most common failure is undertrigger: the skill exists, the description looks fine, and Claude never loads it. You write a query. Nothing happens. You assume it worked. It didn’t.
Claude only reaches for a skill when it decides the task can’t be handled trivially from its own weights. A vague or passive description (“this skill helps with X”) reads as optional. Claude skips it. The description is the only signal Claude checks before deciding whether to load a skill. If it doesn’t compel action, nothing loads, and you never find out: the model answers confidently from training data and keeps going.
Build the eval set before you touch the description
Build roughly 20 eval queries before adjusting anything. Half should be cases where the skill should trigger: real queries from your intended use cases. The other half should be queries where it should not trigger: adjacent topics, overlapping intent, things a reasonable user might ask that the skill was never designed to handle.
Split them 60/40: twelve for training (the set you iterate against) and eight held out for testing. That split matters. If you optimize purely against your training queries, you will overfit. The description gets suspiciously good at those exact twelve examples and fails on anything slightly different. The test set is the only honest measure.
Run each query three times before recording a score. Claude’s trigger decision is probabilistic. A single run can give you a false positive or a false negative. Three runs per query gives you a reliable rate: 0/3, 1/3, 2/3, 3/3. The target is above 80% trigger rate on should-trigger queries and below 20% on should-not queries. Anything outside that band is worth closing.
The eval loop
Five rounds is the budget. More than that and you’re usually chasing noise in your training set rather than improving the skill. Each round: propose one description change, re-run the training queries three times each, record the trigger rate. After five rounds, pick the description with the best score on the test set, not the training set. That’s the one you ship.
Description eval loop
Input
~20 queries: mix of should-trigger and should-not-trigger, split 60/40 train/test
- 01Run the baseline
Fire every training query 3× against the current description. Record the trigger rate for should-trigger and should-not-trigger groups separately.
- 02Diagnose the gap
Undertrigger? Add explicit trigger phrases and make the description slightly pushy. Over-trigger? Tighten scope and add when-not-to-use context.
- 03Propose one change
One edit per round. Changing multiple things at once makes it impossible to know what moved the needle.
- 04Re-run the training set
Re-run all training queries 3× each. Record the new trigger rate. Track every version. You may want to go back.
- 05Repeat up to 5 rounds, then pick by test score
After 5 rounds (or when the training rate plateaus), stop. Evaluate every candidate description on the held-out test set. The one with the best test score ships.
Output
A description that triggers reliably on real queries: verified, not assumed
The one-change-per-round rule is the discipline that makes iteration useful. If you change the description, add a trigger phrase, and rename the skill in the same pass, you can’t attribute a shift in trigger rate to any single decision. Treat the description as the variable and hold everything else fixed.
Three failure modes, each with a different fix
After running this loop across a dozen skills in production, the same three failure patterns keep coming up. They live in different parts of the skill and need completely different fixes. Conflating them wastes rounds.
Trigger and output failure modes
| Failure mode | Symptom | Fix |
|---|---|---|
| Undertrigger | Skill exists but rarely fires on relevant queries | Add explicit trigger verbs; rewrite description to be slightly pushy |
| Over-trigger | Fires on adjacent queries the skill wasn't designed for | Tighten scope; add when-NOT-to-use examples in the description |
| Output drift | Fires correctly but ignores your references; answers from weights | Add counterexamples and anti-patterns to reference files; strengthen the loading map |
Undertrigger and over-trigger are both description problems. Run the eval loop above. Output drift is a reference problem. The skill fired. The model just chose not to use what you built. That’s fixed in the reference files and the loading map, not in the description.
Output quality: the test everyone skips
Trigger rate only tells you whether the skill fired. It says nothing about whether the answer was any good. That’s a separate test, and most people never run it.
The test is simple: give Claude the same real input twice. Once with the skill active, once without. Compare the two responses on five criteria.
Output quality checklist
- Required:It loaded the right reference file.The answer should cite or apply the specific framework or decision criteria from your references, not generic advice Claude could produce without them.
- Required:It applied the framework, not just the voice.Tone and vocabulary matching the skill is necessary but not sufficient. The framework (the Value Equation, the RAISE model, your decision checklist) should structure the answer, not just flavor it.
- Required:It dropped the generic hedging.Without the skill, Claude hedges. With it, you should see concrete judgments. If the answer still reads like 'it depends' with no decision attached, the skill is not anchoring the output.
- Required:It's consistent across two separate sessions.Run the same input in a fresh session. The skill should anchor the same method both times. Wild variation between sessions means the references are too sparse or the loading map is ambiguous.
- Required:The answer would fail without the skill.This is the real bar. If Claude produces an answer this good from weights alone, the skill isn't adding value. The difference between with-skill and without-skill should be obvious.
This test takes ten minutes and most people skip it because it requires running the same query twice and comparing carefully. It’s the only way to know whether your distilled references are actually reaching the answer, or sitting unread in a directory while Claude improvises from training data.
Iterating the description: three concrete moves
After running the trigger-rate eval, most descriptions need one structural change: they’re too passive.
Compare these two descriptions for the same skill:
Before: “This skill helps with offer evaluation using the Hormozi frameworks.”
After: “Use this skill whenever someone asks you to evaluate, critique, improve, or price an offer, value stack, or guarantee. Before responding, load references/01-value-equation.md and references/11-decision-checklist.md.”
The first description tells Claude what the skill is. The second tells Claude when to use it and what to do first. That difference routinely moves the trigger rate from 40% to over 90% on a real eval set.
Three moves that fix undertrigger.
Name the trigger verbs explicitly. List them: “evaluate, critique, review, improve, audit, diagnose.” If the user’s verb appears in the description, Claude can match it directly. Don’t make it guess that “take a look at my offer” means evaluate.
Add a when-to-use sentence. “Use this skill whenever…” is an instruction, not a label. The conditional makes it explicit.
Name the first reference to load. Making reference loading mandatory in the description turns it from an option into a gate. The model can’t answer generically if the description already told it which file to open first.
For over-triggered skills, the inverse applies. Add a when-NOT-to-use sentence: “Do not use this skill for general writing, code review, or technical debugging.” That boundary will hold.
When the description is done
A description is ready when the test-set trigger rate clears 80/20, the output quality checklist passes on two real inputs, and two separate sessions produce consistent answers on the same query. Version it. Check the description into the skill’s directory with a note on what changed and why. Skills drift: a new Claude version ships, usage patterns shift, and the trigger rate drops. You want that history when you need to debug it six months from now.
For what a production-ready SKILL.md looks like internally, the SKILL.md structure and frontmatter guide covers the fields that affect triggering and loading. For concrete examples of skills that have been through this process, see the Claude skill examples.
Evals close the gap between shipped and trusted
Building a skill is an afternoon. Knowing it works is different. It doesn’t show up in the file count, but it shows up in whether the skill is actually used.
Twenty eval queries. A 60/40 train/test split. Three runs per query. Up to five rounds of description edits. Pick the winner by test-set score. Then run the output quality checklist to confirm your references are reaching the answer, not just sitting in a directory.
That’s the full protocol. Not glamorous, but it’s the difference between a skill you deployed and a skill you trust.
If you haven’t written your first skill yet, the guide to building Claude Code skills is the right starting point.
How do I know if my skill is actually triggering?
Run the same query three times and check whether the response applies your skill's framework. If the answer is generic (no specific framework, no evidence of a loaded reference, no meaningful difference from what Claude would produce without the skill), it probably didn't fire.
The cleaner test is the with/without comparison: run the query once with the skill present, once with the SKILL.md temporarily removed. If the answers are identical, triggering is not your problem. Output quality is.
Why run each eval query three times instead of once?
Claude's trigger decision is probabilistic. A single run can give you a false positive (skill fires on a query it normally wouldn't) or a false negative (skipped on a query it should catch). Three runs per query gives you a rate: 0/3, 1/3, 2/3, 3/3. Stable enough to make a decision on.
Why pick the best description by the test set, not the training set?
Because you can overfit a description to your training queries. After five rounds of iteration, a description can get very good at the exact twelve examples you tested and fail on anything slightly different.
The test set (the eight queries you held back and never iterated against) is the only honest measure of generalization. If you look at it before you've stopped iterating, it stops being a test.
What's the difference between undertrigger and output drift?
Undertrigger means the skill doesn't fire. Output drift means it fires but the answer still ignores your references and draws from the model's training weights instead.
Both produce mediocre answers, but the fixes are completely different. Undertrigger is a description problem: iterate the eval loop. Output drift is a reference and loading-map problem. The description is already doing its job.
Do I really need 20 eval queries, or can I do less?
For a narrow, well-defined skill (one that handles a single document type or a specific decision), 15 is usually enough. Fewer than that and the signal is too noisy to iterate on confidently.
The 60/40 split matters more than the absolute number. You need enough training examples to iterate on and enough held-out examples to evaluate fairly. Don't collapse them into one set.
Does a better description always fix a skill that isn't working?
No. The description controls triggering. If the skill fires but the output is still generic, the problem is in the references or the loading map, not the description.
Always run the output quality checklist before spending eval rounds on description edits. If the with-skill answer is already meaningfully different from the without-skill answer, your trigger mechanism is fine and you're solving the wrong problem.