How to Create Claude Code Skills: The Complete Guide to Agent Skills
Claude skills are reusable behavior modules that replace system-prompt debt. Learn the full anatomy: SKILL.md, progressive disclosure, references, and the description field that triggers everything.
A skill is not a better prompt. It is a different category of thing: a module with its own name, references, and behavior that persists across every session without touching your system prompt.
I run thirteen Claude Code skills. They cover offer design (Alex Hormozi’s method, distilled into 13 references and around 80KB of decisions), SEO strategy (Neil Patel’s), funnel architecture (Russell Brunson’s), explanation clarity (Feynman’s), script review (Arthur Miller’s), financial contrarianism (Michael Burry’s), and a lifeos system suite I built for my own second brain: weekly reviews, decision logs, project planning. None of them live in my system prompt. They load when needed and vanish when they are not.
This article covers the complete system. The linked articles go deeper on each piece; I will point to them rather than duplicate the depth.
What are Claude skills?
A Claude skill is a reusable behavior module that Claude Code loads on demand. It lives in its own directory, carries a SKILL.md router file and optional reference files, and activates from its description, not from a standing system prompt. Skills solve long-prompt debt: the accumulation of instructions that bloat your system prompt and slow every session regardless of whether those instructions are relevant.
The what are Claude skills primer covers the conceptual foundation in detail. The short version: a skill is a directory with a name, a description, and a body of instructions that only load when the description matches a task. The official documentation lives at code.claude.com/docs/en/skills.
Long-prompt debt taxes every session
Before skills, persistent behavior lived in the system prompt. One instruction for tone. One for output format. One for code conventions. One for each persona you wanted. After several months you had 4,000 tokens of instructions that loaded every session, including sessions that never touched half of them.
Long-prompt debt compounds in two ways. First, every new instruction costs tokens forever, regardless of task relevance. Second, the model attends to everything at once, which means it attends to the wrong things about half the time. Skills break that coupling. A behavior enters context only when the task calls for it, and exits cleanly when it does not. You stop paying for capabilities you are not using.
Skills vs prompts, memory, MCP, and subagents
A skill is a persistent, reusable module that loads on demand and survives across sessions. A prompt is a session instruction that resets when the conversation ends. Memory is stored context with no executable structure behind it. MCP is a server-side tool provider, not a behavior module. Each solves a different problem. Conflating them means reaching for the wrong tool.
The confusing comparisons are skills vs MCP and skills vs memory. Both have dedicated articles: Claude Skills vs MCP and Subagents covers the tool-vs-behavior distinction; Skill vs Prompt vs Memory has a decision tree for the rest. Use those when the choice is not obvious.
Prompts and memory
- 01Session-scoped: resets when the conversation ends
- 02Context only, no executable structure or directory
- 03Loads everything every session, relevant or not
- 04No internal routing to specific reference files
- 05Grows into long-prompt debt as you add instructions
Claude skills
- 01Persistent: identical behavior every session
- 02Modular: SKILL.md, references/, scripts/, assets/
- 03Loads on demand, not by default in every session
- 04Loading map routes the model to the right reference
- 05Adds capabilities without touching the system prompt
MCP is closer to a tool than a behavior module. It provides server-side capabilities: web search, database access, external APIs. A skill provides a way of thinking and executing that persists across sessions. You can run both at once. They compete for nothing.
Subagents are a different split again: an orchestrator delegates a subtask to a subagent, which runs independently. A skill rides inside a single agent and does not spawn a separate execution. If your use case involves parallel workstreams and independent agents, that is a subagent pattern. If it involves the same agent running a specialized behavior on recurring tasks, that is a skill.
Anatomy of a skill: SKILL.md, frontmatter, references/, scripts/, assets/
A skill directory has four layers. The metadata (name and description) identifies and triggers the skill. The SKILL.md body houses the routing map and behavior instructions. The references/ subdirectory holds distilled knowledge files that load on demand. scripts/ holds executable tools. assets/ holds output the skill produces. The SKILL.md structure and frontmatter guide covers each field in full; this section gives you the map.
Skill directory anatomy
- alex-hormozi/persona skill
- SKILL.mdrouter< 500 lines, loading map lives here
- references/loaded on demand, not by default
- 00-canon.mdcore offer principles
- 01-value-equation.mdpricing and perceived value
- 02-grand-slam-offer.mdoffer construction
- 04-pricing-and-guarantees.mdprice, risk reversal
- 08-retention-and-ltv.mdretention and lifetime value
- 11-decision-checklist.mdfinal offer verdict
- scripts/executable: bash, python, etc.
- score-offer.pynumerical scoring
- assets/output the skill produces
- offer-analysis.mdlast run output
Keep SKILL.md under 500 lines. When it grows past that, the body has taken on work that belongs in references/. A bloated SKILL.md is long-prompt debt in a better package. Same problem. The overflow belongs in a reference file with a name that reflects the decision it supports.
The references/ directory is where the skill’s knowledge lives. These files do not load automatically. The model reads them only when SKILL.md explicitly instructs it to. That distinction is the whole design.
Progressive disclosure: how a skill actually loads
Claude Code loads skills in three levels. Level one is always in context: the name and description of every installed skill. Level two loads when the skill triggers: the full SKILL.md body. Level three loads only when SKILL.md says to: individual files from references/. Nothing auto-loads at level three. The model decides, from the instructions you wrote. Get those instructions wrong and the model skips the references entirely.
Anthropic calls this architecture progressive disclosure. Context is expensive and most of it is irrelevant on any given task. You pay the cost of level one (a few hundred characters per skill) in every session. You pay the cost of level two only when the skill fires. You pay the cost of level three only when a specific reference is needed for the task at hand.
Progressive disclosure: three load levels
- L1
Always in context: name and description
Every installed skill's name and description sits in context in every session. This is what Claude scans to decide whether a skill is relevant. Keep descriptions tight and assertive: they are scanned, not read. The combined name + description + when_to_use is capped at roughly 1,536 characters in the skill listing.
- L2
Loads on trigger: the SKILL.md body
The full SKILL.md body loads when the description match fires. This layer contains the skill's behavior instructions, output modes, and (most importantly) the loading map that routes the model to the right reference files. Under 500 lines. If it's longer, extract the excess.
## Loading map | File | When to load | | --------------------------------- | ------------------------------- | | references/01-value-equation.md | Pricing or perceived value | | references/04-pricing.md | Price, guarantee, risk reversal | | references/11-decision-checklist | Final verdict on an offer |
- L3
Loads on demand: reference files
Individual files from references/ load only when SKILL.md instructs the model to read them. Nothing here is automatic. The loading map enforces the right choice. Write it as a constraint, not a suggestion.
The catch at level three is that voluntary is not mandatory. If SKILL.md says “you may consult references/pricing.md,” the model treats it as a suggestion it can skip. If SKILL.md says “for any pricing question, read references/04-pricing-and-guarantees.md before you write a single word of your answer,” it behaves like a constraint. The phrasing difference matters. Write loading instructions as requirements.
The description field is the whole ballgame
The description is the primary triggering mechanism for a skill. Combined with when_to_use, it is the only thing Claude reads before deciding whether to activate the skill at all. A skill with a weak description does not fire. It sits installed in the library, invisible, while Claude answers from its weights instead of your references.
Anthropic’s documentation names this failure: undertriggering. Skills activate less often than they should because descriptions are too passive. “Helps with offer analysis” is passive. “Use this skill whenever evaluating an offer, pricing model, guarantee, value stack, acquisition funnel, or retention rate. Load it before writing any response about any of those topics” is assertive. Descriptions should be slightly pushy. Assertive about when to reach for the skill, accurate about what it does.
A second failure is using the description for instructions. Instructions belong in SKILL.md. The description’s only job is to pull the skill into context when the task matches. Once it fires, SKILL.md takes over. Mixing the two compresses both functions and does neither well.
Write the description as if you are telling Claude exactly when to reach for this skill. That is precisely what you are doing. Name the specific task types, input shapes, and decision domains the skill owns. If you built an offer-evaluation skill, list the words that should fire it: offer, pricing, guarantee, value stack, conversion, acquisition funnel, retention. Do not make the model infer the scope.
Writing references: split by decision, not by source
References are what separate a skill from a costume. They are the distilled knowledge the model reads instead of improvising from its weights. The principle that governs their structure: split by decision, not by source. A file named book-1-chapter-4.md is an archive. A file named pricing-and-guarantees.md is a decision tool.
The question the model receives never arrives labeled by source. It arrives as a problem: a price that does not convert, a guarantee that sounds weak, an offer the prospect is not pulling for. The reference that should answer it is the one organized around that decision, not the one organized around where the information originally came from.
My working skill library
- 13+
- skills installed
- 13
- references in one skill
- 80KB
- distilled knowledge
- 0
- system prompt lines
personas and system tools
Hormozi, one file per decision domain
books, talks, and playbooks in one persona skill
no skill lives in the system prompt
In my Hormozi skill, the 13 reference files map to 13 decision domains: canon principles, the Value Equation, Grand Slam Offer construction, pricing and guarantees, lead generation, closing and sales, retention and LTV, the acquisition content model, and a final decision checklist. When a pricing question arrives, the loading map points at 04-pricing-and-guarantees.md. When a retention question arrives, it points at 08-retention-and-ltv.md. The model does not have to figure out which part of the corpus is relevant. The map does that. Same structure applies to every skill in the library: neil-patel, russell-brunson, gary-vaynerchuk, feynman, arthur-miller, michael-burry, and the lifeos suite.
Distillation is not compression. It is reconstruction. Dumping three books into a references directory gives you a bigger costume, not a method. Distillation means pulling out, in your own words, the principles that load-bear, the frameworks that repeat, the decision criteria that separate pass from fail, the examples that make abstract guidance concrete, and the calibration phrases that expose when the skill has drifted from the method. You read widely; you keep only what makes a decision.
Persona skills: naming a skill after an expert
A persona skill names the skill after a recognizable expert and backs that name with distilled references. The name activates a denser region of the model’s knowledge. “Alex Hormozi” points at books, talks, frameworks, and phrases in the training data where “marketing expert” points at an average. But the name alone is theater. The references are what carry the method.
Zheng et al. (arXiv:2311.10054, Findings of EMNLP 2024) tested 162 personas across four model families and found that a bare persona label in the system prompt does not improve factual accuracy. The paper’s own conclusion reversed between drafts: from “consistently improves” in November 2023 to “does not improve” in October 2024, after the authors widened the test. A name with no method under it is noise. A name with 13 decision-split references and a loading map is a different machine.
The full treatment (why naming pays off inside a skill when it fails as a standalone label, how to distill a corpus, how to avoid the silent failure where the skill runs entirely from the model’s weights instead of your references) is in Persona Skills for Claude Code: Why Naming an Expert Beats Role Prompting. The short version: use a persona when there is a real public corpus. Use a functional name (security-reviewer, api-contract-auditor, migration-planner) when the skill is internally technical and the work depends on your own checklists, not a public figure’s method.
Real examples: my skill library
I run one SKILL.md router per domain, each an independent module with its own contract, references, and expected output. The personas cover offer design (Hormozi), funnel architecture and the value ladder (Brunson), organic content and distribution strategy (Gary Vaynerchuk), SEO and keyword strategy (Neil Patel), explanation clarity and first-principles teaching (Feynman), dramatic structure and script review (Arthur Miller), and macro contrarian analysis (Michael Burry). The lifeos skills cover task capture, weekly review, decision logging, and project planning; they use my own distilled method, not a public figure’s.
None of them try to do everything. Each has a one-sentence job. When a job does not fit in a sentence, the scope is too broad to build a reliable skill around. The Claude Skill Examples: Real Implementations from a Working Library article shows concrete file layouts, reference structures, and real SKILL.md excerpts, including the loading map format, the output mode constraints, and how the decision checklist forces the model off generic hedging and into a real verdict.
The skills that get used most have assertive descriptions, tight loading maps, and decision-grounded references. The ones that underperform are where I was too gentle with the description or let the references grow by source instead of by decision. Fixing them follows the same pattern every time.
How to test and iterate a skill
Testing a skill means verifying that it loaded its references and used them, not just that the output sounded plausible. The surface failure looks fine: right voice, right energy, confident answer. But if the model ran entirely from its weights instead of your files, your references contributed nothing. You distilled 80KB of method that the skill ignored.
The systematic approach to catching and diagnosing that failure is in How to Test Claude Code Skills: An Evaluation Framework. The core pattern is comparison and citation: give the same real input before and after the skill is installed, check whether the specific framework from references appeared in the answer, and verify that the model can name which reference told it what to do.
Minimum skill quality checks
- Required:Use a real input, not a toy example.A skill that only works on contrived prompts is not production-ready.
- Required:Compare output before and after the skill is installed.If the output is the same, the skill is not adding anything.
- Required:Check whether the specific framework from references/ appeared.Value Equation, RAISE, Grand Slam Offer: the framework is the tell.
- Required:Run the same input twice across different sessions.Consistent output means the skill is loaded, not improvised.
- Required:Test the loading map directly: ask a pricing question, verify it loaded the pricing reference.The loading map is the most likely failure point.
- Required:Check for generic hedging.A skill that loaded its references does not hedge like a general assistant. Generic hedging is a sign the references were skipped.
- Optional:Ask the model to explain its reasoning and cite its source.If it cannot name the reference, it did not use it.
Iteration follows the same diagnostic path. When a skill answers wrong, trace the failure layer by layer: did the description fire? Did SKILL.md load? Did the loading map point at the right reference? Did that reference contain what the task needed? Each layer has a distinct failure mode, and they all look identical from the outside: a wrong or generic answer. The layer structure is what lets you trace the cause.
When to build a skill, and when a prompt is enough
Build a skill when the behavior recurs across sessions and needs consistent, reference-grounded output. The clearest signal: if you are copying and pasting the same instruction block into multiple sessions, that block is a skill waiting to be built. The second signal: if you want specific distilled knowledge (not just tone, but actual frameworks and decision criteria) to appear reliably in the output, that knowledge belongs in references, and references belong in a skill.
Stay with a prompt when the task is one-off, the instructions change each run, or the behavior needs no references. Just a temporary context adjustment. Prompts are fast to write and free to discard. Skills take investment: distillation, routing, testing. Only build the skill when the investment pays across many sessions.
The harder decision is between a skill and an MCP tool, or between a skill and memory. Those have nuance that does not reduce to a single rule. Skill vs Prompt vs Memory: When to Use Each has a full decision tree and the specific cases where each wins.
The practical guide: reach for a skill when you catch yourself recreating the same behavior from scratch, when you need a behavior that loads automatically without manual setup, or when you have done distillation work that deserves to persist rather than evaporate at session end.
FAQ
What are Claude skills?
Claude skills are reusable behavior modules that Claude Code loads on demand. Each skill lives in its own directory with a SKILL.md router, optional reference files, and a description that triggers it. They solve the long-prompt debt problem by keeping capabilities out of the system prompt until they are needed.
The official documentation is at code.claude.com/docs/en/skills.
How do I create a Claude Code skill?
Create a directory for the skill. Add a SKILL.md file with a name, description, when_to_use, and behavior instructions. Add a references/ subdirectory with one file per decision domain. Write a loading map in SKILL.md that tells Claude which reference to read for each type of question.
The description is the trigger. Write it assertively so the skill fires on the right inputs, not timidly so it gets skipped.
What goes in SKILL.md?
SKILL.md contains the skill's name, description, when_to_use field, main behavior instructions, output mode constraints, and the loading map that routes the model to specific reference files. Keep it under 500 lines. Anything longer belongs in a reference file.
The loading map is the most important part. It should name specific files and the specific question types that require them. Not suggest them, require them.
What is progressive disclosure in the context of Claude skills?
Progressive disclosure is the three-level loading model Anthropic uses for skills. Level one (name and description) is always in context. Level two (SKILL.md body) loads when the description match fires. Level three (individual reference files) loads only when SKILL.md explicitly instructs the model to read them.
Nothing at level three is automatic. The loading map is how you make the model reach for the right file instead of improvising from its weights.
Why does the description field matter so much?
Because it is the only part of a skill that Claude reads before deciding whether to activate it. A passive or vague description causes the skill to undertrigger. It misses the cases it was built for, silently.
Descriptions should assertively name the exact task types, inputs, and decision domains that should fire the skill. Write them as if you are telling Claude exactly when to reach for this tool, because you are.
How is a Claude skill different from an MCP tool?
An MCP tool provides server-side capabilities: web access, databases, external APIs. A Claude skill provides a behavior module: a way of thinking and executing, with distilled references, that persists across sessions. You can use both simultaneously. They are not competing for the same slot.
The detailed comparison, including when subagents enter the picture, is in the skills vs MCP and subagents article.
Do persona skills actually work? I thought role prompting was debunked.
Bare role prompting does not work reliably. Zheng et al. (arXiv:2311.10054, Findings of EMNLP 2024) tested 162 personas across four model families and found no factual accuracy gain from a persona label alone. The paper's conclusion flipped between v1 and v3 of the same study.
A persona skill is a different architecture. The name activates a denser cluster in the model's knowledge, but the distilled references carry the method. Without references, you have a costume. With 13 decision-split references and a loading map, you have a skill that consistently uses the expert's actual frameworks instead of improvising the voice.
How many references should a skill have?
As many as there are distinct decisions to support, and no more. My Hormozi skill has 13 because there are 13 identifiable decision domains in offer design. A tighter skill might start with 3 to 5 well-distilled files.
The rule: one file per decision, not one file per source. A file named 'book-2.md' is an archive. A file named 'pricing-and-guarantees.md' is a decision tool.
When should I use a skill instead of a prompt?
Use a skill when the behavior recurs across sessions, you have distilled knowledge that belongs in references, and you want consistent output without manually recreating context each time.
Stay with a prompt when the task is one-off, the instructions change every run, or the behavior needs no references. The clearest signal for building a skill: you are copying the same instruction block into multiple sessions.