SkillCheck is an open source CLI and cloud service that measures whether an agent skill file actually improves a model's task performance. It runs a controlled A/B experiment instead of relying on intuition: the same tasks are solved with and without your skill, graded blind, and scored with a bootstrap confidence interval.

SkillCheck: A/B Test Your Agent Skills

The problem

Skills ship on vibes

A skill file changes your agent's behavior on every single request. Yet almost nobody measures whether that change is an improvement.

Untested by default

Skills get written, committed, and shared without a single controlled comparison. "It feels better" is the entire QA process.

Placebo tax

Every skill costs prompt tokens on every call. A skill that does nothing still bills you for the privilege, forever.

Silent rot

Models change underneath you. A skill that helped six months ago can quietly become useless, or start hurting, after an upgrade.

How it works

A drug trial for your skill

One command runs the whole experiment. No setup, no harness, no notebooks.

Normalize

Your skill file is parsed and its declared domain extracted from front matter or the first heading.

Generate tasks

A generator model writes fresh evaluation tasks from the domain only. It never sees the skill body, so tasks cannot leak its instructions.

Run both arms

Every task runs with the skill injected as a system prompt, and again without it. Same model, same temperature, same everything else.

Grade blind

A separate grader scores each output against the task's pass criterion at temperature 0. It never knows which arm produced the output.

Score the difference

A 1,000 iteration paired bootstrap turns the pass rates into an effect size, a 95% confidence interval, and one verdict: HELPS, PLACEBO, or HARMS.

What you get

Evidence, not anecdotes

SkillCheck produces numbers you can put in a PR description. Every run is reproducible and every score carries its uncertainty.

Forced-injection A/B

The same tasks run twice, with and without your skill. The delta in pass rate is the skill's measured effect in percentage points.

Blind grading

Outputs are shuffled before grading, so the grader cannot favor either arm. No self-evaluation bias, no cherry-picking.

Bootstrap confidence

1,000 paired resamples build a 95% interval around the effect. The verdict only says HELPS when the interval clears zero.

Rot detection

Re-run saved results against new model releases. If a verdict flips from HELPS to PLACEBO, you know the skill rotted before your users do.

Reproducible by design

Every result records the skill hash, task suite, model versions, and transcript hashes. Anyone can re-run and verify the number.

Token aware

SkillCheck counts the prompt tokens your skill adds and reports value per 1k tokens. A small win that triples your context is not a win.

SKILL.md AGENTS.md CLAUDE.md any *.md file whole folders

The result

One card. One answer.

No dashboards to interpret. The CLI prints a single result card that says whether the skill earned its place in your prompt.

Skillapi-documentation

Run size5 tasks × 3 trials

VerdictHELPS

With skill80.0% of tasks passed

Without skill55.0% of tasks passed

Skill effect+25.0 pp change in pass rate

Confidence+8.0 pp to +42.0 pp (95%)

Token cost+480 tokens per call

Satisfaction75.0/100 GOOD

VerdictHELPS, PLACEBO, or HARMS, decided by whether the 95% interval clears zero. No interval, no claim.
Skill effectThe change in pass rate, in percentage points. This is the number to quote in your PR.
Token costWhat the skill adds to every prompt. Weigh it against the effect before shipping.
SatisfactionA 0 to 100 quality score where 50 means no effect. Quick to read, backed by the bootstrap.

Pricing

Start free. Upgrade when it pays off.

Ten runs is enough to test a real skill at two effort levels. Go unlimited when SkillCheck earns a place in your workflow.

Free

$0

10 SkillCheck runs included
Full CLI: check, eval, verify
Blind grading and bootstrap CI
No credit card required

Pro

$19 one-time

Everything in Free
Unlimited SkillCheck runs
Corpus and rot reporting
Priority model capacity

Upgrade lives in your dashboard once you are signed in.

FAQ

Questions, answered

What is SkillCheck?

SkillCheck is an open-source CLI and cloud service that measures whether an agent skill file actually improves task performance. It runs a controlled A/B experiment: tasks are solved with and without your skill, graded blind, and scored with a bootstrap confidence interval.

How does SkillCheck test a skill?

SkillCheck reads your skill domain and generates fresh evaluation tasks. Each task runs with and without your skill injected. A separate grader model scores every output blind, producing an effect size, confidence interval, and a clear verdict: HELPS, PLACEBO, or HARMS.

Which skill files can I check?

You can check any Markdown skill file including SKILL.md, AGENTS.md, or CLAUDE.md. Point SkillCheck directly at a single file or a project folder to pick your target skill file interactively.

Do I need my own model API key?

No. Sign in with Google or GitHub to receive a free SkillCheck Cloud API key with 10 included runs. Alternatively, bring your own API key for OpenAI, Anthropic, Gemini, Groq, Mistral, OpenRouter, or NVIDIA NIM to run fully direct.

What does a PLACEBO verdict mean?

A PLACEBO verdict means the confidence interval for the skill effect overlaps zero. There is no statistically measurable performance improvement, meaning you are consuming extra tokens without proving any tangible benefit to model outputs.

Is SkillCheck open source?

Yes. The SkillCheck CLI, evaluation framework, benchmark methodology, and dashboard web app are 100% open-source under the MIT license. You can inspect the codebase and contribute on GitHub at github.com/sx4im/skillcheck.

Stop shipping placebo skills

Sign in with Google or GitHub, grab your key, and get your first verdict in about two minutes.

npm install -g @sx4im/skillcheck skillcheck check ./SKILL.md