Thousands of SKILL.md files are published. Almost none are tested. SkillCheck runs a blind A/B experiment on any agent skill and hands you a verdict with a confidence interval, in about two minutes.
npx @sx4im/skillcheck to try it without installing anything.
A skill file changes your agent's behavior on every single request. Yet almost nobody measures whether that change is an improvement.
Skills get written, committed, and shared without a single controlled comparison. "It feels better" is the entire QA process.
Every skill costs prompt tokens on every call. A skill that does nothing still bills you for the privilege, forever.
Models change underneath you. A skill that helped six months ago can quietly become useless, or start hurting, after an upgrade.
One command runs the whole experiment. No setup, no harness, no notebooks.
Your skill file is parsed and its declared domain extracted from front matter or the first heading.
A generator model writes fresh evaluation tasks from the domain only. It never sees the skill body, so tasks cannot leak its instructions.
Every task runs with the skill injected as a system prompt, and again without it. Same model, same temperature, same everything else.
A separate grader scores each output against the task's pass criterion at temperature 0. It never knows which arm produced the output.
A 1,000 iteration paired bootstrap turns the pass rates into an effect size, a 95% confidence interval, and one verdict: HELPS, PLACEBO, or HARMS.
SkillCheck produces numbers you can put in a PR description. Every run is reproducible and every score carries its uncertainty.
The same tasks run twice, with and without your skill. The delta in pass rate is the skill's measured effect in percentage points.
Outputs are shuffled before grading, so the grader cannot favor either arm. No self-evaluation bias, no cherry-picking.
1,000 paired resamples build a 95% interval around the effect. The verdict only says HELPS when the interval clears zero.
Re-run saved results against new model releases. If a verdict flips from HELPS to PLACEBO, you know the skill rotted before your users do.
Every result records the skill hash, task suite, model versions, and transcript hashes. Anyone can re-run and verify the number.
SkillCheck counts the prompt tokens your skill adds and reports value per 1k tokens. A small win that triples your context is not a win.
No dashboards to interpret. The CLI prints a single result card that says whether the skill earned its place in your prompt.
Ten runs is enough to test a real skill at two effort levels. Go unlimited when SkillCheck earns a place in your workflow.
Upgrade lives in your dashboard once you are signed in.
NVIDIA_API_KEY to run fully direct instead.Sign in with Google or GitHub, grab your key, and get your first verdict in about two minutes.