Most AI Security Audits Are Broken. Here's How We Fixed It.

The Problem

Everyone has tried using an LLM for security auditing by now. The results are usually the same: a wall of Medium-severity findings that sound authoritative but don’t survive contact with a real security engineer. The model finds every eval(), every os.system(), every place user input exists — and calls them all vulnerabilities.

They’re not. Most of them aren’t. And the ones that are get buried in noise.

We spent the last several weeks building a security audit skill for Zaguán Blade that tries to solve this problem properly. Not by adding more checks — but by teaching the model when not to flag something.

This post is about the design reasoning, not just the artifact. The prompt itself is published in full. What’s interesting is why each piece exists.

How Most LLM Security Audits Fail

Most LLM-based security audits fail in predictable ways. The failure modes are remarkably consistent across models and frameworks:

Severity inflation. Every dangerous API is Critical. Every user-controlled input is “attacker-controlled.” The model doesn’t distinguish between a web-facing SQL injection and a local CLI flag that passes user input to exec() — they’re both RCE, right?
Context blindness. The model assumes everything is a web application. A desktop app executing commands from its own config file gets flagged the same as a server executing commands from an HTTP request body. The trust model is completely different, but the model doesn’t know that.
Hallucinated paths. The model constructs plausible-sounding attack chains that don’t actually exist in the code. “An attacker could send a crafted payload to the /api/process endpoint…” — but that endpoint doesn’t exist.
Generic advice. “Sanitize all inputs.” “Use parameterized queries.” These are true but useless. A real audit tells you which input, which query, and what exactly to change.
Missing the real bugs. While the model is busy flagging every eval() in your test suite, the actual vulnerability — a subtle authorization gap in a multi-tenant API, or a deserialization path through a parser the model didn’t investigate — goes unnoticed.

Our goal was to build something that produces audits a security engineer would actually want to read and act on — and that meant fixing the approach, not just tuning the prompt.

The uncomfortable truth: most LLM-based audit tools aren’t doing security analysis. They’re doing pattern matching with authority.

The Key Insight: From “Find Scary Things” to “Decide What Matters”

The single most important design decision was this: a dangerous sink is not a vulnerability.

Most tools stop at the sink. Real analysis starts at the boundary.

This sounds obvious, but it’s the root of most false positives. The model sees os.system(user_input) and flags it. But the question isn’t whether the sink is dangerous — it’s whether the attacker crosses a meaningful trust boundary to reach it, and whether they gain capability they didn’t already have.

A desktop app executing commands from its own config, editable only by the same user it runs as? Not a vulnerability — the user already has that authority. A web server executing commands from an HTTP request body? Completely different trust model, and it is a vulnerability.

The Exploit Value Test

Early versions of the skill used “Exploit Gain” — does the attacker gain new capability? This was good but incomplete. We kept missing a class of issues that are quiet but dangerous:

A same-user persistence mechanism (autostart entry, shell hook, CI pipeline poisoning)
A config file that gets loaded from a remote sync rather than a static local path
A build step that executes code in a different context than the runtime

These don’t give the attacker new privileges in the traditional sense. But they give them persistence, stealth, scope expansion, or context shifting. That’s valuable to an attacker even without privilege escalation.

So we evolved “Exploit Gain” into “Exploit Value”:

Exploit Value = Capability Gain + Persistence + Stealth + Scope

This is now the gating test before any severity assignment. If Exploit Value ≤ 0 (no new capability, no persistence, no stealthy scope expansion), the finding is capped at Low or Informational. Usually it belongs in a different category entirely.

The Classification Taxonomy

Most audit frameworks have findings and… that’s it. That binary — flag it or ignore it — is itself a design failure. It forces the model to either inflate borderline issues into “findings” or drop them entirely. Neither is correct.

The skill now distinguishes five categories:

Confirmed finding — you can trace the vulnerable path end to end
Likely risk — dangerous pattern, needs one missing fact confirmed (but you must name a specific file/function, not an abstract category)
Abuse primitive — not a vulnerability, but a dangerous building block that could be chained in future attacks
Hardening opportunity — not currently exploitable, but weakens security posture
Design property (by design) — behavior intentionally exposed to a trusted actor; not a vulnerability, but deserves documentation

The Abuse Primitive category is the one that surprises people. It captures things like “executes arbitrary shell from config” or “evaluates templates dynamically.” These aren’t vulnerabilities on their own — the config is trusted, the templates are trusted. But they’re perfect building blocks for an attacker who finds a way to influence that config or those templates through a different path.

This matters because LLMs don’t just find bugs — they combine them. That makes low-severity noise more dangerous, not less. A single Medium finding is a nuisance. Five Medium findings that chain into a sandbox escape are a catastrophe. If you only track standalone vulnerabilities, you miss the chains — and the chains are where the real damage is.

The Same-User Exception

Here’s the subtlety that took us the longest to get right.

Our false-positive downgrade heuristics say: “same-user, same-authority effects should not be Medium+.” This correctly kills the most common class of false positives — desktop apps, local tools, trusted config execution.

But it over-corrects. Same-user is NOT safe if the attacker gains:

Persistence across restarts or sessions
Integrity impact on future trusted execution (e.g., poisoning a CI pipeline, autostart, or shell hook)
Context shifting (triggering execution in a different context, like build time vs. run time)
Silent hijacking of trusted workflows

This exception rule is critical. Without it, the skill would systematically miss persistence mechanisms, supply chain attacks, and developer tooling compromise — exactly the class of issues that are most valuable to modern attackers.

Design Properties Need Guardrails

The “Design property (by design)” category is useful, but it’s also dangerous. In the real world, “that’s by design” is how actual vulnerabilities get dismissed.

So we require every Design Property to explicitly answer:

Who is trusted?
Why are they trusted?
Can trust be violated in practice? (e.g., config loaded from a remote sync vs. a static local file)

If you can’t answer these, it doesn’t belong in this category. This prevents lazy classification and forces the model to reason about whether the trust assumption actually holds.

”What Would Prove Me Wrong?”

This is the safety valve. The skill requires the model to explicitly state, for every finding and in its scratchpad reasoning: “What would prove me wrong?”

This matters because the skill is now strong enough to be convincingly wrong. It reasons well, sounds authoritative, and filters aggressively. When it makes a mistake, it will be a confident mistake. Forcing it to articulate how its own hypothesis could be falsified is the best defense against that.

The Scratchpad as Execution Environment

The <security_scratchpad> isn’t a post-hoc summary — it’s the model’s live investigation workspace. The model must use it to:

Plan which files and routes to investigate
Trace data flows from ingress → trust boundary → sink → impact
State attacker capability before and after
Play Devil’s Advocate against its own hypotheses
Evaluate exploit chaining
State what would prove it wrong
Conclude with exact classification

The scratchpad template mirrors these instructions explicitly — Devil’s Advocate and Exploit Chaining are required headers, not just internal reasoning steps. This means they actually appear in the output, not just silently influence it.

The 2025-2026 Threat Model

The skill is explicitly grounded in the current threat landscape, not a 2021-era checklist. Key shifts it reflects:

Broken access control remains the top application risk
Supply chain failures are now a core appsec category (SHA pinning, OIDC token scope, artifact integrity, mutable action references)
Mishandling of exceptional conditions is a first-class security category (fail-open paths, partial transaction recovery, missing rollback)
AI/Agent surfaces are real attack surfaces (prompt injection, tool-output-to-tool-input leakage, excessive agency)
AI-driven exploit chaining — LLMs and modern attackers combine multiple low-severity issues to achieve critical impact. This is no longer theoretical.

This isn’t just a better audit methodology. It’s becoming a necessary one. When attackers can chain five low-severity issues into a critical compromise, treating each finding in isolation isn’t cautious — it’s negligent.

The threat model has a version and a cutoff date (April 2026), so you know when it starts getting stale.

What We Learned From Testing

We tested the skill against the Openbox source code — a Linux/BSD desktop environment. This was the perfect stress test because it’s exactly the kind of codebase that produces false positives with naive audit prompts:

Config files that intentionally execute commands
Desktop entries that intentionally launch programs
IPC mechanisms that intentionally pass messages between same-user processes
Session management that intentionally restores state

A naive audit flags all of these as Critical. Our early versions flagged most of them as Medium+. The final version correctly classified the majority as Design Properties, with specific reasoning about why trust holds (or doesn’t) in each case.

For example: Openbox’s autostart.sh mechanism executes arbitrary shell commands from a config file. A generated audit calls this RCE. Our audit classifies it as a Design Property — the config is owned by the same user who runs the window manager, the user already has shell access, and no trust boundary is crossed. But if that config were loaded from a remote sync or a shared NFS mount, the classification would flip to Confirmed Finding, because now an untrusted party can influence trusted execution.

The real vulnerabilities — the subtle authorization gaps, the parser edge cases, the incomplete fixes — actually became more visible once the noise was gone.

The Full Skill

The complete skill definition is available on GitHub. We’re publishing it in full because:

It’s defensive tooling. Knowing the audit methodology doesn’t help attackers bypass it — it tells them what we’ll catch, which pushes them toward the gaps we want to find anyway.
The moat isn’t the prompt. The competitive advantage is the integration into the tool loop, the continuous refinement from real audits, and the execution environment. Anyone can copy the markdown; nobody can copy the flywheel.
Feedback accelerates quality. We’ve already gotten massive improvements from having multiple frontier models review it. Opening it to security practitioners will produce another order of refinement.
It raises the bar. Most “AI security audit” tools right now are glorified checklists. Publishing something rigorous forces the field to level up.

The One-Sentence Rule

If you take nothing else from this post, take this:

Never call something a vulnerability unless attacker-controlled data crosses a trust boundary and produces positive exploit value.

That’s the entire skill compressed into one sentence. Everything else is enforcement machinery.

The Security Audit Skill is part of Zaguán Blade, an AI-powered coding environment. The skill definition is available on GitHub under the Apache 2.0 license. Feedback, issues, and contributions are welcome.