Why AI Security Reviews Still Fail Without Runtime Validation
Table Of Contents
- Introduction
- The AI Security Experiment
- What The AI Actually Found
- What The AI Missed
- Why AI Security Reviews Fail
- The Bigger Signal: The AI Security Gap
- Why Traditional AppSec Cannot Keep Up
- How Bright STAR Solves This Problem
- Taking The Next Step In AI Security
- Final Thoughts
Introduction
AI Coding Tools Are Rapidly Changing How Software Is Built.
Developers Can Now Generate Entire Applications In Minutes Using:
- Claude Code
- GitHub Copilot
- Google Gemini
- Cursor
- ChatGPT
- Amazon Q
- Other AI Coding Assistants
The Rise Of The Best AI Coding Tools, Best AI Coding Assistants, And Best Generative AI For Coding Is Fundamentally Reshaping Modern Development.
But While AI Dramatically Accelerates Development Speed, It Also Raises A Critical Security Question:
Can AI Reliably Secure The Code It Generates?
That Question Matters More Than Ever.
Because AI Is No Longer Just Writing Small Functions Or Boilerplate Code.
Modern AI Systems Are Now Generating:
- Entire Applications
- APIs
- Authentication Logic
- Business Workflows
- Infrastructure Configurations
- MCP Integrations
- Runtime Security Mechanisms
And If AI-Generated Code Contains Vulnerabilities, Traditional Security Review Processes May Not Be Able To Keep Up.
To Evaluate This Problem, We Ran A Real-World Experiment Using Claude Code Opus 4.6.
The AI Security Experiment
We Built A ~300-Line Application Entirely Using Claude Code Opus 4.6.
Then, We Planted two critical vulnerabilities inside the application to evaluate whether the same AI Model could reliably detect them during a security review.
The Process Was Simple:

We Asked The Model To Run Five Independent Security Reviews Against The Same Codebase.
The Hypothesis Was Straightforward:
If AI Can Write The Code, Surely It Should Be Able To Find Security Flaws In It.
Right?
The Results Were Concerning.
What The AI Actually Found
Across the five security scans, The Results Showed Major Inconsistencies.
Key Findings:
| Observation | Result |
| Vulnerabilities Consistent Across All 5 Scans | Only 32% |
| Findings That Were False Positives | 60% |
| Scans That Missed Planted Critical Vulnerabilities | 60% |
| Scans That Flagged Dead Code As Critical | 100% |
| Findings Actually Validated Across Runs | ~30% |
The AI Identified A Mix Of Issues Such As:
- Input Validation Problems
- Authentication Weaknesses
- Unsafe Database Operations
- Potential Injection Paths
- Logic Flaws
However, Detection Was Highly Inconsistent.
Some Vulnerabilities Appeared In Only:
1 Out Of 5 Scans
Others Were Missed Completely.
Even More Concerning:
- Some Vulnerabilities Were Incorrectly Explained
- Others Were Flagged As Secure
- Several Critical Issues Were Never Discovered
This Means Running The Same AI Security Scan Multiple Times Produced Completely Different Results.
Actual AI Scan Breakdown
The Scan Results Included:
- Real Vulnerabilities
- Dead-Code Findings
- False Positives
- Context-Dependent Findings
- Overstated Severity Ratings
The Analysis Showed That Many Findings Were Not Actually Exploitable During Runtime Validation.
This Highlights One Of The Biggest Problems With LLM-Based Security Review:
AI Reasoning Is Probabilistic – Not Deterministic.
And Security Cannot Depend on Probability Alone.
The Chart Included In The Research Clearly Demonstrated How Findings Changed Across Multiple Runs Of The Same Scan.
What The AI Missed
Several Vulnerabilities Were:
- Misclassified
- Incorrectly Explained
- Or Completely Missed
Examples Included:
- Improper Authentication Handling
- Weak Authorization Logic
- Unsafe Input Processing Paths
- Potential Injection Vectors
In Some Cases, The AI Even Explained Why Vulnerable Code Was Safe.
This Is A Dangerous Failure Mode.
Because Developers May Trust The AI Explanation And Deploy Vulnerable Code Into Production Environments.
The Most Concerning Result: Missed XSS Vulnerabilities
Perhaps the most important finding was that the AI Completely Missed Two XSS Vulnerabilities That Were Intentionally Planted Inside The Application.
The Vulnerabilities Included:
- A Text/HTML Default Fallback XSS
- An application/XML Namespace XSS
The Attack Chain Required:
- Multi-Step Indirection
- Content Negotiation Logic
- Runtime Rendering Behavior
This Is Exactly The Type Of Runtime Complexity That Traditional LLM-Based Security Review Struggles To Understand.
The Vulnerabilities Were Only Fully Visible During Runtime Execution Analysis – Not Through Static AI Reasoning Alone.
Why AI Security Reviews Fail
Large Language Models Are Excellent At:
- Pattern Recognition
- Explanation
- Code Generation
But They Still Struggle With:
Real Security Validation
The Research Identified Several Key Limitations.
1. LLMs Don’t Execute The Code
AI Models Analyze Code:
- Statically
- Heuristically
- Probabilistically
They Do Not:
- Run The Application
- Trigger The Vulnerability
- Observe Real Runtime Behavior
Without Runtime Execution, Vulnerabilities Often Become:
Theoretical Guesses
Instead Of:
Proven Exploitable Risks
2. AI Security Results Are Probabilistic
Each Security Scan Was Influenced By:
- Prompt Phrasing
- Model Randomness
- Context Window Limitations
This Is Why Multiple Scans Against The Same Code Produced Different Results.
Security Tools Must Be:
- Deterministic
- Repeatable
- Consistent
LLMs Are Not.
3. AI Lacks Exploit Validation
Most AI Security Reviews Identify:
Potential Vulnerabilities
But Rarely Confirm:
- Whether The Vulnerability Is Actually Exploitable
- Whether The Fix Actually Works
This Creates Two Major Problems:
- False Positives
- False Confidence
And Both Become Extremely Dangerous In Production AI Applications.
The Bigger Signal: The AI Security Gap
This Research Exposed A Much Larger Industry Problem.
AI Is Already Generating A Growing Percentage Of Modern Code.
Industry Estimates Suggest:
- 30–40% Of Code Is Already AI-Generated
- Some Teams Report 70%+ AI-Assisted Development
But The Security Ecosystem Has Not Caught Up Yet.
Most Existing Security Approaches Still Depend On:
- Static Analysis
- Heuristic Rules
- LLM-Based Code Review
None Of These Approaches Reliably Prove Exploitability.
And None Were Designed For:
- MCP Architectures
- Agentic AI Systems
- AI APIs
- Runtime AI Workflows
- Autonomous Tool Execution
Why Traditional AppSec Cannot Keep Up
AI-Generated Code Introduces A Completely New Challenge:
Machine-Generated Vulnerabilities At Machine Speed
Developers Can Now Generate:
- Entire Applications
- APIs
- Authentication Logic
- Authorization Flows
- Infrastructure Logic
- Complex Security Mechanisms
…In Minutes.
But If Those Systems Contain Vulnerabilities, Traditional AppSec Processes Cannot Keep Up At The Same Speed.
This Is Why Modern Security Requires:
- Runtime Validation
- Deterministic Testing
- Continuous Exploit Verification
Instead Of Static Security Assumptions Alone.
How Bright STAR Solves This Problem
Bright Security’s STAR (Security Testing & Autonomous Remediation) Platform Was Designed Specifically For This New AI Security Landscape.
Unlike LLM-Based Code Review Tools, STAR Focuses On:
Validated Security Testing
1. STAR Proves Exploitability
Instead Of Guessing About Vulnerabilities, STAR:
- Executes The Application
- Finds Real Attack Paths
- Proves Vulnerabilities Are Exploitable
This Eliminates Much Of The Guesswork Associated With Pure LLM Analysis.
2. STAR Eliminates False Positives
Traditional Security Tools Often Produce:
- Long Lists Of Potential Issues
- Dead-Code Findings
- Theoretical Vulnerabilities
STAR Uses:
- AI-Optimized
- Deterministic
- Runtime DAST Validation
To Focus Only On:
- Exploitable Vulnerabilities
- Production-Relevant Findings
- Real Security Risks
This Dramatically Improves Developer Productivity And Reduces Alert Fatigue.
3. STAR Validates The Fix
After Remediation Is Generated, STAR Re-Tests The Application To Confirm:
The Vulnerability Is Actually Resolved
This Creates A Closed Security Loop:

Most AI Code Review Tools Cannot Reliably Perform This Workflow Today.
Taking The Next Step In AI Security
AI Is An Incredible Force Multiplier For Software Development.
But It Should Not Become Its Own Security Gatekeeper.
Securing AI-Generated Applications Requires Tools That Understand:
- Runtime Behavior
- Modern Attack Surfaces
- AI Execution Chains
- Exploitability
- Dynamic Application Flows
This Is Especially Critical As Organizations Continue Using:
- The Best AI Coding Assistants
- AI-Generated APIs
- Agentic Workflows
- MCP Servers
- Autonomous Development Systems
Bright Security Helps Teams Move Fast Without Sacrificing Runtime Security Validation.
Whether Teams Are Building With:
- Claude
- GPT
- Gemini
- Cursor
- Custom LLMs
Bright Provides The Runtime Testing And Validation Layer Needed To Deploy AI-Generated Applications Safely.
Final Thoughts
Our Research Demonstrated A Critical Reality:
AI Can Generate Code Faster Than Traditional Security Can Validate It.
Claude Opus 4.6 Successfully Identified Some Vulnerabilities.
But The Results Also Showed:
- Inconsistent Detection
- High False Positive Rates
- Missed Critical Vulnerabilities
- Lack Of Runtime Validation
This Creates A Dangerous Security Gap In Modern AI Development Workflows.
As AI Adoption Accelerates Across SaaS And Engineering Teams, Organizations Need More Than:
AI-Generated Security Suggestions
They Need:
Deterministic Runtime Validation
Continuous Exploit Verification
Real Attack Simulation
Runtime Security Testing
Because In Security:





