Gadi Bashvitz

Gadi Bashvitz

Author

Published Date: May 28, 2026

Estimated Read Time: 7 minutes

AI Security Review Fails In Practice: Claude Opus 4.6 Missed Critical Vulnerabilities & Generated Dangerous False Positives

Why AI Security Reviews Still Fail Without Runtime Validation

Table Of Contents

  1. Introduction
  2. The AI Security Experiment
  3. What The AI Actually Found
  4. What The AI Missed
  5. Why AI Security Reviews Fail
  6. The Bigger Signal: The AI Security Gap
  7. Why Traditional AppSec Cannot Keep Up
  8. How Bright STAR Solves This Problem
  9. Taking The Next Step In AI Security
  10. Final Thoughts

Introduction

AI Coding Tools Are Rapidly Changing How Software Is Built.

Developers Can Now Generate Entire Applications In Minutes Using:

  1. Claude Code
  2. GitHub Copilot
  3. Google Gemini
  4. Cursor
  5. ChatGPT
  6. Amazon Q
  7. Other AI Coding Assistants

The Rise Of The Best AI Coding Tools, Best AI Coding Assistants, And Best Generative AI For Coding Is Fundamentally Reshaping Modern Development.

But While AI Dramatically Accelerates Development Speed, It Also Raises A Critical Security Question:

Can AI Reliably Secure The Code It Generates?

That Question Matters More Than Ever.

Because AI Is No Longer Just Writing Small Functions Or Boilerplate Code.

Modern AI Systems Are Now Generating:

  1. Entire Applications
  2. APIs
  3. Authentication Logic
  4. Business Workflows
  5. Infrastructure Configurations
  6. MCP Integrations
  7. Runtime Security Mechanisms

And If AI-Generated Code Contains Vulnerabilities, Traditional Security Review Processes May Not Be Able To Keep Up.

To Evaluate This Problem, We Ran A Real-World Experiment Using Claude Code Opus 4.6.

The AI Security Experiment

We Built A ~300-Line Application Entirely Using Claude Code Opus 4.6.

Then, We Planted two critical vulnerabilities inside the application to evaluate whether the same AI Model could reliably detect them during a security review.

The Process Was Simple:

We Asked The Model To Run Five Independent Security Reviews Against The Same Codebase.

The Hypothesis Was Straightforward:

If AI Can Write The Code, Surely It Should Be Able To Find Security Flaws In It.

Right?

The Results Were Concerning.

What The AI Actually Found

Across the five security scans, The Results Showed Major Inconsistencies.

Key Findings:

ObservationResult
Vulnerabilities Consistent Across All 5 ScansOnly 32%
Findings That Were False Positives60%
Scans That Missed Planted Critical Vulnerabilities60%
Scans That Flagged Dead Code As Critical100%
Findings Actually Validated Across Runs~30%

The AI Identified A Mix Of Issues Such As:

  1. Input Validation Problems
  2. Authentication Weaknesses
  3. Unsafe Database Operations
  4. Potential Injection Paths
  5. Logic Flaws

However, Detection Was Highly Inconsistent.

Some Vulnerabilities Appeared In Only:
1 Out Of 5 Scans

Others Were Missed Completely.

Even More Concerning:

  1. Some Vulnerabilities Were Incorrectly Explained
  2. Others Were Flagged As Secure
  3. Several Critical Issues Were Never Discovered

This Means Running The Same AI Security Scan Multiple Times Produced Completely Different Results.

Actual AI Scan Breakdown

The Scan Results Included:

  1. Real Vulnerabilities
  2. Dead-Code Findings
  3. False Positives
  4. Context-Dependent Findings
  5. Overstated Severity Ratings

The Analysis Showed That Many Findings Were Not Actually Exploitable During Runtime Validation.

This Highlights One Of The Biggest Problems With LLM-Based Security Review:

AI Reasoning Is Probabilistic – Not Deterministic.

And Security Cannot Depend on Probability Alone.

The Chart Included In The Research Clearly Demonstrated How Findings Changed Across Multiple Runs Of The Same Scan.

What The AI Missed

Several Vulnerabilities Were:

  1. Misclassified
  2. Incorrectly Explained
  3. Or Completely Missed

Examples Included:

  1. Improper Authentication Handling
  2. Weak Authorization Logic
  3. Unsafe Input Processing Paths
  4. Potential Injection Vectors

In Some Cases, The AI Even Explained Why Vulnerable Code Was Safe.

This Is A Dangerous Failure Mode.

Because Developers May Trust The AI Explanation And Deploy Vulnerable Code Into Production Environments.

The Most Concerning Result: Missed XSS Vulnerabilities

Perhaps the most important finding was that the AI Completely Missed Two XSS Vulnerabilities That Were Intentionally Planted Inside The Application.

The Vulnerabilities Included:

  1. A Text/HTML Default Fallback XSS
  2. An application/XML Namespace XSS

The Attack Chain Required:

  1. Multi-Step Indirection
  2. Content Negotiation Logic
  3. Runtime Rendering Behavior

This Is Exactly The Type Of Runtime Complexity That Traditional LLM-Based Security Review Struggles To Understand.

The Vulnerabilities Were Only Fully Visible During Runtime Execution Analysis – Not Through Static AI Reasoning Alone.

Why AI Security Reviews Fail

Large Language Models Are Excellent At:

  1. Pattern Recognition
  2. Explanation
  3. Code Generation

But They Still Struggle With:

Real Security Validation

The Research Identified Several Key Limitations.

1. LLMs Don’t Execute The Code

AI Models Analyze Code:

  1. Statically
  2. Heuristically
  3. Probabilistically

They Do Not:

  1. Run The Application
  2. Trigger The Vulnerability
  3. Observe Real Runtime Behavior

Without Runtime Execution, Vulnerabilities Often Become:
Theoretical Guesses

Instead Of:
Proven Exploitable Risks

2. AI Security Results Are Probabilistic

Each Security Scan Was Influenced By:

  1. Prompt Phrasing
  2. Model Randomness
  3. Context Window Limitations

This Is Why Multiple Scans Against The Same Code Produced Different Results.

Security Tools Must Be:

  1. Deterministic
  2. Repeatable
  3. Consistent

LLMs Are Not.

3. AI Lacks Exploit Validation

Most AI Security Reviews Identify:
Potential Vulnerabilities

But Rarely Confirm:

  1. Whether The Vulnerability Is Actually Exploitable
  2. Whether The Fix Actually Works

This Creates Two Major Problems:

  1. False Positives
  2. False Confidence

And Both Become Extremely Dangerous In Production AI Applications.

The Bigger Signal: The AI Security Gap

This Research Exposed A Much Larger Industry Problem.

AI Is Already Generating A Growing Percentage Of Modern Code.

Industry Estimates Suggest:

  1. 30–40% Of Code Is Already AI-Generated
  2. Some Teams Report 70%+ AI-Assisted Development

But The Security Ecosystem Has Not Caught Up Yet.

Most Existing Security Approaches Still Depend On:

  1. Static Analysis
  2. Heuristic Rules
  3. LLM-Based Code Review

None Of These Approaches Reliably Prove Exploitability.

And None Were Designed For:

  1. MCP Architectures
  2. Agentic AI Systems
  3. AI APIs
  4. Runtime AI Workflows
  5. Autonomous Tool Execution

Why Traditional AppSec Cannot Keep Up

AI-Generated Code Introduces A Completely New Challenge:

Machine-Generated Vulnerabilities At Machine Speed

Developers Can Now Generate:

  1. Entire Applications
  2. APIs
  3. Authentication Logic
  4. Authorization Flows
  5. Infrastructure Logic
  6. Complex Security Mechanisms

…In Minutes.

But If Those Systems Contain Vulnerabilities, Traditional AppSec Processes Cannot Keep Up At The Same Speed.

This Is Why Modern Security Requires:

  1. Runtime Validation
  2. Deterministic Testing
  3. Continuous Exploit Verification

Instead Of Static Security Assumptions Alone.

How Bright STAR Solves This Problem

Bright Security’s STAR (Security Testing & Autonomous Remediation) Platform Was Designed Specifically For This New AI Security Landscape.

Unlike LLM-Based Code Review Tools, STAR Focuses On:

Validated Security Testing

1. STAR Proves Exploitability

Instead Of Guessing About Vulnerabilities, STAR:

  1. Executes The Application
  2. Finds Real Attack Paths
  3. Proves Vulnerabilities Are Exploitable

This Eliminates Much Of The Guesswork Associated With Pure LLM Analysis.

2. STAR Eliminates False Positives

Traditional Security Tools Often Produce:

  1. Long Lists Of Potential Issues
  2. Dead-Code Findings
  3. Theoretical Vulnerabilities

STAR Uses:

  1. AI-Optimized
  2. Deterministic
  3. Runtime DAST Validation

To Focus Only On:

  1. Exploitable Vulnerabilities
  2. Production-Relevant Findings
  3. Real Security Risks

This Dramatically Improves Developer Productivity And Reduces Alert Fatigue.

3. STAR Validates The Fix

After Remediation Is Generated, STAR Re-Tests The Application To Confirm:
The Vulnerability Is Actually Resolved

This Creates A Closed Security Loop:

Most AI Code Review Tools Cannot Reliably Perform This Workflow Today.

Taking The Next Step In AI Security

AI Is An Incredible Force Multiplier For Software Development.

But It Should Not Become Its Own Security Gatekeeper.

Securing AI-Generated Applications Requires Tools That Understand:

  1. Runtime Behavior
  2. Modern Attack Surfaces
  3. AI Execution Chains
  4. Exploitability
  5. Dynamic Application Flows

This Is Especially Critical As Organizations Continue Using:

  1. The Best AI Coding Assistants
  2. AI-Generated APIs
  3. Agentic Workflows
  4. MCP Servers
  5. Autonomous Development Systems

Bright Security Helps Teams Move Fast Without Sacrificing Runtime Security Validation.

Whether Teams Are Building With:

  1. Claude
  2. GPT
  3. Gemini
  4. Cursor
  5. Custom LLMs

Bright Provides The Runtime Testing And Validation Layer Needed To Deploy AI-Generated Applications Safely.

Final Thoughts

Our Research Demonstrated A Critical Reality:

AI Can Generate Code Faster Than Traditional Security Can Validate It.

Claude Opus 4.6 Successfully Identified Some Vulnerabilities.

But The Results Also Showed:

  1. Inconsistent Detection
  2. High False Positive Rates
  3. Missed Critical Vulnerabilities
  4. Lack Of Runtime Validation

This Creates A Dangerous Security Gap In Modern AI Development Workflows.

As AI Adoption Accelerates Across SaaS And Engineering Teams, Organizations Need More Than:
AI-Generated Security Suggestions

They Need:
Deterministic Runtime Validation
Continuous Exploit Verification
Real Attack Simulation
Runtime Security Testing

Because In Security:

Stop testing.

Start Assuring.

Join the world’s leading companies securing the next big cyber frontier with Bright STAR.

Our clients:

More

Threats and Vulnerabilities

Agentic AI Security: New Risks When Apps Start Calling Tools

AI systems are no longer passive tools that generate code or responses. They are becoming active agents that execute workflows,...
Gadi Bashvitz
May 25, 2026
Read More
Threats and Vulnerabilities

LLM Data Leakage: From Code to Production (For AppSec & Platform Teams)

AI is no longer just generating code - it is actively executing workflows across APIs, databases, and external systems. Teams...
Gadi Bashvitz
May 7, 2026
Read More
Threats and Vulnerabilities

Prompt Injection vs Data Poisoning in LLM Apps (Deep Technical Guide)

AAI is not just generating code. It is actually executing workflows across Application Programming Interfaces, databases, and external tools. Teams...
Gadi Bashvitz
May 6, 2026
Read More
Threats and Vulnerabilities

How MCP Endpoints Leak Sensitive Data (3 High-Impact Paths)

In the past two years, there have been significant changes in software development. Not only do programmers code – they...
Gadi Bashvitz
May 6, 2026
Read More