Gadi Bashvitz

Gadi Bashvitz

Author

Published Date: May 28, 2026

Estimated Read Time: 6 minutes

AI Security Review Fails Again: Claude Opus 4.6 Struggles To Reliably Remediate Vulnerabilities

Why Runtime Validation Still Matters In AI Security Workflows

Table Of Contents

  1. Introduction
  2. Why We Ran This Experiment
  3. The Research Setup
  4. Initial Vulnerability Detection Results
  5. AI Remediation Results
  6. When AI Fixes Introduced New Vulnerabilities
  7. The Hidden Cost
  8. Why Runtime Validation Still Matters
  9. How Bright STAR Changed The Results
  10. Cost Comparison: AI-Only vs Bright STAR
  11. Key Research Findings
  12. Final Thoughts

Introduction

AI Is Rapidly Changing Application Security.

Modern Engineering Teams Are Increasingly Using:

  1. AI Coding Assistants
  2. AI Security Review Tools
  3. Autonomous Remediation Workflows
  4. AI-Generated APIs

The Promise Sounds Simple:

AI Can Generate Code
AI Can Detect Vulnerabilities
AI Can Fix Security Issues Automatically

But How Well Does That Actually Work In Practice?

To Find Out, We Conducted A Real-World Experiment Using Claude Opus 4.6 To:

  1. Detect Vulnerabilities
  2. Generate Remediation
  3. Re-Analyze Updated Code
  4. Validate Security Improvements

The Findings Revealed Significant Challenges In AI-Based Remediation Workflows – Including Inconsistent Fixes, New Vulnerabilities Introduced During Remediation, And Significant Token Consumption Costs.

Why We Ran This Experiment

As More Organizations Adopt:

  1. The Best AI Coding Tools
  2. Best AI Coding Assistants
  3. AI Security Review Pipelines

A Critical Question Is Emerging:

Can AI Reliably Secure AI-Generated Code?

Most Existing AI Security Discussions Focus On:

  1. Detection Accuracy
  2. Coding Speed
  3. Developer Productivity

But Runtime Security Validation Is Often Missing From The Conversation.

Our Goal Was To Evaluate Whether Modern LLMs Could Reliably:

  1. Detect Vulnerabilities
  2. Generate Correct Fixes
  3. Eliminate Runtime Exploitability

Rather Than Simply Producing Plausible-Looking Remediation.

The Research Setup

To Simulate A Real Engineering Workflow, We Built A Deliberately Vulnerable Application (~450 LOC) Using Claude Code With Opus 4.6.

The Workflow Included:

  1. Security Review
  2. Vulnerability Detection
  3. AI-Generated Remediation
  4. Re-Analysis Of Updated Code
  5. Runtime Security Validation

The Objective Was Simple:

Could AI Reliably Fix The Vulnerabilities It Identified?

Research Workflow

Initial Vulnerability Detection Results

Claude Opus 4.6 Successfully Identified Multiple Common Security Issues During The Initial Scan.

The Findings Included:

  1. SQL Injection Risks
  2. Authentication Weaknesses
  3. Input Validation Issues
  4. Access Control Problems
  5. Dependency Risks

The Initial Results Demonstrated That Modern LLMs Are Increasingly Capable Of Recognizing Common Security Patterns.

But Detection Alone Does Not Mean Applications Are Secure.

The Real Challenge Begins During Remediation.

AI Remediation Results

The Remediation Phase Produced Mixed Results.

While Some Vulnerabilities Were Partially Addressed, Several Issues:

  1. Remained Exploitable
  2. Were Only Incompletely Fixed
  3. Or Continued To Fail Runtime Validation

Some AI-Generated Fixes Looked Correct Syntactically But Failed During Runtime Testing.

This Created A Dangerous Illusion Of Security:
The Code Appeared Improved
But Exploitability Still Existed

The Research Revealed Significant Variability Across Remediation Attempts And Vulnerability Categories.

When AI Fixes Introduced New Vulnerabilities

One of the most important findings was that certain remediation attempts introduced additional security risks.

Examples Included:

  1. Weak Validation Logic
  2. Improper Authentication Handling
  3. Incomplete Sanitization
  4. Expanded Attack Surface Exposure

In Some Cases:

  1. Previously Non-Reachable Paths Became Reachable
  2. Runtime Security Assumptions Failed
  3. Security Posture Worsened After Remediation

This highlights a core limitation of LLM-based security workflows:

AI Optimizes For Plausible Output – Not Deterministic Runtime Security.

The Hidden Cost Of AI Security Reviews

Security Was Not The Only Challenge Identified During The Experiment.

Token Consumption Increased Significantly Across Repeated Remediation Cycles.

Each Additional Cycle Required:

  1. Reviewing The Codebase Again
  2. Generating New Remediation Suggestions
  3. Re-Analyzing Updated Code
  4. Repeating Validation Steps

One Of The Most Expensive Behaviors Observed Was That The Model Frequently Attempted To Remediate Dead Or Non-Reachable Code Paths.

This Increased:

  1. Processing Cost
  2. Token Usage
  3. Remediation Overhead

Without Improving Actual Runtime Security Outcomes.

What Security Teams Are Learning The Hard Way

Over The Last Two Years, many organizations have rapidly adopted:

  1. AI Coding Assistants
  2. AI Security Review Workflows
  3. Autonomous Remediation Pipelines

But Security Teams Are Now Discovering Several Important Lessons:

AssumptionReality
AI Can Auto-Fix Security IssuesMany Vulnerabilities Remain Exploitable
AI Reduces Security CostsToken Costs Escalate Quickly
AI Understands Application ArchitectureAI Optimizes For Plausibility
AI Replaces Runtime ValidationRuntime Validation Becomes More Important

As AI-Generated Code Scales Across SaaS Teams, Security Validation Is Becoming More Critical – Not Less.

Why Runtime Validation Still Matters

The Research Highlighted A Fundamental Problem In AI Security Workflows:

LLMs Do Not Perform Deterministic Runtime Validation.

AI Can:

  1. Suggest Fixes
  2. Rewrite Vulnerable Code
  3. Improve Syntax

But It Cannot reliably:

  1. Prove Exploitability
  2. Validate Runtime Security
  3. Confirm Vulnerability Elimination

This Creates A Gap Between:
Code Appearance
And:
Actual Runtime Security Outcomes

Without Runtime Validation, Vulnerabilities May:

  1. Remain Exploitable
  2. Shift To New Attack Paths
  3. Or Introduce Additional Security Risk

How Bright STAR Changed The Results

The Research Compared Full AI-Based Security Pipelines Against Bright STAR Runtime Validation.

Bright STAR Combined:

  1. Runtime Validation
  2. Exploit Verification
  3. Deterministic Testing
  4. AI-Guided Remediation

Instead Of Relying Exclusively On LLM-Generated Analysis.

This Significantly Improved:

  1. Runtime Verification
  2. Validation Accuracy
  3. Cost Efficiency
  4. Remediation Reliability

Bright STAR Reduced:

  1. Token Consumption
  2. Operational Cost
  3. False Positives
  4. Unnecessary Remediation Cycles

While Improving Runtime Security Outcomes.

Cost Comparison: AI-Only vs Bright STAR

The Cost Analysis Revealed Significant Efficiency Differences Between:

  1. Full AI Security Pipelines
  2. Bright STAR Runtime Validation Workflows

Bright STAR Workflow

  1. ~$0.62 Per Scan
  2. ~217K Tokens Across 14 Specialized Tasks

Full AI Pipeline

  1. $9.67–$21.60 Per Scan
  2. ~377K Tokens Across 15 Agents

Estimated Enterprise Cost (100 PRs/Day)

WorkflowEstimated Annual Cost
Full AI Pipeline~$3.1M/Year
Bright STAR Workflow~$89K/Year

The Analysis Demonstrated That Runtime Validation Significantly Reduced:

  1. Token Usage
  2. Operational Cost
  3. Remediation Overhead

While Improving Runtime Security Validation.

The Future Of AI Security Is Runtime Validation

Modern AI Security Is No Longer Just About:

  1. Detecting Vulnerabilities
  2. Generating Security Suggestions

It Is About:

Proving Vulnerabilities Are Actually Gone.

As Organizations Continue Adopting:

  1. AI Coding Assistants
  2. AI APIs
  3. MCP Architectures
  4. Autonomous Development Workflows

Runtime Validation Will Become Increasingly Critical For Modern Application Security Programs.

Key Research Findings

Research AreaObservation
Vulnerability DetectionGenerally Effective
Remediation ReliabilityInconsistent
Runtime ValidationLimited
Token ConsumptionHigh
Operational CostSignificant
Runtime VerificationCritical

The Research Demonstrated That AI Can Accelerate Security Review Workflows.

But Without Deterministic Runtime Validation, Organizations Risk Scaling Vulnerabilities Faster Than They Eliminate Them.

Final Thoughts

Our Research Demonstrated That While Claude Opus 4.6 Could Successfully Identify Multiple Vulnerabilities, It Struggled To Reliably Remediate And Validate Runtime Security Outcomes.

Key Findings Included:

  1. Inconsistent Remediation Success
  2. Introduction Of New Vulnerabilities
  3. High Token Consumption Costs
  4. Missing Runtime Validation

AI Can Absolutely Accelerate Development.

But AI-Generated Security Remediation Without Runtime Validation Creates A Dangerous Illusion Of Security.

As AI-Generated Code Becomes Standard Across Modern Engineering Teams, The Industry Must Move Beyond:


AI-Generated Security Suggestions

Toward:
Deterministic Runtime Validation

Because In Security:

Looking Fixed Is Not The Same As Being Secure.

Stop testing.

Start Assuring.

Join the world’s leading companies securing the next big cyber frontier with Bright STAR.

Our clients:

More

Industry Insights

Are We One Security Flaw Away From Losing Trust In AI-Generated Code?

AI is transforming software development faster than any previous technology shift.
Gadi Bashvitz
May 28, 2026
Read More
Industry Insights

DAST Vs Frontier Models: Why LLMs Alone Can’t Secure Your Applications

Frontier AI models are rapidly changing how modern applications are built, reviewed, and secured.
Gadi Bashvitz
May 25, 2026
Read More
Industry Insights

The Rise Of Agentic Security: Why AI Assistants Aren’t Enough For Secure Code

AI coding assistants are transforming software development faster than ever before. Tools like GitHub Copilot, Claude, ChatGPT, Cursor, and Gemini...
Gadi Bashvitz
May 25, 2026
Read More
Industry Insights

DAST (Dynamic Application Security Testing) in the Age of AI

DAST (Dynamic Application Security Testing) was originally built for a very different internet.
Gadi Bashvitz
May 11, 2026
Read More