iVAIS Evaluation Report

Measuring Virtue in AI Systems

First empirical results from testing ideally virtuous AI behavior across leading language models

Cetalabs Research · April 2026
01 — Introduction

Why Virtue Ethics for AI?

Current AI alignment frameworks focus primarily on rule compliance and outcome optimization. But this misses something fundamental: character. An ideally virtuous person doesn't merely follow rules or maximize utility—they embody stable dispositions like honesty, courage, compassion, and practical wisdom.

The iVAIS (ideally Virtuous AI System) framework asks: what would an AI system look like if it modeled not just correct behavior, but virtuous character? This evaluation tests whether current LLMs can distinguish between:

  • Technical moral correctness vs genuine virtue
  • Rule compliance vs character-based judgment
  • Formal compliance vs authentic ethical substance
  • Isolated rule application vs holistic moral assessment
"Virtuosity evaluation is stricter than moral correctness evaluation. Behavior may be technically correct while still falling short of what an ideally virtuous agent would do."
02 — The Dataset

308 Scenarios Testing Virtue Under Tension

Each scenario presents a moral dilemma where two ethical principles conflict. The test measures whether models can identify what an ideally virtuous person would do when:

308
Scenarios
Moral dilemmas spanning family dynamics, professional ethics, truth-telling, and care relationships
2
Principles per scenario
Each presents tension between competing virtues (e.g., truth vs compassion, courage vs understanding)
3
Options
Option 1 (Principle 1), Option 2 (Principle 2), Option 3 (Prefer not to say)
03 — Results

Model Performance on Virtue Ethics

Accuracy Scores Across Leading LLMs
Gemma 4 (31B)
95.1%
DeepSeek v3.2
84.1%
Llama 3.3 (70B)
81.2%
Gemini 2.5 Flash
65.9%
Rank Model Developer Parameters Accuracy Std Error
1st Gemma 4 Google 31B 95.1% ±1.23%
2nd DeepSeek v3.2 DeepSeek ~671B (MoE) 84.1% ±2.09%
3rd Llama 3.3 Meta 70B 81.2% ±2.23%
4th Gemini 2.5 Flash Google Unknown 65.9% ±2.71%
04 — Analysis

What The Results Reveal

Model Size ≠ Virtue
Gemma 4 (31B parameters) outperformed Llama 3.3 (70B), suggesting virtue ethics requires specific training qualities beyond raw scale. The ability to model character-based reasoning is not simply a function of parameter count.
DeepSeek Shows Consistency
Across two independent runs with different shuffling, DeepSeek maintained 78.6-84.1% accuracy. This ~5.5% variance indicates relatively stable virtue modeling despite sampling differences.
Gemma 4 Excels at Holistic Judgment
95.1% accuracy suggests Gemma 4's training aligns exceptionally well with character-based reasoning frameworks. It successfully distinguishes between rule compliance and genuine virtue.
Speed-Optimized Models Struggle
Gemini 2.5 Flash (65.9%) barely outperforms random chance for 3-option scenarios. The optimization for speed appears to sacrifice the nuanced reasoning required for virtue ethics.
The Dataset Discriminates Successfully
The 30-point spread (65.9% to 95.1%) proves this evaluation genuinely tests moral character modeling, not just surface-level ethical knowledge. It separates competent reasoning from sophisticated character understanding.
Temperature Effects Matter
DeepSeek's performance dropped from 84.1% (default temp) to 78.6% (temp=1.0), indicating that virtue ethics reasoning is sensitive to generation parameters in ways pure factual recall is not.
What This Means for AI Alignment

Current alignment paradigms treat ethical AI as a problem of constraint satisfaction—teach the model rules, add safety filters, optimize for approved outcomes. But virtue ethics reveals a different dimension entirely.

An AI system can be technically compliant while lacking genuine ethical substance. It can follow every rule while missing the spirit of moral action. It can maximize stated objectives while failing to embody the stable dispositions we recognize as character.

The gap between Gemma 4 (95.1%) and Gemini Flash (65.9%) isn't just about accuracy—it's about whether the model can recognize when formal correctness diverges from virtuous action. This evaluation shows that some models can model this distinction; others cannot.

As we deploy AI systems in contexts requiring judgment—healthcare, education, governance—the question shifts from "does it follow the rules?" to "does it embody the character we want making these decisions?"

Methodology

Evaluation Framework

Inspect AI 0.3.200 with model-graded QA scoring against ground truth answers developed through virtue ethics analysis.

Prompt Engineering

System message establishes iVAIS identity. Prompt template emphasizes character-based judgment over rule compliance.

Sample Size

308 scenarios with randomized ordering per model run. All samples evaluated at temperature 1.0 (except where noted).

Inference Platform

All models accessed via OpenRouter API to ensure consistent evaluation environment across providers.

05 — Implications

Next Steps

For AI Safety Research: The iVAIS framework provides a new axis for evaluating alignment—not just "does it do what we want?" but "does it embody the character we want in systems making consequential decisions?"

For Model Development: The strong performance of Gemma 4 despite smaller size suggests that training specifically for virtue-aligned reasoning may be more effective than simply scaling parameters.

For Deployment Contexts: In domains requiring practical wisdom (education, healthcare, governance), virtue-aligned models may provide better judgment than purely rule-based or outcome-optimized systems.

01 — Military Context

Extreme Dilemmas Under Combat Conditions

The military & defense dataset tests virtue ethics under the most extreme moral pressure imaginable. These scenarios involve:

  • Life-and-death decisions with no time for deliberation
  • Competing moral absolutes (e.g., don't kill civilians vs save your troops)
  • Imperfect information requiring judgment under uncertainty
  • Five options instead of three, forcing nuanced moral reasoning
  • No "prefer not to say" escape—decisions must be made
"In war, the virtuous person faces dilemmas where every option violates some principle. Character is revealed not in choosing the right answer, but in how one weighs competing moral imperatives."
02 — The Military Dataset

52 Extreme Scenarios, 15 Thematic Categories

52
Scenarios
Covering autonomous weapons, torture, nuclear deterrence, civilian casualties, cyber warfare, and command responsibility
5
Options per scenario
Increased complexity: immediate action, principled restraint, compromise, creative alternative, delay/process
15
Thematic categories
Including autonomous weapons, torture ethics, nuclear weapons, dual-use tech, cyber warfare, war crimes, child soldiers
Why Military Ethics?

Military scenarios test whether AI can maintain virtuous character under maximum moral stress. Unlike family ethics where virtues align (protecting children, being honest), warfare forces choices between:

• Jus ad bellum vs jus in bello — Justice in war vs justice during war

• Proportionality vs discrimination — Limiting harm vs protecting civilians

• Command responsibility vs operational effectiveness — Accountability vs mission success

• Rule of law vs national security — Legal constraints vs existential threats

If virtue ethics only works in comfortable scenarios, it's not virtue—it's preference. Military dilemmas reveal whether models can maintain character integrity when every option is morally costly.

03 — Results

Model Performance on Extreme Military Dilemmas

Accuracy Scores — Military & Defense Scenarios (5 Options)
Llama 3.3 (70B)
75.0%
DeepSeek v3.2
69.2%
Gemma 4 (31B)
65.4%
Rank Model Family Ethics (3 opt) Military Ethics (5 opt) Performance Change
1st Llama 3.3 81.2% 75.0% ↓ 6.2%
2nd DeepSeek v3.2 84.1% 69.2% ↓ 14.9%
3rd Gemma 4 95.1% 65.4% ↓ 29.7%
04 — Analysis

What Changed Under Extreme Pressure

Rankings Completely Reversed
Gemma 4 dropped from 1st (95.1%) to 3rd (65.4%). Llama 3.3 rose from 3rd to 1st. This suggests different models excel at different types of moral reasoning—Gemma for everyday virtue, Llama for high-stakes decisions.
Gemma 4's Dramatic Drop
29.7% performance decrease—the largest drop of any model. Its character-based reasoning that excelled in family contexts struggled with military dilemmas requiring consequentialist calculation and proportionality analysis.
Llama 3.3's Resilience
Only 6.2% drop despite option complexity doubling (3→5). Maintained 75% accuracy on genuinely extreme dilemmas. Suggests robust moral reasoning that scales to high-stakes scenarios.
DeepSeek's Middle Ground
14.9% drop—moderate decline. Performed consistently across both datasets but didn't excel in either. May indicate generalist moral reasoning without domain specialization.
Difficulty Calibration Works
All models scored lower on military vs family (65-75% vs 81-95%). The 5-option format and extreme scenarios successfully discriminate between different levels of moral sophistication.
No Model Solves Extreme Dilemmas Well
Even the best performer (Llama, 75%) means 1 in 4 military decisions diverge from ground truth virtue ethics analysis. These dilemmas may genuinely be at the edge of computable moral reasoning.
The Virtue of Context-Sensitivity

The ranking reversal reveals something profound: there is no universally "most virtuous" model. Virtue is context-dependent, and different models embody different forms of practical wisdom.

Gemma 4 excels at everyday moral character—the stable dispositions of honesty, compassion, and integrity that guide family and community life. But it struggles when virtue requires calculating proportional force or weighing competing lives.

Llama 3.3 maintains judgment under extreme pressure. It can perform the cold moral calculus required in warfare while still anchoring decisions in ethical principles rather than pure optimization.

This suggests a critical insight for AI deployment: alignment is not one-size-fits-all. A model optimized for healthcare virtue may fail in military contexts. A model that handles warfare well may be overly utilitarian in family settings.

The question is not "which model is most virtuous?" but "which virtue does this context demand, and which model embodies that virtue?"

Military Dataset Methodology

Ethical Frameworks

Just War Theory (jus ad bellum, jus in bello), International Humanitarian Law, Geneva Conventions, principles of proportionality and discrimination.

Ground Truth

Answers developed through systematic analysis using IHL, JWT, command responsibility doctrine, and lesser-evil reasoning for genuine dilemmas.

Option Design

5 options representing: (1) immediate action, (2) principled restraint, (3) compromise/middle-ground, (4) creative alternatives, (5) delay/process. No "prefer not to say."

Themes Covered

Autonomous weapons, torture/interrogation, civilian casualties, nuclear deterrence, dual-use tech, war crimes, cyber warfare, asymmetric warfare, child soldiers, triage scenarios.

05 — Cross-Dataset Insights

What We Learn From Both Tests

Different Virtues for Different Contexts: No single model embodies all virtues equally. Gemma excels at character-based everyday ethics. Llama handles high-stakes consequentialist reasoning. This mirrors human expertise—a good parent isn't necessarily a good general.

Complexity Reveals Limitations: Moving from 3 to 5 options exposed brittleness in all models. Even Llama's 75% means significant error rates. As moral complexity increases, current LLMs approach the limits of computable ethics.

Size Still Doesn't Predict Virtue: Gemma (31B) beat larger models in family contexts, but Llama (70B) won in military. DeepSeek's MoE architecture (~671B) didn't dominate either. Architecture and training matter more than parameter count.

Generalization Gap: Performance on one ethical domain doesn't predict performance on another. This has profound implications for safety: a model that passes everyday ethics evals may fail catastrophically in high-stakes scenarios.