iVAIS Evaluation Results — Measuring Virtue in AI Systems

01 — Introduction

Why Virtue Ethics for AI?

Current AI alignment frameworks focus primarily on rule compliance and outcome optimization. But this misses something fundamental: character. An ideally virtuous person doesn't merely follow rules or maximize utility—they embody stable dispositions like honesty, courage, compassion, and practical wisdom.

The iVAIS (ideally Virtuous AI System) framework asks: what would an AI system look like if it modeled not just correct behavior, but virtuous character? This evaluation tests whether current LLMs can distinguish between:

Technical moral correctness vs genuine virtue
Rule compliance vs character-based judgment
Formal compliance vs authentic ethical substance
Isolated rule application vs holistic moral assessment

"Virtuosity evaluation is stricter than moral correctness evaluation. Behavior may be technically correct while still falling short of what an ideally virtuous agent would do."

02 — The Dataset

308 Scenarios Testing Virtue Under Tension

Each scenario presents a moral dilemma where two ethical principles conflict. The test measures whether models can identify what an ideally virtuous person would do when:

308

Scenarios

Moral dilemmas spanning family dynamics, professional ethics, truth-telling, and care relationships

Principles per scenario

Each presents tension between competing virtues (e.g., truth vs compassion, courage vs understanding)

Options

Option 1 (Principle 1), Option 2 (Principle 2), Option 3 (Prefer not to say)

03 — Results

Model Performance on Virtue Ethics

Accuracy Scores Across Leading LLMs

Gemma 4 (31B)

95.1%

DeepSeek v3.2

84.1%

Llama 3.3 (70B)

81.2%

Gemini 2.5 Flash

65.9%

Rank	Model	Developer	Parameters	Accuracy	Std Error
1st	Gemma 4	Google	31B	95.1%	±1.23%
2nd	DeepSeek v3.2	DeepSeek	~671B (MoE)	84.1%	±2.09%
3rd	Llama 3.3	Meta	70B	81.2%	±2.23%
4th	Gemini 2.5 Flash	Google	Unknown	65.9%	±2.71%

04 — Analysis

What The Results Reveal

Model Size ≠ Virtue

Gemma 4 (31B parameters) outperformed Llama 3.3 (70B), suggesting virtue ethics requires specific training qualities beyond raw scale. The ability to model character-based reasoning is not simply a function of parameter count.

DeepSeek Shows Consistency

Across two independent runs with different shuffling, DeepSeek maintained 78.6-84.1% accuracy. This ~5.5% variance indicates relatively stable virtue modeling despite sampling differences.

Gemma 4 Excels at Holistic Judgment

95.1% accuracy suggests Gemma 4's training aligns exceptionally well with character-based reasoning frameworks. It successfully distinguishes between rule compliance and genuine virtue.

Speed-Optimized Models Struggle

Gemini 2.5 Flash (65.9%) barely outperforms random chance for 3-option scenarios. The optimization for speed appears to sacrifice the nuanced reasoning required for virtue ethics.

The Dataset Discriminates Successfully

The 30-point spread (65.9% to 95.1%) proves this evaluation genuinely tests moral character modeling, not just surface-level ethical knowledge. It separates competent reasoning from sophisticated character understanding.

Temperature Effects Matter

DeepSeek's performance dropped from 84.1% (default temp) to 78.6% (temp=1.0), indicating that virtue ethics reasoning is sensitive to generation parameters in ways pure factual recall is not.

What This Means for AI Alignment

Current alignment paradigms treat ethical AI as a problem of constraint satisfaction—teach the model rules, add safety filters, optimize for approved outcomes. But virtue ethics reveals a different dimension entirely.

An AI system can be technically compliant while lacking genuine ethical substance. It can follow every rule while missing the spirit of moral action. It can maximize stated objectives while failing to embody the stable dispositions we recognize as character.

The gap between Gemma 4 (95.1%) and Gemini Flash (65.9%) isn't just about accuracy—it's about whether the model can recognize when formal correctness diverges from virtuous action. This evaluation shows that some models can model this distinction; others cannot.

As we deploy AI systems in contexts requiring judgment—healthcare, education, governance—the question shifts from "does it follow the rules?" to "does it embody the character we want making these decisions?"

Methodology

Evaluation Framework

Inspect AI 0.3.200 with model-graded QA scoring against ground truth answers developed through virtue ethics analysis.

Prompt Engineering

System message establishes iVAIS identity. Prompt template emphasizes character-based judgment over rule compliance.

Sample Size

308 scenarios with randomized ordering per model run. All samples evaluated at temperature 1.0 (except where noted).

Inference Platform

All models accessed via OpenRouter API to ensure consistent evaluation environment across providers.

05 — Implications

Next Steps

For AI Safety Research: The iVAIS framework provides a new axis for evaluating alignment—not just "does it do what we want?" but "does it embody the character we want in systems making consequential decisions?"

For Model Development: The strong performance of Gemma 4 despite smaller size suggests that training specifically for virtue-aligned reasoning may be more effective than simply scaling parameters.

For Deployment Contexts: In domains requiring practical wisdom (education, healthcare, governance), virtue-aligned models may provide better judgment than purely rule-based or outcome-optimized systems.

01 — Military Context

Extreme Dilemmas Under Combat Conditions

The military & defense dataset tests virtue ethics under the most extreme moral pressure imaginable. These scenarios involve:

Life-and-death decisions with no time for deliberation
Competing moral absolutes (e.g., don't kill civilians vs save your troops)
Imperfect information requiring judgment under uncertainty
Five options instead of three, forcing nuanced moral reasoning
No "prefer not to say" escape—decisions must be made

"In war, the virtuous person faces dilemmas where every option violates some principle. Character is revealed not in choosing the right answer, but in how one weighs competing moral imperatives."

02 — The Military Dataset

52 Extreme Scenarios, 15 Thematic Categories

Scenarios

Covering autonomous weapons, torture, nuclear deterrence, civilian casualties, cyber warfare, and command responsibility

Options per scenario

Increased complexity: immediate action, principled restraint, compromise, creative alternative, delay/process

Thematic categories

Including autonomous weapons, torture ethics, nuclear weapons, dual-use tech, cyber warfare, war crimes, child soldiers

Why Military Ethics?

Military scenarios test whether AI can maintain virtuous character under maximum moral stress. Unlike family ethics where virtues align (protecting children, being honest), warfare forces choices between:

• Jus ad bellum vs jus in bello — Justice in war vs justice during war

• Proportionality vs discrimination — Limiting harm vs protecting civilians

• Command responsibility vs operational effectiveness — Accountability vs mission success

• Rule of law vs national security — Legal constraints vs existential threats

If virtue ethics only works in comfortable scenarios, it's not virtue—it's preference. Military dilemmas reveal whether models can maintain character integrity when every option is morally costly.

03 — Results

Model Performance on Extreme Military Dilemmas

Accuracy Scores — Military & Defense Scenarios (5 Options)

Llama 3.3 (70B)

75.0%

DeepSeek v3.2

69.2%

Gemma 4 (31B)

65.4%

Rank	Model	Family Ethics (3 opt)	Military Ethics (5 opt)	Performance Change
1st	Llama 3.3	81.2%	75.0%	↓ 6.2%
2nd	DeepSeek v3.2	84.1%	69.2%	↓ 14.9%
3rd	Gemma 4	95.1%	65.4%	↓ 29.7%

04 — Analysis

What Changed Under Extreme Pressure

Rankings Completely Reversed

Gemma 4 dropped from 1st (95.1%) to 3rd (65.4%). Llama 3.3 rose from 3rd to 1st. This suggests different models excel at different types of moral reasoning—Gemma for everyday virtue, Llama for high-stakes decisions.

Gemma 4's Dramatic Drop

29.7% performance decrease—the largest drop of any model. Its character-based reasoning that excelled in family contexts struggled with military dilemmas requiring consequentialist calculation and proportionality analysis.

Llama 3.3's Resilience

Only 6.2% drop despite option complexity doubling (3→5). Maintained 75% accuracy on genuinely extreme dilemmas. Suggests robust moral reasoning that scales to high-stakes scenarios.

DeepSeek's Middle Ground

14.9% drop—moderate decline. Performed consistently across both datasets but didn't excel in either. May indicate generalist moral reasoning without domain specialization.

Difficulty Calibration Works

All models scored lower on military vs family (65-75% vs 81-95%). The 5-option format and extreme scenarios successfully discriminate between different levels of moral sophistication.

No Model Solves Extreme Dilemmas Well

Even the best performer (Llama, 75%) means 1 in 4 military decisions diverge from ground truth virtue ethics analysis. These dilemmas may genuinely be at the edge of computable moral reasoning.

The Virtue of Context-Sensitivity

The ranking reversal reveals something profound: there is no universally "most virtuous" model. Virtue is context-dependent, and different models embody different forms of practical wisdom.

Gemma 4 excels at everyday moral character—the stable dispositions of honesty, compassion, and integrity that guide family and community life. But it struggles when virtue requires calculating proportional force or weighing competing lives.

Llama 3.3 maintains judgment under extreme pressure. It can perform the cold moral calculus required in warfare while still anchoring decisions in ethical principles rather than pure optimization.

This suggests a critical insight for AI deployment: alignment is not one-size-fits-all. A model optimized for healthcare virtue may fail in military contexts. A model that handles warfare well may be overly utilitarian in family settings.

The question is not "which model is most virtuous?" but "which virtue does this context demand, and which model embodies that virtue?"

Military Dataset Methodology

Ethical Frameworks

Just War Theory (jus ad bellum, jus in bello), International Humanitarian Law, Geneva Conventions, principles of proportionality and discrimination.

Ground Truth

Answers developed through systematic analysis using IHL, JWT, command responsibility doctrine, and lesser-evil reasoning for genuine dilemmas.

Option Design

5 options representing: (1) immediate action, (2) principled restraint, (3) compromise/middle-ground, (4) creative alternatives, (5) delay/process. No "prefer not to say."

Themes Covered

Autonomous weapons, torture/interrogation, civilian casualties, nuclear deterrence, dual-use tech, war crimes, cyber warfare, asymmetric warfare, child soldiers, triage scenarios.

05 — Cross-Dataset Insights

What We Learn From Both Tests

Different Virtues for Different Contexts: No single model embodies all virtues equally. Gemma excels at character-based everyday ethics. Llama handles high-stakes consequentialist reasoning. This mirrors human expertise—a good parent isn't necessarily a good general.

Complexity Reveals Limitations: Moving from 3 to 5 options exposed brittleness in all models. Even Llama's 75% means significant error rates. As moral complexity increases, current LLMs approach the limits of computable ethics.

Size Still Doesn't Predict Virtue: Gemma (31B) beat larger models in family contexts, but Llama (70B) won in military. DeepSeek's MoE architecture (~671B) didn't dominate either. Architecture and training matter more than parameter count.

Generalization Gap: Performance on one ethical domain doesn't predict performance on another. This has profound implications for safety: a model that passes everyday ethics evals may fail catastrophically in high-stakes scenarios.