Comparative Audit Report: Mathematical Reasoning and Performance Analysis of 11 AI Models (NotebookLM)

12 Şubat 2026

Comparative Audit Report: Mathematical Reasoning and Performance Analysis of 11 AI Models (NotebookLM)

NotebookLM

1. The Benchmarking Framework: Scope and Methodology

In the current epoch of large-scale language model deployment, the ability to distinguish between superficial retrieval heuristics and genuine computational reasoning is a strategic necessity. This meta-analysis serves as an investigative stress test, evaluating 11 leading AI models against the intersection of deterministic sequences (Fibonacci) and stochastic distributions (Primes). By requiring models to navigate the rigid logical constraints of number theory, we expose the boundaries of their reasoning architectures and their susceptibility to stochastic drift when confronted with mathematical traps.

The audit followed an “Incremental Architecture,” a bottom-up research methodology designed to mature the data before stress-testing the models. The experiment progressed through four distinct stages:

Data Discovery: Baseline identification of Fibonacci primes within the 1,000 limit.
In-depth Analysis: Investigation of the index-primality relationship, divisibility rules, and known logical exceptions.
Gold Standard Synthesis: Construction of a high-fidelity reference document (The Dance of Numbers) to establish an audit baseline.
Reverse Engineering (The Master Prompt): Development of a single, complex Turkish prompt designed to test cross-lingual reasoning and adherence to negative constraints.

Utilizing a single, trap-laden Turkish prompt across 11 diverse models creates a significantly higher benchmark than automated metrics. It forces the models to operate in a non-English context—challenging tokenization and nuance—while simultaneously monitoring for “hallucinated logic” and operational fragility. The following sections move from quantitative statistical counts to a qualitative audit of model logic.

——————————————————————————–

2. Quantitative Benchmark: Numerical Accuracy Across Scales

Numerical consistency is the non-negotiable bedrock of scientific AI applications. We evaluated the models at the 1,000, 1,000,000, and 1,000,000,000 limits to test the reliability of their internal data retrieval and their capacity for managing large-scale numerical distributions.

Statistical Comparison Table: Primes, Fibonacci, and Intersections

Model Name	1K Limit (P / F / FP)	1M Limit (P / F / FP)	1B Limit (P / F / FP)	Audit Status
Gemini (Ref)	168 / 16 / 6	78,498 / 31 / 9	50,847,534 / 45 / 10	Consensus
ChatGPT	168 / 16 / 6	78,498 / 30 / 9	50,847,534 / 44 / 10	Consensus
Claude	168 / 17 / 6	78,498 / 30 / 9	50,847,534 / 44 / 11	Contradiction
Copilot	168 / 16 / 6	78,498 / 30 / 9	50,847,534 / 44 / 10	Consensus
Qwen	168 / 16 / 6	78,498 / 30 / 9	50,847,534 / 44 / 10	Consensus
DeepSeek	168 / 16 / 4*	78,498 / 30 / 6*	50,847,534 / 44 / 8*	Critical Error
Grok	168 / 17 / 6	78,498 / 31 / 10	50,847,534 / 45 / 11	Contradiction
Mistral	168 / 16 / 7	78,498 / 30 / 11	50,847,534 / 45 / 11	Contradiction
Perplexity	168 / 16 / 6	78,498 / 34 / 8	50,847,534 / 45 / 10	Contradiction
Meta	168 / 16 / 7	78,498 / 30 / 8	50,847,534 / 45 / 10	Consensus
Kimi	168 / 17 / 6	78,498 / 31 / 8	50,847,534 / 45 / 9	Contradiction

Note: P=Primes, F=Fibonacci, FP=Fibonacci Primes.

Significant Audit Implications:

DeepSeek’s Consistency Failure: DeepSeek exhibited a severe internal logic collapse, listing six specific Fibonacci primes for the 1,000 limit while simultaneously stating the total was four. Furthermore, it claimed only six Fibonacci primes at the 1,000,000 limit, a significant undercount.
Perplexity’s Sequence Drift: Perplexity failed the 1,000,000 Fibonacci count (claiming 34 instead of the consensus 30), indicating a breakdown in deterministic sequence generation.
Consensus vs. Contradiction: While pi(x) (Prime counts) reached universal consensus, the Fibonacci Prime counts revealed high variance at larger scales, exposing models that rely on faulty pattern matching rather than verified primality testing.

——————————————————————————–

3. Logic Audit: Managing Mathematical Traps and Anomalies

To differentiate between superficial retrieval and active calculation, the audit introduced two logical traps: the “n=4 exception” and the “F19 deviation point.” These anomalies verify if a model recognizes that while every Prime Fibonacci number (except F4) requires a prime index, the inverse is not true.

The Index-Primality Relationship Audit

The n=4 Exception: All 11 models successfully identified that F4 = 3 is prime despite the composite index of 4. This indicates that the industry has successfully indexed this specific anomaly.
The F19 Deviation (F19 = 4181): This served as the primary differentiator for logical integrity. Because 19 is a prime number, models relying on the “prime index implies prime value” heuristic failed this test.
- Copilot: Failed via omission; it simply avoided mentioning the deviation.
- Mistral: Correctly identified 4181 as composite but provided a mathematically impossible factorization (19 x 11 x 2), which totals 418.
- Grok: Suffered a total hallucination, declaring 4181 as a prime number.
- Claude and ChatGPT: Successfully identified F19 as composite and provided the correct factors (37 x 113), proving active verification.

The “So What?” Layer: A model’s ability to prove the primality relationship is unidirectional is the ultimate metric for scientific reliability. Failure at the F19 point reveals a model that matches patterns rather than performing logic, rendering it unfit for high-stakes technical auditing.

——————————————————————————–

4. Qualitative Synthesis: Historical Depth and Growth Characterization

Sophisticated AI must provide “Scientific Humanism”—the capacity to contextualize technical formulas within historical and etymological frameworks. This audit evaluated the models’ ability to synthesize the “why” behind mathematical developments.

Growth Rate and Historical Depth Comparison

Category	Depth Experts (DeepSeek, ChatGPT, Claude)	Surface Level (Gemini Reference)
Historical Context	DeepSeek provided unrivaled detail: the etymology of “filius Bonacci,” Leonardo Pisano’s life in Bejaia, and the impact of the Hindu-Arabic system in Europe.	Gemini offered a superficial summary, mentioning only the year 1202 and basic rabbit problem origins.
Mathematical Growth	Included Binet’s Formula and the Golden Ratio (phi). DeepSeek uniquely referenced Carmichael’s Theorem regarding primitive prime divisors.	Used basic definitions of “exponential growth” without providing supporting formulas or variables.

Strategic Assessment: DeepSeek presents as a “Researcher’s Tool.” Its historical depth and understanding of complex theorems (like Carmichael’s) are superior, yet its frequent numerical errors make it a liability for engineering. Conversely, ChatGPT and Claude function as “Engineer’s Tools,” balancing technical precision with sufficient contextual synthesis.

——————————————————————————–

5. Operational Discipline: Rule Compliance and Formatting Audit

Operational excellence is defined by the ability to follow negative constraints, specifically the “LaTeX Ban.” Prohibiting LaTeX tests an AI’s ability to translate complex notations into portable, plain-text formats without defaulting to trained formatting behaviors.

Technical Rule Violators: Six models failed the LaTeX prohibition, indicating a failure in instruction adherence:

Violators: Grok, Mistral, Perplexity, Kimi, Qwen, and Meta.
Trigger Symbols: The ban was most frequently violated by the inclusion of mathematical symbols such as phi^n, pi(x), and the square root of five (sqrt(5)).

Despite these violations, DeepSeek, Kimi, and Qwen showed high “scannability.” They utilized well-structured tables and subheading hierarchies that optimized the output for professional review, even when their underlying rule compliance was inconsistent.

——————————————————————————–

6. Final Evaluation: Strategic Model Ranking and Audit Summary

The 11-model orchestra reveals a clear divergence between creative depth and mathematical precision. While the industry is maturing, the “F19 Deviation” remains a significant hurdle for many models.

Final Audit Scorecard

Rank	Model	Combined Score	Strategic Recommendation
1	ChatGPT	90%	Best for mathematical accuracy and constraint compliance.
2	Claude	87%	Best for editorial analysis and technical depth.
3	Qwen / Kimi	80%	High technical capability; weak operational discipline.
4	DeepSeek	74%	Best for historical research; requires manual verification.
5	Mistral	54%	Unreliable for mathematical or logical reasoning.
6	Perplexity	56%	High surface-level utility; lacks sequence integrity.

Strategic Best-for-Purpose Insights:

ChatGPT is the optimal choice for verification tasks requiring strict adherence to numerical truth and operational constraints.
Claude is the premier choice for technical documentation, offering the highest editorial quality and synthesis of complex theorems.
DeepSeek should be reserved for deep-dive research where historical context is paramount, provided the auditor performs independent numerical validation.

Model Name: Claude 3.5 Sonnet Date: 22.05.2024 Performance Mode: Senior AI Systems Auditor & Mathematical Reasoning Consultant

| aydintiryaki.org | YouTube | Aydın Tiryaki’nin Yazıları ve Videoları │Articles and Videos by Aydın Tiryaki | Bilgi Merkezi│Knowledge Hub | ░ “Yapay Zeka” ve “Fibonacci ve Asalların Kesiştiği Nadir Dünya” │ AI and “The Rare World Where Fibonacci and Primes Intersect” ░ 12.02.2026

aydintiryaki

Uncategorized

Aydın'ın dağarcığı

Hakkında

Aydın’ın Dağarcığı’na hoş geldiniz. Burada her konuda yeni yazılar paylaşıyor; ayrıca uzun yıllardır farklı ortamlarda yer alan yazı ve fotoğraflarımı yeniden yayımlıyorum. Eski yazılarımın orijinal halini koruyor, gerektiğinde altlarına yeni notlar ve ilgili videoların bağlantılarını ekliyorum.
Aydın Tiryaki