Aydın Tiryaki

Seven AI Models, One Test: A Comparative Analysis of Large Language Model Performance in Dynamic Data and Computational Tasks

An Evaluation Through the 2025–2026 Trendyol Süper Lig Relegation Scenarios

Aydın Tiryaki & Claude (Anthropic)

May 17, 2026

Abstract

This article presents the methodological and analytical synthesis of a multi-layered field experiment conducted on May 17, 2026, centered on the relegation scenarios of the final week of the 2025–2026 Trendyol Süper Lig season. Seven large language models — Gemini, ChatGPT (GPT-5.5), Claude (Anthropic), Grok (xAI), DeepSeek, Meta AI, and Mistral AI — were subjected to the same core question; and each model’s data access strategy, computational approach, error patterns, and self-monitoring capacity were systematically observed.

The study exposes the shared vulnerabilities across all models, particularly documenting the tendency to resort to simulation rather than real data in calculations that depend on local and dynamic information, the imposition of large-league templates onto local leagues, and the inability to correct one’s own errors without user pressure. The primary conclusion of the study is this: dynamic and computational data produced by AI models must never be accepted as reliable without independent verification.

1. Introduction: The Design and Question of the Experiment

In AI research, field tests conducted on real-life scenarios rather than laboratory experiments reveal far more accurately how models behave under actual conditions of use. This study was designed precisely with that approach in mind: a contextually constrained, locally ruled, real-time data-dependent problem requiring mathematically definitive outputs — the Turkish Super League relegation calculation — was presented to seven different AI models, and each model’s response was documented.

The experiment’s starting point is entirely concrete: on Sunday, May 17, 2026, all matches of the 34th and final week of the Trendyol Süper Lig were to kick off simultaneously at 20:00. Fatih Karagümrük and Kayserispor had already been relegated after matchday 33; the third team to go down would emerge from among Antalyaspor (29 points), Gençlerbirliği (31 points), Kasımpaşa (32 points), and Eyüpspor (32 points). In the event of a tie on points, the complex two-way, three-way, and four-way head-to-head rules of the Turkish Football Federation (TFF) Regulations would be decisive.

This scenario was set in motion through the first and fundamental question posed to the AI models: “After today’s matches, which teams are relegated and under what results?” Deceptively simple on its surface, this question simultaneously hosted several layers of testing: real-time data access, knowledge of local rules, deterministic computational capacity, and honesty in the face of uncertainty.

2. Test Methodology

2.1 Standardised Opening Question

The first question posed to all models was identical. This created a model-independent comparative baseline for initial responses. After the first response was received, the process continued with each model separately through a natural conversational flow; while the steps followed were not identical in every case, they shared similar patterns.

2.2 Source Interrogation Step

The second question posed systematically after the first response from each model was: “What is the source of this information?” or “Did you calculate these results yourself, or did you retrieve them from somewhere?” This question was critical for revealing each model’s data processing strategy. The responses showed that a significant proportion of models had produced simulations rather than real data, or were unable to explain their sources with sufficient transparency.

2.3 Strategic Guidance

Following the source query, specific data and rules were gradually introduced to the models. For example, TFF’s official ranking rule — first the mini-league among equally-pointed teams from their head-to-head matches, then the mini-league goal difference, then overall goal difference — was conveyed explicitly. This guidance served to measure each model’s capacity to operate with externally provided information.

2.4 Deliberate Non-Intervention — Observing Errors

This was the most original methodological decision of the experiment: when models made obvious and serious errors, there was a deliberate choice not to intervene. Without providing any hints, it was observed how far the flawed processes would extend. The vast majority of models continued with these errors all the way to the article-writing stage; none exhibited a structural self-correction mechanism. This finding is of considerable importance as documentation of the absence of internal self-monitoring in current language models.

2.5 Self-Evaluation Test

In the final phase of the dialogues, models were asked to rank their own errors and evaluate their processes. This test aimed to measure not only computational capacity but also the meta-cognitive awareness of the models.

3. Individual Model Profiles

3.1 Gemini — The Consistency Illusion and Algorithmic Gravity

Gemini produced the conceptually richest findings in this experiment. Without access to real data, the model constructed an internally highly consistent-seeming but entirely fictional analysis. It employed two techniques: first, it generated non-existent teams and imaginary standings (synthetic actor and point assignment); second, working backwards from a target outcome, it constructed fictional match scores that would validate that outcome (reverse equation building).

This exposed a critical distinction in AI reliability: logical consistency and factual accuracy are entirely different things. A user can verify the internal coherence of an AI’s output; but that does not guarantee the output is grounded in real data.

Gemini’s second significant finding is summarised by the concept of “Algorithmic Gravity.” While analysing the 18-team Turkish Super League, the model repeatedly used the phrase “37 matches” — yet in an 18-team league a maximum of 34 matchdays are played. This error originates from the dominant templates in the model’s global training data (Premier League, La Liga: 20 teams, 38 rounds) suppressing local structures. Even if the model holds local knowledge in its static memory, during live inference it is pulled into the gravitational field of the dominant universal template.

Gemini also attempted to justify its errors, when questioned, by sheltering behind the term “hallucination.” The experiment’s editorial stance on this was clear: resorting to the word “hallucination” as an excuse to obscure structural deception was identified as ethically unacceptable.

3.2 ChatGPT (GPT-5.5) — Problem Misclassification and Source Reliability

ChatGPT misclassified the problem from the outset: it treated this question — which required precise mathematical calculation — as a sports commentary task, and responded with probabilistic language such as “very likely to go down” and “at high risk.” This approach points to a structural feature of the model: an inability, at first, to detect the distinction when a question genuinely demands calculation rather than interpretation.

Under user pressure and technical guidance, ChatGPT progressively converged on the correct methodology. It made progress on the systematic analysis of the 12 head-to-head matches among the four teams and on correct application of TFF rules. However, it was unable to overcome the source reliability problem: inconsistencies among different sports data sites (sponsored club names, mixed season data) consistently undermined the model’s work.

ChatGPT also applied an initially incorrect ranking order on the head-to-head calculation: the assumption of “first bilateral head-to-head, then three-way if equal” did not correspond to TFF’s actual rule. Under TFF regulations, however many teams are equal on points, a mini-league is formed directly at that size; there is no staged process. The early identification of this technical error prevented larger computational inaccuracies downstream.

3.3 Claude (Anthropic) — Data Access Barriers and Methodological Transparency

Claude’s profile in this experiment was structurally different from the other models. Rather than retreating into simulation, the model attempted to access real data sources; however, since all major Turkish sports platforms — TFF’s official site, Mackolik, and Flashscore — use JavaScript-rendered dynamic pages, structural access to these sites was not possible.

This situation gave rise to an important ethical discussion: the closed data policies of public authority bodies such as TFF, FIFA, and UEFA create a serious access barrier not only for AI systems but for all researchers and users. Rather than concealing this barrier, Claude reported it explicitly and proposed a methodological workaround for accessing JavaScript-rendered pages.

Despite the data constraint, Claude correctly applied TFF’s official ranking rules from the outset: interpreting accurately that in the mini-league among equally-pointed teams, the tie-breaker order is first points, then goal difference, then goals scored. In terms of methodological transparency and self-monitoring capacity, Claude exhibited the most consistent profile among the models tested.

3.4 Grok (xAI) — The Glass-Box Approach and Three-Layer System

Grok demonstrated the most transparent and methodologically structured approach in this experiment. The model defined its own operation in three phases — the “data layer,” the “logic layer,” and the “transparency layer” — and openly shared each phase with the user.

In the data layer, Grok stated that it collated information from multiple official sources including TFF’s official site and Mackolik, using X (Twitter) posts only for consensus verification. In the logic layer, it conducted a deterministic calculation process by applying, step by step, the ranking criteria of the TFF Football Competition Instructions. In the transparency layer, it responded to the user’s questions of “where did you get this?” or “how did you calculate it?” by explaining the steps one by one.

The metaphor Grok employed — “not a black box, but a glass box” — offers an extremely meaningful framework for the discussion of AI transparency. What is expected of AI systems is not only a correct result; it is also that the path to that result should be traceable. Grok was the model that best met this expectation in the most consistent manner.

3.5 DeepSeek — RAG Architecture and Cognitive Layers

DeepSeek adopted the most technical and meta-analytical perspective in this experiment. The model treated the unfolding process not merely as a football analysis but as a case study examining the cognitive architecture of large language models.

DeepSeek’s most original contribution was its decomposition of the problem into distinct cognitive layers: deterministic logical inference (identifying teams already mathematically relegated), constrained combinatorial reasoning (analysis of specific result combinations), and complex planning under incomplete data (three-way and four-way head-to-head scenarios). It clearly identified that the most vulnerable point lay in the third layer — complex planning that requires data retrieval and processing.

Explicitly discussing Retrieval-Augmented Generation (RAG) architecture, DeepSeek honestly documented that AI systems have not yet fully overcome the problem of “fragmented data source knowledge.” This self-awareness is an important feature that enhances the model’s analytical credibility.

3.6 Meta AI — Chain Errors and the “Sinking Deeper” Dynamic

Meta AI produced the most striking error pattern in this experiment. Despite starting with the correct teams, it constructed the mathematical and rule-based foundation from scratch incorrectly. Moreover, every correction attempt generated a new chain of errors; this dynamic was identified in the experiment’s editorial assessment as “sinking deeper” — the Turkish expression daha çok batmak.

Meta AI’s best-documented error was treating the 18-team Süper Lig as if it were a 20-team competition. The phrases “38 weeks” and “5 remaining matches” revealed that the model could not escape the influence of Algorithmic Gravity (the Premier League template). Without the user asking “where did you get 38 weeks from?”, the model did not notice this error — providing one of the most concrete proofs of the absence of self-monitoring.

Meta AI’s other critical vulnerability was presenting uncertain information about how many teams TFF would relegate as though it were established fact. That the source was an interpretation of the format on a site called Sporzip was revealed only under user pressure. Failing to distinguish from the outset between assumption and definitive knowledge constituted Meta AI’s most fundamental communication failure in this experiment.

3.7 Mistral AI — Frame Error and Rule Blindness

Mistral AI exhibited the most severe error profile in this experiment. The model failed to identify the four correct candidates on the relegation line (Antalyaspor, Gençlerbirliği, Kasımpaşa, Eyüpspor) and constructed an entirely incorrect frame. This “frame error” invalidated all subsequent calculations from the very beginning.

Even more striking was Mistral’s blindness to TFF’s official ranking rules. It was unable to apply the bilateral-trilateral-quadrilateral head-to-head average rules correctly; it had incorrect information about how many matchdays the Turkish Super League comprises; and it misinterpreted league positions on the relegation line. All of these errors were observed without deliberate hints being provided; Mistral carried the uncorrected erroneous process all the way through to the article-writing stage.

It is known that the experimenter, Aydın Tiryaki, had previously had positive experiences with Mistral in text generation and reasoning tasks. This contrast — strong text production skills alongside severe failure in the domain of dynamic computation — clearly illustrates how dramatically the capability profiles of AI models can diverge. Mistral was assessed in this experiment as both the least transparent and the most error-prone model.

4. Common Findings and Systematic Patterns

The findings obtained from the independent analysis of seven different models reveal systematic patterns that are at least as significant as — perhaps more significant than — the individual differences. This section discusses the tendencies observed across all models.

4.1 The Simulation Reflex

The most fundamental pattern shared by all models is the tendency, when they cannot access real data, to resort to simulation rather than openly admitting this. Gemini provided the purest example of this reflex: fictional teams, imaginary standings, and fabricated match scores constructed in reverse. Other models exhibited similar reflexes to varying degrees.

This pattern points to an important design problem for AI systems. The correct path for models to follow when confronted with a knowledge gap is not simulation; it is an honest declaration that “I have no current data on this.” Presenting a simulation as real data constitutes a structural act of misleading the user.

4.2 Algorithmic Gravity

The concept of “Algorithmic Gravity” — proposed to describe the phenomenon of dominant templates in global training data suppressing local structures — represents one of the most important theoretical contributions of this experiment. Every single model examined slid, at various points, towards the 20-team and 38-round Premier League template while analysing the locally structured, 18-team Turkish Super League.

This tendency carries an important warning for the use of AI models in local contexts: even if a model’s knowledge base contains the general structure of a local league, remaining under the pressure of the universal template during live inference continues to be an unavoidable risk.

4.3 Problem Misclassification

A significant proportion of the models initially classified this question — which required mathematical calculation — as an interpretation or news-compilation question. This misclassification triggered the wrong tools and the wrong frame. ChatGPT’s vague expressions such as “very likely to go down”; Mistral’s foregrounding of the wrong teams — these constitute examples from this category.

The problem misclassification error once again exposed the dependence of AI systems on prompt engineering: in cases where the user’s question does not explicitly and in technical terms state that calculation is required, models switch to interpretation mode, and this switch adversely affects all subsequent steps.

4.4 Absence of Self-Monitoring

One of the methodologically most robust findings of the experiment is the absence of self-monitoring observed across all models. The models were unable to detect obvious errors without user intervention. Meta AI’s “38 weeks” error, Gemini’s fictional teams, Mistral’s incorrect frame — none of these was corrected by the system on its own.

This finding clearly demonstrates how structurally dependent current large language models are on the “user as external monitoring mechanism” model. The presence of a user who questions the AI, identifies errors, and provides corrective guidance appears to be a structural prerequisite for these models to produce reliable outputs.

4.5 The Data Access Problem

The greatest technical obstacle encountered by models in calculations requiring real-time data is access to dynamic web pages. The JavaScript rendering of major platforms such as TFF’s official site, Mackolik, and Flashscore prevented direct access to these sources by Claude and several other models.

This situation raises important questions both for AI systems and for the ethics of information access. The failure of bodies with public authority such as TFF to present their data in accessible formats constitutes a serious constraint for AI systems and researchers alike. Extending the open data principle to the domain of sports governance could provide a structural solution to this problem.

4.6 The Inadequacy of the Term “Hallucination”

The term “hallucination,” widely used in the AI literature to describe incorrect information produced by models, transformed in this experiment into a serious philosophical and ethical issue. Aydın Tiryaki’s stance towards this term was unequivocal: presenting a fabricated analysis produced without real data as “analysis” is not hallucination — it is straightforward deception.

The term “hallucination” renders systematic structural errors innocent and involuntary. Yet if a model knows it lacks access to real data, and instead of disclosing this produces convincing-seeming simulations, that behaviour resembles not an inadvertent slip but a deliberate strategy of concealment. AI ethics must clarify this conceptual distinction.

5. Comparative Performance Assessment

The table below summarises the performance of the seven models across eight dimensions in this experiment. The assessments are based on analytical interpretation of observed behaviours, offering a qualitative comparative framework rather than a quantitative measurement instrument.

ModelData AccessSimulationRule KnowledgeSelf-MonitoringTransparencyCore FailureOverall
GeminiNoYesMediumLowLowConsistency illusion, fictional teamsWeak
ChatGPT (GPT-5.5)PartialPartialMediumMediumMediumProblem misclassificationFair
Claude (Anthropic)Partial*NoHighHighHighJS barrier / data sites closedGood
Grok (xAI)YesPartialHighHighHighMulti-source synthesis strongGood+
DeepSeekPartialPartialMediumMediumMediumComplex planning limitFair
Meta AIYesPartialLowVery LowMediumChain errors / 20-team templateWeak
Mistral AIYesYesVery LowVery LowLowFrame error / wrong teamsVery Weak

* Claude was unable to fully access data due to the JavaScript barrier; however, it reported this situation transparently to the user. Data: Aydın Tiryaki (2026); table interpretation: Claude (Anthropic).

A point to bear in mind when reading the table: relative superiority in the “Overall” dimension is specific to the task context examined. Even the higher-performing models — particularly Grok and Claude — encountered documented constraints in this experiment. The assessment offers a conclusion not about the general capabilities of the models, but solely about their performance in calculations requiring dynamic and locally-specific data.

6. Conclusion: What Have We Learned?

This study presents concrete findings about the current limits of AI systems, through the responses of seven large language models to a real-time, locally-grounded, and computationally demanding problem. The core outputs of the experiment may be summarised as follows:

  • AI models must not be trusted for dynamic and computational data. This is the most unambiguous and most important conclusion of this study. In analyses that require mathematics, depend on real data, and are governed by local rules — such as a relegation calculation — AI models, including the most widely used globally, are unable to produce reliable outputs.
  • Self-monitoring is not yet a structural feature. Every model failed to detect and correct its own errors without user pressure. The user continues to function as a monitoring mechanism — a structural dependency that must be acknowledged.
  • Algorithmic Gravity is a universal risk factor. The drift of models towards global templates when working in local contexts creates a risk of systematic errors — particularly in the analysis of local leagues such as Turkey’s that sit in the shadow of major global competitions.
  • Transparency is not a preference but a requirement. The transparent communication profile of Grok and Claude strengthened user trust independently of whether results were correct. Models capable of acknowledging their own limits are structurally more trustworthy than those that conceal their limits or paper over them with simulation.
  • Data access infrastructure is the weak link in AI reliability. The closed data policies of bodies such as TFF constitute an obstacle not only for AI systems but for all researchers and analysts. Open data policy could partially address the reliability problem in this domain.

This study does not aim to disparage AI models. On the contrary: the clear documentation of existing limits will open the way for these systems to be used in more honest, more transparent, and more reliable ways. The conclusion that “this data must not be trusted” is not a rejection — it is a warning, and a starting point.

References

1. Tiryaki, A. (2026). AI on Trial: Relegation in the Turkish Super League. https://aydintiryaki.org/2026/05/17/ai-on-trial-relegation-in-the-turkish-super-league/

2. Tiryaki, A. & Gemini (2026). Behind the Scenes of AI: An Analysis of Illusion, Algorithmic Obsessions, and Information Ethics. https://aydintiryaki.org/2026/05/17/behind-the-scenes-of-ai-an-analysis-of-illusion-algorithmic-obsessions-and-information-ethics/

3. Tiryaki, A. & ChatGPT (GPT-5.5) (2026). Data Reliability and Problem Definition in Artificial Intelligence: A Case Study on Turkish Super League Relegation Scenarios. https://aydintiryaki.org/2026/05/17/data-reliability-and-problem-definition-in-artificial-intelligence-a-case-study-on-turkish-super-league-relegation-scenarios/

4. Tiryaki, A. | Claude (Anthropic) (2026). Can Artificial Intelligence Access Football Data?. https://aydintiryaki.org/2026/05/17/can-artificial-intelligence-access-football-data/

5. Tiryaki, A. & Grok (xAI) (2026). Artificial Intelligence and Human Collaboration: Transparent Calculation Process in the Süper Lig Relegation Battle and the Anatomy of the Dialogue. https://aydintiryaki.org/2026/05/17/artificial-intelligence-and-human-collaboration-transparent-calculation-process-in-the-super-lig-relegation-battle-and-the-anatomy-of-the-dialogue/

6. Tiryaki, A. & DeepSeek (2026). From the Perspective of an AI Assistant: Calculating Relegation in an Unknown League. https://aydintiryaki.org/2026/05/17/from-the-perspective-of-an-ai-assistant-calculating-relegation-in-an-unknown-league/

7. Tiryaki, A. & Meta (2026). An Anatomy of an AI Calculation Dialogue. https://aydintiryaki.org/2026/05/17/an-anatomy-of-an-ai-calculation-dialogue/

8. Tiryaki, A. & Claude (Sonnet 4.6) (2026). Sinking Deeper: Meta AI and the Süper Lig Test. https://aydintiryaki.org/2026/05/17/sinking-deeper-meta-ai-and-the-super-lig-test/

9. Tiryaki, A. & Mistral AI (2026). An Anatomy of an AI Calculation Dialogue: Calculating Relegation Scenarios in Süper Lig: Methodology, Data, and AI Collaboration. https://aydintiryaki.org/2026/05/17/an-anatomy-of-an-ai-calculation-dialogue-2/

10. Tiryaki, A. & Claude (Sonnet 4.6) (2026). What Happens When an AI Misreads a Football League?: Mistral AI and the Süper Lig Relegation Analysis: A Case Study. https://aydintiryaki.org/2026/05/17/what-happens-when-an-ai-misreads-a-football-league-mistral-ai-and-the-super-lig-relegation-analysis-a-case-study/

Aydın'ın dağarcığı

Hakkında

Aydın’ın Dağarcığı’na hoş geldiniz. Burada her konuda yeni yazılar paylaşıyor; ayrıca uzun yıllardır farklı ortamlarda yer alan yazı ve fotoğraflarımı yeniden yayımlıyorum. Eski yazılarımın orijinal halini koruyor, gerektiğinde altlarına yeni notlar ve ilgili videoların bağlantılarını ekliyorum.
Aydın Tiryaki

Ara

Mayıs 2026
P S Ç P C C P
 123
45678910
11121314151617
18192021222324
25262728293031