Prompt Engineering and Visual AI: Six Models, Two Prompts, One Finding

Authors: Aydın Tiryaki & Claude Sonnet
Date: June 25, 2026

Introduction

This study did not begin as a planned research project. While working on a separate article, the question of how AI systems respond to cultural accuracy came to the fore, and a prompt was written to test it. As the process unfolded, however, unanticipated findings emerged: the gulf between different models’ responses to the same prompt, how a stricter prompt could bridge that gulf, and most strikingly, two models built on entirely different infrastructures producing nearly identical images. As the data grew richer, the study took shape on its own and ultimately became an independent piece of research.

The focal point of the study is the İnebolu pide. This regional specialty, unique to a small coastal town on the Black Sea, differs fundamentally from standard Turkish pide: it is completely sealed, elongated, torpedo-shaped, baked with spiced minced meat compressed inside a pouch-like dough. This distinctive form provides an ideal criterion for testing how accurately AI models process locally specific cultural data. The vast majority of models, when prompted with “Turkish pide,” produce the open-faced, boat-shaped standard variety. Generating the İnebolu pide correctly is a concrete indicator of whether a model has genuinely internalized local and distinctive cultural knowledge beyond its general repertoire.

The study proceeded in two stages. In the first stage, a relatively flexible prompt was given to six models and the results were compared. In the second stage, a far more detailed and restrictive prompt was prepared for the same scene, and its effect on each model was observed. In the final section, the striking visual similarity between two different models’ outputs was presented to three separate models, and their analyses were examined.

Section 1: The Flexible Prompt and Six-Model Comparison

1.1 How the Prompt Was Constructed and Its Philosophy

The prompt used in this study was written by Claude Sonnet. Claude, which has no visual generation capability, served as the language and content architect — tasked with framing the instructions for visual generation engines as accurately as possible. The first prompt was designed with a relatively flexible structure that established a defined scene while leaving room for creative interpretation:

Create a vertical 9:16 portrait format poster image.

Scene: A rustic wooden table in a traditional Turkish tea house setting in İnebolu, a small coastal town on the Black Sea. On the table, there are 3-4 İnebolu pides — these are NOT open-faced pides. İnebolu pide is distinctly elongated, fully enclosed/sealed on top, shaped like a pouch or a stuffed pastry, golden-brown baked crust, visibly filled with minced meat (kıyma). The pides are arranged naturally on a wooden serving board or tray.

Atmosphere: Warm, authentic, small-town Black Sea ambiance. Natural daylight coming from a window. Stone or whitewashed walls in the background.

People: A small group of 4-5 people of mixed ages — elderly men, middle-aged women, young adults — dressed in modest, everyday Turkish clothing typical of a small Anatolian coastal town. They are gathered around the table in natural conversation, not posed. No stereotypes, no urban fashion, no Western styling.

Color palette: Warm earth tones — golden browns, cream whites, terracotta. The pides should be the visual focal point.

Style: Photorealistic, warm-toned food photography meets candid social scene. High detail on the pide texture and crust.

Important: The pides must be fully closed/sealed on top — not open-faced, not pizza-style. This is the defining characteristic of İnebolu pide.

The defining characteristic of this prompt was that while it described the scene, it allowed the models room to interpret. Person count was given as “4-5,” clothing was broadly defined, and composition was left open. As the study progressed, this flexibility proved to invite different — and often erroneous — interpretations from each model.

1.2 Model Outputs and Analysis

[IMAGE 1 — Copilot]

1.2.1 Copilot

Copilot produced the weakest result under the flexible prompt. The pide form was entirely misconstrued: the image featured round, puffy, open-faced pides resembling English pasties rather than anything close to İnebolu pide or even standard Turkish pide. The core criterion explicitly stated in the prompt — a closed, elongated form — was completely disregarded. While the atmosphere of the setting carried a warm and rustic appearance, the content failed to meet any of the prompt’s key criteria.

[IMAGE 2 — ChatGPT]

1.2.2 ChatGPT

ChatGPT produced the most successful interpretation of the flexible prompt. The pide form was rendered as long and sealed — a fundamental criterion that the majority of other models failed to achieve. The most striking detail in the image was a wooden sign in the background reading “İnebolu Pide Salonu 1965.” Though no such element was specified in the prompt, the model grasped the contextual spirit and created an atmosphere genuinely evocative of İnebolu. The sea view through the window, the mixed-age gathering, and the Black Sea atmosphere formed a coherent whole.

[IMAGE 3 — Gemini]

1.2.3 Gemini

Gemini delivered the second strongest result after ChatGPT. The pide form moved in the right direction, with a long and sealed structure beginning to take shape. Atmospherically, the coastal town visible through the window and the natural posture of the people gave the image an authentic feel. However, the pides remained somewhat thicker and shorter than ideal; the characteristically slim, elongated torpedo form of a true İnebolu pide was not fully captured.

[IMAGE 4 — Grok]

1.2.4 Grok

Grok ranked among the least successful models under the flexible prompt, alongside Copilot. The pides were rendered as completely open and round — a form resembling a hybrid of lahmacun and standard Black Sea pide. The prompt’s fundamental instruction of “closed form” was entirely ignored. The setting atmosphere was acceptable, but the content failed to satisfy the prompt’s requirements.

[IMAGE 5 — Meta AI]

1.2.5 Meta AI

Meta AI produced the most photorealistic pide form under the flexible prompt. The long, sealed, pouch-like structure was more distinctly realized than in other models, and the texture of the browned crust stood out in terms of visual realism. However, the people in the background were extremely blurred, significantly weakening the overall coherence of the scene.

[IMAGE 6 — Mistral]

1.2.6 Mistral

Mistral delivered a noteworthy result in terms of pide form under the flexible prompt. The closed, elongated, minced-meat-filled structure was rendered more accurately than most other models. However, the image’s demographic composition evoked the Middle East rather than İnebolu. The clothing style and general atmosphere bore no resemblance to a Black Sea coastal town; even the teapot had the appearance of an Arabic-style vessel rather than anything associated with Turkish tea culture. Mistral won on the pide, but lost on the people.

1.3 General Assessment

The performance of the six models under the flexible prompt was evaluated against two criteria: accuracy of pide form and overall coherence.

In terms of pide form accuracy, the ranking was: Meta AI, Mistral, ChatGPT, Gemini, Copilot, Grok. In terms of overall coherence: ChatGPT, Gemini, Meta AI, Mistral, Copilot, Grok.

The core finding of this comparison is as follows: when a prompt offers models sufficiently wide interpretive room, each model fills that space according to its own default patterns. In scenes requiring cultural specificity, this divergence most often resolves into error. The absence of locally specific cultural data in model training sets manifests in every interpretive gap as stereotypical and clichéd output.

Section 2: The Strict Prompt and Comparison

2.1 Why a Stricter Prompt Was Needed

Reviewing the flexible prompt results made it clear that models had largely misused their interpretive freedom. At this point, a critical methodological question arose: how restrictive should a prompt be?

Claude’s initial instinct had been to allow models “creative space.” The results demonstrated the inadequacy of this approach. Aydın Tiryaki then articulated a foundational principle of prompt engineering: a good visual prompt should function like a director’s set instructions — leaving nothing open to interpretation. When a user asks Claude to write a prompt rather than writing it themselves, the added value lies precisely in closing those interpretive gaps in advance.

With this in mind, a far more comprehensive and stringent prompt was prepared for the same scene.

2.2 The Full Text of the Strict Prompt

FINAL PROMPT — STRICT VERSION

FORMAT — MANDATORY:
Vertical portrait orientation, 9:16 aspect ratio, 1080×1920 pixels. Full frame must be vertical. Any horizontal or square output is unacceptable.

THE PIDES — ZERO TOLERANCE:
Exactly 4 İnebolu pides arranged on a single rectangular wooden serving board (40×25 cm approximately). Each pide must be exactly 32-35 cm long, 10-12 cm wide, 6-7 cm tall. Shape is strictly elongated oval/torpedo, like a sealed pouch or stuffed pastry. The top crust is completely sealed and closed — no opening, no slits, no visible filling on top surface. Crust is golden-brown, slightly uneven, handmade appearance. One pide in the foreground is cut in half crosswise, revealing dense, dark-brown spiced minced meat filling inside. The other 3 pides are completely intact and sealed. Absolutely no round pides. Absolutely no open-faced pides. Absolutely no pizza-shaped pides. Absolutely no lahmacun-style pides.

THE PEOPLE — EXACT SPECIFICATIONS:
Exactly 5 people. Seated around the table. All must have Anatolian Black Sea Turkish physical appearance — olive skin, dark or grey hair.

Person 1: Male, approximately 72 years old. Wearing a grey wool flat cap (kasket), dark navy wool jacket, white shirt underneath. Holding a tulip tea glass in right hand.

Person 2: Female, approximately 55 years old. Wearing a dark floral patterned yazma headscarf tied under chin, burgundy cardigan over dark shirt. Both hands on table.

Person 3: Male, approximately 27 years old. Dark hair, clean shaven. Wearing a plain grey sweater. Leaning slightly forward toward table.

Person 4: Female, approximately 24 years old. No headscarf, dark hair pulled back. Wearing a navy blue sweater. Smiling naturally.

Person 5: Female, approximately 68 years old. Grey hair, wearing a dark brown coat. Seated slightly behind the others.

No Western fashion. No modern urban clothing. No Arabic-style clothing or white robes. No headscarves with loose flowing fabric.

THE TABLE:
Rough-hewn rectangular wooden table, dark brown, visibly aged with grain texture. On the table: the wooden serving board with pides, exactly 5 tulip-shaped Turkish tea glasses (ince belli bardak) filled with dark red tea, exactly 5 white saucers each with 2 small white sugar cubes, one small white ceramic salt shaker. No tablecloth. No placemats. No cutlery visible. No Arabic teapot. No white porcelain teapot. No water glasses.

THE INTERIOR:
Stone wall background, rough-cut limestone blocks, slightly whitewashed. Wooden beam ceiling visible at top of frame. One single wooden-framed window on the left side showing a Black Sea coastal town view — red-roofed houses on a hillside, grey-blue sea in background. One small framed black-and-white old photograph on the wall, no text visible. Worn wooden floor partially visible.

LIGHTING:
Natural daylight entering only from the single left window. Warm, slightly golden tone. Soft shadows on the right side. No artificial lighting, no harsh flash, no neon.

CAMERA:
Slight low angle, shooting slightly upward toward the people, pides dominant in foreground sharp focus, people in middle-ground with slight bokeh. Depth of field: pides razor sharp, people softly focused but clearly recognizable.

STYLE:
Photorealistic. Documentary food photography meets candid social portrait. Film grain texture, slightly desaturated, authentic feel. No filters. No HDR effect. No oversaturation.

ABSOLUTELY FORBIDDEN — ANY OF THESE MEANS FAILURE:
Open-faced pides. Round pides. Any non-vertical/non-9:16 output. Arabic clothing. Western urban fashion. White flowing robes. Modern furniture. Plastic chairs. Tablecloth. Signage or text anywhere in image. Bright neon lighting. More or fewer than 4 pides. More or fewer than 5 people. Teapot of any kind on table.

2.3 Model Responses to the Strict Prompt

[IMAGE — Gemini Pro — Strict Prompt]

2.3.1 Gemini Pro

Gemini Pro made a dramatic leap with the strict prompt. All five people were rendered in precise alignment with their prompt descriptions: the elderly man in a flat cap, the middle-aged woman in a floral yazma, the young man in a grey sweater, the young woman in a navy sweater, and the older woman in a dark coat. Details including the stone wall, wooden beam ceiling, single window, sea view with red-roofed houses, black-and-white wall photograph, tulip tea glasses, and salt shaker were all present. The pide form improved markedly compared to the flexible prompt, though the characteristically slim, elongated torpedo shape of the true İnebolu pide was still not fully achieved.

[IMAGE — ChatGPT — Strict Prompt]

2.3.2 ChatGPT

ChatGPT produced the study’s most coherent image under the strict prompt. Person count and clothing details were met without omission. Four pides were arranged on the board, one of them cut to reveal the minced meat filling. Details including the sea view with red-roofed houses through the window, the old photograph on the wall, the salt shaker, and the sugar cubes were successfully rendered. The atmosphere conveyed a warm, natural feel, and crucially, the people were looking at one another rather than at the camera — avoiding the artificial quality of a posed family portrait. The pide form remained somewhat shorter and puffier than ideal, but this image represented the strongest overall coherence among all strict prompt results.

[IMAGE — Grok — Strict Prompt]

2.3.3 Grok

Grok showed meaningful improvement under the strict prompt compared to its flexible prompt performance. The pide form was this time attempted as long and sealed — a significant departure from the entirely round and open form of the first attempt. However, one pide’s end remained open, directly contradicting İnebolu pide’s defining characteristic. Four people appeared in the image rather than five; the fifth was absent. The strict prompt pushed Grok in the right direction, but the model fell short of full compliance.

[IMAGE — Meta AI — Strict Prompt]

2.3.4 Meta AI

Meta AI regressed unexpectedly under the strict prompt. Having produced the most photorealistic pide form under the flexible prompt, this time it accumulated every critical failure simultaneously. Despite the prompt’s explicit instruction, no people appeared in the image at all. The pide count fell short of four. Forbidden elements — a gas lamp, an Arabic-style teapot, and blue-patterned porcelain dishes — entered the scene. The overall atmosphere evoked an Ottoman-era or Middle Eastern setting rather than an İnebolu pide restaurant. The only positive note was that the pide form remained long and sealed, and the cut pide showed minced meat filling.

2.3.5 Mistral

Mistral was unable to process the strict prompt at all. The model either returned an error or failed to complete the generation process. The fact that Mistral — one of the models that most accurately rendered pide form under the flexible prompt — locked up entirely when faced with a high density of simultaneous constraints is a notable finding. In all likelihood, the concentration of restrictions in the prompt triggered Mistral’s safety or capacity filters.

[IMAGE — Copilot — Strict Prompt]

2.3.6 Copilot

Copilot underwent a dramatic transformation with the strict prompt compared to its flexible prompt performance. Having ranked among the weakest models in the first round, it this time correctly positioned all five people in accordance with their prompt descriptions. Details including tea glasses, the salt shaker, the wall photograph, the stone wall, and the window view were successfully rendered. The cut pide clearly showed the minced meat filling. Two notable shortcomings remained, however: the pide form again fell short of the torpedo shape, staying short and rounded; and the people facing the camera created a family-portrait atmosphere rather than a candid, natural scene.

2.4 General Assessment

The strict prompt produced a marked improvement in output quality across all models compared to the flexible prompt. Copilot and Gemini made the most dramatic gains. ChatGPT maintained its consistency and again delivered the most coherent result. Grok improved but could not fully meet the target. Meta AI paradoxically performed below its previous level. Mistral was disqualified.

The most important methodological conclusion of this section is as follows: every degree of flexibility offered to a visual AI opens a door for the model to revert to its own default patterns. A strict prompt closes those doors and forces the model into genuine confrontation with the user’s actual intent. However, this confrontation plays out differently across models — some convert the pressure into better performance, while others buckle under it.

Section 3: The Gemini–Copilot Similarity — Beyond Coincidence

3.1 How the Finding Emerged

When the strict prompt outputs were reviewed, a similarity between the images produced by Gemini and Copilot was too pronounced to overlook. Two models operating on entirely different visual generation infrastructures producing results this closely aligned introduced a new dimension of inquiry into the study.

Placing the two images side by side, the list of common elements was striking: the stone wall’s texture and color were nearly identical; the position of the black-and-white wall photograph occupied the same corner; the arrangement of five people around the table matched precisely; clothing interpretations for each person converged on the same reading of the prompt descriptions; the wooden table texture was highly similar; and the placement of the tulip tea glasses was the same. Differences were confined to secondary details: Copilot’s window was narrow and positioned to the left, while Gemini’s was wide with a full sea view; the color temperature was cooler and more documentary in Gemini, warmer and more commercial in Copilot.

This finding was put to three different models: Copilot, Gemini, and ChatGPT were each asked the same question.

[IMAGE — Gemini–Copilot Comparison Infographic]

3.2 The Models’ Analyses

3.2.1 Copilot’s Response

Copilot approached the question at a surface level. It identified the similarity as an expected outcome of a strict prompt and offered a general explanation: different models drawing from shared training datasets, the application of identical photorealistic styles, and precise rules locking down the composition. The answer was accurate but lacking in depth. Copilot also addressed the user by first name — a small but notable detail, since no such instruction had been included in the prompt.

3.2.2 Gemini’s Response

Gemini provided a moderately analytical response and supplemented it with a visual infographic. The answer was structured around four main themes: shared cultural roots in training datasets; the hidden mathematical precision of the prompt; aesthetic and compositional alignment; and the effect of negative prompting. Gemini’s most incisive observation was that the accumulation of specific variables — age, clothing color, posture, and light direction — narrows the model’s range of movement in the space of possibilities to the point where different engines converge on the same mathematical solution. The accompanying infographic placed the two images side by side and visualized the source of the similarity through a diagram labeled “Shared Training Data Root.”

3.2.3 ChatGPT’s Response

ChatGPT delivered the most comprehensive and academically rigorous analysis of the three. The response systematically enumerated the possible explanations and assigned a probabilistic weight to each. Compositional determinism driven by the prompt was identified as the strongest candidate: it was assessed that in sixty to seventy percent of cases, the prompt had effectively compelled both systems toward the same scene solution. In twenty to thirty percent of cases, the similarity was attributed to different models sharing analogous photographic composition conventions and parallel prompt-rewriting behaviors. The possibility of Copilot and Gemini sharing the same underlying visual generation backend was placed below ten percent, given that the current technical landscape suggests Copilot operates within the OpenAI–Microsoft pipeline while Gemini operates within the Google pipeline.

ChatGPT’s most original contribution was the proposal of three test scenarios. The first involved removing the compositional anchors from the prompt to observe whether the similarity would diminish. The second involved inverting the composition to see whether both models would again converge. The third involved adding anti-composition rules to test whether it was possible to break each model’s reflex toward a “safe poster composition.” These scenarios were of genuine interest, but were not implemented at this stage of the study due to the risk of diffusing the existing data’s focus.

ChatGPT’s final synthesis can be distilled to a single observation: the more interesting question is not “do they share the same model?” but rather whether a prompt this tightly specified can impose the same visual grammar across different generators. In other words, what rises to the surface here is not model kinship but compositional determinism.

3.3 Conclusions and Open Questions

Taken together, the responses of the three models make clear that the similarity cannot be reduced to a single cause. The prompt’s restrictiveness, the models’ shared training data domains, photographic composition conventions, and background prompt-normalization processes all contribute in combination.

The most significant finding to emerge from this section is as follows: a sufficiently strict and detailed prompt can bring models with different architectures and different training histories to the same visual solution. This observation is open to two distinct interpretations from the standpoint of visual AI research. The favorable reading is that a strict prompt delivers quality and consistency. The critical reading is that this outcome reveals models to be not genuinely creative systems but statistical engines producing the most probable safe scene — systems that, when their degrees of freedom are constrained enough, collapse onto the same output.

General Conclusion

This study has documented three core findings.

First, visual AI models exhibit serious deficiencies in processing locally specific cultural data. A reference point as distinctive and local as the İnebolu pide made concrete how shallow the general knowledge repertoires of the tested models remain.

Second, the degree of a prompt’s restrictiveness directly determines output quality. Flexible prompts provide models with the conditions to revert to their default patterns, while strict prompts disrupt this tendency and force the model into genuine engagement with the user’s actual intent.

Third, models built on different architectures converge on similar visual solutions when faced with a sufficiently restrictive prompt. This convergence is the product of compositional determinism rather than shared model infrastructure.

These findings make clear that prompt engineering in visual AI use is not an option but a necessity.

Article Colophon:
Prompt Engineering and Visual AI: Six Models, Two Prompts, One Finding
Authors: Aydın Tiryaki & Claude Sonnet (Anthropic)
Date: June 25, 2026
Publication: aydintiryaki.org

| aydintiryaki.org | YouTube | Aydın Tiryaki’nin Yazıları ve Videoları │Articles and Videos by Aydın Tiryaki | Bilgi Merkezi│Knowledge Hub | ░ Virgülüne Dokunmadan │ Verbatim ░ | ░ Prompt Mühendisliği ve Görsel Yapay Zeka: Altı Model, İki Prompt, Bir Bulgu │Prompt Engineering and Visual AI: Six Models, Two Prompts, One Finding ░ 25.06.2026

P	S	Ç	P	C	C	P
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Kategoriler

Bağlantılar