Gemini 3 vs GPT 5.1 City Simulation Challenge



Gemini 3.0 Pro vs GPT-5.1 Codex Max on complex simulation challenge with agentic coding.

# Chapters:
0:00 Introduction
0:44 Bird Flocking Challenge
1:52 Gemini 3 Pro: Bird Simulation
3:07 GPT-5.1 Codex: Bird Simulation
4:17 City Challenge
4:43 Gemini 3 City: Simulation
5:46 GPT-5.1 Codex: City Simulation

# SETUP
I used the Gemini/Codex CLI with no other tools and the models started with empty repos.

The models:
Gemini 3.0 Pro via the Gemini CLI tool.
GPT-5.1 Codex Max (extra high reasoning) via Codex CLI.

(I tried using Gemini via Antigravity but I couldn’t get it to stay on track long enough, and rate limits were far too lo)

# PROMPTS
Here are the prompts I used:

# Bird flocking task specific prompt:
“Your task is to build a visually stunning 3D bird flocking simulation. It should be set in a serene, detailed nature scene with vegetation, trees, and whatever else you want to add. The birds should be detailed and flock realistically. Make sure the flocking dynamics look beautiful and realistic. You should be able to swap between a general view, and ‘bird view’ where the camera is following behind one of the birds for an immersive view. Each element should be detailed and realistic as you can.”

# Extra prompt requirements (used on both tasks)
“It should also include controls to look around and explore the simulation, and a GUI with controls for the main simulation parameters. The main animation driver will be ThreeJS. You must install everything yourself here, and you can install/use whatever other dependencies you want.

Testing:
1. You must test your simulation runs correctly with no errors in a browser.
2. You must take screenshots of the simulation from different angles. Systematically verify every element/requirement is visible as intended.
3. When everything is verified as working correctly, use screenshots to iterate on the ‘look’ and aesthetics. What parts could be improved or fixed?

Use the testing/screenshots to iterate and improve it based on what it actually looks like, and to verify and tune the complex dynamics and behaviour. Install/use whatever you need for testing.

The goal is to make it as aesthetically beautiful and interesting as you can. Don’t stop to ask for permission etc. Do not stop until the project is built, run and tested with no errors, and design, aesthetics, and dynamics are checked and iterated via screenshots.”

#City task specific prompt:

“Your task is to build a visually aesthetic procedural city simulation. The city should have the following:

-Buildings
-Streets with sidewalks, traffic lights, etc
-Trees, vegetation, parks, etc.
-Animated people walking around the city
-Cars driving around in traffic
-Other details important in a city you must think of yourself, you will be checked for level of realistic detail you include.

Each element should be detailed and realistic as you can. E.g. model the cars and people etc. You will be checked each of these is present, working, and the level of detail. ”

#Review Step Prompt
(this was repeated 4 times to each model after they finished the first version).

“Continue with refinement stage. Systematically check, verify, and improve every aspect of the simulation.

Don’t add any new features, just use testing/screenshots to refine and improve what is already there.

Use screenshots as a strategic tool for actually verifying aesthetics, dynamics and functionality. Assume the simulation is full of visual and functional issues that need to be spotted and fixed. Look for things that are wrong and need fixing. Also look where visual improvements can be made. You should do many tests and take many screenshots to verify all of the following:

– The overall visual implementation. Does it look as intended by the design/code?
– The detail of individual animation elements. Is each element present and correct?
– The correctness and functionality of the simulation. Do the functional parts actually function correctly?
– The dynamics of the simulation. Does the simulation evolve over time as expected?

Don’t consider this a quick patch or checkup. It’s an opportunity to really find out what needs improving in your simulation over multiple iterations, and systematically improve it. Commit changes as you go.”

The models spent around 10-20 minutes building, and 15-20 minutes refining each simulation.

source