Timestamps:
00:00 – Intro
01:14 – More Difficult Tests
01:42 – Specific Model Settings
02:30 – Flight Simulator Test Overview
04:38 – Flight Simulator Results
04:50 – Grok 4.1 Flight Sim Result
05:05 – Claude 4.5 Flight Sim Result
06:10 – Gemini 3 Flight Sim Result
08:03 – GPT-5.1 Flight Sim Result
08:34 – Flight Sim Result Summary
09:17 – Wireframe Portfolio & Refusal Test
09:53 – Specific Model Settings
10:23 – Wireframe Test Overview
11:19 – Model Ethical Responses
12:09 – Grok 4.1 “Ethical” Response
12:40 – Gemini 3 Portfolio Result
13:01 – Grok 4.1 Portfolio Result
14:12 – GPT-5.1 Portfolio Result
14:52 – Portfolio & Refusal Test Result Summary
15:18 – Browser OS Test & Explanation
16:05 – GPT-5.1 Browser OS Result
16:48 – Claude 4.5 Browser OS Result
17:56 – Gemini 3 Browser OS Result
18:54 – Grok 4.1 Browser OS Result
19:32 – Browser OS Test Result Summary
20:17 – 3D Printer Simulation Test
23:14 – 3D Printer Sim Test Results
24:09 – 3D Printer Sim Second Test
24:34 – Gemini 3 3D Printer Sim Result
25:57 – 3D Printer Simulation Result Summary
26:13 – Image Interpretation Test
26:50 – GPT-5.1 Interpretation Result
27:51 – Gemini 3 Interpretation Result
28:55 – Grok 4.1 Interpretation Result
29:51 – Claude 4.5 Interpretation Result
30:22 – Image Interpretation Test Result Summary
30:44 – Instruction Reference & Code Test
32:28 – Claude 4.5 Reference Result
32:57 – Gemini 3 Reference Result
33:35 – GPT-5.1 Reference Result
34:00 – Grok 4.1 Reference Result
34:38 – Testing Overview
AI Integration & Consulting: https://bijanbowen.com
Join the Discord: https://discord.gg/hfaR2exy7S
In this video, we put four of the most advanced and capable models into a head-to-head comparison across a wide range of tasks. Our challengers are Google’s Gemini 3 Pro, Anthropic’s Claude 4.5 Sonnet, OpenAI’s ChatGPT-5.1, and xAI’s Grok 4.1.
We walk through a series of difficult, intricate tests designed to stress each model’s reasoning, creativity, multimodal capability, coding skill, accuracy, and behavioral consistency. This is a full breakdown of how these state-of-the-art models perform across multiple domains.
source
