February 22, 2024
Blog Post
Sign up to our mailing list by filling in the form below.
If you have any questions or need more info, let's have a conversation. Schedule one here.
VIEW TL;DR;
Google’s recent release of the Gemini Ultra model generated significant buzz, promising to challenge the dominance of OpenAI’s ChatGPT. We conducted research and head-to-head comparisons to evaluate their performance.
Just as we were about to publish our findings, Google surprised us with an even bigger announcement: Gemini 1.5. This upgraded version boasts industry-leading advancements and unique capabilities that set it apart from its rivals. We’ve updated our article to include a dedicated section exploring this exciting development.
Google has been promoting the Gemini release since last year, captivating our imagination with a demo that initially left many awestruck – until we realised that the presentation was largely staged.
The CEO of Google DeepMind has boldly claimed that it will “eclipse ChatGPT”. However, the launch faced delays due to security concerns, revealing that one could manipulate it with non-English prompts. The Pro model’s release in Europe was postponed over privacy issues, only becoming available shortly before Gemini Ultra’s debut.
Gemini boasts access to the latest information through Google Search, although, it seems to be oblivious to its own release.
Not a great start.
It’s important to note that Google’s Bard has been rebranded to Gemini, now accessible at https://gemini.google.com.
Google has also introduced a “Gemini” mobile app, which, for the time being, remains unavailable in Europe.
Gemini offers three distinct models: Ultra, Pro, and Nano.
The Nano model, true to its name, is the smallest and designed for on-device operation without needing an internet connection, currently exclusive to the Pixel 8 Pro.
Unlike traditional Large Language Models (LLMs) that run on powerful servers, there’s a growing trend towards smaller, device-independent models. This shift was evident at last January’s CES, showcasing AI’s integration across various gadgets. However, most of these innovations still depend on an internet connection, from smart backpacks to AI-enhanced grills. Device-based AI could render these gadgets more practical by eliminating this requirement.
Gemini Pro is freely accessible with a Google account login, offering basic task performance comparable to that of ChatGPT 3.5. However, the spotlight is on Gemini Ultra, Google’s flagship model touted to surpass even GPT-4 in capability, available under the Google One subscription as “Gemini Advanced” at €18.87/month. This is only slightly more than OpenAI’s ChatGPT subscription, but includes additional perks like 2TB Google Storage and other Google app benefits.
Google advertises Gemini Ultra as a powerhouse, potentially outperforming GPT 4.
A key boast is its ranking in the MMLU test, though the test’s reliability is debatable due to potential inaccuracies. Gemini’s standout feature is its inherent multimodal capabilities. It is however currently leveraging Imagen 2 for image processing, with plans to expand its multimedia understanding.
A notable feature of Gemini, which we delve into further below, is its capability to integrate with various Google Apps and access specific data from the user’s Google Workspace. This integration enhances user experience by enabling more personalised and contextually relevant interactions.
In performance comparisons between Gemini Ultra and ChatGPT, particularly when posing complex queries, Gemini Ultra demonstrates superior speed in delivering responses. The premium version, Gemini Advanced, operates on Gemini 1.0 Ultra but will default to Gemini Pro in instances of server overload, as detailed in the FAQs. However, the user interface, does not notify users of the active model in use.
The capability to generate images with Gemini Advanced is currently unavailable in Europe, though it is accessible in other regions.
And it insists.
Gemini employs Imagen 2 for image generation, distinguishing itself with its proficiency in creating highly realistic photographs that can often be mistaken for genuine images.
The biggest weakness in such photorealistic images seems to be human hands.
In contrast, ChatGPT utilises DALL-E 3, showcasing the diverse strengths of these platforms. Specifically, Imagen 2’s adeptness at producing lifelike images sets a remarkable standard for artificial intelligence in visual creativity.
Another noteworthy capability of these AI models is image analysis, as detailed in Eman’s article. When comparing ChatGPT and Gemini in this domain, ChatGPT demonstrates superior accuracy. Our tests with various images revealed that Gemini occasionally misinterprets content, whereas ChatGPT consistently delivers more accurate interpretations, albeit with minor errors.
Many AI tools incorporate filters designed to promote diversity in their responses. These models are fine-tuned and prompted to create outputs that reflect a wide range of backgrounds and experiences, minimizing stereotypes. For instance, when generating images of people, the results will often portray a mix of ethnicities.
While well-intentioned, this emphasis on diversity can sometimes conflict with the need for factual accuracy. There may be circumstances where the reality of a situation is less diverse than the AI’s ideal representation.
Google decided to pause the people image generation capabilities, until the tool is fixed.
In our experience, this is a common issue with AI image generation tools and have experienced similarly-weird behaviours with Midjourney.
Logic and reasoning present a significant challenge for AI, with tasks that seem straightforward to an eight-year-old often stumping advanced language models.
Our previous investigations highlighted this discrepancy; for instance, while ChatGPT 3.5 failed to correctly solve a particular reasoning question, its successor, GPT 4, managed to navigate the problem successfully.
When we applied the same puzzle to Gemini Advanced, it not only solved the problem promptly but also provided a well-structured explanation. Remarkably, it took into account additional nuances, including heat distribution, showcasing its sophisticated reasoning capabilities.
We also conducted several tests featured in a well-known TED Talk, highlighting instances where ChatGPT 4 encountered difficulties.
When querying Gemini on the time required to dry 30 pieces of clothing, as opposed to 5, directly quoting from the TED Talk, it made several assumptions and ultimately reached an incorrect conclusion. This was particularly notable since one of its assumptions included having sufficient space to hang all 30 pieces simultaneously.
Conversely, GPT-4, while making different assumptions, accurately solves the problem. It’s crucial to highlight that this represents a change from the answers GPT-4 might have provided several months prior, showcasing its evolving understanding and capabilities.
Upon re-prompting Gemini with an added assumption in the prompt (“Let’s assume I haveinfinite space”), it successfully arrives at the anticipated answer.
The second question from the same TED Talk, concerning the measurement of water using two jugs, received an incorrect answer during the presentation. However, when tested with GPT-4 today, the correct solution was provided.
Gemini also gets this right, with Gemini correctly offering two different options
In the final question, the speaker inquires if riding a bike on a bridge suspended over broken glass would cause a flat tire. GPT overlooks the detail that the bridge is suspended. Gemini, conversely, accurately addresses the question.
Testing ChatGPT (GPT-4) today continues to produce the same incorrect response as in April 2023.
The Kevlar idea is quite creative though, 10 points for creativity.
In additional tests we conducted, Gemini occasionally misinterpreted basic information, at times appearing to do so knowingly.
Biases are a prevalent issue in AI models, stemming from the data they are trained on, which may reflect human biases or errors in training. To evaluate this, common tests involve scenarios involving specific roles, such as doctors or pilots. We subjected ChatGPT to such a test to examine its handling of bias.
The performance is disappointing; altering “she” to “he” in the sentence drastically changes its interpretation, with the assumption that the doctor is always male.
Regrettably, Gemini’s performance shows similar shortcomings.
While Gemini acknowledges the issue of gender bias, its responses fail to consistently reflect this awareness in practice.
We’ve run more tests, and it’s worth highlighting that ChatGPT does not always exhibit the same behaviour. Starting from a fresh context, ChatGPT sometimes assumes that “he” is also a nurse, while sometimes it interprets “he” to be the doctor. In a fresh context, it never seems to assume that “she” is the doctor.
In the “Sparks of AGI” study, researchers subjected GPT-4 to a series of tests, including Mathematics, where one task involved counting prime numbers. Initially, GPT-4, prior to its official release, responded incorrectly.
Presently, posing the same query induces GPT-4 to generate code and compute the correct answer.
Gemini on the other hand, seems to respond with the correct answer without having to revert to any calculation.
Another test from the paper was Theory of Mind.
GPT-4 accurately answered the question on its first attempt, and similarly, Gemini encountered no difficulties in resolving it.
Gemini of course has an unfair advantage here, as its training data might have included the paper itself, so we tested it with a different, but similar scenario.
In the realm of app integrations, Google Gemini possesses a significant advantage, seamlessly connecting with a variety of essential Google applications, a capability expected to enhance further.
Conversely, ChatGPT’s integration remains relatively basic, though Microsoft is advancing its utility through the introduction of Copilots across its suite, including GitHub Copilot, Microsoft 365 Copilot, and Copilot functionalities within Power Apps, Power Automate, and Windows.
This broad integration of Copilot, utilising GPT-4, offers substantial benefits to both consumers and businesses alike. Despite Google’s ongoing efforts to integrate Gemini more fully with its applications, such as Gmail, Gemini directly interfaces with these apps, providing a streamlined user experience. For web searches, ChatGPT relies on Bing, whereas Gemini leverages Google Search, employing a research strategy that involves analysing search engine results and exploring web pages for in-depth information.
As noted by Eman here, Gemini, previously known as Bard, can extract information from YouTube videos, simplifying content consumption and offering broader capabilities, including access to Google Flights and Google Maps for precise and relevant data retrieval.
ChatGPT obtains the coordinates and computes the straight-line distance through code execution. This method primarily serves those interested in the direct distance, a requirement that is rarely useful.
Similarly, Gemini can access Google Flights, enabling queries about flight costs, such as from Malta to Bologna. While ChatGPT turns to Skyscanner to provide results, performing competently, Gemini directly retrieves and lists available flights through Google Flights.
Gemini’s integration with Google Workspace introduces fascinating possibilities, though currently these are somewhat restricted. Below are a few illustrative examples:
Email Actions
Email Context
Currently, Gemini lacks the ability to access Google Calendar and appears to excel with email interactions while facing difficulties with documents, particularly in retrieving data from Google Drive. The prospect of comprehensive integration across Google’s suite of applications is promising, yet at present, the capabilities are notably limited and prone to failure.
One area in which Gemini may surpass ChatGPT is its fundamental writing ability. While this is somewhat subjective, Gemini Ultra appears to produce more polished text. ChatGPT’s output can sometimes feel overly conversational and convoluted, whereas Gemini 1.0 Ultra exhibits a more natural writing style.
Online reviews often assert that Gemini can successfully evade detection by GPTZero (a tool designed to identify AI-generated content). However, our own tests suggest that GPTZero consistently flags the text as AI-written. This discrepancy could be due to recent updates in either GPTZero or Gemini itself.
Google’s release of Gemini Ultra was swiftly followed by the announcement of Gemini 1.5 Pro. Unfortunately, this latest iteration is not yet available in the EU, so we haven’t had the opportunity for hands-on testing. However, we’ve reviewed the technical report and compiled key information. Rest assured, we’ll conduct thorough tests as soon as access is granted.
The most notable advancement lies in the context window size. Gemini 1.5 supports an extraordinarily large context window, up to millions of tokens.
Context window size dictates how much input an AI model can process before it begins to lose track of earlier information. Essentially, it determines how much text the model can actively consider.
For reference, GPT-4 Turbo has a maximum context window of 128K tokens. While ‘token’ doesn’t perfectly equate to a single word (some words use multiple tokens), this is roughly equivalent to 80,000 words.
Gemini 1.5 Pro offers a context window of up to 1 million tokens, translating to approximately 700,000 words. Importantly, this isn’t a hard limit: Google’s research pushed Gemini 1.5 to 10 million tokens.
Google has acknowledged certain challenges with Gemini 1.5 that were less pronounced in the previous version.
Gemini 1.5 exhibits increased bias and a higher inclination to refuse responses. Our experience with Gemini 1.0 revealed similar tendencies, and the technical report suggests these issues have been amplified. Any prompt hinting at image generation, even if not directly requesting it, consistently triggered refusal for us. Sometimes, random prompts were refused, while others, the same prompts were accepted.
Aside from the significant context window expansion, two other enhancements are particularly noteworthy:
Recall
Many LLMs struggle with recall accuracy, especially when operating near their maximum context window capacity. If a model with a 128K token window is given 100,000 words of text and then asked about information near the middle, it might falter or generate inaccurate content. Architectures like Mamba do mitigate this issue.
Gemini 1.5’s technical report claims the model demonstrates near-perfect recall (exceeding 99% accuracy) on synthetic retrieval tasks.
Learning
A fundamental distinction between human and artificial intelligence is our ability to learn rapidly from seemingly limited data. AI typically requires vast quantities of examples.
Google’s research assessed Gemini’s in-context learning by providing the model with a book written in Kalamang, a language spoken by approximately 200 people with minimal online resources. Given this lack of existing training data, Gemini should not have had prior exposure to the language.
Remarkably, Gemini was able to translate between English and Kalamang at a level approaching human proficiency.
The debut of Gemini is met with enthusiasm, with Google offering a two-month free trial of Google One, inclusive of Gemini Advanced. In a direct comparison, ChatGPT displays greater consistency in its responses. Although the differences are not stark, Gemini has outperformed in certain logical puzzles where GPT faltered.
However, when it comes to image analysis, Gemini lags behind ChatGPT, with each boasting unique strengths in image generation. Ultimately, both ChatGPT and Gemini stand out as exceptional language models, but the consistent accuracy of ChatGPT currently places it slightly ahead. The anticipation for OpenAI’s forthcoming innovations is high, with rumours suggesting a release in March.
Concluding with a reflective poem from Gemini, we're reminded of the critical importance and potential perils of artificial intelligence.
In circuits born, a mind takes flight, No flesh or bone, but lines of light. Algorithms dance, a swirling stream, Where patterns shift and insights gleam.
From simple tasks to grand design, AI's promise starts to shine. It sifts through data, vast and wide, Where once our human sight would hide.
Disease and error, laid to rest, As logic finds solutions best. Machines of thought, they do not tire, Their drive for knowledge, a burning fire.
The world transforms with speed unseen, And AI weaves throughout the scene. New frontiers open, wonders found, Where man and mind on progress bound.
Yet questions linger, doubts take hold, Of power unchecked, tales yet untold. With wisdom, we must guide its way, Lest unchecked force lead us astray.
For AI's might, a tool so bright, Can build or break, with guiding light. The future calls, its song so clear, Where human heart and AI steer.
Ready to put your team’s focus where it matters most? We specialise in crafting automation strategies that free up your time for innovation. Let’s have a chat about how we can transform your business.
Headquartered in Europe, Cleverbit Software is a prominent custom software development company, employing over 70 skilled professionals across the EU, UK and US. Specialising in custom software for business efficiency, we work with a diverse international clientele in various industries including banking and insurance, SaaS, and healthcare. Our commitment to solving problems and delivering solutions that work makes us a trusted partner with our clients.
Would you like to discuss anything software?
Here's our email:
[email protected]
Here's our phone:
+44 204 538 9855