AI & Machine Learning

The AI detective: The Needle in a Haystack test and how Gemini 1.5 Pro solves it

September 10, 2024

https://storage.googleapis.com/gweb-cloudblog-publish/images/needle-haystack.max-2600x2600.png

Stephanie Wong

Head of Developer Skills & Community, Google Cloud

Try Gemini 3

Our most intelligent model is now available on Vertex AI and Gemini Enterprise

Imagine a vast library filled with countless books, each containing a labyrinth of words and ideas. Now, picture a detective tasked with finding a single, crucial sentence hidden somewhere within this literary maze. This is the essence of the "Needle in a Haystack" test for AI models, a challenge that pushes the boundaries of their information retrieval capabilities.

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_-_Detective.max-2100x2100.png

Generated using Imagen 2. Prompt: A detective looking for a needle in a haystack. The detective is mostly covered by shadows holding a magnifying glass.

In the realm of artificial intelligence, this test is not about finding a physical needle, but it tests how well a large language model (LLM) can retrieve specific information from large amounts of data in its context window. It's a trial by fire for LLMs, assessing their ability to sift through a sea of data and pinpoint the exact information needed.

This test gauges how well an LLM can pinpoint exact information within its context window. It involves embedding a random statement ("needle") within a long context ("haystack") and prompting the LLM to retrieve it. Key steps include:

Insert the needle: Place a random fact or statement within a long context window.
Prompt the LLM: Ask the model to retrieve the specific statement.
Measure performance: Iterate through different context lengths and document depths.
Score the results: Provide detailed scoring and calculate an average.

The 2 million token challenge

An AI model's context window is like its short-term memory. Google’s Gemini 1.5 Pro has an industry-leading 2 million token context window, roughly equivalent to 1.5 million words or 5,000 pages of text! This is transformative for AI applications requiring understanding and responding to lengthy inputs.

However, a large context window also presents challenges. More information makes it harder to identify and focus on relevant details. So we use the Needle in the Haystack test to measure recall, and Google's Gemini 1.5 Pro has emerged as a star performer.

Google Gemini 1.5 Pro: The master detective

In Google Deepmind’s research paper, Gemini 1.5 Pro demonstrates near-perfect recall (>99.7%) of specific information ("needle") within a vast context ("haystack") of up to 1 million tokens across text, video, and audio modalities. This exceptional recall persists even with contexts extended to 10 million tokens for text, 9.7 million for audio, and 9.9 million for video. While this was an internal test, Gemini 1.5 Pro supports a 2M token context window (the largest of any model provider today).

https://storage.googleapis.com/gweb-cloudblog-publish/images/2_-_99.7_recall_across_modalities.max-1400x1400.png

Gemini 1.5 Pro achieves near-perfect “needle” recall (>99.7%) up to 1M tokens of “haystack” in all modalities, i.e., text, video and audio.

Let’s test all the haystacks

The following benchmark data showcases the impressive advancements made with Gemini 1.5 Pro, particularly in handling long-context text, video, and audio. It not only holds its own against the February 2024 1.5 Pro release but also demonstrates significant improvements over its predecessors, 1.0 Pro and 1.0 Ultra.

https://storage.googleapis.com/gweb-cloudblog-publish/images/3_-_win_rates.max-900x900.png

Gemini 1.5 Pro win-rates compared to Gemini 1.5 Pro from the February 2024 release, as well as the Gemini 1.0 family. Gemini 1.5 Pro maintains high levels of performance even as its context window increases.

Let’s dive deeper.

Video Haystack: Gemini 1.5 Pro retrieved “a secret word” from random frames within a 10.5-hour video, with Gemini 1.5 Flash also achieving near-perfect recall (99.8%) for videos up to 2 million tokens. The model even identified a scene from a hand-drawn sketch, showcasing its multimodal capabilities!

https://storage.googleapis.com/gweb-cloudblog-publish/images/4_-_Video_haystack.max-1000x1000.png

When prompted with a 45 minute Buster Keaton movie “Sherlock Jr." (1924) (2,674 frames at 1FPS, 684k tokens), Gemini 1.5 Pro retrieves and extracts textual information from a specific frame and provides the corresponding timestamp. At the bottom right, the model identifies a scene in the movie from a hand-drawn sketch.

This has high potential for fields like healthcare to analyze lengthy surgical recordings, sports to analyze game activities and injuries, or content creation to streamline the video editing process.

Audio Haystack: Both Gemini 1.5 Pro and Flash exhibited 100% accuracy in retrieving a secret keyword hidden within an audio signal of up to 107 hours (nearly five days!). You can imagine this being useful for improving the accuracy of audio transcription and captioning in noisy environments, identifying specific keywords during recorded legal conversations, or sentiment analysis during customer support calls.

Multi-round co-reference resolution (MRCR): The MRCR test throws a curveball at AI models with lengthy, multi-turn conversations, asking them to reproduce specific responses from earlier in the dialogue. It's like asking someone to remember a particular comment from a conversation that happened days ago — a challenging task even for humans. Gemini 1.5 Pro and Flash excelled, maintaining 75% accuracy even when the context window stretched to 1 million tokens! This showcases their ability to reason, disambiguate, and maintain context over extended periods.

This capability has significant real-world implications, particularly in scenarios where AI systems need to interact with users over extended periods, maintaining context and providing accurate responses. Imagine customer service chatbots handling intricate inquiries that require referencing previous interactions and providing consistent and accurate information.

Multiple needles in a haystack: While finding a single needle in a haystack is impressive, Gemini 1.5 tackles the challenging task of finding multiple needles in a haystack. Even when faced with 1 million tokens, Gemini 1.5 Pro maintains a remarkable 60% recall rate. This performance, while showing a slight decrease compared to the single-needle task, highlights the model's capacity to handle more complex retrieval scenarios, where multiple pieces of information need to be identified and extracted from a large and potentially noisy dataset.

Comparison to GPT-4: Gemini 1.5 Pro outperforms GPT-4 in a “multiple needles-in-haystack" task, which requires retrieving 100 unique needles in a single turn. It maintained a high recall (>99.7%) up to 1 million tokens, still performing well at 10 million tokens (99.2%), while GPT-4 Turbo is limited by its 128k token context length. GPT-4 Turbo's performance on this task "largely oscillates" with longer context lengths, with an average recall of about 50% at its maximum context length.

https://storage.googleapis.com/gweb-cloudblog-publish/images/5_-_gpt_comparison.max-900x900.png

Retrieval performance of the “multiple needles-in-haystack" task, which requires retrieving 100 unique needles in a single turn. When comparing Gemini 1.5 Pro to GPT-4 Turbo we observe higher recall at shorter context lengths, and a very small decrease in recall towards 1M tokens.

Gemini 1.5 Pro's secret weapon

What makes Gemini 1.5 Pro such a master detective? It's the combination of advanced architecture, multimodal capabilities, and innovative training techniques. It incorporated significant architectural changes by using the mixture-of-experts (MoE) model, based on the Transformer architecture. MoE models utilize a learned routing function — think of it like a dispatcher in a detective agency — to direct different parts of the input data to specialized components within the model. This allows the model to expand its overall capabilities while only using the necessary resources for a given task.

The future of AI: finding needles in ever-larger haystacks

The true measure of AI lies not just in its ability to process information, but in its capacity to understand and engage in meaningful conversations. These Needle in the Haystack tests show that Gemini 1.5 Pro and Flash are pushing the boundaries of what's possible, showing it can navigate even the most complex and lengthy dialogues. It's not just about responding; it's about understanding and connecting across modalities — a giant leap towards AI that feels less like a machine and more like a truly intelligent conversational partner.

Try your own Needle in a Haystack test using Gemini 1.5 Pro’s 2M token context window today on Vertex AI.

Posted in