Jump to Content
AI & Machine Learning

The Prompt: Multimodal AI is proof that a picture is worth a thousand words

April 29, 2025
https://storage.googleapis.com/gweb-cloudblog-publish/images/GettyImages-1758555858.max-2600x2600.jpg
Logan Kilpatrick

Senior Product Manager, Google DeepMind

When it comes to AI, audio and vision are creating a new UX paradigm. Let's talk about what multimodal AI is and the opportunity for businesses and people.

At Google DeepMind, you’re seeing the full spectrum of innovation – everything from reinforcement learning (RL) to competitive effect with AlphaGo, to our work on AlphaFold which won a Nobel Prize in Chemistry.

We recently launched our most intelligent model yet: Gemini 2.5, which builds on the best of Gemini with native multimodality. Multimodality – inputs and outputs across audio, visual, and text – helps AI perceive and understand the world in a more holistic and human way.

This is an important shift from earlier AI systems. Early iterations of algorithms handled text well, but if you look at language models as a proxy for intelligence, the obvious issue is they didn’t have the ability to understand the world the way a lot of humans can – through visual or auditory understanding.

Now, multimodal is creating a new UX paradigm entirely – for example, you can already use audio with solutions like Notebook LM. Ultimately, this fusion of inputs and outputs can help you automate complex workflows, generate novel content, and provide natural and robust user experiences. Let's talk about what multimodal AI is, how it's creating this new paradigm, and the opportunity for businesses and people.

Video Thumbnail

What is multimodal AI and why is context important?

Multimodal AI, in the simplest terms, is the fusion of all the input modalities and output modalities that humans are familiar with. The model can take in text, audio, video, images, and output the same. But the key is context. Context is important because without it, the model can’t do what you’re asking it to do. It’s the primary driver of the quality of the responses you get from the model.

Imagine a simple text prompt for a language model. Every time you start a new session or a new interaction, it’s a clean slate. This is what I call “AI 1.0” systems. AI 1.0 systems require users to do all of the heavy lifting of giving the model context and putting it into the context window.

Multimodal AI is exciting because of the potential 2.0 applications using the context we absorb as humans – including audio and vision.

Audio meets vision: A new UX paradigm

Vision is most popular today from an input perspective – if the models can do well at understanding images, they do well at understanding video inherently.

There’s the age-old quote that a picture is worth a thousand words. This matters double in the world of multimodal. If I look at my computer right now and try to describe everything I see, it would take 45 minutes. Or, I could just take a picture. Use cases for vision could be something from as simple as object tracking to image detection. For example, a factory watching an assembly line to make sure there's no impurities in the product they’re creating. Or analyzing dozens of pictures of your farm and trying to understand crop yields. There's huge breadth and opportunity by blending these modalities together.

I can share a recent example. At Google Cloud Next, I showed the audience how you can use Gemini’s multimodal capabilities to renovate a 1970’s-era kitchen. I prompted AI Studio to analyze my colleague Paige’s kitchen, giving it text descriptions, floor plans, and images. Gemini suggested cabinets, a design, color palette, and materials, relying on Gemini’s native image generation capabilities to bring its ideas to life. Then, to estimate how much this would actually cost, we used Grounding with Google Search. We pulled in real-world costs for materials, even local building codes.

From understanding videos, to native image generation, to grounding real information with Google Search – these are things that Gemini shines at.

Alongside vision, audio is another new UX paradigm for how people are interacting with AI. It’s not just a prompt – putting text into a chat bot — but instead speaking to models like we spend a lot of our time speaking to humans.

Look at NotebookLM. It's general purpose and it’s powered by Gemini models under the hood. It's uniquely what makes the notebook experience possible. It's also long context, which means the audio model can do a lot more than what it showcases in the notebook itself. Take a look at how people are already using NotebookLM – from uploading research papers to creating podcasts with Audio Overviews.

Multimodal AI is giving businesses a chance to solve more ambitious problems

Audio, vision, and text will give people and businesses the ability to solve the problems they want to solve, with a lower barrier to entry.

This is the opportunity to differentiate. Putting a mic icon in your chat bot isn’t the full potential of this technology – instead, build a deep product experience around these modalities. Now, you can solve more ambitious problems just by sending a single API call to a model, instead of building the model yourself and figuring out how to deploy it. It works out-of-the-box, today.

We're headed in the right direction

The future of multimodal is two-fold: models taking action in the real world, and stronger infrastructure. Take robots as an example. These models are becoming more and more capable of seeing, understanding, and taking actions. There's a lot of work to make those models reliable, but that’s the direction we’re headed in.

We’ll also need strong optimized infrastructure – everything from testing, observability, monitoring, and version control to A/B testing. I'm optimistic about this new AI ecosystem because if you look at every layer of the stack, there's opportunity to innovate. To learn more about how multimodal AI will take form in 2025, download the AI Trends report.

Posted in