AI & Machine Learning

Faster food: How Gemini helps restaurants thrive through multimodal visual analysis

December 3, 2024

Sagar Kewalramani

Solutions Architect, Google

Alejandro Ballesta Rosen

Solutions Architect, Google

Try Gemini 3.1 Pro

Our most intelligent model available yet for complex tasks on Gemini Enterprise and Vertex AI

Try now

Businesses across all industries are turning to AI for a clear view of their operations in real-time. Whether it's a busy factory floor, a crowded retail space, or a bustling restaurant kitchen, the ability to monitor your work environment helps businesses be more proactive and ultimately, more efficient.

Gemini 1.5 Pro’s multimodal and long context window capabilities can improve operational efficiency for businesses by automating tasks from inventory management to safety assessments. One powerful use case that's emerged for developers is AI-powered kitchen analysis for busy restaurants. AI-powered kitchen analysis can benefit everyone – it can help a restaurant’s bottom line, and also train employees more efficiently while improving safety assessments that help create a safer work environment.

In this post, we'll show you how this works, and ways you can apply it to your business.

Understanding multimodal AI & long context window:

Before we step into the kitchen, let's break down what "multimodal" and “long context window” mean in the world of AI:

Multimodal AI can process and understand multiple types of data. Think of it as an AI system that can see, hear, read, and understand all at once. In our context, it can take the following forms:

Text: Recipes, orders, and inventory lists
Images: Food presentation and kitchen layouts
Audio: Kitchen commands and customer feedback
Video: Real-time cooking processes and staff movements

These data representations added together can reach GBs in size, which is where Gemini’s long context window comes into play. Long-context windows can consume millions of tokens (data points) at once. This makes it possible to input all the data mentioned above – from text to video – to generate cohesive outputs without losing any of your context.

With a projected market size of over $13 billion by 2032 and a staggering CAGR of around 30% from 2024 to 2032, multimodal plus long context window capabilities are the secret ingredients for success.

Let’s look at a real world example

When it comes to running a restaurant, AI can step in as is your inventory manager and safety inspector all rolled into one. In the following test, we fed Gemini a five-minute video of a chef preparing meals during peak operating hours.

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/1_e79MC4t.gif

We asked Gemini with a simple prompt to analyze the video and return multiple values that would help us analyze the meal preparation’s efficiency. First, we asked Gemini for the timestamps spent on each part of the process:

Preparation
Cooking
Plating
Serving

Prompt :

Watch the following video of food being prepared in a kitchen. For each food item being prepared, I want you to analyze the timestamps and provide the start and end times for each of these general cooking stages:

Preparation: This includes any actions done before the food is cooked. Examples: Gathering ingredients, chopping vegetables, mixing sauces, preheating.
Cooking: This involves applying heat to the food using any method. Examples: Frying, baking, grilling, microwaving. It also includes any actions done while the food cooks on the heat source, like flipping or stirring.
Plating: This involves any actions taken after the food is cooked. Examples: Transferring food to a serving dish, adding garnishes, drizzling sauces
Serving: when the cook hands the food to the customer

Output the data in chronological order as a JSON array with the following format: {"steps": [{"step": "Preparation", "start": "xx:xx", "end": "xx:xx"}, {"step": "Cooking", "start": "xx:xx", "end": "xx:xx"}]}

Next, to find bottlenecks and optimize workflows we asked Gemini to identify the following key moments:

Positive moments
Potential safety issues
Inventory counts
Suggestions for improvement

Together, we put these values in a graph that broke down the efficiency of each task and identified opportunities for improvement. We also asked Gemini to translate this in several different languages for a diverse kitchen staff.

The final result: Here’s how Gemini analyzed the kitchen

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/2ndPrompt_optimized.gif

1. Real-time meal preparation and object tracking:

Gemini's object detection capabilities identified ingredients and monitored cooking processes in real-time. By extracting the start and end timestamps for each meal preparation, you can precisely measure meal prep times.

2. Inventory management:

Say goodbye to the "Oops, we're out of that" moment. By accurately tracking ingredient usage, Gemini helped prevent stock-outs and enabled proactive inventory replenishment.

3. Safety assessments:

From detecting a slippery floor to noticing an unattended flame, Gemini picked up on those details that are easy to miss. It's not about replacing human vigilance—it's about enhancing it, creating a safer environment for both staff and diners.

4. Multilingual capabilities:

In a global culinary landscape, language barriers can be troublesome. Gemini broke down these barriers, ensuring that whether your chef speaks Mandarin or your server speaks Spanish, everyone's on the same page.

Gemini’s analysis of a five-minute video could help restaurants optimize operations, reduce costs, and enhance the customer experience. By automating and optimizing mundane tasks, staff can focus on what matters—creating culinary masterpieces and delivering exceptional service. It also helps businesses grow by improving cost savings – optimized inventory and resource management translate directly to a business’s financial bottom line.

And, proactive hazard detection means fewer accidents and a safer work environment. It's not just about avoiding lawsuits—it's about creating a culture of care.

The future is served

Gemini’s models are pioneers in the market, unlocking use cases that are made possible with Google’s research and advancements. But Gemini's impact extends far beyond the restaurant industry – its long context window allows businesses to analyze vast amounts of data, unlocking insights that were previously too costly to attain.

To do this yourself:

Explore the Gemini Multimodal API documentation to learn about video and image analysis
Start building using a free Google Cloud trial to test Gemini's multimodal features
Master multimodal prompting using the comprehensive guide provided

Posted in

AI & Machine Learning

Using Google Cloud AI to measure the physics of U.S. freestyle snowboarding and skiing

By Google Cloud Team • 5-minute read

Data Analytics

Simplify your AI workflow with autonomous embedding generation in BigQuery

By Andong Li • 4-minute read

https://storage.googleapis.com/gweb-cloudblog-publish/images/gemini-3.1_pro_meta_dark.max-700x700.png

AI & Machine Learning

Introducing Gemini 3.1 Pro on Google Cloud

By Michael Gerstenhaber • 2-minute read

https://storage.googleapis.com/gweb-cloudblog-publish/images/Gemini_Generated_Image_jcq8tgjcq8tgjcq8.max-700x700.png

Databases

Powering the next generation of agents with Google Cloud databases

By Amit Ganesh • 5-minute read

Faster food: How Gemini helps restaurants thrive through multimodal visual analysis

Sagar Kewalramani

Alejandro Ballesta Rosen

Try Gemini 3.1 Pro

Understanding multimodal AI & long context window:

Let’s look at a real world example

The final result: Here’s how Gemini analyzed the kitchen

Related articles

Using Google Cloud AI to measure the physics of U.S. freestyle snowboarding and skiing

Simplify your AI workflow with autonomous embedding generation in BigQuery

Introducing Gemini 3.1 Pro on Google Cloud

Powering the next generation of agents with Google Cloud databases