AI & Machine Learning

Forrester names Google Cloud a leader in AI Infrastructure

December 14, 2021

Andrew Moore

Vice President & General Manager: Cloud AI & Industry Solutions

Brad Calder

VP & GM, Google Cloud Platform

Forrester Research has named Google Cloud a Leader in The Forrester Wave™: AI Infrastructure, Q4 2021 report authored by Mike Gualtieri and Tracy Woo. In the report, Forrester evaluated dimensions of AI architecture, training, inference and management against a set of pre-defined criteria. Forrester’s analysis and recognition gives customers the confidence they need to make important platform choices that will have lasting business impact.

Google received the highest possible score in 16 Forrester Wave evaluation criteria: architecture design, architecture components, training software, training data, training throughput, training latency, inferencing throughput, inferencing latency, management operations, management external, deployment efficiency, execution roadmap, innovation roadmap, partner ecosystem, commercial model, and number of customers.

We believe that Google's vision to be a unified data and AI solution provider for the end-to-end data science experience is recognized by Forrester, through high scores in the areas of architecture and innovation. We are focused on building the most robust yet cohesive experience to enable our customers to leverage the best of Google every step of the way. Here are four key areas where Google excels, among the many highlighted in this report.

AI Infrastructure: Leverage the building blocks of innovation

When an organization chooses to run its business on Google Cloud, it benefits from innovative infrastructure available globally. Google offers users a rich set of building blocks such as Deep Learning VMs and containers, the latest GPUs/TPUs and a marketplace of curated ISV offerings to help architect your own custom software stack on VMs and/or Google Kubernetes Engine (GKE).

Google provides a range of GPU & TPU accelerators for various use cases, including high performance training, low cost inferencing and large-scale accelerated data processing. Google is the only public cloud provider to offer up to 16 NVIDIA A100 GPUs in a single VM, making it possible to train very large AI models on a single node. Users can start with one NVIDIA A100 GPU and scale to 16 GPUs without configuring multiple VMs for single-node ML training. Google also provides TPU pods for large-scale AI research with PyTorch, TensorFlow, and JAX. The new fourth generation TPU pods deliver exaflop-scale peak performance with leading results in recent MLPerf benchmarks which included a 480 billion parameter language model.

Google Kubernetes Engine provides the most advanced Kubernetes services with unique capabilities like Autopilot, highly automated cluster version upgrades, and cluster backup/restore. GKE is a good choice for a scalable multi-node bespoke platform for training, inference and Kubeflow pipelines, given its support for 15,000 nodes per cluster, auto-provisioning, auto-scaling and various machine types (e.g. CPU, GPU, TPU and on-demand, spot). ML workloads also benefit from GKE’s support for dynamic scheduling, orchestrated maintenance, high availability, job API, customizability, fault tolerance and ML frameworks. When a company's footprint grows to a fleet of GKE clusters, its data teams can leverage Anthos Config Management to enforce consistent configurations and security policy compliance.

Comprehensive MLOps: Build models faster and more easily without skimping on governance

Google’s fully managed Vertex AI platform provides services for ML lifecycle management, from data ingestion and preparation all the way up to model deployment, monitoring, and management. Vertex AI requires nearly 80% fewer lines of code to train a model versus competitive platforms1, enabling data scientists and ML engineers across all levels of expertise to implement Machine Learning Operations (MLOps) so they can efficiently build and manage ML projects throughout the entire development lifecycle.

Vertex AI Workbench provides data scientists with a single environment for the entire data-to-ML workflow, enabling data scientists to build and train models 5x faster than traditional notebooks. This is enabled by integrations across data services (like Dataproc, BigQuery, Dataplex, and Looker), which significantly reduce context switching. Users are also able to access NVIDIA GPUs, modify hardware on the fly, and set up idle shutdown to optimize infrastructure costs.

Organizations can then build and deploy models built on any framework (including TensorFlow, PyTorch, Scikit learn or XGBoost) with Vertex AI, with built-in tooling to track a model’s performance. Vertex Training also provides various approaches for developing large models including Reduction Server to optimize bandwidth and latency of multi-node distributed training on NVIDIA GPUs for synchronous data parallel algorithms. Vertex AI Prediction is serverless, and performs automatic provisioning and deprovisioning of nodes behind the scenes to provide low latency online predictions. It also provides the capability to split traffic between multiple models behind an endpoint. Models trained in Vertex AI can also be exported to be deployed in private or other public clouds.

Google’s strengths in its current offering are in architecture, training, data throughput, and latency. Its sweet spot is in its product offering, Vertex AI, which has core AI compute capabilities and MLOps services for end-to-end AI lifecycle management.

The Forrester Wave:™ AI Infrastructure, Q4 2021

Tweet this quote

In addition to building models, it is important to deploy tools for governance, security, and auditability. These tools are crucial for compliance in regulated industries, and they help teams to protect data, understand why given models fail, and determine how models can be improved.

For orchestration and auditability, Vertex Pipelines and Vertex ML Metadata tracks the inputs and outputs of an ML pipeline and the lineage of artifacts. Once models are in production, Vertex AI Model Monitoring supports feature skew and drift detection, alerting data scientists. These capabilities speed up debugging and create the visibility required for regulatory compliance and good data hygiene in general.For explainability, Vertex Explainable AI helps teams understand their model's outputs for classification and regression tasks. Vertex AI tells how much each feature in the data contributed to the predicted result. Data teams can then use this information to verify that the model is behaving as expected, recognize bias in the model, and get ideas for ways to improve the model and training data.

These services together aim to simplify MLOps for data scientists and ML engineers, so that businesses can accelerate time to value for ML initiatives.

Security: Protect data while keeping ML pipelines flowing

The Google stack builds security through progressive layers that deliver defense in depth. To accomplish data protection, authentication, authorization and non-repudiation, we have measures such as boot-level signature and chain-of-trust validation.

Ubiquitous data encryption delivers unified control over data at-rest, in-use, and in-transit, with keys that are held by customers themselves.

We offer options to run in fully encrypted confidential environments utilizing managed Hadoop or Spark with Confidential Dataproc or Confidential VMs.

Partner Ecosystem: Work with world-class AI specialists

Google works with certified partners globally to help our customers design, implement and manage complex AI systems. We have a growing list of partners with Machine Learning specializations on Google who have demonstrated customer success across industries, including deep partnerships with the largest Global System Integrators. The Google Cloud Marketplace also provides a list of technology partners who allow enterprises to deploy machine learning applications on Google’s AI infrastructure.

Our dedication to being your partner of choice for ML Needs

Leading organizations like OTOY, Allen Institute for AI and DeepMind (an Alphabet subsidiary) choose Google for ML, and enterprises like Twitter, Wayfair and The Home Depot shared more about their partnership with Google in their recent sessions at Google Next 2021.

Establishing well-tuned and appropriately managed ML systems has historically been challenging, even for highly skilled data scientists with sophisticated systems. With the key pillars of Google’s investments above, organizations can build, deploy, and scale ML models faster, with pre-trained and custom tooling, within a unified AI platform.

We look forward to continuing to innovate and to helping customers on their digital transformation journey. To download the full report, click here. Get started on Vertex AI, learn what's upcoming with infrastructure for AI and ML at Google here, and talk with our sales team.

AI & Machine Learning