Compute

Google Cloud’s innovation-first infrastructure

July 13, 2022

Sachin Gupta

Vice President & GM, Infrastructure, Google Cloud

Organizations are driving the complete transformation of their business by inventing new ways to accomplish their objectives using the cloud; from making core processes more efficient, to improving how they reach and better serve their customers, to achieving insights through data that fuel innovation.

Cloud infrastructure belongs at the center of every organization’s transformation strategy. We see a vast landscape of opportunity to innovate in our cloud’s core capabilities that will have long-standing impact on the speed and simplicity of building solutions on Google Cloud. From data management and machine learning to security and sustainability, we continue to invest deeply in infrastructure innovation that generates value from the foundation upward. We focus on three defining attributes of our infrastructure that help our customers accelerate through innovation:

Optimized: Customers want solutions that meet their specific needs. They want to build and run apps where they need them, tailored for popular workloads, industry solutions, and for specific outcomes whether it is high performance, cost savings, or a balance of both. Their workloads should just run better on Google Cloud.
Transformative: Transformation is more than “lifting and shifting” infrastructure to the cloud for cost saving and convenience. Transformative infrastructure integrates the best of Google’s AI and ML capabilities to drive faster innovation, while meeting the most stringent security, sovereignty, and compliance needs.
Easy: As cloud platforms become more versatile, they can become very complex to adopt and operate. Reducing your operational burden is possible with an easy-to-use cloud platform. Our customers often tell us that Google Cloud makes complex tasks seem simple, and this is a product of intentional engineering.

Google’s 20+ years of technology leadership is built on a culture of innovation and focus on our customers. Here are some examples of new innovation we are bringing in these areas.

Solutions that are optimized for what matters most to you

Let’s start with optimizing for price-performance. Last year, we launched Tau VMs optimized for cost-effective performance of scale-out workloads. Tau T2D leapfrogged every leading public cloud provider in both performance and total cost of ownership delivering up to 42% better price performance versus comparable VMs from any other leading cloud.

Today, we are delighted to announce that we are offering more choice to customers, with the addition of Arm-based machines to the Tau VM family. Powered by Ampere® Altra® Arm-based processors, T2A VMs deliver exceptional single-threaded performance at a compelling price, making them ideal for scale-out, cloud-native workloads. Developers now have the option of choosing the optimal architecture to test, develop and run their workloads.

Cost optimization is a major goal for many of our customers. Spot VMs enable you to take advantage of our idle machine cycles at deep discounts with a guaranteed 60% off and up to 91% savings off on-demand pricing. These are the perfect choice for batch jobs and fault-tolerant workloads in high performance computing, big data and analytics. Customers told us that they would like to see less variability and more predictability in the pricing of Spot VMs. We have heard you loud and clear. Our Spot VMs offer the least variability (once per month price changes) and more predictability in pricing compared to other leading clouds.

Optimizing for global scale is critical to meet the high demands of today's consumers — especially when it comes to video streaming. Launched in May 2022, Media CDN is optimized to deliver immersive video streaming experience at a global scale. Available in more than 1,300 cities, Media CDN leverages the same infrastructure that YouTube uses to deliver content to over 2 billion users around the world. Customers including U-NEXT and Stan have quickly rolled out Media CDN to deliver a modern, high quality experience to their viewers.

Another emerging opportunity is the rise of distributed systems and distributed workers, and the ability to build and run apps wherever needed. With Google Distributed Cloud, we now extend Google Cloud infrastructure and services to different physical locations (or distributed environments) including on premises or co-location data centers and a variety of edge environments. Anthos powers all Google Distributed Cloud offerings, to deliver a common control plane for building, deploying and running your modern containerized applications at scale, wherever you choose.

For greater choice, we have designed Google Distributed Cloud as a portfolio of hardware, software, and services with multiple offerings to address the specific requirements of your workloads and use cases. You can choose from our Edge, Virtual, and Hosted offerings to meet the needs of your business.

Driving transformation through AI/ML and security

The pace of innovation in the field of machine learning continues to accelerate and Google has been a long time pioneer. From Search and YouTube to Play and Maps, ML has helped bring out the best that our products have to offer. We've made it a point to make the best of Google available to our customers, and JAX and Cloud TPU v4 are two great examples.

JAX is a cutting edge open source ML framework developed by Google researchers. It's designed to give ML practitioners more flexibility and allow them to more easily scale their models to the largest of scales.

We recently made Cloud TPU v4 pods available to all our customers through our new ML hub. This cluster of Cloud TPU v4 pods offers 9 exaflops of peak aggregate performance and runs at 90% carbon-free energy, making it one of the fastest, most efficient, and most sustainable ML infrastructure hubs in the world. Cloud TPU v4 has enabled researchers to train a variety of sophisticated models including natural language processing models and recommender models to name a few. Customers are already seeing the benefits, including Cohere who saw a 70% improvement in training times and LG Research who used Cloud TPU v4 to train their large multi-modal 300 billion parameter model.

On the security front, increasing cybersecurity threats has every company rethinking its security posture. Our investments in our planet-scale network that's secure, performant and reliable is matched with our lead in defining industry wide frameworks and standards to help customers better secure their software supply chain. Google last year introduced SLSA (supply chain levels for software artifacts), an end-to- end framework for ensuring the integrity of artifacts throughout the software supply chain. It is an open-source equivalent of many of the processes we have been implementing internally at Google.

We challenge ourselves to enable security without complex configuration or performance degradation. One example of this is our Confidential VMs - where data is stored in the trusted execution environment outside of which it is impossible to view the data or operations performed on it, even with a debugger. Another is Cloud Intrusion Detection System (Cloud IDS), which provides network threat detection built on ML-powered threat analysis which processes over 15 Trillion transactions per day to identify new threats with 4.3M unique security updates made each day. With the highest possible rating of AAA by CyberRatings.org, Cloud IDS has proven efficacy to block virtually all evasions.

Developer-first ease of use

Making your transformation journey simpler, with easy-to-use tools to accelerate your innovation is our priority. Today, we are introducing Batch in preview, a fully managed job scheduler to help customers run thousands of batch jobs with just a single command. It's easy to set up, and supports throughput oriented workloads including those requiring MPI libraries. Jobs run on auto-scalable resources, giving you more time to work on the greatest areas of value. This improves the developer experience for executing HPC, AI/ML, and data processing workloads such as genomics sequencing, media rendering, financial risk modeling, and electronic design automation.

Continuing innovation for greater ease, we recently announced the availability of the new HPC toolkit. This is an open source tool from Google Cloud that enables you to easily create repeatable, turnkey HPC clusters based on proven best practices, in minutes. It comes with several blueprints and broad support for third party components such as the Slurm scheduler and Intel DAOS and DDN Lustre storage.

System performance and awareness of what infrastructure is doing is closely tied to security, but to do this well, it needs to be easy. We recently introduced Network Analyzer to help customers transform reactive workflows into proactive processes and reduce network and service downtime by automatically monitoring VPC network configurations. Network Analyzer is part of our Network Intelligence Center, providing a single console for Google Cloud network observability, monitoring, and troubleshooting.

This is just a sample of what we are doing in Google Cloud to provide infrastructure that gives customers the freedom to securely innovate and scale from on-premises, to edge, to cloud on an easy, transformative, and optimized platform. To learn more about how customers such as Broadcom and Snap are using Google Cloud’s flexible infrastructure to solve their biggest challenges, be sure to watch our Infrastructure Spotlight event, aired today.

Posted in