This document in the Google Cloud Architecture Framework explains some of the core principles and best practices for data analytics in Google Cloud. You learn about some of the key AI and machine learning (ML) services, and how they can help during the various stages of the AI and ML lifecycle. These best practices help you to meet your AI and ML needs and create your system design. This document assumes that you're familiar with basic AI and ML concepts.
To simplify the development process and minimize overhead when you build ML models on Google Cloud, consider the highest level of abstraction that makes sense for your use case. Level of abstraction is defined as the amount of complexity by which a system is viewed or programmed. The higher the level of abstraction, the less detail is available to the viewer.
To select Google AI and ML services based on your business needs, use the following table:
Persona | Google services |
---|---|
Business users | Standard solutions such as Contact Center AI Insights, Document AI, Discovery AI, and Cloud Healthcare API. |
Developers with minimum ML experience | Pretrained APIs address common perceptual tasks such as vision, video, and natural language. These APIs are supported by pretrained models and provide default detectors. They are ready to use without any ML expertise or model development effort. Pretrained APIs include: Vision API, Video API, Natural Language API, Speech-to-Text API, Text-to-Speech API, and Cloud Translation API. |
Generative AI for Developers | Vertex AI Agent Builder lets developers use its out-of-the-box capabilities to build and deploy chatbots in minutes and search engines in hours. Developers who want to combine multiple capabilities into enterprise workflows can use the Gen App Builder API for direct integration. |
Developers and data scientists | AutoML enables custom model development with your own image, video, text, or tabular data. AutoML accelerates model development with automatic search through the Google model zoo for the most performant model architecture, so you don't need to build the model. AutoML handles common tasks for you, such as choosing a model architecture, hyperparameter tuning, provisioning machines for training and serving. |
Data scientists and ML engineers | Vertex AI custom model toolings let you train and serve custom models, and they operationalize the ML workflow. You can also run your ML workload on self-managed compute such as Compute Engine VMs. |
Data scientists & machine learning engineers | Generative AI support on Vertex AI (also known as genai) provides access to Google's large generative AI models so you can test, tune, and deploy the models in your AI-powered applications. |
Data engineers, data scientists, and data analysts familiar with SQL interfaces | BigQuery ML lets you develop SQL-based models on top of data that's stored in BigQuery. |
Key services
The following table provides a high-level overview of AI and ML services:
Google service | Description |
---|---|
Cloud Storage and BigQuery | Provide flexible storage options for machine learning data and artifacts. |
BigQuery ML | Lets you build machine learning models directly from data housed inside BigQuery. |
Pub/Sub, Dataflow, Cloud Data Fusion, and Dataproc |
Support batch and real-time data ingestion and processing. For more information, see Data Analytics. |
Vertex AI | Offers data scientists and machine learning engineers a single platform to
create, train, test, monitor, tune, and deploy ML models for everything from
generative AI to MLOps. Tooling includes the following:
|
Vertex AI Agent Builder | Lets you build chatbots and search engines for websites and for use across enterprise data.
|
Generative AI on Vertex AI | Gives you access to Google's large generative AI models so you
can test, tune, and deploy them for use in your AI-powered
applications. Generative AI on Vertex AI is also known as genai.
|
Pretrained APIs |
|
AutoML | Provides custom model tooling to build, deploy, and scale ML models.
Developers can upload their own data and use the applicable
AutoML service to build a custom model.
|
AI infrastructure | Lets you use AI accelerators to process large-scale ML workloads. These
accelerators let you train and get inference from deep learning models and
from machine learning models in a cost-effective way. GPUs can help with cost-effective inference and scale-up or scale-out training for deep learning models. Tensor Processing Units (TPUs) are custom-built ASICs to train and execute deep neural networks. |
Dialogflow | Delivers virtual agents that provide a conversational experience. |
Contact Center AI | Delivers an automated, insights-rich contact-center experience with Agent Assist functionality for human agents. |
Document AI | Provides document understanding at scale for documents in general, and for specific document types like lending-related and procurement-related documents. |
Lending DocAI | Automates mortgage document processing. Reduces processing time and streamlines data capture while supporting regulatory and compliance requirements. |
Procurement DocAI | Automates procurement data capture at scale by turning unstructured documents (like invoices and receipts) into structured data to increase operational efficiency, improve customer experience, and inform decision-making. |
Recommendations | Delivers personalized product recommendations. |
Healthcare Natural Language AI | Lets you review and analyze medical documents. |
Media Translation API | Enables real-time speech translation from audio data. |
Data processing
Apply the following data processing best practices to your own environment.
Ensure that your data meets ML requirements
The data that you use for ML should meet certain basic requirements, regardless of data type. These requirements include the data's ability to predict the target, consistency in granularity between the data used for training and the data used for prediction, and accurately labeled data for training. Your data should also be sufficient in volume. For more information, see Data processing.
Store tabular data in BigQuery
If you use tabular data, consider storing all data in BigQuery and using the BigQuery Storage API to read data from it. To simplify interaction with the API, use one of the following additional tooling options, depending on where you want to read the data:
- If you use Dataflow, use the BigQuery I/O Connector.
- If you use TensorFlow or Keras, use the tf.data.dataset reader for BigQuery.
- If you use unstructured data such as images or videos, consider storing all data in Cloud Storage.
The input data type also determines the available model development tooling. Pre-trained APIs, AutoML, and BigQuery ML can provide more cost-effective and time-efficient development environments for certain image, video, text, and structured data use cases.
Ensure you have enough data to develop an ML model
To develop a useful ML model, you need to have enough data. To predict a category, the recommended number of examples for each category is 10 times the number of features. The more categories you want to predict, the more data you need. Imbalanced datasets require even more data. If you don't have enough labeled data available, consider semi-supervised learning.
Dataset size also has training and serving implications: if you have a small dataset, you can train it directly within a Notebooks instance; if you have larger datasets that require distributed training, use Vertex AI custom training service. If you want Google to train the model for your data, use AutoML.
Prepare data for consumption
Well-prepared data can accelerate model development. When you configure your data pipeline, make sure that it can process both batch and stream data so that you get consistent results from both types of data.
Model development and training
Apply the following model development and training best practices to your own environment.
Choose managed or custom-trained model development
When you build your model, consider the highest level of abstraction possible. Use AutoML when possible so that the development and training tasks are handled for you. For custom-trained models, choose managed options for scalability and flexibility, instead of self-managed options. To learn more about model development options, see Use recommended tools and products.
Consider the Vertex AI training service instead of self-managed training on Compute Engine VMs or Deep Learning VM containers. For a JupyterLab environment, consider Vertex AI Workbench, which provides both managed and user-managed JupyterLab environments. For more information, see Machine learning development and Operationalized training.
Use pre-built or custom containers for custom-trained models
For custom-trained models on Vertex AI, you can use pre-built or custom containers depending on your machine learning framework and framework version. Pre-built containers are available for Python training applications that are created for specific TensorFlow, scikit-learn, PyTorch, and XGBoost versions.
Otherwise, you can choose to build a custom container for your training job. For example, use a custom container if you want to train your model using a Python ML framework that isn't available in a pre-built container, or if you want to train using a programming language other than Python. In your custom container, pre-install your training application and all its dependencies onto an image that runs your training job.
Consider distributed training requirements
Consider your distributed training requirements. Some ML frameworks, like TensorFlow and PyTorch, let you run identical training code on multiple machines. These frameworks automatically coordinate division of work based on environment variables that are set on each machine. Other frameworks might require additional customization.
What's next
For more information about AI and machine learning, see the following:
- Best practices for implementing machine learning on Google Cloud.
- Practitioners guide to MLOps: A framework for continuous delivery and automation of machine learning.
Explore other categories in the Architecture Framework such as reliability, operational excellence, and security, privacy, and compliance.