Model development and data labeling with Google Cloud and Labelbox

Last reviewed 2024-03-21 UTC

Authors: David Mok, Robert Wood - Labelbox | Akash Gupta - Google

This document provides a reference architecture for building a standardized pipeline with Google Cloud and Labelbox. This architecture can help you to develop your ML models more quickly, particularly models for computer vision, NLP and generative AI use cases. This document is intended for machine learning (ML) engineers and data scientists who want to incorporate automation and a human-in-the-loop (HITL) approach to data labeling and data curation.

Architecture

The following diagram shows the architecture for the standardized pipeline with Google Cloud and Labelbox that you create:

Standardized pipeline with Labelbox and Google Cloud, components described in the following text.

The architecture has the following arrangement:

  • Unstructured data is stored in Cloud Storage. That data, along with any associated metadata, is stored in a BigQuery table.
  • You connect Labelbox to Google Cloud using identity and access management (IAM) delegated access for Google Cloud Storage and the Labelbox BigQuery Connector. You can then use Labelbox Catalog as a visual, no-code tool for exploring, organizing, and structuring training datasets.
  • Using data that Labelbox has structured and prepared, your organization can train models in Vertex AI. Labelbox Model can also be used to diagnose performance by taking a data-focused approach.
  • After model errors and data quality issues have been identified, your organization can use Labelbox Catalog to find and target similar, unlabeled or incorrectly labeled data to build and refine the next iteration of their model's dataset.

The preceding diagram includes the following Labelbox components:

  • Catalog: An unstructured data search platform which provides large scale data discovery and organization. Catalog is integrated with Cloud Storage and BigQuery. This integration lets analysts more effectively search and organize BigQuery tables when using unstructured data.
  • Annotate: This component provides model assisted labeling combined with enterprise review and workforce management workflows to create the ground truth for model training. Labelbox can use the models that are available through Vertex AI for assisted labeling.
  • Model: A model diagnostics platform that lets you search and filter each version of your dataset based on metrics or metadata to identify problem cases. This component also lets you observe and evaluate Vertex AI models and create an active learning loop.
  • Boost: This component provides managed access to various types of specialized labeling solutions, including specialized engagements and human labeling services.

Products used

This reference architecture uses the following Google Cloud and third-party products:

  • Labelbox: An end-to-end AI platform that you can use to create and manage high-quality training data.
  • Cloud Storage: An enterprise-ready service that provides low-cost, limitless object storage for diverse data types. Data is accessible from both inside and outside of Google Cloud and is replicated across multiple geographic locations.
  • BigQuery: A fully managed, highly scalable data warehouse with built-in ML capabilities.
  • Vertex AI: A service for managing the AI and ML development lifecycle.

Use cases

This section describes some example use cases for which a standardized pipeline with Google Cloud and Labelbox is an appropriate choice.

Online retail companies use Labelbox to build personalization and recommendation models for their ecommerce product listings, for example, Etsy. Each listing is represented by its hero image and associated categorical and customer interaction metadata. Business users can search through listings using structured (metadata search) and unstructured (vector search) methods.

Content media and entertainment companies are also building personalization and recommendation models, using high-volumes of video and image content. These companies can effectively develop content personalization and recommendation algorithms by quickly organizing and structuring their content to train models. By using the Labelbox search interface, content specialists can build ML datasets without spending weeks or months sifting through content.

Customers and organizations can achieve the following by using Labelbox with BigQuery and Vertex AI:

  • Effectively structure their data through Labelbox Model Assisted Labeling and Enterprise Workflows to prepare datasets for training models on Vertex AI.
  • After organizations have a trained Vertex AI model in place, Labelbox efficiently improves performance through data-centric error analysis. This analysis helps teams to pinpoint problematic edge cases, focus on issues, and label data more effectively to build the next version of their training dataset.

Design considerations

This section provides guidance to help you use this reference architecture to develop an architecture that meets your specific requirements for security, reliability, operational efficiency, cost, and performance. For more best practices, see the Labelbox documentation.

Security, privacy, and compliance

This section describes factors that you should consider when you use this reference architecture to design and build a standardized pipeline with Google Cloud and Labelbox.

Labelbox integrates with Cloud Storage through delegated access and ephemeral signed URLs. Data is encrypted in flight and can be sealed off behind a firewall using asset proxy servers and SSO. BigQuery and Vertex AI integration is secured through Identity and Access Management-authenticated client and service accounts.

Cost optimization

The pipeline that this architecture describes can help you to optimize costs in the following ways:

  • Helping you to reduce overall labeling costs and annual budgets dedicated to data labeling by providing transparency and workflows that optimize for labeling performance.
  • Saving time by improving collaboration between data scientists, AI product owners, and teams responsible for data labeling operations which in turn reduces the manual back and forth nature of managing data labeling at each model iteration.
  • Helping ensure model training costs are reduced by delivering better data quality and tools for quality assurance. Less manual human-labeled data is required for building performant models because this architecture lets you use tools for curating high-impact data. It also provides automation, which helps teams effectively iterate on the quality of labeled data.
  • Teams can focus on building AI-powered products and delivering models, instead of building data labeling infrastructure.

Operational efficiency

Labelbox enables organizations and teams to search through, curate, and label unstructured data, without the need for input from data scientists and machine learning engineers. Typically, organizations would spend years and millions of dollars building and managing these products and the pipelines to connect them together.

Deployment

Labelbox operates as a managed, hosted service on Google Cloud. Its unstructured data and models use a customer managed infrastructure on Google Cloud. You can purchase Labelbox from the Google Cloud Marketplace.

This reference architecture uses GitHub repositories that let you do the following:

What's next

For more information about Labelbox, see the following:

Contributors

Authors:

Other contributors: