This page explains why and how to use the
MLTransform
feature to prepare your data for training machine learning (ML) models. By
combining multiple data processing transforms in one class, MLTransform
streamlines the process of applying Apache Beam ML data processing operations to
your workflow.
MLTransform
in the preprocessing step of the workflow.
Benefits
The MLTransform
class provides the following benefits:
- Transform your data without writing complex code or managing underlying libraries.
- Generate embeddings that you can use to push data into vector databases or to run inference.
- Efficiently chain multiple types of processing operations with one interface.
Support and limitations
The MLTransform
class has the following limitations:
- Available for pipelines that use the Apache Beam Python SDK versions 2.53.0 and later.
- Pipelines must use default windows.
Text embedding transforms:
- Support Python 3.8, 3.9, 3.10, 3.11, and 3.12.
- Support both batch and streaming pipelines.
- Support the Vertex AI text-embeddings API and the Hugging Face Sentence Transformers module.
Data processing transforms that use TFT:
- Support Python 3.9, 3.10, 3.11.
- Support batch pipelines.
Use cases
The example notebooks demonstrate how to use MLTransform
for specific use cases.
- I want to generate text embeddings for my LLM by using Vertex AI
- Use the Apache Beam
MLTransform
class with the Vertex AI text-embeddings API to generate text embeddings. Text embeddings are a way to represent text as numerical vectors, which is necessary for many natural language processing (NLP) tasks. - I want to generate text embeddings for my LLM by using Hugging Face
- Use the Apache Beam
MLTransform
class with Hugging Face Hub models to generate text embeddings. The Hugging FaceSentenceTransformers
framework uses Python to generate sentence, text, and image embeddings. - I want to compute a vocabulary from a dataset
- Compute a unique vocabulary from a dataset and then map each word or token to a distinct integer index. Use this transform to change textual data into numerical representations for machine learning tasks.
- I want to scale my data to train my ML model
- Scale your data so that you can use it to train your ML model. The
Apache Beam
MLTransform
class includes multiple data scaling transforms.
For a full list of available transforms, see Transforms in the Apache Beam documentation.
Use MLTransform
To use the MLTransform
class to preprocess data, include the following code in
your pipeline:
import apache_beam as beam
from apache_beam.ml.transforms.base import MLTransform
from apache_beam.ml.transforms.tft import TRANSFORM_NAME
import tempfile
data = [
{
DATA
},
]
artifact_location = gs://BUCKET_NAME
TRANSFORM_FUNCTION_NAME = TRANSFORM_NAME(columns=['x'])
with beam.Pipeline() as p:
transformed_data = (
p
| beam.Create(data)
| MLTransform(write_artifact_location=artifact_location).with_transform(
TRANSFORM_FUNCTION_NAME)
| beam.Map(print))
Replace the following values:
TRANSFORM_NAME
: the name of the transform to useBCUKET_NAME
: the name of your Cloud Storage bucketDATA
: the input data to transformTRANSFORM_FUNCTION_NAME
: the name that you assign to your transform function in your code
What's next
- For more details about
MLTransform
, see Preprocess data in the Apache Beam documentation. - For more examples, see
MLTransform
for data processing in the Apache Beam transform catalog. - Run an interactive notebook in Colab.