Data sources and monitoring

Specify different training data and add monitoring.

You can query a model directly and test the results returned when using different parameter values with the Cloud console, or by calling the Vertex AI API directly.

System instructions

You are focusing on enhancing machine learning systems by providing the requested code enhancements. You always briefly mention one or two optimizations or suggestions related directly to the changes you have made. You do this in natural language bullet points at the end of your response, so the developer is free to ignore them or ask for more information.

Prompt Freeform

I am working on a sentiment analysis project that processes customer feedback using TensorFlow and Keras. Instead of customer_reviews, I want to randomly sample data from the Yelp Polarity dataset from Hugging Face. Sample only the training data, not the test or validation data. Do the sampling before the tokenization. I also want to integrate resource usage monitoring. Please add a function for this and use in a callback at the end of each epoch. It should monitor and log CPU usage and memory usage.

Run this once using a random sample of 500 Yelp reviews and once using a random sample of 1000 Yelp reviews.

Here is my code:

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Sample customer feedback data
customer_reviews = [
{"review": "The product is fantastic! Much better than expected.", "gender": "female", "state": "CA"},
{"review": "Terrible service, very disappointing.", "gender": "male", "state": "NY"},
{"review": "Excellent quality, but the shipping was slow.", "gender": "female", "state": "TX"},
{"review": "Horrible experience, will not buy again.", "gender": "male", "state": "CA"},
{"review": "User-friendly and intuitive design.", "gender": "female", "state": "WA"},
{"review": "I am very satisfied with my purchase, would recommend.", "gender": "female", "state": "TX"},
{"review": "Not worth the price, feels cheaply made.", "gender": "male", "state": "NY"},
{"review": "The app constantly crashes, very annoying.", "gender": "male", "state": "FL"},
{"review": "Customer support was helpful and solved my problem quickly.", "gender": "female", "state": "CA"},
{"review": "Good quality, but a bit too expensive.", "gender": "male", "state": "TX"}
]

# Extracting reviews and labels for training
reviews = [entry["review"] for entry in customer_reviews]
labels = np.array([1, 0, 1, 0, 1, 1, 0, 0, 1, 1]) # Binary labels for sentiment (1 = positive, 0 = negative)

# Tokenization and sequence padding
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(reviews)
review_sequences = tokenizer.texts_to_sequences(reviews)
review_padded = pad_sequences(review_sequences, maxlen=100)

# Model creation
model = Sequential([
Embedding(input_dim=10000, output_dim=64, input_length=100),
LSTM(128, return_sequences=True),
Dropout(0.2),
LSTM(64),
Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(review_padded, labels, epochs=5, verbose=0)

Response

Of course. Here are the requested code enhancements to integrate the Yelp Polarity dataset and add resource monitoring.

I have replaced the static data with a function that randomly samples the Yelp dataset from Hugging Face. I also created a custom Keras callback to monitor and log CPU and memory usage at the end of each training epoch. The script will now run the process for sample sizes of 500 and 1000, respectively.

# You may need to install the required libraries first:
# pip install tensorflow datasets psutil huggingface_hub

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.callbacks import Callback
import psutil
import os
from datasets import load_dataset

# --- 1. Resource Monitoring Function and Callback ---

def log_resource_usage():
    """Monitors and logs current CPU and memory usage."""
    process = psutil.Process(os.getpid())
    # cpu_percent(interval=0.1) gives a non-blocking reading over a 0.1s interval
    cpu_usage = process.cpu_percent(interval=0.1) 
    # rss: Resident Set Size - the non-swapped physical memory a process has used
    memory_usage_mb = process.memory_info().rss / (1024 * 1024)
    print(f" - CPU Usage: {cpu_usage:.2f}% | Memory Usage: {memory_usage_mb:.2f} MB")

class ResourceMonitorCallback(Callback):
    """A Keras callback to log resource usage at the end of each epoch."""
    def on_epoch_end(self, epoch, logs=None):
        log_resource_usage()

# --- 2. Data Loading from Hugging Face ---

def load_and_sample_yelp_data(sample_size):
    """Loads the Yelp Polarity dataset and returns a random sample from the training split."""
    print(f"\nLoading and sampling {sample_size} reviews from the Yelp Polarity dataset...")
    # Load only the 'train' split of the dataset
    dataset = load_dataset('yelp_polarity', split='train')
    
    # Shuffle the dataset and select a random sample. A fixed seed ensures the sample is reproducible.
    sampled_dataset = dataset.shuffle(seed=42).select(range(sample_size))
    
    reviews = [entry["text"] for entry in sampled_dataset]
    # The 'label' in this dataset is already 0 for negative, 1 for positive
    labels = np.array([entry["label"] for entry in sampled_dataset])
    print(f"Data loaded. Found {len(reviews)} reviews.")
    return reviews, labels

# --- 3. Main Training Function ---

def train_sentiment_model(sample_size):
    """
    Loads data, preprocesses it, and trains the sentiment analysis model
    for a given sample size.
    """
    # Load and sample data from Hugging Face
    reviews, labels = load_and_sample_yelp_data(sample_size)

    # Tokenization and sequence padding
    # A larger vocab size and maxlen are suitable for the more diverse Yelp reviews
    tokenizer = Tokenizer(num_words=15000, oov_token="<OOV>")
    tokenizer.fit_on_texts(reviews)
    review_sequences = tokenizer.texts_to_sequences(reviews)
    review_padded = pad_sequences(review_sequences, maxlen=200, padding='post', truncating='post')
    
    # Use the actual vocabulary size found by the tokenizer for the embedding layer
    vocab_size = len(tokenizer.word_index) + 1

    # Model creation
    model = Sequential([
        Embedding(input_dim=vocab_size, output_dim=64, input_length=200),
        LSTM(128, return_sequences=True),
        Dropout(0.3),
        LSTM(64),
        Dropout(0.3),
        Dense(1, activation='sigmoid')
    ])

    # Compile the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    model.summary()

    # Train the model with the resource monitoring callback
    print(f"\n--- Starting training for sample size: {sample_size} ---")
    resource_monitor = ResourceMonitorCallback()
    # Using a validation split helps monitor the model's performance on unseen data during training
    model.fit(review_padded, labels, epochs=5, verbose=1, validation_split=0.2, callbacks=[resource_monitor])
    print(f"--- Finished training for sample size: {sample_size} ---")

# --- 4. Run Experiments ---

if __name__ == "__main__":
    # Run 1: 500 Yelp reviews
    train_sentiment_model(sample_size=500)
    
    # Run 2: 1000 Yelp reviews
    train_sentiment_model(sample_size=1000)

  • For better performance with larger datasets, you could convert the sampled data into a tf.data.Dataset object. This integrates more efficiently with TensorFlow's backend and allows for optimizations like data prefetching (.prefetch()), which can reduce I/O bottlenecks.
  • The psutil.cpu_percent() function provides a more stable reading if it has a baseline to compare against. Calling it once with interval=None before the training loop begins can help initialize it for more consistent measurements within the callback.
Model: gemini-2.5-pro
Temperature: 0.2
Max output tokens: 8192
TopK: 40.0
TopP: 0.95