Data sources and monitoring

Specify different training data and add monitoring.

You can query a model directly and test the results returned when using different parameter values with the Cloud console, or by calling the Vertex AI API directly.

System instructions

You are focusing on enhancing machine learning systems by providing the requested code enhancements. You always briefly mention one or two optimizations or suggestions related directly to the changes you have made. You do this in natural language bullet points at the end of your response, so the developer is free to ignore them or ask for more information.

Prompt Freeform

I am working on a sentiment analysis project that processes customer feedback using TensorFlow and Keras. Instead of customer_reviews, I want to randomly sample data from the Yelp Polarity dataset from Hugging Face. Sample only the training data, not the test or validation data. Do the sampling before the tokenization. I also want to integrate resource usage monitoring. Please add a function for this and use in a callback at the end of each epoch. It should monitor and log CPU usage and memory usage.

Run this once using a random sample of 500 Yelp reviews and once using a random sample of 1000 Yelp reviews.

Here is my code:

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Sample customer feedback data
customer_reviews = [
    {"review": "The product is fantastic! Much better than expected.", "gender": "female", "state": "CA"},
    {"review": "Terrible service, very disappointing.", "gender": "male", "state": "NY"},
    {"review": "Excellent quality, but the shipping was slow.", "gender": "female", "state": "TX"},
    {"review": "Horrible experience, will not buy again.", "gender": "male", "state": "CA"},
    {"review": "User-friendly and intuitive design.", "gender": "female", "state": "WA"},
    {"review": "I am very satisfied with my purchase, would recommend.", "gender": "female", "state": "TX"},
    {"review": "Not worth the price, feels cheaply made.", "gender": "male", "state": "NY"},
    {"review": "The app constantly crashes, very annoying.", "gender": "male", "state": "FL"},
    {"review": "Customer support was helpful and solved my problem quickly.", "gender": "female", "state": "CA"},
    {"review": "Good quality, but a bit too expensive.", "gender": "male", "state": "TX"}
]

# Extracting reviews and labels for training
reviews = [entry["review"] for entry in customer_reviews]
labels = np.array([1, 0, 1, 0, 1, 1, 0, 0, 1, 1]) # Binary labels for sentiment (1 = positive, 0 = negative)

# Tokenization and sequence padding
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(reviews)
review_sequences = tokenizer.texts_to_sequences(reviews)
review_padded = pad_sequences(review_sequences, maxlen=100)

# Model creation
model = Sequential([
    Embedding(input_dim=10000, output_dim=64, input_length=100),
    LSTM(128, return_sequences=True),
    Dropout(0.2),
    LSTM(64),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(review_padded, labels, epochs=5, verbose=0)

Response

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from datasets import load_dataset
import psutil
import os

# Function to monitor and log resource usage
def monitor_resources(logs):
    cpu_percent = psutil.cpu_percent(interval=1)
    mem = psutil.virtual_memory()
    mem_percent = mem.percent
    logs.update({'cpu_usage': cpu_percent, 'memory_usage': mem_percent})


# Load the Yelp Polarity dataset
dataset = load_dataset('yelp_polarity')

# Sample data for training
def sample_data(num_samples):
    sampled_data = dataset['train'].random_split([num_samples, len(dataset['train']) - num_samples])[0]
    reviews = [example['text'] for example in sampled_data]
    labels = np.array([example['label'] for example in sampled_data])
    return reviews, labels

# Run with 500 samples
reviews_500, labels_500 = sample_data(500)

# Tokenization and sequence padding for 500 samples
tokenizer_500 = Tokenizer(num_words=10000)
tokenizer_500.fit_on_texts(reviews_500)
review_sequences_500 = tokenizer_500.texts_to_sequences(reviews_500)
review_padded_500 = pad_sequences(review_sequences_500, maxlen=100)

# Model creation for 500 samples
model_500 = Sequential([
    Embedding(input_dim=10000, output_dim=64, input_length=100),
    LSTM(128, return_sequences=True),
    Dropout(0.2),
    LSTM(64),
    Dense(1, activation='sigmoid')
])

# Compile the model for 500 samples
model_500.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

#Resource monitoring callback
resource_callback = tf.keras.callbacks.LambdaCallback(on_epoch_end=monitor_resources)

# Train the model for 500 samples
model_500.fit(review_padded_500, labels_500, epochs=5, verbose=1, callbacks=[resource_callback])


# Run with 1000 samples
reviews_1000, labels_1000 = sample_data(1000)

# Tokenization and sequence padding for 1000 samples
tokenizer_1000 = Tokenizer(num_words=10000)
tokenizer_1000.fit_on_texts(reviews_1000)
review_sequences_1000 = tokenizer_1000.texts_to_sequences(reviews_1000)
review_padded_1000 = pad_sequences(review_sequences_1000, maxlen=100)

# Model creation for 1000 samples
model_1000 = Sequential([
    Embedding(input_dim=10000, output_dim=64, input_length=100),
    LSTM(128, return_sequences=True),
    Dropout(0.2),
    LSTM(64),
    Dense(1, activation='sigmoid')
])

# Compile the model for 1000 samples
model_1000.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model for 1000 samples
model_1000.fit(review_padded_1000, labels_1000, epochs=5, verbose=1, callbacks=[resource_callback])

  • Data Sampling before Tokenization: The code now samples the data from the Hugging Face dataset before performing tokenization, as requested. This ensures that the tokenization process is only applied to the sampled data, improving efficiency.
  • Resource Monitoring: The monitor_resources function and its integration as a callback provide real-time monitoring of CPU and memory usage during training. This allows for better resource management and identification of potential bottlenecks. Consider adding logging to a file for more permanent record-keeping.
Model: gemini-1.5-flash-002
Temperature: 0.2
Max output tokens: 8192
TopK: 40.0
TopP: 0.95