Data is an essential part of a machine-learning (ML) system. You need to prepare your data before you can use it to train your ML model and to get predictions from the trained model. This page provides an overview of the steps required to prepare your data.
Data for model training
You need a large number of existing data instances to train a model. This data:
- Is representative of the data in your problem space.
- Includes all of the features your model needs to make predictions as well as the target value you want to infer in new instances.
- Is serialized in a format that TensorFlow can accept, generally CSV or TFRecord.
- Must be stored in a location that your Google Cloud project can access, typically in Cloud Storage location or BigQuery.
- Is split into three datasets: one for training the model, one for evaluating the trained model's accuracy and generalizability, and one for testing the trained model.
Stages of data preparation
Data preparation usually involves iterative steps. In this documentation the steps for getting data ready for supervised machine learning are identified as:
- Gather the data.
- Clean the data.
- Split the data.
- Engineer the data features.
- Preprocess the features.
Gather the data
Finding data, especially data with the labels you need, can be a challenge. Your sources might vary significantly from one machine learning project to the next. If you find that you are merging data from different sources, or getting data entry from multiple places, you need to be extra careful in the next step.
Clean the data
Cleaning data is the process of checking for integrity and consistency. At this stage you shouldn't be looking at the data overall for patterns. Instead, you clean data by column (attribute), looking for such anomalies as:
Instances with missing features.
Multiple methods of representing a feature. For example, some instances might list a length measurement in inches, and others might list it in centimeters. It is crucial that all instances of a given feature use the same scale and follow the same format.
Features with values far out of the typical range (outliers), which may be data-entry anomalies or other invalid data.
Significant changes in the data over distances in time, geographic location, or other recognizable characteristics.
Incorrect labels or poorly defined labeling criteria.
Split the data
You need at least three subsets of data in a supervised learning scenario: training data, evaluation data, and test data.
Training data is the data that you use to train your model. You need to analyze and get to know the data before you develop the model.
Evaluation data is what you use to check your model’s performance during the regular training cycle. It is your primary tool in ensuring that your model is generalizable to data beyond the training set.
Test data is used to test a model that’s close to completion, usually after multiple training iterations. You should not analyze or scrutinize your test data, instead keeping it fresh until needed to test your model. That way you can be assured that you aren't making assumptions based on familiarity with the data that can pollute your training results.
Here are some important things to remember when you split your data:
It is better to randomly sample the subsets from one big dataset than to use some pre-divided data, such as instances from two distinct date ranges or data-collection systems. The latter approach has an increased risk of non-uniformity that can lead to overfitting.
Ideally you should assign instances to a dataset and keep those associations throughout the process.
Experts disagree about the right proportions for the different datasets. However, regardless of the specific ratios, you should have more training data than evaluation data, and more evaluation data than test data.
Engineer the data features
Before you develop your model, you should get acquainted with your training data. Look for patterns in your data, and think about what values could influence your target attribute. This process of deciding which data is important for your model is called feature engineering.
Feature engineering is not just about deciding which attributes to use from your raw data. The harder and often more important work is extracting generalizable, indicative features from specific data. That means combining the data you have with your knowledge about the problem space to get the data you really need. It can be a complex process, and doing it right depends on understanding the subject matter and the goals of your problem. Here are a couple of examples:
Example data: residential address
Data about people often includes a residential address, which is a complex string, often hard to make consistent, and not particularly useful for many applications on its own. You should usually extract a more meaningful feature from it. Here are some examples of things you could extract from an address:
- Longitude and latitude
- Closest elementary school
- Legislative district
- Relative position to a landmark
Example data: timestamp
Another common item of data is a timestamp, which is usually a large numerical value indicating the amount of time elapsed since a common reference point. Here are some examples of things you might extract from a precise timestamp:
- Hour of the day
- Elapsed time since another event
- Time of day (morning, afternoon, evening, night)
- Whether some facility was open or closed at that time
- Frequency of an event (in combination with other instances)
- Position of the sun (in combination with latitude and longitude)
Here are some important things to note about the examples above:
You can combine multiple attributes to make one generalizable feature. For example, address and timestamp can get you the position of the sun.
You can use feature engineering to simplify data. For example, timestamp to time of day takes an attribute with seemingly countless values and reduces it to four categories.
You can get useful features, and reduce the number of instances in your dataset, by engineering across instances. For example, use multiple instances to calculate the frequency of something.
When you've finished, you'll have a list of features to include when training your model.
One of the most difficult parts of the process is deciding when you have the right set of features. It's sometimes difficult to know which features are likely to affect your prediction accuracy. Machine learning experts often stress that it's a field that requires flexibility and experimentation. You'll never get it perfect the first try, so make your best guess and use the results to inform your next iteration.
Preprocess the data
So far this page has described generally applicable steps to take when getting your data ready to train your model. It hasn't mattered up to this point how your data is represented and formatted. Preprocessing is the next step: getting your prepared data into a format that works with the tools and techniques you use to train a model.
Data formats and AI Platform
AI Platform doesn't get involved in your data format; you can use whatever input format is convenient for your training application. That said, you'll need to have your input data in a format that TensorFlow can read. You also need to have your data in a location that your AI Platform project can access. The simplest solution is often to use a CSV file in a Cloud Storage bucket that your Google Cloud project has access to. Some types of data, such as sparse vectors and binary data, can be better represented using TensorFlow's tf.train.Example format serialized in a TFRecord file.
There are many transformations that may be useful to perform on your raw feature data. Some of the more common ones are:
- Normalizing numerical values to be represented at a consistent scale (commonly a range between -1 and 1 or between 0 and 1).
- Representing non-numeric data numerically, such as changing categorical features to index values or one-hot vectors.
- Changing raw text strings to a more compact representation, like a bag of words.