This page provides a simple overview of the steps required to prepare your data.
Stages of data preparation
Data preparation usually involves iterative steps. In this documentation the steps for getting data ready for supervised machine learning are identified as:
- Gather data.
- Clean the data.
- Split the data.
- Engineer features.
- Preprocess the features.
Finding data, especially data with the labels you need, can be a challenge. Your sources might vary significantly from one machine learning project to the next. If you find that you are merging data from different sources, or getting data entry from multiple places, you’ll need to be extra careful in the next step.
Clean the data
Cleaning data is the process of checking for integrity and consistency. At this stage you shouldn't be looking at the data overall for patterns. Instead, you clean data by column (attribute), looking for such anomalies as:
Instances with missing features.
Multiple methods of representing a feature. For example, some instances might list a length measurement in inches, and others might list it in centimeters. It is crucial that all instances of a given feature use the same scale and follow the same format.
Features with values far out of the typical range (outliers), which may be data-entry anomalies or other invalid data.
Significant changes in the data over distances in time, geographic location, or other recognizable characteristics.
Incorrect labels or poorly defined labeling criteria.
Split the data
You need at least three subsets of data in a supervised learning scenario: training data, evaluation data, and test data.
Training data is the data that you’ll get acquainted with. You use it to train your model, yes, but you also analyze it as you develop the model in the first place.
Evaluation data is what you use to check your model’s performance during the regular training cycle. It is your primary tool in ensuring that your model is generalizable to data beyond the training set.
Test data is used to test a model that’s close to completion, usually after multiple training iterations. You should never analyze or scrutinize your test data, instead keeping it fresh until needed to test your model. That way you can be assured that you aren't making assumptions based on familiarity with the data that can pollute your training results.
Here are some important things to remember when you split your data:
It is better to randomly sample the subsets from one big dataset than to use some pre-divided data, such as instances from two distinct date ranges or data-collection systems. The latter approach has an increased risk of non-uniformity that can lead to overfitting.
Ideally you should assign instances to a dataset and keep those associations throughout the process.
Experts disagree about the right proportions for the different datasets. However, regardless of the specific ratios you should have more training data than evaluation data, and more evaluation data than test data.
Before you develop your model, you should get acquainted with your training data. Look for patterns in your data, and think about what values could influence your target attribute. This process of deciding which data is important for your model is called feature engineering.
Feature engineering is not just about deciding which attributes you have in your raw data that you want in your model. The harder and often more important work is extracting generalizable, indicative features from specific data. That means combining the data you have with your knowledge about the problem space to get the data you really need. It can be a complex process, and doing it right depends on understanding the subject matter and the goals of your problem. Here are a couple of examples:
Example data: residential address
Data about people often includes a residential address, which is a complex string, often hard to make consistent, and not particularly useful for many applications on its own. You should usually extract a more meaningful feature from it. Here are some examples of things you could extract from an address:
- Longitude and latitude
- Closest elementary school
- Legislative district
- Relative position to a landmark
Example data: timestamp
Another common item of data is a timestamp, which is usually a large numerical value indicating the amount of time elapsed since a common reference point. Here are some examples of things you might extract from a precise timestamp:
- Hour of the day
- Elapsed time since another event
- Time of day (morning, afternoon, evening, night)
- Whether some facility was open or closed at that time
- Frequency of an event (in combination with other instances)
- Position of the sun (in combination with latitude and longitude)
Here are some important things to note about the examples above:
You can combine multiple attributes to make one generalizable feature. For example, address and timestamp can get you the position of the sun.
You can use feature engineering to simplify data. For example, timestamp to time of day takes an attribute with seemingly countless values and reduces it to four categories.
You can get useful features, and reduce the number of instances in your dataset, by engineering across instances. For example, use multiple instances to calculate the frequency of something.
When you're done you'll have a list of features to include when training your model.
One of the most difficult parts of the process is deciding when you have the right set of features. It's sometimes difficult to know which features are likely to affect your prediction accuracy. Machine learning experts often stress that it's a field that requires flexibility and experimentation. You'll never get it perfect the first try, so make your best guess and use the results to inform your next iteration.
So far this page has described generally applicable steps to take when getting your data ready to train your model. It hasn't mattered up to this point how your data is represented and formatted. Preprocessing is the next step: getting your prepared data into a format that works with the tools and techniques you use to train a model.
Data formats and Cloud ML Engine
Cloud ML Engine doesn't get involved in your data format; you can use whatever input format is convenient for your training application. That said, you'll need to have your input data in a format that TensorFlow can read. You also need to have your data in a location that your Cloud ML Engine project can access. The simplest solution is often to use a CSV file in a Google Cloud Storage bucket that your Google Cloud Platform project has access to. Some types of data, such as sparse vectors and binary data, can be better represented using TensorFlow's tf.train.Example format serialized in a TFRecords file.
There are many transformations that might be useful to perform on your raw feature data. Some of the more common ones are:
- Normalizing numerical values to be represented at a consistent scale (commonly a range between -1 and 1 or between 0 and 1).
- Representing non-numeric data numerically, such as changing categorical features to index values or one-hot vectors.
- Changing raw text strings to a more compact representation, like a bag of words.
Summary of data preparation and preprocessing
Cloud ML Engine doesn't impose specific requirements on your input data, leaving you to use whatever format works for your training application. Follow TensorFlow's data reading procedures.
- Use the raw data you have to get the data you need.
- Split your dataset into training, validation, and test subsets.
- Store your data in a location that your Cloud ML Engine project can access—a Cloud Storage bucket is often the easiest approach.
- Transform features to suit the operations you perform on them.