Introduction
Imagine you work with an architectural preservation board that's attempting to identify neighborhoods that have a consistent architectural style in your city. You have hundreds of thousands of snapshots of homes to sift through, but it's tedious and error-prone to try to categorize all these images by hand. An intern labeled a few hundred of them a few months ago, and nobody has touched them since. It'd be so useful if you could just teach your computer to do this for you!
Why is Machine Learning (ML) the right tool for this problem?
Classical programming requires the programmer to specify step-by-step
instructions for the computer to follow. While this approach works for solving
a wide variety of problems, it isn't up to the task of categorizing homes the
way you'd like to. There's so much variation in composition, colors, angles,
and stylistic details; you can't imagine coming up with a step-by-step set of
rules that could tell a machine how to decide whether a photograph of a
single-family residence is Craftsman or Modern. It's hard to imagine where
you'd even begin. Fortunately, machine learning systems are well-positioned
to solve this problem.
Is the Vision API or AutoML the right tool for me?
The Vision API classifies images into thousands of predefined categories, detects individual objects and faces within images, and finds and reads printed words contained within images. If you want to detect individual objects, faces, and text in your dataset, or your need for image classification is quite generic, try out the Vision API and see if it works for you. But if your use case requires that you use your own labels to classify your images instead, it's worth experimenting with a custom classifier to see if it fits your needs.
Try the Vision API | Get started with AutoML |
What does machine learning in AutoML involve?
Machine learning involves using data to train algorithms to achieve a desired
outcome. The specifics of the algorithm and training methods change based on
the use case. There are many different subcategories of machine learning, all
of which solve different problems and work within different constraints.
AutoML Vision enables you to perform supervised learning, which
involves training a computer to recognize patterns from labeled data. Using
supervised learning, we can train a model to recognize the patterns and content
that we care about in images.
Data Preparation
In order to train a custom model with AutoML Vision, you will need to supply labeled examples of the kinds of images (inputs) you would like to classify, and the categories or labels (the answer) you want the ML systems to predict.
Assess your use case

While putting together the dataset, always start with the use case. You can begin with the following questions:
- What is the outcome you’re trying to achieve?
- What kinds of categories would you need to recognize to achieve this outcome?
- Is it possible for humans to recognize those categories? Although AutoML Vision can handle a greater magnitude of categories than humans can remember and assign at any one time, if a human cannot recognize a specific category, then AutoML Vision will have a hard time as well.
- What kinds of examples would best reflect the type and range of data your system will classify?
A core principle underpinning Google’s ML products is human centered machine learning, an approach that foregrounds responsible AI practices including fairness. The goal of fairness in ML is to understand and prevent unjust or prejudicial treatment of people related to race, income, sexual orientation, religion, gender, and other characteristics historically associated with discrimination and marginalization, when and where they manifest in algorithmic systems or algorithmically aided decision-making. You can read more in our guide and find “fair-aware” notes ✽ in the guidelines below. As you move through the guidelines for putting together your dataset, we encourage you to consider fairness in machine learning where relevant to your use case.
Source your data
Once you’ve established what data you will need, you need to find a way
to source it. You can begin by taking into account all the data your
organization collects. You may find that you’re already collecting the data
you would need to train a model. In case you don’t have the data you need, you
can obtain it manually or outsource it to a third-party provider.
Include enough labeled examples in each category
The bare minimum required by AutoML Vision training is 100 image
examples per category/label. The likelihood of successfully recognizing a
label goes up with the number of high quality examples for each; in general,
the more labeled data you can bring to the training process, the better your
model will be. Target at least 1000 examples per label.
Distribute examples equally across categories
It’s important to capture roughly similar amounts of training examples for each category. Even if you have an abundance of data for one label, it is best to have an equal distribution for each label. To see why, imagine that 80% of the images you use to build your model are pictures of single-family homes in a modern style. With such an unbalanced distribution of labels, your model is very likely to learn that it's safe to always tell you a photo is of a modern single-family house, rather than going out on a limb to try to predict a much less common label. It's like writing a multiple-choice test where almost all the correct answers are "C" - soon your savvy test-taker will figure out it can answer "C" every time without even looking at the question.
We understand it may not always be possible to source an approximately equal number of examples for each label. High quality, unbiased examples for some categories may be harder to source. In those circumstances, you can follow this rule of thumb - the label with the lowest number of examples should have at least 10% of the examples as the label with the highest number of examples. So if the largest label has 10,000 examples, the smallest label should have at least 1,000 examples.
Capture the variation in your problem space
For similar reasons, try to ensure that your data captures the variety and diversity of your problem space. The broader a selection the model training process gets to see, the more readily it will generalize to new examples. For instance, if you're trying to classify photos of consumer electronics into categories, the wider a variety of consumer electronics the model is exposed to in training, the more likely it will be able to distinguish between a novel model of tablet, phone, or laptop, even if it's never seen that specific model before.
Match data to the intended output for your model
Find images that are visually similar to what you’re planning to make predictions on. If you are trying to classify house images that were all taken in snowy winter weather, you probably won't get great performance from a model trained only on house images taken in sunny weather even if you've tagged them with the classes you're interested in, as the lighting and scenery may be different enough to affect performance. Ideally, your training examples are real-world data drawn from the same dataset you're planning to use the model to classify.
Consider how AutoML Vision uses your dataset in creating a custom model
Your dataset contains training, validation and testing sets. If you do not specify the splits (see Preparing your data), then AutoML Vision automatically uses 80% of your images for training, 10% for validating, and 10% for testing.
Training Set
The vast majority of your data should be in the training set. This is the data
your model "sees" during training: it's used to learn the parameters of the
model, namely the weights of the connections between nodes of the neural network.
Validation Set
The validation set, sometimes also called the "dev" set, is also used during
the training process. After the model learning framework incorporates training
data during each iteration of the training process, it uses the model's performance on the validation set to tune the model's hyperparameters, which are variables
that specify the model's structure. If you tried to use the training set to
tune the hyperparameters, it's quite likely the model would end up overly
focused on your training data, and have a hard time generalizing to examples
that don't exactly match it. Using a somewhat novel dataset to fine-tune model
structure means your model will generalize better.
Test Set
The test set is not involved in the training process at all. Once the model
has completed its training entirely, we use the test set as an entirely new
challenge for your model. The performance of your model on the test set is
intended to give you a pretty good idea of how your model will perform on
real-world data.
Manual Splitting
You can also split your dataset yourself. Manually splitting your data is a
good choice when you want to exercise more control over the process or if
there are specific examples that you're sure you want included in a certain
part of your model training lifecycle.
Prepare your data for import

Once you’ve decided if a manual or automatic split of your data is right for you, there are three ways to add data in AutoML Vision:
- You can import data with your images sorted and stored in folders that correspond to your labels.
- You can import data from Google Cloud Storage in CSV format with the labels inline. To learn more, visit our documentation.
- If your data hasn’t been labeled yet, you can also upload unlabeled image examples and use the AutoML Vision UI to apply labels to each one.
Evaluate
Once your model is trained, you will receive a summary of your model performance. Click “evaluate” or “see full evaluation” to view a detailed analysis.
What should I keep in mind before evaluating my model?
Debugging a model is more about debugging the data than the model itself. If
at any point your model starts acting in an unexpected manner as you’re
evaluating its performance before and after pushing to production, you should
return and check your data to see where it might be improved.
What kinds of analysis can I perform in AutoML Vision?
In the AutoML Vision evaluate section, you can assess your custom model’s performance using the model’s output on test examples, and common machine learning metrics. In this section, we will cover what each of these concepts mean.
- The model output
- The score threshold
- True positives, true negatives, false positives, and false negatives
- Precision and recall
- Precision/recall curves.
- Average precision
How do I interpret the model’s output?
AutoML Vision pulls examples from your test data to present entirely new challenges for your model. For each example, the model outputs a series of numbers that communicate how strongly it associates each label with that example. If the number is high, the model has high confidence that the label should be applied to that document.
What is the Score Threshold?
We can convert these probabilities into binary ‘on’/’off’ values by setting a score threshold. The score threshold refers to the level of confidence the model must have to assign a category to a test item. The score threshold slider in the UI is a visual tool to test the impact of different thresholds for all categories and individual categories in your dataset. If your score threshold is low, your model will classify more images, but runs the risk of misclassifying a few images in the process. If your score threshold is high, your model will classify fewer images, but it will have a lower risk of misclassifying images. You can tweak the per-category thresholds in the UI to experiment. However, when using your model in production, you will have to enforce the thresholds you found optimal on your side.
What are True Positives, True Negatives, False Positives, False Negatives?
After applying the score threshold, the predictions made by your model will fall in one of the following four categories.
We can use these categories to calculate precision and recall — metrics that help us gauge the effectiveness of our model.
What are precision and recall?
Precision and recall help us understand how well our model is capturing information, and how much it’s leaving out. Precision tells us, from all the test examples that were assigned a label, how many actually were supposed to be categorized with that label. Recall tells us, from all the test examples that should have had the label assigned, how many were actually assigned the label.
Should I optimize for precision or recall?
Depending on your use case, you may want to optimize for either precision or recall. Let’s examine how you might approach this decision with the following two use cases.
Use Case: Privacy in images
Let’s say you want to create a system that automatically detects sensitive information and blurs it out.
False positives in this case would be, things that don’t need to be blurred that get blurred, which can be annoying but not detrimental.
False negatives in this case would be things that need to be blurred that fail to get blurred, like a credit card, which can lead to identity theft.
In this case, you would want to optimize for recall. This metric measures, for all the predictions made, how much is being left out. A high-recall model is likely to label marginally relevant examples, which is useful for cases where your category has scarce training data.
Use case: Stock photo search
Let’s say you want to create a system that finds the best stock photo for a given keyword.
A false positive in this case would be returning an irrelevant image. Since your product prides itself on returning only the best-match images, this would be a major failure.
A false negative in this case would be failing to return a relevant image for a keyword search. Since many search terms have thousands of photos that are a strong potential match, this is fine.
In this case, you would want to optimize for precision. This metric measures, for all the predictions made, how correct they are. A high-precision model is likely to label only the most relevant examples, which is useful for cases where your class is common in the training data.
How do I use the Confusion Matrix?
We can compare the model’s performance on each label using a confusion matrix. In an ideal model, all the values on the diagonal will be high, and all the other values will be low. This shows that the desired categories are being identified correctly. If any other values are high, it gives us a clue into how the model is misclassifying test images.
How do I interpret the precision-recall curves?
The score threshold tool allows you to explore how your chosen score threshold affects your precision and recall. As you drag the slider on the score threshold bar, you can see where that threshold places you on the precision-recall tradeoff curve, as well as how that threshold affects your precision and recall individually (for multiclass models, on these graphs, precision and recall means the only label used to calculate precision and recall metrics is the top-scored label in the set of labels we return). This can help you find a good balance between false positives and false negatives.
Once you've chosen a threshold that seems to be acceptable for your model on the whole, you can click individual labels and see where that threshold falls on their per-label precision-recall curve. In some cases, it might mean you get a lot of incorrect predictions for a few labels, which might help you decide to choose a per-class threshold that's customized to those labels. For example, let's say you look at your houses dataset and notice that a threshold at 0.5 has reasonable precision and recall for every image type except "Tudor", perhaps because it's a very general category. For that category, you see tons of false positives. In that case, you might decide to use a threshold of 0.8 just for "Tudor" when you call the classifier for predictions.
What is average precision?
A useful metric for model accuracy is the area under the precision-recall curve. It measures how well your model performs across all score thresholds. In AutoML Vision, this metric is called Average Precision. The closer to 1.0 this score is, the better your model is performing on the test set; a model guessing at random for each label would get an average precision around 0.5.
Testing your model
AutoML Vision uses 10% of your data automatically (or, if you chose
your data split yourself, whatever percentage you opted to use) to test the
model, and the "Evaluate" page tells you how the model did on that test data.
But just in case you want to confidence check your model, there are a few ways
to do it. The easiest is to upload a few images on the "Predict" page, and look
at the labels the model chooses for your examples. Hopefully, this matches your
expectations. Try a few examples of each type of image you expect to receive.
If you'd like to use your model in your own automated tests instead, the "Predict" page also tells you how to make calls to the model programmatically.