Inclusive ML guide - AutoML

At Google, we’ve been thinking hard about the principles that motivate and shape our work in artificial intelligence (AI). We are committed to a human-centered approach that foregrounds responsible AI practices and products that work well for all people and contexts. These values of responsible and inclusive AI are at the core of the AutoML suite of machine learning products and manifests in the following ways.

AutoML expands the kinds of organizations and individuals who can make AI work for them by offering an easy-to-use, codeless user experience that requires no prior machine learning experience.

Using algorithmic techniques such as transfer learning and Learning to Learn, AutoML lowers the bar to entry by enabling organizations to build custom models with smaller datasets than typically required.

AutoML gives you the ability to easily produce ML systems that are meaningful and contextually relevant for you. For instance, if you see that our generic model doesn’t capture slang or language in your domain, you can create a custom model that includes the linguistic features you care about. If you find that generic clothing classification models don't work for the clothing worn by your community, you can train a model that does a better job.

As part of our mission to bring the benefits of machine learning to everyone, we care deeply about mitigating pre-existing biases around societal categories that structure and impact all our lives. At Google, this area of research is called Machine Learning Fairness. In this page, we share our current thinking on this topic and our recommendations for how to use AutoML in conversation with fairness in ML.

What is fairness in machine learning?

Fairness in machine learning is an exciting and vibrant area of research and discussion among academics, practitioners, and the broader public. The goal is to understand and prevent unjust or prejudicial treatment of people related to race, income, sexual orientation, religion, gender, and other characteristics historically associated with discrimination and marginalization, when and where they manifest in algorithmic systems or algorithmically aided decision-making.

Algorithmic challenges emerge in a variety of ways, for example societal bias embedded in training datasets, decisions made during the development of an ML system, or through complex feedback loops that arise when an ML system is deployed in the real world.

When pursuing fairness in machine learning, we see a diversity of valid perspectives and goals. For instance, we may train ML classifiers to predict equally well across all social groups. Or, informed by research on the impact of historical inequities, we might aim to design ML systems that try to correct or mitigate adverse outcomes going forward. These and many other approaches to fairness in machine learning are important and often interrelated.

For further information, see Google’s Responsible AI Practices and Recommended Fairness Practices, Google video on machine learning and human bias, and Moritz Hardt and Solon Barocas’ “Fairness in ML Tutorial.”

Fairness in ML & AutoML

In AutoML, we have an opportunity to promote inclusion and fairness in different ways. As noted earlier, if the machine learning models you can access today don’t fully address the needs of your community or users, due to historical absences or misrepresentation in data, you can create custom models that do a better job. In any custom model you create using AutoML, you can also engage the goals of fairness in machine learning by including data that helps the model predict equally well across all categories relevant to your use case. These fairness-related actions may help mitigate the risk of the following types of negative consequences associated with some ML systems.

Representational harm

This type of harm occurs when an ML system amplifies or reflects negative stereotypes about particular groups. For example, ML models generating image search results or automated text suggestions are often trained on previous user behavior (e.g. common search terms or comments) that can lead to offensive results. In addition to offending an individual user in the moment, representational harm also has diffuse and long-term societal effects on large groups of people.

Opportunity denial

ML systems are increasingly used to make predictions and decisions that have real-life consequences and lasting impacts on individuals’ access to opportunities, resources, and overall quality of life.

Disproportionate product failure

In some cases, unfairness is a matter of basic usability and access. For example, facial recognition software used to unlock a gaming console had disproportionately high failure rates for individuals with darker skin tones, effectively preventing darker-skinned people from using that feature.

In the following section, we share some steps you can take as you build your custom models in AutoML and use them in your ML systems. We will focus on mitigating bias in training datasets, evaluating your custom models for disparities in performance, and things to consider as you use your custom model.

What are some first steps in assessing your use case for fairness in machine learning?

Consider your product’s context and use.

In some cases, fairness is a matter of basic usability and access as described above.

In other cases, fairness intersects with laws and regulations that restrict the use of data that directly identifies or is highly correlated with some sensitive characteristics, even if that data would be statistically relevant. People with certain of those characteristics may also be legally protected against discrimination in some contexts (e.g., ”protected classes”).

In yet other cases, unfairness is not immediately obvious, but requires asking nuanced social, political and ethical questions. For example, when using AI to generate automated text or translations, what types of bias or stereotypes may be ethically problematic (e.g. associating gender with job types, or religion with political views)?

Therefore, it is important to think about how your ML system might be used in practice or how it may allow bias to creep in over time. Review discrimination-related regulations in both your region and the locations your application will serve, as well as existing research or product information in your domain to learn about common fairness issues.

Consider the following key questions.

Further, consider the following key questions. If you answer “yes” to any of these, you may want to consider conducting a more thorough analysis of your use case for potential bias-related issues.

Does your use case or product specifically use any of the following data: biometrics, race, skin color, religion, sexual orientation, socioeconomic status, income, country, location, health, language, or dialect?

Does your use case or product use data that is likely to be highly correlated with any of the personal characteristics listed above (for example, zip code or other geospatial data is often correlated with socioeconomic status and/or income; image/video data can reveal information about race, gender, and age)?

Could your use case or product negatively impact individuals’ economic or other important life opportunities?

Now that you’ve learned about some important aspects of fairness in machine learning, let’s look at approaches you can take as you move through the different steps in the AutoML workflow.

Data Guidelines

Let’s start with the first step in AutoML: putting together your training data. While no training data will be perfectly “unbiased”, you can greatly improve your chances of building a better, more inclusive product if you carefully consider potential sources of bias in your data and take steps to address them.

What kind of biases can exist in data?

Biased data distribution

This occurs when your training data is not truly representative of the population that your product seeks to serve. Think carefully about how your data was collected. For example, if you have a dataset of user-submitted photos and you filter it for image clarity, this could skew your data by over representing users that have expensive cameras. In general, consider how your data is distributed with respect to the groups of users your product will serve. Do you have enough data for each relevant group? There are often subtle, systemic reasons why your dataset might not capture the full diversity of your use case in the real world.

To mitigate this, you could try to acquire data from multiple sources, or filter data carefully to ensure you only take the most useful examples from overrepresented groups.

Biased data representation

It's possible that you have an appropriate amount of data for every demographic group you can think of but that some groups are represented less positively than others. Consider a dataset of microblog posts about actors. It might be the case that you did a great job collecting a 50-50 split of male and female performers, but when you dig into the content, posts about female performers tend to be more negative than those about male performers. This could lead your model to learn some form of gender bias. For some applications, however, different representations between group may not be a problem. In medical classification, for instance, it's important to capture subtle demographic differences to make more accurate diagnoses. But for other applications, biased negative associations may have financial or educational repercussions, limit economic opportunity, and cause emotional and mental anguish.

Consider hand-reviewing your data for these negative associations if it's feasible, or applying rule-based filters to remove negative representations if you think it's right for your application.

Biased labels

An essential step in creating training data for AutoML is labeling your data with relevant categories. Minimizing bias in these labels is just as important as ensuring your data is representative. Understand who your labelers are. Where are they located? What languages do they speak natively? What ages and genders are they? Homogeneous rater pools can yield labels that are incorrect or skewed in ways that might not be immediately obvious.

Ideally, make sure your labelers are experts in your domain or give instructions to train them on relevant aspects, and have a secondary review process in place to spot-check label quality. Aim to optimize for objectivity over subjectivity in decision-making. Training labelers on “unconscious bias” has also been shown to help improve the quality of labels with respect to diversity goals. Finally, allowing labelers to self-report issues and ask clarifying questions about instructions can also help minimize bias in the labeling process.

Tip: If you’re using the human labeling service in AutoML, consider the following guidelines as you write your instructions.

Create labeling instructions and training materials with detailed context about your use case and a description of your end-users. Labeling instructions should be specific and provide illustrative examples that help labelers keep the diversity of your user base in mind.

Review any comments you receive from raters to identify areas of confusion, and pay close attention to any sensitive categories as you spot check, approve, and reject the data labels you receive back.

Once your dataset is ready, consider specifying the test/train split

In the ML beginners guide, we discussed how your dataset is divided in the machine learning process. As noted, in AutoML, you can either have Google automatically split your dataset or manually specify the test/train split. If your use case warrants, you may consider the second option.

While splitting your data manually, consider the guidance we’ve covered so far to create diverse and inclusive test sets. If you use all your best inclusive data for training, you lose out at test time because you might get an overly rosy picture of model performance on underrepresented subgroups. If you have scarce data about a particular subgroup, make sure that data is spread representatively between your training and test sets by performing the train/test split yourself.

Review your training data

  • Do all your categories have the recommended number of data items?
  • Do your categories and images/text represent the diversity of your user base?
  • Is the distribution approximately equal across classes?
  • Does your training data (images, text, sentence pairs) match the type of data you want your model to make predictions on?
  • Evaluate: Assess your model’s performance

    Evaluating your model for fairness requires thinking deeply about your particular use case and what impact your model could have on your end users when it gets things wrong. This means understanding the impact of different types of errors for different user groups. This is where it’s important to think about potential fairness issues. For example, do model errors affect all users equally or are they more harmful for certain user groups?

    Once you’ve thought this through, you’ll be better able to decide what performance metric it makes sense to optimize for (precision vs. recall), evaluate trade-offs between them, and to examine examples of errors to check for bias.

    Use case: Passport photo evaluation

    Let’s say you want to create a tool to help people edit and print passport photos. Each country has its own rules about photo dimensions, framing, acceptable background colors, acceptable facial expressions, and other things that may or may not be in the picture. You want to alert people before they send in a passport application that their photo may not be acceptable.

    False positive:

    A false positive in this case would be when the system marks a photo as unacceptable when in fact the country's passport authority would have accepted it. No big deal — the retake is even more likely to be usable.

    False negative:

    A false negative in this case would be a failure to detect an unusable picture. The customer goes to the expense of printing a photo and submitting an application, only to have it rejected. Worst case, they miss a planned trip because they couldn't get a passport in time.

    Fairness considerations: In this case, it would be important to check whether the model produces false negatives more frequently for certain groups of people, for example based on race or gender. In AutoML, this can be done by examining individual false negatives to check for problematic patterns.

    Optimize for: In this case, you would likely want to optimize for Recall. This aims to reduce the number of false negatives, which in this scenario are the more problematic errors.

    Use case: Kids' content filter

    You’re building a reading app for kids and want to create a digital library of age-appropriate books to include in the app. You want to design a text classifier that selects children’s books from a database of adult and children’s books based on the title and description of each book.

    False positive:

    A false positive in this case would be an adult book that is incorrectly classified as a children’s book and therefore gets added to the kids' reading app. This is problematic as the app could expose children to age-inappropriate content. Parents would be very upset and likely delete the app.

    speaker

    False negative:

    A false negative in this case would be a children’s book that gets flagged incorrectly as an adult book and is therefore excluded from the in-app library. Depending on the book, this could be a minor inconvenience (e.g. excluding an obscure sequel of an unpopular series) or much more problematic — for example, if the children’s book includes content considered controversial by some people but that is generally accepted to have clear educational or social value.

    Fairness considerations: While at first glance this may seem like a simple case, it highlights some of the complexities of evaluating use cases for fairness. On the one hand, there is a clear need to avoid false positives (minimize the likelihood that children are exposed to age-inappropriate content). On the other hand, false negatives can also be harmful. For example, if the text classifier tends to flag children’s books with LGBTQ themes (for instance, stories about children with two parents of the same gender) as inappropriate, this is problematic. Similarly, if books about certain cultures or locations are excluded more commonly than others, this is equally concerning.

    Optimize for: In this case, you would likely want to optimize for Precision. Of all the children's books in the world, your app will only surface a small fraction of them, so you can afford to be picky about which to show users. However, you’d also want to consider UX solutions for how to surface books that might require a parent's input. For example, you could add a feature that recommends parents read a book with children, so they can talk through issues the book raises.

    Predict: Smoke test your model

    Once you’ve evaluated your model’s performance on fairness using the machine learning metrics in AutoML, you can try out your custom model with new images or text in the Predict tab. While doing so, consider the following fairness recommendations:

    Think carefully about your problem domain and its potential for unfairness and bias. You know your area best: is your image classifier likely to be affected by the races or genders of people in images? Is your text classifier likely to be sensitive to terms that refer to demographic groups? Does the language pair for which you're building a translator have cultural differences that may be highlighted, or a mismatched set of pronouns that might end up exposing an underlying societal bias? Come up with cases that would adversely impact your users if they were found in production, and test those on the Predict page or in your own unit tests.

    Remember that the absence of a clear prediction (false negatives), and not only offensive or unfair predictions, could negatively impact your users as well. If you find that the results are not aligned with the experience you’d like to create for all of your end-users, you can further debias your dataset by adding more data to relevant classes or you can use your model in a manner that corrects for any issues you’ve found.

    Use: Your model in production

    Implement simple fixes. If your model isn't perfect, retraining with new data isn't the only answer. Sometimes a simple pre- or post-processing step to remove certain words or types of images can be an effective solution.

    Adjust the score thresholds of your model to find an acceptably 'fair' balance between precision and recall, given your understanding of how different error types impact your users.

    Once your model is built and serving predictions, your data distribution may change subtly over time, and your model may no longer reflect the relevant contexts of your application. Be sure to monitor model performance over time to ensure it's doing as well as you expect, and collect feedback from your users to identify potential issues that may require new data and retraining.

    Sometimes corner cases come up that you just didn't think about. Come up with an incident response plan if you're concerned that your model may misbehave in a manner that may adversely impact your users and your business.

    Feedback

    This is a living document and we are learning as we go. We’d love your feedback on the guidance we’ve given here. Send an email to inclusive-ml-feedback@google.com to tell us about your experience in creating your custom models, what worked, and what didn’t. We look forward to your feedback!