The Diversity Annual Report is now a BigQuery public dataset
Stephanie Wong
Head of Technical Storytelling
James Heighington
Global Head of Insights & Impact
Google's Diversity Data Now on BigQuery
Since 2014, tech companies have relied on metrics to identify trouble spots, establish baselines, and measure meaningful progress in diversity, including publishing DEI data directly through diversity annual reports. However, as our understanding of DEI has evolved over time each company’s report has diverged, creating a fragmented landscape of industry-wide data. This separation is problematic as no single company’s diversity dataset can solve tech’s DEI challenges. Instead, we need to build industry-wide systemic solutions to create sustainable change, and those start with establishing a common language for DEI data and standardizing and sharing DEI data across tech companies.
The challenge is most companies publish diversity data in a way that makes it difficult to analyze, such as using bar graphs and pie charts. Researchers, as a result, are unable to easily pull the real, detailed data or aggregate and analyze it for their unique needs (e.g. hiring trends for tech vs. nontechnical roles, representation of Asian women in leadership, and more). As we shared in our most recent Diversity Annual Report, external research shows that sustainable change will only come from solutions that encompass the entire tech industry and data transparency is a critical step in this work.
Google’s Diversity Annual Report public dataset & BigQuery
In May, we released our 2022 Diversity Annual Report, which includes demographic data on workforce representation, hiring, and attrition of employees at Google, including leadership. You can see our hiring data by race/ethnicity, gender, and intersectional hiring over time, by region, and more. In an effort to make this data more transparent and accessible for analysis, we have added it as a public dataset in BigQuery, Google Cloud’s powerful data warehousing tool. Our data is now even easier for researchers, community organizations, and industry groups to leverage and compare against external benchmarks to help contextualize our progress.
As one of the first in the industry, we are proud to have published our diversity data on BigQuery for the second year in a row. Our dataset, among others, is public,stored and paid for by Google so those who are interested can use BigQuery’s advanced analytic capabilities through the Google Cloud Public Dataset Program for free*.
Diversity Annual Report public dataset on BigQuery (click to enlarge)
Contextualizing diversity data is necessary to make meaningful conclusions
While Google’s diversity dataset can help users compare their own datasets to Google’s current and historical trends, DEI data is only useful when analyzed in the context of other relevant datasets. For example, concluding that Black+ hiring has increased from 8.8% to 9.4% has little meaning unless there is a point of reference, like US general census data, labor force participation rates or graduation rates. It’s why Google also includes other public datasets in BigQuery such as related industry DEI data, talent and graduation pools. By doing so, users can run a sample query that then compares Google’s hiring and representation to related industries (software publishers, data process services, etc.). Users can then better understand and contextualize areas of progress and opportunity, and they can more objectively identify where organizations can take a proactive role in addressing not only DEI in their companies, but also in the communities in which they call home.
Sample query showing Google’s hiring and representation compared to related industries (click to enlarge)
Intersectional data and disaggregated baseline metrics
In the workplace, intersectional data is key to understanding the layers of exclusion and inequity that may exist for certain groups. This includes those with social identity overlap, like race, gender, and LGBTQ+, that create multiple levels of inequality or discrimination. It’s critical for DEI data to be disaggregated in a meaningful way to diagnose the true health of a system and to better understand how intersectionality contributes to the greater DEI landscape .
Google’s Diversity Annual Report public dataset includes Google’s intersectional hiring and representation data also broken down by tech, non-tech, and leadership roles. BigQuery’s friendly interface makes it easy to select the relevant parameters and join this data with other public or private datasets to help meaningfully contextualize Google’s data against the broader industry. Anyone from data scientists to DEI stewards can launch public datasets from the Google Cloud Marketplace and start querying them right away. Findings can be visualized through tools like Looker, Data Studio, or Tableau.
This is just the beginning
As DEI work continues to evolve into an industry-wide approach, we must encourage a standard of practice for collecting and reporting data across the board. At scale, data has the power to enable the tech industry to make real improvements collectively, in addition to inside our individual companies. We hope our dataset and sample queries give researchers and individuals a launchpad to become DEI practitioners of the tech industry, and ensure we have the right data to solve the right problems.
Footnote:
*The only time anyone would need to pay is for queries performed on the data after BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset.