Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

The democratization of big data is a big win for democracy

Monday, June 20, 2016

Posted by Felipe Hoffa, Developer Advocate

“Big data.” “Machine learning.” “Data visualization.” For people outside of industries that use them on a regular basis (and even people within them), technologies like these often seem abstract—as if they only apply to large-scale financial modeling or genomics research. It’s difficult to grasp how they impact the real world. We figured one of the best ways to demonstrate how powerful these technologies can be was to use them to tackle a more down-to-earth topic that directly impacts our daily lives: elections. (See developer guide to working with election data and query code samples below.)

Given the sheer volume of campaign finance data from state, local, and national elections, its complexity has made campaign finance one of the difficult realms of political analysis to decode. As a result, understanding the role campaign financing plays in the democratic process isn’t easy. That’s where big data can now make a real difference. 2012 marked the first time big data was a decisive factor in a presidential election, and the 2016 campaign season is poised to generate more data than any other election cycle in history.

DIY: The ultimate democratization of data

With a dataset provided by the Center for Responsive Politics, we put together a hands-on demo that you can watch below. In it, we highlight how pairing Google BigQuery with re:dash, an open source data visualization tool, lets you query and convert terabytes of data into rich, interactive visualizations.

Once we saw how readily we could query and visualize the CRP’s campaign contributions and expenses dataset, we wanted to see how far we could push it—and if we could tie this kind of data to what people are saying about the candidates in real time.

The next step: visualizing terabytes of historical, and live, data

Old media, real-time social media feeds, public datasets. In our presentation at Google I/O, “Election 2016: The big data showdown,” (see inline, below) we piled on a host of disparate data sources: reddit comments, Twitter, Federal Election Commision disclosures, Wikipedia, and worldwide news (through GDELT). Then we used Dataflow, Datalab, Cloud Machine Learning, and BigQuery to extract meaning from it all.

For example, can we learn anything about how different groups think about candidates today based on their reddit comments four years ago? Sanders supporters were contributing to /r/space and /r/occupywallstreet. Trump supporters were contributing to /r/guns, /r/MensRights, and /r/ronpaul. Clinton supporters were commenting about topics including /r/lgbt, /r/PoliticalDiscussion, /r/baseball, and /r/soccer. How useful is that information and what can we extrapolate from it?

As highlighted in a recent Washington Post article, “Here’s the data that told us Bernie Sanders would lose,” that cited our presentation, this level of analysis offers new ways to integrate historical and live data to gauge voter sentiment over time—and gain insight in the moment. Watch what else we learned:

Projects like this demonstrate that the ability to collect, organize, search and gain insights from an extremely large amount of information has never been easier. The ongoing democratization of big data—and the ability to pair it with data visualization to see what terabytes of data mean at a glance—will fundamentally alter the way we interact with data.

Get started visualizing big data

Try it yourself for free. With publicly accessible datasets in BigQuery, like those from the Center for Responsive Politics, big data coupled with data visualization means greater transparency for all, and greater transparency means a stronger democracy.

Example queries:

From the relatively simple:

Contributions by candidates, with a clean up:

SELECT * FROM (
  SELECT LEFT(date, 7)+'-01' month, recipient, 
    SUM(amount) amount, 
    COUNT(DISTINCT contribid ) c,
    SUM(c) OVER(PARTITION BY recipient) c_tot,
    SUM(c) OVER(PARTITION BY month) m_tot,
  FROM [fh-bigquery:opensecrets.indivs16v2] a
  JOIN (
    SELECT cid, FIRST(party) party
    FROM [fh-bigquery:opensecrets.cands16] 
    WHERE distidrunfor = 'PRES' 
    GROUP BY 1) b
  ON a.recipid=b.cid
  WHERE LEFT(realcode, 2)!='Z9'
  GROUP BY 1, 2
  HAVING c>10
)
WHERE c_tot>8000
AND m_tot>5000
ORDER BY 1, 3 DESC

To the more complex:

Which candidates were the most efficient in transforming donations into actual delegates during the primaries (by state)?

SELECT *, ROUND(a.amount/b.delegates,0) ratio_delegates, ROUND(a.amount/b.votes,0) ratio_votes 
FROM (
  SELECT recipient, state,
  FIRST(REGEXP_EXTRACT(recipient, ', (.*)') + ' ' + REGEXP_EXTRACT(recipient, '(.*),')) candidate,
  SUM(amount) amount, 
  SUM(IF(b.party='R',-1,1)*amount) amount_n,
  COUNT(DISTINCT contribid ) c,
  SUM(c) OVER(PARTITION BY recipient) c_tot,
  FROM [fh-bigquery:opensecrets.indivs16v2] a
  JOIN (
    SELECT cid, FIRST(party) party 
    FROM [fh-bigquery:opensecrets.cands16] 
    WHERE distidrunfor = 'PRES' 
    GROUP BY 1) b
  ON a.recipid=b.cid
  WHERE LEFT(realcode, 2)!='Z9'
  GROUP BY 1,2
  HAVING c>10
) a
JOIN [fh-bigquery:opensecrets.primaries_results] b
ON a.candidate=b.candidate
AND a.state=b.state_abbrev
Stay curious! And share your findings on reddit.com/r/bigquery.
  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

12 Months FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 12 months.

TRY IT FREE