The democratization of big data is a big win for democracy
Posted by Felipe Hoffa, Developer Advocate
“Big data.” “Machine learning.” “Data visualization.” For people outside of industries that use them on a regular basis (and even people within them), technologies like these often seem abstract—as if they only apply to large-scale financial modeling or genomics research. It’s difficult to grasp how they impact the real world. We figured one of the best ways to demonstrate how powerful these technologies can be was to use them to tackle a more down-to-earth topic that directly impacts our daily lives: elections. (See developer guide to working with election data and query code samples below.)
Given the sheer volume of campaign finance data from state, local, and national elections, its complexity has made campaign finance one of the difficult realms of political analysis to decode. As a result, understanding the role campaign financing plays in the democratic process isn’t easy. That’s where big data can now make a real difference. 2012 marked the first time big data was a decisive factor in a presidential election, and the 2016 campaign season is poised to generate more data than any other election cycle in history.
DIY: The ultimate democratization of data
With a dataset provided by the Center for Responsive Politics, we put together a hands-on demo that you can watch below. In it, we highlight how pairing Google BigQuery with re:dash, an open source data visualization tool, lets you query and convert terabytes of data into rich, interactive visualizations.
Once we saw how readily we could query and visualize the CRP’s campaign contributions and expenses dataset, we wanted to see how far we could push it—and if we could tie this kind of data to what people are saying about the candidates in real time.
The next step: visualizing terabytes of historical, and live, data
Old media, real-time social media feeds, public datasets. In our presentation at Google I/O, “Election 2016: The big data showdown,” (see inline, below) we piled on a host of disparate data sources: reddit comments, Twitter, Federal Election Commision disclosures, Wikipedia, and worldwide news (through GDELT). Then we used Dataflow, Datalab, Cloud Machine Learning, and BigQuery to extract meaning from it all.
For example, can we learn anything about how different groups think about candidates today based on their reddit comments four years ago? Sanders supporters were contributing to /r/space and /r/occupywallstreet. Trump supporters were contributing to /r/guns, /r/MensRights, and /r/ronpaul. Clinton supporters were commenting about topics including /r/lgbt, /r/PoliticalDiscussion, /r/baseball, and /r/soccer. How useful is that information and what can we extrapolate from it?
As highlighted in a recent Washington Post article, “Here’s the data that told us Bernie Sanders would lose,” that cited our presentation, this level of analysis offers new ways to integrate historical and live data to gauge voter sentiment over time—and gain insight in the moment. Watch what else we learned:
Projects like this demonstrate that the ability to collect, organize, search and gain insights from an extremely large amount of information has never been easier. The ongoing democratization of big data—and the ability to pair it with data visualization to see what terabytes of data mean at a glance—will fundamentally alter the way we interact with data.
Get started visualizing big data
Try it yourself for free. With publicly accessible datasets in BigQuery, like those from the Center for Responsive Politics, big data coupled with data visualization means greater transparency for all, and greater transparency means a stronger democracy.
From the relatively simple:
Contributions by candidates, with a clean up:
SELECT * FROM ( SELECT LEFT(date, 7)+'-01' month, recipient, SUM(amount) amount, COUNT(DISTINCT contribid ) c, SUM(c) OVER(PARTITION BY recipient) c_tot, SUM(c) OVER(PARTITION BY month) m_tot, FROM [fh-bigquery:opensecrets.indivs16v2] a JOIN ( SELECT cid, FIRST(party) party FROM [fh-bigquery:opensecrets.cands16] WHERE distidrunfor = 'PRES' GROUP BY 1) b ON a.recipid=b.cid WHERE LEFT(realcode, 2)!='Z9' GROUP BY 1, 2 HAVING c>10 ) WHERE c_tot>8000 AND m_tot>5000 ORDER BY 1, 3 DESC
To the more complex:
Which candidates were the most efficient in transforming donations into actual delegates during the primaries (by state)?
Stay curious! And share your findings on reddit.com/r/bigquery.
SELECT *, ROUND(a.amount/b.delegates,0) ratio_delegates, ROUND(a.amount/b.votes,0) ratio_votes FROM ( SELECT recipient, state, FIRST(REGEXP_EXTRACT(recipient, ', (.*)') + ' ' + REGEXP_EXTRACT(recipient, '(.*),')) candidate, SUM(amount) amount, SUM(IF(b.party='R',-1,1)*amount) amount_n, COUNT(DISTINCT contribid ) c, SUM(c) OVER(PARTITION BY recipient) c_tot, FROM [fh-bigquery:opensecrets.indivs16v2] a JOIN ( SELECT cid, FIRST(party) party FROM [fh-bigquery:opensecrets.cands16] WHERE distidrunfor = 'PRES' GROUP BY 1) b ON a.recipid=b.cid WHERE LEFT(realcode, 2)!='Z9' GROUP BY 1,2 HAVING c>10 ) a JOIN [fh-bigquery:opensecrets.primaries_results] b ON a.candidate=b.candidate AND a.state=b.state_abbrev