Bugsnag: Evolving from bug reporting to application stability management

About Bugsnag

Founded in 2012 and based in San Francisco, Bugsnag provides software teams with an automated crash detection platform for their web, mobile, and server apps. The company has more than 4,000 customers that include Lyft, Square, OpenTable, GitHub, and Slack.

Industries: Technology
Location: United States

Bugsnag is using Google Cloud Platform to help it evolve from a bug reporting and resolution platform into the single source of truth for application stability and quality within an organization.

Google Cloud Results

  • Speeds overall performance 50% to 60%
  • Supports 1B+ crash reports a day
  • Helps business growth with new features

Handles 100x traffic spikes with ease

The Bugsnag platform collects data about bugs from mobile apps, web browsers, and backend servers and microservices, to help software teams quickly discover and eliminate bugs. As Bugsnag has grown, so have the volume of errors captured and processed daily. Each crash report can be up to 1MB in size, and Bugsnag has received as many as 1 billion crash reports in one day, resulting in an overall traffic spike of up to 100x.

In 2013, Bugsnag launched its service with a major provider’s cloud-based IT infrastructure as its foundation. Even though Bugsnag used its IT provider’s Application Programming Interface (API) to provision additional virtual server instances to accommodate growth and spikes, Bugsnag couldn’t add more instances as quickly as needed. Overprovisioning helped but wasn’t a viable, cost-effective, long-term solution.

When Bugsnag was unable to accommodate growth or spikes in the volume of error data, some error reporting events could be lost. Part of the Bugsnag platform value is to learn from all customer error events, to provide context that helps software teams know what to prioritize.

For example, based on previous error reports, the Bugsnag platform can tell a customer if 1,000 errors came from a single user or from 1,000 users experiencing the same error. When performance issues caused error reporting events to be lost, the Bugsnag platform couldn’t learn from those events and pass along that additional value to customers.

“Performance had become such a critical issue that we decided to move our revenue-generating platform from one provider to another,” says Chuck Dubuque, Chief Marketing Officer at Bugsnag. “For a startup, that’s scary.”

“Google Cloud Platform made it easy to choose virtual machine type and cluster size. Along with that flexibility, we’re seeing significantly improved performance. We went from multiple seconds per query to hundreds of milliseconds per query, and that’s for the most demanding queries. Overall, performance is 50% to 60% faster.”

Simon Maynard, Co-founder and Chief Technology Officer, Bugsnag

By December 2016, the Bugsnag IT team had decided to migrate to Google Cloud Platform. “Performance is very important to us,” says Simon Maynard, Co-founder and Chief Technology Officer of Bugsnag. “We need to make sure our customers can do highly complex queries in real time. We need to be able to suddenly absorb a huge volume of traffic. And after our tests, it became clear that Google Cloud Platform had the power to handle what we needed, when we needed it.”

In early 2017, Bugsnag migrated to Google Cloud Platform. Google Compute Engine hosts its databases, and containerized applications are managed in Google Kubernetes Engine. Crash report files, a total of about 50TB, are housed in Google Cloud Storage. Bugsnag is also leveraging gRPC, an open-source, universal Remote Procedure Call (RPC) framework that Google developed. Google Kubernetes Engine and gRPC work together to enable Bugsnag to horizontally scale all its services and help ensure they can communicate with each other.

“Before the switch to Google Cloud Platform, some customers had complained about our performance being slow. Afterwards, the vast majority of customers told us they were really impressed by the performance increase.”

Simon Maynard, Co-founder and Chief Technology Officer, Bugsnag

Flexibility and better performance

The switchover from its previous IT provider to running live on Google Cloud Platform took only about 15 minutes to execute. Planning for the switchover, of course, took a few months.

For example, as part of the transition process, engineers benchmarked the performance of Google Compute Engine preemptible virtual machines (VMs) running in Google data centers. The goal: determine which virtual machine instance types were best for running the various Bugsnag databases, in order to create the most efficient backup of its databases. The next steps were to transfer the data to Google Cloud Platform from the previous IT provider’s data warehouse and run queries on the backup.

“Google Cloud Platform made it easy to choose virtual machine type and cluster size,” says Simon. “Along with that flexibility, we’re seeing significantly improved performance. We went from multiple seconds per query to hundreds of milliseconds per query, and that’s for the most demanding queries. Overall, performance is 50% to 60% faster. And the Google Cloud Compute Engine pricing model, where you’re charged less the more you use a virtual machine instance, is the way the cloud should be.”

Over the moon with performance lift

Bugsnag customers were alerted in advance that the platform would be down for about 15 minutes during the switchover weekend. Based on the results Bugsnag engineers had already seen from Google Cloud Platform tests, customers were also told to expect faster performance once the transition was complete.

“Before the switch to Google Cloud Platform, some customers had complained about our performance being slow,” says Simon. “Afterwards, the vast majority of customers told us they were really impressed by the performance increase.”

Visibility into the Google roadmap

Bugsnag developers appreciate the level of support they get from Google, says Simon. “Even though we’re spending the same amount of money with Google as we did with our previous infrastructure provider, Google gives us much better support. We have a specialized account team that travels to us and meets with us onsite regularly. And Google gives us visibility into what’s on the roadmap, which helps us make plans.”

Google Cloud Platform runs on the same data center hardware on which Google runs its internal infrastructure, which gives Bugsnag developers confidence in the platform’s security. “Google is incentivized to keep its data centers highly secure,” says Simon. “If they’re using it for their own operations, you know they’re going to stay on top of it at all times.” In addition, Bugsnag has many customers that must be compliant with the Health Insurance Portability and Accountability Act (HIPAA), for which Google Cloud Platform provides a solid foundation, he adds.

The confidence to grow

Bugsnag wanted a cloud infrastructure that would easily enable the company to evolve its services beyond bug reporting and resolution.

In January 2018, Bugsnag introduced the Releases dashboard, which provides an at-a-glance view into the stability and quality of each software release. Instead of releasing a software update and waiting to see if there’s a flood of errors or customer support calls, Bugsnag customers can easily see the ratio of successful app sessions against unsuccessful or buggy sessions.

The resulting crash rate metric mimics a traffic light: a largely bug-free software release gets a green light; a borderline buggy release is yellow; and a red light means there’s work to do. The Releases dashboard helps software developers understand what their next priority is: to continue working on the next release or focus on fixing bugs in the latest release. The intuitive dashboard also makes it easy to share a release’s success with C-level executives, for strategic planning purposes.

The Releases dashboard requires the Bugsnag platform to handle 10 to 100 times more data than before—the error data, plus the data on successful app sessions, which typically outnumbers the error reports significantly. “We weren’t sure our previous IT infrastructure could handle 10x to 100x more data, plus the spikes that happen,” says Chuck. “The horizontal scaling that Google Kubernetes Engine makes possible and the modern networking stack Google has at its data centers makes this important new feature possible.”

“We want to be a developers’ tool and a tool that gives an entire organization real-time, actionable insight into application stability and quality in production. Because Google Cloud Platform can help us handle more raw, rich data and scale easily, we can evolve our platform and business in new ways.”

Chuck Dubuque, Chief Marketing Officer, Bugsnag

From bugs to application stability management

Bugsnag plans to grow its current platform offerings even further by giving an organization’s executives and employees windows into the same data in ways that make sense to each group’s goals. The Releases dashboard, for example, allows product teams and release managers to focus on improving and tracking the stability of each application release and to set goals for stability using the crash rate metric.

A clear, actionable metric such as crash rate also enables data-driven decisions about promoting a release from staging to production, or whether the next development cycle should concentrate on adding new features or managing stability by fixing bugs.

When developing the Releases dashboard capability, Bugsnag knew that extending its backend microservices architecture to support releases would mean adding several new services and modifying existing ones. There was also a desire to future-proof the Bugsnag infrastructure to handle new use cases the company intended to release down the road.

To handle all of these architectural changes, Bugsnag needed a consistent way of designing, implementing, and integrating its services. And the company wanted a platform-agnostic approach; Bugsnag is a polyglot company and its services are written in Java, Ruby, Go, and Node.js. Bugsnag chose gRPC as the default communication framework for its backend microservices for its speed, broad compatibility, development tooling, multi-platform support, maturity, and adoption.

The Bugsnag platform running on Google Cloud Platform with Google Kubernetes Engine and gRPC will enable even more use cases for Bugsnag customers. For example, a Bugsnag customer’s sales team could use Bugsnag to monitor the success of a new software release in real time.

If the release has a red light on the Releases dashboard, the sales team would know to be prepared for questions about the release from customers. The customer support team could link a live support call from a customer to the errors captured in Bugsnag from that user. This could help reduce the mean time to identifying and resolving the error the user is experiencing.

“We want to be a developers’ tool and a tool that gives an entire organization real-time, actionable insight into application stability and quality in production,” says Chuck. “Because Google Cloud Platform can help us handle more raw, rich data, and scale easily, we can evolve our platform and our business in new ways.”

About Bugsnag

Founded in 2012 and based in San Francisco, Bugsnag provides software teams with an automated crash detection platform for their web, mobile, and server apps. The company has more than 4,000 customers that include Lyft, Square, OpenTable, GitHub, and Slack.

Industries: Technology
Location: United States
Google Cloud Platform logo

12 Months FREE TRIAL

Try Kubernetes Engine, BigQuery, and other Cloud Platform products with $300 in free credit and 12 months.

TRY IT FREE
Google Cloud Platform logo

12 Months FREE TRIAL

Try Kubernetes Engine, BigQuery, and other Cloud Platform products with $300 in free credit and 12 months.

TRY IT FREE