On the importance of software testing

As the famous programmer Jean-Paul Sartre once put it, hell is other people’s code. This is what echoes through your head when you’re jolted awake at 2AM by PagerDuty, blaring about a Sev0 production outage. You trawl through the changelog to find the offending commit: a missing null check that results in an exception. You start rolling back the bad deploy, but as you sit there, illuminated by the glow of your laptop screen, you curse to yourself: how did a simple error like this make it all the way to production?

We’ve all been in escalation situations like this, but perhaps just as many times, also been the author of the offending code change that caused the outage. During my time working on Hadoop, I’ve both written and fixed bugs like:

  • A new file format deserializer that would produce an empty result when reading a file written by the old serializer.
  • A rate limiter which would limit too aggressively by a factor of over 1000x.
  • A function that calculated how much data to flush to disk would, in almost every situation, not flush enough data.

In terms of complexity, these are obvious bugs that barely outrank the typical null pointer exceptions in sophistication, and should have been caught by even the most basic degree of testing. Fortunately, most of these examples were caught during our test cycle, but could have otherwise easily been Sev0 issues.

The case for testing is clear, but I’ve seen bug authors that never learn this lesson and (implicitly) refuse to write tests. Yes, there are times where skipping or deferring testing is acceptable. Yes, there are many nuanced arguments about the downsides of writing too many unit tests, the issues with mocking, and the uselessness of code coverage as a metric. But what really gets my goat is when a bug author’s simple apathy or lack of interest in testing results in a continuation of late-night pages, busted SLAs, and burned-out on-call engineers.

In this post, I present two case studies that illustrate our responsibility as software developers to deliver high-quality, production-ready artifacts for the consumers of our systems. In both of these studies, a catastrophic failure in a critical software system can be directly attributed to a lack of testing and poor quality assurance processes.

Therac-25

The Therac-25 was a medical radiation device used to treat cancer patients. It operated in two different treatment modes:

  • An electron mode which used an electron beam (beta radiation) to treat surface-level cancers.
  • An X-ray mode which turned that same electron beam into X-rays by increasing the current and pointing it at an X-ray target. This could be used to treat deeper tumors.

The Therac-25 was the latest machine in a series of radiotherapy machines. Previous models had hardware interlocks to prevent dangerous situations from happening, namely, operating the beam in high-current X-ray mode without the X-ray target in place.

However, the Therac-25 was the first to be entirely computer controlled. The manufacturer decided to depend entirely upon the control system to insure that this situation would not occur, and removed the hardware interlocks.

This was a fatal mistake. Due to a race condition, it was possible for the operator to accidentally configure the machine in X-ray mode without the X-ray target in place, delivering 100X the intended amount of radiation. Patients suffered horrible burns and radiation sickness, with three ultimately dying as a result of their injuries.

AECL, the manufacturer of the Therac-25, initially did not believe the complaints and delayed investigating the issue. Even after admitting the problem was real, the bug had to be independently reproduced by a hospital technician before AECL was able to develop a software patch. This patch should have been the end of it, but it turned out that the Therac-25 had yet another bug that manifested in the same fatal error. Another patient was killed before the machine was ultimately recalled.

The reason for this was directly pinned on poor software engineering practices. AECL did not have a formal software specification, test plan, or risk analysis for the Therac-25. Most of the coding was done by a single developer who simply carried forward the same code from the earlier Therac model with hardware interlocks. Furthermore, there was no independent testing or end-to-end testing done at all, with most testing happening internally on a hardware simulator.

There are a lot of resources to read more about the Therac-25. The original report on the Therac-25 by Nancy Leveson is great, as well as her 30 years later retrospective on the topic.

Mars Climate Orbiter

The spaceflight business is a risky one. These projects are huge engineering efforts that involve hundreds of millions or billions of dollars invested over a timespan of multiple years, with many agencies, contractors, and subcontractors involved. Even after all that, there’s also a surprisingly high chance that the rocket carrying your payload blows up on the launchpad.

NASA awarded the $125 million dollar Mars Climate Orbiter contract to Lockheed Martin. After four years (and 286 days in space), the Orbiter reached Mars and began a series of maneuvers for orbital insertion. However, the spacecraft entered Mars’ atmosphere much lower than expected and was destroyed.

The primary cause of failure was eventually found to be a software component that emitted calculations in Imperial units (pound-force seconds) while the invoker expected it to be in SI units (Newton-seconds), a factor of 4.45x difference. Although it’s tempting to attribute the issue to this seemingly simple bug, NASA ultimately placed the blame on multiple concurrent failures within their own testing and systems engineering processes.

A choice quote from the IEEE Spectrum article on this topic, which is highly recommended:

Thomas Gavin, deputy director for space and earth science at NASA’s Jet Propulsion Laboratory, added: “A single error should not bring down a $125 million mission.

Because of the rush to get the small forces model operational, the testing program had been abbreviated, Stephenson admitted. “Had we done end-to-end testing,” he stated at the press conference, “we believe this error would have been caught.” But the rushed and inadequate preparations left no time to do it right.

Other complaints about JPL go more directly to its existing style. One of Spectrum‘s chief sources for this story blamed that style on “JPL’s process of ‘cowboy’ programming, and their insistence on using 30-year-old trajectory code that can neither be run, seen, or verified by anyone or anything external to JPL.” He went on: “Sure, someone at Lockheed made a small error. If JPL did real software configuration and control, the error never would have gotten by the door.” Other sources commented that this problem was particularly severe within the JPL navigation team, rather than being a JPL-wide complaint.

So, should I test my software?

The lesson here is not that we need to apply the same software development processes as NASA or medical equipment manufacturers. Waterfall-style software development went out of style for a good reason, and it’s probably not that big a deal if your REST microservice goes down occasionally.

What is notable is that both of these failures were directly attributed to a lack of testing. Testing is both necessary and important when working on a large software project. Without good tests and QA processes in place, it’s nigh impossible to reason about the correctness of the system as a whole. Forgoing testing results in fragile products where even the simplest of bugs can result in catastrophic failure.

In a future post, I’ll dive more into the mechanics of software testing: the different types of tests, and how and when to apply them.

 

Special thanks to my wonderful editors Tiffany Chen, John Sherwood, and Michael Tao, who gave feedback on earlier drafts of this post.

Posted by andrew in Software, 0 comments

Bicycle touring post-mortem

San Francisco to San Diego was my first multi-day tour, and overall I’m very happy with how it went. I’ve done plenty of overnight bike tours to Half Moon Bay or Samuel P. Taylor, and I carried basically the same kit on the multi-day tour.

Here’s a breakdown of what went well and what I might do differently on my next tour. I’m really eager to do the northern section of this route (Seattle or Portland to San Francisco), perhaps next year.

Continue reading →

Posted by andrew in Travel, 1 comment

Riding the 101: Bicycle Touring Mega-Update

I spent about two weeks of my sabbatical riding from San Francisco to San Diego (Jul 31- Aug 15). I got in the habit of posting end-of-day recaps to Facebook and Strava as I went, which really helped me reflect on what happened. I’m reposting all of them here as a mega-post.

Total distance: 622.9 miles

Total climbing: 24436 feet

Total riding time: 55.85 hours

Sunsets on the beach: whenever possible

Continue reading →

Posted by andrew in Travel, 0 comments

Blog refresh: WordPress

I’ve come full-circle. My very first websites circa-2005 were built with a CMS (Joomla or WordPress). I started messing with custom themes and plugins (which is how I really learned to code) then drank deeply of the semantic web koolaid and started hand-coded everything in XHTML, CSS, and PHP. I migrated to a static site generator seeking simplicity and reduced hosting costs, and now, umbrant.com is again powered by a full-fledged CMS: WordPress.

Continue reading →

Posted by andrew, 1 comment

The Next Generation of Apache Hadoop

Apache Hadoop turned ten this year. To celebrate, Karthik and I gave a talk at USENIX ATC ’16 about open problems to solve in Hadoop’s second decade. This was an opportunity to revisit our academic roots and get a new crop of graduate students interested in the real distributed systems problems we’re trying to solve in industry.

This is a huge topic and we only had a 25 minute talk slot, so we were pitching problems rather than solutions. However, we did have some ideas in our back pocket, and the hallway track and birds-of-a-feather we hosted afterwards led to a lot of good discussion.

Karthik and I split up the content thematically, which worked really well. I covered scalability, meaning sharded filesystems and federated resource management. Karthik addressed scheduling (unifying batch jobs and long-running services) and utilization (overprovisioning, preemption, isolation).

I’m hoping to give this talk again in longer form, since I’m proud of the content.

Slides: pptx

USENIX site with PDF slides and audio

Posted by andrew in Talks, 0 comments

Distributed testing

I gave a presentation titled Happier Developers and Happier Software through Distributed Testing at Apache Big Data 2016, which detailed how our distributed unit testing framework has decreased the runtime of Apache Hadoop’s unit test suite by 60x from 8.5 hours to about 8 minutes, and the substantial productivity improvements that are possible when developers can easily run and interact with the test suite.

The infrastructure is general enough to accommodate any software project. We wrote frontends for both C++/gtest and Java/Maven.

This effort started as a Cloudera hackathon project that Todd Lipcon and I worked on two years ago, and I’m very glad we got it across the line. Furthermore, it’s also open-source, and we’d love to see it rolled out to more projects.

Slides: pptx

Source-code: cloudera/dist_test

Posted by andrew in Talks, 0 comments

Windows Azure Storage

What makes this paper special is that it is one of the only published papers about a production cloud blobstore. The 800-pound gorilla in this space is Amazon S3, but I find Windows Azure Storage (WAS) the more interesting system since it provides strong consistency, additional features like append, and serves as the backend for not just WAS Blobs, but also WAS Tables (structured data access) and WAS Queues (message delivery). It also occupies a different design point than hash-partitioned blobstores like Swift and Rados.

This paper, “Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency” by Calder et al., was published at SOSP ’11.

Continue reading →

Posted by andrew in Reviews, 0 comments

Transparent encryption in HDFS

I went on a little European roadshow last month, presenting my recent work on transparent encryption in HDFS at Hadoop Summit Brussels and Strata Hadoop World London. I’ll also be giving the same talk this fall at Strata Hadoop World NYC, which will possibly be the biggest audience I’ve ever spoken in front of.

Slides: pptx

Video: Hadoop Summit Brussels (youtube)

If you have access to O’Reilly, there should be a higher quality video available there.

Posted by andrew in Talks, 0 comments

Mesos, Omega, Borg: A Survey

Google recently unveiled one of their crown jewels of system infrastructure: Borg, their cluster scheduler. This prompted me to re-read the Mesos and Omega papers, which deal with the same topic. I thought it’d be interested to do a compare and contrast of these systems. Mesos gets credit for the groundbreaking idea of two-level scheduling, Omega improved upon this with an analogy from databases, and Borg can sort of be seen as the culmination of all these ideas.

Continue reading →

Posted by andrew in Reviews, 0 comments

Bucket list: Catch a fish and eat it

I checked off one of my bucket list items yesterday: catching a fish, cleaning it, and eating it.

This was the last day of a family vacation in Port St. Lucie in Florida. My original plans to go deep sea fishing fell through, so I went to the surprisingly well-stocked local Walmart to pick up some freshwater gear. I was lucky enough to nab a healthy-looking 15″ largemouth bass with a silver Mepps spinner from the lake behind our timeshare.

Continue reading →

Posted by andrew in Personal, 0 comments