The Next Generation of Apache Hadoop

Apache Hadoop turned ten this year. To celebrate, Karthik and I gave a talk at USENIX ATC ’16 about open problems to solve in Hadoop’s second decade. This was an opportunity to revisit our academic roots and get a new crop of graduate students interested in the real distributed systems problems we’re trying to solve in industry.

This is a huge topic and we only had a 25 minute talk slot, so we were pitching problems rather than solutions. However, we did have some ideas in our back pocket, and the hallway track and birds-of-a-feather we hosted afterwards led to a lot of good discussion.

Karthik and I split up the content thematically, which worked really well. I covered scalability, meaning sharded filesystems and federated resource management. Karthik addressed scheduling (unifying batch jobs and long-running services) and utilization (overprovisioning, preemption, isolation).

I’m hoping to give this talk again in longer form, since I’m proud of the content.

Slides: pptx

USENIX site with PDF slides and audio

Posted by andrew in Talks, 0 comments

Distributed testing

I gave a presentation titled Happier Developers and Happier Software through Distributed Testing at Apache Big Data 2016, which detailed how our distributed unit testing framework has decreased the runtime of Apache Hadoop’s unit test suite by 60x from 8.5 hours to about 8 minutes, and the substantial productivity improvements that are possible when developers can easily run and interact with the test suite.

The infrastructure is general enough to accommodate any software project. We wrote frontends for both C++/gtest and Java/Maven.

This effort started as a Cloudera hackathon project that Todd Lipcon and I worked on two years ago, and I’m very glad we got it across the line. Furthermore, it’s also open-source, and we’d love to see it rolled out to more projects.

Slides: pptx

Source-code: cloudera/dist_test

Posted by andrew in Talks, 0 comments

Windows Azure Storage

What makes this paper special is that it is one of the only published papers about a production cloud blobstore. The 800-pound gorilla in this space is Amazon S3, but I find Windows Azure Storage (WAS) the more interesting system since it provides strong consistency, additional features like append, and serves as the backend for not just WAS Blobs, but also WAS Tables (structured data access) and WAS Queues (message delivery). It also occupies a different design point than hash-partitioned blobstores like Swift and Rados.

This paper, “Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency” by Calder et al., was published at SOSP ’11.

Continue reading →

Posted by andrew in Reviews, 0 comments

Transparent encryption in HDFS

I went on a little European roadshow last month, presenting my recent work on transparent encryption in HDFS at Hadoop Summit Brussels and Strata Hadoop World London. I’ll also be giving the same talk this fall at Strata Hadoop World NYC, which will possibly be the biggest audience I’ve ever spoken in front of.

Slides: pptx

Video: Hadoop Summit Brussels (youtube)

If you have access to O’Reilly, there should be a higher quality video available there.

Posted by andrew in Talks, 0 comments

Mesos, Omega, Borg: A Survey

Google recently unveiled one of their crown jewels of system infrastructure: Borg, their cluster scheduler. This prompted me to re-read the Mesos and Omega papers, which deal with the same topic. I thought it’d be interested to do a compare and contrast of these systems. Mesos gets credit for the groundbreaking idea of two-level scheduling, Omega improved upon this with an analogy from databases, and Borg can sort of be seen as the culmination of all these ideas.

Continue reading →

Posted by andrew in Reviews, 0 comments

Bucket list: Catch a fish and eat it

I checked off one of my bucket list items yesterday: catching a fish, cleaning it, and eating it.

This was the last day of a family vacation in Port St. Lucie in Florida. My original plans to go deep sea fishing fell through, so I went to the surprisingly well-stocked local Walmart to pick up some freshwater gear. I was lucky enough to nab a healthy-looking 15″ largemouth bass with a silver Mepps spinner from the lake behind our timeshare.

Continue reading →

Posted by andrew in Personal, 0 comments

Paper review: Facebook f4

It’s been a while since I did one of these! I did a previous review of Facebook Haystack, which was designed as an online blob storage system. f4 is a sister system that works in conjunction with Haystack, and is intended for storage of warm rather than hot blobs. As is usual for Facebook, they came up with a system that is both eminently practical and tailored for their exact use case.

This paper, “f4: Facebook’s Warm BLOB Storage System” by Muralidhar et al., was published at OSDI ’14.

Continue reading →

Posted by andrew in Reviews, 0 comments

In-memory Caching in HDFS: Lower latency, same great taste

My coworker Colin McCabe and I recently gave a talk at Hadoop Summit Amsterdam titled “In-memory Caching in HDFS: Lower latency, same great taste.” I’m very pleased with how this feature turned out, since it was approximately a year-long effort going from initial design to production system. Combined with Impala, we showed up to a 6x performance improvement by running on cached data, and that number will only improve with time.

Slides: pptx

Video: Youtube

Posted by andrew in Talks, 0 comments

Two engineering principles

I received two interesting pieces of advice at the AMP Lab retreat this past week, which concisely state some of my favorite software engineering principles:

  1. Don’t be a zealot. Understand in technical detail why a given language, framework, or design should be preferred, not because of technological fascination or fanboy-ism. The canonical examples here are programming language flamewars, e.g. Java vs. C++.
  2. Ruthlessly optimize for your requirements. This means first, carefully defining said requirements, but then being completely unafraid to buck conventional wisdom if it’s not a good match. This often means intentionally pruning out features, even common ones implemented by other systems.
Posted by andrew, 0 comments

Apache Hadoop committer

A quick post celebrating that I recently was made a committer on the Apache Hadoop project. I owe a big thanks to everyone who’s reviewed my patches and helped me along the way (especially my colleagues ATM, Todd, and Colin here at Cloudera).

My very first patch was HDFS-1952 in May 2011, via a Hadoop hackathon hosted at Cloudera. It was the most promising newbie HDFS JIRA on the list, and I still remember all the basic issues I had checking out the repo, setting up Eclipse, using Ant, and generating the diff. Two years later, these things have gotten easier 🙂

Here’s to many more contributions in the future!

Posted by andrew in Personal, 0 comments