The Next Generation of Apache Hadoop

Apache Hadoop turned ten this year. To celebrate, Karthik and I gave a talk at USENIX ATC ’16 about open problems to solve in Hadoop’s second decade. This was an opportunity to revisit our academic roots and get a new crop of graduate students interested in the real distributed systems problems we’re trying to solve in industry.

This is a huge topic and we only had a 25 minute talk slot, so we were pitching problems rather than solutions. However, we did have some ideas in our back pocket, and the hallway track and birds-of-a-feather we hosted afterwards led to a lot of good discussion.

Karthik and I split up the content thematically, which worked really well. I covered scalability, meaning sharded filesystems and federated resource management. Karthik addressed scheduling (unifying batch jobs and long-running services) and utilization (overprovisioning, preemption, isolation).

I’m hoping to give this talk again in longer form, since I’m proud of the content.

Slides: pptx

USENIX site with PDF slides and audio

Posted by andrew in Talks, 0 comments

Distributed testing

I gave a presentation titled Happier Developers and Happier Software through Distributed Testing at Apache Big Data 2016, which detailed how our distributed unit testing framework has decreased the runtime of Apache Hadoop’s unit test suite by 60x from 8.5 hours to about 8 minutes, and the substantial productivity improvements that are possible when developers can easily run and interact with the test suite.

The infrastructure is general enough to accommodate any software project. We wrote frontends for both C++/gtest and Java/Maven.

This effort started as a Cloudera hackathon project that Todd Lipcon and I worked on two years ago, and I’m very glad we got it across the line. Furthermore, it’s also open-source, and we’d love to see it rolled out to more projects.

Slides: pptx

Source-code: cloudera/dist_test

Posted by andrew in Talks, 0 comments

Transparent encryption in HDFS

I went on a little European roadshow last month, presenting my recent work on transparent encryption in HDFS at Hadoop Summit Brussels and Strata Hadoop World London. I’ll also be giving the same talk this fall at Strata Hadoop World NYC, which will possibly be the biggest audience I’ve ever spoken in front of.

Slides: pptx

Video: Hadoop Summit Brussels (youtube)

If you have access to O’Reilly, there should be a higher quality video available there.

Posted by andrew in Talks, 0 comments

In-memory Caching in HDFS: Lower latency, same great taste

My coworker Colin McCabe and I recently gave a talk at Hadoop Summit Amsterdam titled “In-memory Caching in HDFS: Lower latency, same great taste.” I’m very pleased with how this feature turned out, since it was approximately a year-long effort going from initial design to production system. Combined with Impala, we showed up to a 6x performance improvement by running on cached data, and that number will only improve with time.

Slides: pptx

Video: Youtube

Posted by andrew in Talks, 0 comments

Hadoop 101 slides

I gave a guest lecture on the Hadoop stack last week at Tapan Parikh’s INFO 206: Distributed Computing Applications and Infrastructure course at Berkeley. I took a more academic approach than most, talking about the original motivating problem of Google search before moving into a deep dive of HDFS and MapReduce and an overview of the rest of the Hadoop ecosystem.

A couple students came up afterwards to say they enjoyed the talk, so I think it was well-received.

Slides: pptx and pdf

Posted by andrew in Talks, 0 comments

Cake presented at SoCC

I just presented some of our work at Berkeley at SoCC, on “Cake: Enabling High-Level SLOs on Shared Storage.” It’s a coordinated, multi-resource scheduler for storage workloads, which enables consolidation of front-end and backend workloads while meeting high-level performance requirements of the front-end workload. Consolidation has advantages in terms of economic costs (reducing overprovisioning and underutilization), and also significantly reducing the latency of traditional unconsolidated copy-then-process analytics cycles.

A PDF of the paper and the slides from my presentation  are available on my research page.


Posted by andrew in Talks, 0 comments