In-memory Caching in HDFS: Lower latency, same great taste

My coworker Colin McCabe and I recently gave a talk at Hadoop Summit Amsterdam titled “In-memory Caching in HDFS: Lower latency, same great taste.” I’m very pleased with how this feature turned out, since it was approximately a year-long effort going from initial design to production system. Combined with Impala, we showed up to a 6x performance improvement by running on cached data, and that number will only improve with time.

Slides: pptx

Video: Youtube

Posted by andrew in Talks, 0 comments

Two engineering principles

I received two interesting pieces of advice at the AMP Lab retreat this past week, which concisely state some of my favorite software engineering principles:

  1. Don’t be a zealot. Understand in technical detail why a given language, framework, or design should be preferred, not because of technological fascination or fanboy-ism. The canonical examples here are programming language flamewars, e.g. Java vs. C++.
  2. Ruthlessly optimize for your requirements. This means first, carefully defining said requirements, but then being completely unafraid to buck conventional wisdom if it’s not a good match. This often means intentionally pruning out features, even common ones implemented by other systems.
Posted by andrew, 0 comments

Apache Hadoop committer

A quick post celebrating that I recently was made a committer on the Apache Hadoop project. I owe a big thanks to everyone who’s reviewed my patches and helped me along the way (especially my colleagues ATM, Todd, and Colin here at Cloudera).

My very first patch was HDFS-1952 in May 2011, via a Hadoop hackathon hosted at Cloudera. It was the most promising newbie HDFS JIRA on the list, and I still remember all the basic issues I had checking out the repo, setting up Eclipse, using Ant, and generating the diff. Two years later, these things have gotten easier 🙂

Here’s to many more contributions in the future!

Posted by andrew in Personal, 0 comments

Grad school four months out

Here’s my account of leaving the PhD program at Berkeley to work at Cloudera. My experience might not be representative or generalize beyond my own situation, but I’m writing this because a number of people have asked me about the differences between grad school and industry. Choosing to leave Berkeley was a very personal decision, but fortunately I’m happy with how it’s turned out.

This also serves as my “Year in review: 2012” post, since this was the major change in my life last year.

Continue reading →

Posted by andrew in Personal, 0 comments

Hadoop 101 slides

I gave a guest lecture on the Hadoop stack last week at Tapan Parikh’s INFO 206: Distributed Computing Applications and Infrastructure course at Berkeley. I took a more academic approach than most, talking about the original motivating problem of Google search before moving into a deep dive of HDFS and MapReduce and an overview of the rest of the Hadoop ecosystem.

A couple students came up afterwards to say they enjoyed the talk, so I think it was well-received.

Slides: pptx and pdf

Posted by andrew in Talks, 0 comments

Highly-available audio in HDFS

Here on the HDFS team at Cloudera, we believe in eating our own dogfood. Since we value our (substantial) MP3 collections quite dearly, it’s only natural to store them in a high performance, highly-available, enterprise-quality distributed filesystem like HDFS. Today, I’m announcing the next generation in aural HDFS enjoyment: listening to music directly from the Namenode web UI.

Continue reading →

Posted by andrew, 0 comments

Bucket list: Cycling a century

I’ve been taking a little time off in between transitioning from grad life at Berkeley to working full-time at Cloudera. I decided to use some of this vacation time to check off a bucket list item: bicycling an imperial century (100 miles). Here’s my experience, and advice for anyone who wants to do the same.

Continue reading →

Posted by andrew in Personal, 0 comments

Cake presented at SoCC

I just presented some of our work at Berkeley at SoCC, on “Cake: Enabling High-Level SLOs on Shared Storage.” It’s a coordinated, multi-resource scheduler for storage workloads, which enables consolidation of front-end and backend workloads while meeting high-level performance requirements of the front-end workload. Consolidation has advantages in terms of economic costs (reducing overprovisioning and underutilization), and also significantly reducing the latency of traditional unconsolidated copy-then-process analytics cycles.

A PDF of the paper and the slides from my presentation  are available on my research page.


Posted by andrew in Talks, 0 comments

MinuteSort with Flat Datacenter Storage

Microsoft Research recently crushed the world record for MinuteSort, sorting 1.4TB in a minute. This replaces the former record held by Yahoo’s 1406 node Hadoop cluster in the Daytona MinuteSort category, and means that Hadoop no longer holds any world sorting record titles.

I found MSR’s approach of “MinuteSort with Flat Datacenter Storage” (FDS) to be intriguing. Most of the prior sort winners (e.g. Hadoop, TritonSort) try to colocate computation and data, since you normally pay a throughput (and thus latency) cost to go over the network. FDS separates out compute from storage, heavily provisioning a full bisection bandwidth network to match the I/O rate of the hard disks on storage nodes.

I’m going to give a rundown of the paper, and then pull out salient points for Hadoop at the end.

Continue reading →

Posted by andrew in Reviews, 0 comments