Hadoop 101 slides

I gave a guest lecture on the Hadoop stack last week at Tapan Parikh’s INFO 206: Distributed Computing Applications and Infrastructure course at Berkeley. I took a more academic approach than most, talking about the original motivating problem of Google search before moving into a deep dive of HDFS and MapReduce and an overview of the rest of the Hadoop ecosystem.

A couple students came up afterwards to say they enjoyed the talk, so I think it was well-received.

Slides: pptx and pdf

Posted by andrew in Talks, 0 comments

Highly-available audio in HDFS

Here on the HDFS team at Cloudera, we believe in eating our own dogfood. Since we value our (substantial) MP3 collections quite dearly, it’s only natural to store them in a high performance, highly-available, enterprise-quality distributed filesystem like HDFS. Today, I’m announcing the next generation in aural HDFS enjoyment: listening to music directly from the Namenode web UI.

Continue reading →

Posted by andrew, 0 comments

Bucket list: Cycling a century

I’ve been taking a little time off in between transitioning from grad life at Berkeley to working full-time at Cloudera. I decided to use some of this vacation time to check off a bucket list item: bicycling an imperial century (100 miles). Here’s my experience, and advice for anyone who wants to do the same.

Continue reading →

Posted by andrew in Personal, 0 comments

Cake presented at SoCC

I just presented some of our work at Berkeley at SoCC, on “Cake: Enabling High-Level SLOs on Shared Storage.” It’s a coordinated, multi-resource scheduler for storage workloads, which enables consolidation of front-end and backend workloads while meeting high-level performance requirements of the front-end workload. Consolidation has advantages in terms of economic costs (reducing overprovisioning and underutilization), and also significantly reducing the latency of traditional unconsolidated copy-then-process analytics cycles.

A PDF of the paper and the slides from my presentation  are available on my research page.

 

Posted by andrew in Talks, 0 comments

MinuteSort with Flat Datacenter Storage

Microsoft Research recently crushed the world record for MinuteSort, sorting 1.4TB in a minute. This replaces the former record held by Yahoo’s 1406 node Hadoop cluster in the Daytona MinuteSort category, and means that Hadoop no longer holds any world sorting record titles.

I found MSR’s approach of “MinuteSort with Flat Datacenter Storage” (FDS) to be intriguing. Most of the prior sort winners (e.g. Hadoop, TritonSort) try to colocate computation and data, since you normally pay a throughput (and thus latency) cost to go over the network. FDS separates out compute from storage, heavily provisioning a full bisection bandwidth network to match the I/O rate of the hard disks on storage nodes.

I’m going to give a rundown of the paper, and then pull out salient points for Hadoop at the end.

Continue reading →

Posted by andrew in Reviews, 0 comments

JVM Performance Tuning (notes)

A presentation by Attila Szegedi titled “Everything I Ever Learned about JVM Performance Tuning @twitter” has been floating around for a few months. I’ve restructured much of the content into a set of notes. This covers the basics of memory allocation and garbage collection in Java, the different garbage collectors available in HotSpot and how they can be tuned, and finally some anecdotes from Attila’s experiences at Twitter.

I’m still fuzzy on some things, so it’s not ground truth. If more experienced people weigh in, I’ll fix things up. The very informative hour-long presentation is still highly recommended.

Continue reading →

Posted by andrew in Reviews, 0 comments

Year in review: 2011 (personal)

I like to take some time every once in a while to think about what I’ve done that I’m proud of, what I’ve learned, and what I want to do. With the beginning of a new year comes the perfect opportunity to reflect on my life in the year past.

I’ve split it into two separate blog posts, one professional (meaning research and grad school life) and one personal (meaning hobbies, self-improvement, life goals). This post covers the latter; my personal life in 2011.

This post is also split up into different sections. The first is again accomplishments, then I cover time management, my 3 main hobbies these days, and then misc other personal stuff.

Continue reading →

Posted by andrew in Personal, 0 comments

Year in review: 2011 (professional)

I like to take some time every once in a while to think about what I’ve done that I’m proud of, what I’ve learned, and what I want to do. With the beginning of a new year comes the perfect opportunity to reflect on my life in the year past.

I’ve split it into two separate blog posts, one professional (meaning research and grad school life) and one personal (meaning hobbies, self-improvement, life goals). This post covers the former; my life as a grad student in 2011.

First, things I’ve done in the past year that I’m proud of. Next, meta-comments related to research. Finally, a section on teaching, since I TA’d CS162 this past semester (my first teaching experience).

Continue reading →

Posted by andrew in Personal, 0 comments

Paper review: Facebook Haystack

This is a review of Facebook’s Haystack storage system, used to store the staggering amount of photos that are uploaded to Facebook everyday. Facebook Photos started out with an NFS appliance, but was forced to move to a custom solution for the reasons of cost, scale, and performance. Haystack is an engineering solution that applies well-known techniques from GFS and log-structured filesystems to their distributed, append-only, key-value blob situation. Metadata management is somewhat novel, as well as their CDN integration.

The paper, “Finding a needle in Haystack: Facebook’s photo storage” by Beaver et al., was published at OSDI ’10.

Continue reading →

Posted by andrew in Reviews, 0 comments