Work

This is a listing of my professional activities, including conference presentations, projects, and publications.

Presentations

What’s new in Hadoop 3.0. Presented at Strata Data Conference San Jose. March 2018. [Slides PDF]

Apache Hadoop 3.0 features and development update. Presented at Strata Data Conference Beijing. July 2017.

Demystifying erasure coding in HDFS. Presented at Strata Data Conference Beijing. July 2017.

A Tale of Two Developers: Finding Harmony Between Commercial Software Development and the Apache Way. Presented at ApacheCon North America. May 2017. [Video YouTube]

The Next Generation of Apache Hadoop: Open Problems in Distributed Storage and Resource Management. Presented at USENIX ATC. June 2016. [Slides PPTX] [Slides PDF] [Audio mp3]

Happier Developers and Happier Software Through Distributed Testing. Presented at Apache Big Data. May 2016. [Slides PPTX]

Transparent encryption in HDFS: The missing piece in big data security. Presented at Strata Hadoop World New York 2015, Strata Hadoop World London 2015, Hadoop Summit Brussels 2015. [Slides PPTX] [Video YouTube]

In-memory Caching in HDFS: Lower Latency, Same Great Taste. Presented at Hadoop Summit Amsterdam. 2014. [Slides PPTX] [Slideshare] [Video YouTube]

News

The Apache Software Foundation: The Apache Software Foundation Announces Apache® Hadoop® v3.0.0 General Availability. December 2017.

Datanami: Committers Talk Hadoop 3 at Apache Big Data. May 2017.

Service

USENIX 2018, Program committee.

Research Projects

As I’m no longer employed as an academic, I am not currently working on any research projects. The work here dates from my time at UC Berkeley in the Algorithms, Machines and People (AMP) Lab, where I was advised by Ion Stoica.

Cake (formerly Frosting) is a multi-resource scheduler for shared storage systems that supports enforcement of high-level service-level objectives. This allows latency-sensitive and batch workloads to be consolidated onto the same storage system, reducing provisioning costs, improving utilization, and shortening traditional copy-then-process analytics cycles. Cake was presented at SoCC 2012.

PACMan is an in-memory caching infrastructure for Hadoop and HDFS which is tuned for large datacenter workloads. We have developed novel cache management policies that perform significantly better than theoretically “optimal” cache-eviction algorithms (such as farthest-in-the-future) by exploiting characteristics of workload traces from large internet companies. This was published at NSDI 2012.

CrowdDB is a relational database that has been extended with crowdsourcing operators, allowing it to incorporate human computation and knowledge as part of query execution. This means CrowdDB can be used to solve AI-hard problems that are difficult to tackle with traditional databases by crowdsourcing small tasks to human workers. I contributed to a demo of CrowdDB shown at VLDB 2011 in Seattle, which won the inaugural Best Demo award.

Publications

Try looking on DBLP as well.

Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Ion Stoica, and Randy Katz: Cake: Enabling High-level SLOs on Shared Storage Systems. SoCC 2012. [PDF] [Talk Slides]

Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Ion Stoica, and Randy Katz: Sweet Storage SLOs with Frosting. HotCloud 2012. [PDF] [Talk Slides]

Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, Ion Stoica: PACMan: Coordinated Memory Caching for Parallel Jobs. NSDI 2012. [PDF]

Amber Feng, Michael J. Franklin, Donald Kossmann, Tim Kraska, Samuel Madden, Sukriti Ramesh, Andrew Wang, Reynold Xin: CrowdDB: Query Processing with the VLDB Crowd. PVLDB 4(12): 1387-1390 (2011). Best Demo Award!

Raghavendra Rajkumar, Andrew Wang, Jason Hiser, Anh Nguyen-Tuong, Jack W. Davidson, John C. Knight: Component-Oriented Monitoring of Binaries for Security. HICSS 2011: 1-10

Anh Nguyen-Tuong, Andrew Wang, Jason Hiser, John C. Knight, Jack W. Davidson: On the effectiveness of the metamorphic shield. ECSA Companion Volume 2010: 170-174