JVM Performance Tuning (notes)

A presentation by Attila Szegedi titled “Everything I Ever Learned about JVM Performance Tuning @twitter” has been floating around for a few months. I’ve restructured much of the content into a set of notes. This covers the basics of memory allocation and garbage collection in Java, the different garbage collectors available in HotSpot and how they can be tuned, and finally some anecdotes from Attila’s experiences at Twitter.

I’m still fuzzy on some things, so it’s not ground truth. If more experienced people weigh in, I’ll fix things up. The very informative hour-long presentation is still highly recommended.

Continue reading →

Posted by andrew in Reviews, 0 comments

Year in review: 2011 (personal)

I like to take some time every once in a while to think about what I’ve done that I’m proud of, what I’ve learned, and what I want to do. With the beginning of a new year comes the perfect opportunity to reflect on my life in the year past.

I’ve split it into two separate blog posts, one professional (meaning research and grad school life) and one personal (meaning hobbies, self-improvement, life goals). This post covers the latter; my personal life in 2011.

This post is also split up into different sections. The first is again accomplishments, then I cover time management, my 3 main hobbies these days, and then misc other personal stuff.

Continue reading →

Posted by andrew in Personal, 0 comments

Year in review: 2011 (professional)

I like to take some time every once in a while to think about what I’ve done that I’m proud of, what I’ve learned, and what I want to do. With the beginning of a new year comes the perfect opportunity to reflect on my life in the year past.

I’ve split it into two separate blog posts, one professional (meaning research and grad school life) and one personal (meaning hobbies, self-improvement, life goals). This post covers the former; my life as a grad student in 2011.

First, things I’ve done in the past year that I’m proud of. Next, meta-comments related to research. Finally, a section on teaching, since I TA’d CS162 this past semester (my first teaching experience).

Continue reading →

Posted by andrew in Personal, 0 comments

Paper review: Facebook Haystack

This is a review of Facebook’s Haystack storage system, used to store the staggering amount of photos that are uploaded to Facebook everyday. Facebook Photos started out with an NFS appliance, but was forced to move to a custom solution for the reasons of cost, scale, and performance. Haystack is an engineering solution that applies well-known techniques from GFS and log-structured filesystems to their distributed, append-only, key-value blob situation. Metadata management is somewhat novel, as well as their CDN integration.

The paper, “Finding a needle in Haystack: Facebook’s photo storage” by Beaver et al., was published at OSDI ’10.

Continue reading →

Posted by andrew in Reviews, 0 comments

External sorting of large datasets

This is a common interview question: how do you sort data that is bigger than memory? “Big data” in the range of tera or petabytes can now almost be considered the norm (think of Google saving every search, click, and ad impression ever), so this manifests in reality as well. This is also a canonical problem in the database world, where it is referred to as an “external sort”.

Your mind should immediately turn to divide and conquer algorithms, namely merge sort. Write out intermediate merged output to disk, and read it back in lazily for the next round. I decided this would be a fun implementation and optimization exercise to do in C. There will probably be a follow-up post, since there are lots of optimizations I haven’t yet implemented.

Continue reading →

Posted by andrew, 0 comments

Static website hosting on Amazon S3

Werner Vogels, Amazon CTO, posted on his blog about a month ago on “New AWS feature: Run your website from Amazon S3“. S3 now offers the ability to host static HTML pages directly from an S3 bucket, which is a great alternative for small blogs and sites (provided, of course, that you don’t actually need any dynamic content). This has the potential to greatly reduce your hosting costs. A small Dreamhost/Slicehost/Linode costs around $20 a month, and I used to run this site out of an extreme budget VPS (Virpus) which was only $5 a month, but I expect to be paying only a few cents per month for S3 (current pricing is just 15¢ per GB-month). Of course, you also gain best-of-class durability, fault-tolerance, and scalability from hosting out of S3, meaning that your little site should easily survive a slashdotting.

The difficulty here is that most of the popular blogging engines require a backing database, and do their content generation dynamically server side. That doesn’t fly with S3; since it is, after all, just a Simple Storage Service, content has to be static and pregenerated. I chose to use Hyde, a Python content generator that turns templates (based on the Django templating engine) into HTML. Hyde page templates are dynamic, written in Django’s templating language which supports variables, control flow, and hierarchal inheritance. Hyde will parse these templates, fill in the dynamic content, and finally generate static HTML pages suitable for uploading to S3. Ruby folks can check out Jekyll as an alternative.

Continue reading →

Posted by andrew, 0 comments