Year: 2011

External sorting of large datasets

This is a common interview question: how do you sort data that is bigger than memory? “Big data” in the range of tera or petabytes can now almost be considered the norm (think of Google saving every search, click, and ad impression ever), so this manifests in reality as well. This is also a canonical problem in the database world, where it is referred to as an “external sort”.

Your mind should immediately turn to divide and conquer algorithms, namely merge sort. Write out intermediate merged output to disk, and read it back in lazily for the next round. I decided this would be a fun implementation and optimization exercise to do in C. There will probably be a follow-up post, since there are lots of optimizations I haven’t yet implemented.

Continue reading →

Posted by andrew, 0 comments

Static website hosting on Amazon S3

Werner Vogels, Amazon CTO, posted on his blog about a month ago on “New AWS feature: Run your website from Amazon S3“. S3 now offers the ability to host static HTML pages directly from an S3 bucket, which is a great alternative for small blogs and sites (provided, of course, that you don’t actually need any dynamic content). This has the potential to greatly reduce your hosting costs. A small Dreamhost/Slicehost/Linode costs around $20 a month, and I used to run this site out of an extreme budget VPS (Virpus) which was only $5 a month, but I expect to be paying only a few cents per month for S3 (current pricing is just 15¢ per GB-month). Of course, you also gain best-of-class durability, fault-tolerance, and scalability from hosting out of S3, meaning that your little site should easily survive a slashdotting.

The difficulty here is that most of the popular blogging engines require a backing database, and do their content generation dynamically server side. That doesn’t fly with S3; since it is, after all, just a Simple Storage Service, content has to be static and pregenerated. I chose to use Hyde, a Python content generator that turns templates (based on the Django templating engine) into HTML. Hyde page templates are dynamic, written in Django’s templating language which supports variables, control flow, and hierarchal inheritance. Hyde will parse these templates, fill in the dynamic content, and finally generate static HTML pages suitable for uploading to S3. Ruby folks can check out Jekyll as an alternative.

Continue reading →

Posted by andrew, 0 comments