02 April 2009

AWS just launched a new service called Amazon Elastic MapReduce that provides the same kind of developer friendly API used for Amazon EC2 or S3 for running Hadoop jobs in the Cloud. You submit a job request and number of instances to the API (pointing to input data and code on S3), and AWS spins up a private Hadoop cluster on EC2, submits your job, and reports back on status through the API. You can cancel or modify jobs using the API, and can even add additional steps to a running job.

I was part of the private beta and wrote a short code sample that shows how to run Python streaming jobs using the service: Finding Similar Items with Amazon Elastic MapReduce, Python, and Hadoop Streaming. As part of the code example, I also pulled together a cleaned up version of the AudioScrobbler dataset for use in music recommendations (it is about 1/4 the size of the Netflix Prize data). The code sample basically implements a Python streaming version of the Pairwise Similarity algorithm found in this paper by Tamer Elsayed, Jimmy Lin, and Douglas Oard and applies it to Netflix Prize ratings and Audioscrobbler playlist data.

The base EC2 images underlying the service are running Hadoop 18.3 on Debian and include NumPy, SciPy, R, BeautifulSoup, and other preinstalled packages useful for Streaming Hadoop jobs. You can use the distributed cache to install other packages like nltk at runtime.

My initial impression is that this will evolve into a powerful tool for people who want to run adhoc mapreduce jobs, prototype MapReduce code on EC2, or interface with on-demand clusters from within their apps. Hopefully we'll see a MapReduce code/task sharing facility at some point similar to the EC2 pubic AMI system.

Note that in the current release of Elastic MapReduce, input data is copied down from S3 at the start of the job and your cluster shuts itself down upon completion by default (you can override this with the API). Mounting data directly from EBS volumes isn't supported yet, but I wouldn't be surprised to see that soon given the potential for integrating with Amazon Public Datasets. Running Dumbo jobs isn't supported yet since it requires a Hadoop patch for 18.3, but it should be possible when AWS moves to Hadoop 0.21 (which will also bring in a number of other important Hadoop features that are missing in 18.3).

For maintaining a permanent cluster in-house or even a semi-permanent cluster on EC2 with a large amount of data, I would recommend using the Cloudera distribution for Hadoop (it is a one-liner to start an EC2 Hadoop cluster from the command line). I would often bounce between running jobs on my Cloudera EC2 cluster and Elastic MapReduce during development of the code example. If you are getting started with Hadoop, the Cloudera training videos are a great place to get up to speed.

So what can you do with Elastic MapReduce? Here are a few initial ideas:

  • Offload background processing from your Rails or Django app to Hadoop by sending the ElasticMapReduce API job requests pointing to data stored on S3: convert PDFs, classify spam, deduplicate records, batch geocoding, etc.
  • Process large amounts of retail sales and inventory transaction data for sales forecasting and optimization
  • Use the AddJobFlowSteps method in the API to run iterative machine learning algorithms using MapReduce on a remote Hadoop cluster and shut it down when your results converge to an answer

I'll post more on this later today - including a detailed explanation of using Netflix Prize data in the code example and some next steps for using Elastic MapReduce.