Amazon Elastic MapReduce: A Web Service API for Hadoop

AWS just launched a new service called Amazon Elastic MapReduce that provides the same kind of developer friendly API used for Amazon EC2 or S3 for running Hadoop jobs in the Cloud. You submit a job request and number of instances to the API (pointing to input data and code on S3), and AWS spins up a private Hadoop cluster on EC2, submits your job, and reports back on status through the API. You can cancel or modify jobs using the API, and can even add additional steps to a running job.

I was part of the private beta and wrote a short code sample that shows how to run Python streaming jobs using the service: Finding Similar Items with Amazon Elastic MapReduce, Python, and Hadoop Streaming. As part of the code example, I also pulled together a cleaned up version of the AudioScrobbler dataset for use in music recommendations (it is about 1/4 the size of the Netflix Prize data). The code sample basically implements a Python streaming version of the Pairwise Similarity algorithm found in this paper by Tamer Elsayed, Jimmy Lin, and Douglas Oard and applies it to Netflix Prize ratings and Audioscrobbler playlist data.

The base EC2 images underlying the service are running Hadoop 18.3 on Debian and include NumPy, SciPy, R, BeautifulSoup, and other preinstalled packages useful for Streaming Hadoop jobs. You can use the distributed cache to install other packages like nltk at runtime.

My initial impression is that this will evolve into a powerful tool for people who want to run adhoc mapreduce jobs, prototype MapReduce code on EC2, or interface with on-demand clusters from within their apps. Hopefully we’ll see a MapReduce code/task sharing facility at some point similar to the EC2 pubic AMI system.

Note that in the current release of Elastic MapReduce, input data is copied down from S3 at the start of the job and your cluster shuts itself down upon completion by default (you can override this with the API). Mounting data directly from EBS volumes isn’t supported yet, but I wouldn’t be surprised to see that soon given the potential for integrating with Amazon Public Datasets. Running Dumbo jobs isn’t supported yet since it requires a Hadoop patch for 18.3, but it should be possible when AWS moves to Hadoop 0.21 (which will also bring in a number of other important Hadoop features that are missing in 18.3).

For maintaining a permanent cluster in-house or even a semi-permanent cluster on EC2 with a large amount of data, I would recommend using the Cloudera distribution for Hadoop (it is a one-liner to start an EC2 Hadoop cluster from the command line). I would often bounce between running jobs on my Cloudera EC2 cluster and Elastic MapReduce during development of the code example. If you are getting started with Hadoop, the Cloudera training videos are a great place to get up to speed.

So what can you do with Elastic MapReduce? Here are a few initial ideas:

  • Offload background processing from your Rails or Django app to Hadoop by sending the ElasticMapReduce API job requests pointing to data stored on S3: convert PDFs, classify spam, deduplicate records, batch geocoding, etc.
  • Process large amounts of retail sales and inventory transaction data for sales forecasting and optimization
  • Use the AddJobFlowSteps method in the API to run iterative machine learning algorithms using MapReduce on a remote Hadoop cluster and shut it down when your results converge to an answer

I’ll post more on this later today - including a detailed explanation of using Netflix Prize data in the code example and some next steps for using Elastic MapReduce.

  • Peter T,


    Good question, I think this is something Amazon needs to sort out in general. For elastic mapreduce (EMR) it is easier to work with chunked S3 files at the moment, so I just run a script to copy the raw data up to s3 in chunks using s3cmd. Outside of EMR I usually use single EBS volumes for datasets < 1 TB, since they can be directly mounted on EC2 and are easier to work with. I haven't tried the EBS + EMR hack yet, but will let you know how that goes if I try it.


    Contact me with the exact data sizes you are dealing with and maybe I can suggest an approach...


    -Pete

  • Peter,


    Great stuff - thanks very much for posting this.


    We ahve 100GB-multi TB input files required for various none-java analysis applications. Would you recommend we use the same approach so beautifully presented above and in the python tutorial - or do you recommend we take a different approach to making the input data available?


    Peter

  • Here's an idea for a streaming hack I mentioned on the forums. There is no built in support for pulling data from EBS volumes instead of S3 and boto isn't installed, but you you could use the distributed cache to install boto, mount an EBS public dataset from within the first mapper, and load it onto hdfs. Something like this:


    1) send a zipped boto src directory, renamed with a .mod extension over the distributed cache


    2) Add an intial dummy JobFlow step where the mapper opens boto.mod, reads in your credentials, imports boto, and mounta the EBS snapshot using your AWS credentials





    import zipimport
    importer = zipimport.zipimporter('boto.mod')
    boto = importer.load_module('boto')





    mount public dataset via boto:


    ec2-attach-volume vol-e84aae81 -i i-a09104c9 -d /dev/sdf
    mkdir /mnt/genbank
    mount /dev/sdf /mnt/genbank


    3) after the volume is mounted, shell out within the mapper and copy the data from the EBS volume to HDFS:


    hadoop fs -put /mnt/genbank /home/hadoop/genbank


    4) proceed with the next JobFlow step, using hdfs:///home/hadoop/genbank as input

  • Thanks Steve, I've been following the Mahout mailing list for a while and look forward to seeing more machine learning algorithms implemented in mapreduce. It is a young project, but I'm interested to see how it develops.


    This post was extremely short since I wanted to get the announcement out quickly, I'll have a more to post about machine learning with Hadoop soon. Until then, check out http://delicious.com/pskomoroch/hadoop+machinel... and http://delicious.com/pskomoroch/mapreduce+machi...>

  • Being that this blog is related to machine learning as well, I think some of your readers would be interested to know about the mahout project[1], which is an apache project whose aim is to develop various machine learning algorithms in the mapreduce paradigm, using hadoop.


    [1] mahout: http://lucene.apache.org/mahout/</p>

blog comments powered by Disqus