Slides & Thoughts from Hadoop World NYC

High level languages for MapReduce

Big data hackers, Apache Hadoop developers, and early adopters from several industries descended on the Roosevelt Hotel this weekend for Hadoop World NYC. I gave a talk on rapid prototyping of data intensive web applications using Hadoop, Hive, Python, and Ruby on Rails. The talk also had a few bits about using R with Hadoop for statistical computing at scale. The sessions were taped, so I’ll update this post with a link to the video when it becomes available [UPDATE: video added below].

Building Data Intensive Apps with Hadoop and EC2 from Cloudera on Vimeo.

The slides give a high level overview of how I built the open source trend tracking site trendingtopics.org over a few weeks last June using Amazon EC2 and Cloudera tools. The code for the site is on Github and the raw data it is powered by is available on Amazon Public Data Sets. I’ve also posted a series of tutorials related to trendingtopics on the Cloudera blog over the past few months:

Here are a few resources mentioned in the talk:

Conference Highlights

It felt like around half of the attendees of Hadoop World were developers or data hackers I know of via Twitter or the Hadoop mailing lists. This resulted in some decent Twitter coverage via the hadoopworld hash tag. The other half of attendees represented enterprise IT, media companies, government, and financial firms who are either early adopters or interested in using Hadoop.

Some interesting announcements were made in the morning. Amazon added new features for the Elastic MapReduce service, including support for Hive, Cloudera’s Hadoop Distribution, and integration with Karmasphere Studio.

Cloudera’s big news was the launch of Cloudera Desktop, a new web-based unified user interface for users and operators of Hadoop clusters. Note that you can also run the Cloudera Desktop on Amazon EC2. Cloudera announced support for their distribution on Softlayer and Rackspace. They also outlined new features in the latest Hadoop distribution (CDH2), which includes support for HBase and Hadoop 0.20.1.

Vertica announced a partnership with Cloudera, which is an interesting development considering the RDBMS vs. MapReduce debates that took place last year.

I think I actually spent more time talking data with fellow hackers like Joshua Reich and Hillary Mason than I did in the talks, but still managed to catch some good ones by the EHarmony team, Stuart Sierra, Deepak Singh, and several Yahoo people. As a big Python user, it was exciting to hear that Jake Hofman from Yahoo! Research, NY plans on an open source release of a Python based Social Network Library for Hadoop, which he used to generate the Twitter analysis in his talk. A big theme in my talk and others I attended was the use of high level languages on top of Hadoop to accelerate development. Most of the teams I talked to actively use multiple abstractions on top of Hadoop, including Pig, Hive, Clojure, or other languages like Python through Hadoop Streaming.

For further details check out these notes from other attendees:

  • Looking forward to the video of your talk. Love the blog and have been enjoying looking through the data sets you have listed. How come trending topics stops in August? Is it too expensive to keep it live?
  • pskomoroch
    Chris: Sorry for the delay getting back to you, I like your blog as well. I froze the updates in August while I was out of the country and haven't turned them back on yet. I moved out to the west coast when I got back and have been busy with some new projects, but I'm planning on turning the updates back on after Christmas, possibly with some additional tracking of government trends.

    Cheers,

    -Pete
  • bearrito
    Finding the article extremely informative.

    The only modification I have had to make so far has been the following:

    Under the Hive portion

    Change: LOAD DATA INPATH 'wikidump/' OVERWRITE INTO TABLE redirect_table;

    To : LOAD DATA INPATH 'hdfs://<localhost>:8020//user/root/wikidump' OVERWRITE INTO TABLE redirect_table;

    The difference here being the port number needs to be included otherwise semantic analysis fails.
blog comments powered by Disqus