Slides & Thoughts from Hadoop World NYC

Big data hackers, Apache Hadoop developers, and early adopters from several industries descended on the Roosevelt Hotel this weekend for Hadoop World NYC. I gave a talk on rapid prototyping of data intensive web applications using Hadoop, Hive, Python, and Ruby on Rails. The talk also had a few bits about using R with Hadoop for statistical computing at scale. The sessions were taped, so I’ll update this post with a link to the video when it becomes available [UPDATE: video added below].
Building Data Intensive Apps with Hadoop and EC2 from Cloudera on Vimeo.
The slides give a high level overview of how I built the open source trend tracking site trendingtopics.org over a few weeks last June using Amazon EC2 and Cloudera tools. The code for the site is on Github and the raw data it is powered by is available on Amazon Public Data Sets. I’ve also posted a series of tutorials related to trendingtopics on the Cloudera blog over the past few months:
- Tracking Trends with Hadoop and Hive on EC2
- Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2
- Grouping Related Trends with Hadoop and Hive
Here are a few resources mentioned in the talk:
- Trendingtopics code on github
- Wikipedia Page Traffic Statistics Dataset
- EMR Forum discussion about using R with Hadoop (scroll down for R code that runs on Twitter data)
- David Rosenberg’s R Streaming package on CRAN
- How FlightCaster Squeezes Predictions from Flight Data
Conference Highlights
It felt like around half of the attendees of Hadoop World were developers or data hackers I know of via Twitter or the Hadoop mailing lists. This resulted in some decent Twitter coverage via the hadoopworld hash tag. The other half of attendees represented enterprise IT, media companies, government, and financial firms who are either early adopters or interested in using Hadoop.
Some interesting announcements were made in the morning. Amazon added new features for the Elastic MapReduce service, including support for Hive, Cloudera’s Hadoop Distribution, and integration with Karmasphere Studio.
Cloudera’s big news was the launch of Cloudera Desktop, a new web-based unified user interface for users and operators of Hadoop clusters. Note that you can also run the Cloudera Desktop on Amazon EC2. Cloudera announced support for their distribution on Softlayer and Rackspace. They also outlined new features in the latest Hadoop distribution (CDH2), which includes support for HBase and Hadoop 0.20.1.
Vertica announced a partnership with Cloudera, which is an interesting development considering the RDBMS vs. MapReduce debates that took place last year.
I think I actually spent more time talking data with fellow hackers like Joshua Reich and Hillary Mason than I did in the talks, but still managed to catch some good ones by the EHarmony team, Stuart Sierra, Deepak Singh, and several Yahoo people. As a big Python user, it was exciting to hear that Jake Hofman from Yahoo! Research, NY plans on an open source release of a Python based Social Network Library for Hadoop, which he used to generate the Twitter analysis in his talk. A big theme in my talk and others I attended was the use of high level languages on top of Hadoop to accelerate development. Most of the teams I talked to actively use multiple abstractions on top of Hadoop, including Pig, Hive, Clojure, or other languages like Python through Hadoop Streaming.
For further details check out these notes from other attendees:
- Amund Tveit has comprehensive notes of the morning and afternoon Hadoop World sessions
- The HubSpot team has two posts: Hadoop World 2009 and Hadoop World Impressions
- Hillary Mason wrote up some observations on her blog
- Deepak Singh, who presented on Hadoop in Bioinformatics, gives his perspective on the conference
