Big data hackers, Apache Hadoop developers, and early adopters from several industries descended on the Roosevelt Hotel this weekend for Hadoop World NYC. I gave a talk on rapid prototyping of data intensive web applications using Hadoop, Hive, Python, and Ruby on Rails. The talk also had a few bits about using R with Hadoop for statistical computing at scale. The sessions were taped, and you can check out the video of my talk here: "Building Data Intensive Apps with Hadoop and EC2".
The slides give a high level overview of how I built the open source trend tracking site trendingtopics.org over a few weeks last June using Amazon EC2 and Cloudera tools. The code for the site is on Github and the raw data it is powered by is available on Amazon Public Data Sets. I've also posted a series of tutorials related to trendingtopics on the Cloudera blog over the past few months:
- Tracking Trends with Hadoop and Hive on EC2
- Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2
- Grouping Related Trends with Hadoop and Hive
Here are a few resources mentioned in the talk:
- Trendingtopics code on github
- Wikipedia Page Traffic Statistics Dataset
- EMR Forum discussion about using R with Hadoop (scroll down for R code that runs on Twitter data)
- David Rosenberg's R Streaming package on CRAN
- How FlightCaster Squeezes Predictions from Flight Data
It felt like around half of the attendees of Hadoop World were developers or data hackers I know of via Twitter or the Hadoop mailing lists. This resulted in some decent Twitter coverage via the hadoopworld hash tag. The other half of attendees represented enterprise IT, media companies, government, and financial firms who are either early adopters or interested in using Hadoop.
Some interesting announcements were made in the morning. Amazon added new features for the Elastic MapReduce service, including support for Hive, Cloudera's Hadoop Distribution, and integration with Karmasphere Studio.
Cloudera's big news was the launch of Cloudera Desktop, a new web-based unified user interface for users and operators of Hadoop clusters. Note that you can also run the Cloudera Desktop on Amazon EC2. Cloudera announced support for their distribution on Softlayer and Rackspace. They also outlined new features in the latest Hadoop distribution (CDH2), which includes support for HBase and Hadoop 0.20.1.
I think I actually spent more time talking data with fellow hackers like Joshua Reich and Hillary Mason than I did in the talks, but still managed to catch some good ones by the EHarmony team, Stuart Sierra, Deepak Singh, and several Yahoo people. As a big Python user, it was exciting to hear that Jake Hofman from Yahoo! Research, NY plans on an open source release of a Python based Social Network Library for Hadoop, which he used to generate the Twitter analysis in his talk. A big theme in my talk and others I attended was the use of high level languages on top of Hadoop to accelerate development. Most of the teams I talked to actively use multiple abstractions on top of Hadoop, including Pig, Hive, Clojure, or other languages like Python through Hadoop Streaming.
For further details check out these notes from other attendees:
- Amund Tveit has comprehensive notes of the morning and afternoon Hadoop World sessions
- The HubSpot team has two posts: Hadoop World 2009 and Hadoop World Impressions
- Hillary Mason wrote up some observations on her blog
- Deepak Singh, who presented on Hadoop in Bioinformatics, gives his perspective on the conference
August 24, 2009
During the last several years, an increasing number of systems within government and industry have been collecting massive amounts of raw data which often sits untapped in large data warehouses. FlightCaster strikes me as a great example of the next generation of web applications that will leverage that data: bootstrapped startups that apply machine learning and data processing at scale to solve a focused problem people actually care about.
From the site:
"FlightCaster predicts flight delays. We use an advanced algorithm that scours data on every domestic flight for the past 10-years and matches it to real-time conditions. We help you evaluate alternative options and help connect you to the right person to make the change."
FlightCaster uses data from:
- Bureau of Transportation Statistics
- FAA Air Traffic Control System Command Center
- National Weather Service
I've been following the data crunching exploits of Bradford Cross on Twitter, and the launch of FlightCaster seemed like a great opportunity for an "in the trenches" interview on building a machine learning application with Rails & Hadoop. During the interview on FlightCaster, Brad describes some of the challenges of working with flight data, statistical approaches for flight prediction, false negatives in FlightCaster, Clojure, Hadoop & Amazon EC2, YCombinator, and lots more.
Of the 9? people at Flightcaster, how did the roles break down? How did you get started on the problem?
People at YC make fun of us a lot on account of our monolithic team.
We have a core team of 5 founders which breaks down as follows; a CEO, an air travel domain expert, two engineers working on web service, apps, and production issues, and me for research.
I have a secret agent collaborator working with me who has been very helpful with research and scalable compute infrastructure.
We also have a few other engineers doing a mix of stuff. This is a speed thing. We wanted to launch an iPhone app, blackberry app, and website all at the same time. We built it all very fast.
We got started on the problem thanks to Evan Konwiser and his peculiar and endearing obsession with the commercial air travel industry.
Did you have someone on the team with domain expertise who was familiar with the flight or weather data sources?
Absolutely, that is Evan Konwiser. He has been instrumental, and not just for familiarity with the data.
The airline industry is a very idiosyncratic industry. It would be hard to learn a lot of the subtle domain logic via induction alone. Our machine learning approach uses a mix of analytical and inductive learning.
What was the biggest challenge in working with the public flight and weather data? Is there a dataset that isn't out there right now that might make predicting flight delays easier for you?
The public data set that we use is the "on-time database" published by the FAA. The data set is tricky to get all in one place since the FAA does not provide any decent API to it. The biggest issue is that we make real time predictions, so we needed a historical set of captured real time data, which we had to create ourselves.
Having a more amalgamated real time dataset going back historically for a decade would be a big help. Having more modernized ways of accessing the data would be helpful.
Until then, if anyone wants to buy it, we will sell it to them for a very high price ... and to sweeten, they must throw in a few obscure, expensive machine learning books that I have on my Amazon wish list.
I've been a big fan of using the combination of Rails, Hadoop, and Amazon EC2 along with a high level language (in your case Clojure). Any tips for people out there thinking of using a similar technology stack? How cost effective is running Hadoop on EC2 for you?
Building layer upon layer of abstraction is a big key. On the jvm, you have to do this, it is the path around the verbosity of Java and the vast abyss of poorly done APIs. You just keep searching until you finally find the folks who have built a sane, high level API on top of the thing you want to use - then you wrap it in a high level language like Clojure. The technical term for this is "wrap the crap."
In our case, we use Cascading as our step up in abstraction on top of Hadoop.
S3 -> EC2 -> Cloudera -> HDFS -> Hadoop -> Cascading -> Clojure. I'm not sure if those layers are exactly the right order, but you get the point. The key is go keep layering until you encapsulate the plumbing and get to the level of abstraction that lets you focus on solving your problem.
Running Hadoop on EC2 has been very cost effective. The biggest issues have come into play with the disconnect between Hadoop and S3. S3 expects open connections to keep reading, and if they don't, S3 terminates them. S3 is very much the Arnold of the distributed file system world. So if your Hadoop jobs are compute intensive, and they are buffering in data in a lazy loading fashion, they tend to lose the connection to S3 during long processing phases. We've worked around this with some hackery, and we are working with Chris Wensel (of Cascading fame) on a more industrial strength solution to the problem.
There are many existing machine learning and statistical computing packages for R, Python, Java, C++, why did you choose to go with Clojure? What were the pros/cons of that approach? Do you use any of those other tools for prototyping or visualization?
Over the years, I have found that, in practice, the statistical and machine learning code is not the big thing to worry about. That code is fun to write, and often you want to tweak and extend the libraries in ways that they have not been designed for. So we use libraries and frameworks for the basics where we can, but we are OK to implement statistical and machine learning algorithms ourselves. I'm quite experienced with this anyway; all the way down to efficient custom data structures built on arrays.
The bigger problem to worry about is in the title of your site; the data wrangling. Especially pre-processing (filtering, transforming, etc) and the general fault-tolerant distributed compute infrastructure. I worked on this sort of thing during my time at Google, and it is far more complex than it seems. It is easy to grasp the concepts, and get an initial implementation, but the edge cases and last mile issues with respect to the fault tolerance are where you take the hit.
Hadoop is a wonderful solution for distributed computation, and since our code is all purely functional, it is very natural for us.
Clojure is ideal for both the data wrangling, as well as the statistical and machine learning code. Also, Clojure plays nice with everything on the jvm so we wrap and use lots of libs from the java world. Put this together with the distributed compute infrastructure that you get with Hadoop, and it starts to make a lot more sense to build these systems in Clojure on the jvm and use Hadoop than it does to use R, python, etc.
That said, if you just need to do quick and dirty prototypes, or don't have the need or the option to invest in infrastructure for distributed computing, R and SciPy are probably still the place to turn.
At Google, the research scientists prototype in python and R, and then port to C++ for the real scalable map reduce runs. We prototype and run in production on the same language and platform, and although it is not as fast as Google's C++ infrastructure, we do have the benefit that Clojure is very high level. For large runs, we parallelize with Hadoop, and we just run smaller tasks locally as if our Hadoop infrastructure were not even there. We are not coupled to it, and yet we don't need to port code or do anything special to run it in distributed mode.
Can you talk more about how you are using Hadoop for Flightcaster? Are most of the jobs I/O intensive, preprocessing a large volume of historical input data, or are they more often cross-validation or simulation runs? Do you tweak your live models then re-train against all the historical data with Hadoop?
Our jobs are CPU intensive - we do a lot of computation per unit of data, even in our data transformation jobs.
We train and test all our models offline, on captured real time data.
Eventually we would like to move to a more incremental online learning approach, but for now we re-train and re-test against newly captured data and re-deploy new the newly trained model.
Any tips for handling bad data/records in large MapReduce jobs?
We've been talking to Chris Wensel and cascading has some really cool stuff for this in the form of "filters" that we are not leveraging yet. We do a lot of filtering and scrubbing of our own during the preprocessing phase, but we will be looking at what cascading can do for us here very soon.
The biggest lesson we have learned about bad data is that we should spend more time up front with visualizing the data and making assertions (or filtering via predicates) to verify that our assumptions about the data hold.
I have had these issues in the past with financial data, but not as bad. Can a flight land before it takes off? Are there 68 hours in a day? These are examples of data that is not malformed or missing, but that violates fundamental properties that you expect to hold true, so they can be trickier to catch.
The big problem with not spending more time on this up front is that these issues are expensive to catch downstream because you will be looking through non-intuitive results of complex analysis jobs and it takes a long time to track down the root of such issues. It is similar to the argument for having good unit tests; it is cheaper to find issues at that level than at the system test level. It is also more sanity-preserving.
You probably can't say much about your secret sauce or algorithms, but can you discuss some general issues you encounter handling conflicting information sources or incomplete data? Do ensemble approaches or online learning play a role?
As I mentioned in a previous response, we want to get into more of an incremental online learning approach, and I think we will be able to soon. Of course this means certain approaches are in and others are out, but that may be OK in our case.
As of right now, we are using a combination of analytical and inductive learning. We compose individual classifiers that are themselves composed of rich domain logic. This is not an ensemble approach, though we may head in that direction very soon.
What we are doing now is more of a network-of-classifiers approach. It is a bit of a strange beast right now, so we would be remiss to call it a Bayesian network. Maybe we could call it a Cardono network in honor of Jerolomo Cardano. It is a bit of an eccentric network that is ahead of its time but waiting for some formalism to come along and straighten it out.
To a large extent, our current approach is an artifact of how quickly we have built our initial model. We have these rich individual domain specific classifiers that are targeted at different features from different data sources, and the approach we are using to learn the composition of these individual classifiers is an area of active research.
What is your general style for attacking prediction problems? Do you start with a single data source and get the entire pipeline running with a simple model, or you dive in building a more complex model with multiple data sources? How did Flightcaster compare to financial prediction?
I like to let simpler cases drive out a lot of the demand for infrastructure early on. As you suggest, I tend to start by getting something working end-to-end with a simple model and a single data source. Then you can start to evolve faster, join in other data sources, and try more deeply theoretical ideas; all on top of a quality infrastructure.
That said, some infrastructure is not required until you get to more complex models, so it is always an evolutionary process where model drives infrastructure.
I made a lot of mistakes early in my career in building trading models where I let me theories get too far ahead of what I could really test in practice. That is not a good place to be. Unfortunately, this is an easy mistake to make.
Flightcaster has been very different from my work in investing in a number of ways. FlightCaster makes predictions by turning the probability distribution estimation problem into a k-way classification problem.
This is a simpler problem than designing holistic investment strategies.
In investing, you have to answer many questions. Selecting the point at which buying and selling occurs is akin to predictive problems in other domains. However, this only answers the question of when to buy or sell.
You also have to decide what to buy or sell - which is called portfolio selection. You also have to decide how much to buy or sell - which is called portfolio allocation. There are also the topics that I like to call risk and exposure accommodation.
The latter subjects of portfolio selection, portfolio allocation, and risk and exposure accommodation are arguably more important than the former subject of timing or predicting entry and exit points. Parameter sensitivity analysis tends to show that return and risk-return metrics are less sensitive to variability in the entry-exit approach as compared with portfolio selection, portfolio allocation, and risk accommodation approaches.
These aspects of financial modeling make it significantly more difficult, and they also seem to largely account for the repeated demise of many so-called "quantitative trading" operations.
Can you say anything about decision points and false alarm rates in travel delay prediction? It seems like there is a big penalty (risk) if somebody misses a flight based on a bad prediction, but a minor annoyance if the flight is delayed and you don't predict it correctly. Will you guys publish some ROC curves at some point?
There are always two key metrics in learning - in IR they are called precision and recall. We seem to be at somewhere around 85% precision and 60% recall. So discovering more delays is where we need to focus, and our false positive delay predictions are not the big issue for us right now.
We compute confusion matrices, and use them to derive precision, recall, and our false positive and false negative rates.
When you turn numerical probability distribution estimation into k-way classification, one side effect is that not all false positives are equal. If we way you will be delayed by over an hour, but you are delayed by 45 minutes, that not bad at all. But if we say you will be delayed by over an hour and you are on time, that is much worse. So there is a distance aspect to false positives.
We don't see a lot of bad false positives now, that is, we don't appear to tell people they will experience a long delay when in fact they will be on time. The bigger issue seems to be our recall - there are a lot of delays that we are just not able to detect yet. Alternatively stated, we have a problem with false negatives in the on time class.
We also have a strong temporal element to all this. How do these numbers change as we get closer or further away from your departure time? We're working on this analysis right now, and we do hope to publish a lot of useful metrics soon.
How was it working with YCombinator?
YCombinator is amazing. The people are all great; both YC founders and the people they invest in. The exposure we have gotten from being part of the program is itself worth the equity stake they take in the company. YC has put together a really fascinating business model. I would like to be an investor in YC.
Trevor Blackwell is a YC founder, friend, and maker of fine quality luxury robots. Actually, his big new thing is telepresence robots. Check out the bots at anybots.com.
Trevor is the one who lured me into working with the Flightcaster team. I hadn't met Paul Graham and his Wife Jessica before YC - they totally rock!
It has been an amazing experience and I would encourage anyone to apply; especially if you want to build something that solves a real problem.
June 11, 2009
I've published a Wikipedia Page Traffic Data Set containing a 320 GB sample of the data used to power trendingtopics.org (I'll talk about Trending Topics more in a upcoming post). The EBS snapshot includes 7 months of hourly page traffic statistics for over 8 Million Wikipedia articles (~ 1 TB uncompressed) along with the associated Wikipedia content, linkgraph, & metadata. The english Wikipedia subset contains ~2.5 Million articles.
If you want to work entirely from the command line, you will need to complete the steps in the Getting Started Guide. When you are set up to use EC2, launch a small EC2 Ubuntu instance from your local machine:
$ ec2-run-instances ami-5394733a -k gsg-keypair -z us-east-1a
Once it is running and you have the instance id, create and attach an EBS Volume using the public snapshot snap-753dfc1c (make sure the volume is created in the same availability zone as the ec2 instance)
$ ec2-create-volume --snapshot snap-753dfc1c -z us-east-1a $ ec2-attach-volume vol-ec06ea85 -i i-df396cb6 -d /dev/sdf
Next, ssh into the instance and mount the volume
$ ssh firstname.lastname@example.org root@domU-12-xx-xx-xx-75-81:/mnt# mkdir /mnt/wikidata root@domU-12-xx-xx-xx-75-81:/mnt# mount /dev/sdf /mnt/wikidata
See the README files in each subdirectory for more details on these datasets...
The good stuff is sitting in 5000 files in /mnt/wikidata/wikistats/pagecounts/
/mnt/wikidata/wikistats/pagecounts# ls -l | wc -l 5068 /mnt/wikidata/wikistats/pagecounts# ls -lh |head total 260G -rw-r--r-- 1 root root 49M 2009-02-26 13:34 pagecounts-20081001-000000.gz -rw-r--r-- 1 root root 46M 2009-02-26 13:34 pagecounts-20081001-010000.gz -rw-r--r-- 1 root root 47M 2009-02-26 13:34 pagecounts-20081001-020000.gz -rw-r--r-- 1 root root 44M 2009-02-26 13:34 pagecounts-20081001-030000.gz -rw-r--r-- 1 root root 45M 2009-02-26 13:34 pagecounts-20081001-040000.gz -rw-r--r-- 1 root root 47M 2009-02-26 13:35 pagecounts-20081001-050001.gz -rw-r--r-- 1 root root 45M 2009-02-26 13:35 pagecounts-20081001-060000.gz -rw-r--r-- 1 root root 50M 2009-02-26 13:35 pagecounts-20081001-070000.gz -rw-r--r-- 1 root root 51M 2009-02-26 13:35 pagecounts-20081001-080000.gz
This directory contains hourly Wikipedia article traffic logs covering the 7 month period from October 01 2008 to April 30 2009, this data is regularly logged from the wikipedia squid proxy by Domas Mituzas.
Each log file is named with the date and time of collection: pagecounts-20090430-230000.gz
Each line has 4 fields:
projectcode, pagename, pageviews, bytes
en Barack_Obama 997 123091092 en Barack_Obama%27s_first_100_days 8 850127 en Barack_Obama,_Jr 1 144103 en Barack_Obama,_Sr. 37 938821 en Barack_Obama_%22HOPE%22_poster 4 81005 en Barack_Obama_%22Hope%22_poster 5 102081
These files contain all links between proper english language Wikipedia pages, that is pages in "namespace 0". This includes disambiguation pages and redirect pages.
In links-simple-sorted.txt, there is one line for each page that has links from it. The format of the lines is ready for processing by Hadoop:
from1: to11 to12 to13 ... from2: to21 to22 to23 ... ...
where from1 is an integer labelling a page that has links from it, and to11 to12 to13 ... are integers labelling all the pages that the page links to. To find the page title that corresponds to integer n, just look up the n-th line in the file titles-sorted.txt.
Contains the raw Wikipedia dumps from March along with some processed versions of the data. One of the files I created provides a direct lookup table for Wikipedia article redirects in page_lookup_redirects.txt, which can be applied to tasks like name standardization and search:
Here is a sample query run when the file is loaded into MySQL:
mysql> select redirect_title, true_title from page_lookups where page_id = 534366; +------------------------------------------------+--------------+ | redirect_title | true_title | +------------------------------------------------+--------------+ | Barack_Obama | Barack Obama | | Barak_Obama | Barack Obama | | 44th_President_of_the_United_States | Barack Obama | | Barach_Obama | Barack Obama | | Senator_Barack_Obama | Barack Obama | ..... ..... | Rocco_Bama | Barack Obama | | Barack_Obama's | Barack Obama | | B._Obama | Barack Obama | +------------------------------------------------+--------------+ 110 rows in set (11.15 sec)
The raw wikipedia dump file latest-pages-articles.xml was also post-processed using xml2sql to produce a set of tab delimited text files for use with Hadoop and other tools :
692M page.txt 115M redirect.txt 987M revision.txt 17G text.txt
the corresponding namespace0 files were created by limiting page.txt and redirect.txt as follows:
# grep '^[0-9]* 0 ' page.txt > page_namespace0.txt # grep '^[0-9]* 0 ' redirect.txt > redirect_namespace0.txt
April 15, 2009
Here is a quick visualization I did in honor of April 15th to investigate what people looking for on tax day...
This "query tree" shows the most frequent searches starting with the term "irs". Each branch in the tree represents a query where the words are sized according to frequency of occurrence. I like how you can see at a glance what the most popular tax forms are by following the "irs tax form ..." branch. Apparently form 8868, Application for Extension of Time To File, is in high demand.
It was created by uploading search queries from AOL users leading to clicks on irs.gov during Spring 2006 to Concentrate, which generated the query tree. This image is a snapshot of an interactive flash visualization in Concentrate, where the focus term was "irs". Looking at query patterns like this can help you get an idea of what people are looking for and how to better organize your site so they can find it quickly.
The raw data is from the released AOL Search data sample, and consists of the subset of unique queries leading to clicks on irs.gov from March to May 2006. The IRS queries used to make the visualization can be downloaded here: irs.gov.queries.csv (191K)
Here are the top 10 queries in the file:
|internal revenue service||1154|
|irs tax forms||196|
|wheres my refund||139|
|federal tax forms||125|
- 02 Apr 2009 » Amazon Elastic MapReduce: A Web Service API for Hadoop
- 12 Feb 2009 » Updated List of Datasets & Video Lectures
- 09 Feb 2009 » Search map: interactive visualization of search query clusters
- 29 Jan 2009 » Conversation with Eric Siegel on Predictive Analytics World
- 21 Nov 2008 » Amazon Web Services Public Datasets
- 09 Apr 2008 » Hidden Video Courses in Math, Science, and Engineering
- 16 Mar 2008 » PyCon 2008 ElasticWulf Slides
- 29 Feb 2008 » Python Montage Code for Displaying Arrays
- 15 Feb 2008 » The Colbert Bump in Amazon Data
- 17 Jan 2008 » Some Datasets Available on the Web
- 06 May 2007 » Google Paper on Parallel EM Algorithm using MapReduce
- 18 Apr 2007 » Amazon EC2 Considered Harmful
- 09 Apr 2007 » MPI Cluster with Python and Amazon EC2 (part 2 of 3)
- 17 Mar 2007 » On-Demand MPI Cluster with Python and EC2 (part 1 of 3)
- 26 Feb 2007 » Netflix Prize Leaderboard Landscape