Slides & Thoughts from Hadoop World NYC (2009-10-05)
Big data hackers, Apache Hadoop developers, and early adopters from several industries descended on the Roosevelt Hotel this weekend for Hadoop World NYC. I gave a talk on rapid prototyping of data intensive web applications using Hadoop, Hive, Python, and Ruby on Rails. The talk also had a few bits about using R with Hadoop for statistical computing at scale. The sessions were taped, so I'll update this post with a link to the video...
How FlightCaster Squeezes Predictions from Flight Data (2009-08-24)
During the last several years, an increasing number of systems within government and industry have been collecting massive amounts of raw data which often sits untapped in large data warehouses. FlightCaster strikes me as a great example of the next generation of web applications that will leverage that data: bootstrapped startups that apply machine learning and data processing at scale to solve a focused problem people actually care about. From the site: "FlightCaster predicts flight...
Wikipedia Page Traffic Statistics Dataset (2009-06-11)
I've published a Wikipedia Page Traffic Data Set containing a 320 GB sample of the data used to power trendingtopics.org (I'll talk about Trending Topics more in a upcoming post). The EBS snapshot includes 7 months of hourly page traffic statistics for over 8 Million Wikipedia articles (~ 1 TB uncompressed) along with the associated Wikipedia content, linkgraph, & metadata. The english Wikipedia subset contains ~2.5 Million articles. It only takes a couple of minutes...
...read more
...read more
...read more
Post Archive
- 05 Oct 2009 » Slides & Thoughts from Hadoop World NYC
- 24 Aug 2009 » How FlightCaster Squeezes Predictions from Flight Data
- 11 Jun 2009 » Wikipedia Page Traffic Statistics Dataset
- 15 Apr 2009 » Quick Visualization of irs.gov Search Queries
- 02 Apr 2009 » Amazon Elastic MapReduce: A Web Service API for Hadoop
- 12 Feb 2009 » Updated List of Datasets & Video Lectures
- 09 Feb 2009 » Search map: interactive visualization of search query clusters
- 29 Jan 2009 » Conversation with Eric Siegel on Predictive Analytics World
- 21 Nov 2008 » Amazon Web Services Public Datasets
- 09 Apr 2008 » Hidden Video Courses in Math, Science, and Engineering
- 16 Mar 2008 » PyCon 2008 ElasticWulf Slides
- 29 Feb 2008 » Python Montage Code for Displaying Arrays
- 15 Feb 2008 » The Colbert Bump in Amazon Data
- 17 Jan 2008 » Some Datasets Available on the Web
- 06 May 2007 » Google Paper on Parallel EM Algorithm using MapReduce
- 18 Apr 2007 » Amazon EC2 Considered Harmful
- 09 Apr 2007 » MPI Cluster with Python and Amazon EC2 (part 2 of 3)
- 17 Mar 2007 » On-Demand MPI Cluster with Python and EC2 (part 1 of 3)
- 26 Feb 2007 » Netflix Prize Leaderboard Landscape
Talks, Articles, Etc
- 30 Mar 2010 » O'Reilly Where 2.0: Spatial Analytics Workshop
- 02 Apr 2009 » Finding Similar Items with Amazon Elastic MapReduce, Python, and Hadoop Streaming
Recent Projects
- TrendingTopics.org Rails application that uses Hadoop on Amazon EC2 to process Wikipedia log files and find trending topics
- EC2Cluster Rails management console and REST web sevice for submitting jobs and launching MPI clusters on AMazon EC2
