PyCon 2008 ElasticWulf Slides

4 8 15 16 23 42

Here are the ElasticWulf slides from my talk. The video will eventually be posted to the PyCon site.

The cluster management scripts I used to run the EC2 beowulf are hosted on google code:

ElasticWulf Project

I should make the initial checkin by Monday, until then you can try out the 32 bit images from the old EC2 tutorial.

PyCon highlights for me included Guido popping his head in towards the end of my talk (unless I was so tired from last minute preparations that I was hallucinating?) and meeting the great essayist Zed Shaw. The “Birds of Feather” (BOF) sessions were my favorite part of the conference so far. Tonight, I caught the tail end of an interesting Natural Language Processing discussion. Chris McAvoy talked me into holding a Netflix Prize BOF session where we exchanged insights about using Python for collaborative filtering. Later that night, my coworker Chris Gemignani organized a Data Visualization session where he did some cool things with Python and Nodebox. We also hung out with Peter Fein from the job search engine JuJu, who pulled together an engaging Distributed Computing BOF. JuJu just released a neat RESTful python search engine project called GrassyKnoll which you should check out. I’ll post more on PyCon when I get back to DC, along with a tutorial on using IPython1 with ElasticWulf.

Oh yeah, the back of the conference shirt featured the xkcd Python comic:

python

Some Datasets Available on the Web

The Datawrangling blog was put on the back burner last May while I focused on my startup. Now that I have some bandwidth again, I am getting back to work on several pet projects (including the Amazon EC2 Cluster). I’m giving an EC2 talk at Pycon in March, so I’m really on the hook to wrap up that series of posts now.

The event which prompted this long overdue blog post was another pet project: collecting public datasets. I keep an eye on topics of interest using del.icio.us tag subscriptions, and yesterday my feed was flooded with links to theinfo.org. Theinfo is a new community site/wiki for people working with large datasets and was started by reddit cofounder Aaron Swartz. The site apparently developed from his work on The Open Library.

Over the past year, I’ve been tagging interesting data I find on the web in del.icio.us. I wrote a quick python script to pull the relevant links from my del.icio.us export and list them at the bottom of this post. Most of these datasets are related to machine learning, but there are a lot of government, finance, and search datasets as well. I probably won’t get around to organizing and posting them to the wiki myself, but theinfo community should be able to figure out what to do with them. The concept reminds me a lot of Jon Udell’s post on public data.

This list is static, but I’ll keep adding links at http://del.icio.us/pskomoroch/dataset

(more…)

Google Paper on Parallel EM Algorithm using MapReduce

I hadn’t seen much discussion of this on the web, so I thought I would post the link to this May 2007 paper from Google:

Google News Personalization: Scalable Online Collaborative Filtering

The abstract:

Several approaches to collaborative filtering have been studied but seldom have the studies been reported for large (several millions of users and items) and dynamic (the underlying item set is continually changing) settings. In this paper we describe our approach to collaborative filtering for generating personalized recommendations for users of Google News. We generate recommendations using three approaches: collaborative filtering using MinHash clustering, Probabilistic Latent Semantic Indexing (PLSI), and covisitation counts. We combine recommendations from different algorithms using a linear model. Our approach is content agnostic and consequently domain independent, making it easily adaptible for other applications and languages with minimal effort. This paper will describe our algorithms and system setup in detail, and report results of running the recommendations engine on Google News.

They use the Movielens dataset as one of the case studies, so there are some possible applications to the Netflix Prize. The part I found interesting was the first detailed description of using the MapReduce model to run large-scale Expectation Maximization (EM) computations in parallel. An implementation of this on Hadoop and Amazon EC2 will let you tackle some large scale machine learning problems.