Python Montage Code for Displaying Arrays

This post will show how to replicate the Matlab montage function using Python. The Data Wrangling blog seems to be getting search traffic from people learning python and looking for machine learning code, so I’m adding a few basic code snippets that you might find useful. Later posts will include Python examples that use the montage function to visualize pattern recognition and collaborative filtering algorithms.

In the past, I used Matlab for prototyping, but over the last few years I have switched to a combination of numpy, scipy, matplotlib, and ipython. When combined with the appropriate libraries, Python can have better numerical performance than Matlab or Octave, nearly identical functionality, and the additional flexibility of Python when you need to munge some text or expose your algorithm as a web service.

Anyway, lets get to the problem at hand… replicating the montage function. For this example, I dug up some data from a Sebastian Seung course on neural networks I took in 2005. The matfiles we used are now on Open Courseware. I think these are cropped versions of images from the MNIST database of handwritten digits (more image datasets here).

The raw dataset is stored in an array, where each row vector is a flattened version of a digitized grayscale image. If you select one vector, reshape it into a square array, and display it as an intensity plot, you get something like this:

sample digit array

In grayscale:

sample MNIST digit vector

To display a montage of all the images (sometimes called a contact sheet), we will build a composite array where each submatrix is one of these reshaped rows. We also want to lay out the submatrices so that the result is roughly square, and all the empty elements are filled in with a default value. The end result looks like this:

Montage of MNIST handwritten digit vectors

(more…)

The Colbert Bump in Amazon Data

Colbert Pic

Last month, I took a position as Director of Advanced Analytics at Juice. I’m primarily a machine learning guy, so I will be focused on developing custom algorithms for Juice clients as well as building analytics products for a wider audience. My first idea for post on the Juice blog was to investigate how an appearance on the Colbert Report correlated with Amazon sales rank (we figured political polling data would be too sparse). We investigated a small sample of authors and found some evidence for the Colbert Bump.
(more…)

Some Datasets Available on the Web

The Datawrangling blog was put on the back burner last May while I focused on my startup. Now that I have some bandwidth again, I am getting back to work on several pet projects (including the Amazon EC2 Cluster). I’m giving an EC2 talk at Pycon in March, so I’m really on the hook to wrap up that series of posts now.

The event which prompted this long overdue blog post was another pet project: collecting public datasets. I keep an eye on topics of interest using del.icio.us tag subscriptions, and yesterday my feed was flooded with links to theinfo.org. Theinfo is a new community site/wiki for people working with large datasets and was started by reddit cofounder Aaron Swartz. The site apparently developed from his work on The Open Library.

Over the past year, I’ve been tagging interesting data I find on the web in del.icio.us. I wrote a quick python script to pull the relevant links from my del.icio.us export and list them at the bottom of this post. Most of these datasets are related to machine learning, but there are a lot of government, finance, and search datasets as well. I probably won’t get around to organizing and posting them to the wiki myself, but theinfo community should be able to figure out what to do with them. The concept reminds me a lot of Jon Udell’s post on public data.

This list is static, but I’ll keep adding links at http://del.icio.us/pskomoroch/dataset

(more…)