17 January 2008

The Datawrangling blog was put on the back burner last May while I focused on my startup. Now that I have some bandwidth again, I am getting back to work on several pet projects (including the Amazon EC2 Cluster). I'm giving an EC2 talk at Pycon in March, so I'm really on the hook to wrap up that series of posts now.

The event which prompted this long overdue blog post was another pet project: collecting public datasets. I keep an eye on topics of interest using del.icio.us tag subscriptions, and yesterday my feed was flooded with links to theinfo.org. Theinfo is a new community site/wiki for people working with large datasets and was started by reddit cofounder Aaron Swartz. The site apparently developed from his work on The Open Library.

Over the past year, I've been tagging interesting data I find on the web in del.icio.us. I wrote a quick python script to pull the relevant links from my del.icio.us export and list them at the bottom of this post. Most of these datasets are related to machine learning, but there are a lot of government, finance, and search datasets as well. I probably won't get around to organizing and posting them to the wiki myself, but theinfo community should be able to figure out what to do with them. The concept reminds me a lot of Jon Udell's post on public data.

This list is semi-static, but I'll keep adding links at http://del.icio.us/pskomoroch/dataset

Update (02/10/09): I have around 400 dataset bookmarks now (more than double the count when this post first appeared), so I've updated the list below. Here is a json file containing the urls and tags: delicious_dataset_links.json

Around 85 of these datasets can be redistributed publicly: http://delicious.com/pskomoroch/redistributable+dataset. The rest are mostly free for academic use, but the license conditions vary.

Here are the 230 new datasets bookmarked since Jan 17, 2008:

Datasets listed in the original post on Jan 17, 2008: