Updated List of Datasets & Video Lectures

New Datasets

It’s spring cleaning time at Data Wrangling. I’ve bookmarked 230 new datasets since publishing my first dataset linkdump in January 2008, so at the request of @mrflip, I’ve appended them to the original post along with a json dump of the tagged links. Flip and the other Infochimps will be pulling anything they might have missed into the infochimps.org dataset repository.

You can check out the new list of datasets at the same url:
“Some Datasets Available on the Web”

Around 85 of these datasets can be redistributed publicly: http://delicious.com/pskomoroch/redistributable+dataset. The rest are mostly free for academic use, but the license conditions vary. Some appear to adhere to the terms on http://opendefinition.org/

New Video Courses

In addition to the datasets, my bookmarks included 20 new video courses since the original video lecture post was published in April, 2008. These are mostly graduate and advanced undergraduate courses in Physics, Mathematics, and Computer Science. Among these are full video courses in Parallel programming, Loop Quantum Gravity, Machine Learning, Financial Markets, and other fun subjects.

The new videos have been added to the post:
“Hidden Video Courses in Math, Science, and Engineering”

Videos of Talks & Seminars

As an added bonus, here is a completely unorganized list of interesting programming, machine learning, and visualization talks which caught my eye in 2008:

(more…)

Search map: interactive visualization of search query clusters

Last month, our team at Juice launched a Django web analytics app called Concentrate that ingests search queries from sources like Google Analytics or Hitwise, then enhances this raw data by discovering common query patterns, generating segmented reports, and offering visual interfaces for data exploration. Jeff Barr wrote about the technology stack we used to build the app itself a couple of weeks ago at the AWS blog. I’ll provide some more detail on that topic later this week. This post will give a basic description of Concentrate’s pattern discovery algorithm and show it in action.

The following mashup provides a visual interface for exploring search patterns used by readers of the Data Wrangling blog by combining output from concentrateme.com with the Google AJAX search API. Each bubble in the visualization below represents a search query typed into Google during the last 2 months that led to clicks on on this site (~2000 unique queries, ~3400 searches). The size of each bubble represents the number of visitors referred by that particular query, and the bubbles are colorized by the query cluster based on phrase pattern structure (’python [x]’, [x] video’, etc). The search results below the chart are highlighted in yellow if they lead to datawrangling.com pages, which allows you to see at a glance where the site ranks for each query.

Search map of queries leading to clicks on datawrangling.com

Click to open the query browser in a new window, then mouse over a query bubble and click to update the search results.

Interactive Search Query Map

(more…)