Hidden Video Courses in Math, Science, and Engineering

Over the last few years, a large number of open courseware directories and video lecture aggregators have popped up on the web. These sites often include introductory courses and research seminars, but it can be difficult to find full courses covering advanced topics. For budgetary and copyright reasons, most upper level and smaller attendance courses are not recorded, or are only offered online for a fee. Many schools provide access-restricted videos of advanced courses to current students, but do not make them available to the wider community. To help remedy this, I have pulled together a big list of advanced courses with publicly available video lectures in math, physics, finance, and computer science that seem to have slipped through the cracks and included them in this post (scroll down to skip to the links).

Book Burnout at MIT What motivated me to pull this together? Like many people who are working full time while taking grad courses, blogging, or burning the midnight oil on a startup, I looked up after a couple of years to find I had gained a bunch of weight and was no longer in the best shape of my life. I had too much to do, and couldn’t tear myself away from coding every day for a couple of hours at the gym. In addition to my gym problem, I had just moved to DC and missed the huge number of courses available in the Boston area. It is difficult to find advanced math and physics courses that fit into a full time work schedule. Being a geek, my first instinct was to look for a technical solution to non-technical problems.

The approach I came up with was to load an Archos video player with video lectures from the web (an iphone would probably work just as well). After 3 months of watching machine learning lectures while on the elliptical machine, I had lost 30 lbs and learned a few things at the same time. The motivation problems for self-study using open courseware videos are a lot like those with working out: you really intend to do something to improve yourself, but you never seem to find the time. Somehow putting the two together and forcing myself to get things done appealed to the part of my brain which seeks extreme efficiency.

forcing yourself to learn something
Most video players now come with wifi built in, so if you have wireless access at your gym you should be ready to go. If you need to download the videos, then depending on the copyright of the author you can use mplayer or other linux utilities to rip the stream and encode it appropriately. Check out my del.icio.us video streaming links for details.
There was a lot of buzz last week about the pace of technology causing bloggers to sacrifice health for work, but this might be a way for technology to actually help improve the situation. You can force yourself to watch some video lectures and get back in shape at the same time…

Enough motivation, on with the links:

Links to Advanced Courses with Complete Video Lectures:


See http://del.icio.us/pskomoroch/video+lectures to find updated links for complete courses…this list is mostly composed of courses I hadn’t seen in other directories, but includes links to some of the better Berkeley, Stanford, and MIT videos as well.

Physics

(more…)

PyCon 2008 ElasticWulf Slides

4 8 15 16 23 42

Here are the ElasticWulf slides from my talk. The video will eventually be posted to the PyCon site.

The cluster management scripts I used to run the EC2 beowulf are hosted on google code:

ElasticWulf Project

I should make the initial checkin by Monday, until then you can try out the 32 bit images from the old EC2 tutorial.

PyCon highlights for me included Guido popping his head in towards the end of my talk (unless I was so tired from last minute preparations that I was hallucinating?) and meeting the great essayist Zed Shaw. The “Birds of Feather” (BOF) sessions were my favorite part of the conference so far. Tonight, I caught the tail end of an interesting Natural Language Processing discussion. Chris McAvoy talked me into holding a Netflix Prize BOF session where we exchanged insights about using Python for collaborative filtering. Later that night, my coworker Chris Gemignani organized a Data Visualization session where he did some cool things with Python and Nodebox. We also hung out with Peter Fein from the job search engine JuJu, who pulled together an engaging Distributed Computing BOF. JuJu just released a neat RESTful python search engine project called GrassyKnoll which you should check out. I’ll post more on PyCon when I get back to DC, along with a tutorial on using IPython1 with ElasticWulf.

Oh yeah, the back of the conference shirt featured the xkcd Python comic:

python

Python Montage Code for Displaying Arrays

This post will show how to replicate the Matlab montage function using Python. The Data Wrangling blog seems to be getting search traffic from people learning python and looking for machine learning code, so I’m adding a few basic code snippets that you might find useful. Later posts will include Python examples that use the montage function to visualize pattern recognition and collaborative filtering algorithms.

In the past, I used Matlab for prototyping, but over the last few years I have switched to a combination of numpy, scipy, matplotlib, and ipython. When combined with the appropriate libraries, Python can have better numerical performance than Matlab or Octave, nearly identical functionality, and the additional flexibility of Python when you need to munge some text or expose your algorithm as a web service.

Anyway, lets get to the problem at hand… replicating the montage function. For this example, I dug up some data from a Sebastian Seung course on neural networks I took in 2005. The matfiles we used are now on Open Courseware. I think these are cropped versions of images from the MNIST database of handwritten digits (more image datasets here).

The raw dataset is stored in an array, where each row vector is a flattened version of a digitized grayscale image. If you select one vector, reshape it into a square array, and display it as an intensity plot, you get something like this:

sample digit array

In grayscale:

sample MNIST digit vector

To display a montage of all the images (sometimes called a contact sheet), we will build a composite array where each submatrix is one of these reshaped rows. We also want to lay out the submatrices so that the result is roughly square, and all the empty elements are filled in with a default value. The end result looks like this:

Montage of MNIST handwritten digit vectors

(more…)

The Colbert Bump in Amazon Data

Colbert Pic

Last month, I took a position as Director of Advanced Analytics at Juice. I’m primarily a machine learning guy, so I will be focused on developing custom algorithms for Juice clients as well as building analytics products for a wider audience. My first idea for post on the Juice blog was to investigate how an appearance on the Colbert Report correlated with Amazon sales rank (we figured political polling data would be too sparse). We investigated a small sample of authors and found some evidence for the Colbert Bump.
(more…)

Some Datasets Available on the Web

The Datawrangling blog was put on the back burner last May while I focused on my startup. Now that I have some bandwidth again, I am getting back to work on several pet projects (including the Amazon EC2 Cluster). I’m giving an EC2 talk at Pycon in March, so I’m really on the hook to wrap up that series of posts now.

The event which prompted this long overdue blog post was another pet project: collecting public datasets. I keep an eye on topics of interest using del.icio.us tag subscriptions, and yesterday my feed was flooded with links to theinfo.org. Theinfo is a new community site/wiki for people working with large datasets and was started by reddit cofounder Aaron Swartz. The site apparently developed from his work on The Open Library.

Over the past year, I’ve been tagging interesting data I find on the web in del.icio.us. I wrote a quick python script to pull the relevant links from my del.icio.us export and list them at the bottom of this post. Most of these datasets are related to machine learning, but there are a lot of government, finance, and search datasets as well. I probably won’t get around to organizing and posting them to the wiki myself, but theinfo community should be able to figure out what to do with them. The concept reminds me a lot of Jon Udell’s post on public data.

This list is static, but I’ll keep adding links at http://del.icio.us/pskomoroch/dataset

(more…)

Google Paper on Parallel EM Algorithm using MapReduce

I hadn’t seen much discussion of this on the web, so I thought I would post the link to this May 2007 paper from Google:

Google News Personalization: Scalable Online Collaborative Filtering

The abstract:

Several approaches to collaborative filtering have been studied but seldom have the studies been reported for large (several millions of users and items) and dynamic (the underlying item set is continually changing) settings. In this paper we describe our approach to collaborative filtering for generating personalized recommendations for users of Google News. We generate recommendations using three approaches: collaborative filtering using MinHash clustering, Probabilistic Latent Semantic Indexing (PLSI), and covisitation counts. We combine recommendations from different algorithms using a linear model. Our approach is content agnostic and consequently domain independent, making it easily adaptible for other applications and languages with minimal effort. This paper will describe our algorithms and system setup in detail, and report results of running the recommendations engine on Google News.

They use the Movielens dataset as one of the case studies, so there are some possible applications to the Netflix Prize. The part I found interesting was the first detailed description of using the MapReduce model to run large-scale Expectation Maximization (EM) computations in parallel. An implementation of this on Hadoop and Amazon EC2 will let you tackle some large scale machine learning problems.

Amazon EC2 Considered Harmful

“The TruckNumber is the size of the smallest set of people in a project such that, if all of them got hit by a truck, the project would be in trouble.” - Portland Pattern Repository

bigbus

I’m taking an “Introduction to Beowulf Design” course this week from the Georgetown University Advanced Research Computing (ARC) division. The class definitely hasn’t been boring. By a strange coincidence, it turns out that the guy sitting next to me is Mike Cariaso, an MPIBlast developer who I have been corresponding with this month in some nodalpoint posts. The course gave us an opportunity to hash out some details around running MPI on EC2. He had just booted up a 10 node Amazon EC2 cluster with MPIBlast when a bus crashed into our building…

(more…)

MPI Cluster with Python and Amazon EC2 (part 2 of 3)

Today I posted a public AMI which can be used to run a small beowulf cluster on Amazon EC2 and do some parallel computations with C, Fortran, or Python. If you prefer another language (Java, Ruby, etc) just install the appropriate MPI library and rebundle the EC2 image. The following set of Python scripts automate the launch and configuration of an MPI cluster on EC2 (currently limited to 20 nodes while EC2 is in beta):

Update (3-19-08): Code for running a cluster with large or xlarge 64 bit EC2 instances is now hosted on google code. The new images include NFS, ganglia, IPython1, and other useful python packages.

http://code.google.com/p/elasticwulf/

Update (7-24-07): I’ve made some important bug fixes to the scripts to address issues mentioned in the comments. See the README file for details

The file contains some quick scripts I threw together using the AWS Python example code. This is the approach I’m using to bootstrap an MPI cluster until one of the major linux cluster distros is ported to run on EC2. Details on what is included in the public AMI were covered in Part 1 of the tutorial, Part 3 will cover cluster operation on EC2 in more detail and show how to use Python to carry out some neat parallel computations.

The cluster launch process is pretty simple once you have an Amazon EC2 account and keys, just download the Python scripts and you can be running a compute cluster in a few minutes. In a later post I will look at cluster bandwidth and performance in detail. If you have only an occasional need for running large jobs, $2/hour for a 20 node MPI cluster on EC2 is not a bad deal considering the ~ $20K price for building your own comparable system.

(more…)

Next Page »