Hidden Video Courses in Math, Science, and Engineering

Over the last few years, a large number of open courseware directories and video lecture aggregators have popped up on the web. These sites often include introductory courses and research seminars, but it can be difficult to find full courses covering advanced topics. For budgetary and copyright reasons, most upper level and smaller attendance courses are not recorded, or are only offered online for a fee. Many schools provide access-restricted videos of advanced courses to current students, but do not make them available to the wider community. To help remedy this, I have pulled together a big list of advanced courses with publicly available video lectures in math, physics, finance, and computer science that seem to have slipped through the cracks and included them in this post (scroll down to skip to the links).

Book Burnout at MIT What motivated me to pull this together? Like many people who are working full time while taking grad courses, blogging, or burning the midnight oil on a startup, I looked up after a couple of years to find I had gained a bunch of weight and was no longer in the best shape of my life. I had too much to do, and couldn’t tear myself away from coding every day for a couple of hours at the gym. In addition to my gym problem, I had just moved to DC and missed the huge number of courses available in the Boston area. It is difficult to find advanced math and physics courses that fit into a full time work schedule. Being a geek, my first instinct was to look for a technical solution to non-technical problems.

The approach I came up with was to load an Archos video player with video lectures from the web (an iphone would probably work just as well). After 3 months of watching machine learning lectures while on the elliptical machine, I had lost 30 lbs and learned a few things at the same time. The motivation problems for self-study using open courseware videos are a lot like those with working out: you really intend to do something to improve yourself, but you never seem to find the time. Somehow putting the two together and forcing myself to get things done appealed to the part of my brain which seeks extreme efficiency.

forcing yourself to learn something
Most video players now come with wifi built in, so if you have wireless access at your gym you should be ready to go. If you need to download the videos, then depending on the copyright of the author you can use mplayer or other linux utilities to rip the stream and encode it appropriately. Check out my del.icio.us video streaming links for details.
There was a lot of buzz last week about the pace of technology causing bloggers to sacrifice health for work, but this might be a way for technology to actually help improve the situation. You can force yourself to watch some video lectures and get back in shape at the same time…

Enough motivation, on with the links:

Links to Advanced Courses with Complete Video Lectures:


See http://del.icio.us/pskomoroch/video+lectures to find updated links for complete courses…this list is mostly composed of courses I hadn’t seen in other directories, but includes links to some of the better Berkeley, Stanford, and MIT videos as well.

Physics

(more…)

Some Datasets Available on the Web

The Datawrangling blog was put on the back burner last May while I focused on my startup. Now that I have some bandwidth again, I am getting back to work on several pet projects (including the Amazon EC2 Cluster). I’m giving an EC2 talk at Pycon in March, so I’m really on the hook to wrap up that series of posts now.

The event which prompted this long overdue blog post was another pet project: collecting public datasets. I keep an eye on topics of interest using del.icio.us tag subscriptions, and yesterday my feed was flooded with links to theinfo.org. Theinfo is a new community site/wiki for people working with large datasets and was started by reddit cofounder Aaron Swartz. The site apparently developed from his work on The Open Library.

Over the past year, I’ve been tagging interesting data I find on the web in del.icio.us. I wrote a quick python script to pull the relevant links from my del.icio.us export and list them at the bottom of this post. Most of these datasets are related to machine learning, but there are a lot of government, finance, and search datasets as well. I probably won’t get around to organizing and posting them to the wiki myself, but theinfo community should be able to figure out what to do with them. The concept reminds me a lot of Jon Udell’s post on public data.

This list is static, but I’ll keep adding links at http://del.icio.us/pskomoroch/dataset

(more…)

Google Paper on Parallel EM Algorithm using MapReduce

I hadn’t seen much discussion of this on the web, so I thought I would post the link to this May 2007 paper from Google:

Google News Personalization: Scalable Online Collaborative Filtering

The abstract:

Several approaches to collaborative filtering have been studied but seldom have the studies been reported for large (several millions of users and items) and dynamic (the underlying item set is continually changing) settings. In this paper we describe our approach to collaborative filtering for generating personalized recommendations for users of Google News. We generate recommendations using three approaches: collaborative filtering using MinHash clustering, Probabilistic Latent Semantic Indexing (PLSI), and covisitation counts. We combine recommendations from different algorithms using a linear model. Our approach is content agnostic and consequently domain independent, making it easily adaptible for other applications and languages with minimal effort. This paper will describe our algorithms and system setup in detail, and report results of running the recommendations engine on Google News.

They use the Movielens dataset as one of the case studies, so there are some possible applications to the Netflix Prize. The part I found interesting was the first detailed description of using the MapReduce model to run large-scale Expectation Maximization (EM) computations in parallel. An implementation of this on Hadoop and Amazon EC2 will let you tackle some large scale machine learning problems.