Wikipedia Page Traffic Statistics Dataset

I’ve published a Wikipedia Page Traffic Data Set containing a 320 GB sample of the data used to power trendingtopics.org (I’ll talk about Trending Topics more in a upcoming post). The EBS snapshot includes 7 months of hourly page traffic statistics for over 8 Million Wikipedia articles (~ 1 TB uncompressed) along with the associated Wikipedia content, linkgraph, & metadata. The english Wikipedia subset contains ~2.5 Million articles.

It only takes a couple of minutes to sign up for an Amazon EC2 account and set up access to the data as an EBS volume from the Amazon Management Console.

If you want to work entirely from the command line, you will need to complete the steps in the Getting Started Guide. When you are set up to use EC2, launch a small EC2 Ubuntu instance from your local machine:

    $ ec2-run-instances ami-5394733a -k gsg-keypair -z us-east-1a

Once it is running and you have the instance id, create and attach an EBS Volume using the public snapshot snap-753dfc1c (make sure the volume is created in the same availability zone as the ec2 instance)

    $ ec2-create-volume --snapshot snap-753dfc1c -z us-east-1a
    $ ec2-attach-volume vol-ec06ea85 -i i-df396cb6 -d /dev/sdf

Next, ssh into the instance and mount the volume

    $ ssh root@ec2-12-xx-xx-xx.z-1.compute-1.amazonaws.com
    root@domU-12-xx-xx-xx-75-81:/mnt# mkdir /mnt/wikidata
    root@domU-12-xx-xx-xx-75-81:/mnt# mount /dev/sdf /mnt/wikidata

See the README files in each subdirectory for more details on these datasets…

Wikistats

The good stuff is sitting in 5000 files in /mnt/wikidata/wikistats/pagecounts/

    /mnt/wikidata/wikistats/pagecounts# ls -l | wc -l
    5068
    /mnt/wikidata/wikistats/pagecounts# ls -lh |head
    total 260G
    -rw-r--r-- 1 root root  49M 2009-02-26 13:34 pagecounts-20081001-000000.gz
    -rw-r--r-- 1 root root  46M 2009-02-26 13:34 pagecounts-20081001-010000.gz
    -rw-r--r-- 1 root root  47M 2009-02-26 13:34 pagecounts-20081001-020000.gz
    -rw-r--r-- 1 root root  44M 2009-02-26 13:34 pagecounts-20081001-030000.gz
    -rw-r--r-- 1 root root  45M 2009-02-26 13:34 pagecounts-20081001-040000.gz
    -rw-r--r-- 1 root root  47M 2009-02-26 13:35 pagecounts-20081001-050001.gz
    -rw-r--r-- 1 root root  45M 2009-02-26 13:35 pagecounts-20081001-060000.gz
    -rw-r--r-- 1 root root  50M 2009-02-26 13:35 pagecounts-20081001-070000.gz
    -rw-r--r-- 1 root root  51M 2009-02-26 13:35 pagecounts-20081001-080000.gz

This directory contains hourly Wikipedia article traffic logs covering the 7 month period from October 01 2008 to April 30 2009, this data is regularly logged from the wikipedia squid proxy by Domas Mituzas.

Each log file is named with the date and time of collection: pagecounts-20090430-230000.gz

Each line has 4 fields:

projectcode, pagename, pageviews, bytes
    en Barack_Obama 997 123091092
    en Barack_Obama%27s_first_100_days 8 850127
    en Barack_Obama,_Jr 1 144103
    en Barack_Obama,_Sr. 37 938821
    en Barack_Obama_%22HOPE%22_poster 4 81005
    en Barack_Obama_%22Hope%22_poster 5 102081

Wikilinks (1.1G)

Contains a wikipedia linkgraph dataset provided by Henry Haselgrove.

These files contain all links between proper english language Wikipedia pages, that is pages in “namespace 0″. This includes disambiguation pages and redirect pages.

In links-simple-sorted.txt, there is one line for each page that has links from it. The format of the lines is ready for processing by Hadoop:

    from1: to11 to12 to13 ...
    from2: to21 to22 to23 ...
    ...

where from1 is an integer labelling a page that has links from it, and to11 to12 to13 … are integers labelling all the pages that the page links to. To find the page title that corresponds to integer n, just look up the n-th line in the file titles-sorted.txt.

Wikidump (29G)

Contains the raw Wikipedia dumps from March along with some processed versions of the data. One of the useful files I created provides a direct lookup table for wikipedia article redirects in page_lookup_redirects.txt, which can be useful for name standardization and search:

Here is a sample query run when the file is loaded into MySQL:

   mysql> select redirect_title, true_title from page_lookups
               where page_id = 534366;
   +------------------------------------------------+--------------+
   | redirect_title                                 | true_title   |
   +------------------------------------------------+--------------+
   | Barack_Obama                                   | Barack Obama |
   | Barak_Obama                                    | Barack Obama |
   | 44th_President_of_the_United_States            | Barack Obama |
   | Barach_Obama                                   | Barack Obama |
   | Senator_Barack_Obama                           | Barack Obama | 
                          .....                           .....         

   | Rocco_Bama                                     | Barack Obama |
   | Barack_Obama's                                 | Barack Obama | 
   | B._Obama                                       | Barack Obama |
   +------------------------------------------------+--------------+
   110 rows in set (11.15 sec)    

The raw wikipedia dump file latest-pages-articles.xml was also post-processed using xml2sql to produce a set of tab delimited text files for use with Hadoop and other tools :

692M page.txt
115M redirect.txt
987M revision.txt
17G text.txt

the corresponding namespace0 files were created by limiting page.txt and redirect.txt as follows:

# grep '^[0-9]*       0       ' page.txt > page_namespace0.txt
# grep '^[0-9]*        0       ' redirect.txt > redirect_namespace0.txt

Quick Visualization of irs.gov Search Queries

Here is a quick visualization I did in honor of April 15th to investigate what people looking for on tax day…

This “query tree” shows the most frequent searches starting with the term “irs”. Each branch in the tree represents a query where the words are sized according to frequency of occurrence. I like how you can see at a glance what the most popular tax forms are by following the “irs tax form …” branch. Apparently form 8868, Application for Extension of Time To File, is in high demand.

It was created by uploading search queries from AOL users leading to clicks on irs.gov during Spring 2006 to Concentrate, which generated the query tree. This image is a snapshot of an interactive flash visualization in Concentrate, where the focus term was “irs”. Looking at query patterns like this can help you get an idea of what people are looking for and how to better organize your site so they can find it quickly.

The interactive flash visualization was developed by Chris Gemignani using Flare with some input from Zach Gemignani and myself and inspiration from the Many Eyes WordTree.

The raw data is from the released AOL Search data sample, and consists of the subset of unique queries leading to clicks on irs.gov from March to May 2006. The IRS queries used to make the visualization can be downloaded here: irs.gov.queries.csv (191K)

Here are the top 10 queries in the file:

Query Searches
irs 4787
irs.gov 2282
www.irs.gov 1975
internal revenue service 1154
irs forms 608
tax forms 361
irs tax forms 196
internal revenue 158
taxes 142
wheres my refund 139
federal tax forms 125
irs refunds 106

Amazon Elastic MapReduce: A Web Service API for Hadoop

AWS just launched a new service called Amazon Elastic MapReduce that provides the same kind of developer friendly API used for Amazon EC2 or S3 for running Hadoop jobs in the Cloud. You submit a job request and number of instances to the API (pointing to input data and code on S3), and AWS spins up a private Hadoop cluster on EC2, submits your job, and reports back on status through the API. You can cancel or modify jobs using the API, and can even add additional steps to a running job.

I was part of the private beta and wrote a short code sample that shows how to run Python streaming jobs using the service: Finding Similar Items with Amazon Elastic MapReduce, Python, and Hadoop Streaming. As part of the code example, I also pulled together a cleaned up version of the AudioScrobbler dataset for use in music recommendations (it is about 1/4 the size of the Netflix Prize data). The code sample basically implements a Python streaming version of the Pairwise Similarity algorithm found in this paper by Tamer Elsayed, Jimmy Lin, and Douglas Oard and applies it to Netflix Prize ratings and Audioscrobbler playlist data.

The base EC2 images underlying the service are running Hadoop 18.3 on Debian and include NumPy, SciPy, R, BeautifulSoup, and other preinstalled packages useful for Streaming Hadoop jobs. You can use the distributed cache to install other packages like nltk at runtime.

My initial impression is that this will evolve into a powerful tool for people who want to run adhoc mapreduce jobs, prototype MapReduce code on EC2, or interface with on-demand clusters from within their apps. Hopefully we’ll see a MapReduce code/task sharing facility at some point similar to the EC2 pubic AMI system.

Note that in the current release of Elastic MapReduce, input data is copied down from S3 at the start of the job and your cluster shuts itself down upon completion by default (you can override this with the API). Mounting data directly from EBS volumes isn’t supported yet, but I wouldn’t be surprised to see that soon given the potential for integrating with Amazon Public Datasets. Running Dumbo jobs isn’t supported yet since it requires a Hadoop patch for 18.3, but it should be possible when AWS moves to Hadoop 0.21 (which will also bring in a number of other important Hadoop features that are missing in 18.3).

For maintaining a permanent cluster in-house or even a semi-permanent cluster on EC2 with a large amount of data, I would recommend using the Cloudera distribution for Hadoop (it is a one-liner to start an EC2 Hadoop cluster from the command line). I would often bounce between running jobs on my Cloudera EC2 cluster and Elastic MapReduce during development of the code example. If you are getting started with Hadoop, the Cloudera training videos are a great place to get up to speed.

So what can you do with Elastic MapReduce? Here are a few initial ideas:

  • Offload background processing from your Rails or Django app to Hadoop by sending the ElasticMapReduce API job requests pointing to data stored on S3: convert PDFs, classify spam, deduplicate records, batch geocoding, etc.
  • Process large amounts of retail sales and inventory transaction data for sales forecasting and optimization
  • Use the AddJobFlowSteps method in the API to run iterative machine learning algorithms using MapReduce on a remote Hadoop cluster and shut it down when your results converge to an answer

I’ll post more on this later today - including a detailed explanation of using Netflix Prize data in the code example and some next steps for using Elastic MapReduce.

Updated List of Datasets & Video Lectures

New Datasets

It’s spring cleaning time at Data Wrangling. I’ve bookmarked 230 new datasets since publishing my first dataset linkdump in January 2008, so at the request of @mrflip, I’ve appended them to the original post along with a json dump of the tagged links. Flip and the other Infochimps will be pulling anything they might have missed into the infochimps.org dataset repository.

You can check out the new list of datasets at the same url:
“Some Datasets Available on the Web”

Around 85 of these datasets can be redistributed publicly: http://delicious.com/pskomoroch/redistributable+dataset. The rest are mostly free for academic use, but the license conditions vary. Some appear to adhere to the terms on http://opendefinition.org/

New Video Courses

In addition to the datasets, my bookmarks included 20 new video courses since the original video lecture post was published in April, 2008. These are mostly graduate and advanced undergraduate courses in Physics, Mathematics, and Computer Science. Among these are full video courses in Parallel programming, Loop Quantum Gravity, Machine Learning, Financial Markets, and other fun subjects.

The new videos have been added to the post:
“Hidden Video Courses in Math, Science, and Engineering”

Videos of Talks & Seminars

As an added bonus, here is a completely unorganized list of interesting programming, machine learning, and visualization talks which caught my eye in 2008:

(more…)

Search map: interactive visualization of search query clusters

Last month, our team at Juice launched a Django web analytics app called Concentrate that ingests search queries from sources like Google Analytics or Hitwise, then enhances this raw data by discovering common query patterns, generating segmented reports, and offering visual interfaces for data exploration. Jeff Barr wrote about the technology stack we used to build the app itself a couple of weeks ago at the AWS blog. I’ll provide some more detail on that topic later this week. This post will give a basic description of Concentrate’s pattern discovery algorithm and show it in action.

The following mashup provides a visual interface for exploring search patterns used by readers of the Data Wrangling blog by combining output from concentrateme.com with the Google AJAX search API. Each bubble in the visualization below represents a search query typed into Google during the last 2 months that led to clicks on on this site (~2000 unique queries, ~3400 searches). The size of each bubble represents the number of visitors referred by that particular query, and the bubbles are colorized by the query cluster based on phrase pattern structure (’python [x]’, [x] video’, etc). The search results below the chart are highlighted in yellow if they lead to datawrangling.com pages, which allows you to see at a glance where the site ranks for each query.

Search map of queries leading to clicks on datawrangling.com

Click to open the query browser in a new window, then mouse over a query bubble and click to update the search results.

Interactive Search Query Map

(more…)

Conversation with Eric Siegel on Predictive Analytics World

predicitive analytics world conference

The Predictive Analytics World Conference is taking place Feb 18-19, 2009 in San Francisco, CA and seems to have an interesting lineup of speakers (including one of the winners of this years Netflix Progress Prize). I’m going to be in the bay area during the week of Feb 15th, so I’m planning on checking out some of the talks. Data Wrangling readers can register using this code: datawranglingpaw09 and get 15% off the conference registration fee. Drop me a line if you are attending and want to meet up.

It also might be worth stopping by if you are an R user, as Mike E. Driscoll at Data Evolution mentioned:

The Bay Area R UseRs group is doing a free, co-located event on Wed evening of the conference — so if you’re interested in mingling with some PAW folks as well as some R users — you can sign up at: http://ia.meetup.com/67/calendar/9573566/

The organizers of the conference are coordinating a nice media blitz across several machine learning blogs; check out the post by Brendan O’Connor and John Langford’s interview at Machine Learning (Theory). I thought I would join in the fun by interviewing Eric about a few topics related to the conference, mostly focusing on customer modeling and machine learning in the business world.

Read on for the transcript of our email interview: (more…)

Next Page »