Some Datasets Available on the Web

The Datawrangling blog was put on the back burner last May while I focused on my startup. Now that I have some bandwidth again, I am getting back to work on several pet projects (including the Amazon EC2 Cluster). I’m giving an EC2 talk at Pycon in March, so I’m really on the hook to wrap up that series of posts now.

The event which prompted this long overdue blog post was another pet project: collecting public datasets. I keep an eye on topics of interest using del.icio.us tag subscriptions, and yesterday my feed was flooded with links to theinfo.org. Theinfo is a new community site/wiki for people working with large datasets and was started by reddit cofounder Aaron Swartz. The site apparently developed from his work on The Open Library.

Over the past year, I’ve been tagging interesting data I find on the web in del.icio.us. I wrote a quick python script to pull the relevant links from my del.icio.us export and list them at the bottom of this post. Most of these datasets are related to machine learning, but there are a lot of government, finance, and search datasets as well. I probably won’t get around to organizing and posting them to the wiki myself, but theinfo community should be able to figure out what to do with them. The concept reminds me a lot of Jon Udell’s post on public data.

This list is semi-static, but I’ll keep adding links at http://del.icio.us/pskomoroch/dataset

Update (02/10/09): I have around 400 dataset bookmarks now (more than double the count when this post first appeared), so I’ve updated the list below. Here is a json file containing the urls and tags: delicious_dataset_links.json

Around 85 of these datasets can be redistributed publicly: http://delicious.com/pskomoroch/redistributable+dataset. The rest are mostly free for academic use, but the license conditions vary.

Here are the 230 new datasets bookmarked since Jan 17, 2008:

Datasets listed in the original post on Jan 17, 2008:

  • Vinoth
    hi
    can anyone tell me how i can obtain data sets for movie ratings to import into WEKA
  • manoj
    i need Boolean dataset for association mining for my MTech project.Please provide me the address for the same. It will be a great help
  • endah
    I need dataset image fingerprint for free,,,where I can find it?
    thank u
  • Mandy
    I need RFID Supply Chain data for my research. Where can I find it??
  • rajeswari
    this is rajeswari.,
    i am doing a project in association rule mining in datamining for time related data. so i need a dataset with time related data ie,, temporal data.so please send me time related data.it's really helpful for my data
  • halim
    Dear all,

    where can i find the dataset for manufacturing?? such as 'defect' or 'not defect' prediction..please ..help me..i am in urgent condition...thx b4
  • uma
    where can i get dataset for mining unexpected temporal association rules(eg:application in adverse drug reaction)
  • uma
    the site u mentioned contains the required data for my project.i thank u for this great help
  • pskomoroch
    The FDA has some data like that:

    http://www.fda.gov/Drugs/GuidanceComplianceRegu...

    Some analogous time series data might be worth looking at as well:

    http://delicious.com/pskomoroch/timeseries+dataset
  • Ken

    A personal favorite: ITRDB: ftp://ftp.ncdc.noaa.gov/pub/data/paleo/treering/

  • Tim,


    I like the site, and the capability to download in various formats. A REST or soap API would be nice, or at least an index page for each format with direct paths to the individual downloads.


    -Pete

  • Tim

    what do you think of http://data.un.org ?

  • Rufus,


    I had bookmarked the project here in July: http://project.knowledgeforge.net/ckan/wiki/package


    Looks like you have made a lot of progress since then, I've just subscribed to the Open Knowledge blog: http://blog.okfn.org/


    Your message is right on target: "Those familiar with freshmeat or CPAN can think of CKAN as providing an analogous service for open knowledge."


    Installing/discovering data should be as easy as installing linux software using repository mirrors...


    port install library-of-congress


    -Pete

  • I don't know whether you've seen CKAN (Comprehensive Knowledge Archive Network). This is a project started by the Open Knowledge Foundation (of which I'm a part) and was launched about a year ago and seeks to perform exactly the type of registry task you've started upon here (though limited to open material only). As the blurb on the front-page says:


    CKAN is the Comprehensive Knowledge Archive Network, a registry of open knowledge packages and projects (and a few closed ones). CKAN is the place to search for open knowledge resources as well as register your own – be that a set of Shakespeare's works, a global population density database, the voting records of MPs, or 30 years of US patents.


    Those familiar with freshmeat or CPAN can think of CKAN as providing an analogous service for open knowledge.

  • What about including the wiki http://www.numberzoom.com/ which is a user-contributed phone numbers database. It's mosty reverse Caller ID for looking up what telemarketers or collection agencies have called, but there is no reason why other numbers wouldn't be on the site.

  • Brent sorry I missed that, this data will be useful for some identity matching projects I'm testing. I just found the programmable web description of your api as well: http://www.programmableweb.com/api/cogmap

  • Brent,


    Just added Cogmap to my dataset bookmarks... any chance on releasing a raw dataset or REST api to fetch raw orgchart data?


    -Pete

  • The omission of cogmap makes me sad! Cogmap provides organization chart data for thousands of companies and exposes it all through a variety of web services.

  • Looks like Google is going to start providing access to loads of open sourced data sets (http://blog.wired.com/wiredscience/2008/01/goog...>

  • skj,


    The WebBase Project link includes some chat data. It would be pretty easy to crawl for that data, provided terms of use for the chat sites are followed. Here is a recent list of hosts Stanford WebBase crawled, which includes chat sites (this link might not be permanent):

    http://dbpubs.stanford.edu:8090/~testbed/doc2/WebBase/crawl_lists/crawled_hosts.0403

  • civilian,


    The LDC site was up yesterday. It may have been hammered by reddit/del.icio.us users? I think some of the datasets they have are extremely large (for example the google N-grams), so there is a handling fee for non-commercial researchers. As far as commercial use fees, many data providers restrict use entirely. Open access to more data would be great ... except where privacy issues are involved. Sometimes there are also competitive reasons for restrictive licenses.


    See more on the issues here:


    http://en.wikipedia.org/wiki/Open_data
    http://en.wikipedia.org/wiki/Data_privacy


    related discussion:


    http://news.ycombinator.com/item?id=100197

  • skj

    Are there any datasets of chat logs? Chat conversations (from IRC or otherwise)?

  • civilian

    The LDC link does not work. As a taxpayer, I am forced to wonder why their data is not open source instead of proprietary and subscription-based.

blog comments powered by Disqus