Some Datasets Available on the Web

The Datawrangling blog was put on the back burner last May while I focused on my startup. Now that I have some bandwidth again, I am getting back to work on several pet projects (including the Amazon EC2 Cluster). I’m giving an EC2 talk at Pycon in March, so I’m really on the hook to wrap up that series of posts now.

The event which prompted this long overdue blog post was another pet project: collecting public datasets. I keep an eye on topics of interest using del.icio.us tag subscriptions, and yesterday my feed was flooded with links to theinfo.org. Theinfo is a new community site/wiki for people working with large datasets and was started by reddit cofounder Aaron Swartz. The site apparently developed from his work on The Open Library.

Over the past year, I’ve been tagging interesting data I find on the web in del.icio.us. I wrote a quick python script to pull the relevant links from my del.icio.us export and list them at the bottom of this post. Most of these datasets are related to machine learning, but there are a lot of government, finance, and search datasets as well. I probably won’t get around to organizing and posting them to the wiki myself, but theinfo community should be able to figure out what to do with them. The concept reminds me a lot of Jon Udell’s post on public data.

This list is static, but I’ll keep adding links at http://del.icio.us/pskomoroch/dataset

42 Responses to “Some Datasets Available on the Web”

  1. January 17th, 2008 | 11:05 pm

    This is one of the web’s most interesting stories on Fri 18th Jan 2008…

    These are the web’s most talked about URLs on Fri 18th Jan 2008. The current winner is …..

  2. January 18th, 2008 | 5:39 am

    datawrangling.com | CommentURL.com…

    datawrangling.com

    A great list of publicly available data sets.

    Over the past year, I’…

  3. January 18th, 2008 | 6:46 am

    […] Some Datasets Available on the Web » Data Wrangling Blog […]

  4. January 18th, 2008 | 4:46 pm

    […] Datasets available on the web Is there an opportunity in utilizing this data for a new startup? […]

  5. January 18th, 2008 | 5:46 pm

    […] Today, a friend of mine alerted me to another lode of mine-able data offered by Peter Skomoroch of Data Wrangling. There are a bunch of great sets here; hope you find something to your liking as well. […]

  6. January 18th, 2008 | 7:30 pm

    […] David Pennock notes the impressive set of datasets at datawrangling. […]

  7. January 18th, 2008 | 7:32 pm

    […] Some Datasets Available on the Web » Data Wrangling Blog (tags: data datasets datamining databases free database web) […]

  8. January 18th, 2008 | 11:17 pm

    […] Some Datasets Available on the Web » Data Wrangling Blog (tags: web api visualization db) […]

  9. January 19th, 2008 | 4:17 am

    […] Some Datasets Available on the Web » Data Wrangling Blog useful web available datasets (tags: data database datamining datasets useful tools programming machinelearning) […]

  10. civilian
    January 19th, 2008 | 4:52 am

    The LDC link does not work. As a taxpayer, I am forced to wonder why their data is not open source instead of proprietary and subscription-based.

  11. skj
    January 19th, 2008 | 5:44 am

    Are there any datasets of chat logs? Chat conversations (from IRC or otherwise)?

  12. January 19th, 2008 | 8:35 am

    civilian,

    The LDC site was up yesterday. It may have been hammered by reddit/del.icio.us users? I think some of the datasets they have are extremely large (for example the google N-grams), so there is a handling fee for non-commercial researchers. As far as commercial use fees, many data providers restrict use entirely. Open access to more data would be great … except where privacy issues are involved. Sometimes there are also competitive reasons for restrictive licenses.

    See more on the issues here:

    http://en.wikipedia.org/wiki/Open_data
    http://en.wikipedia.org/wiki/Data_privacy

    related discussion:

    http://news.ycombinator.com/item?id=100197

  13. January 19th, 2008 | 8:50 am

    skj,

    The WebBase Project link includes some chat data. It would be pretty easy to crawl for that data, provided terms of use for the chat sites are followed. Here is a recent list of hosts Stanford WebBase crawled, which includes chat sites (this link might not be permanent):
    http://dbpubs.stanford.edu:8090/~testbed/doc2/WebBase/crawl_lists/crawled_hosts.0403

  14. January 19th, 2008 | 11:50 am

    Looks like Google is going to start providing access to loads of open sourced data sets (http://blog.wired.com/wiredscience/2008/01/google-to-provi.html).

  15. January 21st, 2008 | 6:17 am

    […] Some Datasets Available on the Web » Data Wrangling Blog (tags: academic research data resources) […]

  16. January 22nd, 2008 | 8:47 am

    […] Data Pr0n […]

  17. January 22nd, 2008 | 8:19 pm

    […] Some Datasets Available on the Web (tags: web data) […]

  18. January 29th, 2008 | 10:07 am

    […] Datasets available on the Web Recently I stumbled upon a great post at Datawrangling.com mentioning datasets available on the Web. In the same post I read about the collaborative effort to gather datasets at theinfo.org. […]

  19. February 1st, 2008 | 9:17 am

    […] Some Datasets Available on the Web » Data Wrangling Blog (tags: machine_learning datamining mashups research collaborative_filtering information_retrieval) […]

  20. February 21st, 2008 | 11:42 pm

    […] Peter Skomoroch: The list of data sets over at Data Wrangling is similar in spirit to the one here. […]

  21. March 8th, 2008 | 2:33 am

    […] Some Datasets Available on the Web » Data Wrangling Blog - […]

  22. April 9th, 2008 | 12:50 pm

    The omission of cogmap makes me sad! Cogmap provides organization chart data for thousands of companies and exposes it all through a variety of web services.

  23. April 9th, 2008 | 1:24 pm

    Brent,

    Just added Cogmap to my dataset bookmarks… any chance on releasing a raw dataset or REST api to fetch raw orgchart data?

    -Pete

  24. April 9th, 2008 | 2:29 pm
  25. April 9th, 2008 | 2:37 pm

    Brent sorry I missed that, this data will be useful for some identity matching projects I’m testing. I just found the programmable web description of your api as well: http://www.programmableweb.com/api/cogmap

  26. April 9th, 2008 | 6:32 pm

    What about including the wiki http://www.numberzoom.com/ which is a user-contributed phone numbers database. It’s mosty reverse Caller ID for looking up what telemarketers or collection agencies have called, but there is no reason why other numbers wouldn’t be on the site.

  27. April 9th, 2008 | 11:19 pm

    […] Some Datasets Available on the Web » Data Wrangling Blog (tags: datasets data datamining database Free databases web tools resources research information lists) […]

  28. April 10th, 2008 | 3:32 am

    […] Some Datasets Available on the Web » Data Wrangling Blog This list is static, but you can follow his links at http://del.icio.us/pskomoroch/dataset (tags: datasets data free datamining databases) […]

  29. April 10th, 2008 | 4:29 am

    I don’t know whether you’ve seen CKAN (Comprehensive Knowledge Archive Network). This is a project started by the Open Knowledge Foundation (of which I’m a part) and was launched about a year ago and seeks to perform exactly the type of registry task you’ve started upon here (though limited to open material only). As the blurb on the front-page says:

    CKAN is the Comprehensive Knowledge Archive Network, a registry of open knowledge packages and projects (and a few closed ones). CKAN is the place to search for open knowledge resources as well as register your own – be that a set of Shakespeare’s works, a global population density database, the voting records of MPs, or 30 years of US patents.

    Those familiar with freshmeat or CPAN can think of CKAN as providing an analogous service for open knowledge.

  30. April 10th, 2008 | 8:01 am

    Rufus,

    I had bookmarked the project here in July: http://project.knowledgeforge.net/ckan/wiki/package

    Looks like you have made a lot of progress since then, I’ve just subscribed to the Open Knowledge blog: http://blog.okfn.org/

    Your message is right on target: “Those familiar with freshmeat or CPAN can think of CKAN as providing an analogous service for open knowledge.”

    Installing/discovering data should be as easy as installing linux software using repository mirrors…

    port install library-of-congress

    -Pete

  31. April 17th, 2008 | 11:54 am

    […] Freebase Data Dumps, Open CellID databases, Some Datasets Available on the Web […]

  32. April 17th, 2008 | 12:26 pm

    […] Some Datasets Available on the Web […]

  33. June 20th, 2008 | 7:54 am

    […] The Data Wrangling blog - um post de blog com uma lista enorme de outros bancos de dados disponíveis na web. […]

  34. July 10th, 2008 | 3:45 am

    […] Blog articles which provide dataset directories - see blog comments as well http://conflate.net/inductio/2008/02/a-meta-index-of-data-sets/ - excellent article listing available data sets in the area of machine learning and inference http://www.readwriteweb.com/archives/where_to_find_open_data_on_the.php - Article containing a list of available dataset websites http://www.datawrangling.com/some-datasets-available-on-the-web.html http://www.daniel-lemire.com/blog/data-for-data-mining/ - has blog, tag cloud, wiki dataset categories http://www.kirix.com/blog/category/data-tagssearch/ http://mobblog.cs.ucl.ac.uk/datasets/ […]

  35. September 11th, 2008 | 6:47 pm

    […] The internet nowadays is a data miner’s paradise, providing unlimited ground for novel ideas and experiments. With just a brief look around one will easily find well-structured data about anything ranging from macroeconomic and business indicators to networks and genes. […]

  36. October 29th, 2008 | 11:13 am

    […] Одна небольшая ссылка на набор из 166 источников datasets, которая точно закроет для вас тему их поиска, минимум до конца этого года.   […]

  37. Tim
    November 10th, 2008 | 8:33 pm

    what do you think of http://data.un.org ?

  38. November 10th, 2008 | 8:49 pm

    Tim,

    I like the site, and the capability to download in various formats. A REST or soap API would be nice, or at least an index page for each format with direct paths to the individual downloads.

    -Pete

  39. November 11th, 2008 | 8:30 pm

    […] Some Datasets Available on the Web » Data Wrangling Blog (tags: useful datasets data statistics) […]

  40. November 12th, 2008 | 2:41 am

    […] Static list on his blog […]

  41. November 12th, 2008 | 3:31 am

    […] Some Datasets Available on the Web » Data Wrangling Blog Som man bare kan grave i (tags: web2.0 sets semanticweb statistics) […]

  42. November 17th, 2008 | 11:10 pm

    […] Some Datasets Available on the Web » Data Wrangling Blog (tags: web statistics) […]

Leave a reply