Some Datasets Available on the Web
The Datawrangling blog was put on the back burner last May while I focused on my startup. Now that I have some bandwidth again, I am getting back to work on several pet projects (including the Amazon EC2 Cluster). I’m giving an EC2 talk at Pycon in March, so I’m really on the hook to wrap up that series of posts now.
The event which prompted this long overdue blog post was another pet project: collecting public datasets. I keep an eye on topics of interest using del.icio.us tag subscriptions, and yesterday my feed was flooded with links to theinfo.org. Theinfo is a new community site/wiki for people working with large datasets and was started by reddit cofounder Aaron Swartz. The site apparently developed from his work on The Open Library.
Over the past year, I’ve been tagging interesting data I find on the web in del.icio.us. I wrote a quick python script to pull the relevant links from my del.icio.us export and list them at the bottom of this post. Most of these datasets are related to machine learning, but there are a lot of government, finance, and search datasets as well. I probably won’t get around to organizing and posting them to the wiki myself, but theinfo community should be able to figure out what to do with them. The concept reminds me a lot of Jon Udell’s post on public data.
This list is static, but I’ll keep adding links at http://del.icio.us/pskomoroch/dataset
- XML.com: GovTrack.us, Public Data, and the Semantic Web
- CiteULike: Available datasets
- Archive-It.org
- Challenge: Synopsis - Causality Workbench
- Natural Language Processing
- LDC - Linguistic Data Consortium - Obtaining Data Resorces
- 1990 Census Name Files
- Given Name Frequency Project: Analysis of Given Name Popularity
- Email Datasets
- ZoomInfo - Welcome to the ZoomInfo Developer API
- Ted Pedersen - Name Discrimination Data / Name Disambiguation Data / Name Ambiguity Data / Named Entity Resolution / Named Entity Disambiguation
- Developers Area - eBay Market Data Documentation - eBay Market Data Documentation
- New SwetoDblp RDF dataset released with 11M triples
- LSDIS : SwetoDblp
- StrikeIron Super Data Pack Web Service 1.0 - StrikeIron Marketplace
- Vaccines: IIS/Tech/Deduplication Test Cases
- Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets
- INFO 747 - Social and Economic Data
- Overstock.com Affiliate Program
- Amazon Web Services Developer Connection : Can Alexa WS provide detailed …
- Market Data — eBay Developers Program
- Health Data Tools and Statistics
- It’s a Pitch-by-Pitch Scouting Report, Minus the Scout - New York Times
- opentick :: market data
- Daily Kos: Obama helps us track $1,000,000,000,000 of federal spending
- Welcome to USAspending.gov
- Campaign Finance Reports and Data
- Machine Learning and Data Mining - Datasets
- GIS for Schools
- Cardiac MRI dataset - York University
- Google Trends API coming soon | Tech news blog - CNET News.com
- MIT Media Lab: Reality Mining
- RL Competition 2008 - Home
- Vehicle Routing Data Sets
- EIA - Petroleum Data, Reports, Analysis, Surveys
- DMOZ100k06 - Michael G. Noll
- Grading
- Carnegie Mellon University - CMU Graphics Lab - motion capture library
- Financial Forecast Center’s Historical Economic and Market Data
- Bureau of Labor Statistics Data
- Browse Business Cycle Indicators Data
- The Numbers Guy : Aspiring to Be the Wikipedia of Numbers
- Social characteristics of the Marvel Universe
- SourceForge.net: Word Lists Collection
- ERS/USDA Data - International Macroeconomic Data Set
- State Agency Databases - GODORT
- The 2000 U.S. Census: 1 Billion RDF Triples
- See Who’s Editing Wikipedia - Diebold, the CIA, a Campaign
- Dataset Generator - Perfect data for an imperfect world.
- National Bureasu of Economic Research: Data
- Entree Chicago Recommendation Data
- community resource guide: i’ve been here before - show me the links
- Social Science Data on the Net
- NBI ASCII Files - Bridge - FHWA
- List of films: A - Wikipedia, the free encyclopedia
- The arXiv on your harddrive
- Insanely Useful Websites | Sunlight Foundation
- Technophilia: Where to find public records online - Lifehacker
- Junk email project
- Enron Email Dataset
- ftp://ftp.bls.gov/pub/special.requests/cpi/cpiai.txt
- GOS - Geospatial One Stop
- CIA Factbook Grep in Python
- Miller Center of Public Affairs - Richard Nixon - Oval Office Recordings
- Deborah Jeane Palfrey Legal Defense Fund
- UC San Diego Data Mining Competition - 2007 - Datasets
- package - MoinMaster
- Retail Industry Financial Ratios & Benchmarks
- Retail Industry Financial Ratios & Benchmarks
- stores | POI Factory
- GpsPasSion Forums - ** INDEX OF POI COLLECTIONS **
- GPS POI US : Home > Retail Stores
- Collective Dynamics Group
- Jester Data download page
- TricTrac: Video Dataset
- Premium Business Information Databases - AlacraWiki
- Index of /edgar
- Mail Index
- metafy / AnthraciteIdioms
- Advance Monthly Sales for Retail and Food Services - Time Series Data/Seasonal Factors - 1992 to Present
- TDT
- Volume of retail sales: Social Trends 33
- generatedata.com
- U.S. Company Filings and Annual Reports
- FTP Information - EDGAR Database
- Data Mining For Investing
- Melissa DATA - Lookups
- FactSet: Data Maven - Kiplinger.com
- IBES (Demo)
- Thomson Financial I/B/E/S Data
- Historical Quotes - Yahoo! Finance
- Network data
- Bureau of Labor Statistics Home Page
- NAR: Research: EHS Data
- RFA - The Industry - Industry Statistics
- Chain Store Guide - Retail Locations
- Press Releases - Directions Magazine
- Energy Information Administration - EIA - Official Energy Statistics from the U.S. Government
- Databases you can use for benchmarking
- UPC Database: Downloads
- Web Crawling / Crawl Datasets at Tobias Escher at the OII
- TechTC - Technion Repository of Text Categorization Datasets
- TMC data archive download site
- http://www.volvis.org/
- Computational Vision: Archive
- DC Pedestrian Classification Benchmark
- opentick :: home
- Web as Corpus
- .:[ packet storm ]:. - http://packetstormsecurity.org/
- Enron Dataset
- Splog Blog Dataset
- Home Page for 20 Newsgroups Data Set
- White Glove Tracking
- NOAA Paleoclimatology Program - Coral and Sclerosponge Data
- NAICS — North American Industry Classification System
- Saving Democracy With Web 2.0 -
- Congresspedia - Congresspedia
- Population Estimates Data Sets
- CRAN Task View: Machine Learning & Statistical Learning
- Data for Data Mining
- PAIDA - Pure Python scientific analysis package
- SUBDUE - Graph Based Knowledge Discovery
- AOL search data mirrors
- Python Cheese Shop : shakespeare 0.4
- AG’s corpus of news articles
- Sampling Techniques for Massive Data - Google Video
- metachronistic » Mirror the Wikipedia
- LETOR: Benchmark Datasets for Learning to Rank
- CN710: Comparative Analysis of Learning Systems (Spring 2006) - Class Project
- UrbanSim Home
- System One - Wikipedia³
- System One - Labs
- Face Recognition Homepage - Databases
- CBCL SOFTWARE Face data set
- Text Analytics Solutions from ClearForest
- 23C3 - Mining Search Queries - Google Video
- Digital History Hacks: Keywords and Clues
- Digital History Hacks: Searching for History
- The Tom Kyte Blog: An interesting data set…
- KDD 2005 - KDD Cup 2005: Aug 21-24, Chicago, IL. USA
- Statistical NLP / corpus-based computational linguistics resources
- Ph.d.-student Rasmus Elsborg Madsen
- Intelligent Web Search and Mining: Tools & Resources
- PageRank Datasets and Code
- Official Google Research Blog: All Our N-gram are Belong to You
- Hyper-threaded Java - Java World
- Statistical Modeling, Causal Inference, and Social Science
- Structural Analysis of Discrete Data and Econometric Applications, by Charles F. Manski and Daniel L. McFadden, MIT Press, 1981.
- Kris Brower » Archives » Google Onpage Search Results Analysis
- CSE 250B Fall 2006
- Matrix Market
- Face Detection
- CSE 250B Project 4, Fall 2006
- G3DATA
- cwm - a general purpose data processor for the semantic web
- WebBase Project
- sam roweis : data
- Index of /data/sequence/mnist
- MNIST handwritten digit database
- Book-Crossing Dataset
- allmovie
- Submissions Guidelines for the Collectorz.com Online Movie Database
- cinema.com
- LUMIERE
- Data dumps - Meta
- “phone ***” ” address *” “e-mail” intitle:”curriculum vitae” - Google Search

This is one of the web’s most interesting stories on Fri 18th Jan 2008…
These are the web’s most talked about URLs on Fri 18th Jan 2008. The current winner is …..
datawrangling.com | CommentURL.com…
datawrangling.com
A great list of publicly available data sets.
Over the past year, Iâ…
[…] Some Datasets Available on the Web » Data Wrangling Blog […]
[…] Datasets available on the web Is there an opportunity in utilizing this data for a new startup? […]
[…] Today, a friend of mine alerted me to another lode of mine-able data offered by Peter Skomoroch of Data Wrangling. There are a bunch of great sets here; hope you find something to your liking as well. […]
[…] David Pennock notes the impressive set of datasets at datawrangling. […]
[…] Some Datasets Available on the Web » Data Wrangling Blog (tags: data datasets datamining databases free database web) […]
[…] Some Datasets Available on the Web » Data Wrangling Blog (tags: web api visualization db) […]
[…] Some Datasets Available on the Web » Data Wrangling Blog useful web available datasets (tags: data database datamining datasets useful tools programming machinelearning) […]
The LDC link does not work. As a taxpayer, I am forced to wonder why their data is not open source instead of proprietary and subscription-based.
Are there any datasets of chat logs? Chat conversations (from IRC or otherwise)?
civilian,
The LDC site was up yesterday. It may have been hammered by reddit/del.icio.us users? I think some of the datasets they have are extremely large (for example the google N-grams), so there is a handling fee for non-commercial researchers. As far as commercial use fees, many data providers restrict use entirely. Open access to more data would be great … except where privacy issues are involved. Sometimes there are also competitive reasons for restrictive licenses.
See more on the issues here:
http://en.wikipedia.org/wiki/Open_data
http://en.wikipedia.org/wiki/Data_privacy
related discussion:
http://news.ycombinator.com/item?id=100197
skj,
The WebBase Project link includes some chat data. It would be pretty easy to crawl for that data, provided terms of use for the chat sites are followed. Here is a recent list of hosts Stanford WebBase crawled, which includes chat sites (this link might not be permanent):
http://dbpubs.stanford.edu:8090/~testbed/doc2/WebBase/crawl_lists/crawled_hosts.0403
Looks like Google is going to start providing access to loads of open sourced data sets (http://blog.wired.com/wiredscience/2008/01/google-to-provi.html).
[…] Some Datasets Available on the Web » Data Wrangling Blog (tags: academic research data resources) […]
[…] Data Pr0n […]
[…] Some Datasets Available on the Web (tags: web data) […]
[…] Datasets available on the Web Recently I stumbled upon a great post at Datawrangling.com mentioning datasets available on the Web. In the same post I read about the collaborative effort to gather datasets at theinfo.org. […]
[…] Some Datasets Available on the Web » Data Wrangling Blog (tags: machine_learning datamining mashups research collaborative_filtering information_retrieval) […]
[…] Peter Skomoroch: The list of data sets over at Data Wrangling is similar in spirit to the one here. […]
[…] Some Datasets Available on the Web » Data Wrangling Blog - […]
The omission of cogmap makes me sad! Cogmap provides organization chart data for thousands of companies and exposes it all through a variety of web services.
Brent,
Just added Cogmap to my dataset bookmarks… any chance on releasing a raw dataset or REST api to fetch raw orgchart data?
-Pete
It’s in there! http://www.cogmap.com/blog/2008/03/04/cogmap-apis/
– brent
Brent sorry I missed that, this data will be useful for some identity matching projects I’m testing. I just found the programmable web description of your api as well: http://www.programmableweb.com/api/cogmap
What about including the wiki http://www.numberzoom.com/ which is a user-contributed phone numbers database. It’s mosty reverse Caller ID for looking up what telemarketers or collection agencies have called, but there is no reason why other numbers wouldn’t be on the site.
[…] Some Datasets Available on the Web » Data Wrangling Blog (tags: datasets data datamining database Free databases web tools resources research information lists) […]
[…] Some Datasets Available on the Web » Data Wrangling Blog This list is static, but you can follow his links at http://del.icio.us/pskomoroch/dataset (tags: datasets data free datamining databases) […]
I don’t know whether you’ve seen CKAN (Comprehensive Knowledge Archive Network). This is a project started by the Open Knowledge Foundation (of which I’m a part) and was launched about a year ago and seeks to perform exactly the type of registry task you’ve started upon here (though limited to open material only). As the blurb on the front-page says:
CKAN is the Comprehensive Knowledge Archive Network, a registry of open knowledge packages and projects (and a few closed ones). CKAN is the place to search for open knowledge resources as well as register your own – be that a set of Shakespeare’s works, a global population density database, the voting records of MPs, or 30 years of US patents.
Those familiar with freshmeat or CPAN can think of CKAN as providing an analogous service for open knowledge.
Rufus,
I had bookmarked the project here in July: http://project.knowledgeforge.net/ckan/wiki/package
Looks like you have made a lot of progress since then, I’ve just subscribed to the Open Knowledge blog: http://blog.okfn.org/
Your message is right on target: “Those familiar with freshmeat or CPAN can think of CKAN as providing an analogous service for open knowledge.”
Installing/discovering data should be as easy as installing linux software using repository mirrors…
port install library-of-congress
-Pete
[…] Freebase Data Dumps, Open CellID databases, Some Datasets Available on the Web […]
[…] Some Datasets Available on the Web […]
[…] The Data Wrangling blog - um post de blog com uma lista enorme de outros bancos de dados disponíveis na web. […]
[…] Blog articles which provide dataset directories - see blog comments as well http://conflate.net/inductio/2008/02/a-meta-index-of-data-sets/ - excellent article listing available data sets in the area of machine learning and inference http://www.readwriteweb.com/archives/where_to_find_open_data_on_the.php - Article containing a list of available dataset websites http://www.datawrangling.com/some-datasets-available-on-the-web.html http://www.daniel-lemire.com/blog/data-for-data-mining/ - has blog, tag cloud, wiki dataset categories http://www.kirix.com/blog/category/data-tagssearch/ http://mobblog.cs.ucl.ac.uk/datasets/ […]
[…] The internet nowadays is a data miner’s paradise, providing unlimited ground for novel ideas and experiments. With just a brief look around one will easily find well-structured data about anything ranging from macroeconomic and business indicators to networks and genes. […]
[…] Одна небольшая ссылка на набор из 166 источников datasets, которая точно закроет для вас тему их поиска, минимум до конца этого года. […]
what do you think of http://data.un.org ?
Tim,
I like the site, and the capability to download in various formats. A REST or soap API would be nice, or at least an index page for each format with direct paths to the individual downloads.
-Pete
[…] Some Datasets Available on the Web » Data Wrangling Blog (tags: useful datasets data statistics) […]
[…] Static list on his blog […]
[…] Some Datasets Available on the Web » Data Wrangling Blog Som man bare kan grave i (tags: web2.0 sets semanticweb statistics) […]
[…] Some Datasets Available on the Web » Data Wrangling Blog (tags: web statistics) […]