The Datawrangling blog was put on the back burner last May while I focused on my startup. Now that I have some bandwidth again, I am getting back to work on several pet projects (including the Amazon EC2 Cluster). I'm giving an EC2 talk at Pycon in March, so I'm really on the hook to wrap up that series of posts now.
The event which prompted this long overdue blog post was another pet project: collecting public datasets. I keep an eye on topics of interest using del.icio.us tag subscriptions, and yesterday my feed was flooded with links to theinfo.org. Theinfo is a new community site/wiki for people working with large datasets and was started by reddit cofounder Aaron Swartz. The site apparently developed from his work on The Open Library.
Over the past year, I've been tagging interesting data I find on the web in del.icio.us. I wrote a quick python script to pull the relevant links from my del.icio.us export and list them at the bottom of this post. Most of these datasets are related to machine learning, but there are a lot of government, finance, and search datasets as well. I probably won't get around to organizing and posting them to the wiki myself, but theinfo community should be able to figure out what to do with them. The concept reminds me a lot of Jon Udell's post on public data.
This list is semi-static, but I'll keep adding links at http://del.icio.us/pskomoroch/dataset
Update (02/10/09): I have around 400 dataset bookmarks now (more than double the count when this post first appeared), so I've updated the list below. Here is a json file containing the urls and tags: delicious_dataset_links.json
Around 85 of these datasets can be redistributed publicly: http://delicious.com/pskomoroch/redistributable+dataset. The rest are mostly free for academic use, but the license conditions vary.
Here are the 230 new datasets bookmarked since Jan 17, 2008:
- Announcing the Article Search API - Open Blog - NYTimes.com
tags: article, api, nytimes, text, corpus, newspaper - Twitter API Wiki / REST API Documentation: Social Graph Methods
tags: graph, network, api, social, twitter - Information Extraction: The RISE Repository of Information Sources
tags: information, textmining, extraction, reviews, jobs - build.kiva: Blog - Introducing the Kiva API
tags: finance, api, social, kiva, microlending, lending - Using the Wikipedia link dataset -- Henry Haselgrove
tags: graph, network, link, wikipedia, pagerank - Lookery Developer Network - Lookery Developer Resources
tags: web, analytics, api, traffic, advertising, demographics, lookery - Visualizing the Growth of Target, 1962-2008 | FlowingData
tags: visualization, retail, finance, gis, map, location, store, via:magnetbox, target - The Economy According To Mint
tags: finance, commercial, consumer, mint, spending - Repositories
tags: links, textmining, books, rdf, ocr, documents - Subsidyscope.com
tags: government, banking, csv, tarp, bailout - Best Buy Remix - Welcome to the Best Buy Remix Developer Network
tags: retail, data, api, product, bestbuy - twibs : find the businesses on twitter
tags: directory, businesses, twitter, companies - True Marble Imagery - Free Download
tags: gis, geo, map, mapping, images, satellite - Massive Scrape of Twitter’s Friend Graph « blog.infochimps.org - Organizing Huge Information Sources
tags: textmining, twitter, network, socialnetwork, pagerank, graph, queryminer - Twitter Scrape (rough draft) - get.theinfo | Google Groups
tags: twitter, socialnetwork, graph - API Documentation — BackType
tags: api, blog, comments, textmining, stream, trends, backtype, queryminer - generatedata.com
tags: random, generator, database, sql - Full Examples — PyMVPA Home
tags: fmri, neuroscience, python, neuralnetwork - wiki.dbpedia.org : Downloads 32
tags: wikipedia, named_entity, rdf, ontology - CinC Challenge 2000 data sets
tags: timeseries, machinelearning, ecg, health, medical, sleep, apnea - Free book usage data from the University of Huddersfield » "Self-plagiarism is style"
tags: books, library, borrowing, recommender, isbn, recommendation, collaborative, filtering, opendata - UC Berkeley. Sheldon Margen Public Health Library. Statistical/Data Resources
tags: health, links, resources, publichealth, berkeley - ICWSM 2009 - International AAAI Conference on Weblogs and Social Media
tags: blog, crawl, corpus, network, web, link - BART - For Developers
tags: urban, transportation, feeds, public, sanfrancisco, bart, api, - Tim Davis: UF Sparse Matrix Collection : sparse matrices from a wide range of applications
tags: spare, matrix - Others Online - Behavioral Targeting, Analytics and Advertising Service for Publishers, Ad Networks, Widgets, WiFi Networks
tags: analytics, audience, segmentation, toolbar, commercial, sem, search, advertising - HumanScan : BioID : Downloads : BioID Face Database
tags: face, detection, image - Face Detection
tags: facerecognition, opencv, face, links, - Building a (fast) Wikipedia offline reader
tags: django, wikipedia, compressed, textmining, howto - Change.gov: The Obama-Biden Transition Team | Join the Discussion: Healthcare
tags: textmining, opinion, comment, topic, government, queryminer - UN General Assembly Voting Data
tags: un, voting, statistics, government - NORB Object Recognition Dataset, Fu Jie Huang, Yann LeCun, New York University
tags: image, 3d - Reddit’s Secret API
tags: reddit, api, json, - Idealware: Mapping Blues: Where is the Data?
tags: resources, links - Opinion Extraction, Opinion Mining, Sentiment Analysis, Summarization of Customer Reviews
tags: sentiment, mining, classification, machinelearning, reviews, recommender, textmining, links - Amazon Web Services Public Datasets » Data Wrangling Blog
tags: amazon, ebs, ec2, s3, publicdata, hadoop - Amazon Web Services (AWS) Hosted Public Data Sets
tags: amazon, ebs, publicdata - Executive PayWatch Database
tags: ceo, compensation, pay, economics, business, labor - http://www.yr-bcn.es/semanticWikipedia
tags: wikipedia, named_entity, tagged, textming - Research Datasets :: CID Data :: Center for International Development at Harvard University (CID)
tags: economics, international, development, - NACDA: Search Holdings
tags: aging, statistics, studies - LIFE photo archive hosted by Google
tags: images, photo, pictures, search - phishingcorpus [JoseWiki]
tags: phising, corpus, text, email, textmining, nlp, mail, security - Wikipedia Datasets for the Hadoop Hack | Cloudera
tags: wikipedia, hadoop, textmining, links - WSCD09: Workshop on Web Search Click Data 2009
tags: workshop, search, web, microsoft, log, - Main Task QA Data
tags: question, answering, trec, nlp, machinelearning - ADL Gazetteer Development
tags: named_entity, location, placenames, geo, nlp - The New York Times Annotated Corpus « YooName - named entity recognition
tags: named_entity, nytimes, corpus, people, organizations, locations - downloading - flossmole - Google Code - How to get FLOSSmole data for your own use
tags: opensource, project, activity, mysql, dump - Google Flu Trends | How does this work?
tags: google, health, trends, search, prediction, epidemiology, biodefence, queries, queryminer - Multi-Domain Sentiment Dataset
tags: sentiment, review, product, amazon - Chris Pound's Name Generation Page
tags: bizzare, scifi, phrase, name, word, generators, random, perl - TradingSolutions - Data Sources
tags: trading, finance, s, api, list - Announcing the New York Times Campaign Finance API - Open - Code - New York Times Blog
tags: nyt, api, campaign, donations, fec, - Beautiful Data - WikiContent
tags: book, data, wiki, via:jhammerb - public domain sounds | free sound library
tags: sound, publicdomain, audio - Netflix API - Welcome to the Netflix Developer Network
tags: netflix, api, movie, mashup, netflixprize, ratings - Data Catalog
tags: dc, government, feeds, transparency, opendata - Open beats Closed: Best Buy’s new APIs - O'Reilly Radar
tags: retail, bestbuy, api - Voter registration data; or, HERE IS YOUR HOPE, YOU FOOLS! « The Edge of the American West
tags: voter, registration, politics, 2008 - Tickermine
tags: custom, research, retail, finance, market, service, analyst, - Linked Movie Data Base
tags: rdf, movies, movie, api - Big Huge Thesaurus API: Access 145,000 Words and Phrases
tags: webservice, api, thesaurus, textmining, nlp, rest, - import/parse/fec.py at master from aaronsw's watchdog — GitHub
tags: fec, python, parser, government, campaign - The Watchdog Project: volunteer
tags: government, transparency, parsing, election, python - Dataset of the day: Where are the Obamacans? | Off the Map - Official Blog of FortiusOne
tags: obama, goverment, mashup, gis, geo, map, campaign, donations - Activity Recognition: Datasets, Bibliography and others
tags: activity, recognition, intent - Normalized Campaign Contribution Data
tags: cmu, politics, campaign, donations, fec, via:jhammerb, government - YouTube Dataset
tags: youtube, research, crawl, socialnetwork, network, graph, web - CRAWDAD
tags: wireless, RF, radio, signal, dartmouth, network - API Documentation - Twitter Development Talk | Google Groups
tags: twitter, text, api - Web FAQ collection | ILPS
tags: faq, question_answering, questions, web, crawl, corpus, xml, textmining - Yahoo! Music API - YDN
tags: api, yahoo, music, artists - Search Query Performance report - Google AdWords Help Center
tags: adwords, ppc, search, metrics, webanalytics, sem, query, queryminer - Wordze Keyword Research Tool
tags: queryminer, keyword, tool, research, commercial, search, adwords - Frontal Face Databases
tags: facerecognition, face, image, recognition - Searchable Catalogs of Data
tags: links, catalogs, social - Download Database - baseball1.com
tags: baseball, database, publicdata, statistics, sports - radiohead - Google Code
tags: lidar, visualization, radiohead, google, video - 80 Million Tiny Images
tags: images, words, english, search, visualization, imagemap - Time Series Center | Harvard University
tags: timeseries, anomaly, detection, astronomical, physics - OpenVisuals - Open Source Visualization Framework
tags: visualization, community, design, processing - BGN: Domestic Names - State and Topical Gazetteer Download Files
tags: gis, usgs - NGA: Country Files
tags: country, cities, geo - Datasets
tags: benchmark, clustering, regression, machinelearning, list, statistics, mathematics - Isomap Datasets
tags: nonlinear, dimensionality, reduction, faces, digits, images, manifold - Yahoo! Search Blog: BOSS -- The Next Step in our Open Search Ecosystem
tags: api, open, search, yahoo, BOSS, queryminer - Download the Database - IP Address Lookup - Community Geotarget IP Project
tags: geocoding, geoip, internet, ip, ipaddress, mysql - Airline Data Project
tags: airline, statistics, finance, revenue, location, travel - reddit.com: Ask Reddit: Where to download a DB dump of Reddit?
tags: reddit, socialnetwork, news, web - Show Us a Better Way: What public data is already available?
tags: statistics, census, uk, school, news, publicdata - Collaborative filtering dataset - dating agency
tags: collaborative, filtering, dating, rating, profiles, czech - About Us - Predictify
tags: predictionmarket, tool, finance, buzz, advertising, marketing, startup, mmds, david_kellogg - VGChartz.com | Video Games, Charts, News, Forums, Reviews, Wii, PS3, Xbox360, DS, PSP
tags: sales, ranking, videogames, retail - Store Level Information
tags: retail, finance, sales, store, - Code for querying and downloading Flickr images
tags: image, python, code, flickr, matlab, recognition - Image Parsing Datasets
tags: image, recognition - TAGora » Data
tags: tag, tagging, s - TAGora » Data
tags: netflixprize, imdb, sparql - OHPI - Traffic Volume Trends
tags: government, traffic, statistics, trends, transportation - PigTutorial - Pig Wiki
tags: search, log, query, web, excite, queries, hadoop, pig, tutorial, mapreduce, parallel, queryminer - Quality of Life Grand Challlenge Dataset: Kitchen Capture
tags: machinelearning, motion, capture, sensor - Summize Twitter Search API
tags: api, buzz, opinion, trends, text, twitter, summize, search - 2008 IEEE InfoVis Contest Dataset
tags: visualization, contest, scalability, motion, tracking, pedestrian, sensor - IMDb Pro : Scary Movie 4: Box office
tags: movie, revenue, sales, box_office, imdb, commercial, movie_study - Spider-Man 2 (2004) - Daily Box Office Results
tags: movie, revenue, box_office, - Live Search : xRank™ Celebrity — check out who’s hot and who’s not!
tags: search, query, volume, trends, celebrity, prediction, buzz, named_entity, - IMDbPro.com Free Trial Signup
tags: movie, revenue, timeseries, imdb, commercial, subsription - Free time-series and micro-data to download
tags: economics, links - PyGTrends: Python API for Google Trends Data
tags: google, trends, search, web, analytics, api, code, python, hack, keyword, query, forecasting, indicator, finance - Official Google Blog: A new flavor of Google Trends
tags: google, trends, search, query, api, csv, keyword, timeseries - Open Research - the Data: Lastfm-ArtistTags2007 - Duke Listens!
tags: last.fm, music, tagging, artists, tags, collaborative, filtering - i2b2: Informatics for Integrating Biology & the Bedside
tags: medical, obesity, - Tiger Data Set Lecture
tags: tiger, gis, lectures - Google To Launch Large Scale Geo-Services
tags: geo, google, gps, location, geolocation, cell, wifi, api, gis - Last.fm’s Playground
tags: celebrity, misspelling, spelling, names - ImportGenius.com : U.S. Customs Database and Competitive Intelligence Tools
tags: commercial, shipping, imports, exports, finance, datamining - Directory Listing of Betfair price files
tags: betting, prediction, betfair, price, csv, predictionmarket - Reuters Spotlight - Article and Media API
tags: news, text, articles, api, content, media, xml, images, publicdata - DataSets - Scikits - Trac
tags: scipy, python, machinelearning, statistics, resource - [Wikitech-l] page counters
tags: wikipedia, pageviews, trends, textmining, seo, topic - Wikipedia article traffic statistics
tags: via:chl, wikipedia, web, analytics, seo, topic, textmining, traffic - Yahoo! Internet Location Platform - YDN
tags: yahoo, geo, geocoding, location, landmarks, gis - How to find images on the internet « Random knowledge
tags: images, links, lists, archive, - Yahoo offers geographic data to Web sites | Tech news blog - CNET News.com
tags: gis, webservice, yahoo, api, location, landmark - Instructions for Obtaining Search Engine Transaction Logs
tags: query, search, log, excite, altavista, alltheweb, transaction - TechTC - Technion Repository of Text Categorization Datasets
tags: datamining, textmining, categorization, classification, odp, directory, text - The TechTC-100 Test Collection for Text Categorization
tags: textmining, classification, category, odp, directory - FEC Election Contributions: Download Detailed Files by Election Cycle
tags: individual, donations, government, election, publicdata, fec - Juiced Google Analytics Python API: Juice Analytics
tags: search, statistics, keywords, analytics, api, python, web, seo, google, google_analytics, juice - Country Name and ISO 3166 Code MySQL Import File
tags: mysql, states, countries, isocode - Semantic Search the US Library of Congress
tags: via:inkdroid, libraries, mashup, rdf, semantic, search, semanticweb, books, api, webservice, - geocoded Hotels « GeoNames Blog
tags: hotels, geonames, - GeoNames webservice and data download
tags: locations, cities, countries, gis - Index of /download/worldcities
tags: cities, gis - ualberta dependency based thesaurus and word count data
tags: corpus, text, similarity, terms - CommonCrawl - About
tags: web, crawler, bot, - Data sets and corpus / corpora for biological literature and text mining , information extraction and information retrival and document classification
tags: bioinformatics, text, corpora, domainspecific, genomics, corpus, - Office of Defects Investigation (ODI), Flat File Downloads
tags: defect, recall, automobile, fightclub, nhtsa, saefty - p2psim - kingdata : DNS server latency network distance matrices
tags: distance, matrix, network, p2p, dns, latency, nmf, queryminer - Sep Kamvar / Personalization /
tags: pagerank, web, matrix, matlab - beta.opentick.com
tags: opentick, trading, beta, feeds, finance - WikiXMLDB: Querying Wikipedia with XQuery
tags: wikipedia, xml, ec2 - kiwitobes.com » Blog Archive » Walmart Growth Video
tags: walmart, visualization, video, freebase, store, retail, locations, opening - Open Cell Id dataset - phone geolocation from GSM cellids
tags: gis, mobile, geolocation - The Cornell Web Lab - The Cornell Web Lab
tags: cornell, web, archive, hadoop, crawl - im2gps: estimating geographic information from a single image
tags: imagerecognition, via:csantos, gis, cmu, gps, imageprocessing, paper, hack, freaking_awesome - Datasets: MUSCLE WP2 Evaluation, Integration and Standards
tags: image, video, audio, currency, sports, imagerecognition - Open Economics - Store - Index
tags: economics, list - welcome @ omdb
tags: free, movie, database, netflixprize - Cogblog » Blog Archive » Cogmap APIs
tags: api, cogmap, person, name, organization, record_linkage - Wal-Mart : Freebase - The World's Database
tags: retail, locations, stores - Cogmap: The Org Chart Wiki
tags: record_linkage, identity, name, organization, orgchart, marketing - German English Parallel Corpus "de-news", Daily News 1996-2000
tags: german, translation, corpus, english, text, via:maxme - Welcome to the CRCNS data sharing activity website — CRCNS
tags: neuroscience, patch, clamp, recordings, neuron, timeseries, patchclamp, data, neural, cortex, visual - Infochimps.org: Free Redistributable Rich Data Sets
tags: aggregator, links - Frequent Itemset Mining Dataset Repository
tags: retail, clickstream, traffic, web, links, sales - Dolores Labs Blog » Blog Archive » Our color names data set is online
tags: colormap, color, mechanicalturk - TeradataUniversityNetwork.com -> Registration
tags: teradata, retail, transactional, database - Pascal Learning Challenge Large Datasets
tags: large, competition, challenge, svm, machinelearning, scalability - ECIS 2007 - The 15th European Conference on Information Systems
tags: retail, dillards, sams_club - Alexa Web Search
tags: alexa, aws, web, search, api, - developerWorks Interviews: Massive data mining and the resurgent mainframe
tags: price, retail, transaction, sams_club, dillards - University of Arkansas - Daily Headlines
tags: retail, dillards, uark - Crime data bonanza!!!
tags: timeseries, crime, statistics, publicdata - State and Federal Case Law
tags: creativecommons, court, legal, law, via:inkdroid - Wikipedia:Lists of common misspellings/For machines - Wikipedia, the free encyclopedia
tags: spelling, mispelling, wikipedia - Copyright Free and Public Domain Media
tags: images, audio, publicdata, maps, video, free - Access to Web Research Collections VLC2/WT10g/WT2g
tags: blog, web, text - Databases you can use for benchmarking
tags: image, vision, recognition, - Lyricsfly Lyrics API, database access to search for music artist and song title, protocol REST with XML document
tags: song, lyrics, database, api, - 2007 IEEE AVSS Detection and Tracking Algorithm Datasets
tags: tracking, video, detection, image, recognition, vehicle, pedestrian, - Eigenvector Research, Inc. : Data Sets Available to Download
tags: NIR, spectra, chemistry, semiconductor, pharmaceutical, matlab, - OTCBVS
tags: image, recognition, detection, pedestrian, thermal, tracking, facerecognition, illumination - 99 Wikipedia Sources Aiding the Semantic Web » AI3:::Adaptive Information
tags: links, directory, record_linkage, extraction, wikipeida, named_entity, recognition, textmining, semanticweb, paper, - UNdata
tags: UN, publicdata, government, statistics - AudioScrobbler Data
tags: audioscrobbler, recommendation, collaborative, filtering, music - The Linking Open Data dataset cloud
tags: directory, rdf, semantic, data, soup, graph - Free Economic Data | Economic, Financial, and Demographic Data
tags: finance, economics, portal, links - ::MLSP 2008::: MLSP competition
tags: machinelearning, trading, competition, backtest, matlab, code, finance, via:DeliciousRob - Computer Vision Test Images
tags: computer, vision, image, ray, trace, fingerprint, stereo, detection, via:chl - The Dataverse Network Project | The Dataverse Network Project
tags: statistics, repository, harvard - DVN - Home
tags: harvard, repository, social, science, research, portal, links - Ohio voter registration data
tags: voter, voting, politics, government, name, address, registration - Voter List Data Files - Election Department, Clark County, Nevada
tags: voting, voter, registration, name, address, data, election, politics, government, nevada - Temperature data (HadCRUT3 and CRUTEM3)
tags: climate, temperature, netcdf - MNIST handwritten digit database, Yann LeCun and Corinna Cortes
tags: handwriting, mnist, image, recognition - LFW : Labelled Faces in the Wild
tags: facerecognition, face, recognition, umass, image - Making random contacts - (37signals)
tags: generator, names - Test (Sample) Data Generators
tags: generator, tools, list, via:jd - Compete - Compete Developer Resources
tags: compete, api, web, statistics, traffic, analytics, mashup - Machine Learning (Theory) » The Peekaboom Dataset
tags: peekaboom, vision, image, large, human, computation, machinelearning, recognition - Ocean Processes and Modeling: Ocean Data
tags: links, oceanography, satellite - BlogoCenter data sets
tags: blog, ucla - Tagged datasets for named entity recognition tasks
tags: nlp, corpus, tagged, named_entity, recognition, list - del.icio.us stats - deli.ckoma
tags: del.icio.us, - The Financial Data Finder A - G
tags: finance, links - Freebase Wikipedia Extraction (WEX)
tags: wikipedia, xml, structured, corpus - The arXiv.org API
tags: arxiv, api, open, paper, academic, - England Football Results Betting Odds | Premiership Results & Betting Odds
tags: gambling, soccer, football, excel, statistics - HughesData - Main - Hughes Lab
tags: rna, bioinformatics, microarray, expression, gene, machinelearning - Stanford MicroArray Database
tags: bioinformatics, microarray, expression, gene, machinelearning, stanford - ArrayExpress Home
tags: bioinformatics, microarray, expression, gene, machinelearning - Gene Expression Omnibus (GEO) Main page
tags: bioinformatics, microarray, expression, gene, machinelearning - Index of /courts.gov
tags: corpus, text, legal, law, court, ruling, opensource, publicdata - Welcome to Openvest
tags: python, finance, edgar, pylons, matplotlib, sec, webservice, via:jolby - Statistical Science Web: Data Sets
tags: links, statistics - Data Mining: Text Mining, Visualization and Social Media: TailRank, Spinn3r, TechMeme and TechCrunch: New Attention
tags: crawler, blog, corpus - Aleix Face Database
tags: facerecognition, machinelearning, face, image - Data Repository Evaluation
tags: umd, links, statistics, government, sports, via:rickladd - PMC FTP Service
tags: biology, medicine, articles, text, journal, authors - "uspop2002" data set
tags: music, similarity, machinelearning - Internet Archive: Details: Amazon ASIN listing and similarity graph
tags: ASIN, amazon, recommendation, collaborative, filtering, via:keyvowel - European Climate Assessment Daily Weather Data
tags: weather, europe, ascii, netcdf - Poverty Data Sets General Information
tags: poverty, statistics - StatLib---Datasets Archive
tags: machinelearning, datamining, cmu, link, collection - National Household Travel Survey (NHTS) Data
tags: driving, transportation, publicdata - RealClearPolitics - Election 2008 - Democratic Presidential Nomination
tags: polls, politics - Nielsen BookScan USA
tags: books, sales, commercial - Pew Internet & American Life Project
tags: internet, demographics, online, web - Home - Numbrary
tags: finance, data, - About - Numbrary
tags: searchengine, search, tagging, aggregator, numeric, extraction, tables, collaboration, web2.0, interface, billpoint - Main Page - OpenTextMining
tags: textmining, open, nature, standards, search - Metafilter Infodump
tags: metafilter, comments, network, via:chl - WEBSPAM-UK2007 | Datasets | Web Spam Detection
tags: web, search, spam, crawler, yahoo - Google to Host Terabytes of Open-Source Science Data | Wired Science from Wired.com
tags: google, article, openaccess - Zillow - Labs - Neighborhood Boundaries
tags: neighborhoods, geo, gis, maps - Trust network datasets - TrustLet
tags: socialnetwork, trustnetwork, trust - Crime in the United States 2006
tags: crime, fbi - TaskForces/CommunityProjects/LinkingOpenData/DataSets - ESW Wiki
tags: opendata, semantic, rdf, collaboration
Datasets listed in the original post on Jan 17, 2008:
- Some Datasets Available on the Web » Data Wrangling Blog
tags: publicdata, links - XML.com: GovTrack.us, Public Data, and the Semantic Web
tags: semanticweb, rdf, congress, politics, government - CiteULike: Available datasets
tags: networks, research, graph, tags, paper, record_linkage - Archive-It.org
tags: archive, internet, web, index, - Challenge: Synopsis - Causality Workbench
tags: competition, machinelearning, forecasting, contest - Natural Language Processing
tags: microsoft, text, paraphrase, corpus - LDC - Linguistic Data Consortium - Obtaining Data Resorces
tags: nlp, text, corpus, ngram, google, commercial, license - 1990 Census Name Files
tags: census, names, identity, frequency, record_linkage - Given Name Frequency Project: Analysis of Given Name Popularity
tags: name, record_linkage, text, identity, code - Email Datasets
tags: enron, names, identity, text, record_linkage - ZoomInfo - Welcome to the ZoomInfo Developer API
tags: api, identity, people, webservice, record_linkage - Ted Pedersen - Name Discrimination Data / Name Disambiguation Data / Name Ambiguity Data / Named Entity Resolution / Named Entity Disambiguation
tags: record_linkage, corpus, nlp, names - Developers Area - eBay Market Data Documentation - eBay Market Data Documentation
tags: ebay, api, retail, price, code - New SwetoDblp RDF dataset released with 11M triples
tags: name, authorship, rdf, record_linkage - LSDIS : SwetoDblp
tags: bibliography, rdf, ontology, duplicate, name, record_linkage - StrikeIron Super Data Pack Web Service 1.0 - StrikeIron Marketplace
tags: webservice, publicdata, datacleaning - Vaccines: IIS/Tech/Deduplication Test Cases
tags: duplicate - Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets
tags: duplicate, detection, record_linkage, datacleaning, text - INFO 747 - Social and Economic Data
tags: datacleaning, record_linkage, video, lectures, course, cornell, economics, finance, publicdata - Overstock.com Affiliate Program
tags: retail, overstock, sales, api, product, price, forecasting - Amazon Web Services Developer Connection : Can Alexa WS provide detailed ...
tags: finance, alexa, amazon, tech - Market Data — eBay Developers Program
tags: ebay, retail, pricing, sales, api, product - Health Data Tools and Statistics
tags: health, information, public, publicdata - It’s a Pitch-by-Pitch Scouting Report, Minus the Scout - New York Times
tags: baseball, gameday - opentick :: market data
tags: opentick, nasdaq, finance, stock - Daily Kos: Obama helps us track $1,000,000,000,000 of federal spending
tags: corruption, government, politics, finance, - Welcome to USAspending.gov
tags: government, money, politics, - Campaign Finance Reports and Data
tags: campaign, politics, elections - Machine Learning and Data Mining - Datasets
tags: face, image - GIS for Schools
tags: epidemiology, gis, health - Cardiac MRI dataset - York University
tags: mri, cardiac - Google Trends API coming soon | Tech news blog - CNET News.com
tags: google, trends, api, - MIT Media Lab: Reality Mining
tags: social, activity, location, cell, gis - RL Competition 2008 - Home
tags: machinelearning, reinforcement, agent, competition, - Vehicle Routing Data Sets
tags: optimization, vehicle, routing - EIA - Petroleum Data, Reports, Analysis, Surveys
tags: oil, energy, statistics, economics, petroleum - DMOZ100k06 - Michael G. Noll
tags: search, pagerank, text, tags, content - Grading
tags: machinelearning, CMU, course, projects, graphicalmodel, code, paper - Carnegie Mellon University - CMU Graphics Lab - motion capture library
tags: gait, pedestrian, walk, motion - Financial Forecast Center's Historical Economic and Market Data
tags: exchangerate, dollar, economics, - Bureau of Labor Statistics Data
tags: economics, lumber, building, materials, homedepot - Browse Business Cycle Indicators Data
tags: economics, indicators, time, series - The Numbers Guy : Aspiring to Be the Wikipedia of Numbers
tags: finance, numberpedia, mechanicalturk, textmining, statistics - Social characteristics of the Marvel Universe
tags: socialnetwork, graphs, comicbooks - SourceForge.net: Word Lists Collection
tags: dictionary, words - ERS/USDA Data - International Macroeconomic Data Set
tags: usda, economics, population, cpi, gdp, income - State Agency Databases - GODORT
tags: government, directory, links, wiki, states - The 2000 U.S. Census: 1 Billion RDF Triples
tags: gis, census, rdf, semantic, sparql - See Who's Editing Wikipedia - Diebold, the CIA, a Campaign
tags: wikipedia, authorship, - Dataset Generator - Perfect data for an imperfect world.
tags: tools, generator - National Bureasu of Economic Research: Data
tags: economics, links - Entree Chicago Recommendation Data
tags: recommender, collaborative, restaurant - community resource guide: i've been here before - show me the links
tags: demographics, maps, gis, statistics, links - Social Science Data on the Net
tags: economics, social, government, health, labor, links - NBI ASCII Files - Bridge - FHWA
tags: government, bridges, safety - List of films: A - Wikipedia, the free encyclopedia
tags: netflix, netflixprize, movie, index, wikipedia, - The arXiv on your harddrive
tags: paper, corpus, arXiv - Insanely Useful Websites | Sunlight Foundation
tags: links, transparency, government, politics, congress, reference - Technophilia: Where to find public records online - Lifehacker
tags: public, records, links - Junk email project
tags: corpus, email, spam, textmining - Enron Email Dataset
tags: enron, corpus, email, text, social, network - ftp://ftp.bls.gov/pub/special.requests/cpi/cpiai.txt
tags: finance, cpi, inflation, data - GOS - Geospatial One Stop
tags: health, gis, epidemiology, links - CIA Factbook Grep in Python
tags: cia, population, python, code, grep - Miller Center of Public Affairs - Richard Nixon - Oval Office Recordings
tags: nixon, speech, tapes, audio, mp3, wav, flac - Deborah Jeane Palfrey Legal Defense Fund
tags: phone, politics - UC San Diego Data Mining Competition - 2007 - Datasets
tags: housing, refinance, mortgage, - package - MoinMaster
tags: - Retail Industry Financial Ratios & Benchmarks
tags: retail, finance, sales, sqft, - Retail Industry Financial Ratios & Benchmarks
tags: retail, finance, sales, sqft - stores | POI Factory
tags: retail, location, poi - GpsPasSion Forums - ** INDEX OF POI COLLECTIONS **
tags: retail, poi, location, gis, gps - GPS POI US : Home > Retail Stores
tags: retail, location, gis - Collective Dynamics Group
tags: smallworld, networking, socialnetwork, graph - Jester Data download page
tags: collaborative, filtering, jokes - TricTrac: Video Dataset
tags: video, - Premium Business Information Databases - AlacraWiki
tags: links, finance, commercial - Index of /edgar
tags: finance, xml, edgar, sec, code, perl - Mail Index
tags: EDGAR, sec, mail, text - metafy / AnthraciteIdioms
tags: finance, SEC, scrape, parse, commercial - Advance Monthly Sales for Retail and Food Services - Time Series Data/Seasonal Factors - 1992 to Present
tags: retail, sales, census - TDT
tags: categorization, textmining, detection, tools - Volume of retail sales: Social Trends 33
tags: retail, sales, uk - generatedata.com
tags: tools, generator, random - U.S. Company Filings and Annual Reports
tags: finance, links, sec - FTP Information - EDGAR Database
tags: edgar, finance, sec, filing, ftp, instructions - Data Mining For Investing
tags: investing, finance, datamining, announcement, sec, filing, links - Melissa DATA - Lookups
tags: consumer, data, database, api - FactSet: Data Maven - Kiplinger.com
tags: factset, finance, - IBES (Demo)
tags: finance, ibes, analyst, forecast, wharton - Thomson Financial I/B/E/S Data
tags: finance, - Historical Quotes - Yahoo! Finance
tags: yahoo, finance, stock, price, - Network data
tags: network, links - Bureau of Labor Statistics Home Page
tags: statistics, labor, government, consumer - NAR: Research: EHS Data
tags: housing, sales, finance - RFA - The Industry - Industry Statistics
tags: ethanol, - Chain Store Guide - Retail Locations
tags: retail, finance, store, locations, gis - Press Releases - Directions Magazine
tags: retail, gis, store, locations - Energy Information Administration - EIA - Official Energy Statistics from the U.S. Government
tags: finance, government, energy, historical, forecasts, fuel, oil - Databases you can use for benchmarking
tags: links - UPC Database: Downloads
tags: product, upc, database, - Web Crawling / Crawl Datasets at Tobias Escher at the OII
tags: crawler, benchmark, search, web, links - TechTC - Technion Repository of Text Categorization Datasets
tags: corpus, text - TMC data archive download site
tags: traffic, data, - http://www.volvis.org/
tags: volumerendering - Computational Vision: Archive
tags: vision, caltech, imagerecognition - DC Pedestrian Classification Benchmark
tags: pedestrian, image, classification, detection - opentick :: home
tags: finance, economics, feed, free, stock, trading, opentick, opensource - Web as Corpus
tags: textmining, corpus, concordance, wordlist, n-gram - .:[ packet storm ]:. - http://packetstormsecurity.org/
tags: dictionary, hack, security, wordlist, password - Enron Dataset
tags: data, mysql, email, energy, text, socialnetwork - Splog Blog Dataset
tags: blog, corpus, spam - Home Page for 20 Newsgroups Data Set
tags: corpus, text, newsgroup - White Glove Tracking
tags: crowdsourcing, image, processing, algorithm, collaborative, distributed, web2.0, code, opensource - NOAA Paleoclimatology Program - Coral and Sclerosponge Data
tags: paleoclimatology, climate, oceanography, coral, sponge, biology - NAICS -- North American Industry Classification System
tags: finance, economics, naics, industry, classifications - Saving Democracy With Web 2.0 -
tags: democracy, web2.0, mashup, government, funding, article - Congresspedia - Congresspedia
tags: collaborative, wiki, government, congress, politics, elections, web2.0, directory - Population Estimates Data Sets
tags: census, data, population, statistics - CRAN Task View: Machine Learning & Statistical Learning
tags: statisticallearning, machinelearning, code, R, libraries, cran, - Data for Data Mining
tags: linkd, datamining, timeseries, text, extraction, socialnetwork - PAIDA - Pure Python scientific analysis package
tags: python, visualization, library - SUBDUE - Graph Based Knowledge Discovery
tags: machinelearning, network, graph, - AOL search data mirrors
tags: aol, search, - Python Cheese Shop : shakespeare 0.4
tags: python, text, - AG's corpus of news articles
tags: corpus, nlp, machinelearning, textmining - Sampling Techniques for Massive Data - Google Video
tags: video, machinelearning, statistics, matrix, sampling, large, sparse, algorithm, experiment_design, towatch - metachronistic » Mirror the Wikipedia
tags: wikipedia, laptop, install, dump - LETOR: Benchmark Datasets for Learning to Rank
tags: ranking, search - CN710: Comparative Analysis of Learning Systems (Spring 2006) - Class Project
tags: machinelearning, algorithm, ogi, bu, greyhound, finance - UrbanSim Home
tags: python, urban, software, simulation, opensource, GIS, census, - System One - Wikipedia³
tags: wikipedia, rdf, - System One - Labs
tags: wikipedia, rdf, tools - Face Recognition Homepage - Databases
tags: face, algorithm, facerecognition, data, image - CBCL SOFTWARE Face data set
tags: face, seung, algorithm, recognition, image - Text Analytics Solutions from ClearForest
tags: extraction, finance, semantic, semanticweb, text - 23C3 - Mining Search Queries - Google Video
tags: aol, search, video, talk, algorithm, informationretrieval, datamining, machinelearning - Digital History Hacks: Keywords and Clues
tags: aol, search, query, analysis - Digital History Hacks: Searching for History
tags: aol, search, query, analysis - The Tom Kyte Blog: An interesting data set...
tags: aol, search, oracle, database, code - KDD 2005 - KDD Cup 2005: Aug 21-24, Chicago, IL. USA
tags: query, categorization, algorithm, google - Statistical NLP / corpus-based computational linguistics resources
tags: corpus, machinelearning, text - Ph.d.-student Rasmus Elsborg Madsen
tags: text, machinelearning, context, matlab - Intelligent Web Search and Mining: Tools & Resources
tags: machinelearning, code, links - PageRank Datasets and Code
tags: pagerank, code, algorithm - Official Google Research Blog: All Our N-gram are Belong to You
tags: linguistics, google, ngram, nlp, record_linkage - Hyper-threaded Java - Java World
tags: clustering, algorithm, java, parallel - Statistical Modeling, Causal Inference, and Social Science
tags: blog, econometrics, finance, machinelearning, math, statistics - Structural Analysis of Discrete Data and Econometric Applications, by Charles F. Manski and Daniel L. McFadden, MIT Press, 1981.
tags: books, econometrics, economics, finance, ebook - Kris Brower » Archives » Google Onpage Search Results Analysis
tags: google, ranking, aol, search, analytics - CSE 250B Fall 2006
tags: netflixprize, machinelearning, course, - Matrix Market
tags: matrixmarket, matrix, - Analysis of incomplete datasets: Estimation of mean values and covariance matrices and imputation of missing values
tags: imputation, matlab, missing, EM, machinelearning - Face Detection
tags: face, image - CSE 250B Project 4, Fall 2006
tags: subset, netflixprize, dimensionality, reduction - G3DATA
tags: extract, from, graphs, hack, google, trends - cwm - a general purpose data processor for the semantic web
tags: python, processor, semantic, web, rdf - WebBase Project
tags: link, analysis, sturcture, web, crawler, stanford - sam roweis : data
tags: machine, learning, matlab, python, hackers, image - Index of /data/sequence/mnist
tags: mnist, xml, format - MNIST handwritten digit database
tags: mnist, - Book-Crossing Dataset
tags: data, set, collaborative, filtering, datamining, books, movie - allmovie
tags: movie, netflixprize, source - Submissions Guidelines for the Collectorz.com Online Movie Database
tags: movie, source - cinema.com
tags: plot, synopsis, movie, netflixprize, prize - LUMIERE
tags: netflixprize, prize, european, movie, revenue, - Data dumps - Meta
tags: mediawiki, wikipedia, import, mysql, sql - "phone ***" " address *" "e-mail" intitle:"curriculum vitae" - Google Search
tags: resume, google