Amazon Web Services Public Datasets

Amazon announced their Hosted Public Data Sets service today, and I expect it to be a game changer. Finding and using datasets on the web just got a lot easier. Similar to how developers can share Amazon Machine Images on EC2, you can now freely share large datasets in the cloud using Amazon EBS snapshots.

A few months ago, Jeff Bar stopped by Juice to talk with our team about how we are using Amazon EC2 and SQS to scale our data mining efforts. One of the issues I brought up was the potential cost and hassle of shuffling large datasets on and off AWS. Jeff discussed his concept of using Amazon as a kind of data & application ecosystem, where various companies, researchers, and data providers interact on AWS and take advantage of the transfer efficiencies of staying within the Amazon infrastructure and using data and APIs locally.

This seems to be a part of that vision, and I’m looking forward to unleashing Hadoop on whatever data flows into the system.

From the AWS site:

AWS public datasets

2 Responses to “Amazon Web Services Public Datasets”

  1. November 21st, 2008 | 9:34 pm

    Peter,
    Your posts on machine learning/data mining/data sets in general are life savers really.

    Nice to see you posting after such a long break.

    • Shubhendu
  2. December 8th, 2008 | 9:01 am

    […] Amazon Web Services Public Datasets » Data Wrangling Blog (tags: datamining storage aws ec2 s3 cloud_computing genetics) […]

Leave a reply