the internet as a single database

25
The Internet as a Single Database Technologies Used & Lessons Learned Houston Code Camp, August 2011 Shion Deysarkar CEO, Datafiniti

Upload: shion-deysarkar

Post on 25-May-2015

1.270 views

Category:

Technology


1 download

DESCRIPTION

A sneak peek at how we are building Datafiniti

TRANSCRIPT

Page 1: The Internet as a Single Database

The Internet as a Single DatabaseTechnologies Used & Lessons Learned

Houston Code Camp, August 2011Shion DeysarkarCEO, Datafiniti

Page 2: The Internet as a Single Database

What does that mean?

Places, people, news, URLs, products, etc., etc.All web data in one, unified format

Accessible as if you were querying a database

Page 3: The Internet as a Single Database

Why build such a thing?

Web crawling is kludgy and unintuitiveOur users needed a better way of getting web data

Developers deserve something better than current APIs

Page 4: The Internet as a Single Database

Why build such a thing?

Because it would be awesome!

Page 5: The Internet as a Single Database

Not an easy task…

The Challenges

Page 6: The Internet as a Single Database

The Challenges

There’s a lot of data on the web

100 million registered domainsMaybe only 100,000 have interesting stuff? (Which ones?)Some sites have millions or billions of data points

Page 7: The Internet as a Single Database

It’s all structured differently!

Do we have to write web crawls for each website?Writing 100,000 web crawlers seems.. not fun

The Challenges

Page 8: The Internet as a Single Database

Data can conflict

How do we know which data is correct?

Website Name Categories Address Zip Code Neighborhood PhoneYelp Max's Wine Dive Wine Bars

American (New)Music Venues

4720 Washington Ave 77007 Washington CorridorRice MilitaryThe Heights

(713) 880-8737

Citysearch Max's Wine Dive RestaurantsWine Bars

4720 Washington Ave #B 77007 Washington Ave.Memorial ParkCentral

(713) 880-8737

Urbanspoon Max's Wine Dive AmericanInternational

4720 Washington Ave 77007 Rice Military (713) 880-8737

Google Max's Wine Dive Wine BarAmerican Restaurant

4720 Washington Ave 77007-5436 (713) 880-8737

Zagat Max's Wine Dive EclecticInt'l

4720 Washington Ave. 77007 Heights 713-880-8737

The Challenges

Page 9: The Internet as a Single Database

So let’s start at the beginning:

Data Collection

Page 10: The Internet as a Single Database

Data CollectionBuilding a scalable web crawler

Cloud or local data center? Neither.Grid computing (think SETI@home)1000s of home PCs that exchange time & bandwidth for $Crawl very fast for relatively little $

Page 11: The Internet as a Single Database

Data CollectionBuilding a scalable web crawler

Coding 1000s of extraction apps

Abstract away everything but pattern matching and link generation

Build a framework that handles all the kludgy work:- Link following & de-duplication- Result formatting & storage- Throttle rates & crawling behavior- Any other crawling activity not specific to a website’s structure

- Load lightweight, website-specific apps into above framework

Page 12: The Internet as a Single Database

Data CollectionBuilding a scalable web crawler

Coding 1000s of extraction appsAbstract away everything but pattern matching and link generation

Page 13: The Internet as a Single Database

Data CollectionBuilding a scalable web crawler

Coding 1000s of extraction appsAbstract away everything but pattern matching and link generation

Page 14: The Internet as a Single Database

Data CollectionBuilding a scalable web crawler

Current peak performance: 4.32 billion URLs per monthDeploying 20 new website crawls every monthEasy to scale crawling performance (just add grid nodes)Easy to scale deployment (just add contractors)

Page 15: The Internet as a Single Database

Now for step 2! (step 1 took us 3 years >_<)

Data Storage

Page 16: The Internet as a Single Database

Data StorageBuilding a scalable data store

What we’re dealing with:TBs (eventually PBs) of dataBillions of rows, Thousands of columns (maybe more)Don’t want to deal with shardingDon’t actually care about ACIDDo care about high-throughput and fault-tolerance

Page 17: The Internet as a Single Database

Data StorageBuilding a scalable data store

NoSQL (Cassandra) >> MySQL (for us)Can increase throughput and storage linearly by adding nodesVirtually unlimited and variable # of columnsMuch faster read/writeSome challenges

- Doesn’t yet support all the select features you’re used to- Not a mature technology yet, expect frequent updates

Page 18: The Internet as a Single Database

Data StorageBuilding a scalable data store

Choosing Cassandra over other NoSQL databasesMore active community, seems to be gaining traction most quickly

Impressive production-scale examples

Backed by corporations (DataStax) and some really smart people

Integrated with other relevant technologies- Solr for text search- Hadoop for batch-style processing

- Though it’s true it has some high-profile scrappings

Page 19: The Internet as a Single Database

Data StorageBuilding a unified database of everything

Normalizing separate data points that represent the same thingCo-occurrence: most popular choice wins

Website Name Categories Address Zip Code Neighborhood PhoneYelp Max's Wine Dive Wine Bars

American (New)Music Venues

4720 Washington Ave 77007 Washington CorridorRice MilitaryThe Heights

(713) 880-8737

Citysearch Max's Wine Dive RestaurantsWine Bars

4720 Washington Ave #B 77007 Washington Ave.Memorial ParkCentral

(713) 880-8737

Urbanspoon Max's Wine Dive AmericanInternational

4720 Washington Ave 77007 Rice Military (713) 880-8737

Google Max's Wine Dive Wine BarAmerican Restaurant

4720 Washington Ave 77007-5436 (713) 880-8737

Zagat Max's Wine Dive EclecticInt'l

4720 Washington Ave. 77007 Heights 713-880-8737

Page 20: The Internet as a Single Database

Data StorageBuilding a unified database of everything

Normalizing separate data points that represent the same thingTrusted sources: put more weight on sources that tend to be right

Website Name Categories Address Zip Code Neighborhood PhoneYelp Max's Wine Dive Wine Bars

American (New)Music Venues

4720 Washington Ave 77007 Washington CorridorRice MilitaryThe Heights

(713) 880-8737

Citysearch Max's Wine Dive RestaurantsWine Bars

4720 Washington Ave #B 77007 Washington Ave.Memorial ParkCentral

(713) 880-8737

Urbanspoon Max's Wine Dive AmericanInternational

4720 Washington Ave 77007 Rice Military (713) 880-8737

Google Max's Wine Dive Wine BarAmerican Restaurant

4720 Washington Ave 77007-5436 (713) 880-8737

Zagat Max's Wine Dive EclecticInt'l

4720 Washington Ave. 77007 Heights 713-880-8737

Page 21: The Internet as a Single Database

Data StorageBuilding a unified database of everything

Identifying interesting data on a random web page

Page 22: The Internet as a Single Database

Yay, step 3! (step 2 took us 3 months :D)

Data Retrieval

Page 23: The Internet as a Single Database

Data RetrievalBuilding an easy way to get lots of data fast

Making the right choices for our APISingle channel for all data retrieval

- RESTful API so anyone can develop with it- All external and internal functionality uses the same API (easier to manage)

As user-friendly and intuitive as possible- SQL-style querying on a NoSQL database- JSON default output, but will also supports CSV and XML- SSL authentication with token

Briefly considered using a 3rd-party service like Mashery

Page 24: The Internet as a Single Database

Put it all together… (step 3 took 3 weeks!!!)

Sneak Peak

Page 25: The Internet as a Single Database

Sign up for the beta at http://www.datafiniti.netFollow us @Datafiniti

Launching Soon