the internet as a single database

The Internet as a Single DatabaseTechnologies Used & Lessons Learned

Houston Code Camp, August 2011Shion DeysarkarCEO, Datafiniti

What does that mean?

Places, people, news, URLs, products, etc., etc.All web data in one, unified format

Accessible as if you were querying a database

Why build such a thing?

Web crawling is kludgy and unintuitiveOur users needed a better way of getting web data

Developers deserve something better than current APIs

Why build such a thing?

Because it would be awesome!

Not an easy task…

The Challenges

There’s a lot of data on the web

100 million registered domainsMaybe only 100,000 have interesting stuff? (Which ones?)Some sites have millions or billions of data points

It’s all structured differently!

Do we have to write web crawls for each website?Writing 100,000 web crawlers seems.. not fun

The Challenges

Data can conflict

How do we know which data is correct?

Website Name Categories Address Zip Code Neighborhood PhoneYelp Max's Wine Dive Wine Bars

American (New)Music Venues

4720 Washington Ave 77007 Washington CorridorRice MilitaryThe Heights

(713) 880-8737

Citysearch Max's Wine Dive RestaurantsWine Bars

4720 Washington Ave #B 77007 Washington Ave.Memorial ParkCentral

(713) 880-8737

Urbanspoon Max's Wine Dive AmericanInternational

4720 Washington Ave 77007 Rice Military (713) 880-8737

Google Max's Wine Dive Wine BarAmerican Restaurant

4720 Washington Ave 77007-5436 (713) 880-8737

Zagat Max's Wine Dive EclecticInt'l

4720 Washington Ave. 77007 Heights 713-880-8737

The Challenges

So let’s start at the beginning:

Data Collection

Data CollectionBuilding a scalable web crawler

Cloud or local data center? Neither.Grid computing (think SETI@home)1000s of home PCs that exchange time & bandwidth for $Crawl very fast for relatively little $

Coding 1000s of extraction apps

Abstract away everything but pattern matching and link generation

Build a framework that handles all the kludgy work:- Link following & de-duplication- Result formatting & storage- Throttle rates & crawling behavior- Any other crawling activity not specific to a website’s structure

- Load lightweight, website-specific apps into above framework

Coding 1000s of extraction appsAbstract away everything but pattern matching and link generation

Current peak performance: 4.32 billion URLs per monthDeploying 20 new website crawls every monthEasy to scale crawling performance (just add grid nodes)Easy to scale deployment (just add contractors)

Now for step 2! (step 1 took us 3 years >_<)

Data Storage

Data StorageBuilding a scalable data store

What we’re dealing with:TBs (eventually PBs) of dataBillions of rows, Thousands of columns (maybe more)Don’t want to deal with shardingDon’t actually care about ACIDDo care about high-throughput and fault-tolerance

NoSQL (Cassandra) >> MySQL (for us)Can increase throughput and storage linearly by adding nodesVirtually unlimited and variable # of columnsMuch faster read/writeSome challenges

- Doesn’t yet support all the select features you’re used to- Not a mature technology yet, expect frequent updates

Choosing Cassandra over other NoSQL databasesMore active community, seems to be gaining traction most quickly

Impressive production-scale examples

Backed by corporations (DataStax) and some really smart people

Integrated with other relevant technologies- Solr for text search- Hadoop for batch-style processing

- Though it’s true it has some high-profile scrappings

Data StorageBuilding a unified database of everything

Normalizing separate data points that represent the same thingCo-occurrence: most popular choice wins

(713) 880-8737

4720 Washington Ave 77007-5436 (713) 880-8737

Normalizing separate data points that represent the same thingTrusted sources: put more weight on sources that tend to be right

(713) 880-8737

4720 Washington Ave 77007-5436 (713) 880-8737

Identifying interesting data on a random web page

Yay, step 3! (step 2 took us 3 months :D)

Data Retrieval

Data RetrievalBuilding an easy way to get lots of data fast

Making the right choices for our APISingle channel for all data retrieval

- RESTful API so anyone can develop with it- All external and internal functionality uses the same API (easier to manage)

As user-friendly and intuitive as possible- SQL-style querying on a NoSQL database- JSON default output, but will also supports CSV and XML- SSL authentication with token

Briefly considered using a 3rd-party service like Mashery

Put it all together… (step 3 took 3 weeks!!!)

Sneak Peak

Sign up for the beta at http://www.datafiniti.netFollow us @Datafiniti

Launching Soon

the internet as a single database

data collectionbuilding

lot of data

web crawls

web crawlers

local data center

web dataweb crawling

billions of data points

scalable web crawlercoding

Technology

1 refworks your personal reference database @internet

internet atlas: a geographical database of the...

real time bio-database teaching through internet

internet marketing - response and database foundations

apnic whois database and internet routing registry

the internet in database: a cassandra use case

database management the single entity, the single table,...

search.ca.com · web viewset oracle probe alarm value to...

r and collecting internet data - department of...

internet atlas: a geographical database of the physical...

database & internet resources, inc

database system architectures - wordpress.com · database...

introduction to internet databases mysql database system...

a general database of hydrometeor single scattering

lms in the internet movie database - arxiv

cmic single database platformtm project management & … ·...

visual foxpro database publishing on the internet

printmaster commodore single page - internet archive

creating an interactive single audit database a state’s ...

rac one node – the always on single instance database