mongodb and web scraping with the gyes platform. mongodb atlanta 2013
DESCRIPTION
Gyes is an aggregation platform for the Web. Gyes allows you to develop, schedule and troubleshoot data extraction programs (crawlers) that translate html content into structured data you can use later on. In selecting the data model for the platform, several challenges arose due to the lack of structure of the scraped data, and the need to provide meaningful and efficient access to it. MongoDB was our third rewrite of the Gyes back-end, and by far has exceeded expectations. In this talk, I would like to discuss some of the challenges we faced, and how MongoDB addressed them. Details about implementation challenges are also shared.TRANSCRIPT
MongoDB and Web Scraping with the Gyes Platform
Jesus [email protected]
@infinithread, @elyisu
MongoDB Atlanta 2013
What is Gyes?
• Let's think on the web as a huge data source of unstructured data
• Absence of a web service or API layer to consume most of the data
• Significant value on thematic aggregation
• Finance (Mint.com, Manilla.com)
• Travel (Kayak.com)
• Shopping (Nextag.com)
What is Gyes? (cont)
• Aggregation platform for the web
• SaaS or hosted
• Domain-specific Scrapers
• JavaScript + jQuery = JSON
• Oriented to provide programmatic access of the data
Goals
• Decouple Data Extraction from Data Consumption
• Provide a Flexible Data Model
• Provide a Semi-structured Model to Access Scraped Data
Challenges: Data Storage
• Relational Databases?• Lack of support for JSON
• Tabular structure vs data schema flexibility
• Key/value stores• Very flexible, but
• Inability of querying the data in more than one dimension
The MongoDB solution
• No tricks, store data as-is
• Flexible (structure of scraped data can change, MongoDB doesn't care)
• Powerful query mechanisms
• Scalable
• Again, store data as-is, consume as-is
Using MongoDB in Gyes
Using MongoDB in Gyes
• One database per user
• Data segregation
• Avoid name conflicts
• Two collections per crawler
• Permanent results (available to the API)
• Temporary results (developing and tuning crawler)
Gyes API
• Ease data consumption programmatically
• RESTful
•API Data functions leverage MongoDB query capabilities (latest,find)
Case Study: Ubirates
• www.ubirates.com. Financial aggregation website (Japan)
• 10 banks (and counting)
• Gyes as aggregator platform and BaaS (data served via API upon page load)
Case Study: Ubirates (cont)
find API call (POST)
URL:http://api.gyeslab.com/v1/find/ubirates/all?apiKey=xxyy&take=1
Body:{
q: { Status: 'success' },
p: { CrawlerName: 1, Data: 1, _id: 0 }
}
Use Case: Ubirates (cont)
var data = crawlers.Select(crawler =>
database.GetCollection(crawler.ToLower()))
.Select(collection =>
collection.Find(q)
.SetSortOrder(SortBy.Descending("RequestId"))
.SetFields(p) .Skip(skip)
.Take(take)
.ToJson(jsonWritterSettings)
);
What's Next
• Scale Data Repository + API• Sharding
• Get data closer to users
• Query optimizations• Indexing
• Caching
The End
Thanks!
@infinithreadwww.infinithread.comwww.gyeslab.com