my weekend startup: seocrawler.co

21
My weekend startup project SEOCRAWLER.CO

Upload: hrvoje-hudoletnjak

Post on 27-Jan-2015

116 views

Category:

Technology


3 download

DESCRIPTION

Why and how is Seocrawler.co built, a talk for Webcamp Zagreb 2013 conference. Presented technical part of project with dev advices for building crawler/spider

TRANSCRIPT

Page 1: My weekend startup: seocrawler.co

My weekend startup project

SEOCRAWLER.CO

Page 2: My weekend startup: seocrawler.co

TEAM

Goran ČandrlićConversion, Google AdWords & Internet Marketing SpecialistWebiny Cofounder

Hrvoje HudoletnjakSoftware developerMicrosoft ASP.NET/IIS MVP

Page 3: My weekend startup: seocrawler.co

WHY?

Target marketWeb masters, site ownersMarketers

Usage scenariosGet broken pages, redirects, non-index, non-follow, ...On-site SQL qualityCrawl competitor pages and find out what are they doing

Business modelFreePay as you goShare and get credits

Page 4: My weekend startup: seocrawler.co

THE PLAN

Let’s build a crawlerMVP version: download CSV file of all pages Public launch: browsing crawled pages online, payments

Let’s spread the wordUse social channel to attract more users

Let’s see what we’re missing, what can be done betterFind out what would people like to payIterate, find new niche markets, ask and listen to people

Page 5: My weekend startup: seocrawler.co

GETTING HANDS DIRTY

ENGINE DEVBasic engine: 2 daysProduction ready (horizontal scalability, disaster recovery, ...): 60+ daysFind edge cases (broken HTML), keep crawler running for days/weeks without crashingAnalysis (tags and content)Store reports for user filtering and browsing

WEB APPLanding page + admin UI (Themeforest)Communication with crawlersBrowse reports, filtersPayment gateway integration (Paypal)Ticketing support system

Page 6: My weekend startup: seocrawler.co

CURRENT STATUS

2,5m pages crawled150GB transfered800 registered users

Most important things:we (think we) know what should we do nextpolished some edge cases, made more stable servicegot the word spreadgot speaking slot at WebCampZg!!

Page 7: My weekend startup: seocrawler.co
Page 8: My weekend startup: seocrawler.co
Page 9: My weekend startup: seocrawler.co
Page 10: My weekend startup: seocrawler.co

FRONT END WEB APP CRAWLERS

RABBIT MQ

DB

USER

CLOUD STORAGE

CSV RESULT

HTML, CSSAJAX / WEBSOCKETS

Page 11: My weekend startup: seocrawler.co

FRONT END/ ADMIN UI

Landing page + admin theme from Themeforest ASP.NET MVC 4Entity Framework 5 (POCO, EF migrations)DotNetOpenAuth for Social loginEasyNetQ for RabbitMQ (pub/sub), CQS pattern for inprocess msgSignalR (fullduplex: WebSockets – Ajax pooling duplex)KnockoutJS, jQuery, ToastrStructureMap IOC/DI, Automapper (db entities <> DTO)

Page 12: My weekend startup: seocrawler.co

CRAWLER

CONTROLLER

CRAWLER WORKER

CRAWLER WORKER

CRAWLER WORKER

...

COM

MAN

D/Q

UER

Y BU

S (C

QS)

RABBIT MQ

ADO.NET / EF

LOG

Page 13: My weekend startup: seocrawler.co

CRAWLER SERVICE

Multi-threaded Crawler (vs evented crawler)Entity Framework 5 LINQ + RAW SQL queries with EF + ADO.NET Bulk InsertEasyNetQ, RabbitMQ, CQS patternStructuremap, HTMLAgilityPack, NLogProtobuf

Page 14: My weekend startup: seocrawler.co

CRAWLER WORKER PROCESS

Start or ResumeResume: load state (SQL, serialized)

Get next page from queue (RabbitMQ, durable store)Download HTML (200ms – 5sec delay), HEAD req for externalCheck statuses, canonical, redirectsRun page analysers, extract data for report, prepare for bulk insertFind links

Check duplicated, blacklisted Check Robots.txtCheck if visited – cache & dbNormalize & store to queue (RabbitMQ)

Save state every N pages (Serialize with Protobuf, store byte[] to Db)

Page 15: My weekend startup: seocrawler.co

RABBITMQ + EASYNETQ

rabbitbus.Subscribe<RecreateReportMessage>("crawlerservice", message =>{ _commandBus.Execute(new MakeReportCommand(message.ProjectId));});

rabbitBus.OpenChannel(c => c.Publish(new RecreateReportMessage(id)));

ADMIN UI

SERVICE

Page 16: My weekend startup: seocrawler.co

COMMAND BUS (MEDIATOR)

Encapsulate command / query into classesIOC / DI for finding and matching handler with command/query typesEasy unit testingAOP: intercept query or command, pre/post execution (logging, auth, caching, ...)

bool alreadyVisited = _bus.Request<bool>(new VisitedPageQuery.Input(projectId, urlHash));

_bus.Execute(new SavePageCommand(pageData, webPage));

public class SavePageReportHandler : IHandle<SavePageCommand>{ // implementation}

Page 17: My weekend startup: seocrawler.co

ISSUES

Everything will crash: net connection, db, thread, VM, ...Resuming / saving statesMemory issue/leaks with some frameworks Don’t optimize before profiling (memory, db)Log everythingDB indexes: how to store for fast filtering, pagingDB as queueing system (don’t)CQS: command / query separation Broken HTML, crazy linksCloud services: connections fail

Page 18: My weekend startup: seocrawler.co

LEARNED

ORMGo low level (raw SQL, bulk insert, SP) if neededProfile: memory, SQL queriesWatch for 1st level cache (ORM unit of work or session)NoSQL?

Cachingin process – in memoryPlan moving to separate service (Redis, ...)

SOAPipeline designPub/Sub, CQS pattern (Mediator)Unit testingCloud resiliance

Page 19: My weekend startup: seocrawler.co

HOSTING

Hosting:All on one server for nowStarted with EC2Migrated to Azure VM (higher HDD IO, faster CPU), Bizspark (free VM), free inbound traffic!Now on Hetzner (dedicated, i7, 32GB RAM, 2xSSD, Win2012 = 60€/m)

Stack: Win 2012, SQL Server 2012, .NET 4.5, ASP.NET MVC 4Load & stress testing (crawl 500k URLs)

Goal: 100 parallel crawlers on VM 2CPU 4GB RAM (OS, DB)

Will scale when needed

Page 20: My weekend startup: seocrawler.co

FUTURE PLANS

Fancy reportsBrand new web user interfaceIntegration with 3th party services (MajesticSEO, ...)Special page analysis NoSQL (RavenDb or Redis) for cachingWarehouse Db for browsing crawled pagesLucene for full text search (RavenDb)Refactor crawler, pipeline design, async evented design

Page 21: My weekend startup: seocrawler.co

THANK YOU! QUESTIONS?

Hrvoje Hudoletnjakm: [email protected]: twitter.com/hhrvoje

Goran Čandrlićm: [email protected]: twitter.com/chande