web scraping: applications and tools

31
ePSIplatform Topic Report No. 2015 / 10 , December 2015 1 WEB SCRAPING, APPLICATIONS AND TOOLS European Public Sector Information Platform Topic Report No. 2015 / 10 Web Scraping: Applications and T ools Author: Osmar Castrillo-Fernández Published: December 2015

Upload: epsi-platform

Post on 29-Jan-2016

467 views

Category:

Documents


1 download

DESCRIPTION

The latest ePSI Platform Topic Report, written by Osmar Castrillo-Fernández, is dedicated to ‘Web Scraping: Applications and Tools’. The report provides an introduction to web scraping techniques, and explores some of the most popular and novel techniques for data extraction and reuse in complex processes.

TRANSCRIPT

Page 1: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 1

WEBSCRAPING,APPLICATIONSANDTOOLS

EuropeanPublicSectorInformationPlatform

TopicReportNo.2015/10

WebScraping:ApplicationsandTools

Author:OsmarCastrillo-Fernández

Published:December2015

Page 2: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 2

WEBSCRAPING,APPLICATIONSANDTOOLS

TableofContents

TableofContents......................................................................................................................2

Keywords...................................................................................................................................3

Abstract/ExecutiveSummary...................................................................................................3

1 Introduction............................................................................................................................4

2 Whatiswebscraping?............................................................................................................6

ThenecessitytoscrapewebsitesandPDFdocuments........................................................6

TheAPImana........................................................................................................................7

3 Webscrapingtools.................................................................................................................8

PartialtoolsforPDFextraction.............................................................................................8

PartialtoolstoextractfromHTML........................................................................................9

Completetools....................................................................................................................12

Confrontingthethreetools................................................................................................25

Othertools..........................................................................................................................26

4 Decisionmap........................................................................................................................27

5 Scrapingandlegislation........................................................................................................28

6 Conclusionsandrecommendations......................................................................................29

AbouttheAuthor....................................................................................................................30

Copyrightinformation.............................................................................................................31

Page 3: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 3

WEBSCRAPING,APPLICATIONSANDTOOLS

Keywords

web scraping, data extracting, web content extracting, datamining, data harvester, crawler,

opengovernmentdata

Abstract/ExecutiveSummary

Internetisthevastestinformationanddatasourceeverbuiltbymankind.However,itisahuge

collectionofheterogeneousandpoorlystructureddata,difficulttocollectinamanualwayand

complicatedtouseinautomatedprocesses.

Overthelastyears,techniquesandtoolshavesurged,allowingdatacollectionandconversion

tostructureddatatobemanagedbyB2CandB2Bsystems.Thisarticleoffersanintroduction

to web scraping techniques and some of the most popular and novel techniques for data

extractionandreuseincomplexprocesses.

Thepossibilitiestotakebenefitofsuchdataaremany,includingareas likeOpenGovernment

Data, Big Data, Business Intelligence, aggregators and comparators, development of new

applicationsandmashups,amongothers.

Page 4: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 4

WEBSCRAPING,APPLICATIONSANDTOOLS

1 IntroductionEvery single day, several petabytes of information are published via the Internet in various

formats,suchasHTML,PDF,CSVorXML.Curiously,HTMLisnotnecessarilythemostcommon

format to publish content. For instance, 70% of the content indexed by Google is extracted

from PDF documents. This is an additional obstacle for the different roles involved in data

extraction from multiple sources. Journalists, researchers, data analysts, sales agents or

software developers are some examples of professionals typically using the copy-and-paste

techniquetogetinformationinaspecificformatandexportittoaspreadsheet,anexecutive

reportortosomedataexchangeformatsuchasXML,JSONoranyoftheseveralvocabularies

availablebasedonthem.

Focusing on data acquisition from HTML (i.e, a web document), this article explains the

mechanismsandtoolsthatmayhelpustominimizethetediousdutiesofdataextractionfrom

theInternet.

With regards to the commitment of transparency and data openness that public

administrations have assumed over the last decade, scraping and crawling are techniques

that may be useful. Whereas current information systems in use by public administrations

Page 5: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 5

WEBSCRAPING,APPLICATIONSANDTOOLS

considerthesetechniques,arelevantamountofwebsites,contentmanagementsystemsand

ECMs(EnterpriseContentManagement)donot.Exchangingthesesystemsfornewonesimply

considerable economic efforts. Scraping tools provide an alternative to exchanging which

minimizes such efforts. Open Government Data and Transparency Policies should take

advantageofthisopportunity.

extraction transformation reuse

Page 6: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 6

WEBSCRAPING,APPLICATIONSANDTOOLS

2 Whatiswebscraping?One of the many definitions of this concept, and the favourite one for the author of this

document,is:

Awebscrapingtoolisatechnologysolutiontoextractdatafromwebsites,inaquick,efficientand

automatedmanner,offeringdatainamorestructuredandeasiertouseformat,eitherforB2Borfor

B2Cprocesses.

Scrapingprocessesmaybewrittenindifferentprogramminglanguages.Themostpopularare

Java,Python,RubyorNode.Asitisobvious,expertprogrammersarerequiredtodevelopand

evolve them, and even to use them.Nonetheless, some software companies have designed

differenttoolsthatenableotherpeopletousescrapingtechniquesbymeansofattractiveand

powerfuluserinterfaces.

WebscrapingtoolsarealsoreferredasWebDataExtractors,DataHarvesters,CrawlingTools

orWebContentMiningTools.

ThenecessitytoscrapewebsitesandPDFdocuments

Asalreadystated,approximately70%oftheinformationgeneratedintheInternetisavailable

in PDF documents, anunstructuredandhard tohandle format.However, awebpagehas a

structuredformat(HTMLcode),althoughinanon-reusableway.

PDFscrapingisnottheobjectoftheanalysisofthisarticle,althoughitistruethatsometools

exist to extract information, mainly related to data tables. This enormous amount of

informationpublishedbutcaptiveofthiskindofformatisusuallycalled“thetyrannyofPDF”.

SometoolsthatarepresentedinlatersectionsofthisdocumentcanreadPDFdocumentsand

returninformationinastructuredformat,althoughinabasicandrudimentaryway.

Following with the main scope of this document (HTML documents), its structured nature

multipliesthepossibilitiesopenbyscrapingtechniques.Webscrapingtechniquesandscraping

tools rely in the structure and properties of theHTML language. This facilitates the taskof

scrapingfortoolsandrobots,stoppinghumansfromtheboringrepetitiveanderrorproneduty

ofmanualdataretrieval.Finally,thesetoolsofferdatainfriendlyformatsfor laterprocessing

andintegration:JSON,XML,CSV,XLSoRSS.

Page 7: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 7

WEBSCRAPING,APPLICATIONSANDTOOLS

TheAPImana

Data collection in a handcrafted way is truly inefficient: search, copy and paste data in a

spreadsheet to later process them. This is a tedious, annoying and tiresome process.

Therefore, itmakesmuchmore sense to automate this process.Scraping allows this kindof

automation, as themajority of the available tools provide an API to facilitate access to the

contentgenerated.AnAPI(ApplicationProgrammingInterface)isamechanismtoconnecttwo

applications,allowingthemtosharedata.ScrapingtoolsfacilitateaURL,anInternetaddressas

thoseyoumaynoticeintheaddressbarofawebbrowser,givingaccesstothedatascraped.In

somecases,APIsarenotonly limited toaURL.Theycanalsobeprogrammed inanyway to

modify the final result of the scraping process. This feature is absolutely useful for B2B

integrationprocesses,enablingsubsequentapplicationsandservicesbasedonsuchdata.

Therefore, it isrecommendedthatagoodscrapingtoolprovidesasimpleandprogrammable

API.TheconceptofAPI isveryrelevant inthis topic.Occasionally, this term isalsoknownas

EndPoint.

Page 8: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 8

WEBSCRAPING,APPLICATIONSANDTOOLS

3 WebscrapingtoolsOnce that the basic concepts related to scraping have been commented, this documents

focusesontoolscurrentlyavailable inthemarket.Basic informationabouttheir functionality

andhowtheyworkisprovided,avoidingdeeptechnicaldetailsasmuchaspossible.

Firstly,wesuggestaninitialtoolsegmentation:

1. Partialtools.Thesearetypicallypluginstothird-partysoftware.Theyusuallyfocuson

aspecificscrapingtechnique(forinstance,HTMLtables).TheydonotprovideanAPI

forB2Bintegration.

2. Complete tools.This segment includes the latest tools in themarketandsomeolder

tools, created as a general scraping service. They offer features such as powerful

graphicaluser interface,visual scrapingutility,SaaSand/ordesktop licensingmodels,

querycachingandrecording,APIsorreportingandauditdashboards.

PartialtoolsforPDFextraction

This segment includes these toolsoriented to readPDFdocumentsandextractallorpartof

theinformationcontained.

WithinthistypemayincludethoseorientedexclusivelytoopenPDFsandextractallorpartof

itsinformation.Thefigurebelowprovidesabriefdescriptionofsomeofthesetools.

Page 9: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 9

WEBSCRAPING,APPLICATIONSANDTOOLS

PDFtoExcelOnLine

www.pdftoexcelonline.comZamzar

www.zamzar.comCometDocs

www.cometdocs.comPDFTables

pdftables.comTabula

tabula.technology

Works online. Allows

several input and output

formats, including Word,

Excel, PowerPoint, PDF).

Theresultingfileissentvia

email.Freetouse.

Works online.

Large amount of

input and output

formats. The

resultingfileissent

via email. Free to

use.

Works online.

Converts PDF to

Word, Excel and

PowerPoint. Cloud-

based document

storage. Application

versions for Desktop,

Smartphone and

Tablet. Free account

with limited features.

API for document

translation.

Works online. Only

converts existing

tables in a PDF

document to Excel.

API available. Free

account (very

limited) plus paid

accounts with

various licensing

terms.

Desktop tool.

Focused on

extraction of

tables in PDF

documents.

Source code

available.

Oriented to

software

developers.

On-line X X X X

Off-line X

SaaSFreeService X X X X

SaaSPaidService X X

Document

formats

word,excel,powerpoint,

pdf many

ConvertsPDFto

Word,Exceland

powerpoint

Convertsexisting

tablesintoPDF

documentstoExcel

format

Onlyextracts

datafromtables

intoPDF

documentsAPI X X

Orientedto

developers X

Sourcecode

available X

PartialtoolstoextractfromHTML

Regarding techniques and tools for web content (HTML documents) partial scraping, some

examplesoftoolsarecommentedbelow.

GoogleSpreadsheetsandtheIMPORTHTMLformula

This is a simple solution, but sufficiently effective to extract data from an HTML table to a

GoogleSpreadsheetsdocument.Theactualformatoftheformulais:

=IMPORTHTML(“URL”;”table”;N)

Testing the IMPORTHTML formula is fairly simple, as long as value N is known. This value

represents theorderof the table in the listof tablesavailable in theHTMLcodeavailable in

Page 10: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 10

WEBSCRAPING,APPLICATIONSANDTOOLS

addressURL.Anexampleofusebasedintheformatis:

=IMPORTHTML("http://www.euskalmet.euskadi.net/s07-

5853x/es/meteorologia/app/predmun_o.apl?muni=";"table";1)

TableCapture,aGoogleChromeextension

TableCapture isanextensiontothepopularGoogleChromewebbrowser.Itenablesusersto

copythecontentoftablesincludedinawebpageinaneasymanner.Theextensionisavailable

fromtheaforementionedbrowser,typingthefollowingURLinitsaddressbar:

https://chrome.google.com/webstore/detail/table-capture/iebpjdmgckacbodjpijphcplhebcmeop

Once installed, if Chrome detects tables in the web page currently rendered, a red icon is

showntotherightoftheaddressbar.Clickingonitshowsalistingofthetableddetectedand

twocontrols.Thefirstonecopiescontenttotheclipboard,andthesecondoneopenaGoogle

Docsdocumentto laterpastethecontentofthetable.Anexample isshowninthefollowing

figure.

ExampleofuseofTableCapture

Page 11: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 11

WEBSCRAPING,APPLICATIONSANDTOOLS

TabletoClipboard,Firefoxadd-on

Asinthepreviouscase,theFirefoxwebbrowseralsosupportsadd-ons(analogoustoChrome’s

extensions) to extract data fromHTML tables. An example is Table2Clipboard,which can be

downloaded and installed from https://addons.mozilla.org/es/firefox/addon/dafizilla-

table2clipboard/?src=userprofile

Inthiscase,acontextmenushowinguponaright-clickonatableallowscopyingituporjust

theclicked rowor column.This isamechanismquiteusefuland less intrusive thatoffersan

interestingfunctionalityinmanycases.

ExampleofuseofTabletoClipboard

Page 12: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 12

WEBSCRAPING,APPLICATIONSANDTOOLS

Completetools

Overthelastyears,asetofcompanies,mayofthemstart-ups,haverealizedthatthemarket

wasdemandingtoolstoextract,store,processandrenderdataviaAPIs.Thesoftwareindustry

moves fast and in many directions and a good web scraper can help in application

development,PublicAdministration transparency,BigDataprocesses,dataanalysis,online

marketing, digital reputation, information reuse or content comparers and aggregators,

amongothertypicalscenarios.

But,whatshouldatoolofthiskindincludeatleasttobeconsideredasaseriousalternative?In

theopinionoftheauthor:

• Apowerfulandfriendlygraphicaluserinterface.

• Aneasy-to-useAPItolinkandintegratedata.

• Avisualaccesstowebsitestoperformdatapicking.

• Datacachingandstorage.

• Alogicalorganizationandmanagementofthequeriesusedtoextractdata.

Page 13: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 13

WEBSCRAPING,APPLICATIONSANDTOOLS

Import.io

CompanylocatedinLondon,specializedindataminingandInternetdatatransformationtousersto

exploitinformation.

Websiteàimport.ioandenterprise.import.io

MottoàInstantlyturnwebpagesintodata

Indubitably, this isoneof the reference tools in themarket. Itmaybeused in fourdifferent

ways.Thefirstone(namedMagic,thatcanbeclassifiedasbasic)istheaccesstoimport.ioin

ordertotypetheaddressofthewebsiteonwhichwewanttoperformscraping.Theresultis

shown in an attractive visual tabular format. The main con is that the process is not

configurableatall.

The second way of use is named Extractor. It is the most common usage of import.io

technology:download,installandexecuteinyourowncomputer.Thistoolisacustomizedweb

browseravailable forWindows,OSXandLinux.Thisway requires someprevious skillsusing

softwaretoolsandsometimetolearnhowtousethetool.However,“picking”isofferedina

quite reasonable manner –although open to improvement, at the same time. “Picking” is

performedbyclickingonthepartsofthescrapedwebsitethatwewanttoextract,inasimple

andvisualway.Thisisafeaturethatanywebscrapingtoolsmustincludethesedays.

Oncethatquerieshavebeencreated,outputformatsareonlytwo:JSONandCSV.Queriesmay

also be combined in order to page results (“Bulk Extract”) or aggregate them (“URLs from

anotherAPI”). It isalsorelevanttonotethatuserswillhaveaRESTfulAPIEndPointtoaccess

the data source –which is a mandatory feature in any relevant complete scraping tool

nowadays.

The import.io application requires a simple user registration process.With a username and

password, the application can be used for free, with a set of basic features that may be

sufficientforsmalldeveloperteamswithoutcomplexscrapingrequirements.Forcompaniesor

professionalsdemandingmore flexibilityandbackendresources,contactwith import.iosales

teamisrequired.

Thethirdandfourthwaysofuseprovidemorevaluetothetool.TheyaretheCrawlerandthe

Connector,respectively.TheCrawlertriestofollowallthelinksinthedocumentindicatedvia

Page 14: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 14

WEBSCRAPING,APPLICATIONSANDTOOLS

its URL and it allows information extraction based on the picking process carried out in the

initialdocument.Inthetestscarriedouttowritethisarticle,wehavenotmanagedtofinishall

this process, as it seems to keep working all the time without producing any results. The

Connector permits to record a browsing script to reach the web document from which to

extractinformation.Thisapproachisveryinteresting,forinstance,ifthedatatobescrapedare

theresultofasearch.

Insummary, import.io isatoolwith interestingfunctionalityfree touse,withahigh levelof

maturity,anattractiveandmoderngraphicaluser interface,supportingcloudstorageofthe

queries,whichdemandslocalinstallationtotakefulladvantageofitsfeatures.

Strengths Weaknesses

VisualInterface Desktopinstallation

BlogandDocumentation Limitedamountofoutputdocumentformats

Allows pagination, aggregation, crawling and script

recording

Learningcurve

StrenghtsvsWeaknessesforimport.io

Viewofthebasicuseofimport.io(availableathttps://import.io)

Page 15: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 15

WEBSCRAPING,APPLICATIONSANDTOOLS

import.iodesktopapplication

Screentoselectdataextractionmodeinimport.io

Page 16: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 16

WEBSCRAPING,APPLICATIONSANDTOOLS

Kimonolabs

KimonoLabsisacompanylocatedinMountainView,California.TheyofferaSaaStoolaftertheuseof

aChromeextension.Itallowsextractingdatafromwebdocumentsinanintuitiveandeasyway.They

provideresultsinJSON,CSVandRSSformats.

Websiteàwww.kimonolobas.com

MottoàTurnwebsitesintostructuredAPIsfromyourbrowserinseconds

Kimono Labs is another key player in the field ofweb scraping. It uses a strategy similar to

import.io, using a web browser. In their case, they offer a Chrome extension, rather than

embeddingawebbrowserinanativedesktopapplication.Therefore,thefirststeptousethis

tool is to installGoogleChromeandthenthisextension.Registration isoptional. It isaquick

andsimpleprocessthatisrecommended.

After installation and registration, we can use Chrome to reach a web document with

interesting information. In thismoment,we click the icon installedby theKimonoextension

andthepickingprocessstarts.Thisprocessprovideshelptotheuseratfirstexecutionwitha

visualandattractive format. Itsgraphicaluser interface is reallypolishedandthetoolresults

veryfriendly.

InitialhelpviewofferedbyKimono

Page 17: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 17

WEBSCRAPING,APPLICATIONSANDTOOLS

TostartworkingwithKimono,theusermustcreatea“ball”intheupperendofthescreen.By

default,aball isalreadycreated.Thevariousballscreatedareshownwithadifferentcolour.

Afterwards,sectionsofthedocumentmaybeselectedtoextractdataan,subsequently,they

are highlightedwith the colour of the associated ball. At the same time, the ball shows the

numberofelements thatmatch theselection.Ballsmaybeused to selectdifferent zonesof

the same document, although their purpose is to refer zones with certain semantic

consistency.

For instance, in thewebsiteofanewspaper,wemightcreateaballnamed“Title”and then

selectaheadline inawebpage.Then, thetoolhighlightsoneor twoadditionalheadlinesas

interesting.Afterselectinganewheadline, the tool startshighlightinganother20 interesting

elements in the web page. We may notice that there are some headlines which are not

highlightedbythetool.Weselectoneofthemandnowover60headlinesarehighlighted.This

process may be repeated until the tool highlights all the headlines after selecting a small

amount of them. This process is known as “selector overload” and is available in several

scrapingtools.

Oncethatall theheadlineshavebeenselected,wecantrydoingthesamewiththeopening

paragraph of each news item: create a ball, click on the text area of an opening paragraph,

thenanotherone,andgoonwiththeprocessuntilhavingallthedesiredinformationreadyfor

extraction.

Although the idea is really good, our tests have found that the process is somehow

bothersome.Sometimes,boxhighlightinginwebpagesdoesnotworkwellwith,forinstance,

problemsintextswhicharelinksatthesametime.

Oncethatthepickingprocess is finishedbyclickingonthe“Done”button,wecannameour

newAPIandparameterize its temporalprogramming.Withthis, thesystemmayexecutethe

APIevery15minutes,hour,day,etc. andstore the results in the cloud. Whenever theuser

calls the API by accessing the associated URL, Kimono does not access the target site but

returnsthemostrecentdatastored in itscache.Thiscachingmechanism ishighlyusefulbut

notexclusiveofKimono.

Page 18: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 18

WEBSCRAPING,APPLICATIONSANDTOOLS

Exampleofhighlightinginthepickingprocessofkimono

The query management console, available atwww.kimonolabs.com, provides access to the

newly createdAPI and various controls andpanels to readdataand configurehow they are

obtained.Thisincludesaninterestingfeature,whichareemailalertstobereceivedwhendata

changeinthetargetsite.

There is anadditional interestingoptionnamed“MobileApp” that integrates the contentof

thecreatedAPIinaviewresemblingamobileapplication,allowingsomestylingconfiguration.

However, the view generated by this option is a web document accessible by the URL

announced, aimed to be rendered in a mobile browser. Unluckily, the name of the option

misleads users and does not generate a mobile application to be published in any mobile

applicationstore.Still,itmaybeausefuloptionforrapidprototyping.

The console menu also offers the “Combine APIs” option. Initially, it may look like an

aggregator, assembling the data obtained from several heterogeneous APIs in a single API.

Nevertheless,helpinformationinthisoptionindicatesthattheaggregatedAPIsmusthavethe

sameexactnameofdata collections. The conclusion is that thisoption is useful topaginate

information,butnottoaggregate.

Page 19: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 19

WEBSCRAPING,APPLICATIONSANDTOOLS

Managementconsoleofkimono

In summary, kimono is a free tool,with a high level ofmaturity, a very good graphical user

interface,providingcloudstorageforqueries,requiringChromebrowserandtheirextension–

bothoftheminstalledlocally.

Strengths Weaknesses

Visualinterface Chromebrowserdependency

Documentation Doesnowallowaggregation

Picker Weakmobileappoption

StrenghtsvsWeaknessesforkimono

Page 20: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 20

WEBSCRAPING,APPLICATIONSANDTOOLS

myTrama

myTrama isawebcrawlingtooldevelopedbyVitesiaMobileSolutions,acompany located inGijón,

Spain.myTramaallowsanyusertoextractdatalocatedindifferentInternetsitesandobtainthemin

anorderedandstructuredwayinvariousformats(JSON,XML,CSVandPDF).

Websitesàwww.mytrama.comandwww.vitesia.com

MottoàDataisthenewoil

myTrama is a newweb crawling tool positioned as a clear competitor to those previously

commented.ItisapurelySaaSservice,thusavoidingtheneedforuserstoinstallanysoftware

nortodependonaspecificwebbrowser.myTramaworksonChrome,Firefox,InternetExplorer

andSafari.Itisavailableathttps://www.mytrama.com.

A general analysis of this tool suggests thatmyTrama takes the best ideas of import.io and

kimono. It presents information in a graphicaluser interface, perhapsnot sogoodbutmore

compactandwiththelookandfeelofaprojectmanagementtool.Someofthefeatureswhich

seemmoreinterestinginthistoolarecommentedbelow:

• Main view is organized in a way similar to an email client, with 3 vertical zones: 1)

folders,2)queries,and3)querydetail.Itisefficientandfriendly.

• Besides JSON, XML and CSV, the classical structured formats for B2B integrations, it

adds PDF for quick reporting and sends results in an easily viewable and printable

format.

• ItincludesaquerylanguagenamedTrama-WQL(quitesimilartoSQL),whichissimple

to use while powerful. It is useful when visual picking is not sufficient, providing a

programmaticmannertodefinethepickingprocess.Documentationofthislanguageis

availableinthetoolasamenuoption.

• The “Audit” menu option gives access to a compact control panel with information

abouttherequestscurrentlybeingmadetoeachoftheAPIs(EndPoints).

• Thepickeriscompletelyintegrated.Itisnotnecessaryanytypeofadditionalsoftware.

Itissimilartotheapproachusedinkimono,althoughituses“boxes”insteadof“balls”.

Asubtledifferentiationisthatamagicwandreplacesthedefaultmousepointerwhen

pickingisavailable.Inaddition,thepickingprocessmaybestoppedbyright-clickingon

theareabeingpicked.

• myTrama permits grouping boxes within boxes, although only one level of grouping

Page 21: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 21

WEBSCRAPING,APPLICATIONSANDTOOLS

andonlyagroupwithqueryareallowed.Thisisaveryusefulfeatureinordertohave

results properly grouped.Hopefully, thedevelopment teamwill improve this feature

soontoprovideuserswithmoreflexibility.

• Queryconfigurationallowsupdatefrequencywithagranularityofminutes,from0to

9999999.Zeromeansrealtime(thisis,accessingthetargetsiteuponeachrequestto

the EndPoint). For any other value, information is obtained from the cache –as in

kimono.

• APIs may be programmed using parameters sent via GET and POST requests.

Unfortunately,thedevteamhasnotpublishedsufficientdocumentationrelatedtothis

feature.Forexample, it ispossibletousetheURLofanAPIandoverwritetheFROM

parameter(theURLreferencingthetargetdocument)inrealtime.Itisalsopossibleto

passparametersviaGETandPOSTinthesameAPI.Additionally,thereisaservicethat

allowstheexecutionofaTrama-WQLsentencewithoutanyquerycreatedinthetool.

As these are not very well documented features, the best choice is to contact the

peopleatVitesia.

• Paging queries and aggregation of heterogeneous queries are supported in a fairly

simpleandcomfortableway.

• Forthosepreferringthebrowserextensionwayofscraping,aChromeextensionisalso

available.Thismechanismallowsuserstobrowsesitesandstartthescrapingprocess

byclickingonthebuttoninstalledbytheextension.Thispluginisnotyetpublishedbut

canberequestedtoVitesia.

• PDF isnotonlya formatavailableasoutput,butalsoas input.Therefore,aURLmay

reference only HTML documents but also PDF. For instance, users will be able to

extractinformationfromPDFsandgenerateJSONdocumentsthatfeedadatabasefor

later information analysis. The business hypothesis to support this is based on the

evidenceinitiallycommentedattheintroductionofthisarticlethatstatedthat70%of

thecontentpublishedintheInternetiscontainedinPDFdocuments.Vitesiaconsider

thatthismaybeadifferentiatingfeaturebetweenmyTramaandtheircompetitors.

• APIs preserve session. This allows chaining calls to queries in myTrama and fulfil

businessprocesses,suchassearches,step-basedforms(wizards)oraccessinformation

availablebehindaloginmechanism.

• Itisavailableintwolanguages:EnglishandSpanish.

• Access to this platform is basedon invitation.Users remain active for 30days. Later

contactwiththedevteamisrequiredinordertomovetoastableuser.

Page 22: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 22

WEBSCRAPING,APPLICATIONSANDTOOLS

Amongallthetoolsanalysed,myTramaseemstobethemostcompleteandcompact,although

its user interface is one step behind kimono and import.io. For users with software

development skills,myTramaseems tobe thebest choice–although requiringdirect contact

withVitesia.

InitialscreenofmyTrama

QuerymanagementconsoleinmyTrama

Page 23: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 23

WEBSCRAPING,APPLICATIONSANDTOOLS

PickingprocessinmyTrama

DashboardscreeninmyTrama

In summary,myTrama is a tool solely offered as a SaaS service, very complete to carry our

scraping processes,with cloud storage and thatmay be operatedwith anyweb browser. Its

major weakness is the lack of documentation of many differentiating issues relevant to

developersinterestedintakingadvantageofscrapingprocesses.

Strengths Weaknesses

TheTrama-WQLlanguage Limiteddocumentation

Dashboards Freelicenseonlyfor30days

Picker Moreorientedtodevelopers

SessionpreservationbetweenAPIrequests

StrengthsvsweaknessesformyTrama

Page 24: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 24

WEBSCRAPING,APPLICATIONSANDTOOLS

Page 25: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 25

WEBSCRAPING,APPLICATIONSANDTOOLS

Confrontingthethreetools

distrib

ution

SaaSmodel

X

(requires

installationof

chromeextension)

X

Desktopinstaller

X

(Windows,OSX

andLinux)

Chromeextension

X

(required)

X

(optional)

Freelicense X X X

Cross-browsercompatibility Ownbrowser OnlyChrome X

operation

Valuationofuserinterface Good Verygood Good

APIscreationsimplicity X X X

Visualpicking X X X

Cachingandstorage X X X

Queryorganization X X X

Ownquerylanguage TramaWQL

Statisticsandauditdashboards X

PDFextraction X

Outputformats JSON,CSV JSON,CSV,RSSJSON,CSV,XML,

PDF

APIscreationsimplicity X X X

SessionpreservationbetweenAPIinvocations X

features

Automaticcrawling ?

Maturitylevel High High Medium/High

Complexofuse Medium/High Medium Medium

Cloudstorage X X X

Querypagination X X X

Queryaggregation X X

Realtimedata ? ? X

othe

rs Orientationtosoftwaredevelopment Medium/Low Medium/Low Medium

GoogleDocsintegration X X

Levelofdocumentation High High Low

Page 26: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 26

WEBSCRAPING,APPLICATIONSANDTOOLS

Othertools

Thisarticleanalysesthreetoolsinthecategoryofcompletetoolsbutmanyotherexist.Forthe

sakeofbrevity, and for those readers interested in this kindof tools, someother interesting

tools,utilitiesorframeworkspermittingwebscrapingarelistedbelow.

Mozenda QuBole ScraperWiki Scrapy ApacheNutch

Scrapinghub ParseHub UbotStudio5 Scraper (Chrome

Plugin)

OutwitHub

Fminer.com 80legs ContentGrabber CloudScrape Webhose.io

UIPath Winautomation VisualWebRipper AddTolt AgentCommunity

AllinOneStats Automation

Anywhere

Clarabridge

Enterprise

DarcyRipper DataIntegration

DataCrops Dataddo Diffbot EasyWebExtract Espion

Feedity Ficstar Web

Grabber

ForNova Big Data

Platform

HeliumScraper KapowKatalyst

PDFCollector PDF Plain Text

Extractor

RedCritter Scrape.it SolidConverter

Spinn3r SyncFirstStandard TextfromPDF Trapeze UnitMiner

Web Content

Extractor

Web Data

Extraction

WebDataMiner Web Robots

Scraping

WebHarvy

If youarenot convinced touseanyof the three recommended tools,MozendaorParseHub

maybeinterestingalternatives.

Page 27: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 27

WEBSCRAPING,APPLICATIONSANDTOOLS

4 DecisionmapThe following diagram may be helpful in the process of deciding which tool meets which

scrapingrequirements.Obviously,thediagramcouldbemorecomplex,asmorequestionsmay

beaskedinadecisionprocess.However,theriseofcomplexityinthefiguresuggestskeepingit

simple,butsufficientlyillustrative.

Page 28: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 28

WEBSCRAPING,APPLICATIONSANDTOOLS

5 ScrapingandlegislationThequestiontobemadeissimple:iswebscrapinglegal?

Theanswer isnotassimple.Firstly,weneedtoknowthespecific legislationofeachcountry

concerned.TheUnitedStatesofAmericaaremorepermissivethanEurope.InSpain,wherethe

authorlivesandworks,lawsarefairlymorerestrictive.However,therearevariousstatements

(includingonefromtheSupremeCourtofSpain),wherewebscrapingisconsideredlegalunder

specificconditions.

Forexample,usersmustbecarefulwithlegalconditionsincertainwebsites.Iftheircontentis

proprietaryandorientedtocommercialpurposes,scrapingmightbeanillegalactivity.Author

rightsareanother legal characteristicof special interestbefore scraping the informationofa

website.Forinstance,newsfromdigitalnewspaperscannotbeextractedtobepublishedina

blogorapplicationwithoutstatingthesource.

Asexpected,allthisgeneratesdoubts,soitisagoodpracticetoconsultlawyersspecializedin

digitalcontent, intellectualproperty,unfaircompetitionand licensing. Inanycase, ifdataare

clearlyopen,youwillnotneedtoworry.

Page 29: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 29

WEBSCRAPING,APPLICATIONSANDTOOLS

6 ConclusionsandrecommendationsWebscrapingisafamiliartermthathasgainedimportancebecauseoftheneedto“free”data

storedinPDFdocumentsorwebpages.Manyprofessionalsandresearchersneedthedatain

order to process it, analyse it and extract meaningful results. On the other hand, people

dealingwithB2Buse casesneed toaccessdata frommultiple sources to integrate it innew

applicationsthatprovideaddedvalueandinnovation.

Themarketdemandsholisticwebscrapingsolutions, thatencompasscloudstorageandease

to build interoperable APIs. In this article we have analysed and compared 3 web scraping

solutions: import.io, Kimono and myTrama. These solutions differ in their implementation

details but share more common ground than believed: a visual picker, JSON results, data

caching,backgroundrobotstogatherdata,programmaticAPIs,etc.

Anexperiencedteam,usingthesesolutions,candevelopservicessuchascontentaggregators,

ranking tools,mobile apps, data monitoring systems, reputation management applications,

businessintelligencesolutions,BigData,etc.

Page 30: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 30

WEBSCRAPING,APPLICATIONSANDTOOLS

AbouttheAuthor

OsmarCastrilloFernándezhasmorethan15yearsofexperienceintheITindustry.Heholdsa

Bachelor’sdegreeinComputerSciencefromtheUniversityofOviedo.Hismainexpertisefields

arewebdevelopment,service-orientedarchitecturesandpublicadministration.

InApril2004hestartedworkingforCTICFoundation intheR&Dteamwhatwas inchargeof

developinganewJ2EEframeworkforthePrincipalityofAsturias(FWPA).TheFWPAwasakey

elementinthesuccessofthemodelofE-GovernmentinthePrincipalityofAsturiasandhelped

to simplify and homogenise the development of new government-related applications and

services.

HewasateacherinthefirstandsecondeditionsoftheJ2EE-FWPAcourseforITprofessionals

inAsturias.HealsotaughtseveralothercoursesrelatedtoUML2.0andn-tierarchitectures.

InOctober2012hefoundedandbecomeCTOofVitesiaMobileSolutions,aCompanydevoted

to extracting and analysing published data on the Internet. At the start of 2013 he started

leadingtheTRAMAproject,atechnologytofacilitatewebscraping,thatlaterbecamethetool

myTrama.

Page 31: Web Scraping: Applications and Tools

ePSIplatform Topic Report No. 2015 / 10 , December 2015 31

WEBSCRAPING,APPLICATIONSANDTOOLS

Copyrightinformation

©2015European PSI Platform–This document and allmaterial therein has been compiled

with great care. However, the author, editor and/or publisher and/or any party within the

European PSI Platform or its predecessor projects the ePSIplus Network project or ePSINet

consortiumcannotbeheldliableinanywayfortheconsequencesofusingthecontentofthis

documentand/oranymaterial referenced therein.This reporthasbeenpublishedunder the

auspicesoftheEuropeanPublicSectorinformationPlatform.

The reportmay be reproducedproviding acknowledgement ismade to the European Public

SectorInformation(PSI)Platform.