extracting tabular data from the web. limitations of the current bp screen scraper. parsing is done...

20
Extracting tabular data from the Web

Upload: lucas-cox

Post on 12-Jan-2016

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern

Extracting tabular data from the Web

Page 2: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern

Limitations of the current BP screen scraper.

Parsing is done line by line.Parsing is done line by line. Pattern matching – not very accurate & Pattern matching – not very accurate &

unpredictable.unpredictable. Need to rewrite code for fetching & parsing Need to rewrite code for fetching & parsing

HTML pages from different websites(Eg. HTML pages from different websites(Eg. MSAMB - Maharashtra, Krishi Marata Vahini –MSAMB - Maharashtra, Krishi Marata Vahini –Karnataka,etc.)Karnataka,etc.)

Doesn’t take care of misplaced tags.Doesn’t take care of misplaced tags.

Page 3: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern

Characteristics of a Solution to this problem

Flexible.Flexible. Unicode Compliant.Unicode Compliant. Smarter pattern matching – explore the structure Smarter pattern matching – explore the structure

of the HTML page rather than single line at a time.of the HTML page rather than single line at a time.

Page 4: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern

Possible Solutions

Page 5: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern

Solution 1

Step 1: Fetch data from the desired site.Step 1: Fetch data from the desired site. Step 2: Step 2: TidyTidy the HTML page. the HTML page. Step 3 : Construct the HTML Step 3 : Construct the HTML

DOMDOM(Document Object Model) (Document Object Model) treetree.. Step 4: Extract node information using Step 4: Extract node information using

Document objectDocument object..

Page 6: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern

Solution 2

Similar to Solution 1Similar to Solution 1 Use Use XPath XPath to locate data(Step 4).to locate data(Step 4). Relative position of nodes in DOM tree stored as Relative position of nodes in DOM tree stored as

XPath.XPath. These XPaths are stored in the properties file These XPaths are stored in the properties file

instead of the entire table structure.instead of the entire table structure.

Page 7: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern

Solution 3 Tested a software - Tested a software - screen-scraper.(www.screen-screen-scraper.(www.screen-

scraper.com)scraper.com) Proxy server that allows the contents of HTTP and Proxy server that allows the contents of HTTP and

HTTPS requests to be viewedHTTPS requests to be viewed Engine that can be configured to extract information Engine that can be configured to extract information

from Web sites using special patterns and regular from Web sites using special patterns and regular expressions. expressions.

Embedded scripting engine that allows extracted data to Embedded scripting engine that allows extracted data to be manipulated, written out to a file, or inserted into a be manipulated, written out to a file, or inserted into a database.database.

It can be used with PHP, Java, or any COM-friendly It can be used with PHP, Java, or any COM-friendly language such as Visual Basic or Active Server Pages.language such as Visual Basic or Active Server Pages.

Costs Costs $90$90 ! ! No Unicode support.No Unicode support.

Page 8: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern
Page 9: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern
Page 10: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern
Page 11: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern

Other Possible Solutions

XMLizXMLize the HTML content.e the HTML content.• XML – more structured and well-formed.XML – more structured and well-formed.• Data interchange between incompatible systems.Data interchange between incompatible systems.• Can use XSL and XSLT to convert from one form Can use XSL and XSLT to convert from one form

to another.to another.

Page 12: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern

Implementation

Page 13: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern

HTML scraper

The HTML scraper has 3 main stepsThe HTML scraper has 3 main steps

1.Downloading the web page using crawlers 1.Downloading the web page using crawlers like ‘wget’.like ‘wget’.

2.Parsing and constructing the DOM tree.2.Parsing and constructing the DOM tree.

3.Querying the DOM tree for retrieving the 3.Querying the DOM tree for retrieving the desired information and inserting to the desired information and inserting to the database.database.

Page 14: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern

Implementation

Download the web page using Download the web page using

wget --post-data=“data” www.agmarknet.nic.inwget --post-data=“data” www.agmarknet.nic.in

Can store the page locally.Can store the page locally. Construct DOM tree using JTidy API.Construct DOM tree using JTidy API.

Tidy tidy = new Tidy();Tidy tidy = new Tidy(); Parse the DOM treeParse the DOM tree

Document doc = tidy.parseDOM(htmlfile,null);Document doc = tidy.parseDOM(htmlfile,null);

Page 15: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern

Query the DOM tree : Query the DOM tree : Depth First Search through the DOM treeDepth First Search through the DOM tree

OrOr Using the XPath APIs.Using the XPath APIs.

Store the HTML page structure in file and use Store the HTML page structure in file and use DFS.DFS.

Or Or Store XPaths and use it for querying.Store XPaths and use it for querying.

Insert into database using JDBC.Insert into database using JDBC.

Page 16: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern

DOM tree of the parsed HTML pagehtml

head

table

tr tr tr tr

APMC Arrivals Variety Low Rate Mid Rate High Rate

Page 17: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern
Page 18: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern
Page 19: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern

Total time taken by the new parser is less Total time taken by the new parser is less than 15 seconds per page. But the old one is than 15 seconds per page. But the old one is more than 30 seconds. more than 30 seconds.

Daily data fetching time=(200*15)secondsDaily data fetching time=(200*15)seconds

Statistics

Page 20: Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern

Parser (using DFS) for NIC and MSAMB Parser (using DFS) for NIC and MSAMB (both English and Marathi) are ready .(both English and Marathi) are ready .