extracting tabular data from the web. limitations of the current bp screen scraper. parsing is done...

Extracting tabular data from the Web

Limitations of the current BP screen scraper.

Parsing is done line by line.Parsing is done line by line. Pattern matching – not very accurate & Pattern matching – not very accurate &

unpredictable.unpredictable. Need to rewrite code for fetching & parsing Need to rewrite code for fetching & parsing

HTML pages from different websites(Eg. HTML pages from different websites(Eg. MSAMB - Maharashtra, Krishi Marata Vahini –MSAMB - Maharashtra, Krishi Marata Vahini –Karnataka,etc.)Karnataka,etc.)

Doesn’t take care of misplaced tags.Doesn’t take care of misplaced tags.

Characteristics of a Solution to this problem

Flexible.Flexible. Unicode Compliant.Unicode Compliant. Smarter pattern matching – explore the structure Smarter pattern matching – explore the structure

of the HTML page rather than single line at a time.of the HTML page rather than single line at a time.

Possible Solutions

Solution 1

Step 1: Fetch data from the desired site.Step 1: Fetch data from the desired site. Step 2: Step 2: TidyTidy the HTML page. the HTML page. Step 3 : Construct the HTML Step 3 : Construct the HTML

DOMDOM(Document Object Model) (Document Object Model) treetree.. Step 4: Extract node information using Step 4: Extract node information using

Document objectDocument object..

Solution 2

Similar to Solution 1Similar to Solution 1 Use Use XPath XPath to locate data(Step 4).to locate data(Step 4). Relative position of nodes in DOM tree stored as Relative position of nodes in DOM tree stored as

XPath.XPath. These XPaths are stored in the properties file These XPaths are stored in the properties file

instead of the entire table structure.instead of the entire table structure.

Solution 3 Tested a software - Tested a software - screen-scraper.(www.screen-screen-scraper.(www.screen-

scraper.com)scraper.com) Proxy server that allows the contents of HTTP and Proxy server that allows the contents of HTTP and

HTTPS requests to be viewedHTTPS requests to be viewed Engine that can be configured to extract information Engine that can be configured to extract information

from Web sites using special patterns and regular from Web sites using special patterns and regular expressions. expressions.

Embedded scripting engine that allows extracted data to Embedded scripting engine that allows extracted data to be manipulated, written out to a file, or inserted into a be manipulated, written out to a file, or inserted into a database.database.

It can be used with PHP, Java, or any COM-friendly It can be used with PHP, Java, or any COM-friendly language such as Visual Basic or Active Server Pages.language such as Visual Basic or Active Server Pages.

Costs Costs $90$90 ! ! No Unicode support.No Unicode support.

Other Possible Solutions

XMLizXMLize the HTML content.e the HTML content.• XML – more structured and well-formed.XML – more structured and well-formed.• Data interchange between incompatible systems.Data interchange between incompatible systems.• Can use XSL and XSLT to convert from one form Can use XSL and XSLT to convert from one form

to another.to another.

Implementation

HTML scraper

The HTML scraper has 3 main stepsThe HTML scraper has 3 main steps

1.Downloading the web page using crawlers 1.Downloading the web page using crawlers like ‘wget’.like ‘wget’.

2.Parsing and constructing the DOM tree.2.Parsing and constructing the DOM tree.

3.Querying the DOM tree for retrieving the 3.Querying the DOM tree for retrieving the desired information and inserting to the desired information and inserting to the database.database.

Implementation

Download the web page using Download the web page using

wget --post-data=“data” www.agmarknet.nic.inwget --post-data=“data” www.agmarknet.nic.in

Can store the page locally.Can store the page locally. Construct DOM tree using JTidy API.Construct DOM tree using JTidy API.

Tidy tidy = new Tidy();Tidy tidy = new Tidy(); Parse the DOM treeParse the DOM tree

Document doc = tidy.parseDOM(htmlfile,null);Document doc = tidy.parseDOM(htmlfile,null);

Query the DOM tree : Query the DOM tree : Depth First Search through the DOM treeDepth First Search through the DOM tree

OrOr Using the XPath APIs.Using the XPath APIs.

Store the HTML page structure in file and use Store the HTML page structure in file and use DFS.DFS.

Or Or Store XPaths and use it for querying.Store XPaths and use it for querying.

Insert into database using JDBC.Insert into database using JDBC.

DOM tree of the parsed HTML pagehtml

head

table

tr tr tr tr

APMC Arrivals Variety Low Rate Mid Rate High Rate

Total time taken by the new parser is less Total time taken by the new parser is less than 15 seconds per page. But the old one is than 15 seconds per page. But the old one is more than 30 seconds. more than 30 seconds.

Daily data fetching time=(200*15)secondsDaily data fetching time=(200*15)seconds

Statistics

Parser (using DFS) for NIC and MSAMB Parser (using DFS) for NIC and MSAMB (both English and Marathi) are ready .(both English and Marathi) are ready .

extracting tabular data from the web. limitations of the current bp screen scraper. parsing is done...

Documents