extracting tabular data from the web. limitations of the current bp screen scraper. parsing is done...
TRANSCRIPT
Extracting tabular data from the Web
Limitations of the current BP screen scraper.
Parsing is done line by line.Parsing is done line by line. Pattern matching – not very accurate & Pattern matching – not very accurate &
unpredictable.unpredictable. Need to rewrite code for fetching & parsing Need to rewrite code for fetching & parsing
HTML pages from different websites(Eg. HTML pages from different websites(Eg. MSAMB - Maharashtra, Krishi Marata Vahini –MSAMB - Maharashtra, Krishi Marata Vahini –Karnataka,etc.)Karnataka,etc.)
Doesn’t take care of misplaced tags.Doesn’t take care of misplaced tags.
Characteristics of a Solution to this problem
Flexible.Flexible. Unicode Compliant.Unicode Compliant. Smarter pattern matching – explore the structure Smarter pattern matching – explore the structure
of the HTML page rather than single line at a time.of the HTML page rather than single line at a time.
Possible Solutions
Solution 1
Step 1: Fetch data from the desired site.Step 1: Fetch data from the desired site. Step 2: Step 2: TidyTidy the HTML page. the HTML page. Step 3 : Construct the HTML Step 3 : Construct the HTML
DOMDOM(Document Object Model) (Document Object Model) treetree.. Step 4: Extract node information using Step 4: Extract node information using
Document objectDocument object..
Solution 2
Similar to Solution 1Similar to Solution 1 Use Use XPath XPath to locate data(Step 4).to locate data(Step 4). Relative position of nodes in DOM tree stored as Relative position of nodes in DOM tree stored as
XPath.XPath. These XPaths are stored in the properties file These XPaths are stored in the properties file
instead of the entire table structure.instead of the entire table structure.
Solution 3 Tested a software - Tested a software - screen-scraper.(www.screen-screen-scraper.(www.screen-
scraper.com)scraper.com) Proxy server that allows the contents of HTTP and Proxy server that allows the contents of HTTP and
HTTPS requests to be viewedHTTPS requests to be viewed Engine that can be configured to extract information Engine that can be configured to extract information
from Web sites using special patterns and regular from Web sites using special patterns and regular expressions. expressions.
Embedded scripting engine that allows extracted data to Embedded scripting engine that allows extracted data to be manipulated, written out to a file, or inserted into a be manipulated, written out to a file, or inserted into a database.database.
It can be used with PHP, Java, or any COM-friendly It can be used with PHP, Java, or any COM-friendly language such as Visual Basic or Active Server Pages.language such as Visual Basic or Active Server Pages.
Costs Costs $90$90 ! ! No Unicode support.No Unicode support.
Other Possible Solutions
XMLizXMLize the HTML content.e the HTML content.• XML – more structured and well-formed.XML – more structured and well-formed.• Data interchange between incompatible systems.Data interchange between incompatible systems.• Can use XSL and XSLT to convert from one form Can use XSL and XSLT to convert from one form
to another.to another.
Implementation
HTML scraper
The HTML scraper has 3 main stepsThe HTML scraper has 3 main steps
1.Downloading the web page using crawlers 1.Downloading the web page using crawlers like ‘wget’.like ‘wget’.
2.Parsing and constructing the DOM tree.2.Parsing and constructing the DOM tree.
3.Querying the DOM tree for retrieving the 3.Querying the DOM tree for retrieving the desired information and inserting to the desired information and inserting to the database.database.
Implementation
Download the web page using Download the web page using
wget --post-data=“data” www.agmarknet.nic.inwget --post-data=“data” www.agmarknet.nic.in
Can store the page locally.Can store the page locally. Construct DOM tree using JTidy API.Construct DOM tree using JTidy API.
Tidy tidy = new Tidy();Tidy tidy = new Tidy(); Parse the DOM treeParse the DOM tree
Document doc = tidy.parseDOM(htmlfile,null);Document doc = tidy.parseDOM(htmlfile,null);
Query the DOM tree : Query the DOM tree : Depth First Search through the DOM treeDepth First Search through the DOM tree
OrOr Using the XPath APIs.Using the XPath APIs.
Store the HTML page structure in file and use Store the HTML page structure in file and use DFS.DFS.
Or Or Store XPaths and use it for querying.Store XPaths and use it for querying.
Insert into database using JDBC.Insert into database using JDBC.
DOM tree of the parsed HTML pagehtml
head
table
tr tr tr tr
APMC Arrivals Variety Low Rate Mid Rate High Rate
Total time taken by the new parser is less Total time taken by the new parser is less than 15 seconds per page. But the old one is than 15 seconds per page. But the old one is more than 30 seconds. more than 30 seconds.
Daily data fetching time=(200*15)secondsDaily data fetching time=(200*15)seconds
Statistics
Parser (using DFS) for NIC and MSAMB Parser (using DFS) for NIC and MSAMB (both English and Marathi) are ready .(both English and Marathi) are ready .