business specific online information extraction from german websites
TRANSCRIPT
Business Specific Online Information Extraction from German Websites
Yeong Su Leeand
Michaela Geierhos
Ludwig-Maximilians-UniversitätCentrum für Informations- und Sprachverarbeitung (CIS)
Oettingenstr. 67, 80358 München
03.02.2009
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
2
Time Management
● talk: 20 ~ 25 min.– presentation of our article on “Business Specific Online
Information Extraction from German Websites”
● questions and discussion
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
3
Overview
● Introduction– goal of the article
● starting situation● resources● goal● application area
– definition of terms● Implementation● Evaluation● Summary● Appendix
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
4
Goal of the article
● starting situation– business websites occupy a significant part in our Internet
society– existing business directories are built on the manual data
processing and extraction● goal
– automatic creation of business directories● complete, coherent and up-to-date
● general system requirements– modular, efficient, portable, robust, scalable, compact,
comprehensive● what we have
– training URLs, several lexicons and knowledge bases
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
5
Overview
● Introduction– goal of the article– definition of terms
● business specific information and domain name● automatic extracted business data
● Implementation● Evaluation● Appendix
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
6
Business Specific Information
● business websites are composed of – homepage
● title, meta data, anchor texts, body text, etc.
– many other web pages (structured) / info pages● profile page, contact page, imprint page, etc.
● business specific information contains the relational facts concerning the domain name
● domain name hierarchy– focus on business SLDs– where are the business SLDs from?
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
7
Example for Business Specific Information
Wanted Information
Marketing Information
Web Designer Info
Navigation
Shopping Cart
Ads
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
8
Automatic Extraction of Business Data
domain name attribute valueinfo pagecategory companycompany name SQL Gesellschaft für Datenverarbeitung mbHstreet Franklinstraße 25azip code 01069city Dresdenphone no. (0351) 876190fax no. (0351) 8761999mobile no.emailVAT ID DE140300780
www2.sql-gmbh.de tax no.CEO Dipl.-Ing.oec. Jürgen Bittnerownerchairmancontact personresponsible personrepresentativemanagement boardsupervisory boardlocal court Dresdenfinancial officeregister no. HRB 5256
http://www.sql-gmbh.de/sqlgmbh2007/menue-right/impressum.html
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
9
Overview
● Introduction
● Implementation– systematic considerations– architecture– components
● Evaluation● Summary● Appendix
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
10
Systematic Considerations
● targeted crawling over deciding key terms (anchor texts) – sublanguages
● exploitation of HTML tags – tree structure (DOM)
● heuristic approach– information is located in a certain area– information is densely compact – attribute-value process
(sublanguages)
● extensibility● updatability● portability on other languages
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
11
System Architecture: ACIET
WWW Crawler Info Page Analyzer I n d e x
URL-DB
“Learn”
Internal & External Indicators
Result Processing
Query
User
Home Page Analyzer
“Learn”
Anchor Texts
A B C
PreprocessingConstructing DOMMinimal RegionApplying Attribute- Value ProcessPostprocessing...
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
12
Overview
● Introduction● Implementation
– systematic considerations– architecture– components
● crawler● (classificator)● info page analyzer● postprocessing
● Evaluation● Summary● Appendix
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
13
Crawler
● targeted crawler– effective
● reduce the bandwidths and storage capacity
– statistical evaluation of anchor texts
● sequence of information page– imprint page– contact page– profile page– home page
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
14
Info Page Analyzer
● exploitation of HTML tags– tree structure (DOM)– weight of HTML tags
● minimal data region● attribute-value process
– a list of class attributes– lexicon and knowledge based class value expressions
● internal indicators for business names and streets● regular expressions for numerals like zip code, phone, fax
and mobile number, tax number, VAT ID● grammar for city and person names
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
15
Decision of Minimal Data Region
● depth-first-traverse● positive indicators
– responsible, provider, operator, ...● negative indicators
– design, realisation, implementation, ...– deletion of nodes preceded by negative phrases and their
decendants● factors for decision of minimal data region
– attribute-value pairs● zip code, phone number, VAT ID
– minimal length of minimal region
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
16
Attribute-Value Process
● For each demanded class there was gathered a list of class attributes.
● If a class attribute occurs, then the corresponding value expression is searched.
● Within tables
– number of <TD>-s
– number of delimiters
– If a character sequence between delimiters does not correspond to any attribute or value, then it is simply dumped.
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
17
Example of the Attribute-Value Process
<TR>
<TD> <TD>Frank Reinhard Zerspannungstechnik<BR>Theodorstraße 12<BR>28219 Bremen<BR><BR>Inhaber:<BR><BR>Tel.<BR>Fax:<BR>E-Mail:<BR>Internet:<BR>Umsatzsteuer-Nr.:
<P></P><P>Frank Reinhard<BR><BR>0421/396 59 00<BR>0421/396 59 01<BR>[email protected]<BR>www.frank-reinhard.de<BR>73-369-01329</P><P></P>
SLD: frank-reinhard
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
18
<table> <tr> <td><p><font>Frank Reinhard Zerspanungstechnik<br> Theodorstraße 12<br> 28219 Bremen</font></td> <td><p></td> <td></td> </tr> <tr> <td>Inhaber:<br> <br> Tel.<br> Fax:<br> E-Mail: <br> Internet:<br> Umsatzsteuer-Nr.:</font></td> <td>Frank Reinhard</font><p><font> 0421/396 59 00<br> 0421/396 59 01<br> <font><a href="mailto:[email protected]" onfocus="this.blur()"> [email protected]</a><br> <a href="http://www.frank-reinhard.de">www.frank-reinhard.de</a><br> 73-369-01329</font></font></td> <td></td> </tr></table>
Source Code Revision of Sample SLD
<TABLE>
<TR> <TR>
<TD> <TD><TD> <TD><TD> <TD>
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
19
lynx: Text Based Web Browser
Frank Reinhard Zerspanungstechnik Theodorstraße 12 28219 Bremen
Inhaber: Tel. Fax: E-Mail: Internet: Umsatzsteuer-Nr.: Frank Reinhard
0421/396 59 00 0421/396 59 01 [email protected] www.frank-reinhard.de 73-369-01329
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
20
According to Attribute-Value Process
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
21
Postprocessing
● uniform data management– street, city, local court, person name, phone and fax number,
email, tax number, and VAT ID
● coherence of classes– phone area code and city– legal form and register number– tax number and VAT ID
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
22
Overview
● Introduction● Implementation● Evaluation
– evaluation table– lack of precision– lack of recall
● Summary● Appendix
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
23
Evaluation Table
Total150 150 129 96,3% 86,0%150 149 147 98,6% 98,0%150 150 150 100,0% 100,0%150 150 150 100,0% 100,0%137 135 134 99,2% 97,8%125 124 124 100,0% 99,2%
13 13 13 100,0% 100,0%email 126 124 124 100,0% 98,4%VAT ID 73 72 72 100,0% 98,6%
25 22 22 100,0% 88,0%CEO 39 28 28 100,0% 71,7%
24 21 21 100,0% 87,5%33 24 24 100,0% 72,7%12 11 11 100,0% 91,6%44 38 38 100,0% 86,3%45 38 38 100,0% 84,4%
99,1% 91,3%
Extracted Type of Information Extracted Correct Precision Recallbusiness namestreetzip codecityphone no.fax no.mobile no.
tax no.
business ownerresponsible personauthorized personlocal courtregister no.On average
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
24
Lack of Precision
● only 3 of totally 16 information bits vary in precision due to
– mismatches of company names in case of several business occurrences – SLD: bergener-rathaus-reisebuero
– mistakes in street names in case of missing internal indicators on the page – SLD: gestuet-schlossberg
● Zachow 5 has no identifiable suffix for a regular street name, and our system located the street name „Ridlerstraße 31 B“
– non-resolution of ellipsis in phone numbers● 02851/8000+6200 is transformed to (02851) 80006200,
but the deletion of „+“ is not correct
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
25
Lack of Recall
● 13 of totally 16 information bits vary in recall
● their incomplete or none-recognition is due to
– flash animations, javascript and images protecting the piece of information searched for
– missing external indicators on information pages
– textual representations of phone number, e.g. 0700 TEATRON
– informal specification of tax number, register number, etc
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
26
Overview
● Introduction
● Implementation
● Evaluation
● Summary
● Appendix
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
27
Summary
● The system ACIET– automates the search of business specific information and
the maintenance of the extracted information
– is modular, scalable and extensible
– is applicable to other languages
– can integrate other modules like a text analyzer for the sector classification
● Application areas– business directory service, job offering service, etc.
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
28
DEMO
Address Finder Demo
http://www.cis.uni-muenchen.de/~yeong/ADDR_Finder/addr_finder_de_v12.html
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
29
Thank you!
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
30
Overview
● Introduction
● Implementation
● Evaluation
● Summary
● Appendix
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
31
ImpressumKontaktÜber unsAndereStartseite
Imprint: 1,674
Contact: 293
Start page: 106Others: 105 About us: 26
Total: 2,214
Statistical Evaluation of Anchor and URL Texts
anchor text statistics
URL text statisticsImpressumKontaktÜber unsAndereStartseite
Imprint: 1,131
Contact: 348
Others: 237Start page: 106
About us: 14
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
32
Weight of HTML Tags
● HTML tags classified into block-, list-, table-, image- and character tags
● Example– <tr><td><i><b>foo</b></i></td> <td><b><i>foo</i></b></td></tr>
(a) total tree (b) weighted tree<tr>
<td> <td>
<i>
<b>
foo
<b>
<i>
bar
<tr>
<td> <td>
foo bar
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
33
Overview ofExternal Indicators
Class External Indicatorbusiness name 99phone no. 25fax no. 7mobile no. 13email 16CEO 23business owner 16contact person 10chairman 23management board 4VAT ID 97tax no. 25register no. 22local court 28tax office 4
03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU
34
Table Types
table
tr tr
td td td td
table
tr tr
table
tr tr
td td td td
table
tr tr
table
tr tr
td td td td
table
tr tr
table
tr tr
td td td td
table
tr tr
Attr1<Delimiter> Value1
Attr2<Delimiter> Value2
Attr3<Delimiter> Value3
Attr4<Delimiter> Value4
Type 3
Attr1<Delimiter> Attr2
Value1<Delimiter> Value2
Attr3<Delimiter> Attr4
Value3<Delimiter> Value4
Type 4
Attr1 Wert1 Attr2 Wert2 Attr1 Attr2 Value1 Value2
Type 1 Type 2