bnsi updates - europa · use case 2: e-commerce php script with 3 logics (4 positive and 1 negative...
TRANSCRIPT
BNSI updates
WP2 Face to Face meeting
Gdańsk, October 5-6, 2017
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
Initial conditions• The task
• Finding websites of enterprises with 10 and more employees
• Software environment• Windows OS
• MySQL, MSSQL, …
• PHP, VB, Java, …
• Legal issues• No legal constrains at national level for web-scraping
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
Initial data from Business Register
CSV file with 26836 businesses
• 20649 e-mails
• 2006 urls
• Addresses
• Phone numbers
• NACE codes
• …
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
MySQL Database
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
Table ikturl Structure (1)
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
Table ikturl Structure (2)
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
Scripts
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
Common features of the search scripts
• Run in browser
• <meta http-equiv="Refresh" content="30">
• Timestamps
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
Script geturl.php
• Checks if the initial 2006 urls are real websites
• Constructs domain names from the 20649 e-mails
• Checks if the constructed domains are real websites
• Saves the results in the database
• Result - 7038 possible urls
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
Script jabse_search.php
• Uses automated search interface of http://www.jabse.com (jabse_interface.php)
• Get up to 10 search results from names in Bulgarian and the same for transliterated names in English
• Excluding from the search results the complex urls
• Saves the results in the database
• Result - 15638 results in Bulgarian and 16201 results in Latin
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
Script google_search.php
• Uses Google search interface
• Get up to 10 search results from names in Bulgarian
• Saves the results in the database
• Result - 26829 sets of up to 10 search results
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
Google search interface
• 200 searches per day free
• 1000 searches for 5 EUR max 10000 per day
• 300 EUR for free searches on credit card registration
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
Script list.php
Database crawling interface, which displays the enterprises with characteristics and the urls search results and allows the user to choose the correct url of each enterprise
list.php.htm
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
Manual work done
• 26836 records were checked in 45 working days
• 600 records per work day
• 9809 urls were found
• 36.6 % of enterprises have websites
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
The scripts
• Made to do the work in Bulgarian reality
• Not intended for different database table structure
• Hard coded labels in Bulgarian
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
Use Case 1: URLs Inventory
SBR enterprises with 10 and more employees26836 enterprises20649 e-mails, 2006 urlsAddresses, Phones, NACE codes…
Search in JABSE, Google – BNSI software
Search in Bing – ISTAT software UrlSeracher
9809 urls were found
36.6 % of enterprises with 10 and more employees have websites
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
• BNSI software
Use Case 1: URLs Inventory
URLSearcher, RootJuice, URLScorerand URLMatchTableGeneratorprograms from ISTAT and Apache SolrStorage platform
• total number of enterprises: 26836
• The URLMatchTableGeneratorpredicts the right URLs of 67 % of the enterprises
• A better list of yellow pages and internet catalogues are needed
• Apache Solr version should be as the suggested from the ISTAT
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
• ISTAT software
URLScorer
URLScorer
URLMatchTableGenerator
Use Case 2: E-commercePHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139e-commerce v2: 1048e-commerce v3: 662
manual checked: 856 e-commerce
10% sample: 27 e-commerce
856+27*10=1126 e-commerce
11.5% of enterprises (10+ employees) with websites do e-commerce
4.2% of enterprises with 10 and more employees do e-commerce
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
Use Case 3: Social Media PresencePHP script (crawling only the first page of enterprise website)facebook: 2356twitter: 922linkedin: 560google: 871youtube: 527pinterest: 139instagram: 127
24.9% of enterprises (10+ employees) with websites have at least one social media profile
9.1% of enterprises with 10 and more employees use at least one of the listed social media
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
Verification of the results with ICT survey data – 2016 (E-commerce)
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
Verification of the results with ICT survey data – 2016 (Social media presence)
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
Plans for SGA-II
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
• Enterprise URLs Inventory – improvements from SGA-I
• E-commerce on Enterprises Web sites -improvements from SGA-I and testing of the ISTAT software
• Social Media Presence on Enterprises webpages -improvements from SGA-I
• Job advertisements on enterprises’ websites - new
THANK YOU FOR YOUR ATTENTION!
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729