day2 item6 dealing with unstructured data [modalit

6
1 Eurostat Big Data Effective Processing and Analysis of Very Large and Unstructured data for Official Statistics. Dealing with Schemaless Data Examples and Applications Monica Scannapieco Istat ([email protected]) Eurostat Example 1: Census LOD Project Datalift Platform: To design and implement the LOD production process • Steps: Dataset Upload Ontologies Upload Mapping to RDF LOD Publishing Example of LOD-Based services: Querying and Visualization

Upload: others

Post on 24-Mar-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

1

Eurostat

Big DataEffective Processing and Analysis

of Very Large and Unstructured data for Official Statistics.

Dealing with Schemaless Data Examples and

ApplicationsMonica Scannapieco

Istat ([email protected])

Eurostat

Example 1: Census LOD Project

• Datalift Platform:

• To design and implement the LOD production process

• Steps:

• Dataset Upload

• Ontologies Upload

• Mapping to RDF

• LOD Publishing

• Example of LOD-Based services: Querying and Visualization

2

Eurostat

Census LOD Project: Recap

Data Model: RDF GraphQuery Language: SPARQL

Eurostat

Screenshot Live Demo - 1Addresses Triples

3

Eurostat

Screenshot Live Demo - 2

Linked Census Section

Eurostat

Screenshot Live Demo - 3

Linked Census Section

4

Eurostat

Screenshot Live Demo - 4

Linked Census Section

Eurostat

Example 2: Scraping and Processing Web Documents

• Apache Platform:

• Nutch: Scraper

• Lucene: Document access

• SOLR: Document management

• Steps:

• Configure and Launch Nutch scraper

• Configure SOLR

• Access LUCENE API for processing

5

Eurostat

Example 2: Configure and Launch Nutch

• Set parameters like:

• Seed: URLS where to start crawling

• Width and depth of navigation

• Regular experession the URLs should be conform to

• HTML tags to keep

• Data object types to include (e.g. images, etc.)

Eurostat

Example 2: SOLR Features and Config

• Defines the field types and fields of documents

• HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby, PHP, Velocity, CSV, binary)

• Natural Language Processing settings:

• E.g. Space removal, stemming, etc.

• Index settings:

• Tokens to index

6

Eurostat

Example 2: SOLR Querying

Searches: Wildcard, Fuzzy, proximity, Range, etc.

Eurostat

Example 2: LUCENE API

• Possible Programmatic Access for specific Purposes

• Example:

• Computation of a Term-Document Matrix