c a time efficiency research project “coordination of air-transport time efficiency research ......

CATER

- Coordinating Air transport Time Efficiency Research

FP7-AAT-2013-RTD-1

Deliverable no.: D5.2

Deliverable Title: Design search engine implementation and DB

Organisation name of lead Contractor for this Deliverable:

CIAOTECH

Author(s): Valeria Marino

Paolo Salvatore

Participant(s): -

Work package contributing to the deliverable:

5

Task contributing to the deliverable: T5.2

Version: 1

Total Number of Pages 13

“This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under

grant agreement no 605497.”

D 5.2 Design search engine implementation and DB

Deliverable Title

Design search engine implementation and DB

Table of Versions

Version Date Authors Version Description

Reviewers Date of Approval

1 03/06/2014

Valeria Marino

For delivery Paolo Salvatore 06/06/2014

Authored by: Valeria Marino Paolo Salvatore

D 5.2 Design search engine implementation and DB 3

Table of Contents

1. Summary ......................................................................................................................................................................... 4

2. Introduction ................................................................................................................................................................... 4

3. CASK Database .............................................................................................................................................................. 4

3.1. Contents ....................................................................................................................................................................... 4

3.2. Search Engine ............................................................................................................................................................ 6

3.3. The Specific Domain Semantic approach ..................................................................................................... 6

3.4. The Document Management System .............................................................................................................. 9

3.5. The “Qualified contents” approach to build a community driven set of qualified contents10

4. Labelling/Tagging of the CASK Contents ...................................................................................................... 10

5. Conclusions ................................................................................................................................................................. 13


1. Summary This report describes the work performed in the context of Work-package 5, Task 5.2 of the European Project “Coordination of Air-transport Time Efficiency Research” (CATER). During this Task the design of the CASK database, and the design / architecture of the search engine with its main functionalities has been developed.

2. Introduction This deliverable provides details of the database and search engine on which CASK, the ICT tool that will be used to support the CATER project activities, will be based.

In the first section all the contents that will be offered by the CASK database have been described, and details of the DB in general have been indicated.

The second section provides indications on how the CASK environment has been designed in order to deliver functionalities to allow the search of relevant contents by filtering them through a pre-qualified attribute, tags, and making use of semantic technologies.

In the design and development of the functionalities, CTECH made use of an already available search engine (and related visualisation of search results) realised in previous project, which has been customised for CATER. The search engine and the contents are accessed through REST services provided by the external provider.

3. CASK Database

3.1. Contents

The set of contents (source of data) that will be managed by CATER are:

Contents by the community: these includes contents (mainly documents) provided by the members of the community with the aim to share them as source of data to extract relevant information. To allow an easy integration of such information in the CASK system, a fully fledged Document Management System has been included. Contents that cab be uploaded and searched in CASK can be in any language, as the Document Management System allows a full text search on contents of any language.

Contents searched in database or other sources not provided by the community members: These contents include a set of source of data where information useful for researches on Air transport Time Efficiency could be found:

- Database of European Patents: CASK will access, through REST services, a search service (provided by a subcontractor) to search over the full set of XML based data from the European Patent office. Members of the CATER community will be able to query through CASK such database n English, as all patents include a machine translation in English.


- Scientific Papers: CASK will access, through REST service, the Open Access scientific papers that are collected through available API (Application Programs Interface) from the CORE repository, thanks to an agreement with the Open University of London. CORE integrates Open Access papers from several sources and in all sectors.

- Database of grants for Research and Development: this database is provided by CTECH, it is a proprietary database of all European funding programmes for research and development, and national funds (six countries are covered, namely Netherlands, Germany, UK, France, Belgium and Italy). All grants are provided in English. For National Grants, a searchable abstract in English is also present;

- Database of funded EU projects for research and development: this database is provided by CTECH and has been built over the official database of the European Commission (available on CORDIS) and shall be integrated in the next months with some national database of funded projects. European projects are all described in English.

- Project seeking partners: these are research projects seeking partners / collaborations. The database is provided by CTECH and is including projects developed by clients / partners of CTECH. It might include also projects shared on social networks or other public networks. All projects seeking partners are provided in the English language.

- The Web – CASK crawled and indexed URLs: CASK has its own crawler that, starting from URLs suggested by the community members, crawls (with a depth of 1 to 3 links) and index relevant web pages. Users will be able to suggest new URLs, and launch searches on the indexed web pages. It is expected that, starting from a set of pages already known by users, the set of crawled pages shall be quite relevant for the specific domain of interest of users. The available set of crawled and indexed pages will grow over the time, being an incremental process.

Figure 1 – Cask contents for researches


3.2. Search Engine

The CASK system is able to offer several advanced search features, over the contents reported above, that help the user to retrieve meaningful results from their queries; the most important ones are:

highlighting of the search terms and their synonyms in the search results;

expansion of the natural language queries with synonyms and related words;

suggestions of possible completions for queries (while user is typing them);

automatic classification and tagging of contents;

suggestions of similar and related documents in the query results or in documents’ listings;

re-ranking of results obtained with a full-text search, for taking into account semantic similarities with the original and expanded query;

faceting of the search results in groups of clusters (useful also for further filtering of the results).

In order to process a huge amount of documents and information, a very robust architecture is used, which ensures service reliability and redundancy of stored data in order to ensure high availability of services. For this purpose, the features made available by the latest version of the product Apache Solr 4 have been fully exploited, with particular reference to the techniques of separation of the indexes across multiple servers and automatic replication of data, known as Solr Cloud1.

On each instance of Apache Solr, for each different source of information a repository (called Solr Core) can be defined, which basically identifies an index of such data. Thanks to the introduction of Solr Cloud, which occurred in the latest version of the product, it is now possible to divide a single index on separate instances, in order to reduce the size of very large indexes and thus gain in scalability.

Thanks to these choices, actually the system makes use of the best available technologies in the Open Source world for managing Big Data and should scale very well, if needed, enhancing the hardware equipments, if a large amount of customers will start to use the platform in the future.

The overall search process is quite demanding in terms of computing. In such sense, the subcontractor Innovation Engineering has put at disposal of the project a set of virtual machines over its own servers hosted in a primary data center in Rome (Italy).

3.3. The Specific Domain Semantic approach

One specific feature of the overall search process is related to the usage of a semantic chain to retrieve relevant results. The semantic is applied through the usage of Latent Semantic Analysis (LSA) technique that is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that 1 More information is available at http://wiki.apache.org/solr/SolrCloud


are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of columns while preserving the similarity structure among rows. Words are then compared by taking the cosine of the angle between the two vectors formed by any two rows. Values close to 1 represent very similar words while values close to 0 represent very dissimilar words.

Figure 2 - LSA process

A matrix specific for the Aerospace sector has been developed by CTECH in the framework of another research project. Such matrix has been fine-tuned for the CATER project taking into consideration texts extracted by URLs and documents already indicated by the project partners as relevant for the project purposes and for the Aerospace sector itself.

In the following table it has been reported an example of similarities calculated by CASK inside the matrix for some words related to the Aerospace sector.


Table 1 – Example of words considered similar by CASK during researches

Query Similarities

Passenger aircraft train traveler motorist crew_member commuter vehicle pilot

Aircraft aeroplane helicopter fly flight airliner vehicle wing biplane airship

Airplane aircraft glider boat fixed-wing airliner airship boats uav vehicle monoplane

Airports international_airport downtown suburb road city railway_station park building_in harbour

Air traffic management route stream flow transport security delivery network_traffic communication monitoring scheduling

Destination address destination_address access_point sender endpoint route

Flight landing fly aircraft aeroplane takeoff cruise spacecraft land craft helicopter

Intermodal transportation public_transport freight logistics transit shipping

public_transportation

procurement road_transport parking warehousing

Luggage baggage pallet carry-on cargo trolley cart wheelchair truck valuables crate

Pilot crew helicopter aircraft astronaut test_pilot flight trainer flight_crew passenger crew_member

Planes axis wing direction angle horizontal edge vertical aeroplane perpendicular trajectory

Propulsion propulsion_system nasa thruster

rocket_engine

hypersonic jet_engine space_vehicle flight_control unmanned avionics

Runway roadway taxiway landing cockpit ramp deck train dock horizon corridor

Travel transit trip journey ride sailing trade taxi shopping long_distance Transportation

Security screenings privacy protection coverage trainings physical_security training_sessions assurance reporting data_integrity security_measure

Security radar surveillance sonar defence airborne tracking intelligence protection operational interception ground-based

D 5.2 Design search engine implementation and DB

3.4. The Document Management System

In the CASK environment has been implemented also a Document Management System(DMS) to manage documents internal to the community. The integrated DMS is an instance of a DMS already developed by CTECH in previous project and customized for the CASK needs. It has to be outlined that such DMS is not simply an integration of a DMS already available in the Open Source community, capable of storing, saving and retrieving binary documents; due to the choice of using Apache Solr2 for implementing the search system, in order to better integrate the DMS and the search system, a DMS based on Apache Solr has been used, instead of using a complex or full-featured DMS, like Alfresco or OpenKM; Solr, in fact, is a popular, open source platform for enterprise search and it can be easily combined with other tools for increasing its features in semantic analysis of the text contained inside the documents.

Solr extracts and analyzes the documents’ text, and indexes their metadata (like title, author, short description, etc.), but it can’t store the binary form of the documents for subsequent retrieval, so an integration of Solr with Apache Hadoop3 was the solution used to guarantee a simple but powerful binary content repository. This Apache project provides an implementation of a distributed file system called HDFS (Hadoop Distributed File System). HDFS is designed to store large files, and can easily deal with big documents. It can even address files up to many gigabytes and is able to distribute the load across multiple machines. It achieves reliability by replicating the data across multiple hosts, spawning its nodes that talk to each other to rebalance data, to move copies around, and to keep the replication of data high.

A simple but rich metadata schema for documents and folders in Solr has been defined, and is used for indexing and making the metadata searchable. It is worth mentioning that, at the time of the upload of a file, the user can define and attach to the documents tags or categories, which are later used in researches to categorize and filter the results. Each company registered to the system can define a language, in which the majority of its documents are expressed. The Solr schema will reflect this setting, and will use different analysis and indexing tools, if the language is different from English. The process of indexing the document, i.e. the analysis of its textual content, according to standard semantic techniques, is computationally heavy and complex. For this reason, the system is designed to carry out this operation in the background: as soon as the document is uploaded to the server, an asynchronous thread is started to analyze the text, extracted with Apache TIKA (integrated in Solr). In this way, by combining the capabilities of Solr, Tika and Hadoop, a complete Document Management System has been implemented to be adopted by the users to safely store their documents in the cloud, to make them searchable with an advanced search system and share them with co-workers or colleagues in an easy, safe and immediate way.

Searches on documents can be performed by keywords and also with the application of additional filters (advanced search, based on file name, file content or tags assigned to the document, is also possible) and results retrieved are ordered by rank. Rank is the score assigned by the search engine to each result, according to its relevance to the search terms used. The search results highlight the search terms and show a list of text snippets, in which the

2 http://lucene.apache.org/solr/ 3 http://hadoop.apache.org/


search terms are contained. It is possible to filter the results using facets (grouping of the results in dynamic categories), based on the tags assigned to the document and on the document creation date.

3.5. The “Qualified contents” approach to build a community driven set of qualified contents

The CASK system offers to the users the possibility to save in a specific area, named “Qualified contents”, the contents retrieved during their researches over the CASK database (comprehensive of URLs, Patents, Papers, Ideas, Projects and Grants) and considered relevant and of interest in respect of the query launched. Such activity not only allows the community to build steadily a solid knowledge base of contents really relevant for the community, but also facilitate the users’ research activities. In fact, the contents saved in the “Qualified contents” section are automatically sharable with the CATER community and afterwards are searchable by the users accessing the specific area “Search Qualified Contents”.

Figure 3 - Search Qualified Contents Page

4. Labelling/Tagging of the CASK Contents One of the functionalities reported in the first deliverable D5.1- Overall functional and technical specifications expressed the need of the CATER users to be able to launch searches making use


of a pre-defined set of filters based on the ACARE enablers, on the R&I framework and on the Door to Door (D2D) model, defined during the project. The approach taken in CASK was to allow users to assign a set of TAGS to each content he/she considers relevant for the project purposes, meaning any content that the user define as “qualified” (see section 3.5) is also tagged. Therefore, the Qualified Contents are not only a subset of contents considered relevant by the community, but are also Tagged to easy the contents clustering and search. Technically, CASK maintains a Database that stores the ID of the content considered to be relevant (example: a specific paper), and then such ID is added to the “qualified” list and the tags are associated to that. Therefore, CASK maintains all the metadata related to such contents and the link to them, while the content itself is still maintained in the original database. This approach is necessary when dealing with huge databases such as the one of papers and patents (more than 20 millions of record), and allows the users to perform all the actions in an easy way with high performances. The TAGS that the users can associate to the contents have been grouped in five main categories (ACARE keywords, R&I taxonomy, Relevance to Time Efficiency, Relevant Type of Transport, Door2Door Model) by the project partners in the framework of other workpackages. In the following table the tags identified for each category have been reported:

Category Options

ACARE keywords

Understanding Customer Expectation & Role Understanding Market & Society Assessment of Mobility System Concepts Mobility System Design Mobility System Performance Travel Management Assessment of Mobility Choices IT for Intermodal Mobility Choices Disruption and Recovery Management From Airports to Air Transport Interface Nodes Information Platform for Operations Management of Air Traffic System Intelligence, Automation, Human System-wide Safety Management Systems System-wide Security Management Systems Intelligence Innovative methods Safety radar for air transport system Security Radar for air transport system Operational mission management Resilience (Methodologies/products/services) Defining standards, certification, Innovation Passenger/Payload centred fast-time Simulator


Category Options

R&I taxonomy

Physics (PHY) Structures (STR) Propulsion (PRO) Vehicle Systems and Equipment (VSE) Mechanics (MCH) Integrated Design & Validation (methods & tools) (IDV) Traffic Management (TMG) Transport Nodes (TNO) Human Factors (HFA) Innovative Concepts & Scenarios (ICS)

Relevance to Time Efficiency Specific for 4h D2D goal Directly improving TE Transparently improving TE

Relevant Type of Transport

Air

Rail Water Road MM All

D2D Model

From door to origin airport Access transportation to airport Use/ride transportation to airport Access drop-off/parking zone at airport Access airport building At origin airport Preliminary checks and luggage drop-off Security screenings Entry to boarding gate Boarding aircraft FLIGHT (1,2,…) At origin ground Take-off In-flight Landing At destination ground At transit airport Leaving initial aircraft Transit checks Transit security screenings


Category Options

Transit to new aircraft Entry to new boarding gate Boarding new aircraft At destination airport Leaving aircraft Checks (customs, immigration,…) Luggage re-claim From destination airport to door Exit airport building Access drop-off/parking zone at airport Use/ride transportation to destination Access destination point

Table 2 – TAGS identified to qualify contents

In order to qualify a content, the users should select at least one option/tag per category. Users can decide to query the “qualified contents” per keywords, other fields, or Tags. They can even decide to “browse” the system per tags by not including any keyword and just filtering the contents per the set of tags identified.

5. Conclusions

The Database and the Search engine of CASK have been designed based on the analysis of the requested features. A Document management System has been included as integral part of the CASK database and search system, addressing the specific request to have a search system able to include preferences and documents provided by the CATER community. The overall search process designed for CASK allows for a smart usage of the pre-defined set of filters based on the ACARE enablers, on the R&I framework and on the Door to Door (D2D) model, defined during the project.

c a time efficiency research project “coordination of air-transport time efficiency research ......

Documents