kconnect d3.1 requirements for vertical search solutions · this document summarizes the...

34
www.kconnect.eu D3.1 Requirements for Vertical Search Solutions Deliverable number D3.1 Dissemination level Public Delivery date 2015.07.31 Status Final version Author(s) Célia Boyer, Ljiljana Dolamic, Jon Brassey, Angus Roberts, Endre Jofoldi, Zoltan Varju, Zoltan Farago, João Palotti, Veronika Stefanov This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 644753 (KConnect)

Upload: vunguyet

Post on 16-Aug-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

www.kconnect.eu

D3.1 Requirements for Vertical Search Solutions

Deliverable number D3.1

Dissemination level Public

Delivery date 2015.07.31

Status Final version

Author(s)

Célia Boyer, Ljiljana Dolamic, Jon Brassey, Angus Roberts, Endre Jofoldi, Zoltan Varju, Zoltan Farago, João Palotti, Veronika Stefanov

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 644753 (KConnect)

D3.1 Requirements for Vertical Search Solutions

Page 2 of 34

Executive Summary

This document summarizes the requirements for vertical search solutions, exemplified by Kconnect partners Health On the Net (HONsearch), Trip (Trip database) and Precognox (NOTA).

The requirements are grouped into whether they can be built from existing components (from the Khresmoi project) or are new, and whether they apply to only one partner or are shared by several.

The existing functions are machine translation, semantic annotation, and semantic search. New/improved shared functions are semantic query annotation, search log analysis, readability classification and advanced machine translation. They will be included in the common core functions. Trustability, relatedness, adverse effect lookup, and rapid review analysis are individual requirements, which could be potential common functions in the future.

Many of the requirements (semantic annotation of queries, semantic search API, etc.) proposed here satisfy the needs of the consortium partners for semantic services that can be integrated into industrial solutions.

The document describes ten use cases in detail. These use cases correspond to real life scenarios, and all of them are linked with the requirements. At the end new applications will serve visitors with useful search suggestions, wider language coverage and reliable results which are adjusted to the user expertise level.

General technical requirements as well as performance requirements were identified. The core general requirement is that all components should work as services. Performance requirements focus on response time and availability, based on current and predicted usage numbers of the systems. The annex contains the wireframes designed for the individual systems (HON and NOTA).

D3.1 Requirements for Vertical Search Solutions

Page 3 of 34

Table of Contents 1 Introduction ..................................................................................................................................... 6

2 Requirements ................................................................................................................................... 8

2.1 Existing shared requirements .................................................................................................. 8

2.1.1 Machine translation ......................................................................................................... 8

2.1.2 Semantic annotation ........................................................................................................ 9

2.1.3 Semantic search ............................................................................................................... 9

2.2 New/Improved shared requirements........................................................................................ 9

2.2.1 Machine translation (CUNI) .......................................................................................... 10

2.2.1.1 Multi-lingual search .................................................................................................. 10

2.2.1.2 Spanish language extension ....................................................................................... 10

2.2.1.3 Search safety net ........................................................................................................ 10

2.2.2 Semantic search (ONTO) .............................................................................................. 10

2.2.3 Semantic annotation of queries ..................................................................................... 12

2.2.4 Search log analysis (TUW) ........................................................................................... 12

2.2.5 Readability classification (HON) .................................................................................. 13

2.3 New individual requirements ................................................................................................ 13

2.3.1 Adverse events (Trip) .................................................................................................... 13

2.3.2 Trustability (HON) ........................................................................................................ 14

2.3.3 Rapid reviews analysis (Trip) ........................................................................................ 15

2.3.4 Relatedness (Trip) ......................................................................................................... 16

3 Use cases ....................................................................................................................................... 17

3.1 Use case 1 (HON).................................................................................................................. 17

3.2 Use case 2 (HON).................................................................................................................. 17

3.3 Use case 3 (HON).................................................................................................................. 18

3.4 Use case 4 (NOTA) ............................................................................................................... 19

3.5 Use case 5 (NOTA) ............................................................................................................... 20

3.6 Use case 6 (Trip) ................................................................................................................... 22

3.7 Use case 7 (Trip) ................................................................................................................... 22

3.8 Use case 8 (Trip) ................................................................................................................... 22

3.9 Use case 9 (Trip) ................................................................................................................... 23

3.10 Use case 10 (Trip) ................................................................................................................. 23

4 Technical requirements ................................................................................................................. 25

4.1 General requirements ............................................................................................................ 25

4.2 Performance requirements ..................................................................................................... 25

D3.1 Requirements for Vertical Search Solutions

Page 4 of 34

5 References ..................................................................................................................................... 28

Annex A ................................................................................................................................................ 29

A.1 Milena ......................................................................................................................................... 29

A.2 Nina ............................................................................................................................................. 30

A.3 Roman ......................................................................................................................................... 32

A.4 NOTA ........................................................................................................................................... 33

D3.1 Requirements for Vertical Search Solutions

Page 5 of 34

List of Abbreviations

API Application Programming Interface

CAD Computer Aided Diagnosis

CSR Clinical Study Reports

CT Computed Tomography

CUNI Charles University Prague

GUI Graphical User Interface

HON Health on the Net

HONcode HON Code of Conduct for medical and health Web sites

HONsearch HON search engine

IP Intellectual Property

JSON JavaScript Object Notation

K4E Khresmoi for Everyone

MEDLINE Medical Literature Analysis and Retrieval System Online

MeSH Medical Subject Headings

MIMIR Multiparadigm Indexing and Retrieval

MRI Magnetic Resonance Imaging

MT Machine Translation

NNH Numbers Needed to Harm

NNT Numbers Needed to Treat

NOTA Napvizit Orvosi Tudástár Alkalmazás (Daily Visit Medical Knowledge Application)

ONTO Ontotext

PDF Portable Document Format

PICO Population, Intervention, Comparison, Outcome

PubMed (search engine on MEDLINE)

SNOMED Systematized Nomenclature of Medicine

TUW Vienna University of Technology

USFD University of Sheffield

WP3 Kconnect Workpackage 3 (Early Adopters: Medical Vertical Search Solutions)

D3.1 Requirements for Vertical Search Solutions

Page 6 of 34

1 Introduction

Between 2010 and 2014, Khresmoi [3] developed a multilingual multimodal search and access system for biomedical information and documents (Figure 1). This was achieved by:

• Effective automated information extraction from biomedical documents, including improvements using manual annotation and active learning, and automated estimation of the level of trust and target user expertise

• Automated analysis and indexing for medical images in 2D (X-Rays) and 3D (MRI, CT)

• Linking information extracted from unstructured or semi-structured biomedical texts and images to structured information in knowledge bases

• Support of cross-language search, including multilingual queries, and returning machine-translated pertinent excerpts

• Adaptive user interfaces to assist in formulating queries and interacting with search results

Figure 1. Khresmoi information flow

Based on past years experiences the team proposed modifications and new features. The purpose of the KConnect project is to commercialize the deliveries of the earlier project.

The following project partners working on the search vertical market had already adopted the technologies into their websites.

Trip [2] is a clinical search engine designed to allow users to quickly and easily find and use high-quality research evidence to support their practice and/or care. Trip has been online since 1997 and in that time has developed into the Internet’s premier source of evidence-based content. Their motto is ‘Find evidence fast’ and this is something they aim to deliver for every single search. As well as research evidence they also allow clinicians to search across other content types including images, videos, patient information leaflets, educational courses and news.

The main purpose of NOTA is to provide a unified medical search engine with social capabilities. Medical practice is a community based activity by its very nature, colleagues share their experiences regarding interesting cases and new developments in the field. NOTA provides a platform that makes easy to discover relevant content and discuss new findings and cases with the professional community

The Health On the Net Foundation (HON [1]) promotes and guides the deployment of useful and reliable online health information, and its appropriate and efficient use. Created in 1995, HON is a non-profit, non-governmental organization, accredited to the Economic and Social Council of the United Nations. For 15 years, HON has focused on the essential question of the provision of health information to citizens, information that respects ethical standards. To cope with the unprecedented volume of healthcare information available on the Net, the HONcode of conduct offers a multi-stakeholder consensus on standards to protect citizens from misleading health information.

D3.1 Requirements for Vertical Search Solutions

Page 7 of 34

In section 2 the document refers to the previous Khresmoi [3] project deliverables but focuses on new shared and individual development requirements. Section 2.1 describes the existing base services, section 2.2 contains the new/improved shared requirements, and inside section 2.3 new individual requirements can be found. These development requirements will be the baseline for the modification of the actual services and for the development of new ones.

After the updated central services are up and running as test versions, the use cases described in section 3 could be implemented/modified in the client applications.

Section 4 collects the base requirements for the system design/development phase, and for the live usage as well.

D3.1 Requirements for Vertical Search Solutions

Page 8 of 34

2 Requirements

This deliverable is based on the results of the Kconnect team meeting held in Gothenburg on 2015.06.16-2015.06.17. The aim of the meeting was to collect the most important development requirements for optimizing the process and improving the performance of online health information retrieval. The collected requirements are structured into common and individual requirements. Since as part of the project, vertical search solutions will be developed first, the requirements have been distilled from discussing the needs of early adaptors. Figure 2 shows those requirements.

Figure 2 Shared and individual requirements

Stakeholders agreed to build solutions using tools developed by the Khresmoi [3] project and collected shared and individual requirements. This part of the document (a) briefly summarizes the existing solutions (as the detailed descriptions can be easily found in the Khresmoi [3] documentation), (b) describes the shared requirements, and (c) gives an overview of the the individual development needs. More information on the individual requirements can be found in the third part of this document under the use cases section.

2.1 Existing shared requirements The shared requirements are delivered from the Khresmoi [3] project. There is no detailed description of the existing tools in this document. If you are interested in the details, please see the the original project deliverables, as referenced below.

2.1.1 Machine translation

Machine translation (MT) makes the flow of information easier by providing automatic translation of a given text in specified language. In practice, the goal of MT systems is not to substitute human translators, but to make textual information accessible on demand. Since general purpose machine translation is a hard problem, KConnect provides a domain specific solution within the medical field. When knowledge is fragmented since it has been stored in local languages, MT services liberates it and make it accessible to everyone. Machine translation plays a key role in achieving a European Digital

D3.1 Requirements for Vertical Search Solutions

Page 9 of 34

Single Market. KConnect provides an easy-to-use digital service that provides quality machine translation between the following languages and English: French, Spanish, German, Swedish, Polish and Hungarian.

KConnect Machine Translation services has been developed by the Charles University of Prague. The system is based on the robust and mature Moses framework and provides the following services:

• Query translation. Automatic translation of queries makes possible to search content in several languages

Translation of longer snippets. To help users to evaluate search results and get instant access to the information, KConnect MT services provides a snippet translation feature that returns automatic translation of longer texts. For more detailed information see [9] for Khresmoi [3] results.

This function will be used in the following use cases: - Use case 4 (Section 3.4) - Use case 5 (Section 3.5)

2.1.2 Semantic annotation

Semantic annotation is the process which enhance digital texts with meta-data e.g. by extracting names of drugs, diseases and etc. An NLP pipeline for English has already been developed by USFD. Further languages will be added to the existing solutions. This will be presented in WP1 deliverables.

For more detailed information see [10] for Khresmoi [3] results.

This function will be used in the following use cases: - Use case 4 (Section 3.4) - Use case 5 (Section 3.5)

2.1.3 Semantic search

Semantic search makes the process of searching more natural by extending the query terms with connected terms (e.g. synonyms) and/or grouping the results into semantic categories. This requires the semantic annotation of the indexed documents (see 2.1.2.). The semantic relationship between terms can be derived from a so called knowledge base. These solutions were developed during the KHRESMOI project.

For more detailed information see [10] for Khresmoi [3] results.

This function will be used in the following use cases: - Use case 4 (Section 3.4) - Use case 5 (Section 3.5) - Use case 6 (Section 3.6) - Use case 7 (Section 3.7) - Use case 8 (Section 3.8) - Use case 9 (Section 3.9) - Use case 10 (Section 3.10)

2.2 New/Improved shared requirements This section describes shared requirements which will be built on existing solutions or required by the consortium partners. The requirements reflect the needs of industrial applications. Machine translation is a core element of KConnect services, it will not only provide multilingual solutions, but can be used

D3.1 Requirements for Vertical Search Solutions

Page 10 of 34

to improve the quality of search. Semantic search is still as its infancy, its use in industrial settings is lacking standardised solutions. The requirements (semantic annotation of queries, semantic search API, and PICO) proposed here satisfy the needs of the consortium partners for semantic services that can be integrated into industrial solutions. Search log analysis is vitally important for industrial applications, whereas the readability features open new possibilities in designing health information sites for the public.

2.2.1 Machine translation (CUNI)

2.2.1.1 Multi-lingual search At the simplest level a user would type in search terms in their own language and this would be auto-converted to English, a search performed in the Trip Database, and results returned in English. However, there are multiple ways of enhancing this further:

• Auto-convert the search results (titles) to the native language. • Create and translate snippets into the native language. Trip currently doesn’t use snippets so

this would be an additional consideration. It might be a snippet or it might be an auto generated document summary.

• Translation of the whole document

This function will be used in the following use cases: - Use case 6 (Section 3.6)

2.2.1.2 Spanish language extension Trip has a worldwide user base and the most common non-English user group is from Spanish language countries (Spain and South America). So, developing Spanish would ensure maximum benefit for Trip.

This would be a good challenge, for a site to be able to pull through related documents in the native language to a user in Trip. So, if we know a user is Spanish we could pull through related articles that are written in Spanish. Similarly, for a site in non-English the ability to bring through related English-language documents would be useful.

This function will be used in the following use cases: - Use case 6 (Section 3.6)

2.2.1.3 Search safety net However, the main use case is in a novel tool, the search safety net. When searching for research evidence it can be vital to locate all the evidence regardless of language. The search safety net will use linguistic relatedness and clickstream data to highlight potentially missed articles. The ability to bring non-English documents would be incredibly powerful.

This function will be used in the following use cases: - Use case 7 (Section 3.7)

2.2.2 Semantic search (ONTO)

The semantic search service will provide an API for user-friendly search interfaces. Since semantic search is not part of the daily routine of ordinary users, the API analyses queries and suggests search

D3.1 Requirements for Vertical Search Solutions

Page 11 of 34

terms grouped into semantic categories (see Figure 3) The API do not implement any solution for interfaces, its sole purpose is to provide suggestions to queries.

Input: query string (at least 3 characters)

Output:

1. up to three terms best matching the input query (e.g. metformin, methopterin... for the query “met”) and a list of possible semantic queries for this term in JSON format.

2. list of up to five phrase queries containing the above or being labelled by the above terms (e.g. Metphormin (Glucophage) for Polycystic Ovary Syndrome)

The format of the final query depends on the selection made. The front end will choose the final query and submit it to MIMIR [5]. The MIMIR user guide can be found here: https://gate.ac.uk/mimir/doc/mimir-guide.pdf

The client application defines the GUI design and the type of output (see above): only disambiguation, real semantic query suggestion, or both.

The current output format of the ONTO knowledge base is JSON. It will be received by K4E.

Figure 3 gives the illustration of the output described above in the search engine query suggestion. Further details of the output's usage in this purpose can be found in the Use case section of this document (Sections 3.1, 3.2 and 3.3).

Figure 3 Semantics in query suggestions – an example

This function will be used in the following use cases: - All use cases (Section 3.1-10)

D3.1 Requirements for Vertical Search Solutions

Page 12 of 34

2.2.3 Semantic annotation of queries

The Trip Database has adopted, as an alternative interface, a PICO search system. PICO is a way of helping users structure a clinical question in a way that also helps support searching for evidence. The Centre for Evidence-Based Medicine, Oxford describes it:

“One of the fundamental skills required for practising EBM is the asking of well-built clinical questions. To benefit patients and clinicians, such questions need to be both directly relevant to

patients’ problems and phrased in ways that direct your search to relevant and precise answers.” [8]

In practice a well-built clinical question contains up to four separate components (1) P = Population of patients the question refers to (2) I = Intervention that is being explored (3) C = Comparisons intervention (4) O = Outcome that is of interest.

For example a clinical question might be ‘In diabetics what is better metformin or insulin in preventing retinopathy?’ The PICO elements are as follows:

P = Population = Diabetics I = Intervention = metformin C = Comparison = insulin O = Outcome = retinopathy

Currently on Trip a user manually adds the PICO elements (they do not need to use all 4) and Trip then does a contingency search to retrieve the top 5-20 articles linked to the search.

Via semantic mark-up the system could allow auto-extraction of PICO elements from a user typing out the full question. The interface would see the user typing out the full-text, the system suggests the PICO elements (which the user can alter if the Trip suggestion is incorrect) and then they do the search.

To maximise the impact the system could the full-text question, PICO elements and articles viewed to create a new record to go in to the Trip index for re-use.

Input: user query (free text)

Output: four terms (each for PICO), any of them could be empty

If the output contains values then is should be fed back to MIMIRAPI or other search engine.

This function will be used in the following use cases: - Use case 8 (Section 3.8)

2.2.4 Search log analysis (TUW)

The automatic analysis of query logs has been used internally to understand how the users search and to identify requirements, such as the need for query translation, when a number of queries were detected to be non-English.

We exploited the queries to build a user expertise classifier [4], which detects the level of expertise of users based on their behaviour, e.g. session length, query length, etc., and on the content of queries issued, e.g. use of complex or specific vocabulary.

Our next steps are exploring the use of clickstream (user's recorded activity on the page) data to provide support for two areas: (1) relatedness (which articles are related to others based on the clickstream data), and (2) for results re-ranking (based on actual number of clicks, but also to pull through non-text matching documents (http://www.pontneo.com/Trip/search.php).The clickstream could be used across a number of features of Trip and, as we own the clickstream data, it adds considerably to IP.

No API is currently available for this tool, therefore it needs to be developed.

D3.1 Requirements for Vertical Search Solutions

Page 13 of 34

This function will be used in the following use cases: - Use case 4 (section 3.4) - Use case 5 (section 3.5) - Use case 7 (section 3.7)

2.2.5 Readability classification (HON)

The automated detection of the readability level is integrated into the everyone.khresmoi.eu/hon-search (K4E, (Figure 4)). The readability detection is currently implemented for English and French. The details about the readability implementation can be found in [6].

Figure 4 Readability implementation within the K4E search engine

No API is currently available for this tool. The API that will be in the scope of this project will have the following characteristics:

1. Input:

a) text, language

b) URL, language

In the case of the URL input the page is first fetched, the meaningful content is extracted and the detection is performed. While in the case of text input, the detection is performed directly on the provided text.

2. Output:

• List of readability scores detected for each level (easy, average, or difficult). This output will be in JSON format.

This function will be used in the following use cases: - Use case 1 (section 3.1) - Use case 2 (section 3.2) - Use case 3 (section 3.3)

2.3 New individual requirements This section describes individual requirements. Although the proposed solutions have been referred as individual requirements, they are related to semantic search and they can serve as a basis for shared solutions in the future.

2.3.1 Adverse events (Trip)

Clinical Study Reports (CSRs) are typically long (hundreds if not thousands of pages), typically poorly structured documents. They are a long summary of the results of a clinical trial and often form the basis of the articles published in peer-reviewed journals. These journal articles are typically 8-12 pages long so there is significant summarisation undertaken. The concern is that, due to the summarisation,

D3.1 Requirements for Vertical Search Solutions

Page 14 of 34

important detail is overlooked or ignored meaning unnecessary harm might result from under-reported adverse events.

This proposed piece of work will involve a system to upload a CSR (probably as a PDF) and for the system to ‘read’ the contents and accurately identify adverse events mentioned in the CSR. As well as ‘marking up’ the adverse event data within the PDF it would allow users to generate a report or analytics. This would allow researchers to easily understand the number and volume of adverse events. So, as well as pulling out the mentions of adverse events it also extracts the frequency of adverse events so these could be quantified. Typically, adverse events are reported as a frequency e.g. this adverse event occurs at a rate of around 1 in 10.

Additional enhancements would allow users to be able to upload any text and have any system automatically extract not only adverse events but other data that might be amenable to machine reading, for instance MeSH terms, SNOMed terms.

This function will be used in the following use cases: - Use case 9 (3.9)

2.3.2 Trustability (HON)

The automated detection of the HONcode compliance is integrated into the everyone.khresmoi.eu/hon-search (K4E) pipeline (Figure 5).

This system needs further development in detection of certain HONcode criteria such as “date” as well as expanding the language coverage. It is currently implemented in English, French, German, Spanish, Italian and Dutch. However, due to small training collections for languages other than English and French, additional work is required for those languages as well. HON will be using the MT services to translate the existing collections and use those translations as a training base for the languages not covered yet, and for enriching the existing collections. The details about the system for automated detection of the HONcode conformity and its implementation into K4E search pipeline are described in [6, 7].

Figure 5. K4E - trustability implementation

Currently no API exists for the detection of the HONcode compliance. The API that will be in the scope of this project will have following characteristics:

1. Input:

a) text, language

b) URL, language

D3.1 Requirements for Vertical Search Solutions

Page 15 of 34

If the input is a URL, the page is first fetched, the meaningful content is extracted and the detection is performed. While in the case of text input, the detection is performed directly on the provided text.

2. Output:

The user can require different information and the outputs are:

- For the submitted text or URL the detected HONcode principles are required. The output is the list of detected principles with corresponding scores.

- In case that the user requires whether the URL is certified, the API returns the trustability level detected on the whole body of pages coming from the same source as the input page. The output is the list of principles detected.

- The user can require the pages from certified websites covering the same thematic. The output in this case is the list of pages from various websites on the same thematic as the input URL.

All results will be in JSON format.

This function will be used in the following use cases: - Use case 1 (section 3.1) - Use case 2 (section 3.2) - Use case 3 (section 3.3)

2.3.3 Rapid reviews analysis (Trip)

The current Trip Rapid Review system uses sentiment analysis to decide if an intervention in a placebo controlled trial is in favour or not of the intervention. A user searches and selects trials that are pertinent to the clinical question and the system auto-reads them to see if they are in favour, or not, of the intervention in question. It does this by sentiment analysis of the individual trials and then averages out the results of the individual trials to give an overall estimate of effect. The initial system was crude and only had 2 outcomes (positive or negative) for an intervention versus a placebo.

The system could be improved in two main ways:

• Scoring on 3 outcomes (positive (+1), negative (-1) and neutral (0)). • Allow head-to-head trials, where two drugs/interventions are tested against each other.

The new system would start in a similar way to the old system with a user adding in their search terms and then selecting candidate trials for inclusion in the analysis. Once selected the system would analyse these, assigning 1, 0 or -1 for each trial and these would then be combined to produce an overall score. An additional challenge is that the system should try to ascertain how big the trial was (number of participants) and this allows us to modify the score (1, 0 or -1). For example in the current system a small trial (less than 100 participants) has the score reduced by 75% - so a small positive trial would score +0.25 (not +1).

An additional component of the system is to automatically identify new trials as they are published in PubMed (Trip already pulls in new clinical trials but that is just to populate the search index) and see if there are any pre-existing reviews that are asking/answering the same questions. In other words if a particular new trial closely related to an existing review this will be flagged up and if it’s an existing Trip Rapid Review it could be automatically incorporated into the system.

The above rests on a really import concept for Trip – relatedness/similarity.

This function will be used in the following use cases: - Use case 10 (3.10)

D3.1 Requirements for Vertical Search Solutions

Page 16 of 34

2.3.4 Relatedness (Trip)

An important aspects of Trip’s plans for growth/improvement revolve around the concept of relatedness. In other words how similar are two ‘items’? There are many ‘related articles’ features which use latent semantic indexing (LSI) to find articles that are similar. The logic being that if you like one article then very similar ones should appeal.

The use cases for Trip are multiple:

• As above, how similar are articles within Trip. This might be at the time of searching Trip but also, if a user looks at an article, can a system look for articles that are similar that are added to Trip in the future and then alert the user.

• An important area of potential business growth is to profile Trip users to better understand their interests and to use this to highlight new jobs, books, clinical trials that are recruiting etc. Trip can highlight these to the user and potentially secure referral fees. This requires Trip to see how similar a user is to the trial, book, job etc.

• A similar use case to the above is the potential for users of Trip to ask questions and Trip send the questions to people the system predicts can answer them. For instance, a user might have a specific question on paediatric eczema, can Trip find users that are ‘related’ to the question and see if they can answer it?

• In the future Trip might want to create a formal social network and then Trip would want to recommend similar users to each other, as LinkedIn does.

• Trip is working on using the clickstream data and machine learning to boost the results in Trip (more below). The current approach is to look at a top-level profession such as dentist, cardiologist to infer results. This is showing great promise but the more granular the approach is the better. For example, take a cardiologist with a special interest in heart failure. Trip can learn most from his/her own activity but the data is likely to be sparse. However, Trip can boost this by looking at the clickstream data of similar users (ie. those with an interest in heart failure) and then, finally, the system can look at all cardiologists.

Trip’s work on the clickstream data will shortly be boosted by machine learning. The initial idea being that articles that are clicked on are more likely to be of interest to users than articles that are no clicked on. So, you boost popular items. Trip have improved upon this by using sections of the clickstream data. So, for a given search, results can be boosted based on profession/speciality etc. This works by looking specifically at the clickstream data of that profession/interest and boosting those articles. So, for a search such as antibiotics and with a selected interest as dentistry Trip boost the results based on the previous clicks of dentists. The results are very encouraging.

Using machine learning to predict an article's use for a given topic will add additional benefits. For this Trip will input articles that have been clicked on by – say - dentists and articles not clicked on by dentists to try to learn what articles are likely to be of interest. At a very crude level this could well be the occurrence of the word ‘teeth.

The shortcoming of this approach is that it is very high-level, hence the desire to boost the granularity by looking at similar clinicians. This might not be possible for the machine learning aspect but it should work for the clickstream data.

This function will be used in the following use cases: - Use case 6 (Section 3.6) - Use case 7 (Section 3.7) - Use case 10 (Section 3.10)

D3.1 Requirements for Vertical Search Solutions

Page 17 of 34

3 Use cases

This section describes the use cases identified by KConnect WP3 team members. Each use case is structured with an example, characteristic and scenario. The graphical presentation of the use cases presented in this section can be found in the Annex A at the end of this document.

3.1 Use case 1 (HON) Milena, first-time user, has got a question concerning a disease treated by a specific medication

Milena is a sales woman, she speaks only English. Recently she has noticed that her sister is not doing well, although she is trying to convince Milena that everything is ok. She has also noticed a medication called “metformin” in her sister's house, she never heard about before, which raised her suspicion about her sister’s condition. She would like to know what condition or disease is treated by that medication.

Characteristic:

1. Milena speaks only English.

2. She has limited medical knowledge

Scenario:

1. Milena goes to HONsearch website, English is set by default as the language and starts typing “met”.

2. System gives suggestions on different medications and diseases with possible predefined semantic queries questions, as well as certain number of queries with these diseases/medication in context.

3. She chooses the “Diseases treated by metformin” from the proposed list.

4. System detects this as a semantic query and reacts accordingly.

5. Three diseases are detected amongst results, the information on these diseases is displayed containing the characteristics of the disease, clinical info, symptoms etc.

6. Reading the descriptions of the diseases, she recognizes certain number of symptoms in the description of the “diabetes type 2”.

7. She clicks on it and the list of results is filtered, keeping only those related to this condition.

8. Accessing first few results in the list, she realizes that the returned results are scientific articles, she find them difficult to read. She notices the “Results too complex” label and click on the proposed link.

9. New, basic search query is launched for “diabetes type 2”. It returns the list of results, in which she found satisfactory results.

3.2 Use case 2 (HON) Nina, mother of a small child, is trying to determine the condition that corresponds to certain symptoms.

Living since recently in Switzerland, she has basic notions of French. Her child developed blisters on the feet and hands, which resemble those of chickenpox. However, her child has already had chickenpox.

D3.1 Requirements for Vertical Search Solutions

Page 18 of 34

Characteristics

1. Nina speaks English and has some notion of French. She has to speak French to the child's doctor since he does not speak any English.

2. Searches health information on a regular bases, especially concerning children.

Scenario

1. Arriving at the home page, Nina notices the term “blisters” in the “Sign and Symptom” cloud and clicks on it.

2. The term is added to the query field. The system reacts by proposing a query suggestion list (semantic and word in context).

3. She chooses “My child has blisters on hands and feet”.

4. System performs basic search query.

5. Among returned results two main diseases are detected “Hand-foot-mouth” and “chickenpox”. Information on these diseases are displayed.

6. Reading the information on the diseases, she recognizes her child's symptoms in Hand-Foot-Mouth, she clicks on this disease and the results are filtered accordingly.

7. Living in the French speaking part of Switzerland and having to discuss her child's conditions in this language, she wants to have at least basic information. She checks the French in the language list, which adds results in this language.

8. She finds the French name of the disease and the other info in the list of results. Hovering over the “in English” link besides the results in French, she verifies her conclusion in the snippet translation in English.

9. She would like to know about possible treatment, but only in English. She deselects the French from the language list, and clicks the treatment from the disease menu on the right.

3.3 Use case 3 (HON) Roman has a doubt about best the treatment for his condition

Roman is a native French speaker with very good level of English. He is very well accustomed to searching health related information on the Internet. He feels at ease reading health related scientific literature in both French and English. Roman suffers from morbid obesity. He is considering undergoing a surgical intervention. After discussing with his doctor, he is still in doubt on what the best intervention would be. He wants to have as much details as possible on two interventions his doctor suggested, hut also on other possible solutions.

Characteristics

1. Roman is native French speaker, fluent in English

2. He is capable of understanding the even highly complex health related scientific articles.

Scenario

1. Arriving on the HONsearch home page his interface language is set by default to French.

2. In the “Maladies” (Diseases) cloud he spots the “obesité” which he chooses.

3. This term is added to the query and a list of suggestions appears.

4. Being particularly interested in the morbid obesity he continues to type his query adding “mor” with a new list of suggestions concerning the “obesité morbide” appearing.

D3.1 Requirements for Vertical Search Solutions

Page 19 of 34

5. Among the suggestions he notices the “Obesité morbide/bypass ou sleeve?” (Morbid obesity/ bypass or sleeve?). These are the two interventions proposed by his doctor. He chooses this proposition and the basic query is launched by the system.

6. The description of the morbid obesity detected as disease is displayed together with the list of results. Additional information such as readability level of each result is displayed as well.

7. Roman notices a warning sign besides one of the returned results. Hovering over this sign the message that the automated system was unable to detect all HONcode principles for this website.

8. Roman reads some of the results, however still unsure he decides to add results in English language as well.

9. The results are added with the translation of the snippet available from the “en Français” link

10. After going through information about the two aforementioned interventions, he would like to see other treatment options. He notices the menu on the right and clicks the treatment, which launches new query “Morbid obesity treatment” in both French and English.

11. Looking for more scientific information on different treatment, Roman clicks on the link “Rechercher dans les articles médicaux” below query field. In addition he chooses to have this information only in English.

12. This launches the semantic query “Traitement indiqué pour l'obesité morbide” (Treatment has indications for morbid obesity).

13. The returned list of results contains the scientific articles concerning different treatments on his condition, which help him in the final decision.

3.4 Use case 4 (NOTA) Bela, first-time user in academic environment, he has got an exact question

Bela is a medical student who has just learned about NOTA and he would like to give it a try. He has got an exact question in his mind: „immun thrombocytopeniás beteg kezelési lehetőségei” (possible treatments for immune thrombocytopenia)

Characteristics

1. Bela accesses NOTA through its web site, so he sees the whole front page (news, questions, etc.)

2. University setting: IP identification (unlimited access to all contents, no need to login)

3. Firs-time user; he is NOT familiar with the site, has no clue what he can expect, has no user account so personalization is not an option

4. As a medical student, Bela is good at English and German

Prototypical scenario

1. Bela opens the browser and navigates to NOTA

2. Bela enters into the NOTA Knowledgebase (a big button at the top of the site)

3. His eyes go through the menu items, Answers, Ask a question, Search, News, Register or Sign in, Help

4. Bela chooses Answers, and he learns about the possibility of asking new questions

5. Bela switches to the Search option, just to give it a try, and enters “possible treatments for immune thrombocytopenia” in Hungarian

D3.1 Requirements for Vertical Search Solutions

Page 20 of 34

6. The system initiates the following steps

a. first it tries to retrieve documents from Akademiai Kiado’s collection (containing documents in Hungarian and in English) The following link shows a possible result: http://www.akademiai.com/content/y575461641uq6r13

b. it retrieves medical protocols relevant to the query

c. it identifies the national code for the disease

d. it tries to find relevant case-studies in the publisher’s databank

e. it tries to identify drugs relevant to the query

f. it retrieves other relevant content in PubMed (in English)

g. If relevant answers exist, the system retrieves them from the publisher’s database

7. Search results are displayed in facets.

8. A visible button appears on the site, showing that Bela can ask a question relevant to his query and/or the results

9. By choosing the “most recent publications” option, Bela gets a short list of recent (no older than 18 months) publications on the topic.

a. He gets the title and the abstract of every results in Hungarian.

b. Since Bela speaks English, he rather goes for the option “show results and snippets in their original language”

10. Bela finds an article on a new drug, but he is skeptical about its usefulness so he asks a question: has anybody has got experiences with it?

11. Bela happily concludes his discovery; he found Hungarian and English content relevant to his field of specialization. He reached a greater community of fellow doctors, asked for advice and an email alert will be sent to him when someone answers his question.

3.5 Use case 5 (NOTA) Eszter, experienced medical professional user with a preference for mobile devices

Eszter is a practicing medical professional with a zeal for new technologies. She feels herself comfortable with new technologies, her choice of platform is a tablet since she is extremely mobile. Eszter likes internet services and uses social platforms in her everyday workflow. She has been using NOTA for months. Although Eszter reads English texts on almost every day, she learned the language as an adult and goes for a Hungarian version of the material whenever it is possible.

Characteristic

1. Eszter is a competent user, visits NOTA several times a day and is actively using its functionalities (News, Ask a question, Answers, Search, etc.)

2. Eszter is using NOTA from her home or on the move (e.g. on the subway or on the train). She has been registered as an academic user.

3. Eszter is using a tablet.

Prototypical scenario

1. Eszter gets an email alert because she set up alerts for news in the field of gerontology and a new paper has just been published.

D3.1 Requirements for Vertical Search Solutions

Page 21 of 34

2. Eszter clicks on the link in the mail and the browser opens the tablet optimized version of NOTA

3. Eszter logs in (using an OpenID, such as her Google account)

4. Her personalized page appears on the site. A box highlights alerts, new questions and answers in topics relevant to Eszter’s interest and etc.

5. A navigation bar is shown on the top of the site (News, Ask a question, Answers, Search, Profile, Help)

6. Eszter chooses an interesting question, “Is it possible to diagnose Alzheimer-dementia in its early stage?”

a. Although there is no answer to this question, relevant keywords have been extracted from the question

b. Users can tag questions or they can accept/reject keywords extracted by the system

c. The user who asked the question misspelled Alzheimer’s, he wrote “Altzheimer” a common mistake made by Hungarians. Although the system recognized this as a possible mistake, the user was in a hurry and saved his question containing the wrong spelling.

7. Eszter chooses the option, “Find content relevant to this question”. This option initiates a query built-up from the extracted keywords and user defined tags and returns:

a. The system recognizes the misspelling of Alzheimer’s and automatically changes to the correct from (similar to the ‘Did you mean?’ function offered by Google)

b. matches from Akademia Kiado’s journals (English and Hungarian)

c. protocols stored in the publisher’s database

d. case-studies from the publishers database

e. drugs related to the treatment of the disease

f. PubMed results

8. Results are displayed in facets. Results are in Hungarian, even the abstract of English articles have been translated into Hungarian.

9. Eszter is dissatisfied with the results so she starts refining it

a. She notices the knowledgebase on the right of the page. It offers semantically related terms. Eszter chooses “diagnosis” and goes for the “computer aided diagnosis (CAD)” option

b. non-relevant matches disappear from the results and new ones come up

c. Eszter browses through the results and learns that there are many new ways of computer-aided diagnosis. She goes back to the semantic search facility and finds “speech analysis” among the sub-genres of CAD

d. Eszter finds interesting articles from a Hungarian research group. Although the vast majority of the papers are in English, she sees their abstracts in Hungarian, which makes easier to go through the results.

e. Eszter learns that the research group is collaborating with a French institute and this institute is actively publishing on speech-based diagnosis too.

10. Eszter downloads relevant papers to her device. She starts reading interesting case-studies from the research group.

11. Before logging out, Eszter saves her search results, and gives a short answer to the question. The answer is not an expert opinion, but points to relevant materials found on NOTA.

D3.1 Requirements for Vertical Search Solutions

Page 22 of 34

3.6 Use case 6 (Trip) Pablo is a Spanish doctor who is confident in speaking English but has less confidence in written English. He is keen to use evidence to answer his clinical questions

Characteristics

1. Pablo is a frequent user of Trip but is sometime concerned that he might be missing important articles due to English not being his first language.

2. He has registered with the site and as part of the process he has indicated that he is from Spain.

Prototypical scenario

1. Pablo logs in to the site and is given the option to search in English or Spanish. 2. Pablo picks Spanish and types the search terms in the search box. 3. The results are returned in Spanish and clicking on the links will take him to the article. 4. All the articles will have a ‘related articles’ option which will allow Pablo to select similar

articles that he may have missed and this can include articles written in Spanish, giving him a broader overview of the research literature.

3.7 Use case 7 (Trip) Geoff is an academic librarian and Susan is an academic researcher. They are both experienced searchers and have worked on a significant number of systematic and rapid reviews. However, they are under enormous time pressures and are always concerned that they may miss important articles. These concerns are both for professional reasons and also because it may affect the outcomes of the reviews.

Characteristics

1. Geoff and Susan frequently use Trip and feel comfortable with the functionality

Prototypical scenario

1. Geoff and Susan independently log in to Trip having formulated a search strategy. 2. As they search they select articles of interest. 3. At the end of the search session they export the records to be incorporated into their reference

management software. 4. They are then presented with the option of using the Trip ‘Search safety net feature’. 5. Having selected this option they are presented with a list of articles related to their selected

search terms. They are also told if the relatedness comes from linguistic similarity or via clickstream analysis.

6. They can quickly skim over the list of selected articles, selecting any that are relevant and these too can be exported to their reference management software.

3.8 Use case 8 (Trip) Dewi is a busy general practitioner and frequently generates clinical questions in the course of his day. He is keen to provide good quality care and wants his practice to be evidence based. However, he lacks confidence in using the literature and frequently just asks his colleagues.

Characteristics

1. Dewi is an infrequent user of Trip but is aware that is contains a large selected of evidence-based content

Prototypical scenario

D3.1 Requirements for Vertical Search Solutions

Page 23 of 34

1. Dewi goes to Trip and logs in. 2. Selecting the PICO search interface Dewi is presented with a search box to enter his full-text

(natural language) question. After doing so he presses enter. 3. Trip analyses the question and extracts the PICO elements and automatically searches Trip. 4. Alongside the search results Dewi is shown the PICO elements. If any of the elements were

wrong Dewi could alter them to improve the search, but in this case the elements were fine and Dewi is happy that he has good search results and can base his clinical decision on high quality research evidence.

3.9 Use case 9 (Trip) Tom and Carl are clinical academics who regularly undertake systematic reviews. They have become increasingly concerned about relying on journal articles as they are aware this misses lots of information contained in the trials and are considering using Clinical Study Reports. These are long and frequently poorly structured documents.

Characteristics

1. Tom and Carl are confident systematic reviewers but are finding the move to using Clinical Study Reports daunting due to the significant increase in content.

2. While they are especially keen on locating adverse event reporting they are concerned they may miss important examples due to the sheer volume of text.

Prototypical scenario

1. Tom uploads a Clinical Study Report as a PDF onto the Trip website. 2. After a few seconds Tom is informed that the article has been annotated. Tom selects the

‘Adverse event’ report. 3. He is then told that there are 105 separate mentions of potential adverse events. He calls Carl

to analyse the results with him. 4. The adverse events are shown by frequency of occurrence in the Clinical Study Reports and

by selecting an individual adverse event Tom and Carl are shown where these mentions are made. They can then read the text and decide if the mention is a true adverse event or possibly a symptom. For instance ‘cough’ can be a disease symptom but it could also be an adverse event linked with, say, a pharmaceutical drug.

5. One they have finished their review they press finish and a report is generated. They realise that the published research articles have failed to report on multiple adverse events including one potentially serious one.

3.10 Use case 10 (Trip) Magda is a dentist and enthusiastic user of the medical literature, frequently undertaking searches.

Characteristics

1. As well as being a busy dentist Magda is a keen proponent of evidence based healthcare 2. She uses a number of databases including Trip. The decision as to which database(s) to use is

principally governed by the amount of time she has at her disposal.

Prototypical scenario

1. Magda has just had a conversation with a colleague which highlights a fairly new intervention to prevent caries. She is keen to understand what the research literature says about the intervention.

2. Due to a relatively short amount of time she elects to use the Trip Rapid Review system.

D3.1 Requirements for Vertical Search Solutions

Page 24 of 34

3. She is presented with the search boxes where she enters caries and the interventions name. 4. After pressing search she is presented with a number of controlled trials that have explored the

effectiveness of the intervention. 5. She selects the ones she feels are pertinent and presses the analyse button. 6. The system analyses all the selected trial and reports back the relative performance of each

trial and an aggregate score for the intervention. In addition the size of each trial is indicated. 7. The score highlights that there evidence, while positive is only marginally positive and is based

on a number of small (and therefore unreliable) trials. 8. Magda is confident that the intervention lacks a robust evidence base and therefore decides,

for now, not to introduce it into her clinical practice. 9. 3 months later Magda gets notification that a new trial has been published which is highly

related to the other articles she used in the review. She agrees and the analysis is updated. The new trial was large and shown significant benefit. Magda now feels confident enough to use the intervention in her clinical practice.

D3.1 Requirements for Vertical Search Solutions

Page 25 of 34

4 Technical requirements

This section treats the technical requirements for Kconnect systems. In order to build easy to adopt systems, the latest industry standards should be met. From the user experience view the 24/7 availability and the response times are critical.

4.1 General requirements All developed system components should work as services. The communication between the services must be standardized, to make it easy to add or replace obsolete services

Since all partner systems will be using industry standard open source tools and integrate various third-party APIs, reliability and performance are key requirements.

RESTful communication and JSON webservices format are preferred.

4.2 Performance requirements NOTA, Trip and HON aim to serve its clients even during peak hours, so the integrated services should be able to handle large amount of simultaneous requests. Response time is critical, it should be in the range of milliseconds - whenever it is possible.

Numbers for Trip usage:

number of current visitors is 100 000/month

number of future visitors is 200 000/month

Numbers for NOTA usage:

number of current visitors is 0

number of future visitors is 150 000/month

Numbers for HON usage:

number of current visitors is 600/month

number of future visitors is 250 000/month

Altogether the consortium members expect 600 000 visitors/month when all new features are developed and integrated to the applications.

For the K4E search engine developed by HON two distinct tools are attached which gather the information regarding the usage namely google analytics and HON log analysis tool. The information presented by these tools enables us to estimate possible traffic requirements as well as to estimate the peaks usage.

HON is currently in possession of 12011 query logs. The results returned by the tool show that currently the average number of queries per day is 18 with the peak of 295 queries experienced on 13/01/2015 (Figure 6). Most of these queries were in English (59%) as shown in Figure 7.

The output of the Google analytics confirms the current estimation of 600 users/month. According to it, the number of users in the month of May 2015 was 729 (Figure 8).

D3.1 Requirements for Vertical Search Solutions

Page 26 of 34

Figure 6 HON Log Analysis - query/day

D3.1 Requirements for Vertical Search Solutions

Page 27 of 34

Figure 7 HON Log Analysis - query by language

Figure 8 K4E - google analytics

D3.1 Requirements for Vertical Search Solutions

Page 28 of 34

5 References

[1] Health On the Net Foundation (HON) https://www.healthonnet.org/

[2] Trip https://www.Tripdatabase.com/

[3] KHRESMOI http://khresmoi.eu/

[4] Joao Palotti, Allan Hanbury, and Henning Muller. Exploiting Health Related Features to Infer User Expertise in the Medical Domain. Proceedings of WSCD Workshop on Web Search and Data Mining. John Wiley \& Sons, Inc., New York, NY, USA, 2014

[5] MIMIR https://gate.ac.uk/mimir/doc/mimir-guide.pdf

[6] Allan Hanbury, Celia Boyer, Ljiljana Dolamic, João Palotti. Report on automatic document categorization, trustability and readability. Khresmoi public deliverable D1.6: http://khresmoi.eu/assets/Deliverables/WP1/KhresmoiD16.pdf

[7] Wei Li, Angus Roberts, Johann Petrak, Ljiljana Dolamic, Gareth J.F. Jones, Liadh Kelly, Lorraine Goeuriot. Report on results of the WP1 second evaluation phase. Khresmoi public deliverable D1.8: http://khresmoi.eu/assets/Deliverables/WP1/KhresmoiD1.8.pdf

[8] Centre for Evidence-Based Medicine. Asking Focused Questions, http://www.cebm.net/asking-focused-questions/

[9] Pecina et al., Adaptation of machine translation for multilingual information retrieval in the medical domain, Artificial Intelligence in Medicine, 6 (3), 2014.

[10] Roberts et al., D1.7: Prototype and report on semantic indexing and annotation for information retrieval, 2014, http://www.khresmoi.eu/assets/Deliverables/WP1/KhresmoiD17.pdf

D3.1 Requirements for Vertical Search Solutions

Page 29 of 34

Annex A

A.1 Milena

1 2

3 4

5 6

D3.1 Requirements for Vertical Search Solutions

Page 30 of 34

A.2 Nina

1 2

3 4

5

6

D3.1 Requirements for Vertical Search Solutions

Page 31 of 34

7

D3.1 Requirements for Vertical Search Solutions

Page 32 of 34

A.3 Roman

1 2

3 4

5 6

D3.1 Requirements for Vertical Search Solutions

Page 33 of 34

7 8

9

A.4 NOTA

D3.1 Requirements for Vertical Search Solutions

Page 34 of 34

1. NOTA

2. NOTA

3. NOTA

4. NOTA

5. NOTA

6. NOTA