alternative information gathering on mobile deviceskth.diva-portal.org › smash › get ›...

66
INOM EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP , STOCKHOLM SVERIGE 2017 Alternative Information Gathering on Mobile Devices EDIN JAKUPOVIC KTH SKOLAN FÖR INFORMATIONS- OCH KOMMUNIKATIONSTEKNIK

Upload: others

Post on 07-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

INOM EXAMENSARBETE TEKNIK,GRUNDNIVÅ, 15 HP

, STOCKHOLM SVERIGE 2017

Alternative Information Gathering on Mobile Devices

EDIN JAKUPOVIC

KTHSKOLAN FÖR INFORMATIONS- OCH KOMMUNIKATIONSTEKNIK

Page 2: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used
Page 3: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Abstract

Searching and gathering information about specific topics is a time wasting, butvital practise. With the continuous growth and surpassing of desktop devices, themobile market is becoming a more important area to consider. Due to the porta-bility of mobile devices, certain tasks are more difficult to perform, compared toon a desktop device. Searching for information online is generally slower on mobiledevices than on desktop devices, even though the majority of searches are performedon mobile devices.

The largest challenges with searching for information online using mobile devices,are the smaller screen sizes, and the time spent jumping between sources and searchresults in a browser. These challenges could be solved by using an application thatfocuses on the relevancy of search results, summarizes the content of them, andpresents them on a single screen.

The aim of this study was to find an alternative data gathering method with afaster and simpler searching experience. This data gathering method was able toquickly find and gather data requested through a search term by a user. The datawas then analyzed and presented to the user in a summarized form, to eliminate theneed to visit the source of the content.

A survey was performed by having a smaller target group of users answer a question-naire. The results showed that the method was quick, results were often relevant,and the summaries reduced the need to visit the source page. But while the methodhad potential for future development, it is hindered by ethical issues related to theuse of web scrapers.

Keywords – Data collection, Mobile devices, Web scraping, Summarization meth-ods, User-centered design

3

Page 4: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Abstrakt

Sokning och insamling av information om specifika amnen ar en tidskravande, mennodvandig praxis. Med den kontinuerliga tillvaxten som gatt forbi stationara en-heters andel, blir mobilmarknaden ett viktigt omrade att overvaga. Med tanke parorligheten av barbara enheter, sa blir vissa uppgifter svarare att utfora, jamfortmed pa stationara enheter. Att soka efter information pa Internet ar generelltlangsammare pa mobila enheter an pa stationara.

De storsta utmaningarna med att soka efter information pa Internet med mobilaenheter, ar de mindre skarmstorlekarna, och tiden spenderad pa att ta sig mel-lan kallor och sokresultat i en webblasare. Dessa utmaningar kan losas genom attanvanda en applikation som fokuserar pa relevanta sokresultat och sammanfattarinnehallet av dem, samt presenterar dem pa en enda vy.

Syftet med denna studie ar att hitta en alternativ datainsamlingsmetod for attskapa en snabbare och enklare sokupplevelse. Denna datainsamlingsmetod kom-mer snabbt att kunna hitta och samla in data som begarts via en sokterm av enanvandare. Darefter analyseras och presenteras data for anvandaren i en samman-fattad form for att eliminera behovet av att besoka innehallets kalla.

En undersokning utfordes genom att en mindre malgrupp av anvandare svaradepa ett formular av fragor. Resultaten visade att metoden var snabb, resultaten varofta relevanta och sammanfattningarna minskade behovet av att besoka kallsidan.Men medan metoden hade potential for framtida utveckling, hindras det av de etiskaproblemen som associeras med anvandningen av web scrapers.

Keywords – Datainsamling, Mobila enheter, Web scraping, Textsammanfattningsme-toder, Anvandarcentrerad design

4

Page 5: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Acknowledgements

We would like to thank our advisers Fadil Galjic and Leif Lindback at the RoyalInstitute of Technology. The feedback and help we received during this projectproved invaluable for this thesis.

5

Page 6: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used
Page 7: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Contents

1 Introduction 111.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.4 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Theoretical Background 152.1 Web Search engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.1 Web Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.2 Web Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Asynchronous Programming . . . . . . . . . . . . . . . . . . . . . . . 162.2.1 Concurrent Programming . . . . . . . . . . . . . . . . . . . . 162.2.2 Multithreaded Android Programming . . . . . . . . . . . . . . 162.2.3 AsyncTask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Managing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.1 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.2 PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 User Interface Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.1 Colour Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.2 User Interactivity . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5.1 Natural Language Processing . . . . . . . . . . . . . . . . . . 192.5.2 Automatic Summarization . . . . . . . . . . . . . . . . . . . . 202.5.3 Generic Summarization . . . . . . . . . . . . . . . . . . . . . . 20

2.6 Web Browsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.6.1 HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.6.2 CSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.7.1 Web Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.7.2 Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . 222.7.3 Similar Applications . . . . . . . . . . . . . . . . . . . . . . . 22

3 Methods 253.1 Research Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.1 Research Methods . . . . . . . . . . . . . . . . . . . . . . . . 253.1.2 Research Process . . . . . . . . . . . . . . . . . . . . . . . . . 26

7

Page 8: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.1 Literature Study . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.2 Interview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Design and Implementation of Prototype . . . . . . . . . . . . . . . . 283.3.1 Design of Prototype . . . . . . . . . . . . . . . . . . . . . . . 293.3.2 Implementation of Prototype . . . . . . . . . . . . . . . . . . 293.3.3 Development Environment . . . . . . . . . . . . . . . . . . . . 30

3.4 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4.1 Formative Evaluation . . . . . . . . . . . . . . . . . . . . . . . 303.4.2 Heuristic Evaluation . . . . . . . . . . . . . . . . . . . . . . . 303.4.3 Summative Evaluation . . . . . . . . . . . . . . . . . . . . . . 31

3.5 Evaluating Performance . . . . . . . . . . . . . . . . . . . . . . . . . 313.5.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 313.5.2 Methods of Evaluation . . . . . . . . . . . . . . . . . . . . . . 32

4 Collecting and Presenting Information: Challenges and Possibili-ties 334.1 Issues with Using Search Engines . . . . . . . . . . . . . . . . . . . . 33

4.1.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.1.2 Data Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 Getting Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2.1 Types of Information Requirements . . . . . . . . . . . . . . . 344.2.2 Presentation of Data . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Improving Information Gathering . . . . . . . . . . . . . . . . . . . . 354.3.1 Reducing Data Usage . . . . . . . . . . . . . . . . . . . . . . . 354.3.2 Time to Resolution . . . . . . . . . . . . . . . . . . . . . . . . 354.3.3 Showing Relevant Data . . . . . . . . . . . . . . . . . . . . . . 35

4.4 Challenges and Possibilities: Summary . . . . . . . . . . . . . . . . . 364.4.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.4.2 Possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Information Gathering Application: Design and Implementation 395.1 Design of the Application . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1.1 Application Functionality . . . . . . . . . . . . . . . . . . . . 395.1.2 Webscraping for Information . . . . . . . . . . . . . . . . . . . 405.1.3 Application Structure . . . . . . . . . . . . . . . . . . . . . . . 405.1.4 Application Flowchart . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2.1 Web scraping for Data . . . . . . . . . . . . . . . . . . . . . . 455.2.2 Storing and Updating of Data . . . . . . . . . . . . . . . . . . 465.2.3 Creating a Summary . . . . . . . . . . . . . . . . . . . . . . . 47

6 Information Gathering Application: Evaluation 496.1 Presentation of The Results . . . . . . . . . . . . . . . . . . . . . . . 49

6.1.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 496.2 App Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.2.1 App Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.2.2 Relevance Results . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.3 Survey Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

8

Page 9: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

6.3.1 How Relevant were the Summaries? . . . . . . . . . . . . . . . 546.3.2 Did the Swipe Functionality Positively Impact the Experience. 54

7 Discussion 557.1 Methodology and Consequences of the Study . . . . . . . . . . . . . . 55

7.1.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.1.2 Consequences of the Study . . . . . . . . . . . . . . . . . . . . 56

7.2 Problem Statement Revisited . . . . . . . . . . . . . . . . . . . . . . 577.2.1 Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . 58

7.3 Ethical Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597.3.1 Lost Clicks and Ad Revenue . . . . . . . . . . . . . . . . . . . 597.3.2 Information and Copyright issues . . . . . . . . . . . . . . . . 597.3.3 Anti web scraping Industry . . . . . . . . . . . . . . . . . . . 59

7.4 Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607.4.1 Effect on Environment . . . . . . . . . . . . . . . . . . . . . . 607.4.2 Economical Sustainability . . . . . . . . . . . . . . . . . . . . 60

8 Conclusions 618.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618.2 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

9

Page 10: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used
Page 11: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Chapter 1

Introduction

With the exponential growth of online data[1] accessed through mobile devices, it isbecoming more difficult to search and find desired information about a topic. Timeis often wasted sifting through data that is either irrelevant or duplicate informationof already collected data. Search engines today rely on each individual user to siftthrough the found links in order to get to the desired information. The time andnumber of page traversals it takes to find the desired data could be reduced, by hav-ing the search application do the work of finding and presenting the information.The objective of such a system would be to reduce the bandwidth, and the timespent searching for relevant information.

Improving on the current methods for collecting data requires the informationsearched for to be presented faster, while maintaining relevance to the desired topic.Improving how data is collected in a way that benefits the user over traditionalmethods introduces the concern of presentation. How should the data be presentedto the user in a way that both saves time and helps them find the desired informa-tion? This thesis presents the task of developing a information gathering methodfor Android devices, which finds and presents relevant data to the user, and exploreshow to apply certain methods in Android application development. The rest of thischapter introduces the specific problems that defines and motivates the focus andpurpose of this thesis.

1.1 Background

Finding and gathering data online is mostly done through search engines, such asGoogle and Bing. The companies that offer these search services use programmablebots, known as web crawlers that traverse the World Wide Web, and create indexesfor each site they gain access to.

The information gathered by the web crawler is then used to present the searcherwith links to websites that are most relevant. Presenting the most relevant sitesfirst is done by analysing the page, using many types of questions to determine itsrelevancy. Web crawlers can also be used to fetch specific data from a web page,and are then referred to as a web scrapers.

11

Page 12: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

1.2 Problem

There is a need to reduce the work required by a user to gather data on a specificsubject on mobile devices. Different options must be explored concerning the gath-ering of data without the use of traditional search engines. One option that couldbe viable is using web scrapers to scrape the web for data, and analyze what contentbest characterizes the desired data. The data then has to be presented to the userin a coherent manner.

All this is needed to produce an alternative to the desktop friendly search enginesthat are more difficult to use on mobile devices. This thesis explores ways that webcrawlers and web scrapers can be used and discusses how to implement them in asmart environment in the form of an Android application. Since the data collectedby the web scraper has to be processed by the application, the thesis will also discussmethods of storing the data and processing it.

Another problem that arises is the issue of presenting the collected data to a userin a clear way. As we try to improve upon search engines, there must be a thoughtout design plan when developing for the Android platform. The data presented tothe user must not only be simple to read and understand, but also summarize thecontent without leaving important details out. This means there is a problem withboth the technical and aesthetic part of presenting data.

The task of improving upon existing methods for gathering data is a difficult onefor many reasons. A successful implementation of a smart information gatheringtool would need to reduce search time and bandwidth. While looking up a short de-scription or a wanted link is easy to do in your smart devices’ browser using existingsearch engines, gathering data from several sources becomes more difficult the moredata one needs on the subject.

1.3 Problem Statement

This thesis aims to investigate the following questions:

• In which way can a web scraper be used to collect relevant data on a subject?How can the collected data be stored and analyzed?

• In which way can an Android application use a web scraper for data gathering?How can the collected data be presented to promote easy access to the desiredinformation?

1.4 Purpose

This thesis aims to find a search solution for mobile devices that reduces the band-width and time used for finding relevant information for the user. The experiencesfrom this study could also aspire to lay a foundation for other people who wishto develop Android applications that make use of multithreading, summarizationmethods and databases.

12

Page 13: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Android developers that want to gather data from the web using their software,can use this thesis to determine if web scrapers are a viable option to accomplishthat. There are also different problems that arise when developing for the AndroidOS, concerning data gathering and presentation. These problems include issues suchas how to find, store and analyze data using web scrapers. Further issues that ariseare the user friendliness of an application.

1.5 Delimitations

Creating an application of this type can range from hundreds of lines of code tomillions with varying complexity. Since the goal is to return data which can varyfrom hundreds of different file types and extensions, the decision was made to limitit to only gather raw text. Local and server side caching was also excluded from thisproject due to uncertainties regarding legal aspects. Furthermore, the UI design ofthe application was kept simple, with a main focus on functionality.

1.6 Thesis Outline

The thesis is structured as follows.

• In chapter 2 the thesis presents the necessary background information alongwith its related sources. The chapter provides the technical background neededto fully understand the document.

• In chapter 3 our research strategies and methods are presented and brieflydiscussed. The chapter gives an overview of which different research strategieswere chosen and why.

• Chapter 4 covers the challenges and possibilities that arise when performing aweb search using a mobile device.

• Chapter 5 provides an overview of the applications implementation, designand all of its functionality.

• Chapter 6 presents the results gathered from user feedback received throughquestionnaires, and the statistical results generated from the data containedin the database.

• Chapter 7 discusses the design decisions made when implementing the applica-tion and what motivated these decisions. The problem statement is revisitedand reflected upon. Furthermore, ethical aspects of the thesis is discussed.

• Chapter 8 ends the thesis with conclusions, future uses and possible futureresearch within the thesis’ topic.

13

Page 14: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used
Page 15: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Chapter 2

Theoretical Background

Understanding the content of this thesis requires a basic understanding of a widenumber of technologies and design principles. The following chapter covers thefundamental concepts and topics required to fully understand the work and researchin this document.

2.1 Web Search engines

The search engines used today rely on different technologies to find and presentwhat links are most relevant to a search term.

2.1.1 Web Indexing

Performing a web search using a search engine is achieved by entering a relevantsearch term to the desired topic. The result of the search is returned as websitesthat are deemed most relevant by the search engine. Search engines, such as Googleand Bing, rely on various methods for discovering the websites that make up the In-ternet in order to present the user with relevant search results. The data that searchengines use for displaying relevant results are gotten through a method called webindexing.[2]

Web indexing refers to various methods of indexing either a set of web pages orthe whole Internet. Indexing is achieved using web crawlers that recursively visiteach link on a web page. When a search engine finds a website, it takes a snapshotof the content of the website and saves it in a database. When the search engine hasa website’s contents, it can quickly match the website with a user’s search query.

2.1.2 Web Scraping

Web scraping is a data gathering technique where data is harvested from the web.After loading a website, the web scraper software can extract the available websitedata and repackage it into a desired format[3]. The data can then be stored locallyand be used without having to access the web. As many websites do not offer theoption to save specific data from their website, web scraping can be used to auto-mate the manual technique of copying desired data by hand.

15

Page 16: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Because the content and structure of websites varies, each website requires a dif-ferent solution for fetching content. The desired data is found by identifying theelements or attributes where the data resides. Web scrapers can be used with theaddition of web crawlers, to gather information throughout many links.

2.2 Asynchronous Programming

Unlike synchronous programs were code is executed sequentially from top to bottom,asynchronous programs uses a non-blocking model where processes in blocked statesdon’t hinder the rest of the program[4].

2.2.1 Concurrent Programming

Concurrent programming deals with programs where instructions can either be ranin parallel or without blocking. Concurrency depends on having several threads ofexecution.

CPU Cores

CPUs found in mobile and desktop devices usually have several cores of execution.A core can only perform a single instruction at a time, while maintaining a pointerto the next instruction and a small memory known as a registry. By splitting up theworkload between the CPUs cores parallelism can be achieved, which can provide aspeedup to a degree[5].

Threads

A thread of execution is a sequence of instructions which a CPU core can perform[6].Threads are spawned by processes, which are programs running on a device. Threadsspawned by processes contain instructions that can be executed independently ofother code, and does not need to be ran sequentially.

A single core can have several threads running at the same time, but are not ran inparallel. By switching between running different threads on a single CPU core, theCPU can provide the illusion of concurrency. This prevents a process from blockingother operations and thus making the program unresponsive.

2.2.2 Multithreaded Android Programming

By making use of several threads, code that has to wait for something no longerblocks the rest of the program. The task scheduler can simply assign the CPU coresnew instructions. By using built in Java APIs such as Executor, ThreadPoolEx-ecutor and FutureTask, multithreaded programs are made easier to write and keeptrack of[7].

2.2.3 AsyncTask

AsyncTask is a class in the Android OS package, which provides a simple way ofperforming background operations and handling the result on the UI thread[8]. The

16

Page 17: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

UI thread is the main thread of an Android Application, which updates the graphicalinterface. Blocking the UI thread prevents the application from rerendering thescreen, and thus gives the impression that the application is frozen. By handlingtasks on a separate thread using the AsyncTask class, the UI thread is not blocked.Unlike thread handlers such as Executors, AsyncTasks are made for shorter, lessCPU intensive operations.

2.3 Managing Data

Modern software applications depend on both local and server side data to provideusers with content. Managing data requires knowledge of both database design andhow to access and update data.

2.3.1 SQL

SQL stands for Structured Query Language and is a programming language createdby IBM in the 1970s to help developers manage databases easier[9]. SQL beinga query language, means that users can create queries that holds the informationneeded for the DBMS to accomplish a specific task on the database. While therewere many query languages created, SQL became the most popular and is the mostused query language today. When a user wants to manage their database, i.e. addingan entry to a table, a query has to be created and handled by the DBMS.

2.3.2 PHP

PHP is a scripting language that is primarily used on web servers. It is often usedin web development to provide interaction between a client and data stored on aserver. PHP code can also be embedded directly into HTML documents to performvarious functions, such as generating dynamic content. PHP code that is embeddedin HTML is executed on the server and the generated HTML is sent to the client[10].

17

Page 18: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

2.4 User Interface Design

Designing a mobile application provides certain challenges not found on a desktopdevice. The smaller screen and touch controls require extra care to ensure theapplication is easy to use.

2.4.1 Colour Theory

Designing an application that is visually appealing requires a fundamental under-standing of colour theory. Colour theory looks at different colour combinations andhow they are perceived. A way of thinking about complementary colours is by look-ing at a colour wheel[11]. The colour wheel puts complementary colours on oppositesides and analogous colours close by, as seen in figure 2.1.

Figure 2.1: Colour Wheelimage from Wikimedia Commons[12].

Analogous colours: Colours that lie next to each other in the colour wheel, areoften found in nature and are considered harmonious and pleasing to the eye[13].Because analogous colours don’t create a high contrast, they are commonly used fordeciding the overall colour theme of a design.

Complementary colours: Colours that lie on the opposite side of the colourwheel create a high contrast, and are commonly used when something needs tostand out. Unlike analogous colours, complementary colours can be quite jarringand should thus not be used for the overall design colour palette.

18

Page 19: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

2.4.2 User Interactivity

Due to the size and mobility, mobile devices have vastly different interactions thandesktop devices and thus need to be designed accordingly. Contrary to desktopdevices, there are many ways of holding and interacting with a touchscreen. Ac-counting for all kinds of device requires the design to be responsive and simple.Accomplishing this requires the following certain principles.[14]

• Consistency: Each page of the application should keep consistency in regardsto design elements, such as font and colour. Furthermore, the design of thepages should be shaped by usability testing.

• Readability: Text should always have a high contrast to make it easy toread. The font size should be large enough and scale with the size of thescreen. Important labels such as button texts should be extra large and visibleto convey functionality.

• Simplicity: The simplicity of a design requires balance between functionalityand ease of use. Using the features of an application should be easy enoughto do without reading instructions, while still accomplishing the task.

• Visibility: Everything the user needs to navigate and use the applicationshould be available without distractions. Navigation should always be pre-sented in a way that’s clear and natural. The user should never have to guessor spend large amounts of time navigating between pages.

• Feedback: The design should never have the user guessing what is happening.The state and condition of an application should always be visible, so the userdoes not think the application has frozen when its loading.

When designing a UI, placement of objects needs extra consideration. Because theinterface is interacted by trough touch controls, parts of the screen will be coveredby fingers.

2.5 Text Processing

Text processing involves different ways of manipulating text to either extract orchange parts.

2.5.1 Natural Language Processing

Natural Language Processing (NLP) is a field covered by computer science, com-puter linguistics and artificial intelligence that studies the relationship between hu-man language and computers. NLP looks at how computers can analyze humanlanguage and derive conclusions on the content of text. There are many uses forNLP algorithms which depends on the desired output.

19

Page 20: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

2.5.2 Automatic Summarization

The process of summarizing texts using software is an applied method of NLP knownas ”automatic summarization”. The goal of summarization algorithms is often togenerate a summary from a non predefined text[15]. There are many varied methodsof achieving this, but they all rely on identifying a list of keywords that define thetopic of the text.

2.5.3 Generic Summarization

Generic summarization is one of the types of automatic summarization that is fo-cused on producing a summary from a collection of data[16]. The goal of genericsummarization is to condense a number of sentences down to a smaller amount,while keeping the most relevant data. One way this can be achieved is by giv-ing each sentence a relevancy score[17]. Determining the relevance of a sentencerequires checking the sentence for different factors including the occurrence of key-words. Keywords are words commonly found in the original text that are not apart of stopwords. Stopwords are common words in English such as “the” or “and”that do not describe the content. Assigning a sentence a relevancy score is achievedthrough the following steps also depicted in figure 2.2.

• How many words in the sentence was also found in the search term.The search term is split up and a set of search specific keywords are identified.If a sentence contains one or more keywords from the search term, it’s morelikely to be relevant.

• How long the sentence was compared to an “ideal” sentence. Theideal length of a sentence tends to be around 15-20 words according to mostwriting guides. A sentence is given a weighted length score based on how closeit comes to the “ideal” sentence length.

• Where the sentence was found in the text. A lot of articles and reportsfollow a text structure, where general topics are introduced in the beginningand concluded at the end. More specific subjects are discussed in the middle.A summary consists of more general information and does not go into thedetails. Sentences that are found in the beginning and end of the original textis thus given a higher weighted position score. This style of writing is knownas the Hourglass Model[18].

• The sentences keyword density. A sentence is given a weighted keyworddensity score based on how many keywords it contains. If a sentence containsmany keywords, it’s more likely to contain descriptive information.

• A score based on how common the keywords found in the sentencewhere. For each keyword that occurs in a sentence, a weighted frequencyscore is given based on how frequent the keywords are in the original text.Keywords that are found more often in the original text give a higher score.

Finally, the weighted scores are combined and the 5 most relevant sentences arecombined to compose a summary. The sentences are combined in the same order asthey occurred in the text.

20

Page 21: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Figure 2.2: Summarization Flowchart.

2.6 Web Browsers

2.6.1 HTML

HTML, which stands for Hypertext Markup Language, is the standard language usedfor creating documents that represent web pages[19]. A document created throughHTML consists of nested elements, with different tags that describe the contentthey contain. The elements that make up the HTML documents are interpreted bythe browser to decide how the content should be displayed, but are not displayedthemselves. An HTML document start with a Document Type Declaration tagthat declares which version of HTML the document is rendered by. The contentbetween the <html> and </html> tags describe the web page in whole, but thevisual content is placed only between the <body> and </body> tags. Programscan be ran in the browser to offer dynamic content by writing JavaScript between<script> tags, or by including separate files through the ”src” attribute.

2.6.2 CSS

Cascading Style Sheets(CSS) is a style language used to describe the presentationof structured documents, such as HTML or XML. While HTML documents canbe styled using style attributes for each element, a style sheet makes it possible toseparate the content of a document from the presentation. The style of a HTMLelement is declared by having a keyword called a selector, which is a part of thestylesheet that specifies the tag name of the element.

The properties of the selector such as colour, font and many more are then ap-plied to each element matching the tag. By using attribute selectors, it is possibleto target specific elements that have matching id or class attributes in the targetdocument. In addition to specifying the colour and font of elements, CSS is alsoused to design the layout of a web page or document[20].

21

Page 22: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

2.7 Related Work

This section covers some of the related works found throughout the thesis work. Itpresents the issues and possibilities found in similar applications.

2.7.1 Web Scraping

Web scraping deals with the extraction of information that a user is likely to finduseful or interesting. A web scraper is simply a tool that automates informationgathering online and returns the result to the user. Web scraping can be applied formany different purposes, but is most often used by companies to monitor competitorsprices or collect information from people’s profiles. Any text or media that is foundonline can generally be web scraped, so the use of web scraping is varied. Webscraping is generally found in most larger e-commerce business, where they use webscrapers to track prices of products sold by their competitors. Research companiespull a lot of data from different websites and make use of web scrapers because it’sautomated. Web scraping is generally done as an either “one time” scraping wheredata is fetched in a large batch and used, or it’s used continuously over a long timeto keep track of changes.

2.7.2 Summarization

Automatic summarization techniques are used to automatically create summaries,with little or none human intervention. Summarizers are useful for getting anoverview of a complete text in a shorter time. Automatic summarization is notthat commonly used in business, because the technology is not mature enough andthe results can vary in quality. Summarization tools can be used to summarize anymedia including text or videos, but is mostly used for text. Software that performssummarizations is often written in machine learning courses as an exercise, but thebusiness applications are still rather unused.

2.7.3 Similar Applications

• Sensebot: Semantic Engines LLC is a company that has developed servicesrelated to finding information online through web scrapers. The product called“SenseBot”[21] is described as a search engine that produces summaries fromsearch terms. According to the website it is using text mining and multi-document summarization to produce a coherent summary. While the webscraping part is probably true which is what they refer to as “text mining”,the summarization is very poorly implemented and clearly does not use severaldocuments for each summary, as it’s contradictorily stated. The website claimsto present users with summaries, but when tested would only provide a singlesentence which was often found to be irrelevant. From the website, it wasmade clear that a summary has to contain more than one sentence to properlydescribe the content.

22

Page 23: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

• http://smmry.com/ SMMRY is a website[22] that offers a tool for summa-rizing text. The text to summarize can be provided either through a file, a urlor by typing in the text manually. The length of the summary can be spec-ified in sentences. The website does not offer any information regarding theowner, but is selling a service provided through its API. The website offers theoptions to set the amount of sentences. Five sentences seemed to be the bestcompromise between length and content quality. Both the SMMRY websiteand the developed application summarizes content, but the application allowsusers to create summaries without providing a source.

23

Page 24: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used
Page 25: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Chapter 3

Methods

This chapter covers the research strategies and methodologies that were used in thisstudy. Furthermore, the research process is outlined and the data collection, andresult gatherings methods used to answer the problem statement are presented.

3.1 Research Strategy

A literature study and an interview were conducted to gather the required knowledgeneeded to answer the problem statement. This section outlines how the researchmethods were approached and why they were chosen.

3.1.1 Research Methods

The research methodologies were chosen and conducted in order to gain the knowl-edge required to assess and find a solution to the problem statement.

• Literature StudyA literature study is a process of gathering information about a subject fromvarious sources, such as articles, books and research papers. The gatheredinformation can then be processed and summarized to help gain an under-standing of a researched topic. A literature study is performed with eithera quantitative or qualitative method[23]. The qualitative method looks atthoughts and opinions. This uncovers new problems and possibilities and al-lows one to delve deeper into the problem. The quantitative method looks atmeasured or deducted data. The initial literature study is often conductedwith a qualitative research approach, in order to find new thoughts and trendsabout a researched topic. The qualitative study is usually then followed by aquantitative study, where measurable data is evaluated and interpreted to for-mulate facts. In order to assure the validity of the material, literature studiesrequire one to critically evaluate the information and sources, to determine thelegitimacy of the content. Without a critical analysis, information gatheredcan’t be used in summarises or integrated into one’s work. The choice of usinga literature study was performed to get a better understanding of the fields inwhich the thesis was conducted in.

25

Page 26: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

• InterviewDue to the involvement of many different technologies and challenges, an inter-view was deemed useful in order to learn from the experiences of someone whohas worked with similar systems. The desired information was collected byperforming a general interview with some predetermined questions, and wasconducted with a software engineer that had work experience in many of thetechnologies used. The purpose of the interview was to learn the best prac-tices of storing and analyzing data, collected from an Android applications.Additionally we gained insight into common problems that can occur duringthe development process, and how to avoid them.

3.1.2 Research Process

The work of this thesis was divided into several phases, where different types ofstudies and development practices were conducted, as depicted in figure 3.1. Duringthe first phase of the thesis, research was performed to create a strong foundationof background knowledge in the relevant fields. The information gathered was ana-lyzed, a hypothesis was made and conclusions were drawn. When the backgroundknowledge was deemed strong enough, the development phase was started, wherethe data from the research was used to plan and implement a prototype. The de-velopment phase is conducted in an iterative manner, in which the prototype isheuristically, formatively evaluated and redesigned to improve upon the current ver-sion. The last phase of the thesis is the evaluation phase, where different evaluationmethods were used to test and analyze the result of the prototype.

Figure 3.1: The Process Outline.

26

Page 27: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

3.2 Data Collection

This section contains the methods used to gather and summarize data, and howthey were applied.

3.2.1 Literature Study

Before a literature study could be conducted, an assessment was made to create anoverview of the prerequisites needed to fully understand and approach the problem.By breaking down the problem statement into smaller subproblems, concepts andtechnologies required to conduct the study were identified. With an overview ofthe project in mind, academical search engines such as Google Scholar and theKTH library search tools were used to search for relevant information. Furtherresults were acquired by studying the official documentations for the programminglanguages Java, PHP and the IDE Android Studio. The process of the literaturestudy is depicted in figure 3.2.

Figure 3.2: Literature Study Process.

• Analyzing the Knowledge RequirementBecause the prerequisites of this thesis required background knowledge inmany different fields, it was important to identify the key concepts that wouldprovide an understanding and knowledge of concepts related to the problemstatement. Certain topics were deemed more important for the project, suchas Android development and web scraping, and thus were prioritized. Withthe delimitations in mind, certain topics required less in depth studying inorder to attain adequate knowledge.

• Acquiring Relevant Source MaterialFinding source material, whether it’s a published article, a book or a blog,requires searching through different sources of varying depth and complexity.Without varying the search sources, it would have been difficult to get enoughknowledge about the topics without having to spend a large amount of timeon large case studies.

As more knowledge was obtained and applied, new topics arose which neededto be studied. Because the technologies used were rather new, such as webscraping, most sources had to be fetched from online articles, documentationsand studies.

27

Page 28: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

• Evaluating the Content and SourcesA crucial aspect of the literature studies was determining the validity andrelevance of a source. Because the concepts and technologies used in this thesisare new and ever evolving, it was important to assure the sources where up todate in addition to being trustworthy. When possible, official documentationswere used for the literature study.

• Reading, Studying and Understanding the Chosen SourcesBy gathering and studying material of different levels of complexity and depthfrom multiple sources, a better understanding was attained and the problemstatement was better understood. This would lead to the discovery of newsubproblems and solutions that needed further studying.

• Applying Knowledge to Update Design and Delimitations With thediscovery of new subproblems and challenges, the design of the prototype hadto be updated to match the new delimitations and possible solutions. Byfinding new solutions and problems, further studies were required.

• Evaluating Result Gathering Methods Finally, the literature study wasused to decide the methods that would be used to obtain the data required toanswer and evaluate the problem statements.

3.2.2 Interview

After performing and summarizing the results from the literature study, it was clearthat not all questions were answered, or could be reliably found online or in books.The purpose and aim of the interview was to get a better understanding of thedevelopment process, and how to develop applications specifically targeting mobiledevices. The interview was held as part informal and part general interview, whichmeans that while there were some general questions, the interview was kept ratheropen for discussions. This was done in order to get answers to some questions, andcoming up with new questions that were previously not thought of. The interviewwas held with Eric Von Knorring who is a software engineer. The choice to inter-view him was made due to his wast experience in many of the technologies usedsurrounding the application.

3.3 Design and Implementation of Prototype

The prototype was intended to provide a tool that could be used to investigate theproblem statement and measure a result. When the research phase of the thesis wascompleted, the resulting data was gathered to be evaluated and used in the designof the prototype. This section contains the method used to design and implementthe prototype.

28

Page 29: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

3.3.1 Design of Prototype

Due to the complexity of the application, the design process can be split into severaldifferent parts that need to be analyzed and designed.

• Web Scraping DesignThe gathering of data through web scraping was independently designed to fitthe specific requirements of the application.

• Database DesignDesigning the database refers to the implementation of a separate system usedto store and update values, through function calls from the mobile application.The design and choice of technology was decided based on previous knowledgein database paradigms and due to fulfilling requirements of handling concur-rent writes to the same database rows.

• UI DesignFrom the literature study, baselines for creating a simple and aesthetic UI wereestablished and documented. Implementing a proper design required severaliterations, in order to reduce any distractions from the core purpose of theapplication.

• Information DesignThe aspect of the application that required the most research was the presenta-tion of the results. Developing the solutions required to promote better accessto the desired information, resulted in branching out to researching NaturalLanguage Processing.

3.3.2 Implementation of Prototype

Designing and implementing the application was a process that was not done se-quentially, but rather iteratively and in parts. This was done to keep the projectscope flexible in terms of features and quality. Because there was no end user toevaluate each iteration of the project, self evaluations were had after each workphase was complete, where the design was analyzed and adjusted. Due to the iter-ative workflow of implementing smaller parts of the whole application, the risk ofnot completing the program on time was reduced. Furthermore, the MVC designpattern was used to write different parts of the application independently.

The implementation of the prototype was done in two different parts. First wasthe front-end design of the prototype, which was the different views of the Androidapplication. Secondly was the back-end design, which includes the database, modelclasses and the PHP code used to query the results. During the implementationphase, both the front-end and back-end were developed in parallel. This was neededto be able to test and evaluate parts of the prototype before continuing with the de-velopment. The front-end of the prototype was from the beginning very dependanton the back-end, such as the database to test that certain features work.

The parallel implementation of the front-end and back-end was used to create theresulting prototype. The prototype was then used to generate the results needed forthe evaluation of the thesis.

29

Page 30: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

3.3.3 Development Environment

This project used the IDE Android Studio to develop the application, due to beingthe official environment to develop and supported by Google. The benefits of usingAndroid Studio was the inclusion of tools required to develop Android applicationssuch as emulators, built in tools for version control and dependency management.Development of the database and backend was done by using the program XAMPPto setup a local server and design the database schemas. The backend PHP codewas written using the text editor Atom. The mobile application was developed torun on Android version 4.1(JELLY BEAN) and newer.

3.4 Evaluation Methods

Determining if the application was able to achieve its purpose of investigating andpossibly solving the problem statement, requires evaluations to be made. Differentmethods of evaluation were needed throughout the development to ensure the workwas moving towards its target.

3.4.1 Formative Evaluation

The iterative process of designing, redesigning and implementing the project proto-type, made use of the practices of formative evaluation, where modifying a prototypeoccurs all through the implementation stage. While there were no stakeholders todiscuss current iterations, self reflections and iterative cycles made it possible toevaluate the project at different versions. The application was evaluated after eachmilestone was achieved. The progress and issues that occurred during the implemen-tation were discussed, and the resulting discussion was used to make adjustmentswhere it was applicable.

3.4.2 Heuristic Evaluation

Heuristic evaluation refers to the method of identifying and solving problems relatedto user interfaces. By applying different heuristics when creating an interactive andresponsive user interface, the design of an application can achieve its purpose ofdirecting the user without being distracting. The evaluated heuristics used werefrom the Nielsen set[24] of heuristics:

• Visibility of System Status: Give the user feedback on what going on.

• Consistency and Standards: Ensure that system behaviour was consistentthroughout the application.

• Aesthetic and Minimalist Design: Only present relevant information andoptions.

The heuristics applies to both the aesthetics and design of the UI, but also to thefunctionality of the underlying system.

30

Page 31: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

3.4.3 Summative Evaluation

While formative evaluation methods were applied during the development process,a summative evaluation was conducted after the application was implemented. Theevaluation was performed to measure various metrics of the application. The sum-mative evaluation strived to answer how viable the application is with performanceand functionality in mind[25].

• Was the performance targets met in terms of speed, RAM and datausage?

• Could the application serve as an alternative method for informationgathering?

• Were the summaries relevant?

3.5 Evaluating Performance

In order to evaluate the applications performance, relevant metrics needed to becollected, analyzed and presented. The literature study provided the basis for whichmetrics need to be tested, in order to assess the possibility of solving the problemstatements.

3.5.1 Performance Metrics

To assess the viability of the application prototype, certain key properties wereidentified, that were deemed necessary to fulfill. The metrics were chosen as follow.

• Speed: How long it takes from entering a search term until information isvisible.

• Data usage: How much network data is used from that a search is starteduntil the information is fully loaded.

• RAM usage: How much RAM is used at the peak of RAM usage.

• Relevancy: How many of the search results return relevant results and whatshare of results are relevant to the search term.

The results need to be relevant to the search term, or else the application does notfulfill its purpose. The application must display the result at least as quick as aregular search engine. The application also can’t use more network data then aregular search. Lastly, the application can’t leak memory or use significantly morememory than a regular search engine.

31

Page 32: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

3.5.2 Methods of Evaluation

In order to obtain results of the performance metrics, methods that accurately andreliable measure data had to be used. Furthermore, a range of devices with differentperformance needed to be tested.

Speed

Measuring the speed of the application requires checking how long time it takesfrom hitting the search button, until having the result fully load. Furthermore, eachmajor function will be measured to identify possible bottlenecks.

Obtaining the time measurements was achieved by calling a built in function forsaving the system time, before and after the desired measurement. The elapsedtime was then obtained by taking the difference between the time stamps.

long startTime = System.nanoTime();

// code being measured

long elapsedTime = System.nanoTime()-startTime;

Data Usage

Measuring the network data was done by using the built in tool Android DeviceManager. By selecting a process, the total amount of network data could be mea-sured over a time period.

RAM Usage

The amount of RAM used could be measured by taking snapshots during runtime.This was done by using the built in monitor, and measuring during peak RAMusage.

Relevancy

The relevancy was measured by user feedback from a target group, that tested dif-ferent search inputs and gave feedback on the results. The feedback was collectedthrough a questionnaire where the users were prompted to answer questions regard-ing the application. The feedback was received in terms of numerical scores andtext for each question. How much of the contents that is relevant, was measured bystatistically evaluating the database data.

32

Page 33: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Chapter 4

Collecting and PresentingInformation: Challenges andPossibilities

This chapter presents and analyzes the issues with searching on mobile devices. Theresult of the literature study was used to design a plan for implementing a prototypeand test its viability.

4.1 Issues with Using Search Engines

This section covers the technical issues that were found and how they affect thegathering of information when using mobile devices.

4.1.1 Performance

Since the first web page was created in 1990, websites have become a lot larger withthe addition of various features such as images, videos, fonts, CSS and JavaScriptto name a few. What started as a simple way of sharing information in the formof text, has evolved into often building fully fledged web applications with complexfeatures, and as a consequence often large JavaScript file sizes.

This increase in size and performance required has been a noticeable issue for desk-top users, as websites have been trending towards implementing feature sets of webapps[26]. The issues are further magnified when considering that factors, such asbandwidth and power draw, are less of an issue on connected devices. Not only aremobile devices limited by their batteries and data plans, the network connection,processor and RAM are often significantly slower, which affects the performance.

Performing a search on a mobile device has not gotten much slower due to thesearch engines used, but rather due to the loading of found data and navigating ofthe result. The heavy use of JavaScript in modern websites, introduces features thatare often unwanted when trying to find information quickly, such as animations andads that load in dynamically. Trying to find information while on a mobile deviceoften takes a significant amount of time, especially if the desired information is hardto find and requires the user to check several links.

33

Page 34: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

4.1.2 Data Usage

The average web page has more than doubled in size from 2012 to 2016, where itwas over 2.3Mb large[26]. The trend is moving towards larger websites, and it’smostly due to the increase of images, video and other media to become more visu-ally appealing. The use of JavaScript has also increased, with most websites usingone or more larger frameworks in addition to any other code. This increase in web-site size has had a greater impact on mobile device users, as faster mobile networkconnections, such as the 4G network, are not always available[27].

Trying to find information online while using a mobile device on a data plan, can becostly in terms of data used. Search engines provide only a sentence or two beloweach result link, which makes it difficult to assess the content quality of a web-site before loading it. This issue is further magnified in nations where the networkinfrastructure is weaker and mobile devices slower.

4.2 Getting Information

This sections covers the types of information needs that exist when using mobiledevices, and the issues of how they are presented.

4.2.1 Types of Information Requirements

The types of searches and decisions made when using mobile devices are often dif-ferent to desktop searches[28]. The types of searches made on mobile devices areoften done to help make a quick decision while on the move. Topics that requiredeeper research are often delegated to desktop devices, where searching is faster andoften simpler.

4.2.2 Presentation of Data

Search engines present the results from a web search, by providing a link to wherethe desired information can be found. User that perform searches on mobile devicesare usually presented with a list of links, that each have a descriptive sentence ortwo taken from the web page. Compared to on a desktop environment, this list istoo big for the smaller screens of the mobile devices.

Due to the smaller screens of mobile device, fewer links can be seen at once, whichfurther inhibits the user experience. The links are presented in order, where the mostrelevant links are placed at the top, according to whatever algorithm the search en-gine uses for ranking. This still requires the user to either trust the search enginewith the top link, or to manually visit sites until the information they are searchingfor is found.

34

Page 35: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

The desired information is always at least two clicks away and even if it’s found,it often comes with additional undesired information. Mobile searches are mostlymade to get quick and convenient answers. Finding the desired information from along article or web page is time consuming, which is detrimental to the goal of mostmobile searches.

4.3 Improving Information Gathering

This section covers the solutions that were found from the literature study and howthey could be applied to solve issues.

4.3.1 Reducing Data Usage

As of April 2017[29], images and scripts account for over 88% of the average websitesize and are growing larger every year. HTML documents which contain the con-tent, are on average less than 2% of the size of a website. Unless additional HTMLdocument content is loaded after the static page is fetched, the desired informationof a website accounts for less than 50kb on average.

By only fetching the content that is required to extract the desired information,less time can be spent waiting for images, fonts and CSS to load and data usagecan be reduced. This extraction of data can be done manually, or be automatedusing web scrapers. Libraries such as ”BeautifulSoup”[30], ”phantomjs”[31] and”jsoup”[32], which are made for extracting information from websites are availablein many languages, which reduces the need to implement a custom solution.

4.3.2 Time to Resolution

The number of clicks and time it takes to find the desired information varies fromsearchterm to searchterm, but has a large impact whatever a user waits for a pageto load. According to data collected in 2016[33], over half of mobile users will leavea site if the page is not loaded in 3 seconds, which impacts the time it takes toanswer the user’s information requirement. Furthermore, 77% of web pages takemore than 10 seconds to load on the 3G network[33]. The time to resolution is oneof the big issues with mobile searching and can often be contributed to large ads,slow sequential requests and over stylised websites.

4.3.3 Showing Relevant Data

Finding relevant data is the main goal for search engines. It can be difficult forsearch engines to know what exactly the user wants to find when they are doinga specific search. That’s why it’s useful to get user feedback on what results areconsidered relevant. Applying user feedback further expands upon the search enginealgorithms[34], that are used to rank websites on their relevancy to a search term.When a website has been given an individual relevance rank, they are presented onthe resultpage with the most relevant search on the top.

35

Page 36: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

The search links only have a sentence or two of content, which does not indicateenough about the information stored in the link. When presenting the search resultson a result page, introducing an abstract helps the user make a decision on thewebsite’s relevance and could be enough to solve the information need.

4.4 Challenges and Possibilities: Summary

This section covers a summary of the challenges and possibilities found during theliterature study.

4.4.1 Challenges

With the many different challenges discovered, special focus was given to the twomost difficult. The performance aspect had to be prioritized in order for the ap-plication to be considered an alternative to regular search method. In particular,the time to resolution was a key performance metric to have in mind. Displayingrelevant data was the second difficult challenge, due to the subjectivity of differentsummary methods.

In the case of the performance, it was discovered that mobile network speed had thelargest impact on web scraping performance. And while it doesn’t take very long toscrape just a single web page, the time it would take to scrape several pages aftereach other would add up to an unacceptable amount of time. As the applicationneeds to present several results from many websites, this challenge had to be solved.

In the case of showing relevant data, finding and extracting the correct contentfrom web pages proved to be an issue. Because there were no preset rules for a spe-cific page, the web scraper has to be configured to work on all kinds of pages. Due tothe smaller screen sizes available on mobile devices, choosing which sentences to bepresent is important to make good use of screen space. Choosing the best sentencesis not easy, as the most relevant sentences can be located anywhere on the page,with many different page layouts.

Other smaller challenges such as how data should be stored and designing a goodUI were also discovered, but didn’t require as much in depth research.

36

Page 37: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

4.4.2 Possibilities

Along with discovering challenges, a number of possibilities were also found.

As the speed of web scraping can be slow when scraping several pages after eachother, solutions for this problem were researched. One of the possible solutions thatwere found was the use of threads. By taking advantage of a mobile device’s differ-ent CPU cores (if they have more than one), threads could be used to scrape severalweb pages simultaneously. This could reduce the time to resolution if implementedeffectively.

Another challenge was to decide which sentences in a web page’s content to presentto the user. A solution that was found for this challenge was the use of automati-cally created summaries. To be able to create these summaries, research had to beconducted in the field of summarization methods. By using generic summarizationto give sentences a weighted score based on a scoring algorithm, more relevant sen-tences could be separated from the more irrelevant ones.

Possible solutions to the smaller problems include creating a database to store allnecessary data, and creating a UI based on the research made on colour theory anduser interactivity.

37

Page 38: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used
Page 39: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Chapter 5

Information GatheringApplication: Design andImplementation

This chapter provides an overview of the created application and all of its function-ality. Furthermore, flowcharts and diagrams are presented. The implementation ofeach component of the application is described.

5.1 Design of the Application

This section covers the structure and functionality of the Android application, whichis used to gather the results needed to answer the problem statement.

5.1.1 Application Functionality

The Android application is an information gathering tool developed for this thesis,that finds and returns information instead of links as a result. The application wasspecifically developed with mobile devices in mind, where different issues appearcompared to connected devices. The app focuses on reducing data usage and pre-senting summaries of links that give an insight into the full content.

The main functionality of the Android application is to provide a user with a searchbar, where they can enter a searchterm. After a user has entered a search, they arepresented with a couple of search results, which are ranked based on how relevantthey are. The search results consist of a text summary and a link to the web pagefrom where the specific summary was generated. The user can then either expandthe summary to read more, visit the link or affect the relevance ranking by swipingleft or right on the result. From this page the user can either perform a new searchor get a larger summary, which consists of all the results that were swiped right.

39

Page 40: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

5.1.2 Webscraping for Information

The Implementation of this search application was achieved by making use of thealready indexed web, provided by search engines such as Google. In order to gatherthe relevant information from the found links, the links where web scraped to fetchthe HTML document and retrieve text from the desired elements.

5.1.3 Application Structure

An overview of the package structure is depicted in figure 5.1. The “ViewActivity”package contains the classes, which represent the different pages in the application.The classes implement both the functionality of a view and a controller. This isdue to the code that describes the user interface and interaction lie in the sameclass. The model package contains a package called “DatabaseHandler”, which isresponsible for connecting to a database, fetching and updating data. The modelalso contains the “WebScraping” package which contains the classes that scrapelinks for data and generates summaries that are presented in the view. When asearch is performed, the “ThreadSearch” class performs multithreaded web scrapingand summaries are generated in the “ThreadScrapeResult” class. Lastly, the DTOpackage contains classes which are used to gather and transfer data between classes.

Figure 5.1: Package Overview of the Application.

40

Page 41: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

5.1.4 Application Flowchart

Start Screen

When the application is started it fills a list with stop words that will be used bythe summarize algorithm. Entering a search term switches from the ”MainActivity”intent to the ”ResultPage” intent, as depicted in figure 5.2.

Figure 5.2: Starting Screen Overview.

Before the page loads, it fetches the search term and creates an AsyncTask to performtasks on a background thread. Relevant links which contain information are webscraped from a search engine and used to update the database. A thread is spawnedfor each link, and data is web scraped and summarized before being collected. Thegathered result is sorted and displayed for the user. A new search can be made bypressing the ”New Search” button and a summary of relevant results are presentedwith the ”Continue” button. Flowchart is depicted in figure 5.3

Figure 5.3: Result Screen Overview.

41

Page 42: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

The relevant summaries are fetched and the database is called to update whichlinks are relevant on a background thread. A new search is made by pressing the”New Search” button. Final result is depicted in figure 5.4 and overall flowchart isdepicted in figure 5.5

Figure 5.4: Summary Result Page.

Figure 5.5: Overall Application flowchart.

42

Page 43: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

In App Views

Screenshots from the application are depicted in figures 5.6, 5.7 and 5.8. Further-more the colour pallet is depicted in figure 5.9

First of all, depicted in figure 5.6, is the start screen of the application. This isthe initial view presented to the user after starting the application. This view in-cludes a search bar where the user can enter its desired search term and a buttonto initiate a search.

Depicted in figure 5.7 is the result view that the user reaches after initiating asearch. A list of search results, which consist of summaries, are presented. Beneatheach summary, there is a link to the source page, which leads to the page from wherethe summary was generated. The user can swipe left or right on search results todecide its relevance. A left swipe declares that the result is considered irrelevant anda right swipe is for relevant results. The complementary colours red and green areused to represent the two choices. The view includes a button that returns the userto the start screen in purpose of making a new search. After one or more relevantchoices has been swiped, there is a button for continuing on to view the collectedsummaries that were deemed relevant.

In figure 5.8, the summary view of the application is depicted. This is the viewthat is reached when continuing from the result screen. This view presents the cho-sen summaries from the search results on a scrollable page. There is also a newsearch button to initiate a new search.

Lastly, the chosen colors are presented in figure 5.9. The application has a themethat consists of shades of green that are analogous on the color wheel. The text isblack on white, which has high contrast and is easy to read. To clarify which swipedirection indicates a relevant or irrelevant result without providing instructions, thecomplementary colours red and green are used.

43

Page 44: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Figure 5.6: Application Start Screen. Figure 5.7: Application Result Screen.

Figure 5.8: Application Final Page. Figure 5.9: Colour Palette.

44

Page 45: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

5.2 Implementation

This section covers the implementation of the mobile application.

5.2.1 Web scraping for Data

To gather data that was requested by a user, a data gathering method had to bechosen. The use of a web scraper as the data gathering method was an obviouschoice, but there was some consideration required when it came to implementingit. Writing a web scraper from scratch was considered, but was deemed outside thescope of this thesis. Instead the web scraping library “jsoup” was chosen due tobeing written in Java and thus simple to incorporate.

In order to display relevant data, Google was used as a source for finding rele-vant links. This was achieved by fetching the links from a Google search query asdepicted in listing 5.1. The links were found to be children of <h3> elements withthe class name r by inspecting the HTML source code.

Document doc = Jsoup.connect("https://www.google.se/search?q="+searchTerm)

.get();

Elements searchLinks = doc.select("h3.r > a");

Listing 5.1: Code for fetching relevant links

Links that contain relevant data could now be identified and scraped for their con-tent. Scraping the content of a website required the application to wait for a TCPconnection to be established before a GET request could be made. The time ittakes to establish a connection and start downloading content from the web serveris substantial, and thus fetching data from several sources sequentially was not anoption. To circumvent this issue, a thread was created for each instance whereHTTP requests were made. The handling of threads and data was achieved by us-ing a ExecutorService from the java.util.concurrent package.

An ExecutorService provides methods for setting the amount of threads to beran and how to invoke functions on threads. A example code showing how to initi-ate the ExecutorService is depicted in listing 5.2 where numOfThreads is equalto the amount of links to web scrape.

ExecutorService executor = Executors.newFixedThreadPool(numOfThreads);

Listing 5.2 Setting the number of threads to use.

By providing a list of Callable tasks to the ExecutorService, the executor caninvoke all methods and gather the result once the threads are done. The result of thecallable tasks are saved in a list and can then be further processed. This simplifiesthe issue with syncing threads and allows all threads to finish before proceeding.The results from the executor are saved in a list as depicted in listing 5.3.

List<Future<ThreadScrapeResult>> futures = executor.invokeAll(callableTasks);

Listing 5.3 Gathering the results from multiple threads.

45

Page 46: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

5.2.2 Storing and Updating of Data

In order to keep track of what results were deemed relevant, a database was usedto store the relevancies related to each search result. By keeping track of whichsummaries were swiped left or right, a summaries relevancy score was updated andcould be presented in different order with the highest relevance first. Performing acomplete search requires two connections to the database.

Using the database required a web server hosting PHP files. The PHP files wouldhandle requests from the application and was used to perform queries. Setting upthe connection was done by including a PHP file with the configuration for con-necting to the web server as depicted in listing 5.4. The queries were performedthrough PHP instead of directly calling the database due to security reasons. Usingsoftware, mobile applications can be decompiled and the database configuration filecan be found.

$host = "localhost";

$dbname = "kandidat";

$username = "root";

$password = ’******’;

$connection = new PDO("mysql:host=$host; dbname=$dbname",$username,$password);

Listing 5.4 Database configuration for PHP

When Entering a Search

When a new search was made, there were a couple of terms that needed to be saved.The search term were needed to identify separate searches. The domains and fullURLs from the initial web scrape were required to identify search terms, that giveseveral results from the same domain. The links from the resulting web scrape wereused to prepare a query as depicted in listing 5.5. The queries were issued by sendinga POST request to the web server with the information to upload.

for(int i=0;i<links.size();i++){builder.appendQueryParameter("searchUrl"+i,links.get(i));

builder.appendQueryParameter("domainUrl"+i,domains.get(i));

}

Listing 5.5 Query used to update the database with URLs and domains.

The web server would then perform a SQL query and update the database with thenew information. In order to prevent security issues such as SQL injections, userinput was validated using built in functions such as binding parameters as depictedin listing 5.6.

$stmt->bindParam(’:search’,$ POST[’searchUrl’.$x]);

$stmt->execute();

Listing 5.6 Binding input parameters to prevent SQL injections.

After the information was uploaded and updated, the relevance of each link wasfetched and returned from the web server.

46

Page 47: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

After Choosing which Links are Relevant

Following a successful search, users were able to swipe either left or right, to rankthe relevancy of a summary. To update this information, the database was queriedwith the necessary information required to identify which summaries was ranked andwhat score they got. This was used to update how relevant a summary was for eachsearch term, which would change the order of future results, based on relevancy.

Database Design

The database was designed to contain all the information that was necessary toidentify and score each summary. The url, domain and searchterm were used toidentify a unique summary, without having to store the actual text. Furthermore,some additional columns such as ”numOfHits” and ”noOfSearches” were used tocollect user data as depicted in figure 5.10.

Figure 5.10: Database Design.

5.2.3 Creating a Summary

Displaying relevant information required the result to contain text related to thesearch term. Furthermore, the results had to be short enough to be useful while onthe move, while still containing the necessary information. To achieve this, an algo-rithm was created that extracts the most relevant sentences from text. Identifyingwhich sentences were the most relevant was achieved by assigning scores and iden-tifying key words. These keywords were identified by taking the most used wordsfrom the text, which were not a part of the common words. After the text had beenfully analysed and scored, the highest rated sentences were returned, in the orderfound in the text. Each sentence was given a weighted score based on:

• How many words in the sentence was also found in the search term.

• How long the sentence was compared to an “ideal” sentence.

• Where the sentence was found in the text.

• The sentence’s keyword density.

• A score based on how common the keywords found in the sentence where.

47

Page 48: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used
Page 49: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Chapter 6

Information GatheringApplication: Evaluation

This chapter covers the application’s performance as well as user feedback.

6.1 Presentation of The Results

The following section presents the performance metrics gathered from using theapplication. The tests were performed using an emulator on three different mobiledevices, with different levels of performance. The search terms were chosen to testresults that consist of large and word dense pages, as well as smaller and morecompact texts.

• ”What is Brexit?”: This search term was the third most searched on Googleduring 2016 and the pages were on average 1.78mb large across the testeddevices.

• ”Theory of Relativity”: This search term was chosen to test the perfor-mance on more mobile friendly web pages with an average page size of 0.63mbacross the tested devices.

6.1.1 Performance Metrics

Measuring the performance of the application was done by testing for three differentmetrics.

Performance : Speed

The speed of the application was measured in nanoseconds for each function, fromentering a search term until the result is displayed. The time that passes frompressing the search button until retrieving the result, consists of four major functionsthat gather and operate on data. The performance metrics displays these individualfunctions and how much of the total time they account for. The result is displayedin milliseconds.

49

Page 50: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

• GoogleScrape: Is the function that performs a web scrape on Google to fetchrelevant links for a search term.

• QueryTime: The time it takes to updates the database with the new resultsand receive an answer.

• ThreadSearch: Function that splits the workload up into threads and gathersthe summarized data from each web scraped link.

• Avg Summarize: Average time it took for a device to create a summariza-tion.

• Max Summarize: Longest time it took for a device to create a summariza-tion.

In figure 6.1 the results from the search term “What is Brexit?” is displayed. Thedifference between the results of the smartphones was small. On average 89.356%of the time was spent web scraping, while creating the summary accounted for lessthan 5%.

Figure 6.1: Time metrics for search example 1.

In figure 6.2 the results from the search term “Theory of Relativity” is displayed.Once again there were no significant differences in performance. On average 92.63%of the time was spent web scraping with less than 5% used for summarizing theresults.

Figure 6.2: Time metrics for search example 2.

50

Page 51: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Performance: Single vs Multiple Threads

Two different implementations of the application was tested, to see how long ittakes from performing a search, to receiving the result. One version made useof multiple threads to webscrape and create the summaries, while the other useda single thread. The average time was computed by taking the average from 30searches with caching disabled. The single threaded version took on average 8485 mswhile the multithreaded took on average 2381 ms. The single threaded applicationwas on average 3.36 times slower as depicted in figure 6.3. The depicted time is thesum of the functions shown in figure 6.1

Figure 6.3: Performance difference between single and multithreaded application.

51

Page 52: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Performance: Ram

The amount of RAM used by the application was measured by performing a searchand measuring the peak RAM usage and is depicted in figure 6.4 and 6.5. The RAMusage was low throughout the testing and no memory leaks were discovered.

Figure 6.4: Allocated used Ram.

Figure 6.5: Unallocated available Memory.

52

Page 53: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Performance: Data

Measuring the amount of network data used, was achieved by using the IDE’s builtin network monitor tool. The average data used for a search was measured boththrough the application and through a normal web browser. Browser data wasmeasured by searching for a searchterm, clicking a link and letting the page fullyload. The data was collected by measuring the average data usage from the samelinks scraped by the application. The application data was measured by performinga search term and getting the summaries. On average the application would reducethe data amount by 60-80%, depending on the search term and mobile device asdepicted in figure 6.6 and 6.7.

Figure 6.6: Network Data saved for Search Example 1.

Figure 6.7: Network Data saved for Search Example 2.

6.2 App Usage

This section covers the data gathered from tests performed by people taking a sur-vey. The survey had 18 participants who filled out a questionnaire and an additionalnumber of people who tried the application. The testing group consisted of mostlystudents between the age of 20-30 with decent to good knowledge of mobile appli-cations.

53

Page 54: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

6.2.1 App Usage

From the data collected the following results were had.

• 750 unique summaries were created.

• 183 summaries were ranked.

• 91 summaries were deemed irrelevant.

• 92 summaries were deemed relevant.

• 80 unique search terms were searched for.

• 440 unique domains were found.

6.2.2 Relevance Results

From the testing pool of people partaking in the survey, relevance ratings wereassociated to urls for specific search terms. There were 80 unique search termsentered and the average search term was entered 2.8250 times. Out of the 750unique summaries 24.4% of summaries were swiped. This resulted in the averagesummary having a relevance score of 0.499213.

6.3 Survey Results

A survey was conducted in order to get user feedback and search data. While theresults were too biased to be of any scientific use, it showed which parts of theapplication could be improved and which parts were good. This was due to thesurvey group consisting of friends and acquaintances, who would most likely givehigher scores. The questions had both a scoring system from 0-5 and a text fieldfor answers.

6.3.1 How Relevant were the Summaries?

The feedback from the survey indicated that the summaries were good for the mostpart with 55.6% of the votes giving the summaries a 5/5. 33.3% of the votesgave the summaries a 4/5. The mean score was 4.28, which indicated that thesummaries were mostly good. The standard deviation of the answers was 0.75593.

6.3.2 Did the Swipe Functionality Positively Impact the Ex-perience.

The swipe functionality had a larger spread of scores with 33.3% giving the score 2and 44.4% giving the score of 3. The standard deviation was 1 and the mean scorewas 3. The result indicates that the swipe functionality could improve the resultsover time but has flaws.

54

Page 55: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Chapter 7

Discussion

This chapter covers the discussions on the different methods, solutions and prob-lems found during this project. The problem statement is revisited and discussed.Furthermore the repercussions of the application are discussed from a sustainableand ethical viewpoint.

7.1 Methodology and Consequences of the Study

This section covers the applied methodologies used to evaluate and answer the prob-lem statement. It also covers the consequences of the study.

7.1.1 Methods

These following methods were the ones used during the project.

Literature study

Investigating possible solutions for the problem statement required having back-ground knowledge in a wide set of topics. Because there was a lack of experiencewith some of the technologies required to develop the Android application, a litera-ture study was performed to gain the required knowledge. Furthermore, presentingthe data in a way that would promote faster access to the desired information, wasdifficult due to there not existing obvious solutions. Performing the literature studywas more a necessity than a choice.

Interview

The reason for doing the interview was partly to gain insight into what best practicesexist, but also to settle some uncertainties that arose during the early stages ofdesigning the application. The interview did not cover many questions regardingAndroid applications, but it proved invaluable for managing the project in terms oftime and scope.

55

Page 56: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Design and implementation

Some of the minor issues were UI related. While most modern web pages are de-signed to work well on mobile devices, they are often harder to navigate than thedesktop counterparts. More major issues that appear were due to performance andnetwork data limits. Because most modern websites make use of images and largeJavaScript frameworks, a lot of data and performance is used to load web pages. Wechose to apply the MVC architectural pattern and evaluate our work formatively.The reason for choosing the design pattern was mostly due to having previous ex-perience developing using MVC. Because Android development was a new conceptthat had to be learnt for this thesis, it was beneficial to apply already known con-cepts. By using the MVC pattern, and evaluating the work formatively, it was madeeasier to modify and rewrite a layer of the application without other layers beingaffected. This was essential for the development cycle where the design often had tobe changed. Some of the issues with using MVC was that it was difficult to definewhat the control layer was. The view of an Android application acts more as aview-controller hybrid which reduces cohesion and reduces the amount of code thatcould be reused. Other design patterns were considered such as MVVM and moreAndroid specific patterns.

Evaluation

• Formative Evaluation: Formative evaluation was used during the designand implementation process. This proved very useful for developing an appli-cation that was constantly changing, as decisions to change the application’sfunctionality were made based on incoming new information.

• Heuristic Evaluation: Presenting the data to the user required us to con-sider many aspects of what makes a design presentable without being dis-tracting. We considered many different implementations, and applied differ-ent heuristics to promote faster and better understanding of the applicationsresults. We chose to apply heuristic evaluation because we found that poorlydesigned visuals could affect how well the information was received.

• Summative Evaluation: In order to decide if the application could be usedto solve the problem statement, the results had to be evaluated both froma user by user experience, and from collecting data. We chose to apply asummative evaluation in order to decide if the application was able to achievethe performance metrics we desired, and if the results were relevant. Theevaluation was performed by measuring data and asking users for feedback.The evaluation gave us some clear answers on which aspects worked and whatcould be improved.

7.1.2 Consequences of the Study

From the results collected and the feedback received from the target group, we foundthat presenting information in this way, could certainly be an attractive alternativefor some information needs.

56

Page 57: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

7.2 Problem Statement Revisited

The following questions were the problem statements of this thesis

• In which way can a web scraper be used to collect relevant data on a subject?How can the collected data be stored and analyzed?

• In which way can an Android application use a web scraper for data gather-ing? How can the collected data be presented to promote easy access to thedesired information?

From the problem statements, the following questions could be extracted.

• In which way can a web scraper be used to collect relevant data ona subject? The solution we came up with was to make use of the alreadyindexed web provided by the search engine Google. By making use of analready established search engines for indexing and finding relevant links, wecould use those links as a source for extracting information. This was done inorder to reduce the scope of the project, and because a better way of findingrelevant links was deemed unachievable.

• How can the collected data be stored and analyzed? A decision wasmade to store and update data on a web server. This was done in order to geteasier access to metrics, such as which search terms were popular and what wasdeemed relevant. By having users update the same information, a collectiveeffort was made to push more relevant summaries to the top. From the resultsgathered through the questionnaire, most user found that this could improvethe search results over time. Preventing vote manipulation was something thatwas considered, but would have expanded the scope of the project too much.By collecting various data regarding searches, it was made simple to gatherand analyze how well different aspects application worked.

• In which way can an Android application use a web scraper for datagathering? Gathering data in an effective manner was achieved by utilizingthreads, which reduced the time it took waiting for HTTP requests. SinceAndroid applications are built using Java, several web scraping libraries wereavailable. Performing the web scrape was done by including a library for webscraping, due to the improved performance rather than implementing our own.The initial implementation which only used a single thread proved to be muchslower than a search made on a regular search engine. Since most Androiddevices can utilize several cores, multithreading became a good solution forimproving the search performance as shown by the measured results.

57

Page 58: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

• How can the collected data be presented to promote easy accessto the desired information? By sorting all the gathered links by theirrelevancy score, the most relevant result was presented at the top of the listof results. By having the most relevant results at the top, the time to resolvecould be shortened by hiding irrelevant results further down. An issue withfinding information online, is that the text often has filler sentences or is tolong. In order to more easily and faster understand the information, we decidedto present the user with summaries. These summaries were long enough tosolve the information need, but short enough to be understood quickly. Inthe case that the summary wasn’t detailed enough, each summary had a linkto the source page. According to the user feedback, the summaries generatedby the application were overall rather good. Most of the negative feedbackwas due to the swipe functionality being considered too sensitive. Overall,the feedback collected indicated that the application could definitely se someusage but would require more work in order to smooth out some minor issues.

7.2.1 Design Decisions

Designing and implementing the application came with many tough decisions thathad to be made. While developing the application, there were compromises madebetween performance and size, but some issues were had due to other reasons.

Caching Summaries

A decisions was made not to cache the summaries. Caching the summaries wouldhave reduced the time it takes search for the same term from a couple of seconds toless than a second. Still, a decision was made to not include caching due to possiblecopyrights issues of storing website data on our server. Even though the contentthat would have been saved consists of summaries, we decided not to take any risks.

Choice of Platform

The reason for developing this application on the mobile platform were many. Theproblems in the problem statement are amplified on mobile devices compared todesktop devices, due to the shortcoming of mobile technology, such as limited batteryand network data. Mobile devices account for more than 50% of searches and is onlyincreasing. Thus it made sense to us to target a platform where the benefits wouldbe seen most clearly.

58

Page 59: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

7.3 Ethical Aspects

During the literature study, certain issues regarding unethical behaviour of webscraping software were discovered. This section covers the discussion about theseethical topics.

7.3.1 Lost Clicks and Ad Revenue

As hosting websites can be very expensive, the use of ads on websites is a widespreadphenomenon. Many websites also rely on their ad revenue as their main source ofincome. When a website wants to gather statistics on how many users browse theirpages, they usually count the amount of clicks made by a visitor. By using a webscraper to gather the data directly from a website using web scrapers, the websitedoes not get any user information or clicks. The web scraper also dodges any adspresented on the website as there is no browser to run scripts or show the ads. Web-sites gain lower, if not zero, revenue from being scraped by a web scraper, comparedto a user visiting the website.

And while the website gets no revenue from the web scraper it still has to pro-vide the scraper with files, which puts a load on the websites server. This createsthe ethical issue of not giving anything back to the creator of the content. If allsearch engines tried to present the data of a search result better than the actualwebsite the data is hosted on, there would be a less incentive for creators to publishtheir data on ad driven free websites. A consequence of this could be a decrease inoverall free content on the internet and a growth in content behind paywalls. Totry and help content creators, we make sure to always link to the source page whenpresenting a search result.

7.3.2 Information and Copyright issues

Throughout history there have been many legal cases that involve companies suingor being sued over the use of web scrapers. Web scrapers are legal in most countriesbut come with rules that have to be followed in others. Web scraping is still inthe legal grey area due to the technology being hard to interpret when it comesdown to jurisdiction. The information that is gathered by web scrapers can besubjected to copyright laws depending on how it’s used. When the data is presentedin transformative manner such as in our application, there is no grounds for copyrightinfringement, but is still considered unethical.

7.3.3 Anti web scraping Industry

Due to the legal grey area of which web scraping lies in, there is a whole industrycreated just to combat web scraping. There are many different ways a websitecan try to protect itself from web scrapers, but requires the ability to differentiatebetween a program and a human. This has proven to be a difficult challenge and isone of the reasons large companies still try to issue legal cases.

59

Page 60: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

7.4 Sustainability

This section covers the possible effects the thesis could have on sustainability if theapplication is used or the ideas applied.

7.4.1 Effect on Environment

The positive effect our application would have on the environment, if people startedusing it, might seem negligible at first, but could potentially reduce power anddata consumption by a large amount. The samples used in our results reducedthe data usage by more than 50% compared to visiting a website. This wouldreduce the amount of battery a mobile device uses and the workload of servers inan ideal scenario. To have any real effect it would require a large group of thepopulation, switching from getting their information needs through interactive andheavy websites, to the rather lightweight and minimal application. Overall theapplication does not have any impact on the environment in the current form, butideas could be applied by larger companies to make an impact in energy usage.

7.4.2 Economical Sustainability

The economical sustainability of websites could be harmed by applications such asours that use resources without giving anything back. The application itself was notdesigned to be monetized, thus the application could not scale to provide service tomany users, without costing a lot to run.

60

Page 61: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Chapter 8

Conclusions

This chapter concludes what was achieved throughout the thesis and what futureimplications the results could hold.

8.1 Summary

This study set out to investigate how problems related to information gathering onmobile devices, could be solved using technologies such as web scraping. Based onthe results from our literature study, issues that occur when searching for infor-mation on mobile devices were identified. These issues include problems with longloading times, small and difficult to use interfaces, large website size and presentingdata in a mobile friendly way to users.

With the issues in mind, an application was designed that would try to improve theinformation gathering process on mobile devices. By making use of web scrapersfor gathering the information, rethinking the way information could be presented,an application was implemented that makes use of these ideas. The results thatwere gathered from the application were presented and analyzed. The user feedbackindicated that the application did achieve its goal of quickly presenting relevant in-formation, but was inconsistent and would require some work before it could replaceregular search engines.

The different results of the thesis were discussed, and the authors came to the con-clusion that the technical aspects of the application could be improved to a pointwhere it could potentially be used by a specific target group, that has a need forraw text gathering. But while the technical side had potential, the application washindered by the unethical aspects of web scraping.

8.2 Future research

While this thesis gives an introduction to the implications of using web scrapingon mobile devices and web scraping in general, there is much more to investigate.Firstly, the ethical aspects of web scraping is only merely touched upon in this the-sis. This is a huge discussion topic in the IT industry. Further research has to bemade if there is ever going to be an agreement on what is allowed and what is not.

61

Page 62: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

If legal grounds can’t be decided because of too many different opinions, an officiallist of guidelines on how to use web scrapers should be created.

Secondly, presenting the data could be improved. By having a team of design-ers working together with the developers, the presentation of the search results andtheir summaries have the potential to be very good and easy to read. This idea couldin the future compete with traditional methods of data gathering, if implementedwell.

Lastly, the storage of the data have to be increased and improved if the appli-cation would be available and used by a large mass of people. By using a big serverarchitecture that can handle a large amount of concurrent users, the issues of stor-ing and managing data would be minimal, as the client-to-server interaction is small.

If all these issues would be solved, there could be potential for the app to workas a research tool. It would work by collecting and condensing data on more scien-tific topics, and presenting these to the user as i.e. a summarized report.

62

Page 63: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

Bibliography

[1] Growth of data online [updated June 19, 2017: cited June 19, 2017]http://www.cisco.com/c/en/us/solutions/collateral/service-

provider/visual-networking-index-vni/vni-hyperconnectivity-

wp.html

[2] Search Engine IndexingAvAmy N. Langville,Carl D. Meyer (2011) Google’s PageRank and Beyondpp.15-20.

[3] A.Herrouz, C.Khentout, M.Djoudi ”Overview of Web Content Mining Tools”(2013),

[4] Blocking Processes [cited June 19, 2017]http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-intro.pdf

[5] Theoretical Speedup from Parallelizing Computations [updated August 28, 2000:cited June 19, 2017]http://www.phy.duke.edu/~rgb/brahma/brahma_old/als/als/node3.html

[6] CPU Threads [cited August 24, 2013 : cited June 19, 2017]https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/4_

Threads.html

[7] MultiThreading in Android [updated May 17, 2017 : cited June 19, 2017]https://developer.android.com/reference/java/util/concurrent/

ThreadPoolExecutor.html

[8] AsyncTasks in Android [updated May 17, 2017 : cited June 19, 2017]https://developer.android.com/reference/android/os/AsyncTask.html

[9] The SQL language [updated April 21,2017 : cited June 19, 2017]https://docs.microsoft.com/en-us/sql/odbc/reference/structured-

query-language-sql

[10] PHP Scripting Language [updated June 19, 2017; cited June 19, 2017]http://php.net/manual/en/intro-whatis.php

[11] Color Theory Explanation [cited June 19, 2017]https://colorysemiotica.files.wordpress.com/2015/04/harris1770.pdf

[12] Color Wheel Image [updated May 21,2017 : cited June 19, 2017]https://commons.wikimedia.org/wiki/File:RGV_color_wheel_1908.png

63

Page 64: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

[13] Analogous Colors, Joen Wolfrom (1992) The Magical Effects Of Colorpp.31-32

[14] Principles of Interaction Design [updated June 19, 2017: cited June 19, 2017]http://asktog.com/atc/principles-of-interaction-design/

[15] Inderjeet MANI ”Summarization Evaluation: An Overview” (2001)

[16] Generic Summarization [cited June 19, 2017]http://www4.ncsu.edu/~slrace/genericsummarizationtalk.pdf

[17] Yihong Gong, Xin Liu ”Generic Text Summarization Using Relevance Measureand Latent Semantic Analysis” (2001)

[18] Hourglass Writing Structure [cited June 19, 2017]http://writingcenter.uconn.edu/wp-content/uploads/sites/593/2014/

06/The_Hourglass_Approach.pdf

[19] HTML [updated May 30 2017 : cited June 19, 2017]https://www.w3.org/standards/webdesign/htmlcss

[20] CSS standard [updated April 5 2017 : cited June 19, 2017]https://www.w3.org/standards/webdesign/htmlcss#whatcss

[21] SenseBot Website [accessed June 19, 2017]http://sensebot.com/

[22] SMMRY Website [accessed June 19, 2017]http://smmry.com/

[23] Qualitative and Quantitative Methods [updated April 4,2017 : cited June 19,2017]http://www.lib.vt.edu/research/methodology/quantitative-

qualitative.html

[24] Nielsen Heuristics [updated June 19, 2017: cited June 19, 2017]https://www.nngroup.com/articles/ten-usability-heuristics/

[25] Summative Evaluation [updated April 4,2017 : cited June 19, 2017]https://cyfar.org/different-types-evaluation#Summative

[26] Web Page Size [updated May 31,2017 : cited June 19, 2017]https://www.keycdn.com/support/the-growth-of-web-page-size/

[27] Global 4G Coverage [updated April 4,2017 : cited June 19, 2017]https://opensignal.com/reports/2016/11/state-of-lte

[28] Mobile Shopping Behaviour [updated April 4,2017 : cited June 19, 2017]https://www.thinkwithgoogle.com/articles/mobile-shoppers-

consumer-decision-journey.html

[29] Website size [updated April 4,2017 : cited June 19, 2017]http://httparchive.org/interesting.php?a=All&l=Apr%2015%202017

64

Page 65: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

[30] BeautifulSoup Python Library [accessed June 19, 2017]https://www.crummy.com/software/BeautifulSoup/

[31] PhantomJS JavaScript Library [accessed June 19, 2017]http://phantomjs.org/

[32] jsoup Java Library [accessed June 19, 2017]https://jsoup.org/

[33] Mobile Data [updated April 4,2017 : cited June 19, 2017]https://www.thinkwithgoogle.com/nordics/research-study/the-need-

for-mobile-speed-how-mobile-latency-impacts-publisher-revenue/

[34] Search Engine Algorithms [updated February 5,2007 : cited June 19, 2017]http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm

65

Page 66: Alternative Information Gathering on Mobile Deviceskth.diva-portal.org › smash › get › diva2:1119386 › FULLTEXT01.pdfThe information gathered by the web crawler is then used

TRITA TRITA-ICT-EX-2017:61

www.kth.se