exploring spatio-temporal patterns of volunteered

Exploring Spatio-Temporal Patterns of VolunteeredGeographic Information

A Case Study on Flickr Data of Sweden

Yufan Miao

August 20, 2013

Bachelor Thesis, 15hpGeomatics

Supervisor: Prof. Bin JiangExaminer: Dr. Julia Ahlen

Abstract

This thesis aims to seek interesting patterns from massive amounts of Flickr data in Sweden with pro-posed new clustering strategies. The aim can be further divided into three objectives. The first oneis to acquire large amount of timestamped geolocation data from Flickr servers. The second objectiveis to develop effective and efficient methods to process the data. More specifically, the methods to bedeveloped are bifold, namely, the preprocessing method to solve the “Big Data” issue encountered inthe study and the new clustering method to extract spatio-temporal patterns from data. The third oneis to analyze the extracted patterns with scaling analysis techniques in order to interpret human socialactivities underlying the Flickr Data within the urban envrionment of Sweden.

During the study, the three objectives were achieved sequentially. The data employed for this studywas vector points downloaded through Flickr Application Programming Interface (API). After data ac-quisition, preprocessing was performed on the raw data. The whole dataset was firstly separated by yearbased on the temporal information. Then data of each year was accumulated with its former year(s) sothat the evovling process can be explored. After that, large datasets were splitted into small pieces andeach piece was clipped, georeferenced, and rectified respectively. Then the pieces were merged togetherfor clustering. With respect to clustering, the strategy was developed based on the Delaunay Triangula-tion (DT) and head/tail break rule. After that, the generated clusters were analyzed with scaling analysistechniques and spatio-temporal patterns were interpreted from the analysis results. It has been found thatthe spatial pattern of the human social activities in the urban environment of Sweden generally followsthe power-law distribution and the cities defined by human social activities are evolving as time goes by.

To conclude, the contributions of this research are threefold and fulfill the objectives of this study,respectively. Firstly, large amount of Flickr data is acquired and collated as a contribution to other aca-demic researches related to Flickr. Secondly, the clustering strategy based on the DT and head/tail breakrule is proposed for spatio-temporal pattern seeking. Thirdly, the evolving of the cities in terms of humanactivities in Sweden is detected from the perspective of scaling. Future work is expected in major twoaspects, namely, data and data processing. For the data aspect, the downloaded Flickr data is expectedto be employed by other studies, especially those closely related to human social activities within urbanenvironment. For the processing aspect, new algorithms are expected to either accelerate the processingprocess or better fit machines with super computing capacities.

Keywords: Big Data, VGI, Flickr, Delaunay Triangulation, Power Law, Scaling Analysis, Spatio-Temporal Pattern

i

Acknowledgements

First and foremost, I would like to acknowledge my supervisor Bin Jiang. It has been my great honorto have him as my supervisor. From this two-year stimulating working experience with him, I not onlywas inspired and motivated a lot by his contagious enthusiasm in research but also absorbed plenty ofknowledge and more importantly the way of self-learning and critical thinking. I am very thankful forthe example he has set up for me as a scientist, guiding me to the right way of doing research. I am alsovery thankful for his suggestions and concerns of my future career.

Secondly, I would like to thank University of Gavle for offering me this wonderful learning experi-ence in Sweden. I would like to express my special gratitudes to Peter Fawcett, Anders Brandt, PiaOllert-Hallqvist, Ross Nelson, Eva Sahlin, Julia Ahlen, Nancy Joy Lim, and Xiansong Huang for theirconcerns and aids during my study. I learnt and benefited a lot from their courses or suggestions, whichmakes me who I am today.

Thirdly, I would like to express my appreciation to all my friends who keep company with me in myhighs and lows. I would like to specially thank Tao Jia who gave me many valuable suggestions aboutmy thesis, Mian Wang who provided me hardware support and Tao Peng who gave me suggestions aboutcoding. I also would like to thank Yangzhuoran Liu, Zihao Fan, Qiao Zhou, Zhongtao Wang, Di Lin andDi Zhao who willingly encouraged and helped me during my working on my thesis.

Lastly but not leastly, I am highly indebted to my parents who keep as great supporters of all time.It was they who stimulated my interests in science and supported me in all my pursuits. It was they whoprovided me timely and unselfish helps whenever I needed. Without them, I could hardly pass throughthe whole journey. Thank you.

Yufan MiaoGavle, SwedenAugust 20, 2013

ii

Contents

1 Introduction 11.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Research Questions and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Literature Review 62.1 The Incoming Big Data Era . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Volunteered Geographic Information (VGI) and Flickr Data . . . . . . . . . . . . . . . . 82.3 Natural Cities and Delaunay Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Scaling Analysis in Urban System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Material and Methods 153.1 Description of Flickr Timestamped Geolocation Data . . . . . . . . . . . . . . . . . . . 153.2 The Data Preporcessing strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 The Clustering Strategies Based on the Delaunay Triangulation . . . . . . . . . . . . . . 183.4 Power Law Distribution Identificaiton Techniques . . . . . . . . . . . . . . . . . . . . . 19

4 Results and Discussion 224.1 A Comparative Analysis of Data and methods . . . . . . . . . . . . . . . . . . . . . . . 224.2 Scaling Analysis and Interpretation of the Flickr Patterns . . . . . . . . . . . . . . . . . 244.3 Limitations and Problems of the Data and the Processing Strategies . . . . . . . . . . . 25

5 Conclusions 275.1 Contributions of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 Future Work of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

References 32

A Power Law Test Results for each Year Data 33

iii

List of Figures

Figure 1.1– Relationship between three focuses of the thesis . . . . . . . . . . . . . . . . . . 2Figure 1.2– The thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Figure 2.1– The first definition of big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Figure 2.2– The second definition of big data . . . . . . . . . . . . . . . . . . . . . . . . . . 8Figure 2.3– Three characteristics of VGI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Figure 2.4– Two different methods to define natural cities . . . . . . . . . . . . . . . . . . . . 12Figure 2.5– Comparison of triangles with circumcircle criterion property and without this

property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Figure 2.6– The rank-size distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Figure 3.1– The flow chart to download Flickr data . . . . . . . . . . . . . . . . . . . . . . . 15Figure 3.2– The downloaded Flickr Data (part) . . . . . . . . . . . . . . . . . . . . . . . . . 16Figure 3.3– The preprocessing flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Figure 3.4– The clustering strategy based on DT . . . . . . . . . . . . . . . . . . . . . . . . . 19Figure 3.5– The relationship between α and xmin . . . . . . . . . . . . . . . . . . . . . . . . 20

Figure 4.1– Comparison of natural cities extracted from Flickr Data and OSM data . . . . . . 22Figure 4.2– Comparison of natural cities generated by DT and CCA . . . . . . . . . . . . . . 23Figure 4.3– The impacts of resolution problem for DT and CCA methods . . . . . . . . . . . 24Figure 4.4– An overview of the scaling analysis results . . . . . . . . . . . . . . . . . . . . . 25

Figure A.1– The power-law test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

iv

List of Tables

Table 3.1– A summary of the downloaded data . . . . . . . . . . . . . . . . . . . . . . . . . 16Table 3.2– The cleaned data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Table 4.1– The power-law test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

v

1. Introduction

We are now living in an age of information explosion when data-intensive computing and data-drivendiscoveries are flourishing. With quitillion bytes of data being produced every day, the research paradigmis gradually shifting from hypothesis-driven to data-driven (Howe, 2009). Although data amount is nolonger a big issue, the lack of efficient strategies to deal with the invading “Big Data” has become themajor barrier to the advancement of human knowledge. Therefore, effective and efficient data processingand analyzing strategies have become the key basis of scientific discoveries and commercial competi-tions and new methods to deal with this “Big Data” issue are in urgent needs.

As many other fields, Geographic Information Science (GIScience) is also benefiting from the increas-ing availability of geospatial data and suffering from problems to handle the data. Considering the factthat data is the most important component of a Geographic Information System (GIS), the benefits anddrawbacks brought by the “Big Data” are amplified for GIScience in comparison with many other fields.For one aspect, problems about data still exist. Even though data amount is sufficient, geospatial dataacquired though traditional methods, say, by commercial companies or national surveying agencies isexpensive to purchase and restricted to utilize. This may thus hinder some researchers with insufficientfunds from conducting possibly notable researches. For another aspect, problems about the processingand anaylizing methods are thorny. Since most GIS products are mainly designed for desktop process-ing, their computing capacities are limited and can not deal with large dataset. On the contrast, Cloudcomputing and super computers can well solve the problem, but related technologies are still immatureand can hardly be reached by average people at low cost. Moreover, even though the patterns are finallygenerated, wrong interpretation with wrong methods may kill possible novel discoveries. Therefore, itis also of vital importance to choose the right analysis method for the data pattern interpretation.

For the cost and limitation problems of data, the emergence of Volunteered Geographic Information(VGI) has brought some dawn of hope to the academic circle as a solution (Crandall et al., 2009). Co-pious volunteered geospatial data is now available at low price and is widely spread through internet.Moreover, one special type of VGI data collected from social media networks is also accessible throughtechnologies such as Application Programming Interface (API). This special kind of data is a vividrecord of the contributors’ social behaviors within a geographic space. Therefore, it is of huge researchvalues to unveil the patterns behind it.

Among all social media networks, Flickr is a leading photo sharing website operated by Yahoo!. Itis a rich photo reservoir with massive valuable metadata tagged on each photo. Up to 2011, Flickr hasalready had around 80 million registered users and 4.5 million daily uploads photos (Yahoo!Inc., 2011).Moreover, in 2013, it announced a new policy to provide one free terabyte space for each user to storephotos, which is attracting more users with more photos. Flickr has two features that are crucial forthis study. One feature is geotagging which enables users to add location information as metadata totheir photos. The other feature is timestamp which contains temporal information. Both of them can bequeried and downloaded through Flickr API and make Flickr an important VGI source for GIScienceresearchers. Therefore, Flickr data is downloaded and employed for the study of this thesis.

For the data processing strategy problems, recently, new efforts and research strategies to utilize theVGI data are emerging in the field of Geographic Information Science (Goodchild and Hill, 2008). Ac-cordingly, many interesting discoveries have been made in recent years with GIS as an effectual tool to

1

unveil unexpected patterns from massive VGI data (Frankel and Reid, 2008). Among all the methods,one novel concept called natural city brought up by Jiang and Jia (2011) is a revolutionary way to definecities and their boundaries. It defines a city by clustering the city key elements such as street junctions orstreet blocks in a bottom-up fashion in comparison with the traditional administrative definition of a cityin a top-down fashion. This method is believed to be more objective because the definition relies heavilyon the data rather than decision makers. For social media data, the bottom-up fashion is its nature as wellas an advantage of the ability to reflect real individual behavior when it is adopted to study human socialbehaviors within an urban environment. Therefore, the bottom-up clustering strategy behind the conceptof natural city is borrowed for the pattern seeking of the Flickr photos in this study. However, existingnatural city clustering strategies are not suitable for the processing of social media data. Therefore, anew one based on the Delaunay Triangulation (DT) will be proposed in this thesis.

For the interpretation method problems, scaling analysis will be borrowed from statistical physics for thestudy of social media patterns within the urban environment. This analysis method is usually applied incomplex systems whose elements are interacting with its environment and exhibit as a unity propertiesthat individual element does not have. In this sense, human can be regarded as the elements and city acomplex system. Therefore, scaling analysis is very suitable for interpreting the patterns to knowledgeand is employed in this thesis.

Consequently, inspired by the novel researches conducted by pioneers and tempted by the ample datastored in Flickr servers, this research is motivated to explore the spatio-temporal patterns of massivegeospatial data acquired through Flickr API. Then data in Sweden will be processed with clusteringstrategies developed based on the Delaunay Triangulation. Finally, the patterns will be analyzed and in-terpreted with scaling analysis. With respect to the abovementioned, the research problem and questionsare declared as follows.

1.1 Problem Statement

This study aims to unveil the spatio-temporal patterns hidden in massive Flickr data with new methods,stressing the importance of volunteered geospatial data in urban study and triggering more concernson it from researchers. From the goal stated above, it is obvious that there are three focuses in thisresearch, namely, data, methods and pattern. The relationship between them three is de facto the processof geospatial knowledge discovery. In this process, Flickr data is input, methods the black box to processdata and patterns the results that should be analyzed and interpreted as can be seen in figure 1.1. Beforethe research aim being further divided into research questions and objectives, one important questionshould be answered in advance. The question is why this study is important. The answer to this questionis threefold, concerning data, methods, and patterns, respectively.

Figure 1.1: Relationship between three focuses of the thesis

For the data aspect, there are three reasons. Firstly, Flickr data has its unique charateristics against otherdata sources, which is very suitable for the study of spatio-temporal patterns in the urban environment.Unlike traditional data sources such as census data, Flickr data truely reflects the individual behavior ofcitizens in the urban environment. Moreover, unlike some VGI data source such as Open Street Map(OSM), Flickr has rich temporal data and can reflect the dynamic behavior of citizens. Secondly, Flickr

2

is a very rich data source which contains large amount of geospatial data. Considering its 80 millionregistered users and 4.5 million daily uploads, even though only 10 percent of the users tag their photoswith location information, there are still roughly 164 million photos available per year. Thirdly, it is ac-cessible by everyone and free to download through API. There is no strict restriction for users to registera Flickr account and to require an API key to use Flickr API. It is usually free for users to download datathrough API as long as it is for non-commercial purpose. These three reasons answer why Flickr data isadopted in this study.

For the method aspect, to choose an appropriate method, it should be taken into consideration boththe data to be processed and the pattern to be generated. Since the data is acuqired in a bottom-up fash-ion and has the advantage of reflecting real individual behavior, the patterns should be extracted witha bottom-up method and should be objective as much as possible. To fulfill these two requirements,the approach to define natural cities is the ideal one and is adopted for this study. However, since theexisting clustering strategies to define natural cities have their own drawbacks to be applied to the socialmedia data, new clustering stategy should be proposed following the philosophy of the existing methods.Details about the existing methods and the new methods will be introduced in chapter 2 and chapter 3respectively. For now, it is worth knowing that the new method is to work as a supplement of the existingmethods and is designed to better fit the Flickr data.

For the pattern aspect, the extracted patterns have three aspects of significance. The first one is thatthe patterns can be used to testify if it is meaningful to explore VGI data, especially social media datain urban study. The second one is that the patterns can be used to check if the proposed method is animprovement of the existing methods to retrieve natural cities. The thrid one is that some regularitiesof human social behaviors within the urban environment can be interpreted from the pattern to enrichhuman knowledge.

By answering from the perspectives of the three focuses of this thesis, now the question why this study isimportant has been answered. Based on the three focuses of the thesis aim, three research questions arefurther asked and three objectives are set up accordingly. The next section will introduce the researchquestions and objectives in details.

1.2 Research Questions and Objectives

For this study, there are three major research questions to ask:1) How and how difficult is it to acquire, collate, and process massive amount of Flickr timestamptedgeolocation data?2) How the new clustering strategies should be proposed for the exploration of the spatio-temporal pat-tern of Flickr data?3) How the patterns should be interpreted so that useful knowledge can be distilled from the patterns?

According to each of these questions, three objectives are settled:1) The Flickr timestamped geolocation data should be downloaded through Flickr API and preprocessedso that it can be further explored with available desktop GIS software;2) A new clustering strategy based on DT and head/tail break should be proposed as a complement tothe existing clustering strategy to define natural cities;3) Extracted patterns should be analyzed and interpreted with both visualization and statistical tech-niques.

3

1.3 Thesis Structure

This thesis is composed of five chapters. In the first chapter, the thesis mainly presents a general in-troduction of this study, clarifies the aim, research questions and objectives of the study and outlinesthe structure of the thesis. The second chapter is a literature review where major concepts appearing inthis thesis are introduced in details. In the third chapter, data and methods employed in the study aredescribed. In the fourth chapter, the extracted spatio-temporal patterns are presented and interpreted.Meanwhile, limitation and problems about the data and the methods are described and analyzed. Thelast chapter aims to summarize the keynotes of this thesis. The whole thesis is first summarized. Thenthe contributions of this thesis are listed and some future work is suggested. Details about the followingchapters are introduced below. An overview can be seen in figure 1.2.

Firstly, terms and concepts are explained in chapter 2 which can be divided into four sections. Thefirst section aims to provide a research context to the readers with some basic concepts about “Big Data”where definitions and some strategies to deal with it are introduced. The second section is mainly aboutthe VGI and the Flickr timestamped geolocation data to help readers better understand the data used inthe thesis. The third section is about natural city and Delaunay Triangulation to provide some theoreticalbackground of the methods to be proposed. The fourth section provides some warm-up concepts aboutscaling analysis in urban environment.

Secondly, data and methods employed in this study are elucidated in chapter 3. The whole chaptercan be divided into four sections. In the first section, the acquisition method of Flickr timestamptedgeolocation data is described. In the second section, the data preprocessing process is explained as anavenue to deal with the “big data” problem. In the third section, a new clustering strategy based onDelaunay Triangulation is proposed. In the last section, one kind of statistical analysis called scalinganalysis and some fitting test techniques are introduced.

Thirdly, the interpretation of the extracted patterns is presented in chapter 4, which is divided into threesections. In the first section, the generated patterns are presented and compared with patterns generatedwith other VGI data and other natural city methods. In the second section, the scaling analysis results aredisplayed and discussed to interpret the generated patterns. In the third section, limitations and problemsof the study are discussed.

Finally, the last chapter is divided into two sections. In the first section, the whole thesis is summa-rized. Then in the second section, the contributions of this thesis are outlined and research questions andobjectives are confirmed to be answered and fulfilled. Moreove, some future work is suggested.

4

Figure 1.2: The thesis structure

5

2. Literature Review

The concept of ”Big Data” is nowadays getting big with the improvement of data acquisition techniquesand the widespread internet. Bathed in the ”Big Data” atmosphere, Volunteered Geographic Information(VGI) has attracted great attentions from the academic circle ever since its emergence (Goodchild, 2007).Through its development in recent years, VGI has already evolved as an important research area withspecial interests in enabling the motivation of the general public to contribute open source data, manag-ing the burgeoning volumes of data and mining the rich patterns underlying these data. Among all VGIdata sources, Flickr contains very abundant temporal information and geolocation information taggedon its billions of photos which are closely related to human social activities within the geographic space.Therefore, it is of great research value to explore the spatio-temporal patterns hidden in the metadata ofFlickr photos. To achieve this, the concept of natural cities are adopted. However, existing methods todefine natural cities have their own drawbacks to be applied to the social media data. Therefore, newmethod is proposed based on the Delaunay Triangualtion (DT). To interpret generated patterns, scalinganalysis is adopted as the tool considering that city can be regarded as a complex system.

This chapter aims to provide some general knowledge through literature review about the major con-cepts that will appear in the thesis, namely, “Big Data”, VGI, Flickr, natural city, DT and scaling anal-ysis. The whole chapter is divided into four sections accrodingly. The first section about “Big Data”aims to provide some research context to the study. The second section about VGI and Flickr is targetedto provide knowledge about the research object of this thesis. The third section provides some theoret-ical background to the concept of natural cities and related clustering approaches. Moreover, DT willbe briefly introduced to justify the reason why it is employed to develop new clustering strategy. Thefourth section aims to provide some general knowledge about scaling analysis in urban system so thatits importance to this study can be explained.

2.1 The Incoming Big Data Era

We are nowadays living in the “Big Data” era, flooded by the overwhelming data acquired from an ex-tensive range of sources such as photograph archives, sensor networks, social media traffics, biologicalrecords and spatial databases. Every day, about 2.5 quintillion bytes of data are created and among thewhole data existent, about 90% were created during the last two years (Lafreniere, 2011). This unpreci-dented phenomenon together with its huge impact on almost every field not only quickly propogates this“big” concept all over the world but also leads to a global trend of its further exploitation.

The origin of the term “Big Data” is uncertain: some people reckon that the term was coined by RogerMagoulas in 2005 (Collaborative Consulting, 2012) whereas others argue that it originated from the term“eScience” brought up by John Taylor in 2000 (Hey et al., 2009). However, just as was asserted by JonKleinberg, a famous computer scientist from Cornell University (Lohr, 2012), “the term itself is vague,but it is getting at something that is real”, perhaps a more well-accepted explaination can be that theterm emerged in practice gradually and came into coinage naturally.

In a similar vein, the definitions of “Big Data” are also multifarious. One popular definition is the“3V’s”, meaning “Big Volume” which measures the amount of data, “Big Velocity” which measureshow fast data is being produced and it has to be processed to meet specific needs, and “Big Variety”which measures the degree of the data’s heterogeity (Gartner Inc., 2011). In this school, IBM adds “Big

6

Veracity” to achieve a four dimensional definition (Lafreniere, 2011). By veracity, the meaning is bifold.In one aspect, it means that “Big Data” can be a better way to generate trustful information for decisionmakings than overpowered tiny samples which are commonly employed in statistics. In the other aspect,however, it means that it is still a huge issue to process with reliable strategies the growing volumes andvarieties. An overview of this definition can be seen in figure 2.1

Figure 2.1: The first definition of big data

Another popular definition is proposed by MIKE2.0 (2012), an open source standard for informationmanagement, as “Size”, “Degree of Complexity”, and “Use of Longitudianl Information”. By “Size”, itmeans the amount of independent data sources rather than that of data itself. By “Degree of Complexity”,the meaning is bifold: one is about the data sources which are inconsistent and unpredictable while theother one is about the data itself which is interrelated and whose individuals are hard to delete. By “Useof Longitudinal Information”, it means that it should include the analysis of the temporal changes ofinformation with the same type and on the same subject. In addition, two confusing facts should beclarified according to the definition: 1) Big data can be of relatively small size as long as it’s complexenough. 2) Large dataset may be not big if the data is too simple. This definiton is quite similar tothe former definition where “Size” is equivalent to “Volume”, “Degree of Complexity” to “Variaty”and “Use of Longitudinal Information” to “Velocity”. However, in this definition, “big” refers moreabout complexity rather than the volume size, which is the major difference from the first definition. Anoverview of this definition can be seen in figure 2.2When actually dealing with the “Big Data” in practice, its analysis pipeline majorly includes five distinctphases (Agrawal et al., 2012): acquisition and recording; extraction, cleaning and annotation; integra-tion, aggreagation and representation; analysis and modeling; and interpretation. Among this analysisprocess , some remarkable available techniques can be mentioned here to give an overview. However,the detailed introduction is far beyond the scope of this paper and is thus exclusive from the contentof this paper. These techniques are as follows (Manyika et al., 2011): 1) A/B testing, 2) associationrule learning, 3) classification, 4) cluster analysis, 5) crowdsourcing, 6) data fusion and data integration,7) data mining, 8) ensemble learning, 9) genetic algorithms, 10) machine learning, 11) natural languageprocessing (NPL), 12) neural networks, 13) network analysis, 14) optimization, 15) pattern recognition,16) predictive modeling, 17) regression, 18) sentiment analysis, 19) signal processing, 20) spatial anal-ysis, 21) statistics, 22) supervised learning, 23) simulation, 24) time series analysis, 25) unsupervisedlearning, and 26) visualization.

It should be noticed that techniques such as classification, cluster analysis, data fusion and data in-tegration, spatial analysis, statistics, supervised learning, unsupervised learning and visualization arecommon measures employed in a Geographical Information System (GIS). Obviously, the study ofGIScinece complies with the major trend of the “Big Data” era. Considering the fact that data is the

7

Figure 2.2: The second definition of big data

most important component of a GIS, the incoming big data era undoubtedly brought new opportunitiesand challenges to the development of Geographic Information Science (GIScience) and set up a researchcontext for this study. Massive amount of VGI data will be explored in this thesis and a brief introductionof VGI and its subset Flickr geolocation data will be introduced in the next section.

2.2 Volunteered Geographic Information (VGI) and Flickr Data

With the improvement of GPS accuracy for civil use and the ubiquitous GPS module bundled withinmobile handsets, nowadays, people can share and retrieve geographic information through internet atlow cost and fast speed. As a result, VGI emerges as a newborn open-source rival challenging theauthority of traditional commercial and governmental giants. The major difference between VGI andtraditional geospatial data is the data aquisition method where VGI is of a grassroots style, which meansthat data comes into being in a bottom-up way from a large number of spontaneous contributions ofamateur volunteers whereas traditional data is of a centralized style, which means that data is collectedin a top-down way from limited number of premediated contributions of professional surveyors. Theemergence of VGI not only leads to new efforts in GIScience researches (Goodchild, 2008), but alsochallenges the “knowledge politics” of spatial data infrastructures (Elwood, 2010; Hardy et al., 2012).

Although the term VGI has been brought up by Goodchild (2007) for several years, there is still noclear definition of it due to its inborn complexity. As such, different researchers define VGI from differ-ent perspectives: Goodchild (2007) regards it from a user motivation perspective as “a special case ofthe more general Web phenomenon of user-generated content”; Kuhn (2007) describes it as the “scalingup of closed loops” from a control theory perspective and the “informationable action” from an informa-tion theory perspective; and Sui (2008) argues from a web 2.0 perspective that the “wikification of GISis perhaps one of the most exciting and indeed revolutionary developments since the invention of GIStechnology in the early 1960s”. However, it still appears relatively vague for people to penetrate thistopic. Therefore, to achieve a thorough understanding of the phenomenon, the enabling technologies,the major characteristics of VGI, and some related concepts should also be concluded and introducedwith some details.

Speaking of the enabling technologies, Goodchild (2007) listed six of them which have the most signif-icant impacts on the evolving of the VGI. They are web 2.0, geotags, georeferencing, GPS, graphics,and broadband communication. Some of them are newborn technologies in recent years such as web 2.0

8

and geotags whereas others are of great breakthroughs in the near decades. In some sense, VGI can beviewed as the result of the growing range of interactions enabled by web-based technologies. Therefore,among these technologies, web 2.0 has been widely regarded as the cornerstone of the VGI which meansthat it is the emergence of web 2.0 that make this VGI concept possible.

Similarly, geotag is something relatively new, popular in recent years. It is a standardized code recordinggeographic information that can be inserted into descriptive information on websites. Many popular web2.0 websites adopt this new technology such as Wikipedia, Facebook, Flickr and twitter. Meanwhile,in the GIScience world, georeferencing is nothing new to geographers but takes another form to fulfillthe needs of normal users: coordinate-and-projection-based reference systems are substituted by name-based ones considering that normal users rarely know their locations through coordinates but names.GPS is another improvement of the existing technology with the removal of the SA policy by the USgovernment as well as the skyrocketing GPS-embedded mobile devices. In a similar vein, high-qualitygraphics are another great improvement for the realization of VGI, providing prerequisites for good vi-sual user experience.

Last but not least, without the establishment of broadband communication framework, VGI would bejust a blue-sky nonsense. It is the high-capacity connection that enables the timely user-website inter-action and thus enables the VGI reservoir. With the knowledge of the initiative technologies of VGI,some major characteristics should be explained further in order to achieve deeper insights on the concept.

The first characteristic is that it belongs to the subject of neogeography which is defined by Turner(2007) as a new geography where “people use and create their own maps on their own terms and bycombining elements of an existing toolset and share location information with friends and visitors, helpshape context, and convey understanding through knowledge of place”. Unlike traditional geographer, aneogeographer no longer use dominant commercial GIS software such as ArcGIS or MapInfo to solvedisputes on land area, but use Application Programming Interface (API) to display or share geospatialdata in their own style.

The second characteristic is that it aims to convey the collective intelligence (Smith, 1994), same tothe spirit of open source (Raymond, 1998). This collective intelligence comes into existence by collab-oration or competition of the individuals among a large group of people. This nature usually leads to theavailability of large amounts of geospatial data.

The third characteristic is the web 2.0 technology which is an interoperable internet environment whereinteraction between users and websites are realized. In this way, users can upload, download and evenedit geographic information according to their own needs, not only adding more flexibility but also en-hancing the public participation. An overview of these three charateristics can be seen in figure 2.3. Inaddition to these characteristics, some related concepts influenced or brought about by the emergence ofVGI should be discussed here.

The first concept is spatial data infrastructure patchworks which was introduced in a report by U.S. Na-tional Research Council (NRC, 1993). The meaning of this concept is that national mapping agenciesshould attempt to provide certain guidance and standards for groups and individuals to create maps withcombined purposes and various scales according to their needs rather than providing uniform coverageof the entire country. As a result, VGI not only fits well this model but also accelerates the whole process.

The second concept is human as sensors. In comparison with the traditional artificial sensors, the net-work of human sensors have three obvious advantages: 1) five natural sensors, 2) intelligence, and3) over six billion individuals. In correspondance, VGI can be an effective use of the human sensors andtheir features.

9

Figure 2.3: Three characteristics of VGI

The third concept is citizen science where groups of people or networks of citizens can act as observersin certain field of science. In some sense, the need of expertise in some sophisticated fields may limit theextension of VGI. However, in other fields such as collecting data of well-defined geographic features,VGI remains its power. The fourth concept is that participant populations could be all human beingssince VGI aims to be open to all and collect wisdom of people from different fields. This leads to a newbottom-up approach for the creation and dissemination of geographic information.

The last concept is early warning of certain natural or man-made disasters through the volunteeredreports, which can exploit the specialty of local people and apply it to the local problems. Then whatkind of VGI data will be employed for this study? In this case, Flickr data will be explored as one im-portant social media data with both geographic information and temporal information which is perfectfor the study of the spatio-temporal pattern within an urban environment. Before any details about theFlickr data, some general information about Flickr should be introduced in advance.

Flickr is a popular photo-sharing and video-sharing website which contains huge public-accessible im-ages taken all over the world. It was firstly launched by Ludicorp in 2004 and later was acquired byYahoo! in 2005. After its launch, Flickr soon catched the eyeballs of the general public and evenrevolutionized the whole photograph industry by changing the history of photo sharing. To date, it hasevolved as one of the best online photo management and sharing service in the world (Yahoo!Inc., 2012).

One of the most remarkable features of Flickr is its adoption of tags and sets as methods to organizeimages to provide convenience for users to search photos based on topics. Because of these features,Flickr has been generally regarded as a typical example of folksonomy coined by Wal (Vander Wal,2007) which means that it is a user-generated system of classifying and organizing online contents (im-ages and videos in this case) into different categories by the use of metadata (tags and sets in this case).

More interestingly, one important heart-stirring feature for geographers is geotagging which allows usersto add location-based information to their photos. Therefore, considering its roughly 80 million regis-tered users and 4.5 million daily uploaded photos (Yahoo!Inc., 2011), huge geospatial data as well astemporal, textual and other data could be of great research value. More importantly, the numbers pro-vided above are only valid to the end of the year 2011 and with the rapid development of web technology,these numbers are still burgeoning, especially after Flickr announced new policy recently to provide each

10

user one terabyte of free space to store photos. Therefore, Flickr can be regarded as a dynamic databasewith abundant geospatial data and temporal infomation annotated to the existing and the continuouslygrowing number of photos.

Flickr adopts Application Programming Interface (API) for developers to develop with great flexibil-ity Flickr-related applications. As they described (Yahoo!Inc., 2012), the Flickr API is how peoplecan access valuable metadata such as tags, geolocation, and data with exchangeable image file format(Exif). This means that API can be used to query and download the metadata of each photo freely fromthe Flickr servers. Considering the flexibility and the zero cost mentioned above, API is adopted asthe avenue to acquire massive Flickr timestamped geolocation data from its servers. Details about theacquisition method will be described in next chapter.

2.3 Natural Cities and Delaunay Triangulation

The term natural cities was coined to solve the boundary definition issue during the empirical testifica-tion of the existence of Zipf’s law in urban environment (Jiang and Jia, 2011). As its name indicates,with this approach, cities are defined in a natural way. It is called a natural way because this approachcomplies with the orgin of a city. As such, cities are defined based on the contribution of the generalpublic in a bottom-up way rather than on administrative boundaries imposed by the government in atop-down way. The approach of natural cities is adopted as one important foundation of the study in thisthesis for its obvious advantages over other methods. Then the question is how this approach is betterthan others.

The advantage of this natural definition over the conventional ones majorly lies in the different datathey use, especially the different data aquisition methods and the data formats. Traditionally, censusdata is acquired through commercial companies or mapping agencies within the adminitrative bound-aries of the study area. Therefore, boundaries of these cities are the adminitrative boundaries whosedefinitions are relatively subjective or even arbitrary. Usually, cities defined in this way do not cover thewhole area people live in because some places are excluded from census. As a result, it may be of biasesto conduct researches on such cities. Although many methods (Holmes and Lee, 2009; Rozenbfeld et al.,2009) had been proposed to improve it, they still rely heavily on census data and can hardly be ultimatesolutions to the biases. On the contrast, the approach of natural cities is more objective because the dataemployed is contributed by real people and reflects the real fact of human activities in specific locations.Therefore, in this sense, the approach of natural cities is superior to the traditional methods with censusdata.

Perhaps a better solution than those with census data to detect the city boundaries is to employ remotesensing technologies. For example, Sutton (Sutton, 2003) proposed a method to detect city boundariesthrough nightlight images. However, the problem of this kind of method is the raster data format whichsuffers from the Modifiable Areal Unit Problem (MAUP) (Openshaw, 1984). In comparison, the vectorformat of the natural city boundaries is influenced less by the MAUP. Therefore, in this sense, the ap-proach of natural cities is superior to those with data of raster format.

However, this natural approach itself is still evovling and is not perfect. When it was firstly proposed,street junctions derived from OSM were adopted as the constitutes of the natural cities. This is becauseit is believed that streets are where major human activities occur (Jiang and Jia, 2011). To distill citiesfrom the data, the street junctions were agglomerated with the clustering strategy City Clustering Algo-rithm (CCA) (Rozenbfeld et al., 2009) and each cluster was regarded as a city. However, at that time,the city boundary was delineated through interpolation in raster format which suffered from the MAUP.

To improve that, a new approach was proposed later with street blocks derived from OSM as a re-

11

placement of city junctions (Jiang and Liu, 2012). For this improved method, small blocks whoseneighborhood is also composed of small blocks are clustered as cities because of the spatial autocorrela-tion effect that can be explained by the first law of geography (Tobler, 1970) which says ”all things arerelated to other things, but close things are more related than distant things”. Then with head/tail breakdivision rule (Jiang, 2012) which says that “if the probability distribution of a dataset is heavy-tailed,then the mean value can divide the dataset into a high percentage part and a low percentage part”, theclustered blocks are divided into large groups whose sizes are above the mean and small groups whosesizes are below the mean. In this sense, the large groups are actually the natural cities. This time, thecity boundaries are totally in vector format and is influenced less by the MAUP. However, this methodsuffers from another problem that those small blocks at the edge of the cities that are supposed to becategorized as cities are categorized as rural places. Fortunately, this problem does not influence theresults very much.

As mentioned above, there are two existing methods to define natural cities, namely, the point-basedmethod and the block-based one as can be seen in figure 2.4. Since the data to be processed is point-based, the block-based method can not be applied in this study. Meanwhile, results generated by theexisting point-based method has to be interpolated to raster format which suffers more severely fromthe MAUP issue than the vector format, so it should be improved with new methods. Therefore, a newmethod to define natural cities based on the Delaunay Triangulation is proposed in this thesis.

Figure 2.4: Two different methods to define natural cities (Jia and Jiang, 2010).

The reason why DT is chosen as the foundation of the new method is mainly because of its two uniquecharacteristics, namely, the empty circumcircle property of triangles and nearest neigbour connectionbetween each vertice. Accordingly, the first property is depicted in figure 2.5. As can be seen fromfigure 2.5a, the circumcircle of each triangle has one and only one triangle in its interior. On the contrast,in figure 2.5b, there are two triangles in the circumcircle of triangle T1. This property ensures that anglesin a triangle are relatively large angles rather than one large and two small. In this way, triangles withextreme angles are avoided in DT. As we know, for a real world, extreme cases are rare and normal casesare frequent. Therefore, very long triangles are less likely to be present and the results generated as DTrelect better the real case.

12

(a) Triangles with empty circumcirclecriterion property

(b) Triangles without empty cir-cumcircle criterion property

Figure 2.5: Comparison of triangles with circumcircle criterion property and without this property

The second feature of DT is that points are connected with each other in a manner of nearest neighbour.This means that each point is connected with points in its nearest neighbourhood. As a result, extremelong edge can be avoided. Moreover, to produce triangles in the nearest neighbour manner saves com-puting power and is reasonable for the study of geographic space because it follows the first law ofgeography. Because of the two properties of DT, a new clustering stategy is proposed based on it andfollows the philosophy of the block-based method. Details about the method will be introduced in thenext chapter.

2.4 Scaling Analysis in Urban System

Scaling analysis is one of the most important fundamental tools borrowed from physics to sovle prob-lems in complex systems which consist of interconnected parts interacting with their environment andexhibit as a unity some emergent properties not observed from these individual parts (Miller and Page,2007). Meanwhile, just as Berry pointed out (Berry, 1964), an urban system includes cities as systemswithin systems of cities.

To interpret this statement, cities can be regarded as systems composed of interacting and interdependentparts which can be studied at various levels in terms of structure, function and dynamic changes. More-over, these systems can be further partitioned into various subsystems. From a more general perspective,a set of cities also compose a system, providing an environment for each city, namely, any other citieswithin the system. In this sense, socio-economy is the environment of these sets of cities. To explainmore concretely, we are now living in a post-industrial society when cities exhibit some post-industrialcharacteristics. As Berry and Garrison (1958) asserted, the post-industrial cities have experienced theprocess of deconcentration and can be characterized as urban fragments in conjunction with the segre-gation of population, the expansion of urban infrastructure and some byproducts such as urban sprawls.

From the above statements, it is obvious that an urban system has similar characteristics with a complexsystem. They are both systems with a hierarchical structure and their partitions are all interconnected,interacting with their environments and have some behaviors their individuals do not have. Actually,urban system is one kind of complex system. Therefore, it is natural to apply certain tools from thestudy of complex system to the urban system. Among these tools, scaling analysis can be a powerfuland effective one.

Speaking of scaling analysis in urban study, it is de facto to identify if there are existing patterns inthe study area that follow the power law distribution. This is because scaling is one property of thepower law distribution. The mechanism underlying the scaling property is that no matter how the mea-

13

sured size is enlarged or contracted to certain scales, the shape of the distribution maintain unchanged.When the concept is applied to urban system, one classic model that should be mentioned is the Rank-Size Distribution (RSD) which is characterized by the Zipf’s law (Zipf, 1949). The RSD is a variantof the power law distribution and can be identified with the same techniques. The general idea behindthis Zipf’s law is that in systems of cities, the size of the r-ranked city should be about the 1/r propor-tion of that of the top-ranked city and the distribution of the city against their size exhibits a long-taildistribution (Figure 2.6). This actually well explains the term scaling which means that the shape ofthe distribution remains unchanged no matter if the measured size is rescaled and to what extend it isrescaled. More importantly, it reveals an important attribute of a city system that within certain lengthof period, the city size holds a relatively stable heterogeneous hierarchy (Berry, 1967).

Figure 2.6: The rank-size distribution

This regularity provides some new perspectives to unveil the mechanism underlying cities and alsodraw great concerns from the academic circle. For example, Ioannides and Overman (2001) proposed amethod to test the validity of Zipf’s Law for cities and for calculating local Zipf’s law exponents of thedistribution of the US city size; Cordoba (2007)derives restrictions on preferences, technologies and therandomly determined properties of the external driving forces that urban models have to fulfill in orderto explain this regularity; Rozenfeld (2008) used city clustering algorithm to examine Gibrat’s law ina log-normal distribution which is more suitable for the distribution of all cities rather than Zipf’s lawonly for big cities; Jiang and Jia (2011) provided a new geospatial perspective on the verification of thevalidaty of Zipf’s law for all cities in the US; Jiang and Liu (Jiang and Liu, 2012) also developed a newperspective of city and field blocks to the scaling of geographic space; and Benguigui and Blumentfeld-Lieberthal (Benguigui and Blumenfeld-Lieberthal, 2006) proposed a new approach to analyze city sizedistributions based on an empirical analysis of 41 cases.

However, even though a lot of efforts have been made in the study of this area, identification of suchdistributions has always been a touchy job because of the heterogenioty of the problems and the limitedavailability of the techniques. The development of the identification techniques can date back to the19th century when the Italian economist Pareto took logarithms (formula 3.4.0.1) on both x-axis andy-axis and measured the formed straght line in a histogram of the wealth data (Arnold, 1983). Thistechnique has been popularly accepted for quite a long time. However, it has its limitations and theresult is not reliable enough (Clauset et al., 2009). To solve this problem, Clauset and others (Clausetet al., 2009) proposed a method by adopting Maximum Likelihood Estimation(MLE) (Shanbhag andRao, 2001), Goodness of Fit(GoF) (Wasserman, 2003) and Kolmogorov-Smirnov(KS) test (Press et al.,1992) as measurements for calculating the model parameters and model significance. Details about thisidentification technique will be introduced in next chapter.

14

3. Material and Methods

This chapter is mainly about data and data processing strategies employed in this study, which are thefoundations of the empirical study of this research. The topics of this chapter are threefold and organizedinto three sections, respectively. Firstly, the acquisition method of Flickr data and the characteristics ofthe raw data will be introduced in details in the first section. Secondly, software and methods employedto preprocess the raw data will be described and depicted with figures. Lastly, the clustering strategybased on the Delaunay Triangulation will be proposed and explained.

3.1 Description of Flickr Timestamped Geolocation Data

For the acquisition of the Flickr data, Flickr API is adopted in this study as a flexible and cheap tool. Inorder to use these APIs, developers have to apply for the API key first through their Yahoo IDs. With thiskey, developers can get authorization from Flickr so that requests from the developed application canbe recognized and replied by the Flickr servers. The whole geolocation retrieving process is depicted infigure 3.1 and its pseudo code is in Algorithm 1.

Algorithm 1 Flickr Photos Geolocation Data Retrieving AlgorithmInput: key, boundary (minLat, minLon, maxLat, maxLon), start date taken, end date takenOutput: photo location, time stamps

1: procedure RETRIEVE LOCATION(key, boundary, start date taken, end date taken)2: Flickr App← New Flickr(key)3: search option. BoundaryBox← boundary4: search option. HasGeo← True5: search option.MinTakenDate← start date6: search option.MaxTakenDate← end date7: Append Flickr App.PhotoSearch(SearchOptions) to photo collection8: for all photo ∈ photo location do9: Append (photo.latitude, photo.longitude, photo.timeTaken) to photo location

10: return photo location

Figure 3.1: The flow chart to download Flickr data

As can be seen above, four parameters have to be specified as inputs, namely, the API key, the boundary

15

of the study area, the start date and the end date. Meanwhile, the output is a list of photo locations andtime stamps. Photo locations are the geographic locations in latitudes and longitudes where photos weretaken while time stamps are the times in Gregorian calendar format when photos were taken. Amongthe four parameters to be passed in, the API key is firstly applied to initialize a Flickr object so as toactivate users’ Flickr program. Then the other three parameters are passed in to customize the searchpreferences. In this case, three options have to be customized. For the boundary option, the boundary isdefined as a rectangular box covering the area of whole Sweden. For the other two, the start date and theend date when photos were taken are defined to ensure that only photo information within the specifiedtime period is to be downloaded.

Figure 3.2: The downloaded Flickr Data (part)

After all these customization, the application starts to search photos that satisfy the reqirements andappends them to a photo collection called photo location. Then, the data is written into a text filefrom each photo in the photo collection. Finally, the data is seperated into several files based on thetemporal information of year. The data format defined in the program can be seen in figure 3.2 where ID,latitude, longitude, and taken time of each photo are separated with tabs. In this case, the timestampedgeolocation data of Flickr photos in whole Sweden is downloaded from 2006 to 2011. After that, eachyear data is accumulated with former year(s) to study the accumulation effect. A summary of the datacan be seen in table 3.1.

Table 3.1: A summary of the downloaded data

2006 2007 2008 2009 2010 2011Size 11MB 79MB 144MB 169MB 303MB 517MBNum 163,418 1,161,717 2,048,781 2,376,134 4,276,968 7,329,708AcuSize 11MB 90MB 234MB 403MB 706MB 1,323MBAcuNum 163,418 1,325,135 3,373,916 4,424,915 6,325,749 13,655,457

Size: text file size; Num: number of points in each file; AcuSize: the accumulated sizewith former year(s); AcuNum: the accumulated point number with former year(s).

From the second row of the table, it can be seen that the total amount of the text files is over 1 GB whichdoes not seem to be a large dataset. However, when converted to shapefiles, the dataset can reach 4GB which is double the amount of the computing capacity of ArcGIS Desktop 10.1 (ESRI, Inc., 2012).Therefore, it failed when ArcGIS Desktop was employed to directly process the data and new strategieshave to be proposed to deal with this issue. In this thesis, one special characteristic of the Flickr datais utilized to handle this problem. Since points downloaded are highly duplicated, the duplication wasfirstly removed from the dataset and then the number of the duplicated points is added back to the re-duced dataset accordingly as a new column called ”Num”. In this way, the decreased data amount isunder the computing capacity of ArcGIS Desktop and the problem is solved. Details about this methodwill be discussed in next section.

To download Flickr data is a very time-consuming process. However, this is not totally determinedby the limitation of the local network speed, but majorly by the speed limitation policy imposed by

16

Flickr which does not allow too frequent requests to their servers (Yahoo! Inc., 2013). Moreover, sincethe downloading program is written to pass when there are any unknown exceptions thrown to ensurea continuous downloading process, some data may not be successfully downloaded due to the unpre-dictable changes of the network environment. Therefore, to ensure the data amount is sufficient, in thisstudy, the downloading program was running on the cloud server provided by Microsoft Azure which isbelieved to be of a relatively stable and fast network environment. With data at hand, in next sections,the processing strategies will be discussed in details.

3.2 The Data Preporcessing strategy

In this research, there are three major problems about the employed data that hinder the study from con-duction. The first problem is the data amount which is beyond the computing capacity of the employedGIS software, namely, ArcGIS Desktop 10.1. Since the hardware resource at hand is limited, prepro-cessing of the data is necessary so that it can be further processed and analyzed. The second problemis about the data content. Because the downloaded data is convering a larger area than the study area,redundant data has to be removed before being studied. Therefore, some sieving techniques have to beconducted in advance. The third problem is about the data format. As the downloaded data is in txtformat which is not in geographic data format such as shapefile, the data has to be converted before itcan be used. This is another aspect in which preprocessing is crucial. A flow chart of the preprocessingtechniques can be seen in figure 3.3. The following paragraphs will describe the preprocessing processin details.

Figure 3.3: The preprocessing flow

Inspired by the “theory of partitioning” (Hillier, 1996), the strategy to deal with the “Big Data” in thisstudy is to partition the data into accessible pieces. However, it should be noticed that this idea canonly be applied to the data preprocessing when data integrity is not necessary for the analysis. In thiscase, the data is splitted to several files each of which contains 50,000 points and organized according todifferent year they belong to. After that, each piece of data is iterately converted into shapefiles througha program written developed in this study with ArcObjects. In this way, the data amount problem ispartly solved.

The next step is clipping which aims to remove redundant data outsiede the study area and solve thecontent problem. As we know, the downloaded data is bounded by a rectangular box which covers notonly the study area of Sweden but also parts of Norway, Finland, and Denmark. Therefore, before thedata being processed, it has to be clipped by the administrative boundary of Sweden. Therefore, theconverted shapefiles are input into another program with clipping function provided by the ArcObject

17

library and then clipped respectively. After clipping, the shapefiles are merged by year and the contentand format problems are solved.

However, this dataset is still around 3.9 GB which is too large for any analysis with ArcGIS Desk-top 10.1. Therefore, the second strategy to fundamentally solve the data amount problem should beintroduced. It is to remove the duplications by taking advantage of the heavy duplication feature of theFlickr Data. Before this step is performed, the shapefile data should be converted back to the txt formatfor the conveiniece of the program developed to remove duplications. After that, data in one specificyear is merged with data before this year. This is to prepare the study of the spatio-temporal accumu-lation effect in the urban environment. Then the program to remove duplication is run and the cleanedfiles are converted to shapefiles afterwards. At this time, the data amount drops dramatically as can beseen in table 3.2. Up till now, the data has been prepared for the clustering that will be introduced innext section.

Table 3.2: The cleaned data

2006 2007 2008 2009 2010 2011AcuSize (before) 11MB 90MB 234MB 403MB 706MB 1,323MBAcuSize (after) 18KB 56KB 91KB 132KB 179KB 235KBAcuNum (before) 163,418 1,325,135 3,373,916 4,424,915 6,325,749 13,655,457AcuNum (after) 648 1492 2402 3473 4682 6144

AcuSize (before): text file size of the accumulated raw data; AcuSize (after): text file size ofthe cleaned data; AcuNum (before): number of accumulated points in the study area; and Num(after): cleaned number of points in the study area

3.3 The Clustering Strategies Based on the Delaunay Triangulation

The clustering strategy of this study is majorly inspired by the method proposed by Jiang and Liu (2012)to delineate natural city boundaries. As is mentioned before, the existing natural city approach has twovariants. One is point-based with the adoption of City Clustering Algorithm (Rozenbfeld et al., 2009)as its clustering strategy whereas the other one is block-based with a relatively new clustering strategydeveloped based on the first law of geography (Tobler, 1970). The natural city approach solves two ma-jor problems of the traditional methods. For those with census data as the study object, the natural cityapproach define city boundaries in a more objective and scientific way. For those with raster result, theblock-based natural city approach is affected less by the MAUP issue. However, since the point-basednatural city approach generates raster results, it is not perfect and suffering severely from the MAUP.Moreover, it is hard to add time dimension to the block-based method. Therefore, new point-based clus-tering strategy is motivated to be proposed.

The clustering strategy proposed in this research is a spatial point clustering method based on the De-launay Triangulation as can be seen in figure 3.4. The clustering philosophy is the same to that of theblock-based natural city approach which is the first law of geography. It is believed that all points are re-lated to each other, but close points are more related than distant points. The reason why it is a DT-basedmethod is that distance between points are measured by the Triangle edges and those points with shortdistance to each other are clustered into one group as cities. Details about the method will be introducedas follows.

The point shapefiles generated from the preprocessing step should be imported into ArcGIS Desktop10.1 and TIN models are produced accordingly. Then the lengths of the TIN edges should be calculatedand head tail break rule is adopted to divide the TIN edges into two groups by the mean value of the

18

Figure 3.4: The clustering strategy based on DT

lengths. The group with shorter TIN edges contains the urban areas whereas the one with longer edgesthe rural areas. Therefore, the second group with long edges shoudl be removed by exporting the firstgroup to a new shapefile. After that, the TIN edges in the new shapefile are converted to polygonswhich are then dissolved to form cities. In this way, urban patterns are generated. To distill knowledgefrom these patterns, the scaling analysis can be applied. Techniques to perform the scaling analysis isdescribed below.

3.4 Power Law Distribution Identificaiton Techniques

In this study, a method based on maximum-likelihood brought up by clauset and others (2009) is cho-sen as the identification techniques for power law distribution, which is now generally recognized as areliable method to solve the identification problems. The power law distribution can be characterized bythe formula 3.4.0.1. In this method, there are three important values that are to be found out, namely, α ,xmin and p. Among these three values, α is the scaling parameter with which the shape of the power-lawdistribution can be determined; xmin is the lowest bound of the data that fits the power-law distributionand is usually unknown in practice; and the p-value is for the test of the plausibility of the fitted model.The value of α can be estimated with the MLE method expressed as in formula 3.4.0.2.

log(P(x)) =−α log(x)+ log(M), f or x≥ xmin, M =α−1

(xmin)α−1 (3.4.0.1)

α = 1+n[n

∑i=1

(log(xi

xmin))]−1 (3.4.0.2)

However, according to Clauset’s method , the value of α is extremely sensitive to the value of xmin. Ascan be seen in figure 3.5, the only stable range of the xmin value is around the true value. Therefore, xmin

should be determined in advance and as accurate as possible. One reliable method is to estimate the xmin

value by minimizing the distance between the power-law model and the data.

The method first defines a set of all possible value for xmin, namely, all values of x. Then it itera-tively builds up power-law models by calculating out the value of α for each value of xmin in the xmin setbased on formula 3.4.0.2. When building up each power-law model according to each value of xmin, thedifference between empirical data and the model is calculated, which means that based on all the value

19

of x in the empirical data set (x values are the same in the empirical data and the model), the differencesof y are calculated. After this, the maximum distance is extracted from obtained set of distance values.This distance is then compared with the maximum distance obtained from the next iteration of power-law model building-up process based on next xmin value.

The whole process can be illustrated in Algorithm 2. In this algorithm, xmin set means the data setall possible xmin values belong to; x′min means any possible value for xmin, S(x) is the distribution of theempirical data; D is the largest distance between the distribution of the empirical data; and the power-lawmodel (the KS test as in formula 3.4.0.3), and Dmin is the smallest value of the D. In this way, the valueof xmin is estimated, the value of α can be determined and thus the fitted power-law model is found out.

Figure 3.5: The relationship between α and xmin (Clauset et al., 2009). This is an experiment of 5000datasets with 2500 samples from a well-known power law model with the true value of α and xmin as 2.5and 100, respectively.

D = maxx≥xmin |S(x)−P(x)| (3.4.0.3)

Algorithm 2 xmin Value Estimation AlgorithmInput: data, xmin setOutput: xmin

1: procedure xmin ESTIMATION(data, xmin set)2: for all x′min ∈ xmin set do3: for all xi ∈ data do4: α ← 1+n[∑n

i=1(log( xix′min

))]−1

5: for all x≥ x′min do6: P(x)← α−1

x′min( x

x′min)−α

7: S(x)←CDF(x) . CDF means cumulative distribution function8: D← maxx≥x′min

|S(x)−P(x)| . The KS test9: if Dmin > D then

10: Dmin← D11: xmin← x′min

12: return xmin

The last parameter p-value is for the goodness-of-fit (GoF) test which is employed to check if the hy-pothesis is plausible. To calculate out this value, a large number of synthetic data sets should be firstlygenerated with the same values of α and xmin to the power-law distribution that fits the observed data.Then each synthetic data set is employed to find out a new power-law distribuiton that fits it best withthe same techniques to get the values of α and xmin of the new distribution. After this, KS tests are per-formed between the empirical distribution of each of these synthetic data sets and their newly generated

20

distributions and get a bunch of Ds as in formula 3.4.0.3. The number of those Ds that are larger thanthe one generated before this GoF test are counted and p-value is then calculated out as the fraction ofthis number and the total number of Ds generated from the synthetic data sets.

However, a question may be asked naturally: how are these synthetic data sets made up? Accordingto Clauset’s explanation, these data sets are generated with a semiparametric approach which is bifold:for each synthetic data set, 1) synthetic data values that are less than xmin (found before the GoF test) arerandomly generated from the distribution of the empirical data; on the contrary, 2) synthetic data valuesthat are larger than xmin are randomly generated from the distribution of the power-law model. Then howmany synthetic data sets are the minimum requirement? The answer given by Clauset is that this numbershould be at least 1

4 ε−2 where ε is the accuracy of the p-value. Up till now, the identificaiton techniquesof the power law distribution have been fully explained. Further application of them to analyze andinterpret the spatio-temporal patterns of the Flickr data will be dicussed in next subsection.

Up till now, the data and the processing methods have been fully introduced and explained. Data em-ployed of this study is Flickr timestamped geolocation data downloaded through Flickr API. The datadownloaded has three issues, namely, the data amount issue, the data content issue and the data formatissue. To solve these problems, the data should be preprocessed before using. After splitting, clipping,duplicating and merging, the preprocessed data is clustered through a new strategy based on DT. Finally,the generated natural cities are further analyzed and interpret through scaling analysis techniques. Innext chapter, results of the research will be presented and analyzed. Some issues about the data and themethod will be discussed and some future work will be suggested.

21

4. Results and Discussion

This chapter is mainly about the presentation of the results, interpretation of these results and somediscussion about the limitations of the study and problems encountered during the study. Gernerallyspeaking, the whole chapter can be divided into three major parts according to the contents. The firstpart aims to compare and analyze the results generated by Flickr Data and open street map (OSM) data.Moreover, the proposed clustering strategy based on Delaunay Triangulation will be compared with theexisting point-based clustering method to extract natural cities. In the second part, scaling analysis willbe conducted and the analysis result will be interpretated. In the last section, some limitations of thedata will be discussed based on the visualization of the generated natural cities.

4.1 A Comparative Analysis of Data and methods

In this section, a comparative analysis is performed on the extracted patterns, namely, natural cities fromthe perspective of data and methods. For the data aspect, the result is compared with the one extractedfrom the street junctions of OSM data as can be seen in figure 4.1. Similarly, from the method apsect,the result is compared with the one generated by the original point-based clustering strategy, namely, theCity Clustering Algorithm (CCA) as can be seen in figure 4.2.

(a) Flickr natural cities (b) OSM natural cities

Figure 4.1: Comparison of natural cities extracted from Flickr data and OSM data

From figure 4.1, it is obvious that OSM data depicts the city boundary better than Flickr data. Thisresult is quite reasonable because of the different natures of the two data source. For the OSM data, thepoints are extracted from the street junctions. As we know, streets constitute the skeleton of a city, soeven if a city decays, the skeleton will remain. In this sense, OSM data can be more accurate than Flickrdata to detect the the existing boundary. However, since streets usually tend to be increasing ratherthan decreasing unless a very big revolution takes place of the city infrastructure, the OSM data is moresuitable to detect the sprawling of a city rather than contraction, which is one shortcoming of this kind of

22

data. Moreover, due to the lack of temporal information in OSM data, it is not suitable to be employed instudies related to dynamic changes. On the contrast, Flickr data is retrieved from photos which are onerepresentative of human social activities. It may not be suitable to depict the structure of a city, but it isbetter to be used to detect the hot spots within an urban environment within a period of time. Moreover,since there are rich and accurate temporal information in the Flickr data, it can be used to study theevolving of a city in terms of human social activities. To conclude, natural cities defined by OSM dataare more structure-oriented whereas those by Flickr data are more human-oriented. Therefore, these twotypes of data should be carefully employed for the study of different topics.

(a) Natural cities generated by DT (b) Natural cities generated by CCA

Figure 4.2: Comparison of natural cities generated by DT and CCA

From figure 4.2, it looks as if natural cities generated from the CCA method is better because there aremore red areas in figure 4.2b than in figure 4.2a. However, this is a wrong impression for two reasons.The first reason is that the clusters generated by CCA all represent urban area which is incorrect. In otherwords, it means that as long as there is data existing in certain place, this place is a city. For example,among clusters generated by CCA, there are a lot of clusters with only one point. This is why thereare more red areas in figure 4.2b. The second reason is that resolution has huge impact on the resultsgenerated by the CCA. As can be seen in figure 4.3, if the pixel size changes, the results of CCA canvary accordingly to a great extent. Moreover, the choice of the pixel size is dependent on the user whichmay be subjective. On the contrast, since data has been filtered, it is more guaranteed that the extractednatural cities with DT method are urban space. In this sense, the hierarchy of the urban environment inSweden is more clear. Moreover, since the result is in vector format, it is not influenced much by theresolution problem and the result is more objective.

23

(a) No resolution problem for vector re-sult.

(b) The resolution problem ofraster result.

Figure 4.3: The impacts of resolution problem for DT and CCA methods

4.2 Scaling Analysis and Interpretation of the Flickr Patterns

Due to the duplication issue of the Flickr data, the areas of the generated natural cities can not be usedto unveil the real spatio-temporal pattern of Sweden. Therefore, the point numbers within generatednatural cities are employed instead and each dataset is the accumulation of data in one specific year andthat in its former year(s). After data collating, the power-law fitting tests are performed on each datasetwhose results are summarized in table 4.1. An overview of the results can be seen in figure 4.4. Inaddtion, the distribution of each year data can be seen as log-log plot in figure A.1.

As is shown in table 4.1, the p-values tests pass for all the years except 2006. Considering that Flickrfirstly launched the geotagging feature in 2006 (Arrington, 2006), it is quite reasonable that there arelimited number of photos with geolocation information. Therefore, there is less information being down-loaded. However, the data of 2006 is still important to work as the starting point of the data accumula-tion. On the contrast, from the year 2007, with the skyrocketing of the data amount, city numbers keepincreasing and the power law distribution fits each year data very well. Therefore, there is enough confi-dence for people to believe that the spatial pattern of the urban area in each year follows the power-lawdistribution with a scaling parameter around 1.5. Moreover, this pattern remains stable along the timeaxis. Since the data anlyzed is the accumulation of its former years, it means that, with time goes by,human social activities are increasing but the pattern remains the same. This is an evidence that cities inSweden are evovling in terms of social activities and the pattern is scale-free.

Table 4.1: The power-law test results

2006 2007 2008 2009 2010 2011CityNum 17 33 65 106 141 195α 1.39 1.46 1.49 1.52 1.53 1.48xmin 43 120 120 1,823 171 195p-value 0 0.28 0.44 0.67 0.54 0.02

Note: CityNum is the number of natural cities; α is thevalue of scaling parameter; xmin is the minimum value inthe dataset that fits the power law distribution; p is the p-value test results which indicate the dataset fits the powerlaw distribution if the p-value is larger than 0.

24

Figure 4.4: An overview of the scaling analysis results

The limitations of the scaling analysis lies mainly in the data aspect. As we all know, unlike typicalsocial media data like twitter whichich covers a wide range of human activities, Flickr timestampedgeolocation data only concerns those related to photos. Therefore, the information available is limitedto certain extent. Moreover, Sweden is of relatively small population which also limits the availabledata amount. Furthermore, unlike the everlasting popularity of twitter in the US, Flickr is graduallyreplaced by the supernova instagram. Therefore, it may be more objective if data from other sources canbe employed for the same study as a control. However, this is beyond the content of this thesis and issuggested to be explored in the future.

4.3 Limitations and Problems of the Data and the Processing Strategies

With the rapid growth of VGI and the increasing concerns on VGI-related researches, some limitationsand problems about VGI itself are emerging, which may work as barriers to the its further developmentand have side effects on related researches. Likewise, Flickr, as one typical VGI reservoir also suffersfrom these problems and thus the research conducted in this thesis is also more or less influenced. Thenwhat kind of issues are they? Roughly, these issues can be divided into four categories, namely, knowl-edge discovery, data uncertainty, human privacy, and underlying driving forces.

The first issue concerns about the usage of VGI. Nowadays, with the near real-time feature of VGI,data acquisition is no longer big issue in the knowledge discovery process. However, methods and tech-niques to exploit these massive datasets are limited so that large amounts of data are garbaged before thefull value is utilized. As a result, the focus should be shift from data acquisition to data processing andinterpretation. However, “Big Data” is still far beyond the processing compacity of average equipmentssuch as personal desktop. Efficient algorithms and ingenious strategies can not fundamentally solve theproblems of “Big Data”. In a similar vein, this study is also bothered by this issue since the processingstrategies are dependent on the duplication of the data and is very limited to the Flickr data employed inthis study. Therefore, a public accessible well-established cyberGIS infrastructure is necessary to solvethe problem at the root cause (Wang, 2010).

The second issue is about the reliability of the VGI. Two major problems are with respect to this is-sue: one is about the quality of the data itself since the contributors usually upload the data without strictquality control; the other one is about the lack of effective methods to improve the quality of the existingdata due to the complexity, the ill-structure and the heterogeneity of the data. This is one key reason forthe heavy duplication of the Flickr data. As a result, the spatial distribution of the data is uneven and can

25

not be used to detect the real boundary of a city. The Flickr data works more like census data which canbe employed for scaling analysis rather than boundary detection.

The third problem is about the user of VGI whose privacy should be protected. Unintentional or ill-intentional expose of the users’ privacy, especially their locations may lead to many social problemssuch as threats to users’ personal safety. This problem tremendously influences the availability of VGIdata. Due to the protection of users’ privacy, many social networks with extremely rich timestampedgeolocation data refuse to provide public accessible avenues. One compelling example is Facebook.This is a double-edge sward with both positive and negative effects. From the user prospective, this is agood way to protect their privacy. However, from the perspective of researchers, this may be extremelyfrustrating. A good way to solve this problem may be a confirmation of the users’ willingness to sharetheir location information and to open the existing data to the public.

The last issue is about the sustainability of VGI. Thus, the driven force underlying it should be examinedthoroughly so that the potential for the further development of VGI can be foreseen. As Goodchild(2007)points out, there are mainly two reasons: 1) the altruism spirit, and 2) the inner nature of volunteers toself-improve and self-satisfy themselves. This issue is not much a problem in this research. However,it is believed that with more people’s awareness of the importance of the VGI data, this research maybecome a positive thrust to the contribution of VGI data from the general public.

With respect to the clustering strategy employed in this research, it should be noticed that consideringthe data scale is at national level, related statistic tests should be categorized as “global test” (Jacquez,2008) which is only valid for detecting if spatial structure exist. However, it fails to answer questionsabout where the clusters are and how the spatial dependences vary from each other. Therefore, for thisthesis, only the existence of the spatio-temporal pattern can be detected and testified.

26

5. Conclusions

This chapter is composed of two sections. In the first section, the whole thesis will be summarized andcontributions of this work will be outlined accordingly. For the second section, some future work willbe suggested based on the existing work. Details about these two sections can be seen in the followingtexts.

5.1 Contributions of This Thesis

This thesis generally follows the research pattern of geographic knowledge dicovery with massive ge-ographic data to fit in the context of the “Big Data” era. During the whole process, one type of VGI,namely, the Flickr timestampted geolocation data was employed as the major study object. A new clus-tering strategy based on Delaunay Triangulation (DT) is proposed to improve the natural city approach.Meanwhile, scaling analysis was also performed to interpret the spatio-temporal patterns underlying theFlickr data. It is found that cities in Sweden is evoloving in terms of human social activities.

The contribution of this paper is three-fold: 1) large amount of VGI data is downloaded and collated;2) a new clustering strategy based on Delaunay Triangulation is proposed; and 3) cities in Sweden isdiscovered to be evolving in terms of human social activities. Some details about each contribution isdescribed as follows.

Firstly, around 4 GB of timestamped geolocation data has been genereated. Acquiring data from API isproved to be a feasible avenue. Moreover, the duplicated feature of Flickr data has been discovered as areminder of future use of the data. Secondly, a new clustering strategy based on DT has been proposedas a complement to the natural city approaches. Thirdly, the spatio-temporal pattern behind Flickr datais explored. It has been found that the pattern generally follows the power law distribution with a scalingparameter around 1.5 and remains stable with time goes by. The increasing city number by year and thescaling characteristic of the pattern indicate that cities are evolving in terms of human social activities.

5.2 Future Work of This Thesis

As has been mentioned before, there are some limitations and problems encountered during the study ofthis thesis. With respect to that, the limitations and problems are majorly bifold, namely, one concerningdata while the other one concerning the data processing strategy. As a result, the unsolved problems turnto be the outlook of further studies. Details about the expected future work are put forward as follows.

In terms of data, the flickr data may have some other research potential as an important source ofsocial media data, awaiting people to discover. With more and more people being familiar with thegeotagging feature of Flickr, more VGI data will be made available to the general public. However, thespeed limitation of downloading set up by Yahoo balks efficient and timely researches. Therefore, thisproblem is expected to be addressed by the authority of Flickr later on. For example, they may removethe speed limitation for users with academic purposes. Moreover, the study in this thesis should also befurther explored with VGI data from other sources such as twitter to support the conclusion of this thesis.

In terms of the processing strategies, the clustering strategy proposed in this thesis may be revisedto better fit the context of “Big Data”. For the preprocessing, suggestions can be given to improve the

27

efficiency which, in this context, is mainly about speed. To speed up, existing method may be improvedwith faster algorithms or with algorithms better fitting computers with super computing capacity. Forthe clustering strategies, parameters used by the software should be further investigated. For the wholestrategy, some other social-economic factors may be explored further which possibly provide betterexplanation the patterns in urban space.

28

References

Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal, U., Franklin, M., Gehrke, J., Haas, L.,Halevy, A., Han, J., Jagadish, H. V., Labrinidis, A., Madden, S., Papakonstantinou, Y., Patel, J. M.,Ramakrishnan, R., Ross, K., Shahabi, C., Suciu, D., Vaithyanathan, S., and Widom, J. (2012).Challenges and Opportunities with Big Data. Available at: http://cra.org/ccc/docs/init/

bigdatawhitepaper.pdf.

Arnold, B. C. (1983). Pareto Distribution. Fairland, MD, USA. International Co-operative PublishingHouse.

Arrington, M. (2006). Flickr geo tagging now live. Available at: http://techcrunch.com/2006/

08/28/flickr-to-launch-geo-tagging-today/.

Benguigui, L. and Blumenfeld-Lieberthal, E. (2006). Beyond the power law - a new approach to analyzecity size distributions. Computers, Environment and Urban Systems, 31(2007):648–666.

Berry, B. J. L. (1964). Cities as systems within systems of cities. Regional Science, 13(1):146–163.

Berry, B. J. L. (1967). Geography of Market Centers and Retail Distribution. New Jersey, USA.Englewood Cliffs.

Berry, B. J. L. and Garrison, W. L. (1958). The functional bases of the central place hierarchy. EconomicGeography, 34(2):145–154.

Clauset, A., Shalizi, C. R., and Newman, M. E. J. (2009). Power-law distributions in empirical data.SIAM Review, 51:661–703.

Collaborative Consulting (2012). Making sense of big data: A collaborative point of view. CollectiveWhite Paper Series. Available at: http://www.collaborative.com/.

Cordoba, J. C. (2007). On the distribution of city sizes. Journal of Urban Economics, 63(2008):177–197.

Crandall, D., Backstrom, L., Hunttenlocher, D., and Kleinberg, J. (2009). Mapping the world’s photos.In WWW ’09 Proceedings of the 18th International Conference on World Wide Web, pages 761–770,New York, NY, USA. ACM.

Elwood, S. (2010). Geographic information science: emerging research on the societal implications ofthe geospatial web. Progress in Human Geography, 34(3):349–357.

ESRI, Inc. (2012). Geoprocessing considerations for shapefile output. Available at:http://resources.arcgis.com/en/help/main/10.1/index.html#/for_shapefile_

output/01m100000004000000/.

Frankel, F. and Reid, R. (2008). Distilling meanning from data. Nature, 455(7209):30.

Gartner Inc. (2011). Pattern-based strategy: Getting value from big data. Gartner Group press release.Available at: http://www.gartner.com/it/page.jsp?id=1731916.

29

http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf

http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf

http://techcrunch.com/2006/08/28/flickr-to-launch-geo-tagging-today/

http://techcrunch.com/2006/08/28/flickr-to-launch-geo-tagging-today/

http://www.collaborative.com/

http://resources.arcgis.com/en/help/main/10.1/index.html#/for_shapefile_output/01m100000004000000/

http://resources.arcgis.com/en/help/main/10.1/index.html#/for_shapefile_output/01m100000004000000/

http://www.gartner.com/it/page.jsp?id=1731916

Goodchild, M. F. (2007). Citizens as sensors: the world of volunteered geography. Geojournal,69(4):211–221.

Goodchild, M. F. (2008). Geographic information science: the grand challenges. In Wilson, J. P.and Gotheringham, A. S., editors, The handbook of geographic information science, pages 596–608.Malden, MA, USA. Blackwell.

Goodchild, M. F. and Hill, L. L. (2008). Introduction to digital gazetteer research. International Journalof Geographical Information System, 22(10):1039–1044.

Hardy, D., Frew, J., and Goodchild, M. F. (2012). Volunteered goegraphic information production as aspatial process. International Journal of Geographical Information Science, 26(7):1191–1212.

Hey, T., Tansley, S., and Tolle, K. (2009). The Fourth Paradigm: Data Intensive Scientific Discovery.Redmond, Washington, USA. Microsoft Research.

Hillier, B. (1996). Space is the Machine. Cambridge, UK. Cambridge University Press.

Holmes, T. and Lee, S. (2009). Cities as six-by-six-mile squares: Zipf’s law? In L., G. E., editor, TheEconomics of Agglomerations. Chicago, IL, USA. University of Chicago Press.

Howe, J. (2009). Crowdsourcing: Why the power of the crowd is driving the future of business. Chicago,IL, USA. University of Chicago Press.

Ioannides, Y. M. and Overman, H. G. (2001). Zipf’s law for cities: an empirical examination. RegionalScience and Urban Economics, 33(2003):127–137.

Jacquez, G. M. (2008). Spatial clustering analysis. In The Handbook of Geographic Information Science,pages 395–416. Oxford, UK. Blackwell Publishing.

Jia, T. and Jiang, B. (2010). Measuring urban sprawl based on massive street nodes and the novel conceptof natural cities. preprint. Available at: http://arxiv.org/pdf/1010.0541v2.pdf.

Jiang, B. (2012). Head/tail breaks: A new classification scheme for data with a heavy-tailed distribution.The Professional Geographer, ahead-of-print.

Jiang, B. and Jia, T. (2011). Zipf’s law for all the natural cities in the united states: a geospatialperspective. International Journal of Geographical Information System, 25(8):1269–1281.

Jiang, B. and Liu, X. (2012). Scaling of geographic space from the perspective of city and field blocksand using volunteered geographic information. International Jouranl of Geographic InformationScience, 26(2):215–229.

Kuhn, W. (2007). Volunteered geographic information and giscience. Position Paper. Available at:http://www.ncgia.ucsb.edu/projects/vgi/docs/position/Kuhn_paper.pdf.

Lafreniere, T. (2011). Bring big data to the enterprise. IBM Corp. Available at: http://www-01.ibm.com/software/data/bigdata/.

Lohr, S. (2012). How big data became so big. The New York Times. Available at: http:

//www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.

html?_r=0.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., and Byers, B. H.(2011). Big data: The next frontier for innovation, competition, and productivity. Mackin-sey Global Institue. http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation.

30

http://arxiv.org/pdf/1010.0541v2.pdf

http://www.ncgia.ucsb.edu/projects/vgi/docs/position/Kuhn_paper.pdf

http://www-01.ibm.com/software/data/bigdata/

http://www-01.ibm.com/software/data/bigdata/

http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?_r=0



http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation

http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation

MIKE2.0 (2012). Big Data Definition. Available at: http://mike2.openmethodology.org/wiki/Big_Data_Definition.

Miller, J. H. and Page, S. E. (2007). Complex Adaptive Systems: An Introduction to ComputationalModels of Social Life. Princeton, New Jersey, USA. Princeton University Press.

NRC (1993). Toward a coordinated spatial data infrastructure for the nation. Washington, USA.National Academies Press.

Openshaw, S. (1984). The Modifiable Areal Unit Problem. Norwich, UK. Geo Books.

Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992). Numerical Recipes in C:The Art of Scientific Computing. Cambridge, UK. Cambridge University Press.

Raymond, S. E. (1998). Good bye, ”free software”; hello, ”open source”. Available at: http://www.catb.org/esr/open-source.html.

Rozenbfeld, D. H., Rybsik, D., Gabaix, X., and Makse, A. H. (2009). The area and population ofcities: New insights from a different perspective on cities. Working Paper 15409, NBER. Avaibalbeat: http://ssrn.com/abstract=1486545.

Rozenfeld, H. D., Rybski, D., Andrade, J. S., Batty, M., Stanley, H. E., and Makse, H. A. (2008).Laws of population growth. In Proceedings of the National Academy of Sciences, volume 105, pages18702–18707.

Shanbhag, D. N. and Rao, C. R. (2001). Stochastic Processes: Theory and Methods. Amsterdam,Netherland. Elsevier Science & Technology.

Smith, B. J. (1994). Collective Intelligence in Computer-based collaboration. Hillsdale, NJ, USA.Lawrence Erlbaum Association.

Sui, D. Z. (2008). The wikification of gis and its consequences: or angelina jolie’s new tattoo and thefuture of gis. Computers, Environment and Urban Systems, 32(1):1–5.

Sutton, P. (2003). A scale-adjusted measure of urban sprawl using nighttime satellite imagery. RemoteSensing of Environment, 86.

Tobler, W. (1970). A computer movie simulating urban growth in the detroit region. Economic Geogra-phy, 46(2):234–240.

Turner, A. (2007). Introduction to Neogeography. O’Reilly Media Short Cuts[Ebook].

Vander Wal, T. (2007). Folksonomy Coinage and Definition. Available at: http://vanderwal.net/folksonomy.html.

Wang, S. (2010). A cyberGIS framework for the synthesis of cyberinfrastructure, GIS, and spatialanalysis. Annals of the Association of American Geographers, 100(3):535–557.

Wasserman, L. (2003). All of statistics: a concise course in statitical inference. New York, NY, USA.Springer.

Yahoo! Inc. (2013). The flickr developer guide: API. Available at: http://www.flickr.com/

services/developer/api/.

Yahoo!Inc. (2011). Flickr. Available at: http://advertising.yahoo.com/article/flickr.

html.

Yahoo!Inc. (2012). The Flickr Developer Guide. Available at: http://www.flickr.com/services/developer/.

31

http://mike2.openmethodology.org/wiki/Big_Data_Definition

http://mike2.openmethodology.org/wiki/Big_Data_Definition

http://www.catb.org/esr/open-source.html

http://www.catb.org/esr/open-source.html

http://ssrn.com/abstract=1486545

http://vanderwal.net/folksonomy.html

http://vanderwal.net/folksonomy.html

http://www.flickr.com/services/developer/api/

http://www.flickr.com/services/developer/api/

http://advertising.yahoo.com/article/flickr.html

http://advertising.yahoo.com/article/flickr.html

http://www.flickr.com/services/developer/

http://www.flickr.com/services/developer/

Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort. Oxford, UK. Addison-WesleyPress.

32

A. Power Law Test Results for each YearData

This appendix aims to provide an overview of the power law testing result for each year. In the figuresbelow, the color of each year data is corresponding to the ones used in figure 4.4. Circles represent datawhereas dashed lines represent the power law distribuiton that best fit the data. Details can be seen asfollows.

(a) 06 (b) 2007

(c) 2008 (d) 2009

(e) 2010 (f) 2011

Figure A.1: Power-law test results. The distributions have been taken logarithm on both the x-axis andthe y-axis.

33

exploring spatio-temporal patterns of volunteered

Documents