supplementary materials for - humnet lab -...

Post on 26-May-2018

219 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Supplementary Materials for

Title: Revealing patterns in human spending habits Authors: Riccardo Di Clemente1, Miguel Luengo-Oroz2, Bapu Vaitla3, Marta C.

González1,4*

Affiliations: 1Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Massachusetts Avenue 77, MA 02139 Cambridge (USA) 2 United Nations Global Pulse, 46th St & 1st Ave, New York, NY 10017 USA 3 Department of Environmental Health, Harvard University, 677 Huntington Avenue Boston, MA 02115 USA 4 Center for Advanced Urbanism, Massachusetts Institute of Technology, Massachusetts Avenue 77, MA 02139 Cambridge (USA)

*Correspondence to: martag@mit.edu This PDF file includes:

Figs. S1 to S10 References

2

Supplementary Figure

Fig. S 1 (A) Histogram of the user transaction number in the CCRS, the 150,000 users selected for the analysis are those with more than 10 transactions and less than 300. (B) Relation between the district population (Census Data) and the number of user in our datasets, using the same color map as Fig.1 in the paper. (C) Distribution of overall users’ monthly expenditure in USD.

A B

C

3

Fig. S 2 Complementary Cumulative Distribution and Probability Density Plot (inset) of MMCs’ transaction codes. The probability distribution of a transaction code 𝑥" presents Zipf’s distribution 𝑝(𝑋") ∝ 𝑥"().+,, with a KS test of 0.04 and with cutoff identified as the right-most point in the distribution before the fitted power law (33,34).

4

Fig. S 3 (A) The Z-score of each word depend on occurrence. The z-score of each word is calculated by comparing the code sequence with 100 randomized code sequences for each user, preserving the number of transactions per type. (B) Probability Density Function plot of the occurrence of the significant words {𝑾𝒊} and its complementary cumulative distribution; the probability distribution words manifest a power law behavior 𝒑(𝑾𝒊) ∝ 𝒙𝒊(𝟏.𝟕𝟑; with 𝒙𝒊 frequency of the {𝑾𝒊} and KS test of 0.02. (C) Z-score of a words occurrence for a sample user. We highlight in orange the users’ associated set of words with a z-score greater than 2, that characterizes the user shopping routines.

5

Fig. S 4 Clustering results depending on the users’ selection. For each threshold 𝑥 we select all the users with more than 𝑥 significant links. Using the Louvain Algorithm (26) we perform the cluster of the users’ similarity matrix of the selected users at each threshold. For each threshold, we show the proportion of users that belong to each cluster, the core transaction for that cluster (as defined in the main text, Fig. 3) and the conditional probability for a user to belong in a given cluster depending on its number of significant links 𝑃(𝐶𝑙𝑢𝑠𝑡𝑒𝑟|#𝑙𝑖𝑛𝑘𝑠). By applying a lower threshold is it possible to increase the number of users analyzed. In particular, selecting users with more than 3 significant links improves the identification of clusters 4 and 6, which were misidentified when using higher thresholds. At the same time lowering the threshold increases the number of user that we are not able to categorize effectively (user percentage of cluster 5).

4,5

12

3

64,5

12

3

6

,,

Toll

?

15%25%16%23%20%

Toll

?

18%23%15%17%27%

#USERS = 2.4K #USERS = 4.1K

4

6

5

12

34

65

12

3

Toll

?

15%17%13% 8%21%26%

Toll

?

13%17%11%10%39%10%

#USERS = 7.2K #USERS = 13.0K

6

Fig. S 5 (A) Cluster analysis of the users’ similarity matrix for threshold 3 using the Walking Trap (27) and Leading Eigenvector (28) algorithms; both algorithms detect six different clusters as Louvain (25) (Fig. 3 and Fig. S5). (B) Network modularity analysis depending on the three cluster algorithms proposed. We see the Louvain algorithm always performs better in terms of modularity (33). (C) Analysis of similarity between the thee methods of data clustering performed, using Normalized Mutual Information (NMI) (34,35) and Rand (36) index. Values near one suggest a higher similarity between the cluster identified by the Louvain algorithm and the other two algorithms.

A

C

Toll

?

12%17%20%16%22%13%

Toll

?

12%13%11% 7%47%10%

4

65

12

3

4

6

5

12

3

B

7

Fig. S 6 (A) Distributions of the number of users’ transactions per clusters (on the left), and confidence interval (on the right). (B) Distribution of the users’ transaction diversity per clusters. We measure the transaction diversity 𝐷(𝑖) of a user 𝑖 by using the Shannon entropy of the user’s transactions and dividing by the number of transactions 𝑁 hence: 𝐷(𝑖) = 𝑝 𝑡" log 𝑝 𝑡"IJ∈LJ /𝑁; with 𝑇" the set of user transaction. The users identified as “commuters” (see main text) are those with low transactions diversity and a higher frequency of transactions. Conversely the users in the cluster 5 manifest a higher transaction diversity with a low number of transactions. These two factors combined means that the identification of the users’ routines in this cluster is more challenging.

A

B

8

Fig. S 7 (A) Sequitur compression ratio. Ratio between the original sequence transactions length and the length of Sequitur (25) sequence output. The compression ratio of the clustered user is 1.50. (B) Shanon entropy of the transactions Sequence. We define the Shannon entropy for a user𝑖 asS(i) = p(tT) log p(tT)UV∈WV ; where p(tT) is the probability of observing the transaction tT in the transaction user sequence TT. Our proposed algorithm achieves its best results analyzing the users characterized by higher values of Shannon entropy in transactions type.

9

Fig. S 8 Distributions of socio-demographic characteristics of individuals in each cluster.

10

Fig. S 9 Confidence intervals of socio-demographic characteristics of individuals in each cluster detected by our framework, and the solid in red representing the median values of the all clustered users.

11

Fig. S 10 Median expendidure by transaction code (in USD for the 10 weeks considered). The overal clusters’ expenditure are in agreement with the core transaction identified by our framework.

Toll

12

References

(from 33 only in the supplementary) 1. D. Lazer et al., Science 323, 721–723 (2009).

2. J. Giles, Nature 488, 448–450 (2012).

3. Vespignani, A. Nat. Phys. 8, 32–39 (2011).

4. N. Eagle, M. Macy, R. Claxton, Science 328, 1029–1031 (2010).

5. A. Pentland, Sci. Am., 309 78-83 (2013).

6. J. Mervis, Science, 336, 22 (2012).

7. J. L. Toole, et al., J. R. Soc. Interface 12, 10.1098/rsif.2014.1128 (2015).

8. M. C. Gonzalez, C. A. Hidalgo, A.-L. Barabasi, Nature 453, 779 (2008).

9. S. Jiang, et al., P. Natl. Acad. Sci. USA, 113, 10.1073/pnas.1524261113 (2016).

10. C. Song, et al., Science, 327, 1018-1021 (2010).

11. J. Blumenstock, G. Cadamuro, R. On, Science 350, 1073 (2015).

12. M. Lenormand, Sci. Rep. 5, 10.1038/srep10075 (2015).

13. T. Louail, et al., Sci. Rep. 5, 10.1038/srep05276 (2015).

14. S., Çolak, et al., Nat. Com. 7, 10.1038/ncomms10793 (2016).

15. M. R. Solomon, Consumer behavior: Buying, having, and being (prentice Hall

Engelwood Cliffs, NJ, 2014).

16. D. Pennacchioli, et al. EPJ Data Sci 3, 1—27 (2014).

17. Y. Yoshimura, et al., arXiv:1610.00187 (2016).

18. C. Krumme, et al., Sci. Rep. 3, 10.1038/srep01645 (2013).

19. X. Dong, et al., Workshop on Information in Networks, (NY, USA, 2015).

20. V. Singh et al., PLoS ONE, 10, 10.1371/journal.pone.0136628, (2016).

21. R. Milo, et al., Science, 298, 824—827, (2002).

22. V. C. Solutions, Visa USA Inc. (2004).

23. S. Sobolevsky, et al., PloS one, 11, 10.1371/journal.pone.0146291 (2016).

24. C. G. Nevill-Manning, I. H. Witten, J. Artif. Intell. Res. 7, 66–82 (1997).

25. V. Blondel, et al., J Stat. Mech-Theory, 10, P10008 10.1088/1742-

5468/2008/10/P10008 (2008)

26. C. L. Staudt, H. Meyerhenke, IEEE T. Parall. Distr., 27, 171--184 (2016).

13

27. M. EJ. Newman, Phys. Rev. E, 74, 10.1103/PhysRevE.74.036104, (2006).

28. P. Pons, M. Latapy, International Symposium on Computer and Information

Sciences, 284—293, (2005).

29. L. Pappalardo et al., IEEE International Conference on Big Data, 871—878,

(2015).

30. C. Hidalgo, N. Blumm, A.-L. Barabási, Plos Comput. Biol., 5,

10.1371/journal.pcbi.1000353 (2009).

31. L. Schuerman, S. Kobrin, JSTOR, 67—100, (1986).

32. A. Cavallo, Scraped data and sticky prices, NBER (2015).

33. Clauset A, Shalizi CR, Newman MEJ (2009) Power-law distributions in

empirical data. SIAM Review 51.

34. J. Alstott, E. Bullmore, P. Dietmar, PloS one 9 10.1371/journal.pone.0085777

(2014).

35. M. EJ. Newman, P. Natl. Acad. Sci. USA, 23, 8577--8582, (2006).

36. L. Danon, A. Diaz-Guilera, J. Duch, A. Arenas, J Stat. Mech-Theory E., 09,

P09008, (2005).

37. L.N.F. Ana, A. K. Jain, Computer Vision and Pattern Recognition, 2003.

Proceedings. 2003 IEEE Computer Society Conference on, 2, II—128, (2003).

38. W. M. Rand, J. Am. Stat. Assoc., 66, 846--850, (1971).

top related