social network analysis using cloud computing services

Download Social Network Analysis using Cloud Computing Services

If you can't read please download the document

Upload: dongwoo-lee

Post on 11-May-2015

1.102 views

Category:

Technology


1 download

TRANSCRIPT

  • 1.PlatformDay2009SNS Analysis using Cloud Computing ServicesDHT-based Key-Value Storage and MapReduce-based AnalysisDongWoo [email protected] SOiko LaboratoryD SocialFlow OikoLab 2 CloudKR

2. Agenda2CloudKR Introduction Social Network Serivce Motivation : Visualization, Social Network Analysis SocialFlow Scale Out Technologies : Cloud Computing SNS Analysis Architecture based on Cloud Overall Process Crawling DHT Storage (CouchDB) MapReduce Pair-Wise Similarity Cloud Computing Service Amazon Web Service EC2 / S3 / Elastic MapReduce Tips References 3. Introduction2CloudKRSocial Network Cloud Computing Mobile Device 4. Social Network Service 2 CloudKR Social Applications = Social NetworksA social network is a collection of people bound togetherthrough a specific set of social relations.A collection of people is a social network if and only if it ispossible for something to spread virally through that collection. 5. Social Network Services : Twitter, Facebook2CloudKR 6. Social Applications 7. Social Networkshttp://www.vincos.it/world-map-of-social-networks/ 8. Social Network Analysis2CloudKR Social Graph Analysis Visualization Person-to-Person Relationship Temporal Mind Mining (Content Clustering) Post-Mortem Log Processing 9. Social Network Analysis : Visualization2CloudKR 10. Social Network Analysis : Visualization2CloudKR 11. Social Network Analysis : Visualization 2 CloudKR 12. Social Network Analysis : Visualization2CloudKR 13. SocialFlow 2 CloudKR Thoughts, Feelings, Interests, Relationship and Information of SNS Real-time Massive Social Data Streams Difficult to follow the Social Streams Need a way to get a summary or clustered information based on Common InterestsD SocialFlowOikoLab 14. SocialFlow Getting Common Flows of people through Content Similarities2CloudKR Reflecting Short-Term Interests of People Extracting Hot Issues Revealing Relationships among In/Out Resources Implementing Scale-Out Technologies Evolving toward Recommendation Systembased on Collective Intelligence 15. Scale Out Technologies : Cloud Computing 2 CloudKR 16. Why Cloud Computing?2CloudKR SPOF (Single Point of Failure) Cluster Administration (Who do this?) Initial Infrastructure Investment (Risk Management) Focus on Main Thing (Intelligence) Enable Highly Scalable Services New resource provision paradigms for Grid Infrastructures: Virtualization and Cloud / ISGC 2009 http://tinyurl.com/nacgu7 17. Cloud Computing: e.g. Storage Failure2CloudKR 18. SNS Analysis Architecture based on Cloud 2 CloudKR D SocialFlow OikoLab 19. Experimental Project 2 CloudKRD SocialFlow OikoLabPython / Django / BotoML / Data MiningDHT / CouchDBCloud / AWS S3, EC2, Hadoop MapReduce 20. Workflow2 CloudKR SNSCrawlerMapReducePost-Processing CDN User In-house ClusterCloud Service(Local DataCenter) 21. Technologies : Before 2 CloudKRCrawler CrawlerCrawlerHash_ringConsistent MapReduce CouchJS DHTCouchDBKey-ValueMachineHomeStorage Learning Made 22. Technologies : After2CloudKRCrawler CrawlerCrawler Storage S3Hash_ringConsistent EC2 MapReduce DHT HadoopCouchDBKey-ValueMachineHomeStorage Learning Made 23. Crawling 2 CloudKR Fetching recent postings of SNS Storing fetched postings to CouchDB Storage through DHT Layer (which select a sever) Pushing raw data into the Cloud to process them with MapReduce CrawlerDB [ term, doc ] CrawlerDBIndex Indexer FileDB CrawlerDB Mapper CrawlerDHT Replication 24. Consistent DHT (Distributed Hash Table) Uniform key distribution and load balancing with a good hash function 2CloudKR Minimizing the effects of a storage crash or temporal down High availability with replication schemeN-1 0 Node N-1Node k-1 k-1 Notice: A real node has non-linear portions of the total keyspace.Replicask+1 Node k+1Node k 21 Replicate(k, k-1, k+1)!"#$!%&()*+,-.(/0123,(0405123,(&6-.-7-1(080.-9(.0405.-9(.&6-.-7-1(0: 25. Consistent DHT (Distributed Hash Table)2 CloudKRAdminAnonymouseTrafcUser TrafcView Admin View User ViewGenerated ContentsSNS CrawlerSNS AnlysisAWS S3 htmlimageDHT Front EndMemory Cache N-10 Node N-1 Node k-1DHTNode k+1Node k2 1 26. Consistent DHT : Replication 2 CloudKR * Replica = 2 D A B C A B B C C D D A B B B Replica Replica 27. CouchDB (Key-Value Storage) 2 CloudKR Erlang -based Key-Value Storage Storage Engine (MVCC, B-tree) RESTful API Service-side JavaScript Engine (MapReduce) View Engine Futon Web UI 28. CouchDB: Server-side Javascript2CloudKR Purpose Local Computations on Local Data Sets Features Mozillas Spidermonkey MapReduce Framework with Javascript Fork External Process (couchjs) Performance Enhancements Expected Googles V8 (Chromes Javascript Engine / JIT) http://tinyurl.com/m76sx3 29. CouchDB: MapReduce 2 CloudKRdoc = (d1, d2, fq)dx: { di } 30. Map & Reduce : Pair-Wise Similarity 2CloudKR[ term, { docs } ] =>DB [ term, doc ] [ term, { docs } ] [ doc1, doc2 ]DBIndexDocGroupDoc CandidateIndexer FileGrouper File CombinatorFileDBDBMapper Reducer Mapper DocPairReducer Counter Doc File ResultFile Indexer and Grouper for Processing Korean. [ freq, doc1, doc2 ] No NLP and No Structural Analysis. Produce a pairwise similarity between two postings. 31. Map & Reduce : Optimization 2CloudKR Concerns Sample Data Consider Key Group Size Distribution Two months postings of my friends Data Load Balancing Reachable graph: 4,060 Peoples Barrier Point Total Postings: 206,115 32. Pair-Wise Similarity and its TreeMap Posting: 110,008 Users: 2,691 Score >= 6 33. Pair-Wise Similarity and its Cluster 2 CloudKROne issue and different opinions among people 34. Pair-Wise Similarity and its Cluster2CloudKRCommon Interest / Hot Issue 35. Pair-Wise Similarity and its ClusterOne person and the similar contents pattern (specialty) 2 CloudKR 36. Pair-Wise Similarity and its Cluster Similar Structure of Sentences (trendy, parady) 2 CloudKR 37. Deployment 2 CloudKR EC2 S3/CloudFront Flickr www 38. Cloud Computing Service2CloudKR 39. Before the Cloud Age Smart Shell Gurus Daily Work : Parallel Sort 2CloudKR$ wc -l datascpscp$ sort -rm data*.sorted >$ split -l 1000k data NFSNFS data.sorted $ nohup ./work.sh data1 > data1.processed $ nohup sort -r data1.processed > data1.sorted Need to prepare/maintain physical machines and resourcesComplexity Need to monitor job progress (wait and see jobs status) Need to cope with machine failure (slave nodes / storages / networks) Need to schedule multiple jobs 40. Amazon Web Service : Overview EC2 EC2 EC2 EC22 CloudKR Messages SQS (Simple Query Service) Auto Scaling CloudWatchMonitoringElastic Load BalancingEC2 (Elastic Compute Cloud) Mount EBS (Elastic Block Store)1 GB to 1TB Permissions HeaderClients API ObjectsClients HTTPClients Buckets AMI (Machine Image)eSATA/USBSimpleDBS3 (Simple Storage Service)Ofine Mgmt ConsoleEC2 CLI SSHImport/Export Adminkey-value CloudFront Access Key IDEdges Secret Access Key Key PairInstant EC2 Hadoop Cluster Elastic MapReduceHTTP Hadoop HadoopHadoop Clients 41. Amazon Web Service Amazon Management Console 2 CloudKR 42. AWS : AMI2 CloudKRAMIAmazon Machine Image 43. AWS : Paid AMI / The Cloud Market 2 CloudKR AMI Amazon Machine Image Paid AMI 44. AWS : How to make a AMI (1)2CloudKR Loopback File # dd if=/dev/zero of=new_image.fs bs=1M count=1024 Make ext3 le system # mke2fs -F -j new_image.fs # mkdir /mnt/ec2-fs # mount -o loop new_image.fs /mnt/ec2-fs # mkdir /mnt/ec2-fs/dev # /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x console # /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x null # /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x zero # mkdir /mnt/ec2-fs/etc Create /mnt/ec2-fs/etc/fstab (Add /dev/sda1 --> /, /etc/pts, shm, /proc, /sys) Create yum-xen.conf # mkdir /mnt/ec2-fs/proc # mount -t proc none /mnt/ec2-fs/proc # yum -c yum-xen.conf --installroot=/mnt/ec2-fs -y groupinstall Base Edit /mnt/ec2-fs/etc/syscong/network-scripts/ifcfg-eth0 Edit /mnt/ec2-fs/etc/syscong/network Edit /mnt/ec2-fs/etc/fstab (Add /dev/sda2 --> /mnt, /dev/sda3 --> swap) chroot /mnt/ec2-fs /bin/sh Edit services 45. AWS : How to make a AMI (2) 2 CloudKR Building an AMI # yum install ruby # rpm -i ec2-ami-tools-noarch.rpm (Download from public s3 bucket) # ec2-bundle-image -i new_image.fs -k my-private-key.key -u aws-user-id Local Machine Root File System # ec2-bundle-vol -k my-private-key.key -s 1000 -u aws-user-id Upload to S3 # ec2-upload-bundle -b my-bucket -m image.manifest -a my-aws-access-key-id -s my-secret-key-id Register AMI # ec2-register my-bucket/image.manifest IMAGE ami-xxxx Testing # ec2-describe-images ami-xxxx Deregister AMI # ec2-deregister ami-xxxx Running AMI # ec2-run-intances ami-xxxx -n 1 http://docs.amazonwebservices.com/AWSEC2/2006-06-26/DeveloperGuide/ 46. AWS : EC2 Running Instance AWS Management Console 2 CloudKR 47. AWS : EC2 Running Instance 2 CloudKR 48. Amazon Web Service: Access Methods Access Key ID / Secret Access Key ID / Key Pairs2CloudKR Amazon Management Console EC2 API (WSDL) / EC2 CLI (Command Line Interface) SSH Firefox Extensions S3 Firefox Organizer Elasticfox S3 DNS: s3CNAME s3.amazonaws.com.e.g) Bucket Name: /s3.xyz.comhttp://s3.xyz.com ---> S3s s3.xyz.coms3cmd (python)s3cmd.rb / s3sync.rb (ruby)S3Hub (Mac) 49. Amazon Web Service: Elasticfox Firefoxs Extension: Elasticfox2CloudKR 50. Amazon Web Service: Elasticfox 2CloudKR Key Pairs Private Key SSH 51. Amazon Web Service: Elasticfox2 CloudKR Security Groups Open Network Ports 52. AWS: Elastic MapReduce 2 CloudKR EC2 + Hadoop Tools Management Console elastic-mapreduce CLI Preparation Code --> S3 Data --> S3 Log Folder Output Folder Job Flow Streaming Custom Jar Sample Applications 53. AWS: Elastic MapReduce 2 CloudKR 54. AWS: Elastic MapReduce : Web UI2CloudKR 55. AWS: Elastic MapReduce : CLI for Workflow 2CloudKRinput/*Step1jobow #id output1/part-000**Step2 output2/part-000**Step3 output3/part-000** 56. AWS: Elastic MapReduce 2 CloudKR Failed tasks will be rescheduled in other Hadoop slaves. If a task is finished, the same instance will be killed by a tracker. 57. AWS: Elastic MapReduce 2 CloudKR 58. AWS: SocialFlow Automation 2 CloudKRHome IDCAmazonWild World Local GlobalResultsAdminDHT S3 Users Read/WriteRead OnlyRendererboto python Launching EC2 pool 59. AWS: EC2, EMR Price Model 2 CloudKRServiceTypePer Instance Hour1 Week (7 Days) 1 Week (7 Days)$ 0.10 (S) $ 16.8KRW20,865On-Demand $ 0.40 (L) $ 67.2KRW 83,462$ 0.80 (E) $ 134.4 KRW 166,924 EC2 Reserved $ 0.03 (S) $ 5.04KRW6,259 1yr $ 325$ 0.12 (L) $ 20.16 KRW 25,038 3yr $ 500$ 0.24 (E) $ 40.32 KRW 50,077$ 0.10 (S)$ 0.015$ 19.32 KRW23,995ElasticOn-Demand $ 0.40 (L)$ 0.06 $ 77.28 KRW 95,981MapReduce $ 0.80 (E) $ 154.56KRW 191,963$ 0.12 (S) = Small, (L) = Large, (E) = Extra Large1 USD = 1242 KRW 60. AWS: Performance 2 CloudKR http://tinyurl.com/qj6ao7 61. AWS: Performance 2 CloudKR 62. AWS: Performance 2CloudKR http://tinyurl.com/p9jsyz 63. AWS: Performance 2 CloudKR http://tinyurl.com/cqqxgl 64. 10 Cent Tips 2 CloudKR AWS EC2 Minimizing set-up time with prepared shell scripts Use Boto for automating deployments Use S3 (Free of Charge between S3 and EC2 in the same region) $0.030 per GB through June 30, 2000 ($0.1 per GB normal price) AWS Elastic MapReduce Enabling the SSH port(22) and Hadoop related ports (9100, 91001) Assess to Master Node: ssh -i keypair hadoop@public_dns_name Double Check (PATH, etc) Debug, Debug, Debug Use EC2 for hadoop (eg. Cloueras Hadoop AMI) (No extra cost for Hadoop!) 65. 10 Cent Tips2CloudKR AWS S3 Setting HTTP header for images and static resources. Cache-Control: max-age=31536000 Block Search Bots robots.txt at the root of a Bucket User-agent: * Disallow: / Using BitTorrent for large files http://s3.xyz.com/xfile.zip?torrent Compress Rendered HTML with gzip Content-Encoding: gzip$ s3cmd put index.html s3://s3.xyz.com/www --mime-type "text/html --add-header "Content-Encoding: gzip" --acl-public 66. Amazon Web Service : Limitations 2 CloudKR 67. References 10 MapReduces Tips, Cloudera, http://tinyurl.com/pxuqup2CloudKR Christian Charas, Thierry Lecroq, Handbook of Exact String-Matching Algorithms Dan Pritchett (eBay), BASE: Alternative ACID, p.48-55, ACM Queue May/June 2008 Edward Chang, (Google Research), Mining Large Scale Social Networks, MMDS 08 Edward Walker, Benchmarking Amazon EC2 for high-performance scientific computing Matei Zaharia et al, Improving MapReduce Performance in Heterogeneous Environments, OSDI 08 Following Twitter http://twitter.com/AmazonEC2 http://twitter.com/AmazonS3S3