Download - Big data computing overview
2
Agenda
§ What am I doing?§ Big Data Computing History
– Supercomputer– Parallel Computing– Linux Cluster– Big Data Computing
§ Google File System (GFS)§ Hadoop Map and Reduce§ Spark Stream Processing§ References
4
1. Personal Cloud Repository Access
2. Personal Health Record Retrieval
3. Case based Reasoning (Similar Case Search)
4. Comparisionamong Similar Patients (for Health Planning, Prediction, Advise)
1
2 3
4
Healing Platform
5
Healing Platform
모바일 플랫폼
Open API
의료 데이터프로바이더 1..N
5000만명x17건/365일=~200만건/일;
라이프레코드프로바이더 1..N
5000만명x5회= ~3억/일
개인 힐링 레코드저장소 1..N
5000만명/일
요청
전송
저장
서비스분석 엔진
모바일 서비스 1..N
RES
Tful
lAPI
3초 이내
로드요청
표준변환
Targeted 데이터/힐링지식베이스
(NoSQL DB)
TD TD
TD KB
변환
/필터
링
스트림컴퓨팅(업데이트 관리)
고속계산용DB
DW
구축
DC DC
DC DC
Big DataPersonal DataControl
Service
분석플랫폼
데이터 중계기
요청
전송
공공 임상사례 빅데이터
개인 힐링레코드사례 빅데이터
원본 빅데이터 (HDFS)
유사사례검색
트렌드
플래닝
TD 구
축
지식베이스 구축 엔진Cluster, CBR, …
9
Architecture of HyperCube
John P. Hayes, “Architecture of Supercomputer,” International Conference of Parallel Processing 1986.http://web.eecs.umich.edu/~tnm/trev_test/papersPDF/1986.08.Architecture%20Of%20A%20Hypercube%20Supercomputer_Conf_Paralle l_Processing.pdf
12
Architecture of HyperCube
http://web.eecs.umich.edu/~tnm/trev_test/papersPDF/1986.08.Architecture%20Of%20A%20Hypercube%20Supercomputer_Conf_Parallel_Processing.pdf
22
Linux Cluster Specifications
§ 16 PCs§ PC’s specification
– Pentium3– 16MB– 20GB
§ Myrinet (300Mbps)
25
Limitations of achieving this goal
§ Visible Human Project– Data Size : 40GB (~100GB)
§ Linux File System (ext2)– 16GB/1 file – IDE bandwidth : 33Mbps (66Mbps)– Ethernet bandwidth : 100Mbps (below 30Mbps)– RAM : not enough
§ Myrinet network interface– Too difficult to use– Kernel hooking required!!!
§ Programming Model– PVM or MPI – Too Slow & Difficult!!!
27
Google File System (GFS, 2003)
SanjayGhemawat,“TheGoogleFileSystem,”http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
32
How about this example?
§ Count Phone Call Logs?– Each user’s total time for phone call– KT’s case : 40TB / month– No exception available
§ Oracle Database– HW cost : ?– SW cost : Over 400,000,000 Korean Won– Time cost : about 1 day.
33
Solution?
§ Simple is best– Log Merge
for(int i=0;i<max_log;i++)user[log[i].id].usage_time +=log[i].usage_time;
But,Toomuchtimerequired!!!
43
Conclusion
§ Big Data Computing?– Of course, it is needed!! But for us?
§ We did a lot.– We need to enhance our aspect?
§ What’s the next? – Trends are repeated!!!– Your major might be come again?
46
References
§ John P. Hayes, “Architecture of Supercomputer,” International Conference of Parallel Processing 1986.
§ MPI code example, http://mpitutorial.com/tutorials/mpi-hello-world/
§ PVM code example, http://www.netlib.org/pvm3/book/node17.html
§ Sanjay Ghemawat, “Google File System,” SOSP 2003
§ Hadoop Code Example, http://azure.microsoft.com/en-us/documentation/articles/hdinsight-sample-wordcount/
§ Madhukara Phatak, Introduction to Apache Spark, http://blog.madhukaraphatak.com/introduction-to-spark/