uvod u big data i nauku o podacima
TRANSCRIPT
![Page 1: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/1.jpg)
Big data i data science
Pojam, tehnologije, primeri
Startit
![Page 2: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/2.jpg)
Big data i data science
Pojam, tehnologije, primeri
Startit
![Page 3: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/3.jpg)
3 / 33
Sadržaj• Uvod• Distribuirani sistem datoteka• MapReduce• Big data frameworks• Data science• Eksterni izvori podataka• Reference
Startit
![Page 4: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/4.jpg)
Big data• Francis X. Diebold Paul F. and Warren S.
Miller Professor of Economics School of Arts and Sciences University of Pennsylvania – "...the necessity of grappling with Big Data, and the
desirability of unlocking the information hidden within it, is now a key theme in all the sciences —arguably the key scientific theme of our times."
4 / 33Startit
![Page 5: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/5.jpg)
Big data• Tri izazova:
– Količina podataka (Volume)– Brzina obrade podataka u odnosu na njeno nastajanje
(Velocity)– Razlika u izvorima, formatu, kvalitetu i strukturi
podataka za obradu (Variety)
5 / 33Startit
![Page 6: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/6.jpg)
Big Data
6 / 33Startit
![Page 7: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/7.jpg)
7 / 33
Sadržaj• Uvod• Distribuirani sistem datoteka• MapReduce• Big data frameworks• Data science• Eksterni izvori podataka• Reference
Startit
![Page 8: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/8.jpg)
Motivacija• Neki od zahteva koje treba zadovoljiti
– Smeštanje velikih datoteka (nekoliko GB)– Otpornost na greške– Čitanje i pisanje od strane puno klijenata
Koristiti super računar ili farmu jeftinih računara?
8 / 33Startit
![Page 9: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/9.jpg)
Distribuirani sistem datoteka• Predstavlja sistem dototeka rasprostranjen
na farmi jeftinih računara koji obrazuju klaster
• Pruža jednostavnu skalabilnost, otpornost na greške, konkurentni pristup velikom broju klijenata
• Brzo izvršavanje željene operacije (pisanja ili čitanja)
9 / 33Startit
![Page 10: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/10.jpg)
Distribuirani sistem datoteka• Sastoji se iz:
– Glavnog čvora (master) – sadrži podatke o drugim čvorovima• Lakaciju delova datoteka (chunks), način deljenja datoteka u
chunk-ove i lokacije chunk-ova i njihovih kopija– Podređeni čvorovi (chunkservers) – sadrže delove
datoteka i njihove verzije
10 / 33Startit
![Page 11: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/11.jpg)
Distribuirani sistem datoteka• Arhitektura distribuiranog sistema datoteka
11 / 33Startit
![Page 12: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/12.jpg)
Distribuirani sistem datoteka• Pisanje u distribuirani sistem datoteka
12 / 33Startit
![Page 13: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/13.jpg)
13 / 33
Sadržaj• Uvod• Distribuirani sistem datoteka• MapReduce• Big data frameworks• Data science• Eksterni izvori podataka• Reference
Startit
![Page 14: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/14.jpg)
MapReduce• Predstavlja programski model, čiji je cilj
obrada velike količine podataka– putem paralelnog i distribuiranog algoritma koji se
izvršava na klasteru– oslanjajući se na distribuirani sistem datoteka
• MR programski model vrši obradu u dva koraka– Map i Reduce koraku
14 / 33Startit
![Page 15: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/15.jpg)
MapReduce
15 / 33Startit
![Page 16: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/16.jpg)
16 / 33
Sadržaj• Uvod• Distribuirani sistem datoteka• MapReduce• Big data frameworks• Data science• Eksterni izvori podataka• Reference
Startit
![Page 17: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/17.jpg)
Big data frameworks
17 / 33Startit
![Page 18: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/18.jpg)
Hadoop• Predstavlja framework koji je zadužen za
skladištenje i obradu podataka na klasterima jeftinog hardvera
• baziran je na MapReduce programskom modelu
• Postoje razni DSL-ovi koji olakšavaju pisanje MapReduce programa na Hadoopu poput Apache Pig-a i Hive-a
18 / 33Startit
![Page 19: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/19.jpg)
Apache Spark• Za razliku od MapReduce paradigme
– gde se podaci koriste u memoriji samo za vreme računanja Map ili Reduce koraka
• Apache Spark pruža klijentima mogućnost da izvrše keširanje podataka ili međurezultata – Na taj način lako i brzo izvršava iterativne algoritme
19 / 33Startit
![Page 20: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/20.jpg)
Apache Storm• Predstavlja distribuirani sistem koji vrši
obradu tokova podataka u realnom vremenu• Koirsti se u realtime analizama, online
machine learning - u, kontinualnom računanju, distribuiranim RPC-ovima i ETL-u
20 / 33Startit
![Page 21: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/21.jpg)
Cloudera Distributed Hadoop (CDH)
21 / 33Startit
![Page 22: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/22.jpg)
22 / 33
Sadržaj• Uvod• Distribuirani sistem datoteka• MapReduce• Big data frameworks• Data science• Eksterni izvori podataka• Reference
Startit
![Page 23: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/23.jpg)
Data science• Predstavlja interdisciplinarnu oblast
– O naučnim metodama, procesima i sistemima za izdvajanje znanja iz različitih oblika podataka• Struktuiranih i nestruktuiranih
• Podrazumeva ekspertizu iz različitih oblasti– Programiranje– Matematika– Poslovni procesi
23 / 33Startit
![Page 24: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/24.jpg)
Data science• Hal Varian, Google's Chief Economist, NYT:
– "The next sexy job" – "The ability to take data—to be able to understand it,
to process it, to extract value from it, to visualize it, to communicate it—that's going to be a hugely important skill.„
• Mike Driscoll, CEO of metamarkets:– "Data science, as it's practiced, is a blend of Red-Bull-
fueled hacking and espresso-inspired statistics." – "Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools & materials, coupled with a theoretical understanding of what's possible."
24 / 33Startit
![Page 25: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/25.jpg)
Data science
25 / 33Startit
![Page 26: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/26.jpg)
Data science• Struktuiranje podataka (data jujitsu)
– Prikupljanje, scrap-ovanje, parsiranje, čišćenje, integracija, restrukturiranje, perzistencija, filtriranje, brisanje, kombinovanje, spajanje, provera, učitavanje i oblikovanje podataka
• Analiza podataka– Data mining, tradicionalna statistika
• Vizualizacija podataka– Putem grafikona
26 / 33Startit
![Page 27: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/27.jpg)
Data science u praksi• Primena:
– Istraživanje mišljenja javnog mnjenja– Analiza konkurentnosti tržišta– Analiza poslovanja preduzeća– ...
• Dobijanje odgovora na bilo koje pitanje koje je bazirano na javno dostupnim podacima
27 / 33Startit
![Page 28: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/28.jpg)
Data science u praksi
28 / 33Startit
![Page 29: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/29.jpg)
Data science u praksi
29 / 33Startit
![Page 30: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/30.jpg)
30 / 33
Sadržaj• Uvod• Distribuirani sistem datoteka• MapReduce• Big data frameworks• Data science• Eksterni izvori podataka• Reference
Startit
![Page 31: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/31.jpg)
Eksterni izvori podataka• Twitter API
– Pruža kontinualan tok dela podataka sa Twitter-a• Facebook graph
– Pruža pristup dela Facebook graph-a klijenta i njegovih prijatelja
• Web crawler– Scrapy, Apache Nutch
31 / 33Startit
![Page 32: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/32.jpg)
32 / 33
Sadržaj• Uvod• Distribuirani sistem datoteka• MapReduce• Big data frameworks• Data science• Eksterni izvori podataka• Reference
Startit
![Page 33: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/33.jpg)
Reference– Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung – The
Google file system– Jeffrey Dean and Sanjay Ghemawat – MapReduce: Simplified
Data Processing on Large Clusters – Roger D. Peng – R Programming for Data Science– https://bigdatacoursespring2015.appspot.com/preview– http://cloudera.com/– http://www.cloudera.com/downloads/quickstart_vms/5-7.html– https://hadoop.apache.org/– https://spark.apache.org/– https://storm.apache.org/– https://dev.twitter.com/overview/api– https://developers.facebook.com/docs/graph-api– http://scrapy.org/doc/– http://nutch.apache.org/
33 / 33Startit
![Page 34: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/34.jpg)
Pitanja i komentari
?Startit
![Page 35: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/35.jpg)
Sadržaj• Uvod• Distribuirani sistem datoteka• MapReduce• Big data frameworks• Data science• Eksterni izvori podataka• Reference
Startit
![Page 36: Uvod u Big Data i nauku o podacima](https://reader035.vdocuments.net/reader035/viewer/2022062412/589a9b7b1a28abfc1a8b478d/html5/thumbnails/36.jpg)
Big Data i data science
Pojam, tehnologije, primeri
Startit