the new power of data: collection, integration, and analytics wenny rahayu professor in computer...
TRANSCRIPT
THE NEW POWER OF DATA: Collection, Integration, and Analytics
Wenny RahayuProfessor in Computer ScienceHead, School of Engineering and Mathematical Sciences
La Trobe University, Melbourne Australia
Where is La Trobe?
35,000 students3,200 staff
3
Moving from Databases to Data Container
“Everyday, 2.5 quintillion bytes of data are created and 90% of the data in the world today was created within the past two years”.
IBM Corporation
…1015 = quadrillion (petabytes)1018 = quintillion (exabytes)1021 = sextillion (zettabytes)
“Worldwide information is more than doubling every two years, with 1.8 zettabytes or 1.8 trillion gigabytes projected to be created and replicated this year alone”.ZDNet news
VOLUME
4http://archive.tiecon.org/content/big-data-landscape-why-should-you-care
Means for Data Collection
5
We are not quite sure what the exact definition of a Data Scientist is, but if you deal with something generally related to converting data into useful insight then you will hopefully benefit from joining the group.
Whether you’re in business, academia, or government, and whether you’re an analyst, data miner, programmer, student, electrical engineer, computer scientist, physicist, etc, and you work with data to generate insights, build predictive models, build optimisation models, build reports/dashboards/visualisations, automate analyses, etc, using python, R, SQL, C/C+, Java, Tableau, Excel, Hadoop, etc, and you care about doing it right, efficiently, repetitively, optimally, visually, etc, then join us!
Source: http://www.meetup.com/Data-Science-Melbourne/
Multi-Disciplinary
66
New ways of developing drugs – Novartis New Drug Research
Novartis Institute for Biomedical Research (NIBR) in Cambridge, Mass.
• A new breed of “data scientist” is working to re-invent the traditional drug research team. Instead of biologists, chemists and clinicians working in silos, pharmaceutical companies such as Novartis are assembling collaborative, cross-disciplinary teams.
• These teams include data scientists, drawing on their expertise in computer science and statistics to sift through information and attempt to extract answers to pressing questions. They collaborate with biologists and clinicians to develop a clear hypothesis and then put it to the test.
• https://www.novartis.com/stories/discovery/surfing-wave-big-data-analytics
Data Inspires NewScientific Innovation
7
Smart Sensor Solution and Real Time Data Analytics
Database
Pasture
Recording behaviour, activity and relationship of animals
Options for sensor data download
Sensor data will be saved securely in a database
system for post-analysis
Sensor on lambs
Base-station
Sensor on Ewes
Proximity approach
Handheld Reader
Computer
Smart Sensor
RF Communication
Activity Sensors:Accelerometers, Gyroscope,
Magnetometer, Temperature
Low Power Processing and Storage
Battery Powered and Power Management Unit
User InterfaceAdministration /
Configuration
Data Visualization
Reporting System
Alert System
User Interface
Analysed sensor data reports will be accessible through a
web-base user interface
Multidisciplinary work between IT, Engineering, Centre of Technology Infusion, and the Agricultural Department.
Will produce low cost, long life, sensors for use with farm animals to monitor motion, proximity and true location.
Sensor data and real-time data analytics will provide actionable information to farmers on (parentage, health, oestrous, grazing information etc.)
8
Big Data - the bottleneck issue
8
Gathering & preparingdata
(70~80%)
Analyzingdata
(20~30%)
Homogenous, standard enterprise data
Gathering & preparingdata
(95%)
Analyzingdata(5%)
Heterogeneous, Big Data
* Reference from Prof. Timos Sellis – Data Ecosystem - From Very large databases to Big Data Infra structure, La Trobe November 2015
9
Also known as data fusion, data blending, data mapping, data acquisition, etc…
Informal description by Roderick et. al http://www.odbms.org/2015/11/what-is-data-blending/:
“… the answer is not always written on the same book as the question. Thus, we must learn to decipher it from multiple books. Some of them are in a foreign language, some are hundreds of times thicker than others, and most of them are by different authors who have never agreed on a literary style. And there is no catalogue.”
Data Integration
10
Data Integration
The need to deal with large data size and different complexity of data formats/structures
Integration can be achieved through:• Standardization of data representations • Global semantic representation: ontology or
schema mediator• “Loose coupling” integration: data virtualization,
data container
Standardization of data representations
12
XML as the common ‘language’ of representation� XBLR (eXtensible Business Reporting Language)
� BSML (Bioinformatics Sequence Markup Language)
� HL7 (Health Level Seven)
� FIX (Financial Information eXchange)
� AIXM (Aeronautical Information eXchange Model)
� GML (Geograhical Markup Language)
� MathML (Mathematical Markup Language)
� GBXML (Green Building eXtensible Markup Language)
� And so on….
Example of Standardization
13
Snapshot of a standard XML representation in Aviation - AIXM
<AirportHeliport ..
<timeSlice> <AirportHeliportTimeSlice gml:id="AHa1"> <gml:validTime> <gml:TimePeriod gml:id="AHb1"> <gml:beginPosition>7/8/2004 0:0:0</gml:beginPosition> <gml:endPosition>12/31/8888 0:0:0</gml:endPosition> </gml:TimePeriod> </gml:validTime> <interpretation>BASELINE</interpretation>
<designator>NFFN</designator> <name>NADI</name> <type>AD</type> <magneticVariation>12.24</magneticVariation> <ARP> <ElevatedPoint gml:id="AHc1"> <gml:coordinates decimal="." cs="," ts=" "> 177.443333333333,-17.7563888888889 </gml:coordinates> <elevation uom="FT">59</elevation> </ElevatedPoint> </ARP>
……… </AirportHeliportTimeSlice> </timeSlice> </AirportHeliport>
14
Integration of standard XML representation in Aviation : AIXM, WXXM, FIXM, etc.
ADMSOracle
AIXM 5.0Oracle
AutomatedMapping Specification
ADMSAIXM-based database LAYER 1
ADMS Mapping and Migration to new AIXM5 Database
EFB Charting Publication …Visualisation Tool
Transformation to produce flat XML documents
LAYER 3External Service Providers
AIXM document
WXXM Weather data
FIXM, NOTAMXML data
???Future XMLData
15
The International Standard Body OGC - XML Standard in Aviation Domain
AIXM = Aeronautical Information Exchange Model
WXXM = Weather Information Exchange Model
FIXM = Flight Information Exchange Model
Source: OGC – www.opengeospatial.org
16
The layering design approach enables the integration of AIXM data with other Aeronautical XML based data
Aeronautical Reference Data NOTAM Airport Spatial Data Dynamic-Temporal Messaging
WEATHER Data
A 0 3 1 2 / 0 8 N O T A M N
Q ) L K A A / Q F A X X / I V / B O / A / 0 00 / 9 9 9 / 5 0 0 6 N 0 1 4 1 5 E 0 0 5
A ) L K P R B ) 0 8 0 3 2 3 0 0 0 0 C ) P E R M
E ) N E W P O S T A L A D D R E S S O F L K P R A D : K E K R A L O V S K E M U L E T I S T I 6 / 1 0 1 9 1 6 0 0 8 P R A H A 6 R U Y Z N E .
yyyy mm tmax tmin af rain sun degC degC days mm hours 2008 1 5.0 -1.4 21 --- 29.7 2008 2 7.3 1.9 8 --- 71.9 2008 3 6.2 0.3 13 --- 101.4 2008 4 8.6 2.1 5 --- 128.6 2008 5 15.8 7.7 0 --- 180.4
NBA5683GG YSCBNOCX YUZZNCLX012322 YBBBZEZXC0120/10 NOTAMR C0119/10Q) YBBB/QXXXX/IV/NBO/A/000/999/1653S14545EA) YBCSB) 1003012322 C) 1003050930 ESTD) DAILY 0800/0930 1800/2100E) INCREASED FLYING FOX ACTIVITY
Data Integration
17
The layering design approach enables the integration of AIXM data with other Aeronautical XML based data
Data Integration
D a ta T y p es S PAT IA L T E M P O R A L
N O TA M m e ssa g e
lo c a tio n c o o rd in a te s , a re a
c o o rd in a te s
valid s ta r t a n d end d a te s , d u ra tio n
AV IAT IO N R E F E R E N C E
d a ta
lo c a tio n
c o o rd in a te s , a re a c o o rd in a te s ,
sh a p e
valid s ta r t a n d
end d a te s , p e rm a n e n t o r
te m p o ra ry W E AT H E R
d a ta lo c a tio n c o o rd in a te s , a re a
c o o rd in a te s , te m p e ra tu re ,
p re s su re
valid s ta r t a n d end d a te s , d u ra tio n
Ta b le 1 Av ia t io n d a ta to be in te g ra te d with te m p o ra l a n d sp a tia l in te g ra tio n p o ints
X1234/09 NO TA M Q) YM M L/Q MR XX/IV /NBO /A /00/999/3767S14484E002/ A) YM ML B) 07068:0:0 C) 070610:0:0 EST E) RWY 16/34 CONDITIONAL DUE TO RESUR FAC ING
<AIRPORT_HELIPORT num ="2"> <AH_UUID>16468</AH_UUID> <NAME>MELBO URNE</NA ME> <DESIG N ATO R>YM M L</DES IG NATO R> <RUNW AY_FULL_CO DE>16/34</RU NW AY_FULL_CO DE> <RUN_DIR_V ID>11781</RUN_DIR_V ID> <AH_USG_LIM_C O DE>CO NDITIONAL<AH_USG _LIM _CO DE> <AH_W ARN_DESCR>Resurfacing</AH_W ARN_DESCR>
< /A IRPORT_HELIPO RT> </ALL_A IRPORT_HELIPO RT>
• Global semantic representation: ontology or schema mediator
19
The Ontology
Ontology Definition• O = (C, H, R, P, I, A), where
• C = a set of entities in the ontology (class and instance)
• H = a set of taxonomic relationships between concepts.
• R = a set of non-taxonomic ontology concept relationships.
• P = a property set of ontology concept entities that connects a class property into a datatype.
• I = a set of ontology instance declaration (the relationships of instances with its class, its property and value, and other instances).
• A = is a set of axioms and rules that allow consistency checking of an ontology and infer new knowledge through some inferencing mechanism.
20
The Ontology
Ontology Development• via Domain Expert
• From scratch
• Mostly manual and time consuming
• Valid and rich knowledge within ontology
• via Data Transformation
• Existing data required
• Based on specific data format transformation (e.g. RDB and XML)
• Automatic
• Knowledge richness limited to database content
21
What we need…
• Global common knowledge
• Local ontology development may not be shared globally
• The value of local knowledge for global development
• Rich and valid knowledge
• Automatic development process
• Does not rely on the availability of domain expert
• Domain experts are not always present
• Immediate development
• Maintainable knowledge
22
A Data-Driven Dynamic Common Ontology
The Concept• (i) create common ontologies automatically from community knowledge
representations and
• (ii) maintain its content by: capturing dynamic knowledge changes and updates specific in the community, and capturing world recent updates (eg through social media and news).
• Contents updates are done through propagation and enrichment.
23
A Data-Driven Dynamic Common Ontology
The Concept• (i) create common ontologies automatically from community knowledge
representations and
• (ii) maintain its content by: capturing dynamic knowledge changes and updates specific in the community, and capturing world recent updates (eg through social media and news).
24
A Data-Driven Dynamic Common Ontology
The Creation • CO = (C, H, R, P, I, A, S, CV), where
• S = is a set of similarity values between ontology knowledge components (class, instances, non-taxonomic relationship, and properties) and its respected external similar ontology knowledge component.
• CV = is a set of confidence values Cv residing in an ontology instance knowledge, which takes the ratio between the number of knowledge sources that mention a knowledge and the total number of knowledge sources.
• Why Confidence Value (CV)?
• Knowledge stability assurance. The new extracted knowledge is not always being the best knowledge and one particular piece of knowledge from one community may not necessarily become global community knowledge representation.
25
A Data-Driven Dynamic Common Ontology
The Propagation• Why ?
• Frequent change in community
• Validity assurance from the knowledge source
• How ?• Using delta script
• A delta script is very useful when the original file is located in another place or in the distributed environment, since sending the whole updated file will consume resources and result in a greater chance of information loss
26
A Data-Driven Dynamic Common Ontology
The Enrichment• Why ?
• Global knowledge update
• Validity assurance from the global understanding
• How ?• Take RECENT related document (e.g. recent news article)
• Ontology + Linguistic Pattern –based extraction
• Self-enrichment : find related recent document by exploiting keywords from the common ontology.
• Domain independent• Confidence value is considered
ReferencesFudholi, D.H., Rahayu, W., Pardede, E. (2015). A data-driven dynamic ontology. J. Information Science 41(3): 383-
398.
Fudholi, D.H., Rahayu, W., Pardede, E. (2014). CODE (Common Ontology DEvelopment): A Knowledge Integration
Approach from Multiple Ontologies. IEEE AINA , 751-758 (2014), Victoria Canada.
Fudholi, D.H., Rahayu, W., Pardede, E., Hendrik. (2013). A Data-Driven Approach toward Building Dynamic
Ontology. ICT-EurAsia 2013: 223-232 (2013), Indonesia
• Global semantic representation: ontology or schema mediator
29
o Data can arrive from various heterogeneous data sources.
o Data from different have different structures.
o In most cases the underlying data is quite similar. But as the structures are different, conflicts arise.
Consolidating Data Sources
30
Consolidating Data Sources
ReferencesNguyen H. Q., David Taniar, J. Wenny Rahayu, Kinh Nguyen (2011) "Double-layered schema
integration of heterogeneous XML sources", Journal of Systems and Software, Vol. 84 (1), pp. 63-
76.
Nguyen, H., Rahayu, J.W., Taniar, D., Nguyen, K., 2008, Mediation-based XML query answerability, Proceedings of
the OTM 2008 Confederated International Conferences: On the Move to Meaningful Internet Systems (OTM 2008), 9
November 2008 to 14 November 2008, Springer, Berlin Germany, pp. 1550-1558.
• “Loose coupling” integration: data virtualization, data container
33
Data Container *
• The era of large heterogeneous data collection – moving from Databases to Data container
• Data container – contains a collection of resources, each of which has a unique reference/identifier
• The resources in a Data container can be: databases, database relations, database tuples, files, records in files, data streams, social media documents, parts of texts, maps, trajectories, etc.
* Reference from Prof. Timos Sellis – Data Ecosystem - From Very large databases to Big Data Infra structure, La Trobe November 2015
34
Data Container *
User Query
Result
35
o The data may arrive from various data sources from different locations.
o Data from multiple data sources are integrated and aggregated on the fly.
o The user experiences the presence of a real data warehouse. The user has no clue where the data is from, but it is available.
o Some more benefits are,
o Real-time availability of information for decision support.
o Data is less stale.
o Able to access data instantly.
Data Warehouse Virtualization
36
Current Data Warehousing Trends in IndustryAccording to the latest Gartner Study (2015), Data Warehouses can be classified into
four main categories:
1. Traditional Data Warehouse
o Consolidates and stores historical data that arrive from various data sources
2. Operational Data Warehouse
o Data is structured and continuously loaded to support operational queries
3. Logical Data Warehouse
o Structured data and other content data typeso Utilizes Data Virtualization
4. Data Lake
o Uses flat architecture to store data in its original formato Supports ‘Schema on Read’ capabilities
37
Traditional Data Warehouse
38
Dynamic Data Warehouse
E Chang, W. Rahayu, M. Diallo, M. Machizaud: Dynamic Data Mart for Business Intelligence. IFIP AI 2015: 50-63.
39
o 3M: Data Mining, Data Marshalling, and Data Meshing
o 3R: Recommendation, Reconciliation, Representation
Dynamic Data Warehouse
E Chang, W. Rahayu, M. Diallo, M. Machizaud: Dynamic Data Mart for Business Intelligence. IFIP AI 2015: 50-63.
40
New Trends in Data WarehousingMicrosoft | IBM | Oracle | Cisco | Sap
• Real-time Data Warehousing• Support new data types• Support for cloud data• Data Lake • Real-time Data Warehousing• Logical Big Data Warehousing• Support for complex, structured
and unstructured data • Support for Big Data• Logical Data Warehousing• Data Lake
41
Finally…
Integration can be achieved through:• Standardization – suitable for domain specific data
sources since it is relying on the availability of the standard
• Global semantic representation – suitable for data sources with an inherent common knowledge
• “Loose coupling” integration - suitable for large heterogeneous data sources/data container with a dynamic nature (frequent changes)
Thank you
CRICOS Provider 00115M