1 berendt: gegevensbanken, 2nd semester 2011/2012, berendt/teaching/ 1 gegevensbanken outlook –...
TRANSCRIPT
1Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
1
Gegevensbanken Outlook –
The Semantic Web,
XML, RDF,
Linked (Open) Data,
NoSQL
Bettina Berendt
Katholieke Universiteit Leuven, Department of Computer Science
http://www.cs.kuleuven.be/~berendt/teaching/
2Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
2
Waar zijn we?
Les # wie wat1 ED intro, ER2 ED EER, (E)ER naar relationeel schema2 ED relationeel model3 KV Relationele algebra & relationeel calculus4,5 KV SQL6 KV Programma's verbinden met gegevensbanken7 KV Functionele afhankelijkheden & normalisatie8 KV PHP10 BB Beveiliging van gegevensbanken11 BB Geheugen en bestandsorganisatie12 BB Externe hashing13 BB Indexstructuren14 BB Queryverwerking15-17 BB Transactieverwerking en concurrentiecontrole18 BB Data mining en Information Retrieval9 BB XML (en meer over het Web als GB), NoSQL
Nieuwe thema‘s / vooruitblik
Hoe worden gegevens machtig? Analyse & combinatie
3Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
3
Een motivatie
V:
Algemeen over het internet: valt dit te beschouwen als één grote ongeordende chaos van websites,
of zijn het meer allemaal aparte databases (bijvoorbeeld met alle webpagina's uit België of alle webpagina's van een internetprovider als Telenet)
die samen het internet vormen (en dus toelaten aan een grote, algemene database om die zijn taken te verdelen) ?
4Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
4
Bijvoorbeeld: SIG.MA
5Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
5
Gegevensbanken Outlook –
The Semantic Web,
XML, RDF,
Linked (Open) Data,
NoSQL
Bettina Berendt
Katholieke Universiteit Leuven, Department of Computer Science
http://www.cs.kuleuven.be/~berendt/teaching/
6Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
6
The original vision
The entertainment system was belting out the Beatles' "We Can Work It Out" when the phone rang. When Pete answered, his phone turned the sound down by sending a message to all the other local devices that had a volume control. His sister, Lucy, was on the line from the doctor's office: "Mom needs to see a specialist and then has to have a series of physical therapy sessions. Biweekly or something. I'm going to have my agent set up the appointments." Pete immediately agreed to share the chauffeuring.
At the doctor's office, Lucy instructed her Semantic Web agent through her handheld Web browser. The agent promptly retrieved information about Mom's prescribed treatment from the doctor's agent, looked up several lists of providers, and checked for the ones in-plan for Mom's insurance within a 20-mile radius of her home and with a rating of excellent or very good on trusted rating services. It then began trying to find a match between available appointment times (supplied by the agents of individual providers through their Web sites) and Pete's and Lucy's busy schedules. (The emphasized keywords indicate terms whose semantics, or meaning, were defined for the agent through the Semantic Web.)
Tim Berners-Lee, James Hendler and Ora Lassila (2001). The Semantic Web. A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American. http://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21
7Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
7
The Semantic Web layer cake (T. Berners-Lee talk at XML 2000)
RDF: W3C Rec. 2004
OWL: W3C Rec. 2004OWL2: W3C Rec. 2009
URI = Uniform Resource Identifier, bv:•URL (U.R. Locator) : waar te vinden (~ adres van een persoon)•URN (U.R. Name) : identiteit (~ naam van een persoon, ISBN van een boek)
8Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
8
Gegevensbanken Outlook –
The Semantic Web,
XML, RDF,
Linked (Open) Data,
NoSQL
Bettina Berendt
Katholieke Universiteit Leuven, Department of Computer Science
http://www.cs.kuleuven.be/~berendt/teaching/
9Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
9
You have data … How should you structure it?
medium-altitude, long-endurance unmanned aerial vehicle
14.7 meters
512 kilograms70 knots
Here's some data about an aircraft:
400 nautical miles
10Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
10The XML approach is to "wrap" each data item in start/end tags
<Aircraft> <wingspan>14.8 meters</wingspan> <weight>512 kilograms</weight> <cruise-speed>70 knots</cruise-speed> <range>400 nautical miles</range> <description> medium-altitude, long-endurance unmanned aerial vehicle </description></Aircraft>
RQ-1.xml
and define this data
schema, in a DTD
or XML Schema
11Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
11
XML Terminology
<wingspan>14.8 meters</wingspan>
Start tag End tag
Data
Element
12Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
12
Why use XML?
It is a universally accepted standard way of structuring data (syntax).
It is a W3C recommendation (W3C = World Wide Web Consortium)
The marketplace supports it with a lot of free/inexpensive tools.
The alternative to using XML is to define your own proprietary data syntax, and then build your own proprietary tools to support the proprietary syntax (Not a very appealing idea).
13Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
13
But: What is this XML snippet talking about, i.e., what are the semantics?
<Predator> …</Predator>
What is a Predator?
14Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
14
Predator - which one?
Predator: a medium-altitude, long-endurance unmanned aerial vehicle system.
Predator : one that victimizes, plunders, or destroys, especially for one's own gain.
Predator : an organism that lives by preying on other organisms.
Predator: a company which specializes in camouflage attire.
Predator: a video game.
Predator: software for machine networking.
Predator: a chain of paintball stores.
15Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
15
A little more flexibility through namespaces
<?xml version="1.0" encoding=„UTF-8"?>
<myThings
xmlns:h=http://www.mySchemas.org/TR/aircraft/ xmlns:f="http://www.yourSchemas.com/animals">
<h:Predator>
<h:name>OL231-b</hname>
<h:wingspan>14.8 metres</h:wingspan>
</h:Predator>
<f:Predator>
<f:name>Panthera</f:name>
<f:eats>antelopes</f:eats>
</f:Predator>
</myThings>
16Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
16
Querying XML
Verschillende querytalen, bv. XPath, XQuery
17Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
17
18Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
18
19Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
19
20Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
20
21Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
21
22Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
22
23Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
23
24Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
24
Problems of XML
1. What does nesting mean?
2. What do syntactical variations mean?
3. What do linguistic variations mean?
4. How can we extend our knowledge?
25Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
25
1. What does nesting mean?
Schema 1 allows for expressions like:
<Person>
<name>Peter Parker</name> ...
</Person>
name being an XML-element of Person means: the person HAS-A ...
Schema 2 allows for expressions like:
<Person>
<type>Comic-book hero</type> ...
</Person>
type being an XML-element of Person means: the person IS-A ...
Problems: a) we don‘t know what nesting means, b) even if we do know, we can‘t express this in a machine-readable way (at most build it into an application that uses these XML statements, but that would bury meaning in procedures!)
26Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
26
2. What do syntactical variations mean?
Schema 1 allows for expressions like:
<Person>
<name>Peter Parker</name>
<birthday>1932-04-12</birthday> ...
</Person>
Schema 2 allows for expressions like:
<Person name=“Peter Parker“>
<type>Comic-book hero</type> ...
</Person>
Problems: a) what does it mean for some information to be an XML-element vs. an XML-attribute? b) even if we do know that they are the same, we can‘t express this in a machine-readable way, for example to combine the information from the two sources (same remark about applications as in 1.)
27Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
27
3. What do linguistic variations mean?
Schema 1 allows for expressions like:
<Person>
<name>Peter Parker</name> ...
</Person>
Schema 2 allows for expressions like:
<Person>
<naam>Peter Parker</naam> ...
</Person>
Problems: a) we do not know whether elements from different data sources that differ by, e.g. natural, language, are the same or not b) even if we do know that they are the same, we can‘t express this in a machine-readable way, for example to combine the information from the two sources (same remark about applications as in 1.)
28Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
28
4. How can we extend our knowledge?
Schema 1 allows for expressions like:
<WebResource>
<type>Picture</type>
<hasURL>http://www.example.org/Pictures/myPic.png</hasURL>
<isAbout>Peter Parker</isAbout> ...
</WebResource>
Schema 2 allows for expressions like:
<WebResource>
<hasURL>http://www.example.org/Pictures/myPic.png</hasURL>
<hasLicence>CreativeCommons</hasLicence> ...
</WebResource>
Problems: a) we cannot refine our schema information by that provided by another source b) even if we can be sure about principal linkability (here: via the URL), we can‘t express this in a machine-readable way, for example to combine the information from the two sources (same remark about applications as in 1.)
29Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
29Summary: XML not well-suited for conceptual modelling and therefore not suited for truly semantic markup
XML makes no commitment on:
Domain-specific ontological vocabulary
Ontological modeling primitives
Requires pre-arranged agreement on &
Only feasible for closed collaboration
agents in a small & stable community
pages on a small & stable intranet
Not suited for sharing Web-resources
30Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
30
Solution approach of the „higher levels“ of the Semantic Web
1. Break down information into atomic statements: subject-predicate-object
2. Define (in a formal-semantics way) what each component of each statement means
a. Give it a URI (uniform resource identifier) to enable uniform meaning specification
b. Define languages to say more about (specify) the meaning (by relating it to other units of meaning – cf. a dictionary in which each word is explained by other words)
3. The languages mentioned in 2.b. each add more expressivity:
1. RDF: subject-predicate-object statements (in RDF terminology: a resource has a property with a certain value.
2. RDFS: simple ontology building blocks: class, subclass-of relation, use RDF‘s type to denote that (e.g.) an individual is a instance of a class (= make it possible to define a schema and its instances), ...
3. OWL: more advanced ontology building blocks: a class (= concept) is disjoint with another one, is the same as another one; a property is functional, symmetric, the inverse of another one; ...
31Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
31
Semantic Web vs. Database
Advantages of using RDF/RDFS/OWL to define an Ontology:
Extensible: much easier to add new properties. Contrast with a database - adding a new column may break a lot of applications
Portable: much easier to move an OWL document than to move a database.
Advantages of using a Database to define an Ontology:
Mature: the database technology has been around a long time and is very mature.
32Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
32
Gegevensbanken Outlook –
The Semantic Web,
XML, RDF,
Linked (Open) Data,
NoSQL
Bettina Berendt
Katholieke Universiteit Leuven, Department of Computer Science
http://www.cs.kuleuven.be/~berendt/teaching/
33Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
33
RDF model
RDF “statements” consist of
resources (= nodes)which have propertieswhich have values (= nodes,strings)
http://www.w3.org/TR/REC-rdf-syntax/
“Ora Lassila”
author
= subject= predicate= object
“http://www.w3.org/TR/REC-rdf-syntax/ has the author Ora Lassila”
resource valueproperty
34Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
34
RDF Model Example
http://www.w3.org/TR/REC-rdf-syntax/
“Ora Lassila”
dc:Creator
“1999-02-22”
dc:Date
“W3C”
dc:Publisher
35Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
35
Complex values
So far, values of properties have been strings
A graph node (corresponding to a resource) also can be the value of a property
arbitrarily complex tree and graph structures are possible
syntactically, values can be embedded (i.e. lexically in-line) or referenced (linked)
Example:
http://www.w3.org/TR/REC-rdf-syntax/
“Ora Lassila”
dc:Creator
p:EMail
p:Name
36Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
36
RDF in XML
<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:p="http://example.org/persons/1.0/">
<rdf:Description rdf:about="http://www.w3.org/TR/REC-rdf-syntax"> <dc:creator> <rdf:nodeID="abc“> </dc:creator>
</rdf:Description>
<rdf:Description rdf:nodeID="abc"> <p:Name>“Ora Lassila”</p:Name> <p:Email>”[email protected]”</p:Email><p:HasHomepage><rdf:resource=“http://www.nokia.com”></p:…><p:WorksIn> <rdf:ID=“xyz"> </p:WorksIn>
</rdf:Description>
</rdf:RDF>
37Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
37
RDF Schema
• Defines small vocabulary for RDF: • Class, subClassOf, type• Property, subPropertyOf• domain, range
• Vocabulary can be used to define other vocabularies for your application domain
Person
Student Researcher
subClassOfsubClassOf
Jeentype
hasSuperVisordomain range
Frank
type
hasSuperVisor
38Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
38
<rdf:Description ID="MotorVehicle"> <rdf:type resource="http://www.w3.org/...#Class"/> <rdfs:subClassOf rdf:resource="http://www.w3.org/...#Resource"/></rdf:Description>
<rdf:Description ID="Truck"> <rdf:type resource="http://www.w3.org/...#Class"/> <rdfs:subClassOf rdf:resource="#MotorVehicle"/></rdf:Description>
<rdf:Description ID="registeredTo"> <rdf:type resource="http://www.w3.org/...#Property"/> <rdfs:domain rdf:resource="#MotorVehicle"/> <rdfs:range rdf:resource="#Person"/></rdf:Description>
<rdf:Description ID=”ownedBy"> <rdf:type resource="http://www.w3.org/...#Property"/> <rdfs:subPropertyOf rdf:resource="#registeredTo"/></rdf:Description>
RDF Schema syntax in XML
39Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
39
Gegevensbanken Outlook –
The Semantic Web,
XML, RDF,
Linked (Open) Data,
NoSQL
Bettina Berendt
Katholieke Universiteit Leuven, Department of Computer Science
http://www.cs.kuleuven.be/~berendt/teaching/
40Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
40Wat is dit?Kunnen we hiermee iets doen?
41Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
41
Gecombineerd door SIG.MA
42Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
42
En hoe werkt dit?
Linked Open Data:
“A way of making the Semantic Web happen“ (it is hoped)
Key concept: leverage the existence of structured data and combine it with the languages and infrastructures of the Web and the Semantic Web
End 2011:
32 billion triples
43Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
43
Data items are identified with HTTP URIs
pd:cygri
Richard Cyganiak
dbpedia:Berlin
foaf:name
foaf:based_near
foaf:Personrdf:type
pd:cygri = http://richard.cyganiak.de/foaf.rdf#cygri
dbpedia:Berlin = http://dbpedia.org/resource/Berlin
From http://www.ai.sri.com/~nysmith/slides/aic-seminars/090724-bizer.ppt
44Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
44
Resolving URIs over the Web
dp:Cities_in_Germany
3.405.259dp:population
skos:subject
Richard Cyganiak
dbpedia:Berlin
foaf:name
foaf:based_near
foaf:Personrdf:type
pd:cygri
From http://www.ai.sri.com/~nysmith/slides/aic-seminars/090724-bizer.ppt
45Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
45
Dereferencing URIs over the Web
dp:Cities_in_Germany
3.405.259dp:population
skos:subject
Richard Cyganiak
dbpedia:Berlin
foaf:name
foaf:based_near
foaf:Personrdf:type
dbpedia:Hamburg
dbpedia:Muenchen
skos:subject
skos:subject
pd:cygri
From http://www.ai.sri.com/~nysmith/slides/aic-seminars/090724-bizer.ppt
46Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
46
What is LOD?
“A way of making the Semantic Web happen“ (it is hoped)
Key concept: leverage the existence of structured data and combine it with the languages and infrastructures of the Web and the Semantic Web
Tim Berners-Lee: four principles of Linked Data (http://www.w3.org/DesignIssues/LinkedData)
Use URIs to identify things.
Use HTTP URIs so that these things can be referred to and looked up ("dereferenced") by people and user agents.
Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML.
Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.
47Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
47
SPARQL: The standard query language for LOD
"What are all the country capitals in Africa?"
PREFIX abc: <http://example.com/exampleOntology#>
SELECT ?capital ?country
WHERE {
?x abc:cityname ?capital ;
abc:isCapitalOf ?y .
?y abc:countryname ?country ;
abc:isInContinent abc:Africa .
}
48Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
48
Connecting to a database … ah … triple store
49Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
49
The Linked Open Data Cloud
50Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
50
Gegevensbanken Outlook –
The Semantic Web,
XML, RDF,
Linked (Open) Data,
NoSQL
Bettina Berendt
Katholieke Universiteit Leuven, Department of Computer Science
http://www.cs.kuleuven.be/~berendt/teaching/
51Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
51
History of the World, Part 1
Relational Databases – mainstay of business
Web-based applications caused spikes
Especially true for public-facing e-Commerce sites
Developers begin to front RDBMS with memcache or integrate other caching mechanisms within the application (ie. Ehcache)
From: Perry Hoekstra. From: Perry Hoekstra. NoSQLNoSQL. . www.intertech.com/resource/usergroup/NoSQL.ppt www.intertech.com/resource/usergroup/NoSQL.ppt
52Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
52
SELECT *FROM
membersWHERE name LIKE „%kirsten
%“????
Get write lockUpdate friends tableRelease write lock
????
53Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
53
Herinnering: Taak voor de volgende les
Zijn alle ACID eigenschappen even belangrijk voor de volgende types van toepassingen?
Wat kann je doen als voor je toepassing snelheid heel belangrijk is?
Online banking
Een online shop (e.g. boeken/media)
Een sociale netwerk site
54Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
54
Scaling Up
Issues with scaling up when the dataset is just too big
RDBMS were not designed to be distributed
Began to look at multi-node database solutions
Known as ‘scaling out’ or ‘horizontal scaling’
Different approaches include:
Master-slave
Sharding
All approaches come with their own respective problems
From: Perry Hoekstra. From: Perry Hoekstra. NoSQLNoSQL. . www.intertech.com/resource/usergroup/NoSQL.ppt www.intertech.com/resource/usergroup/NoSQL.ppt
55Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
55
What is NoSQL?
Stands for Not Only SQL
Class of non-relational data storage systems
Usually do not require a fixed table schema nor do they use the concept of joins
All NoSQL offerings relax one or more of the ACID properties (will talk about the CAP theorem)
NoSQL best gebruikt in grote gedistribueerde gegevensbanken!
From: Perry Hoekstra. From: Perry Hoekstra. NoSQLNoSQL. . www.intertech.com/resource/usergroup/NoSQL.ppt www.intertech.com/resource/usergroup/NoSQL.ppt
56Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
56
Why NoSQL?
For data storage, an RDBMS cannot be the be-all/end-all
Just as there are different programming languages, need to have other data storage tools in the toolbox
A NoSQL solution is more acceptable to a client now than even a year ago
Think about proposing a Ruby/Rails or Groovy/Grails solution now versus a couple of years ago
From: Perry Hoekstra. From: Perry Hoekstra. NoSQLNoSQL. . www.intertech.com/resource/usergroup/NoSQL.ppt www.intertech.com/resource/usergroup/NoSQL.ppt
57Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
57
Dynamo and BigTable
Three major papers were the seeds of the NoSQL movement
BigTable (Google)
Dynamo (Amazon)
Gossip protocol (discovery and error detection)
Distributed key-value data store
Eventual consistency
CAP Theorem (discuss in a sec ..)
From: Perry Hoekstra. From: Perry Hoekstra. NoSQLNoSQL. . www.intertech.com/resource/usergroup/NoSQL.ppt www.intertech.com/resource/usergroup/NoSQL.ppt
58Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
58
CAP Theorem
Three properties of a system: consistency, availability and partitions
You can have at most two of these three properties for any shared-data system
To scale out, you have to partition. That leaves either consistency or availability to choose from
In almost all cases, you would choose availability over consistency
Note that this is a slightly different notion of consistency than the one we are used to from transaction systems (ACID)!
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
From: Perry Hoekstra. From: Perry Hoekstra. NoSQLNoSQL. . www.intertech.com/resource/usergroup/NoSQL.ppt www.intertech.com/resource/usergroup/NoSQL.ppt
59Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
59
Availability
Traditionally, thought of as the server/process available five 9’s (99.999 %).
However, for large node system, at almost any point in time there’s a good chance that a node is either down or there is a network disruption among the nodes.
Want a system that is resilient in the face of network disruption
From: Perry Hoekstra. From: Perry Hoekstra. NoSQLNoSQL. . www.intertech.com/resource/usergroup/NoSQL.ppt www.intertech.com/resource/usergroup/NoSQL.ppt
60Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
60
Consistency Model
A consistency model determines rules for visibility and apparent order of updates.
For example:
Row X is replicated on nodes M and N
Client A writes row X to node N
Some period of time t elapses.
Client B reads row X from node M
Does client B see the write from client A?
Consistency is a continuum with tradeoffs
For NoSQL, the answer would be: maybe
CAP Theorem states: Strict Consistency can't be achieved at the same time as availability and partition-tolerance.
From: Perry Hoekstra. From: Perry Hoekstra. NoSQLNoSQL. . www.intertech.com/resource/usergroup/NoSQL.ppt www.intertech.com/resource/usergroup/NoSQL.ppt
61Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
61
Eventual Consistency
When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent
For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service
Known as BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID
From: Perry Hoekstra. From: Perry Hoekstra. NoSQLNoSQL. . www.intertech.com/resource/usergroup/NoSQL.ppt www.intertech.com/resource/usergroup/NoSQL.ppt
62Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
62
What kinds of NoSQL
NoSQL solutions fall into two major areas:
Key/Value or ‘the big hash table’. Amazon S3 (Dynamo)
Voldemort
Scalaris
Schema-less which comes in multiple flavors, column-based, document-based or graph-based.
Cassandra (column-based)
CouchDB (document-based)
Neo4J (graph-based)
HBase (column-based)
From: Perry Hoekstra. From: Perry Hoekstra. NoSQLNoSQL. . www.intertech.com/resource/usergroup/NoSQL.ppt www.intertech.com/resource/usergroup/NoSQL.ppt
63Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
63
Dus, kunnen jullie nu beantwoorden:
p 26 tabel 2.4: Relationele databases komen slecht uit de vergelijking,
waarom worden deze dan zo veel gebruikt?
64Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
64
Gegevensbanken Outlook –
The Semantic Web,
XML, RDF,
Linked (Open) Data,
NoSQL
Bettina Berendt
Katholieke Universiteit Leuven, Department of Computer Science
http://www.cs.kuleuven.be/~berendt/teaching/
65Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
65
Data mining/information retrieval and Linked Data?
Crowdsourcing:
Unstructured / semi-structured information Structured data
DM and IR:
Unstructured / semi-structured information Structured data
… and vice versa: LOD as a data source for DM !
66Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
66
NoSQL and Linked Data ?
„RDF database systems are the only standardized NoSQL solutions available at the moment, being built on a simple, uniform data model and a powerful, declarative query language.”
http://blog.datagraph.org/2010/04/rdf-nosql-diff
More ideas:
http://webofdata.wordpress.com/2011/05/02/nosql-linked-data-processing/
67Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
67
NoSQL and Data Mining / Information Retrieval ?
Indeed! Since scalability is a huge issue!
More in Advanced Databases and Text-Based Information Retrieval, where you‘ll work with such systems (and, if you want, use LOD …)
68Berendt: Gegevensbanken, 2nd semester 2011/2012, http://www.cs.kuleuven.be/~berendt/teaching/
68