version 12svoboda/courses/181-b4m36ds2/lectur… · introducon xml=extensiblemarkuplanguage •...

58
B4M36DS2, BE4M36DS2: Database Systems 2 hƩp://www.ksi.mī.cuni.cz/~svoboda/courses/181-B4M36DS2/ Lecture 2 Data Formats MarƟn Svoboda marƟ[email protected] 8. 10. 2018 Charles University, Faculty of MathemaƟcs and Physics Czech Technical University in Prague, Faculty of Electrical Engineering

Upload: others

Post on 15-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

B4M36DS2, BE4M36DS2: Database Systems 2h p://www.ksi.m .cuni.cz/~svoboda/courses/181-B4M36DS2/

Lecture 2

Data FormatsMar n Svobodamar [email protected]

8. 10. 2018

Charles University, Faculty of Mathema cs and PhysicsCzech Technical University in Prague, Faculty of Electrical Engineering

Page 2: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Lecture OutlineData formats• XML – Extensible Markup Language• JSON – JavaScript Object Nota on• BSON – Binary JSON• RDF – Resource Descrip on Framework• CSV – Comma-Separated Values• Protocol Bu ers

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 2

Page 3: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

XMLExtensible Markup Language

Page 4: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Introduc onXML = Extensible Markup Language• Representa on and interchange of semi-structured data

+ a family of related technologies, languages, speci ca ons, …

• Derived from SGML, developed byW3C, started in 1996• Design goals

Simplicity, generality and usability across the Internet

• File extension: *.xml, content type: text/xml• Versions: 1.0 and 1.1• W3C recommenda on

h p://www.w3.org/TR/xml11/

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 4

Page 5: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Example

<?xml version="1.1" encoding="UTF-8"?><movie year="2007">

<title>Medvídek</title><actors>

<actor><firstname>Jiří</firstname><lastname>Macháček</lastname>

</actor><actor>

<firstname>Ivan</firstname><lastname>Trojan</lastname>

</actor></actors><director>

<firstname>Jan</firstname><lastname>Hřebejk</lastname>

</director></movie>

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 5

Page 6: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Document StructureDocument• Prolog: XML version + some other stu• Exactly one root element

Contains other nested elements and/or other content

<?xml<?xml versionversion == "" versionversion "" ...... ?>?> ......

elementelement

Example<?xml version="1.1" encoding="UTF-8"?><movie>

...</movie>

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 6

Page 7: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

ConstructsElement• Marked using opening and closing tags

… or just an abbreviated tag in case of empty elements

• Each element can be associated with a set of a ributes

<< namename

attributeattribute

>> element contentelement content << // namename >>

<< namename

attributeattribute

// >>

Examples<title>...</title><actors/>

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 7

Page 8: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

ConstructsTypes of element content• Empty content• Text content• Element content

Sequence of nested elements• Mixed content

Elements arbitrarily interleaved with text values

texttext

elementelement

elementelement

texttext

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 8

Page 9: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

ConstructsA ribute• Name-value pair

namename == "" valuevalue ""

Escaping sequences (prede ned en es)• Used within values of a ributes or text content of elements• E.g.:

&lt; for <&gt; for >&quot; for "…

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 9

Page 10: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

XML Conclusion

XML constructs• Basic: element, a ribute, text• Addi onal: comment, processing instruc on, …

Schema languages• DTD, XSD (XML Schema), RELAX NG, Schematron

Query languages• XPath, XQuery, XSLT

XML formats = par cular languages• XSD, XSLT, XHTML, DocBook, ePUB, SVG, RSS, SOAP, …

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 10

Page 11: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

JSONJavaScript Object Nota on

Page 12: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Introduc onJSON = JavaScript Object Nota on• Open standard for data interchange• Design goals

Simplicity: text-based, easy to read and writeUniversality: object and array data structures

– Supported by majority of modern programming languages– Based conven ons of the C-family of languages

(C, C++, C#, Java, JavaScript, Perl, Python, …)

• Derived from JavaScript (but language independent)• Started in 2002• File extension: *.json• Content type: application/json• h p://www.json.org/

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 12

Page 13: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Example

{"title" : "Medvídek","year" : 2007,"actors" : [

{"firstname" : "Jiří","lastname" : "Macháček"

},{

"firstname" : "Ivan","lastname" : "Trojan"

}],"director" : {

"firstname" : "Jan","lastname" : "Hřebejk"

}}

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 13

Page 14: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Data StructureObject• Unordered collec on of name-value pairs (proper es)

Correspond to structures such as objects, records, structs,dic onaries, hash tables, keyed lists, associa ve arrays, …

• Values can be of di erent types, names should be unique

{{

stringstring :: valuevalue

,,

}}

Examples• { "name" : "Ivan Trojan", "year" : 1964 }• { }

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 14

Page 15: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Data StructureArray• Ordered collec on of values

Correspond to structures such as arrays, vectors, lists,sequences, …

• Values can be of di erent types, duplicate values are allowed

[[

valuevalue

,,

]]

Examples• [ 2, 7, 7, 5 ]• [ "Ivan Trojan", 1964, -5.6 ]• [ ]

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 15

Page 16: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Data Structure

Value• Unicode string

Enclosed with double quotesBackslash escaping sequencesExample: "a \n b \" c \\ d"

• NumberDecimal integers or oatsExamples: 1, -0.5, 1.5e3

• Nested object• Nested array• Boolean value: true, false• Missing informa on: null

stringstring

numbernumber

objectobject

arrayarray

truetrue

falsefalse

nullnull

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 16

Page 17: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

JSON Conclusion

JSON constructs• Collec ons: object, array• Scalar values: string, number, boolean, null

Schema languages• JSON Schema

Query languages• JSONiq, JMESPath, JAQL, …

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 17

Page 18: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

BSONBinary JSON

Page 19: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Introduc onBSON = Binary JSON• Binary-encoded serializa on of JSON documents

Extends the set of basic data types of values o ered by JSON(such as a string, …) with a few new speci c ones

• Design characteris cs: lightweight, traversable, e cient• Used byMongoDB

Document NoSQL database for JSON documentsData storage and network transfer format

• File extension: *.bson• h p://bsonspec.org/

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 19

Page 20: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

ExampleJSON

{"title" : "Medvídek","year" : 2007

}

BSON24 00 00 0002 74 69 74 6C 65 00 0A 00 00 00 4D 65 64 76 C3 AD 64 65 6B 0010 79 65 61 72 00 D7 07 00 0000

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 20

Page 21: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Document StructureDocument = serializa on of one JSON object or array• JSON object is serialized directly• JSON array is rst transformed to a JSON object

Property names derived from numbers of posi onsE.g.:[ "Trojan", "Svěrák" ] →{ "0" : "Trojan", "1" : "Svěrák" }

• StructureDocument size (total number of bytes)Sequence of elements (encoded JSON proper es)Termina ng hexadecimal 00 byte

int32int32 elementelement 0000

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 21

Page 22: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Document StructureElement = serializa on of one JSON property

......

0202 namename int32int32 bytebyte 0000

0101 namename doubledouble

1010 namename int32int32

1212 namename int64int64

0303 namename documentdocument

0404 namename documentdocument

0808 namename 0000

0101

0A0A namename

0909 namename int64int64

1111 namename int64int64

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 22

Page 23: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Document StructureElement = serializa on of one JSON property• Structure

Type selector– 02 (string)– 01 (double), 10 (32-bit integer), 12 (64-bit integer)– 03 (object), 04 (array)– 08 (boolean)– 0A (null)– 09 (date me), 11 ( mestamp)– …

Property name– Unicode string terminated by 00

bytebyte 0000

Property value

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 23

Page 24: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

RDFResource Descrip on Framework

Page 25: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Introduc onRDF = Resource Descrip on Framework• Language for represen ng informa on about resourcesin the World Wide Web

+ a family of related technologies, languages, speci ca ons, …Used in the context of the Seman c Web, Linked Data, …

• Developed byW3C• Started in 1997• Versions: 1.0 and 1.1• W3C recommenda ons

h ps://www.w3.org/TR/rdf11-concepts/– Concepts and Abstract Syntax

h ps://www.w3.org/TR/rdf11-mt/– Seman cs

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 25

Page 26: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

StatementsResource• Any real-world en ty

Referents = resources iden ed by IRI– E.g. physical things, documents, abstract concepts, …

Values = resources for literals– E.g. numbers, strings, …

Statement about resources = one RDF triple• Three components: subject, predicate, and object

Examples<http://db.cz/movies/medvidek><http://db.cz/terms#actor><http://db.cz/actors/trojan> .

<http://db.cz/movies/medvidek><http://db.cz/terms#year>"2007" .

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 26

Page 27: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

StatementsTriple components• Subject

Describes a resource the given statement is aboutIRI or blank node iden er

• PredicateDescribes the property or characteris c of the subjectIRI

• ObjectDescribes the value of that propertyIRI or blank node iden er or literal

Although triples are inspired by natural languages, they havenothing to do with processing of natural languages

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 27

Page 28: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Example

<http://db.cz/movies/medvidek><http://db.cz/terms#actor> <http://db.cz/actors/machacek> .

<http://db.cz/movies/medvidek><http://db.cz/terms#actor> <http://db.cz/actors/trojan> .

<http://db.cz/movies/medvidek><http://db.cz/terms#year> "2007" .

<http://db.cz/movies/medvidek><http://db.cz/terms#director> _:n18 .

_:n18<http://db.cz/terms#firstname> "Jan" .

_:n18<http://db.cz/terms#lastname> "Hřebejk" .

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 28

Page 29: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Iden ers and LiteralsIRI = Interna onalized Resource Iden er• Absolute (not rela ve) IRIs with op onal fragment iden ers• RFC 3987• Unicode characters• Examples

http://db.cz/movies/medvidekhttp://db.cz/terms#actormailto:[email protected]:issn:0167-6423

• URLs are o en used in prac ce→ informa on about givenresources are then intended to be published / retrieved viastandard HTTP

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 29

Page 30: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Iden ers and LiteralsLiterals• Plain values

E.g.: "Medvídek", "2007"• Typed values

E.g.: "Medvídek"^^xs:string, "2007"^^xs:integerXML Schema simple data types are adopted and used

• Strings with language tagsE.g.: "Medvídek"@cs

• Types and language tags cannot be mutually combined

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 30

Page 31: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Iden ers and LiteralsBlank node iden ers• Blank nodes (anonymous resources)

Allow to express statements about resources without explicitlynaming (iden fying) them

• Blank node iden ers only have local scope of validityE.g. within a given le, query expression, …

• Par cular syntax depends on a serializa on formatE.g.: _:node18

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 31

Page 32: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Data ModelDirected labeled mul graph• Ver ces

One vertex for each IRI or literal value• Edges

One edge for each individual tripleEdges are directed subject predicate−−−−−→ objectProperty names (predicate IRIs) are used as edge labels

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 32

Page 33: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Example

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 33

Page 34: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Serializa onAvailable approaches• N-Triples nota on

h ps://www.w3.org/TR/n-triples/• Turtle nota on (Terse RDF Triple Language)

h ps://www.w3.org/TR/turtle/• RDF/XML nota on

XML syntax for RDFh ps://www.w3.org/TR/rdf-syntax-grammar/

• JSON-LD nota onJSON-based serializa on for Linked Datah ps://www.w3.org/TR/json-ld/

• …

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 34

Page 35: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

N-Triples Nota onRDF N-Triples nota on = A line-based syntax for an RDF graph• Simple, line-based, plain text format• File extension: *.rdf• h ps://www.w3.org/TR/n-triples/

Example• Already presented…

Document• Statements are terminated by dots, delimited by EOL

tripletriple

0D 0A0D 0A tripletriple

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 35

Page 36: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

N-Triples Nota onStatement• Individual triple components are delimited by spaces

subjectsubject predicatepredicate objectobject ..

Triple components: subject, predicate, object

IRI referenceIRI reference

blank node idblank node id

IRI referenceIRI reference IRI referenceIRI reference

blank node idblank node id

literalliteral

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 36

Page 37: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

N-Triples Nota onIRI reference• IRIs are enclosed in angle brackets

<< IRIIRI >>

Blank node iden er

__ :: labellabel

Literal• Literals are enclosed in double quotes

"" valuevalue ""

^^^^ IRI referenceIRI reference

@@ language taglanguage tag

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 37

Page 38: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Turtle Nota onTurtle = Terse RDF Triple Language• Compact text format,various abbrevia ons for common usage pa erns

• File extension: *. l• Content type: text/turtle• h ps://www.w3.org/TR/turtle/

Example@prefix i: <http://db.cz/terms#> .@prefix m: <http://db.cz/movies/> .@prefix a: <http://db.cz/actors/> .m:medvidek

i:actor a:machacek , a:trojan ;i:year "2007" ;i:director [ i:firstname "Jan" ; i:lastname "Hřebejk" ] .

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 38

Page 39: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Turtle Nota onDocument• Contains a sequence of triples and/or declara ons• Pre x declara ons

Pre xed names can then be used instead of full IRI references• Groups of triples

Individual groups are terminated by dots

@prefix@prefix prefixprefix :: IRI referenceIRI reference ..

triplestriples

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 39

Page 40: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Turtle Nota onTriples• Triples sharing the same subject and object or at leastthe same subject can be grouped together

object list for a shared subject and predicatepredicate-object list for a shared subject

• Brackets can be used to de ne blank nodes

subjectsubject predicatepredicate objectobject

,,

;;

[[ predicatepredicate objectobject

,,

;;

]]

predicatepredicate objectobject

,,

;;

..

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 40

Page 41: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Turtle Nota onTriple components: subject, predicate, object

IRI referenceIRI reference

prefixed nameprefixed name

blank node idblank node id

IRI referenceIRI reference

prefixed nameprefixed name

IRI referenceIRI reference

prefixed nameprefixed name

blank node idblank node id

literalliteral

[[ predicate-object listpredicate-object list ]]

IRI reference / pre xed name

<< IRIIRI >> prefixprefix :: local namelocal name

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 41

Page 42: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Turtle Nota onLiteral• Tradi onal literals+ new abbreviated forms of numeric and boolean literals

"" valuevalue ""

^^^^ IRI referenceIRI reference

prefixed nameprefixed name

@@ language taglanguage tag

truetrue

falsefalse

integer / decimal / doubleinteger / decimal / double

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 42

Page 43: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

ExampleExample revisited

@prefix i: <http://db.cz/terms#> .@prefix m: <http://db.cz/movies/> .@prefix a: <http://db.cz/actors/> .m:medvidek

i:actor a:machacek , a:trojan ;i:year "2007" ;i:director [ i:firstname "Jan" ; i:lastname "Hřebejk" ] .

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 43

Page 44: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

RDF Conclusion

RDF statements• Subject, predicate, and object components

Schema languages• RDFS (RDF Schema)• OWL (Web Ontology Language)

Query languages• SPARQL (SPARQL Protocol and RDF Query Language)

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 44

Page 45: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

CSVComma-Separated Values

Page 46: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Introduc onCSV = Comma-Separated Values• Unfortunately not fully standardized

Di erent eld separators (commas, semicolons)Di erent escaping sequencesNo encoding informa on

• RFC 4180, RFC 7111• File extension: *.csv• Content type: text/csv

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 46

Page 47: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Example

firstname,lastname,yearIvan,Trojan,1964Jiří,Macháček,1966Jitka,Schneiderová,1973Zdeněk,Svěrák,1936Anna,Geislerová,1976

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 47

Page 48: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Document StructureDocument• Op onal header + list of records

namename ,, namename 0D 0A0D 0A

recordrecord 0D 0A0D 0A recordrecord

Record• Comma separated list of elds

valuevalue ,, valuevalue

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 48

Page 49: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Protocol Bu ers

Page 50: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Introduc onProtocol Bu ers• Extensible mechanism for serializing structured data

Used in communica on protocols, data storage, …• Design goals

Language-neutral, pla orm-neutralSmall, fast, simple

• Developed (and widely used) by Google• Started in 2008 internally and 2011 publicly• Versions: proto2, proto3• File extension: *.proto• h ps://developers.google.com/protocol-bu ers/• Real-world usage: RiakKV, HBase

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 50

Page 51: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Introduc onIntended usage• Schema crea on→ automa c source code genera on→sending messages between applica ons

Components• Interface descrip on language• Source code generator (protoc compiler)

Supported languages– O cial: C++, C#, Java, Python, Ruby …– 3rd party: Perl, PHP, Scala, …

• Binary serializa on formatCompact, not self-describing

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 51

Page 52: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Example

syntax = "proto3";message Actor {

string firstname = 1;string lastname = 2;

}message Movie {

string title = 1;int32 year = 16;repeated Actor actors = 17;enum Genre {

UNKNOWN = 0;COMEDY = 1;FAMILY = 2;

}repeated Genre genres = 2048;

}

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 52

Page 53: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Schema StructureSchema• One schema may contain mul plemessage descrip ons

Other constructs are allowed as well, e.g. enumera ons

...... messagemessage

enumenum

Message• Represents a small logical record of informa on

De nes a set of uniquely numbered eldsNested messages or enumera ons are allowed too

messagemessage namename {{ fieldfield

messagemessage

enumenum

}}

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 53

Page 54: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Schema StructureField• Describes one data value

repeatedrepeated

typetype namename == tagtag ;;

• Rule – allowed number of value occurrencesDefault = 0 or 1 valuerepeated = 0 or more values (i.e. an arbitrary number)

– The order of individual values is preserved

• …

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 54

Page 55: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Schema StructureField• Type

Atomic: int32, int64, double, string, bool, bytes, …– Mappings to data types of par cular programming languages

as well as default values are introducedComposed: messages, enumera ons, …

• Name – name of a given eld• Tag – internal integer iden er

Used to iden fy individual elds of a message in a binaryformatFrequently used elds should be assigned lower tags

– Since lower number of bytes will then be needed

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 55

Page 56: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Schema StructureEnumera on• Descrip on of a prede ned list of values• The rst item is considered to be the default valueand its value must be equal to 0

enumenum namename {{ itemitem == constantconstant ;; }}

A few other constructs are available too (e.g. maps)

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 56

Page 57: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…
Page 58: Version 12svoboda/courses/181-B4M36DS2/lectur… · Introducon XML=ExtensibleMarkupLanguage • Representaonandinterchangeofsemi-structureddata +afamilyofrelatedtechnologies,languages,specicaons,…

Lecture ConclusionData formats• Tree: XML, JSON• Graph: RDF• Rela onal: CSV

Binary serializa ons• BSON, Protocol Bu ers

B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 58