document oriented databases and text processing

21
Document Oriented Databases and Text Processing DBSEM 2013/14: “Beauty is our Business” TU Berlin 18.12.2013 Moritz Platt

Upload: moritz-p

Post on 13-Jul-2015

476 views

Category:

Technology


1 download

TRANSCRIPT

Document Oriented Databases and Text ProcessingDBSEM 2013/14: “Beauty is our Business” TU Berlin18.12.2013 Moritz Platt

The Early Days of the Relational Model

•Developed in the 60s / 70s•Mainly responsible: Edgar F. Codd• SystemR is the first commercial product based on his research• The query language SQL and the relational model remain largely un-

changed for more than 40 years

Why was this model so successful?

DBSEM 2013/14 2

The Success of the Relational Model

Reliability•Atomicity•Consistency• Isolation•Durability

Normalisation •Removing redundancies•Data integrity• Saving of disk space

Declarative query language• Easy to learn•Additional abstraction layer

DBSEM 2013/14 3

The Relational Model Meets „Big Data“

Limited volumeStatic dataRigidly structured dataControlled data quality

Large volumeStreaming data

Semi-structured dataUncertain data quality

VolumeVelocityVarietyVeracity

1970 1990 2000 20101980

Apache Jackrabbit

CouchDB

Hadoop

SystemR/SQL DB2

Neo4j

MongoDB

DBSEM 2013/14 4

Paradigms of Storing Data

Relational

Rigidly structured (ex-plicit schemes, explicitly defined relationships)

Structured (explicit or implicit schemes)

XML based

JSON based

Weak structured (no schemes)

Document-based

DBSEM 2013/14 5

Querying Relational and Document Based Databases

Imperative MongoDB query

db.companies.find( { industry : 'technology', stockPrice : { $gt : 100 } }, { ceoName : 1 })

Let‘s compare two queries retrieving the names of CEOs of public companies with a stock quotation of over 100$ in the tech industry.

Declarative SQL query

SELECT CEO_NAMEFROM COMPANIESWHERE INDUSTRY = 'technology' AND STOCK_PRICE > 100

DBSEM 2013/14 6

Querying Relational and Document Based Databases

MongoDB result set

{ 'ceoName' : 'Rometty' }{ 'ceoName' : 'Chênevert' }{ 'ceoName' : 'Thulin' }{ 'ceoName' : 'McNerney' }

Imperative languages express how to retrieve the results•More structural knowledge neces-

sary

SQL result set

RomettyChênevertThulinMcNerney

Declarative languages define the ex-pected result•Allows for more abstraction

(Views, Triggers, etc.)

• Similar query structure•Underlying structured data is indispensable

DBSEM 2013/14 7

However, a lot of highlyrelevant data is textual and

not formally structured.

Traditional media

prejudice to compliance by the migrant professional with any

non-discriminatory conditions of pursuit which might be laid

down by the latter Member State, provided that these are

objectively justified and proportionate.

(4) In order to facilitate the free provision of services, there should

be specific rules aimed at extending the possibility of pursuing

professional activities under the original professional title. In the

case of information society services provided at a distance, the

provisions of Directive 2000/31/EC of the European Parliament

and of the Council of 8 June 2000 on certain legal aspects of

information society services, in particular electronic commerce, in

the Internal Market ( 1 ), should also apply.

(5) In view of the different systems established for the cross-border

provision of services on a temporary and occasional basis on the

one hand, and for establishment on the other, the criteria for

distinguishing between these two concepts in the event of the

movement of the service provider to the territory of the host

Member State should be clarified.

(6) The facilitation of service provision has to be ensured in the

context of strict respect for public health and safety and

consumer protection. Therefore, specific provisions should be

envisaged for regulated professions having public health or

safety implications, which provide cross-frontier services on a

temporary or occasional basis.

(7) Host Member States may, where necessary and in accordance

with Community law, provide for declaration requirements.

These requirements should not lead to a disproportionate

burden on service providers nor hinder or render less attractive

the exercise of the freedom to provide services. The need for

such requirements should be reviewed periodically in the light

of the progress made in establishing a Community framework for

administrative cooperation between Member States.

(8) The service provider should be subject to the application of

disciplinary rules of the host Member State having a direct and

specific link with the professional qualifications, such as the

definition of the profession, the scope of activities covered by a

profession or reserved to it, the use of titles and serious profes­

sional malpractice which is directly and specifically linked to

consumer protection and safety.

(9) While maintaining, for the freedom of establishment, the prin­

ciples and safeguards underlying the different systems for recog­

nition in force, the rules of such systems should be improved in

the light of experience. Moreover, the relevant directives have

been amended on several occasions, and their provisions should

▼B

2005L0036 — EN — 24.03.2011 — 006.001 — 3

( 1 ) OJ L 178, 17.7.2000, p. 1.

prejudice to compliance by the migrant professional with any

non-discriminatory conditions of pursuit which might be laid

down by the latter Member State, provided that these are

objectively justified and proportionate. (4) In order to facilitate the free provision of services, there should

be specific rules aimed at extending the possibility of pursuing

professional activities under the original professional title. In the

case of information society services provided at a distance, the

provisions of Directive 2000/31/EC of the European Parliament

and of the Council of 8 June 2000 on certain legal aspects of

information society services, in particular electronic commerce, in

the Internal Market ( 1 ), should also apply. (5) In view of the different systems established for the cross-border

provision of services on a temporary and occasional basis on the

one hand, and for establishment on the other, the criteria for

distinguishing between these two concepts in the event of the

movement of the service provider to the territory of the host

Member State should be clarified. (6) The facilitation of service provision has to be ensured in the

context of strict respect for public health and safety and

consumer protection. Therefore, specific provisions should be

envisaged for regulated professions having public health or

safety implications, which provide cross-frontier services on a

temporary or occasional basis. (7) Host Member States may, where necessary and in accordance

with Community law, provide for declaration requirements.

These requirements should not lead to a disproportionate

burden on service providers nor hinder or render less attractive

the exercise of the freedom to provide services. The need for

such requirements should be reviewed periodically in the light

of the progress made in establishing a Community framework for

administrative cooperation between Member States. (8) The service provider should be subject to the application of

disciplinary rules of the host Member State having a direct and

specific link with the professional qualifications, such as the

definition of the profession, the scope of activities covered by a

profession or reserved to it, the use of titles and serious profes­

sional malpractice which is directly and specifically linked to

consumer protection and safety. (9) While maintaining, for the freedom of establishment, the prin­

ciples and safeguards underlying the different systems for recog­

nition in force, the rules of such systems should be improved in

the light of experience. Moreover, the relevant directives have

been amended on several occasions, and their provisions should

▼B

2005L0036 — EN — 24.03.2011 — 006.001 — 3

( 1 ) OJ L 178, 17.7.2000, p. 1.

prejudice to compliance by the migrant professional with any

non-discriminatory conditions of pursuit which might be laid

down by the latter Member State, provided that these are

objectively justified and proportionate.

(4) In order to facilitate the free provision of services, there should

be specific rules aimed at extending the possibility of pursuing

professional activities under the original professional title. In the

case of information society services provided at a distance, the

provisions of Directive 2000/31/EC of the European Parliament

and of the Council of 8 June 2000 on certain legal aspects of

information society services, in particular electronic commerce, in

the Internal Market ( 1 ), should also apply.

(5) In view of the different systems established for the cross-border

provision of services on a temporary and occasional basis on the

one hand, and for establishment on the other, the criteria for

distinguishing between these two concepts in the event of the

movement of the service provider to the territory of the host

Member State should be clarified.

(6) The facilitation of service provision has to be ensured in the

context of strict respect for public health and safety and

consumer protection. Therefore, specific provisions should be

envisaged for regulated professions having public health or

safety implications, which provide cross-frontier services on a

temporary or occasional basis.

(7) Host Member States may, where necessary and in accordance

with Community law, provide for declaration requirements.

These requirements should not lead to a disproportionate

burden on service providers nor hinder or render less attractive

the exercise of the freedom to provide services. The need for

such requirements should be reviewed periodically in the light

of the progress made in establishing a Community framework for

administrative cooperation between Member States.

(8) The service provider should be subject to the application of

disciplinary rules of the host Member State having a direct and

specific link with the professional qualifications, such as the

definition of the profession, the scope of activities covered by a

profession or reserved to it, the use of titles and serious profes­

sional malpractice which is directly and specifically linked to

consumer protection and safety.

(9) While maintaining, for the freedom of establishment, the prin­

ciples and safeguards underlying the different systems for recog­

nition in force, the rules of such systems should be improved in

the light of experience. Moreover, the relevant directives have

been amended on several occasions, and their provisions should

▼B

2005L0036 — EN — 24.03.2011 — 006.001 — 3

( 1 ) OJ L 178, 17.7.2000, p. 1.

prejudice to compliance by the migrant professional with any

non-discriminatory conditions of pursuit which might be laid

down by the latter Member State, provided that these are

objectively justified and proportionate. (4) In order to facilitate the free provision of services, there should

be specific rules aimed at extending the possibility of pursuing

professional activities under the original professional title. In the

case of information society services provided at a distance, the

provisions of Directive 2000/31/EC of the European Parliament

and of the Council of 8 June 2000 on certain legal aspects of

information society services, in particular electronic commerce, in

the Internal Market ( 1 ), should also apply. (5) In view of the different systems established for the cross-border

provision of services on a temporary and occasional basis on the

one hand, and for establishment on the other, the criteria for

distinguishing between these two concepts in the event of the

movement of the service provider to the territory of the host

Member State should be clarified. (6) The facilitation of service provision has to be ensured in the

context of strict respect for public health and safety and

consumer protection. Therefore, specific provisions should be

envisaged for regulated professions having public health or

safety implications, which provide cross-frontier services on a

temporary or occasional basis. (7) Host Member States may, where necessary and in accordance

with Community law, provide for declaration requirements.

These requirements should not lead to a disproportionate

burden on service providers nor hinder or render less attractive

the exercise of the freedom to provide services. The need for

such requirements should be reviewed periodically in the light

of the progress made in establishing a Community framework for

administrative cooperation between Member States. (8) The service provider should be subject to the application of

disciplinary rules of the host Member State having a direct and

specific link with the professional qualifications, such as the

definition of the profession, the scope of activities covered by a

profession or reserved to it, the use of titles and serious profes­

sional malpractice which is directly and specifically linked to

consumer protection and safety. (9) While maintaining, for the freedom of establishment, the prin­

ciples and safeguards underlying the different systems for recog­

nition in force, the rules of such systems should be improved in

the light of experience. Moreover, the relevant directives have

been amended on several occasions, and their provisions should

▼B

2005L0036 — EN — 24.03.2011 — 006.001 — 3

( 1 ) OJ L 178, 17.7.2000, p. 1.

prejudice to compliance by the migrant professional with any

non-discriminatory conditions of pursuit which might be laid

down by the latter Member State, provided that these are

objectively justified and proportionate.

(4) In order to facilitate the free provision of services, there should

be specific rules aimed at extending the possibility of pursuing

professional activities under the original professional title. In the

case of information society services provided at a distance, the

provisions of Directive 2000/31/EC of the European Parliament

and of the Council of 8 June 2000 on certain legal aspects of

information society services, in particular electronic commerce, in

the Internal Market ( 1 ), should also apply.

(5) In view of the different systems established for the cross-border

provision of services on a temporary and occasional basis on the

one hand, and for establishment on the other, the criteria for

distinguishing between these two concepts in the event of the

movement of the service provider to the territory of the host

Member State should be clarified.

(6) The facilitation of service provision has to be ensured in the

context of strict respect for public health and safety and

consumer protection. Therefore, specific provisions should be

envisaged for regulated professions having public health or

safety implications, which provide cross-frontier services on a

temporary or occasional basis.

(7) Host Member States may, where necessary and in accordance

with Community law, provide for declaration requirements.

These requirements should not lead to a disproportionate

burden on service providers nor hinder or render less attractive

the exercise of the freedom to provide services. The need for

such requirements should be reviewed periodically in the light

of the progress made in establishing a Community framework for

administrative cooperation between Member States.

(8) The service provider should be subject to the application of

disciplinary rules of the host Member State having a direct and

specific link with the professional qualifications, such as the

definition of the profession, the scope of activities covered by a

profession or reserved to it, the use of titles and serious profes­

sional malpractice which is directly and specifically linked to

consumer protection and safety.

(9) While maintaining, for the freedom of establishment, the prin­

ciples and safeguards underlying the different systems for recog­

nition in force, the rules of such systems should be improved in

the light of experience. Moreover, the relevant directives have

been amended on several occasions, and their provisions should

▼B

2005L0036 — EN — 24.03.2011 — 006.001 — 3

( 1 ) OJ L 178, 17.7.2000, p. 1.

DIRECTIVE 2005/36/EC OF THE EUROPEAN PARLIAMENT

AND OF THE COUNCIL

of 7 September 2005

on the recognition of professional qualifications

(Text with EEA relevance)

THE EUROPEAN PARLIAMENT AND THE COUNCIL OF THE

EUROPEAN UNION,

Having regard to the Treaty establishing the European Community, and

in particular Article 40, Article 47(1), the first and third sentences of

Article 47(2), and Article 55 thereof,

Having regard to the proposal from the Commission ( 1 ),

Having regard to the opinion of the European Economic and Social

Committee ( 2 ),

Acting in accordance with the procedure laid down in Article 251 of the

Treaty ( 3 ),

Whereas:

(1) Pursuant to Article 3(1)(c) of the Treaty, the abolition, as between

Member States, of obstacles to the free movement of persons and

services is one of the objectives of the Community. For nationals

of the Member States, this includes, in particular, the right to

pursue a profession, in a self-employed or employed capacity,

in a Member State other than the one in which they have

obtained their professional qualifications. In addition,

Article 47(1) of the Treaty lays down that directives shall be

issued for the mutual recognition of diplomas, certificates and

other evidence of formal qualifications.

(2) Following the European Council of Lisbon on 23 and 24 March

2000, the Commission adopted a Communication on ‘An Internal

Market Strategy for Services’, aimed in particular at making the

free provision of services within the Community as simple as

within an individual Member State. Further to the Communi­

cation from the Commission entitled ‘New European Labour

Markets, Open to All, with Access to All’, the European

Council of Stockholm on 23 and 24 March 2001 entrusted the

Commission with presenting for the 2002 Spring European

Council specific proposals for a more uniform, transparent and

flexible regime of recognition of qualifications.

(3) The guarantee conferred by this Directive on persons having

acquired their professional qualifications in a Member State to

have access to the same profession and pursue it in another

Member State with the same rights as nationals is without

▼B

2005L0036 — EN — 24.03.2011 — 006.001 — 2

( 1 ) OJ C 181 E, 30.7.2002, p. 183.

( 2 ) OJ C 61, 14.3.2003, p. 67.

( 3 ) Opinion of the European Parliament of 11 February 2004 (OJ C 97 E,

22.4.2004, p. 230), Council Common Position of 21 December 2004 (OJ

C 58 E, 8.3.2005, p. 1) and Position of the European Parliament of

11 May 2005 (not yet published in the Official Journal). Council Decision

of 6 June 2005.

Government DataSocial media

… and much, much more …

DBSEM 2013/14 8

Structuring Text Data

•A declarative query logic requires structured data•Declarative Information Extraction creates a structured representation of

text-data

SupervisedDeclarative Information ExtractionClassification

Weakly supervisedActive LearningDistant Supervision

UnsupervisedRelation Extraction

[Kilias:2013:IID:2513190.2513196]

DBSEM 2013/14 9

Early Information Extraction Systems

• Early Information Extraction systems•1970s: FRUM (later ATRANS)•1980s: JASPER (Journalist’s Assistant for Preparing Earnings Reports)

• Systems for limited domains

[Andersen:1992:AEF:974499.974531]

25 secondsprocessing time

21% successrate

GREEN TREE ACCEPTANCE, INC <GNT.N> ST. PAUL, Minn, Oct 17 Shr 70 cts vs 70 cts Net 10.4 mln vs 10.3 mln Avg shrs 11.6 mln vs 11.5 mln Nine Months Shr 1.70 dlrs vs 1.21 dlrs Net 26.7 mln vs 20.8 mln Avg shrs 11.6 mln vs 11.5 mln

DBSEM 2013/14 10

Later Information Extraction Systems

• Later: more sophisticated systems (commercial and academical)•GATE (University of Sheffield)• The SQoUT Project (NYU/Columbia)•UIMA (Apache)• SystemT-IE (IBM)

„The SystemT project is an amalgam of two major research themes centered around analytics and search over unstructured content.“ [IBM2013]

•Document-oriented• Simple declarative logic, comparable to SQL•Optimized for scalability

• SystemT-IE is a good example for IE frameworks

DBSEM 2013/14 11

• SystemT-IE operates over a simple relational data model with the data types span, tuple, relation [Krishnamurthy:2009:SSD:1519103.1519105]

• The nature of spans, tuples and relations depends on the extractor function

Dictionary based extractor (e.g country names)Ireland has many landmarks dating back long time. One of the oldest is Dublin Castle, which was founded on the orders of King John of England in 1204.

Regular Expression based extractor (e.g. numeric matches /\d{4}/)Ireland has many landmarks dating back long time. One of the oldest is Dublin Castle, which was founded on the orders of King John of England in 1204.

Structured Text in SystemT-IE

DBSEM 2013/14 12

Querying Text: AQL

AQLEXTRACT

CONSOLIDATE

DETAG

SQLSELECT

AS

WHEREFROM

JOIN

• “Annotation Query Language (AQL) is the primary language (...) for build-ing extractors that extract structured information from unstructured or semi-structured text.” [IBM2013a]

• Similar to SQL•Additional operators

•CONSOLIDATE Resolving of overlapping spans

•EXTRACT Extract useful fea-tures from text

•DETAG Remove markup from text

DBSEM 2013/14 13

AQL: “extract”

Task: Extract related names and telephone numbers from a text document D.

D: Jack Byrne 715-4279 James Aaron Roberts 489-4905 Emma Taylor 513-1069 Simon Jones 999-6459 Aoife Ryan 116-2496

create view Person asextract dictionary 'firstNames.dict' on D.text as namefrom Document D;

create view Phone asextract regex /\d{3}-\d{4}/ on D.text as numberfrom Document D;[Li:2011:SDI:2002440.2002459]

DBSEM 2013/14 14

AQL: “extract”

D: Jack Byrne 715-4279 James Aaron Roberts 489-4905 Emma Taylor 513-1069 Simon Jones 999-6459 Aoife Ryan 116-2496

View Phone:715-4279489-4905513-1069999-6459116-2496

View Person:JackJamesAaronEmmaSimonAoife

DBSEM 2013/14 15

AQL: “extract”

•Combine the views

create view PersonPhoneAll asselect CombineSpans(P.name, Ph.number) as matchfrom Person P, Phone Phwhere FollowsTok(P.name, Ph.number, 0, 3);[Li:2011:SDI:2002440.2002459]

View PersonPhoneAll:Jack, 715-4279James, 489-4905Aaron, 489-4905Emma, 513-1069Simon, 999-6459Aoife, 116-2496

Jack Byrne 715-4279 James Aaron Roberts 489-4905 Emma Taylor 513-1069 Simon Jones 999-6459 Aoife Ryan 116-2496

DBSEM 2013/14 16

AQL: “consolidate”

•Removes overlapping spans from a column•According to a specified policy [Chiticariu:2010:SAA:1858681.1858695]

•Containment•Overlap

create view PersonPhone asselect R.name as namefrom PersonPhoneAll Rconsolidate on R.name;

View PersonPhoneAll:Jack, 715-4279James, 489-4905Emma, 513-1069Simon, 999-6459Aoife, 116-2496

Jack Byrne 715-4279 James Aaron Roberts 489-4905 Emma Taylor 513-1069 Simon Jones 999-6459 Aoife Ryan 116-2496

DBSEM 2013/14 17

Conclusion

• Information Extraction technology is on the rise•Declarative IE systems are becoming more perfomant and robust•Widespread use of IE technology might be taken for granted the same way re-

lational technology is today

DBSEM 2013/14 18

References

inproceedings(Andersen:1992:AEF:974499.974531)Andersen, P.M., Hayes, P.J., Huettner, A.K., Schmandt, L.M., Nirenburg, I.B. & Weinstein, S.P.Automatic Extraction of Facts from Press Releases to Generate News StoriesProceedings of the Third Conference on Applied Natural Language ProcessingAssociation for Computational Linguistics, 1992, pp. 170-177

inproceedings(Chiticariu:2010:SAA:1858681.1858695)Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F.R. & Vaithyanathan, S.SystemT: An Algebraic Approach to Declarative Information ExtractionProceedings of the 48th Annual Meeting of the Association for Computational LinguisticsAssociation for Computational Linguistics, 2010, pp. 128-137

other(IBM2013)IBMIBM big data platformhttp://www-01.ibm.com/software/data/bigdata/2013Accessed 2013-12-03

other(IBM2013a)IBMIBM big data platformhttp://pic.dhe.ibm.com/infocenter/bigins/v1r4/index.jsp?topic=%2Fcom.ibm.swg.im.infosphere.biginsights.text.doc%2Fdoc%2Fbiginsights_aqlref_con_aql-overview.html2013Accessed 2013-12-03

DBSEM 2013/14 19

References

inproceedings(Kilias:2013:IID:2513190.2513196)Kilias, T., Löser, A. & Andritsos, P.INDREX: In-database Distributional Relation ExtractionProceedings of the Sixteenth International Workshop on Data Warehousing and OLAPACM, 2013, pp. 93-100

article(Krishnamurthy:2009:SSD:1519103.1519105)Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S. & Zhu, H.SystemT: A System for Declarative Information ExtractionSIGMOD Rec., ACM, 2009, Vol. 37(4), pp. 7-13

inproceedings(Li:2011:SDI:2002440.2002459)Li, Y., Reiss, F.R. & Chiticariu, L.SystemT: A Declarative Information Extraction SystemProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems DemonstrationsAssociation for Computational Linguistics, 2011, pp. 109-114

DBSEM 2013/14 20

Picture Credit

IconsPage3:Arrow by Jamison Wieser from The Noun Project

Page 4:Flag by Ashley van Dyck from The Noun Project

Page 5:Document by Jamison Wieser from The Noun Project

Other icons used are in the public domain.

PhotographyPage 1: Archives by Marino González is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic License.Basedonaworkathttp://www.flickr.com/photos/merlin1487/5518280677/.Toviewacopyofthislicense,visithttp://creativecommons.org/licenses/by-nc-nd/2.0/legalcode.

DBSEM 2013/14 21