document oriented databases and text processing
TRANSCRIPT
Document Oriented Databases and Text ProcessingDBSEM 2013/14: “Beauty is our Business” TU Berlin18.12.2013 Moritz Platt
The Early Days of the Relational Model
•Developed in the 60s / 70s•Mainly responsible: Edgar F. Codd• SystemR is the first commercial product based on his research• The query language SQL and the relational model remain largely un-
changed for more than 40 years
Why was this model so successful?
DBSEM 2013/14 2
The Success of the Relational Model
Reliability•Atomicity•Consistency• Isolation•Durability
Normalisation •Removing redundancies•Data integrity• Saving of disk space
Declarative query language• Easy to learn•Additional abstraction layer
DBSEM 2013/14 3
The Relational Model Meets „Big Data“
Limited volumeStatic dataRigidly structured dataControlled data quality
Large volumeStreaming data
Semi-structured dataUncertain data quality
VolumeVelocityVarietyVeracity
1970 1990 2000 20101980
Apache Jackrabbit
CouchDB
Hadoop
SystemR/SQL DB2
Neo4j
MongoDB
DBSEM 2013/14 4
Paradigms of Storing Data
Relational
Rigidly structured (ex-plicit schemes, explicitly defined relationships)
Structured (explicit or implicit schemes)
XML based
JSON based
Weak structured (no schemes)
Document-based
DBSEM 2013/14 5
Querying Relational and Document Based Databases
Imperative MongoDB query
db.companies.find( { industry : 'technology', stockPrice : { $gt : 100 } }, { ceoName : 1 })
Let‘s compare two queries retrieving the names of CEOs of public companies with a stock quotation of over 100$ in the tech industry.
Declarative SQL query
SELECT CEO_NAMEFROM COMPANIESWHERE INDUSTRY = 'technology' AND STOCK_PRICE > 100
DBSEM 2013/14 6
Querying Relational and Document Based Databases
MongoDB result set
{ 'ceoName' : 'Rometty' }{ 'ceoName' : 'Chênevert' }{ 'ceoName' : 'Thulin' }{ 'ceoName' : 'McNerney' }
Imperative languages express how to retrieve the results•More structural knowledge neces-
sary
SQL result set
RomettyChênevertThulinMcNerney
Declarative languages define the ex-pected result•Allows for more abstraction
(Views, Triggers, etc.)
• Similar query structure•Underlying structured data is indispensable
DBSEM 2013/14 7
However, a lot of highlyrelevant data is textual and
not formally structured.
Traditional media
prejudice to compliance by the migrant professional with any
non-discriminatory conditions of pursuit which might be laid
down by the latter Member State, provided that these are
objectively justified and proportionate.
(4) In order to facilitate the free provision of services, there should
be specific rules aimed at extending the possibility of pursuing
professional activities under the original professional title. In the
case of information society services provided at a distance, the
provisions of Directive 2000/31/EC of the European Parliament
and of the Council of 8 June 2000 on certain legal aspects of
information society services, in particular electronic commerce, in
the Internal Market ( 1 ), should also apply.
(5) In view of the different systems established for the cross-border
provision of services on a temporary and occasional basis on the
one hand, and for establishment on the other, the criteria for
distinguishing between these two concepts in the event of the
movement of the service provider to the territory of the host
Member State should be clarified.
(6) The facilitation of service provision has to be ensured in the
context of strict respect for public health and safety and
consumer protection. Therefore, specific provisions should be
envisaged for regulated professions having public health or
safety implications, which provide cross-frontier services on a
temporary or occasional basis.
(7) Host Member States may, where necessary and in accordance
with Community law, provide for declaration requirements.
These requirements should not lead to a disproportionate
burden on service providers nor hinder or render less attractive
the exercise of the freedom to provide services. The need for
such requirements should be reviewed periodically in the light
of the progress made in establishing a Community framework for
administrative cooperation between Member States.
(8) The service provider should be subject to the application of
disciplinary rules of the host Member State having a direct and
specific link with the professional qualifications, such as the
definition of the profession, the scope of activities covered by a
profession or reserved to it, the use of titles and serious profes
sional malpractice which is directly and specifically linked to
consumer protection and safety.
(9) While maintaining, for the freedom of establishment, the prin
ciples and safeguards underlying the different systems for recog
nition in force, the rules of such systems should be improved in
the light of experience. Moreover, the relevant directives have
been amended on several occasions, and their provisions should
▼B
2005L0036 — EN — 24.03.2011 — 006.001 — 3
( 1 ) OJ L 178, 17.7.2000, p. 1.
prejudice to compliance by the migrant professional with any
non-discriminatory conditions of pursuit which might be laid
down by the latter Member State, provided that these are
objectively justified and proportionate. (4) In order to facilitate the free provision of services, there should
be specific rules aimed at extending the possibility of pursuing
professional activities under the original professional title. In the
case of information society services provided at a distance, the
provisions of Directive 2000/31/EC of the European Parliament
and of the Council of 8 June 2000 on certain legal aspects of
information society services, in particular electronic commerce, in
the Internal Market ( 1 ), should also apply. (5) In view of the different systems established for the cross-border
provision of services on a temporary and occasional basis on the
one hand, and for establishment on the other, the criteria for
distinguishing between these two concepts in the event of the
movement of the service provider to the territory of the host
Member State should be clarified. (6) The facilitation of service provision has to be ensured in the
context of strict respect for public health and safety and
consumer protection. Therefore, specific provisions should be
envisaged for regulated professions having public health or
safety implications, which provide cross-frontier services on a
temporary or occasional basis. (7) Host Member States may, where necessary and in accordance
with Community law, provide for declaration requirements.
These requirements should not lead to a disproportionate
burden on service providers nor hinder or render less attractive
the exercise of the freedom to provide services. The need for
such requirements should be reviewed periodically in the light
of the progress made in establishing a Community framework for
administrative cooperation between Member States. (8) The service provider should be subject to the application of
disciplinary rules of the host Member State having a direct and
specific link with the professional qualifications, such as the
definition of the profession, the scope of activities covered by a
profession or reserved to it, the use of titles and serious profes
sional malpractice which is directly and specifically linked to
consumer protection and safety. (9) While maintaining, for the freedom of establishment, the prin
ciples and safeguards underlying the different systems for recog
nition in force, the rules of such systems should be improved in
the light of experience. Moreover, the relevant directives have
been amended on several occasions, and their provisions should
▼B
2005L0036 — EN — 24.03.2011 — 006.001 — 3
( 1 ) OJ L 178, 17.7.2000, p. 1.
prejudice to compliance by the migrant professional with any
non-discriminatory conditions of pursuit which might be laid
down by the latter Member State, provided that these are
objectively justified and proportionate.
(4) In order to facilitate the free provision of services, there should
be specific rules aimed at extending the possibility of pursuing
professional activities under the original professional title. In the
case of information society services provided at a distance, the
provisions of Directive 2000/31/EC of the European Parliament
and of the Council of 8 June 2000 on certain legal aspects of
information society services, in particular electronic commerce, in
the Internal Market ( 1 ), should also apply.
(5) In view of the different systems established for the cross-border
provision of services on a temporary and occasional basis on the
one hand, and for establishment on the other, the criteria for
distinguishing between these two concepts in the event of the
movement of the service provider to the territory of the host
Member State should be clarified.
(6) The facilitation of service provision has to be ensured in the
context of strict respect for public health and safety and
consumer protection. Therefore, specific provisions should be
envisaged for regulated professions having public health or
safety implications, which provide cross-frontier services on a
temporary or occasional basis.
(7) Host Member States may, where necessary and in accordance
with Community law, provide for declaration requirements.
These requirements should not lead to a disproportionate
burden on service providers nor hinder or render less attractive
the exercise of the freedom to provide services. The need for
such requirements should be reviewed periodically in the light
of the progress made in establishing a Community framework for
administrative cooperation between Member States.
(8) The service provider should be subject to the application of
disciplinary rules of the host Member State having a direct and
specific link with the professional qualifications, such as the
definition of the profession, the scope of activities covered by a
profession or reserved to it, the use of titles and serious profes
sional malpractice which is directly and specifically linked to
consumer protection and safety.
(9) While maintaining, for the freedom of establishment, the prin
ciples and safeguards underlying the different systems for recog
nition in force, the rules of such systems should be improved in
the light of experience. Moreover, the relevant directives have
been amended on several occasions, and their provisions should
▼B
2005L0036 — EN — 24.03.2011 — 006.001 — 3
( 1 ) OJ L 178, 17.7.2000, p. 1.
prejudice to compliance by the migrant professional with any
non-discriminatory conditions of pursuit which might be laid
down by the latter Member State, provided that these are
objectively justified and proportionate. (4) In order to facilitate the free provision of services, there should
be specific rules aimed at extending the possibility of pursuing
professional activities under the original professional title. In the
case of information society services provided at a distance, the
provisions of Directive 2000/31/EC of the European Parliament
and of the Council of 8 June 2000 on certain legal aspects of
information society services, in particular electronic commerce, in
the Internal Market ( 1 ), should also apply. (5) In view of the different systems established for the cross-border
provision of services on a temporary and occasional basis on the
one hand, and for establishment on the other, the criteria for
distinguishing between these two concepts in the event of the
movement of the service provider to the territory of the host
Member State should be clarified. (6) The facilitation of service provision has to be ensured in the
context of strict respect for public health and safety and
consumer protection. Therefore, specific provisions should be
envisaged for regulated professions having public health or
safety implications, which provide cross-frontier services on a
temporary or occasional basis. (7) Host Member States may, where necessary and in accordance
with Community law, provide for declaration requirements.
These requirements should not lead to a disproportionate
burden on service providers nor hinder or render less attractive
the exercise of the freedom to provide services. The need for
such requirements should be reviewed periodically in the light
of the progress made in establishing a Community framework for
administrative cooperation between Member States. (8) The service provider should be subject to the application of
disciplinary rules of the host Member State having a direct and
specific link with the professional qualifications, such as the
definition of the profession, the scope of activities covered by a
profession or reserved to it, the use of titles and serious profes
sional malpractice which is directly and specifically linked to
consumer protection and safety. (9) While maintaining, for the freedom of establishment, the prin
ciples and safeguards underlying the different systems for recog
nition in force, the rules of such systems should be improved in
the light of experience. Moreover, the relevant directives have
been amended on several occasions, and their provisions should
▼B
2005L0036 — EN — 24.03.2011 — 006.001 — 3
( 1 ) OJ L 178, 17.7.2000, p. 1.
prejudice to compliance by the migrant professional with any
non-discriminatory conditions of pursuit which might be laid
down by the latter Member State, provided that these are
objectively justified and proportionate.
(4) In order to facilitate the free provision of services, there should
be specific rules aimed at extending the possibility of pursuing
professional activities under the original professional title. In the
case of information society services provided at a distance, the
provisions of Directive 2000/31/EC of the European Parliament
and of the Council of 8 June 2000 on certain legal aspects of
information society services, in particular electronic commerce, in
the Internal Market ( 1 ), should also apply.
(5) In view of the different systems established for the cross-border
provision of services on a temporary and occasional basis on the
one hand, and for establishment on the other, the criteria for
distinguishing between these two concepts in the event of the
movement of the service provider to the territory of the host
Member State should be clarified.
(6) The facilitation of service provision has to be ensured in the
context of strict respect for public health and safety and
consumer protection. Therefore, specific provisions should be
envisaged for regulated professions having public health or
safety implications, which provide cross-frontier services on a
temporary or occasional basis.
(7) Host Member States may, where necessary and in accordance
with Community law, provide for declaration requirements.
These requirements should not lead to a disproportionate
burden on service providers nor hinder or render less attractive
the exercise of the freedom to provide services. The need for
such requirements should be reviewed periodically in the light
of the progress made in establishing a Community framework for
administrative cooperation between Member States.
(8) The service provider should be subject to the application of
disciplinary rules of the host Member State having a direct and
specific link with the professional qualifications, such as the
definition of the profession, the scope of activities covered by a
profession or reserved to it, the use of titles and serious profes
sional malpractice which is directly and specifically linked to
consumer protection and safety.
(9) While maintaining, for the freedom of establishment, the prin
ciples and safeguards underlying the different systems for recog
nition in force, the rules of such systems should be improved in
the light of experience. Moreover, the relevant directives have
been amended on several occasions, and their provisions should
▼B
2005L0036 — EN — 24.03.2011 — 006.001 — 3
( 1 ) OJ L 178, 17.7.2000, p. 1.
DIRECTIVE 2005/36/EC OF THE EUROPEAN PARLIAMENT
AND OF THE COUNCIL
of 7 September 2005
on the recognition of professional qualifications
(Text with EEA relevance)
THE EUROPEAN PARLIAMENT AND THE COUNCIL OF THE
EUROPEAN UNION,
Having regard to the Treaty establishing the European Community, and
in particular Article 40, Article 47(1), the first and third sentences of
Article 47(2), and Article 55 thereof,
Having regard to the proposal from the Commission ( 1 ),
Having regard to the opinion of the European Economic and Social
Committee ( 2 ),
Acting in accordance with the procedure laid down in Article 251 of the
Treaty ( 3 ),
Whereas:
(1) Pursuant to Article 3(1)(c) of the Treaty, the abolition, as between
Member States, of obstacles to the free movement of persons and
services is one of the objectives of the Community. For nationals
of the Member States, this includes, in particular, the right to
pursue a profession, in a self-employed or employed capacity,
in a Member State other than the one in which they have
obtained their professional qualifications. In addition,
Article 47(1) of the Treaty lays down that directives shall be
issued for the mutual recognition of diplomas, certificates and
other evidence of formal qualifications.
(2) Following the European Council of Lisbon on 23 and 24 March
2000, the Commission adopted a Communication on ‘An Internal
Market Strategy for Services’, aimed in particular at making the
free provision of services within the Community as simple as
within an individual Member State. Further to the Communi
cation from the Commission entitled ‘New European Labour
Markets, Open to All, with Access to All’, the European
Council of Stockholm on 23 and 24 March 2001 entrusted the
Commission with presenting for the 2002 Spring European
Council specific proposals for a more uniform, transparent and
flexible regime of recognition of qualifications.
(3) The guarantee conferred by this Directive on persons having
acquired their professional qualifications in a Member State to
have access to the same profession and pursue it in another
Member State with the same rights as nationals is without
▼B
2005L0036 — EN — 24.03.2011 — 006.001 — 2
( 1 ) OJ C 181 E, 30.7.2002, p. 183.
( 2 ) OJ C 61, 14.3.2003, p. 67.
( 3 ) Opinion of the European Parliament of 11 February 2004 (OJ C 97 E,
22.4.2004, p. 230), Council Common Position of 21 December 2004 (OJ
C 58 E, 8.3.2005, p. 1) and Position of the European Parliament of
11 May 2005 (not yet published in the Official Journal). Council Decision
of 6 June 2005.
Government DataSocial media
… and much, much more …
DBSEM 2013/14 8
Structuring Text Data
•A declarative query logic requires structured data•Declarative Information Extraction creates a structured representation of
text-data
SupervisedDeclarative Information ExtractionClassification
Weakly supervisedActive LearningDistant Supervision
UnsupervisedRelation Extraction
[Kilias:2013:IID:2513190.2513196]
DBSEM 2013/14 9
Early Information Extraction Systems
• Early Information Extraction systems•1970s: FRUM (later ATRANS)•1980s: JASPER (Journalist’s Assistant for Preparing Earnings Reports)
• Systems for limited domains
[Andersen:1992:AEF:974499.974531]
25 secondsprocessing time
21% successrate
GREEN TREE ACCEPTANCE, INC <GNT.N> ST. PAUL, Minn, Oct 17 Shr 70 cts vs 70 cts Net 10.4 mln vs 10.3 mln Avg shrs 11.6 mln vs 11.5 mln Nine Months Shr 1.70 dlrs vs 1.21 dlrs Net 26.7 mln vs 20.8 mln Avg shrs 11.6 mln vs 11.5 mln
DBSEM 2013/14 10
Later Information Extraction Systems
• Later: more sophisticated systems (commercial and academical)•GATE (University of Sheffield)• The SQoUT Project (NYU/Columbia)•UIMA (Apache)• SystemT-IE (IBM)
„The SystemT project is an amalgam of two major research themes centered around analytics and search over unstructured content.“ [IBM2013]
•Document-oriented• Simple declarative logic, comparable to SQL•Optimized for scalability
• SystemT-IE is a good example for IE frameworks
DBSEM 2013/14 11
• SystemT-IE operates over a simple relational data model with the data types span, tuple, relation [Krishnamurthy:2009:SSD:1519103.1519105]
• The nature of spans, tuples and relations depends on the extractor function
Dictionary based extractor (e.g country names)Ireland has many landmarks dating back long time. One of the oldest is Dublin Castle, which was founded on the orders of King John of England in 1204.
Regular Expression based extractor (e.g. numeric matches /\d{4}/)Ireland has many landmarks dating back long time. One of the oldest is Dublin Castle, which was founded on the orders of King John of England in 1204.
Structured Text in SystemT-IE
DBSEM 2013/14 12
Querying Text: AQL
AQLEXTRACT
CONSOLIDATE
DETAG
SQLSELECT
AS
WHEREFROM
JOIN
• “Annotation Query Language (AQL) is the primary language (...) for build-ing extractors that extract structured information from unstructured or semi-structured text.” [IBM2013a]
• Similar to SQL•Additional operators
•CONSOLIDATE Resolving of overlapping spans
•EXTRACT Extract useful fea-tures from text
•DETAG Remove markup from text
DBSEM 2013/14 13
AQL: “extract”
Task: Extract related names and telephone numbers from a text document D.
D: Jack Byrne 715-4279 James Aaron Roberts 489-4905 Emma Taylor 513-1069 Simon Jones 999-6459 Aoife Ryan 116-2496
create view Person asextract dictionary 'firstNames.dict' on D.text as namefrom Document D;
create view Phone asextract regex /\d{3}-\d{4}/ on D.text as numberfrom Document D;[Li:2011:SDI:2002440.2002459]
DBSEM 2013/14 14
AQL: “extract”
D: Jack Byrne 715-4279 James Aaron Roberts 489-4905 Emma Taylor 513-1069 Simon Jones 999-6459 Aoife Ryan 116-2496
View Phone:715-4279489-4905513-1069999-6459116-2496
View Person:JackJamesAaronEmmaSimonAoife
DBSEM 2013/14 15
AQL: “extract”
•Combine the views
create view PersonPhoneAll asselect CombineSpans(P.name, Ph.number) as matchfrom Person P, Phone Phwhere FollowsTok(P.name, Ph.number, 0, 3);[Li:2011:SDI:2002440.2002459]
View PersonPhoneAll:Jack, 715-4279James, 489-4905Aaron, 489-4905Emma, 513-1069Simon, 999-6459Aoife, 116-2496
Jack Byrne 715-4279 James Aaron Roberts 489-4905 Emma Taylor 513-1069 Simon Jones 999-6459 Aoife Ryan 116-2496
DBSEM 2013/14 16
AQL: “consolidate”
•Removes overlapping spans from a column•According to a specified policy [Chiticariu:2010:SAA:1858681.1858695]
•Containment•Overlap
create view PersonPhone asselect R.name as namefrom PersonPhoneAll Rconsolidate on R.name;
View PersonPhoneAll:Jack, 715-4279James, 489-4905Emma, 513-1069Simon, 999-6459Aoife, 116-2496
Jack Byrne 715-4279 James Aaron Roberts 489-4905 Emma Taylor 513-1069 Simon Jones 999-6459 Aoife Ryan 116-2496
DBSEM 2013/14 17
Conclusion
• Information Extraction technology is on the rise•Declarative IE systems are becoming more perfomant and robust•Widespread use of IE technology might be taken for granted the same way re-
lational technology is today
DBSEM 2013/14 18
References
inproceedings(Andersen:1992:AEF:974499.974531)Andersen, P.M., Hayes, P.J., Huettner, A.K., Schmandt, L.M., Nirenburg, I.B. & Weinstein, S.P.Automatic Extraction of Facts from Press Releases to Generate News StoriesProceedings of the Third Conference on Applied Natural Language ProcessingAssociation for Computational Linguistics, 1992, pp. 170-177
inproceedings(Chiticariu:2010:SAA:1858681.1858695)Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F.R. & Vaithyanathan, S.SystemT: An Algebraic Approach to Declarative Information ExtractionProceedings of the 48th Annual Meeting of the Association for Computational LinguisticsAssociation for Computational Linguistics, 2010, pp. 128-137
other(IBM2013)IBMIBM big data platformhttp://www-01.ibm.com/software/data/bigdata/2013Accessed 2013-12-03
other(IBM2013a)IBMIBM big data platformhttp://pic.dhe.ibm.com/infocenter/bigins/v1r4/index.jsp?topic=%2Fcom.ibm.swg.im.infosphere.biginsights.text.doc%2Fdoc%2Fbiginsights_aqlref_con_aql-overview.html2013Accessed 2013-12-03
DBSEM 2013/14 19
References
inproceedings(Kilias:2013:IID:2513190.2513196)Kilias, T., Löser, A. & Andritsos, P.INDREX: In-database Distributional Relation ExtractionProceedings of the Sixteenth International Workshop on Data Warehousing and OLAPACM, 2013, pp. 93-100
article(Krishnamurthy:2009:SSD:1519103.1519105)Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S. & Zhu, H.SystemT: A System for Declarative Information ExtractionSIGMOD Rec., ACM, 2009, Vol. 37(4), pp. 7-13
inproceedings(Li:2011:SDI:2002440.2002459)Li, Y., Reiss, F.R. & Chiticariu, L.SystemT: A Declarative Information Extraction SystemProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems DemonstrationsAssociation for Computational Linguistics, 2011, pp. 109-114
DBSEM 2013/14 20
Picture Credit
IconsPage3:Arrow by Jamison Wieser from The Noun Project
Page 4:Flag by Ashley van Dyck from The Noun Project
Page 5:Document by Jamison Wieser from The Noun Project
Other icons used are in the public domain.
PhotographyPage 1: Archives by Marino González is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic License.Basedonaworkathttp://www.flickr.com/photos/merlin1487/5518280677/.Toviewacopyofthislicense,visithttp://creativecommons.org/licenses/by-nc-nd/2.0/legalcode.
DBSEM 2013/14 21