proquest and xml db - oracle... s317428: building really scalable xml applications with oracle xml...
TRANSCRIPT
![Page 1: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/1.jpg)
<Insert Picture Here>
S317428: Building Really Scalable XML Applications
with Oracle XML DB and Oracle Text
Michele Pompilius Nipun Agarwal
Data Technology Manager Director, XML Development
Proquest Oracle
![Page 2: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/2.jpg)
Background Information • ProQuest Company is a privately-held global
information services company
• 1500 employees
• $500 million revenue
• ProQuest partners with leading newspaper and academic journal providers in disciplines such as medicine, technology, social sciences, and humanities
• ProQuest aggregates materials and distributes digitized content to academic institutions, public libraries and schools
• ProQuest has a portfolio of 1,500 products and relationships with over 9,000 content providers
• 10 major product lines
![Page 3: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/3.jpg)
Background Information
• Who am I?
• Data Technology Manager
• Have Worked with Oracle Products Since 1987
• Rejoined ProQuest in June, 2007
• Manage the Database Team
• 3 DBAs; 3 Architects; 4 Developers
• Part of the Global Product Development Organization
• Support Other Areas of the Business
• JDeveloper for Custom Internal Applications
• New to Oracle XML DB
• Initiated Proof of Concept in August, 2009
• Started with 11gR2 Beta
![Page 4: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/4.jpg)
Oracle XML DB Product/Project Overview
• Project Morningstar
• Enterprise-wide effort to consolidate technology across multiple business units, each with it‟s own “silo” of content
• Two-year plan to establish new platform, integrating business units in phases
• Approximately 100 staff members involved
• Ultimate goal is a single, integrated vault comprising all ProQuest content, which can be searched from a single entry point
• Technical Strategies/Challenges
• Huge volume of documents
• Very complex, internally developed XML Schema
![Page 5: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/5.jpg)
Oracle XML DB Product/Project Specifics
• Application Architecture
• Front-end (Customer Facing Application)
• Web-based interface for user login
• Never interacts directly with the content store
• Documents are searched and served to users by FAST
search engine
• Content Store (Internal Editorial Application)
• Complete store of documents in Oracle XML DB
• Content Store User Interface directly interacts with content
store
• XML Search being investigated and prototyped now
• Document Manufacturing
• Ingest rate of 10 million documents/day
![Page 6: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/6.jpg)
Oracle XML DB Product/Project Specifics
• Data Characteristics
• Typical document size is 10-12k
• XML Schema has on the order of 700 nodes
• Flexible model
• Supports many content types: newspaper, journal,
dissertations, etc
• Data Volume
• Proof of concept: scaled to 82 million documents
• Production: Now just shy of 800 million documents
• Next phase (2011) will ramp document count up over
2.5 billion
• Database is currently 7TB in size
• XML Table and LOB segment is 5 TB
![Page 7: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/7.jpg)
Oracle XML DB Product/Project Specifics
• Environment
• 4 node cluster, HP DL360 G6, 2 quad core CPUs 144GB
RAM
• Running 11.2.0.1 (11gR2) on RHEL 5.3
• Supports all online users, internal/editorial operations, and
manufacturing activity
• XML Table is Range Partitioned
• Launch Schedule
• August 2010 – Customer Preview Successfully Launched!
• December 2010 – General Release
• Next Steps
• Continue to increase document manufacturing ingest rates
• XML Index and Text Index Prototyping
![Page 8: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/8.jpg)
Two Phases
• Phase 1
• Live in 2010
• Focus on
• Ingestion speed
• Scalability
• Disk Storage
• Phase 2
• Work in Progress – plan to go live in 2011
• Focus on Query performance
• Build XML and Text Indexes
![Page 9: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/9.jpg)
Proquest Morningstar with XML DB
Oracle Confidential
Binary XML (secure files)
XMLIndex – Phase 2
Text Index – Phase 2
Content Store Loader
Insert
Update
Content Store
Content Store User Interface
Query
Index maintenance
![Page 10: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/10.jpg)
Data Model
• Binary XMLType column in a relational table
• Partitioned by range on primary key
• Non-schema based to avoid schema evolution later
• Locally partitioned XMLIndex and Text Index
• Running on a RAC system
![Page 11: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/11.jpg)
Ingestion Performance
• Range Partitioned Binary XML Table
• Asynchronous index for both XMLIndex and Text Index
• POC numbers
• Target : 300 docs/sec
• Achieved : 475 docs/sec (SQL Loader)
• CPU utilization < 60%
![Page 12: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/12.jpg)
Scalability
• 800 million rows
• About 50 partitions
• Concurrent load
• Parallel Query
• Ingestion rate constant
• >5 TB of XML data
![Page 13: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/13.jpg)
Storage
• 25% compression for Binary XML
• Disk storage less than competitors
• Less I/O
• More rows in memory
• Indexes use around 3x of raw xml data
![Page 14: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/14.jpg)
TPoX Benchmark – XML DBComparison of Storage Space with another DB
0
250
500
750
XML Storage Indexing Overall Disk Usage
XML Storage and Indexing
Dis
k S
tora
ge (
MB
)
Oracle 11gR2 Binary Storage with XTIDX Another DB with SB with XIDX
Oracle uses 2.4x less storage
(based on Gmean query time, 6000 customer docs)
![Page 15: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/15.jpg)
Phase 2 – Query Performance
![Page 16: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/16.jpg)
Query Performance
• Mixed queries containing XMLEXISTS and CONTAINS
• CONTAINS may use INPATH, HASPATH
• One predicate uses index, the other evaluated as a post filter
• Cost of predicates determines index usage
• Queries use parallel processing to utilize available CPU
• Contains clause optimized to push down most processing, including count, to text index
• Result Set Interface with parallel table function
![Page 17: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/17.jpg)
Proquest Sample Queries
select p.doc from PROQUEST_DATA p where xmlexists('/RECORD/ObjectInfo/Copyright/CopyrightData' passing p.doc) and xmlexists('/RECORD/ObjectInfo/RecInfo/ObjectRevisions/ObjectRevis
ion[UpdatedDate="20090614150554"]' passing p.doc) order by goid /
select /*+ FIRST_ROWS(50) no_index(p pd_text_index) rparse*/ p.doc from PROQUEST_DATA p
where xmlexists('/RECORD/ParentInfo/Parent[GroupingID="23468"]'
passing p.doc) and contains (p.doc, 'new') > 0
order by goid /
![Page 18: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/18.jpg)
Oracle Confidential
XML Index
• Primary use case in conjunction with Binary XML
• Accelerates path, predicate and structural attribute searches
• Path based index : 11gR1
• Structured index : 11gR2
![Page 19: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/19.jpg)
Oracle Confidential
XMLIndex (Path Based)
• Accelerates path and predicate searches
• Organizes paths and values in single path table
• Supports searching and fragment extraction
• Path sub-setting for indexing specific paths
• Asynchronous mode for deferred maintenance
• Ideal when XPath to be queried not known in advance
• Also called Unstructured XMLIndex
![Page 20: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/20.jpg)
Oracle Confidential
XMLIndex (Path Based) Layout
RID PATHID ORDER
KEY
LOCATOR VALUE
10 /Document 1 Locator to get
binary content
10 /Document/Title 1.1 Locator to get
binary contentIndexing XML
Techniques
10 /Document/Affiliation 1.2 Locator to get
binary contentOracle
10 /Document/pubDate 1.3 Locator to get
binary content2007-04-10
20 /Document 1 Locator to get
binary content
20 /Document/Title 1.1 Locator to get
binary contentObject
relational
storage
![Page 21: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/21.jpg)
Oracle Confidential
• Project out commonly searched structured attributes
• Pivot each item as a column in the table
• All xpath matching is avoided at run time
• Secondary Indexes can be created on Structured Index
• Relational indexes on projected scalar attributes
• Text Index on projected text attributes
• Domain specific Index on domain attributes, e.g. image
• Physical rewrite using XQuery/XPath expression matching
XMLIndex (Structured)
![Page 22: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/22.jpg)
Oracle Confidential
XMLIndex (Structured) Layout
<Document>
<title>Indexing XML Techniques</title>
<affiliation>Oracle<affiliation>
<pubdate>2007-04-10</pubdate>
….
</Document>
<Document>
<title>Object relational storage</title>
<affiliation>Oracle<affiliation>
<pubdate>2003-03-15</pubdate>
…
</Document>
XML data
Structured XMLIndex
Row
ID
Title Affil Pubdate
10 Indexing XML
Techniques
Oracle 2007-04-
10
20 Object
relational
storage
Oracle 2003-03-
15
![Page 23: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/23.jpg)
Oracle Text and XML
• Oracle Text is the full text search engine in Oracle Database
• Free with all versions of the database
• The power of a standalone search engine plus full integration with the Oracle stack
• Can perform fast free-text search within XML text
<title>Crouching Tiger, Hidden Dragon</title>
… contains( movieInfo, „tiger within title‟) …
• Result Set Interface (new in 11.2.0.2) allows you to
• Specify Query request and hitlist requirements in XML
• Fetch Hitlist as XML
![Page 24: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/24.jpg)
Indexes for Query Performance
• XMLIndex
• Path Subsetted
• Asynchronous maintenance
• Structured XML Index
• Text Index
• AUTO LEXER
• Path Section Group
• Interval Sync
• Asynchronous maintenance
![Page 25: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/25.jpg)
Oracle Confidential
Querying XML Content in XML DB
XMLIndex
DOM Tree Model
Streaming XPath
Evaluation
Object-Relational
Relational Storage Secure Files
Binary XML
XQuerySQL/XML
XMLType Abstraction
XVMPushdownXQuery Rewrite
Functional Evaluation
Procedural XQueryDB XQuery
SQL Execution
RelationalAccess
Methods
![Page 26: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/26.jpg)
Binary XML - Comparison with another DB
601MB
1821 msec
67MB451 msec
189MB
508 msec
10MB
161 msec
Storage needed for TPoX data
Mean TPoX Query Response functional
eval
Storage needed for XMark data
Mean XMark Query Response functional
eval
Oracle …
1/3rd the size 3x faster
![Page 27: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/27.jpg)
TPoX Benchmark Comparison of Oracle XML DB with another DB (with Indexes)
0.1
1
10
100
Q1 Q2 Q3 Q4 Q5 Q6 Q7
Queries
Lo
g E
lap
sed
Tim
e
(ms)
(based on Gmean query time, 6000 customer docs)
Oracle 11gR2 Binary Storage with XTIDX Another DB with XIDX
![Page 28: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/28.jpg)
Conclusion
• Proquest live on 11gR2 with XML DB
• Focus on Ingestion speed and scalability
• Binary XML Storage
• Range Partitioning
• 1TB of data
• Prototype underway for Content Store
• Focus on Query performance
• XML Index
• Text Index
![Page 29: Proquest and XML DB - Oracle... S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text Michele Pompilius Nipun Agarwal Data](https://reader030.vdocuments.net/reader030/viewer/2022040205/5edcf8dbad6a402d6667e240/html5/thumbnails/29.jpg)