lucene domain index

78
Lucene Domain Index This project was originally sponsored by Lending Club , an online social lending network where people can borrow and lend money among themselves based upon their affinities and/or social connections. The project is under Apache V2 License: Copyright 2004 The Apache Software Foundation Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Lucene Domain Index ...................................................................................... 1 1. Introduction................................................................................................. 3 1.1 What is Lucene ........................................................................................... 3 1.2 What is Lucene Domain Index....................................................................... 3 1.3 Why do I use Lucene Domain Index? ............................................................. 4 2. Install .......................................................................................................... 5 2.1 Requirements ............................................................................................. 5 2.2 Install binary distributions ............................................................................ 5 2.2.1 11g Binary Distribution........................................................................... 5 2.2.2 10g Binary Distribution........................................................................... 6 2.3 Install Instructions to compile from sources .................................................... 7 2.3.1 Generating Maven's artifacts ................................................................... 8 2.4 Optimizations ............................................................................................. 9 2.4.1 Using NCOMP on 10g ............................................................................. 9 2.4.2 Using JIT on 11g ................................................................................... 9 2.5 Testing Lucene Domain Index ....................................................................... 9 Required grants for regular Oracle users ........................................................... 9 3. Examples................................................................................................... 12 IMPORTANT: Before start using Lucene Domain Index grant this to any Oracle user rather than LUCENE: .................................................................................... 12 3.1 Create a Lucene Domain Index ................................................................ 12 3.1.1 Single column index ............................................................................. 12 IMPORTANT: Lucene Domain Index name can not be larger than 21 characters.... 13 3.1.2 Multiple columns ................................................................................. 13 3.1.3 Multiple tables..................................................................................... 13 3.1.4 Padding and formatting ........................................................................ 15 3.1.5 Functional columns .............................................................................. 16 3.1.6 Create OnLine index ............................................................................ 17 3.1.7 Populate Index .................................................................................... 17 3.1.8 Parallel Index Operations ...................................................................... 17 3.2 Alter ..................................................................................................... 18 3.2 Rebuild ................................................................................................. 18 3.2.1 Manual ............................................................................................... 19 3.2.2 On Line .............................................................................................. 19 3.3 Drop..................................................................................................... 19

Upload: dsa

Post on 07-Apr-2015

696 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Lucene Domain Index

Lucene Domain Index

This project was originally sponsored by Lending Club, an online social lending network where peoplecan borrow and lend money among themselves based upon their affinities and/or social connections.The project is under Apache V2 License:

Copyright 2004 The Apache Software FoundationLicensed under the Apache License, Version 2.0 (the "License");you may not use this file except in compliance with the License.You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, softwaredistributed under the License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.See the License for the specific language governing permissions andlimitations under the License.

Lucene Domain Index ...................................................................................... 11. Introduction................................................................................................. 3

1.1 What is Lucene ...........................................................................................31.2 What is Lucene Domain Index.......................................................................31.3 Why do I use Lucene Domain Index? .............................................................4

2. Install .......................................................................................................... 52.1 Requirements .............................................................................................52.2 Install binary distributions ............................................................................5

2.2.1 11g Binary Distribution...........................................................................52.2.2 10g Binary Distribution...........................................................................6

2.3 Install Instructions to compile from sources ....................................................72.3.1 Generating Maven's artifacts ...................................................................8

2.4 Optimizations .............................................................................................92.4.1 Using NCOMP on 10g .............................................................................92.4.2 Using JIT on 11g ...................................................................................9

2.5 Testing Lucene Domain Index .......................................................................9Required grants for regular Oracle users ...........................................................9

3. Examples................................................................................................... 12IMPORTANT: Before start using Lucene Domain Index grant this to any Oracle userrather than LUCENE: .................................................................................... 123.1 Create a Lucene Domain Index ................................................................ 123.1.1 Single column index............................................................................. 12IMPORTANT: Lucene Domain Index name can not be larger than 21 characters....133.1.2 Multiple columns ................................................................................. 133.1.3 Multiple tables..................................................................................... 133.1.4 Padding and formatting ........................................................................ 153.1.5 Functional columns .............................................................................. 163.1.6 Create OnLine index ............................................................................ 173.1.7 Populate Index.................................................................................... 173.1.8 Parallel Index Operations...................................................................... 173.2 Alter..................................................................................................... 183.2 Rebuild ................................................................................................. 183.2.1 Manual............................................................................................... 193.2.2 On Line .............................................................................................. 193.3 Drop..................................................................................................... 19

Page 2: Lucene Domain Index

3.4 Querying............................................................................................... 193.4.1 Simple columns................................................................................... 203.4.2 Multiple columns ................................................................................. 203.4.3 Pagination .......................................................................................... 213.4.4 Sort................................................................................................... 213.4.5 Count Hits Function ............................................................................. 243.4.6 First Rows Optimizer Hint ..................................................................... 263.4.6 Highlighting ........................................................................................ 263.4.7 Highlighting using pipeline table functions............................................... 283.4.8 More like this functionality .................................................................... 293.4.9 Facets................................................................................................ 303.4.10 Terms pipeline table functions ............................................................. 333.4.11 Did You Mean functionality .................................................................. 34

3.5 Synchronize ............................................................................................. 363.6 Optimize.................................................................................................. 363.6 XMLDB Export .......................................................................................... 373.7 Exporting/Importing functional index with exp/imp Oracle tools....................... 414. Locking and Performance ............................................................................. 44

4.1 Lock used by Lucene Domain Index .......................................................... 444.2 Performance tips .................................................................................... 444.2.1 Index Writer parameters ...................................................................... 444.2.2 Auto Tune Memory functionality ............................................................ 444.2.3 Keep Index on RAM ............................................................................. 454.2.4 Compare your execution plan ................................................................ 45

5 Know caveats.............................................................................................. 46Appendixes .................................................................................................... 48

A. Parameter reference and syntax ................................................................... 48A.1 Lucene Index Writer parameters .............................................................. 49A.1.1 MergeFactor ....................................................................................... 49A.1.2 MaxBufferedDocs ................................................................................ 49A.1.3 MaxMergeDocs................................................................................... 49A.1.4 MaxBufferedDeleteTerms...................................................................... 49A.1.5 UseCompoundFile................................................................................ 49A.2 Analyzer parameters .............................................................................. 49A.2.1 Analyzer............................................................................................. 50A.2.2 Stemmer............................................................................................ 50A.2.3 PerFieldAnalyzer.................................................................................. 50A.3 User Data Store parameters .................................................................... 51A.3.1 ExtraCols ........................................................................................... 51A.3.2 ExtraTabs........................................................................................... 51A.3.3 WhereCondition .................................................................................. 52A.3.4 UserDataStore .................................................................................... 52A.3.2 FormatCols......................................................................................... 52A.4 General parameters................................................................................ 52A.4.1 SyncMode .......................................................................................... 53A.4.2 AutoTuneMemory ................................................................................ 53A.4.3 LobStorageParameters ......................................................................... 53A.4.4 LogLevel ............................................................................................ 53A.4.5 CachedRowIdSize ................................................................................ 54A.5 Query parameters .................................................................................. 54A.5.1 DefaultColumn .................................................................................... 54A.5.2 DefaultOperator .................................................................................. 54A.5.3 NormalizeScore................................................................................... 54A.5.4 PreserveDocIdOrder ............................................................................ 55A.6 Highlight parameters .............................................................................. 55A.6.1 Formatter........................................................................................... 55A.6.2 MaxNumFragmentsRequired ................................................................. 55A.6.3 FragmentSize ..................................................................................... 55

Page 3: Lucene Domain Index

A.6.4 FragmentSeparator.............................................................................. 55B Lucene Domain Index Storage....................................................................... 55C JUnit test suites explained ............................................................................ 56

C.1 DBTestCase base class............................................................................ 56C.2 TestDBIndex ......................................................................................... 57C.3 TestDBIndexAddDoc ............................................................................... 57C.4 TestDBIndexDelDoc................................................................................ 58C.5 TestDBIndexParallel ............................................................................... 58C.6 TestDBIndexSearchDoc........................................................................... 60C.7 TestQueryHits........................................................................................ 61

D Functions, operators and utilities ................................................................... 63D.1 lcontains ancillary operator ..................................................................... 63D.2 lscore ancillary operator.......................................................................... 67D.3 lhighlight ancillary operator ..................................................................... 68D.4 phighlight pipeline table function.............................................................. 68D.5 rhighlight pipeline table function .............................................................. 69D.6 MoreLike.this function ............................................................................ 70D.7 lfacets aggregate function ....................................................................... 71D.8 index_terms pipeline table function .......................................................... 71D.9 high_freq_terms pipeline table function .................................................... 72D.10 DidYouMean package............................................................................ 72

E Project Change Log ...................................................................................... 742.9.2.1.0 Production Release based on Lucene 2.9 (2.9.2) core base .................. 742.9.1.1.0 Production Release based on Lucene 2.9 (2.9.1) core base .................. 742.9.0.1.0 Production release based on Lucene 2.9.0 core base, 29/Sep/09 ..........742.4.1.1.0 (maintenance release based on Lucene 2.4.1, 27/Mar/09) ................... 752.4.1.0.0 (first release based on Lucene 2.4.1, 9/Mar/09) ................................. 752.4.0.1.0 (maintenance release based on Lucene 2.4.0, 10/Jan/09) ................... 752.4.0.0.0 (production release based on Lucene 2.4.0, 10/10/08)........................ 752.3.2.0.0 (binary release based on Lucene 2.3.2, 1/Jun/08) .............................. 762.2.0.2.2 (fixpack for 2.2.0.2.0 release, 5/Apr/08) ........................................... 762.2.0.2.1 (fixpack for 2.2.0.2.0 release, 12/Dec/07) ......................................... 762.2.0.2.0 (third major release synchronized with Lucene 2.2.0, 12/Dec/07).........762.2.0.1.1 (second release, 27/Sep/07 05:39 AM) ............................................. 762.2.0.1.0 (first release synchronized with lucene 2.2.0, 14/Sep/07 06:44 AM) .....772.0.0.1.3 (third release, 09/Jan/07 11:40 AM)................................................. 772.0.0.1.2 (second release, 20/Dec/06 02:03 PM) ............................................. 782.0.0.1.1 (first release, 28/Nov/06 01:04 PM) ................................................. 782.0.0.1.0 (initial implementation, 22/Nov/06 03:45 PM).................................... 78

1. Introduction

1.1 What is Lucene

Apache Lucene is a high-performance, full-featured text search engine library writtenentirely in Java. It is a technology suitable for nearly any application that requires full-textsearch, especially cross-platform.Apache Lucene is an open source project available for free download.If Lucene is a pure Java framework why not use it inside Oracle Database JVMenvironment?

1.2 What is Lucene Domain Index

Lucene Domain Index is full integration of Lucene project running inside the Oracledatabase using Oracle JVM.

Page 4: Lucene Domain Index

Oracle provides a full featured JVM inside your Oracle Database compliant with JDK 1.4 in10g release and 1.5 in 11g.OJVMDirectory is a replacement for Lucene's file system storage by a BLOB based storage,the name is related to the class which overrides (Directory.java), here a simple list ofpoints take into account to choose this storage:

• Using traditional File System for storing the inverted index is not a good option forsome users, you don't have commit or rollback behavior, backup, etc.

• Using BLOB for storing the inverted index running Lucene outside the Oracledatabase has a bad performance because there are a lot of network round tripsand data marshaling.

• Indexing relational data stores such as tables with VARCHAR2, CLOB or XMLTypewith Lucene running outside the database has the same problem as the previouspoint.

• The JVM included inside the Oracle database can scale up to 10.000+ concurrentsessions without memory leaks or deadlock and all the operations on tables are inthe same memory space!!

More on this, Oracle provides a Data Cartridge API (ODCI), also called Extensible Indexingmechanism because you can write your own Domain Index and integrate it with the OracleEngine and optimizer.There are some important points integrating Lucene by using ODCI:

• Changes on rows are automatically notified to Lucene, now these changes are en-queued using Oracle AQ. User can control if these changes are applied OnLine orDeferred.

• Oracle optimizer can choose a proper execution plan if there is a Domain Indexcreated.

• You can mix lcontains(),lhighlight() and lscore() operators in your queries.

1.3 Why do I use Lucene Domain Index?

Oracle include a full feature Enterprise Engine named Oracle Text made in C and fullyintegrated to the Oracle Engine, but:on Oracle Text you can not:

• control which functionality will be included into next release.• easily customize it for your needs.• index Index Organized Tables (IOT)• index joined tables• index unlimited extra columns• easily highlight text• index NCLOB and NVARCHAR data types

on 10g you can not:• index multiple columns in a same index• sort and filter by using indexed columns at index level

on 11g you can not:• filter by / sort by on columns of timestamp with TZ, commonly

used in XDB because is the official data type for xsd:date typeusing Lucene inside Oracle:

• usually indexes are small because Lucene Domain Index do not store any column, exceptthe rowid, inside Lucene inverted index structure. Using a rowid Oracle can lookup anycolumn value faster than retrieve it from Lucene inverted index.

• Support padding for Text columns• Support formatting (rounding/padding) for Number and Date/Time columns• You can create index on-line even in a standard edition databases (feature available en EE

for Text)• Extending DefaultUserDataStore class an application can implement any data type

mapping, specially BLOB which in common cases have non standard encoding• An experimental native REST WS can be used to query the index

Page 5: Lucene Domain Index

• Lucene inverted index is transactional, if a SQL operation is rolled back, the index will beconsistent too, avoiding phantom reads or negative hits (rows which should be included ashit but was not included in Lucene index)

• is a ready to use uptodate solution for any programming language, for example Ruby, .Net,Phyton or PHP.

• an elegant solution for highlighting text use pipeline table functions.• a high level abstraction layer for Lucene IR library, developers only deal with SQL• transparent compression and encryption of Lucene storage if you enable Oracle

Transparent Data Encryption and Secure File compression

2. Install

2.1 Requirements

• JDeveloper 11g only if you want to edit Java code.• Ant 1.7.0• Sun JDK 1.5.0_05/1.4.2 ($ORACLE_HOME/jdk directory works fine as Java Home for compiling

on 10g and 11g)• Linux/Windows Database Oracle 10g 10.2/11g production

2.2 Install binary distributions

Binary distributions are available at SourceForge.net and provides a very straightforwardinstallation.

2.2.1 11g Binary Distribution

Edit your ~/build.properties file with your Database values (Windows users can findbuild.properties file at C:\Documents and Settings\username folder):

db.str=testdb.usr=LUCENEdb.pwd=LUCENEdba.usr=sysdba.pwd=change_on_installjavac.debug=truejavac.source=1.4javac.target=1.4

db.str is your SQLNet connect string for your target database, check first with tnspingThis is an example environment setting before installing on 11g database

MAVEN_HOME=/usr/local/mavenORACLE_BASE=/u01/app/oracleORACLE_HOME=$ORACLE_BASE/product/11.1.0.6.0/db_1ORACLE_SID=testJAVA_HOME=$ORACLE_HOME/jdkPATH=$MAVEN_HOME/bin:$HOME/bin:$ORACLE_HOME/bin:$JAVA_HOME/bin:/usr/local/bin:$PATH

Page 6: Lucene Domain Index

LD_LIBRARY_PATH=$ORACLE_HOME/lib:/usr/local/libCVS_RSH=sshumask 022export PATH LD_LIBRARY_PATH ORACLE_HOME ORACLE_BASE ORACLE_SID JAVA_HOMECVS_RSH NLS_LANG

Upload, install and test your code into the database

# ant install-ojvm# ant test-domain-index

For Oracle 11g you can perform a post-installation step:

# ant jit-lucene-classes

This target force to translate all Lucene, Snowball and OJVMDirectory classes toassembler.Instead of waiting that the database compile it by detecting most used classes or method.

2.2.2 10g Binary Distribution

First edit your ~/build.properties with something like this:

db.str=orcldb.usr=LUCENEdb.pwd=LUCENEdba.usr=sysdba.pwd=change_on_installjavac.debug=truejavac.source=1.4javac.target=1.4

db.str property is a SQLNet connect string for the target database.ORACLE_HOME environment setting is required and properly configured to an Oracle10g database layout, finally execute ant without arguments.Here an example of environment setting on 10g database

MAVEN_HOME=/usr/local/mavenORACLE_BASE=/u01/app/oracleORACLE_HOME=$ORACLE_BASE/product/10.2.0/db_1ORACLE_SID=orclJAVA_HOME=$ORACLE_HOME/jdkPATH=$MAVEN_HOME/bin:$HOME/bin:$ORACLE_HOME/bin:$JAVA_HOME/bin:/usr/local/bin:$PATHLD_LIBRARY_PATH=$ORACLE_HOME/lib:/usr/local/libCVS_RSH=sshumask 022export PATH LD_LIBRARY_PATH ORACLE_HOME ORACLE_BASE ORACLE_SID JAVA_HOMECVS_RSH NLS_LANG

Page 7: Lucene Domain Index

If you are re-installing Oracle Lucene OJVM integration first drop any Lucene DomainIndex not installed at Lucene's schema.Default target will drop first Lucene schema if exists, additionaly (Recommended forproduction system) you can run "ant ncomp-ojvm" which translates all Lucene classes toC using JAccelerator, for example:

# ant ncomp-ojvm# ant test-domain-index

2.3 Install Instructions to compile from sources

- Unpack or checkout Lucene sources.- Checkout OJVM sources, by now only Anonymous CVS access is provided you candownload from Source Forge servers with:

cd /tmpcvs -d:pserver:[email protected]:/cvsroot/dbprism logincvs -z3 -d:pserver:[email protected]:/cvsroot/dbprism co -Pojvm

- Copy to $LUCENE_ROOT/contrib

# cd $LUCENE_ROOT/contrib# cp -rp /tmp/ojvm .

- Edit $LUCENE_ROOT/common-build.xml adding a target for creating a jar file with testsources.

<target name="jar-test" depends="compile-test"><jar destfile="${build.dir}/${final.name}-test.jar" basedir="${build.dir}/classes/test"

excludes="**/*.java"/></target>

- Also edit above file at the target name test adding db.usr, db.pwd and db.str propertiesas System properties to be available for Lucene Domain Index JUnit suites.

<target name="test" depends="compile-test" description="Runs unit tests"><fail unless="junit.present">

##################################################################JUnit not found.Please make sure junit.jar is in ANT_HOME/lib, or made availableto Ant using other mechanisms like -lib or CLASSPATH.

##################################################################</fail>

............<!-- contrib/ojvm uses these system properties to connect to the target database --

><sysproperty key="db.str" value="${db.str}"/>

Page 8: Lucene Domain Index

<sysproperty key="db.usr" value="${db.usr}"/><sysproperty key="db.pwd" value="${db.pwd}"/>

............<delete file="${build.dir}/test/junitfailed.flag" />

</target>

- (OPTIONAL) Update Lucene's BufferedIndexInput.BUFFER_SIZE according to yourdb_block_size init.ora parameter.Before compile and upload Lucene core library you can changeorg.apache.lucene.store.BufferedIndexInput.BUFFER_SIZE constant to the value of yourdb_block_size init parameter, this change will improve reading performance by using sameblock size as the physical block size that your database use.- Compile OJVM Directory sources and tests, these targets automatically copies all LuceneDomain Index required libraries from your $ORACLE_HOME and Internet. Starting withOJVM 2.4.0.1.x build.xml file automatically compiles all Lucene contrib modulesdependency.

# cd $LUCENE_ROOT/contrib/ojvm# ant jar-core# ant jar-test

- Edit your ~/build.properties file with your Database values:

db.str=orcldb.usr=LUCENEdb.pwd=LUCENEdba.usr=sysdba.pwd=change_on_installjavac.debug=truejavac.source=1.4javac.target=1.4

db.str is your SQLNet connect string for your target database, check first with tnspingutility, also note that for 11g database user and password are case sensitive, so leaveLUCENE in uppercase.- Upload your code to the database

# ant install-ojvm

2.3.1 Generating Maven's artifacts

You can generate Lucene and OJVM Directory Maven's artifacts following previous one steps,then execute:

# ant generate-maven-artifacts

Page 9: Lucene Domain Index

2.4 Optimizations

2.4.1 Using NCOMP on 10g

Is strongly recommended before going in production that install Oracle LuceneDomain Index NCOMPed in 10g databases. NCOMP automatically translate Luceneand OJVMDirectory Java code to assembler and finally install it as dynamic linklibrary (.so/.dll) in your Oracle home. To do this simply execute this Ant taskinstead of install-ojvm target:

# ant ncomp-ojvm

2.4.2 Using JIT on 11g

First verify that your database parameter java_jit_enabled is TRUE. Oracle 11gincludes a JIT technology which automatically translates most used Java methodsto assembler. If you want to pre-compile all Lucene Java code to assembler and notwait for Oracle database detects common used code you can execute this target:

ant jit-lucene-classesant jit-oracle-classes

2.5 Testing Lucene Domain Index

Required grants for regular Oracle users

IMPORTANT: Before start using Lucene Domain Index grant this to any Oracle user ratherthan LUCENE:

-- connected as sysdbabegin

dbms_java.grant_permission('SCOTT','SYS:java.util.logging.LoggingPermission','control', '' );

commit;end;/

Lucene Domain Index have two kinds of test suites to check that everything is OK afterinstallation.First test suite which can be launched using Ant is pure SQL and use SQLPlus to work, to launchit simply execute:

[mochoa@mochoa ojvm]$ ant test-domain-indexBuildfile: build.xml

[echo] Building ojvm...

test-domain-index:

Page 10: Lucene Domain Index

[exec][exec] SQL*Plus: Release 11.1.0.6.0 - Production on Wed Dec 5 17:43:24 2007[exec][exec] Copyright (c) 1982, 2007, Oracle. All rights reserved.[exec][exec][exec] Connected to:[exec] Oracle Database 11g Release 11.1.0.6.0 - Production[exec][exec][exec] Table dropped.[exec][exec][exec] Table created.[exec][exec] SQL> Disconnected from Oracle Database 11g Release 11.1.0.6.0 - Production[echo] See output at ../../build/testLuceneDomainIndex.txt

Except for the test which uses test_source_small table which outputs his log at the .trc filesother will outputs his log information at ../../build/testLuceneDomainIndex.txt file.

Second test suite is a set of JUnit tests to simulate middle tier environments, it also use aconnection pool. To start these suites run:

[mochoa@mochoa ojvm]$ ant -Ddb.usr=scott -Ddb.pwd=tiger -Ddb.str=test "-Djunit.includes=**/AllTests.java" ojvm-testBuildfile: build.xml

[echo] Building ojvm...

ojvm-test:[echoproperties] #Ant properties[echoproperties] #Wed Dec 05 17:56:30 ART 2007.........common.test:

[junit] Testsuite: org.apache.lucene.index.TestDBIndex[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 5.883 sec[junit][junit] ------------- Standard Output ---------------[junit] Table created: T1[junit] Index created: IT1[junit] Index altered: IT1

&nbsp;&nbsp; [junit] Inserted rows: 40 total bytes inserted: 421 avg text length: 10[junit] Index synced: IT1 elapsed time: 249 ms.[junit] Avg Sync time: 6[junit] Index optimized: IT1 elapsed time: 46 ms.[junit] Avg Optimize time: 1[junit] Row deleted 41, from: 10 to: 50 elapsed time: 2005 ms. Avg time: 48 ms.[junit] Index droped: IT1[junit] Table droped: T1[junit] ------------- ---------------- ---------------

.............[junit] Testsuite: org.apache.lucene.indexer.TestQueryHits[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 4.158 sec[junit]

Page 11: Lucene Domain Index

[junit] ------------- Standard Output ---------------[junit] iteration from: 13775 to: 13785[junit] Step time: 1291 ms.[junit] iteration from: 13785 to: 13795[junit] Step time: 157 ms.[junit] iteration from: 13795 to: 13805[junit] Step time: 144 ms.[junit] iteration from: 13805 to: 13815[junit] Step time: 147 ms.[junit] iteration from: 13815 to: 13825[junit] Step time: 145 ms.[junit] iteration from: 13825 to: 13835[junit] Step time: 147 ms.[junit] iteration from: 13835 to: 13845[junit] Step time: 145 ms.[junit] iteration from: 13845 to: 13855[junit] Step time: 150 ms.[junit] iteration from: 13855 to: 13865[junit] Step time: 278 ms.[junit] iteration from: 13865 to: 13875[junit] Step time: 146 ms.[junit] Elapsed time: 3159[junit] Hits: 18387[junit] Elapsed time: 653[junit] ------------- ---------------- ---------------[junit] Testsuite: org.apache.lucene.indexer.TestTableIndexer[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.685 sec[junit]

[delete] Deleting: /u01/src/lucene-2.2.0/build/contrib/ojvm/test/junitfailed.flag

BUILD SUCCESSFULTotal time: 8 minutes 22 seconds

org.apache.lucene.indexer.TestQueryHits use a table which is very big to create anddestroy it at setup() and tearDown() methods. Before run this test create the table with:

create table test_source_big as (select * from all_source);

and his index on 10g with:

create index source_big_lidx on test_source_big(text)indextype is lucene.LuceneIndex

parameters('AutoTuneMemory:true;Analyzer:org.apache.lucene.analysis.SimpleAnalyzer;MergeFactor:500;FormatCols:line(0000);ExtraCols:line"line"');

Or in 11g with:

create index source_big_lidx on test_source_big(text)indextype is lucene.LuceneIndexparameters('AutoTuneMemory:true;FormatCols:line(0000);ExtraCols:line

"line";Analyzer:org.apache.lucene.analysis.SimpleAnalyzer;MergeFactor:500;LobStorageParameters:PCTVERSION0 DISABLE STORAGE IN ROW CHUNK 32768 CACHE READSFILESYSTEM_LIKE_LOGGING');

Page 12: Lucene Domain Index

If you want to execute only one or some specific test override Ant's property junit.includes,for example:

ant -Ddb.usr=scott -Ddb.pwd=tiger -Ddb.str=test "-Djunit.includes=**/TestDBIndex.java" ojvm-test

Note that this argument is enclosed by "" to prevent Unix shell replacement.

3. Examples

IMPORTANT: Before start using Lucene Domain Index grant this to anyOracle user rather than LUCENE:

-- connected as sysdbabegin

dbms_java.grant_permission('SCOTT','SYS:java.util.logging.LoggingPermission','control', '' );

commit;end;/

3.1 Create a Lucene Domain Index

3.1.1 Single column index

Table example:

create table t1 (f1 number,f2 varchar2(200),f3 varchar2(200),f4 number unique);

create index it1 on t1(f2) indextype is lucene.LuceneIndexparameters('Analyzer:org.apache.lucene.analysis.SimpleAnalyzer');

Create a domain index on table t1 column f2 using SimpleAnalyzer as Lucene Analyzer.After this DDL command is executed two new tables, one AQ queue and one index are atuser's schema, named IT1$.T, IT1$QT, IT1$Q and IT1$DI respectively.Other example but instead using some Lucene Analyzer, using Snoball Stemmer.

create index it1 on t1(f2) indextype is lucene.LuceneIndexparameters('Stemmer:English');

Create a domain index on table t1 column f2 using English Stemmer.

Page 13: Lucene Domain Index

IMPORTANT: Lucene Domain Index name can not be larger than 21characters.

This limit is due a limitation in Oracle DBMS AQ table name.Every Lucene Domain Index have associated a AQ queue namedidx_name$Q and his queue table idx_name$QT.

3.1.2 Multiple columns

For previous one table example you can also index extra columns passing the informationas parameter to the index due Oracle 10g do not support Domain Index with compoundcolumns, here example:

create index it1 on t1(f2) indextype is lucene.LuceneIndexparameters('Stemmer:English;ExtraCols:F1 "f1"');

Creating an index with ExtraCols parameter cause that Lucene index both columns,master column f2 indexed as F2 and F1 indexed as "f1", as you can see below, at querysection examples, lcontains() operator provides Lucene's Query Parser Syntax which havefunctionality for selecting multiples field using f1:text for example. Using ExtraColsparameter imply that create index operator performs a full scan on table t1 with a syntaxlike: SELECT ROWID,F2,F1 "f1" FROM T1.Because ODCI Api will not detect changes on other columns than the master, you need tocreate a trigger that fire an update on the master column when a change on ExtraCols listis detected. Here an example:

CREATE OR REPLACE TRIGGER L$IT1BEFORE UPDATE OF f1 ON t1FOR EACH ROWBEGIN

:new.f2 := :new.f2;END;/

Any changes on f1 also will force to change f2, then ODCI will notify Lucene that anspecific rowid was updated, Lucene Domain Index based on his parameter definition willupdate the inverted index to reflect changes in both columns.

3.1.3 Multiple tables

Lucene Domain Index supports indexing in multiples column and multiples tables whichcan be joined in a natural form, it means defining a list of tables with ExtraTabsparameter, and a where condition with WhereCondition parameter. Here an example:

create table t2 (f4 number primary key,f5 VARCHAR2(200));

create table t1 (f1 number,f2 VARCHAR2(4000),

Page 14: Lucene Domain Index

f3 number,CONSTRAINT t1_t2_fk FOREIGN KEY (f3)

REFERENCES t2(f4) ON DELETE cascade);

You can index both tables using t1 as master index definition with:

create index it1 on t1(f3) indextype is lucene.LuceneIndexparameters('ExtraCols:L$MT.f2 "f2",t2.f5

"f5";ExtraTabs:t2;WhereCondition:L$MT.f3=t2.f4');

Note that tables t1 and t2 are joined directly by a foreign key, so t2 could be consider asa satellite table of t1. With this set of parameters when ODCI Api detects a change on it1master column (f3) a select like this is executed:

select L$MT.f3,L$MT.f2 "f2",t2.f5 "f5" from t1 L$MT,t2 where L$MT.rowid=? andL$MT.f3=t2.f4;

Bold parts of the query are injected by Lucene Domain Index implementation and italicsparts are extracted from ExtraCols and ExtraTabs parameters. The table alias L$MT isautomatically added by Lucene Domain Index to the master table, this alias is importantto create complex joins with Object Tables which uses existsNode or extracValueoperator, that functionality was added starting with 2.9.0.1.0 release.With the above scenario a trigger for getting Lucene Index synced with changes in anycolumns defined at ExtraCols parameter is a bit complex, it requires a combination of twotriggers:

CREATE OR REPLACE TRIGGER L$IT1BEFORE UPDATE OF f2 ON t1FOR EACH ROWBEGIN

:new.f3 := :new.f3;END;/CREATE OR REPLACE TRIGGER LT$IT1

AFTER UPDATE OF f5 ON t2FOR EACH ROWDECLARE

ridlist sys.ODCIRidList;BEGIN

SELECT ROWIDBULK COLLECT INTO ridlistFROM T1 WHERE F3=:NEW.f4;

LuceneDomainIndex.enqueueChange(USER||'.IT1',ridlist,'update');END;

/

First trigger is similar to the previous example, second trigger at the satellite table looksfor all rowid at the master table who have references to satellite row, then it usesLuceneDomainIndex.enqueueChange procedure to notify Lucene Domain Index changes.sys.ODCIRidList is an special ODCI structure to hold a group of rowid.

Page 15: Lucene Domain Index

3.1.4 Padding and formatting

Lucene Domain Index can be customized with a parameter named UserDataStore, itdefines which class is responsible for creating Lucene documents. Lucene documents area list of Field one for each column indexed plus and extra field named rowid storedcompressed and untokenized. By default UserDataStore is defined toorg.apache.lucene.indexer.DefaultUserDataStore.Default User Data Store supports left padding for NUMBER or FLOAT columns, andleft character padding for VARCHAR2 or CHAR columns. To define padding, FormatColsparameter at create or alter index DDL command can be used. Here an example:

create table t1 (f1 number primary key, f2 varchar2(200), f3 number(4,2))ORGANIZATION INDEX;insert into t1 values (1, 'ravi', 3.46);insert into t1 values (3, 'murthy', 15.87);commit;

create index it1 on t1(f2) indextype is lucene.LuceneIndex

parameters('Stemmer:English;FormatCols:F2(zzzzzzzzzzzzzzz),F3(00.00);ExtraCols:F3');

Above example shows that for F2 column all values will be automatically padded to 15character (z) and F3 column using 00.00, then these rows will be indexed as Lucenedocuments:

Document<stored/compressed,indexed<rowid:*BAEAPBQCwQL+>indexed,tokenized<F2:zzzzzzzzzzzravi> indexed<F3:03.46>>Document<stored/compressed,indexed<rowid:*BAEAPBQCwQT+>indexed,tokenized<F2:zzzzzzzzzmurthy> indexed<F3:15.87>>

For columns based on Oracle XMLType, FormatCols parameter can be used to define anXPath expression which controls a subset of XML nodes to be indexed.

create table t1 (f1 VARCHAR2(10), f2 XMLType);insert into t1 values ('1', XMLType('<emp id="1"><name>ravi</name></emp>'));insert into t1 values ('3', XMLType('<emp id="3"><name>murthy</name></emp>'));commit;

create index it1 on t1(f1) indextype is lucene.LuceneIndexparameters('ExtraCols:F2;FormatCols:F1(000),F2(/emp/name)');

Above rows will be indexed as:

Document<stored/compressed,indexed<rowid:AAATciAAEAAADwcAAA>indexed,tokenized<F1:001> indexed,tokenized<F2:ravi >>Document<stored/compressed,indexed<rowid:AAATciAAEAAADwcAAB>indexed,tokenized<F1:003> indexed,tokenized<F2:murthy >>

For columns of type VARCHAR/CHAR/CLOB and special string NOT_ANALYZED,NOT_ANALYZED_STORED, ANALYZED_WITH_OFFSETS,ANALYZED_WITH_POSITIONS and ANALYZED_WITH_POSITIONS_OFFSETS canbe used as format, this constant tells to User Data Store class that a field will be stored or

Page 16: Lucene Domain Index

not as indexed untokenized, untokenized fields can be used then as sort field. Here anexample:

create table emails (emailFrom VARCHAR2(256),emailTo VARCHAR2(256),subject VARCHAR2(4000),emailDate DATE,bodyText CLOB)

/

create index emailbodyText on emails(bodyText) indextype is lucene.LuceneIndexparameters('Analyzer:org.apache.lucene.analysis.StopAnalyzer;ExtraCols:emailDate"emailDate",subject "subject",emailFrom "emailFrom",emailTo "emailTo"');

-- required to Sort by subjectalter index emailbodyTextparameters('LogLevel:INFO;FormatCols:subject(NOT_ANALYZED),emailFrom(NOT_ANALYZED),emailTo(NOT_ANALYZED)');

The translation rules to Lucene Fields is:• NOT_ANALYZED - Field(name, value, Field.Store.NO, Field.Index.NOT_ANALYZED)• NOT_ANALYZED_STORED - Field(name, value, Field.Store.YES,

Field.Index.NOT_ANALYZED)• ANALYZED - Field(name, value, Field.Store.YES, Field.Index.ANALYZED,

Field.TermVector.NO)• ANALYZED_WITH_VECTORS - Field(name, value, Field.Store.YES,

Field.Index.ANALYZED, Field.TermVector.YES)• ANALYZED_WITH_OFFSETS - Field(name, value, Field.Store.YES,

Field.Index.ANALYZED, Field.TermVector.WITH_OFFSETS)• ANALYZED_WITH_POSITIONS - Field(name, value, Field.Store.YES,

Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS)• ANALYZED_WITH_POSITIONS_OFFSETS - Field(name, value, Field.Store.YES,

Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS)If no Format value is associated to a column of type VARCHAR/CHAR/CLOB it is convertedto Lucene Filed as Field(name, val, Field.Store.NO, Field.Index.ANALYZED)

3.1.5 Functional columns

ExtraCols parameter also have a possibility to define functional column for Lucene Indexwhich means, any SQL function valid in a select section is allowed. For example, usingabove table definition:

create index it1 on t1(f1) indextype is lucene.LuceneIndexparameters('ExtraCols:F2,extractValue(F2,''/emp/@id'') "id";FormatCols:F1(000),F2(/

emp/name),id(00)');

Create this set of Lucene documents indexed:

Document<stored/compressed,indexed<rowid:AAATciAAEAAADwcAAA>indexed,tokenized<F1:001> indexed,tokenized<F2:ravi > indexed,tokenized<id:01>>Document<stored/compressed,indexed<rowid:AAATciAAEAAADwcAAB>

Page 17: Lucene Domain Index

indexed,tokenized<F1:003> indexed,tokenized<F2:murthy >indexed,tokenized<id:03>>

Note that a virtual column was defined and indexed as "id", this column then is availableat lcontains operator.

3.1.6 Create OnLine index

If you put SyncMode:OnLine during create index DDL operation it will cause that LuceneDomain Index enqueues all rowids of the master table for indexing in batchs ofBatchCount rows (default is 115). Immediately that the command returns the index isready and a PLSQL AQ Callback will populate Lucene Index structure in background. Forexample:

create index pages_lidx_all on pages p (value(p))indextype is Lucene.LuceneIndexparameters('SyncMode:OnLine;LogLevel:WARNING;Stemmer:Spanish;ExtraCols:extractValue(object_value,''/page/title'') "title",extractValue(object_value,''/page/revision/comment'')"comment",extract(object_value,''/page/revision/text/text()'')"text",extractValue(object_value,''/page/revision/timestamp'')"revisionDate";IncludeMasterColumn:false;LobStorageParameters:PCTVERSION 0ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');

3.1.7 Populate Index

During create index DDL statement using PopulateIndex:false causes that Lucene Indexstructure is created empty and the Domain Index is ready. Then you can call to alterindex rebuild DDL statement to populate it. Here an example:

create index it1 on t1(f2) indextype is lucene.LuceneIndex

parameters('PopulateIndex:false;LogLevel:ALL;IncludeMasterColumn:false;ExtraCols:F1,extractValue(F2,''/emp/name/text()'') "name",extractValue(F2,''/emp/@id'')"id";FormatCols:F1(000),id(00)');

-- A this point it1 index is ready but not populatedselect lscore(1),f2 from t1 where lcontains(f2, 'name:ravi',1) > 0;-- Populate Indexalter index it1 rebuildparameters('Analyzer:org.apache.lucene.analysis.WhitespaceAnalyzer');-- query, returns one rowselect lscore(1),f2 from t1 where lcontains(f2, 'name:ravi',1) > 0;

3.1.8 Parallel Index Operations

Starting with Lucene Domain Index 2.9.1.1.0, you can enable ParallelDegreeoperations, ParallelDegree parameter which can be 0 or 2 to 9, is implemented usingmultiples Data Storage to process insert operations in parallel, this is useful when youhave multi core chips or RAC environments. By now only insert are parallelized an theindex must be configured in OnLine mode. Following an example of index creation withparallel inserts enables

Page 18: Lucene Domain Index

create index source_big_lidx on test_source_big(text)indextype is lucene.luceneindexparameters('BatchCount:1000;ParallelDegree:4;SyncMode:OnLine;LogLevel:INFO;AutoTuneMemory:true;PerFieldAnalyzer:line(org.apache.lucene.analysis.KeywordAnalyzer),TEXT(org.apache.lucene.analysis.SimpleAnalyzer);FormatCols:line(0000);ExtraCols:line"line"');

After this index DDL statement is executed five new tables will be visibly on user'sschema, SOURCE_BIG_LIDX$T (master index storage), andSOURCE_BIG_LIDX$[0..3]$T storage for slaves process, also a sequenceSOURCE_BIG_LIDX$S is created generating number from 0 to 3.The parallel implementation will enqueue batch of 1000 rows (BatchCountparameter) on the master queue related to the index, the AQ callback which isenable for this queue will dequeue each batch of rows and enqueue in the slavesqueues, the result of these operations is that Oracle AQ process will executemultiple AQ server process, you can see multiples ora_j00x_sid process running.With Oracle 11g we saw that the AQ implementation do not start new slavesprocess if one callback is getting a lot of CPU usage, my experience show that for aBatchCount parameter set up to 250 leaves a level of degree on AQ queues whichguarantee that multiples slaves process will be executed resulting in a real parallelinsert operations.

3.2 Alter

SQL DDL alter index command can be used with Lucene Domain Index to change anyparameter after index creation time. Lucene Domain Index parameters are a simple list ofname:value pairs stored into Lucene OJVMDirectory storage. If you want to remove anyparameter from the storage pre-pending ~ in a parameter name is used.Here some examples of alter index:

alter index it1parameters('MaxBufferedDocs:500;AutoTuneMemory:false');

Change Lucene Index Writer parameter MaxBufferedDocs to 500 and disable Auto Tune Memoryfunctionality.

alter index it1parameters('MaxBufferedDocs:500;AutoTuneMemory:false;SyncMode:OnLine');

Similar to the previous one example but enabling SyncMode to OnLine.

alter index it1 parameters('~SyncMode:OnLine');

Disable SyncMode from the above example, you can get similar functionality settingSyncMode:Deferred which is the default value for SyncMode.

3.2 Rebuild

SQL DDL alter index allow you to rebuild an index from scratch, this is useful when LuceneDomain Index is damaged, corrupted or you need to change some parameter which is

Page 19: Lucene Domain Index

necessary to be applied to existing rows already indexed, for example Lucene Analyzerparameter.

3.2.1 Manual

Manual index rebuild is a typical way, here an example:

alter index it1 rebuild

parameters('Analyzer:org.apache.lucene.analysis.StopAnalyzer;MaxBufferedDocs:500;AutoTuneMemory:false);

Above example shows how to change Lucene Index Analyzer, if you change your indexAnalyzer its necessary to rebuild the complete index because you should not query anindex with an analyzer different from the index time.

3.2.2 On Line

Alter index rebuild will not return up to the complete operation is finished. Rebuild OnLine is a functionality for Oracle Index available in enterprise edition databases, but witha little trick you can rebuild Lucene Domain Index On Line too.If you are working with SyncMode:Deferred you need to change to SyncMode:OnLine,then you can rebuild the index by using:

alter index it1 rebuildparameters('SyncMode:OnLine;MergeFactor:100;BatchCount:1000');commit; -- notify change to AQ Callback

Rebuild command enqueues batchs of 1000 rowids of the master table (it1) for additionto Lucene Index structure then Lucene Domain Index AQ Callback will process thesemessages using background database process and automatically commit changes when itfinish.

3.3 Drop

Dropping Lucene Index do not differs from dropping any other index. Just call:

drop index it1;

This operation implies drop Lucene Domain Index table, for above example IT1$T, and an AQqueue IT1$Q with his storage table IT1$QT. If the index is configured with SyncMode:OnLine,first the AQ Callback is disabled.If something is wrong during index drop command you can add "force" at the end of thecommand to be sure that System's views will not have any reference to the index.

3.4 Querying

Lucene Domain Index define a new SQL operator named lcontains() with his ancillary operatorslscore() and lhighlight(), his functionality is similar to Oracle Text contains and score operators.Next example shows operator functionality and parameters.

Page 20: Lucene Domain Index

3.4.1 Simple columns

For the table and index defined into 3.1.4/3.1.5 section a simple usage of lcontains andlscore is:

SQL> select lscore(1),f2 from t1 where lcontains(f1, 'F2:ravi',1) > 0;

LSCORE(1)----------F2--------------------------------------------------------------------------------------------------------------------------------------------

1<emp id="1">

<name>ravi</name></emp>SQL>

First parameter of lcontains operator is the column which have attached Lucene DomainIndex, this is the master column of the index and is a default field for Query Parsersyntax..Second parameter is Lucene Query Parser syntax string, above table example havedefined Lucene Domain Index at f1 columns, so its not default field for the query, withthis definition to query for an string inside F2 column its necessary to explicit defined"F2:".If you want to use lscore its necessary to specify as third argument in lcontains, acorrelation id, it this example "1", this correlation id then match with lscore(1) toassociate the ancillary operator to a proper lcontains.If you are querying for the master column of the index you can simply omits columnqualifier, for example:

SQL> select lscore(1) from t1 where lcontains(f1,'001',1)>0;

LSCORE(1)----------

1SQL>

lcontains() operator must always be compared with >0.

3.4.2 Multiple columns

Query Parser Syntax supports many logical operator and term modifier, you can combineany of them with each column indexed. Here a practical example using table and indexfrom section 3.1.4/3.1.5

SQL> select f1,lscore(1) sc,extractValue(f2,'/emp/@id') id from t12 where lcontains(f1, '003 OR (F2:ravi AND id:01)',1)>0;

F1 SC ID---------- ---------------- -----------------------------

Page 21: Lucene Domain Index

1 .577350259 13 .288675129 3

Note that first row match against column F2:ravi and functional column id:01, second rowmatch with F1 equal to 003 (remember F1 qualifier its not necessary because is the mastercolumn of the index defined in 3.1.5)

3.4.3 Pagination

lcontains operator have an extension to Query Parser syntax to include in-line paginationinformation at Lucene Domain Index Hits result.You can select an specific window (pagination) of your query injecting a Query Parser likerange inside lcontains() operator. For example:

select /*+ DOMAIN_INDEX_SORT */ ... where lcontains(col,'rownum:[20 TO 40] ANDword1',1)>0 order by lscore(1) DESC;

Lucene Domain Index implementation extracts automatically pagination informationrownum:[n TO m] AND from the beginning of the query syntax and return to the Oracleoptimizer a subset of 20 rowids. This extension provides a lot of performance improvementby eliminating the needs of using Oracle's Top-N syntax which need to collect all rowidsand then filter to calculate the window.Due inline pagination is an home made extension to Query Parser syntax have twolimitations or know caveats:

• rownum:[n TO m] AND must be at the beginning of Query Parser string and as is,we simple use string position of rownum and AND reserved keywords to extractstart and stop index of the window.

• pagination is concatenated using AND boolean operator, but its not strictly andAND operator because is a simple substring splitting operation, it means a priorityfor grouping AND OR are no applied to the first part, rownum:[n TO m] AND xxOR bb should be evaluated as (rownum:[n TO m] AND xx) OR bb rather than thisLucene will search xx OR bb and ODCI will extract n TO m window.

Note: DOMAIN_INDEX_SORT optimizer hint is required for Sorting, see next section.

3.4.4 Sort

Lucene provides sort over the result of a particular query, Lucene Domain Index providessorting by using an extra argument at lcontains() operator. Here examples of sorting usingemails table created in section 3.1.4:

SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject FROM emails2 where lcontains(bodytext,'security','subject:ASC',1)>0;

SUBJECT--------------------------------------------------------------------------------------------------------------------------------------------Re: lucene injectionRe: lucene injectionRe: lucene injectionRe: lucene injectionlucene injection

Page 22: Lucene Domain Index

Elapsed: 00:00:00.04SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject FROM emails

2 where lcontains(bodytext,'security','subject:DESC',1)>0;

SUBJECT--------------------------------------------------------------------------------------------------------------------------------------------lucene injectionRe: lucene injectionRe: lucene injectionRe: lucene injectionRe: lucene injection

Elapsed: 00:00:00.17

Sort parameter syntax is a coma separated string of field[:ORDER[:TYPE]] values, fieldsincluded in sorting spec should be NOT_ANALYZED or NOT_ANALYZED_STORED, seeFormatCols argument above. ORDER can be ASC or DESC, default value is ASC. TYPE canbe string, float or int, starting with Lucene 2.9.0 default value is string.Using above table a little complicated sort spec is:

SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject,emailFrom FROM emails2 where

lcontains(bodytext,'security','subject:ASC:string,emailFrom:DESC:string',1)>0;

SUBJECT EMAILFROM----------------------------------------- -------------------------------------------------------------------------Re: lucene injection [email protected]: lucene injection [email protected]: lucene injection [email protected]: lucene injection [email protected] injection [email protected]

Elapsed: 00:00:00.06SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject,emailFrom FROM emails

2 where lcontains(bodytext,'security','subject:ASC:string,emailFrom:ASC:string',1)>0;

SUBJECT EMAILFROM------------------------------------------ ------------------------------------------------------------------------Re: lucene injection [email protected]: lucene injection [email protected]: lucene injection [email protected]: lucene injection [email protected] injection [email protected]

Elapsed: 00:00:00.05SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject,emailFrom FROM emails

2 where lcontains(bodytext,'security',1)>0;

SUBJECT EMAILFROM------------------------------------------- ----------------------------------------------------------------------lucene injection [email protected]

Page 23: Lucene Domain Index

Re: lucene injection [email protected]: lucene injection [email protected]: lucene injection [email protected]: lucene injection [email protected]

Elapsed: 00:00:00.09

Latest query doesn't include sort so it sorted by score. An abbreviated syntax for sort stringis ASC or DESC which means sort by score ascending or descending, this short format isequivalent to use order by syntax with lscore operator, for example:

SQL> SELECT lscore(1),subject FROM emails2 where lcontains(bodytext,'security',1)>0;

LSCORE(1) SUBJECT-------------- ------------------------------------------------------------------------------------------------------.241440386 lucene injection.22763218 Re: lucene injection.199178159 Re: lucene injection.140840232 Re: lucene injection.140840232 Re: lucene injection

Elapsed: 00:00:00.10SQL> SELECT lscore(1),subject FROM emails

2 where lcontains(bodytext,'security',1)>0 order by lscore(1) asc;

LSCORE(1) SUBJECT------------- -----------------------------------------------------------------------------------------------------.140840232 Re: lucene injection.140840232 Re: lucene injection.199178159 Re: lucene injection.22763218 Re: lucene injection.241440386 lucene injection

Elapsed: 00:00:00.11SQL> SELECT /*+ DOMAIN_INDEX_SORT */ lscore(1),subject FROM emails

2 where lcontains(bodytext,'security','subject:DESC',1)>0;

LSCORE(1) SUBJECT------------- ----------------------------------------------------------------------------------------------------.241440386 lucene injection.22763218 Re: lucene injection.199178159 Re: lucene injection.140840232 Re: lucene injection.140840232 Re: lucene injection

Elapsed: 00:00:00.07SQL> SELECT /*+ DOMAIN_INDEX_SORT */ lscore(1),subject FROM emails

2 where lcontains(bodytext,'security','subject:ASC',1)>0;

LSCORE(1) SUBJECT------------- -----------------------------------------------------------------------------------

Page 24: Lucene Domain Index

--------------.140840232 Re: lucene injection.140840232 Re: lucene injection.199178159 Re: lucene injection.22763218 Re: lucene injection.241440386 lucene injection

Elapsed: 00:00:00.07

First example uses default sort by score descend, second example uses order by syntaxoverriding default sort and change it to score ascend, the other ones are equivalent butusing lcontains sort argument string.Note that if you are using lcontains sort string, you has to add DOMAIN_INDEX_SORToptimizer hint, this hint tells Oracle optimizer that the order of the rows will be dictated byLucene Domain Index.The usage of lscore(anc_id) in conjunction with lcontains(column,query,sort_str,anc_id)make not sense and produce an extra overhead on the score computation which can beavoided, it means if you are querying Lucene Domain Index and want to get the resultordered by other columns rather than the relevance order why to compute it, AVOIDlscore() function in the select list and you will get a query faster. For example:

SQL> SELECT /*+ DOMAIN_INDEX_SORT BAD */ lscore(1),subject FROMemails2 where lcontains(bodytext,'security','subject:ASC',1)>0;LSCORE(1) SUBJECT

------------- -------------------------------------------------------------------------------------------------.140840232 Re: lucene injection.140840232 Re: lucene injection.199178159 Re: lucene injection.22763218 Re: lucene injection.241440386 lucene injectionElapsed: 00:00:00.07

SQL> SELECT /*+ DOMAIN_INDEX_SORT GOOD */ subject FROM emails2 where lcontains(bodytext,'security','subject:ASC')>0;SUBJECT--------------------------------------------------------------------------------------

-----------Re: lucene injectionRe: lucene injectionRe: lucene injectionRe: lucene injectionlucene injection

Elapsed: 00:00:00.02

3.4.5 Count Hits Function

Count hits function is a Lucene Domain Index optimization to replace SQL count(*)functionality. This is extremely fast because there is no rowid information passed fromLucene Data Cartridge to Oracle Engine to count matching rows. Here an example:

Page 25: Lucene Domain Index

SQL> select LuceneDomainIndex.countHits('EMAILBODYTEXT','security') hits from dual;

HITS----------

5

Elapsed: 00:00:00.02

First argument of count hits function is an string with Lucene Domain Index syntax(IDX_NAME), second argument is Query Parser syntax string equals to second argument oflcontains function, optionally you can use a three argument version of countHits function touse index in another schemas, first argument is the schema, second argument is the indexname and last one is the Query Parser syntax string. After a count hits function call youcan use a select with lcontains function, if count hits query matchs with lcontains query,lcontains will have a cached information for returning matching rowids. Following someexamples of count hits an his correlated query using caching results:

SQL> select LuceneDomainIndex.countHits('EMAILBODYTEXT','security') hits from dual;

HITS----------

5

Elapsed: 00:00:00.02SQL> select emailFrom FROM emails

2 where lcontains(bodytext,'security',1)>0;

EMAILFROM--------------------------------------------------------------------------------------------------------------------------------------------codeshepherd@[email protected]@[email protected]@danielnaber.de

Elapsed: 00:00:00.08SQL> select LuceneDomainIndex.countHits('EMAILBODYTEXT','security') from dual;

LUCENEDOMAININDEX.COUNTHITS('EMAILBODYTEXT','SECURITY')------------------------------------------------------------------------------

5

Elapsed: 00:00:00.02SQL> select emailFrom FROM emails

2 where lcontains(bodytext,'security','emailFrom:ASC',1)>0;

EMAILFROM--------------------------------------------------------------------------------------------------------------------------------------------codeshepherd@[email protected]@[email protected]@danielnaber.de

Page 26: Lucene Domain Index

Elapsed: 00:00:00.04

In both queries lcontains found a cached hits structure evaluated by count hits function.Lucene Domain Index stores cached hits information, to localize it uses a key compoundedby sort_string(QueryParser.toString()) so both arguments of count hits and lcontainsshould match to re-use a cached hits structure. For last query example the stringemailFrom:(security) is used as key.

3.4.6 First Rows Optimizer Hint

Starting with 2.4.0.1.0 release we have replaced deprecated Lucene Hits class byTopDocs class. If you use FIRST_ROWS optimizer hint in conjuction with lcontains inlinepagination Lucene Domain Index will call to TopDocs to get only the first M hits. Forexample:

SQL> select /*+ FIRST_ROWS DOMAIN_INDEX_SORT */lhighlight(1),extractValue(object_value,'/page/title')

from pages where lcontains(object_value,'rownum:[1 TO 10] AND (musica tangorock)',1)>0;

FIRST_ROWS and rownum:[1 TO 10] tells to Lucene Domain Index that performs aLucene Query for the first 10 hits only. Next query with rownum:[10 TO 20] will havemost of the Lucene structures cached in memory such as the Searcher and the ROWID<->Lucene DocID association, but it will re-query Lucene Index to get first 20 Hits (1..20),this cache miss behavior of Hits could be interpreted as bad solution but is extremely usefulif 90% of query only visits the first page of the hits, typical behavior on Internet Search.In the other hand if you omits FIRST_ROWS optimizer hint, Oracle by default switch toALL_ROWS mode which means, if you are using pagination (rownum:[n TO m]) withm greater than 2000, Lucene Domain Index will fetch m first hits, but if m is lower than2000, Lucene Domain Index will try to fetch by default 2000 hits. The magic number 2000is due Oracle ODCI API calls to ODCIFetch routine in batch of 2000 rowids.If FIRST_ROWS and in-line pagination are not included in query, Lucene Domain Indexworks in batch of 2000 hits causing several cache miss in a full scan mode. For examplegiven a query:

SQL> select count(*) from pages where lcontains(object_value,'musica tango rock')>0;

causes that Lucene Domain Index fetch the first 2000 hits, finally with the informationthat the hits length is 2736 it re-fetch (cache miss) the 2736 hits. Obviously you can useLuceneDomainIndex.countHits() function to count hits faster than the previous query.

3.4.6 Highlighting

lhighlight ancillary operator works as lscore but returning a VARCHAR2 text with thewords highlighted during the evaluation of lcontains function, the tag used to remarkmatching words is not customizable yet and is <B>, also the fragment separator and themaximum number of fragments are constant (... and 4, respectively). Starting with2.4.1.1.0 release it have parameters customizable through alter index ... parameters()DDL command to change. Highlighting example:

Page 27: Lucene Domain Index

SQL> SELECT /*+ DOMAIN_INDEX_SORT */ lhighlight(1) txt,lscore(1) sc,subject2 FROM emails where lcontains(bodytext,'security OR mysql','subject:ASC',1)>0;

TXT SCSUBJECT

On Dec 21, 2006, at 4:56 AM, Deepan wrote:> I am bothered about <B>security</B>.27477634 Re: lucene injectionproblems with lucene. Is it vulnerable to> any kind of injection like <B>mysql</B> injection? many times the query from> user is passed to lucene for search without validating.Rest easy. There are no known <B>security</B> issues with Lucene, and ithas even undergone a recent static code analysis by Fortify (see thelucene-dev e-mail list

Highlighting only works with columns of type VARCHAR2, CLOB and XMLType. You canperform highlighting operation even if your master columns is not indexed/stored, forexample for an index created with:

create index pages_lidx_all on pages p (value(p))indextype is Lucene.LuceneIndexparameters('PopulateIndex:false;DefaultColumn:text;SyncMode:Deferred;LogLevel:INFO;Analyzer:org.apache.lucene.analysis.SpanishWikipediaAnalyzer;ExtraCols:extractValue(object_value,''/page/title'') "title",extractValue(object_value,''/page/revision/comment'') "comment",extract(object_value,''/page/revision/text/text()'') "text",extractValue(object_value,''/page/revision/timestamp'') "revisionDate";FormatCols:revisionDate(day);IncludeMasterColumn:false;LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHEREADS FILESYSTEM_LIKE_LOGGING');

SQL> select /*+ DOMAIN_INDEX_SORT */ lhighlight(1),extractValue(object_value,'/page/title') from pages where lcontains(object_value,'rownum:[1 TO 10] AND (musicatango rock)',1)>0;

<page xmlns="http://www.mediawiki.org/xml/export-0.3/"><title><B>Música</B> de Argentina... [[Latinoamérica|latinoamericanos]] con más desarrollo en su

[[<B>música</B>]].

Se encuentra una gran... argentinos, un instrumento tradicional andino]]Aún se mantiene la <B>música</B> de los [[Indígenas_en_Argentina... de grandes corrientes de[[inmigración|inmigrantes]] europeos, la <B>música</B> argentina se enriquecióMúsica de Argentinamusical emparentado con la [[habanera]] y el [[<B>tango</B> (<B>música</B>)|<B>tango</B>]].

==Diferencias con el <B>tango</B>==

Aunque tanto la milonga como el <B>tango</B> están en [[compás]] de 2/4, las 8 [[semicorchea]]s de lamilonga están distribuidas en 3 + 3 + 2 en cambio el <B>tango</B> posee un ritmo más «cuadrado». Lasletras...]] criticó en algún momento el <B>tango</B> y prefirió la milonga, que no trasmite la melancolíaMilonga (género musical)

Index creation DDL have IncludeMasterColumn:false, which means the whole XMLTyperepresentation of the Spanish Wikipedia page dump is not indexed only the virtual columnstitle, comment, text and revisionDate are processed by Lucene, but TextHighlight Java

Page 28: Lucene Domain Index

function attached to lhighlight operator receives the XMLType from RDBMS engine, so itcall to Lucene Highlighter class with the whole XMLType object (note that page titles are inbold only to separate rows at the output).Parameters supported by highlighting functions are:

• Formatter, a valid class name which implements Lucene Interface Formatterand with a constructor with no arguments, default valueorg.apache.lucene.search.highlight.SimpleHTMLFormatter.

• MaxNumFragmentsRequired, number of text fragments returned by Highlightfunction, default value 4.

• FragmentSize, size of each fragment returned, default value 100.• FragmentSeparator, String used as fragment separator, default is "...". Note that

you can not use ";" or ":" as fragment separator because are used as parameterand value delimiters into alter index ... parameters(..) DDL statement.

There is no customization allowed by passing constructor arguments to Formatter class,but you can easily creates your own Formatter which call to SimpleHTMLFormatter witharguments, your Formatter will look like:

create or replace and compile java source named"org.apache.lucene.search.highlight.MyHTMLFormatter" aspackage org.apache.lucene.search.highlight;

public class MyHTMLFormatter extends SimpleHTMLFormatter {public MyHTMLFormatter() {

super("<span class=\"myhighlightclass\">","</span>");}

}/show errors

alter index emailbodyTextparameters('Formatter:org.apache.lucene.search.highlight.MyHTMLFormatter;MaxNumFragmentsRequired:3;FragmentSeparator:...;FragmentSize:50');

3.4.7 Highlighting using pipeline table functions

phighlight and rhighlight provides a more general usage patterns of Lucene highlightingfunctionality. phighlight receives an SQL query as string and performs highlighting in setof user defined columns on the query result. rhighlight receives a SYS_REFCURSORargument and performs highlighting in a set of user defined query columns, unlikephighlight, rhighlight requires that the user defined a return type of the query, usuallya TABLE OF collection, because with a SYS_REFCURSOR argument there is no option toknow at compilation time the return type of the query.Both phighlight and rhighlight support highlighting parameters defined during createindex or alter index DDL statements, see 3.4.6 section for more information.Here two examples of highlighting features by using pipeline table functions, table emailsis the example table/index of the section 3.1.4:

SELECT * FROMTABLE(phighlight(

'EMAILBODYTEXT','lucene OR mysql','SUBJECT,BODYTEXT','select /*+ DOMAIN_INDEX_SORT FIRST_ROW */ lscore(1) sc,e.*

Page 29: Lucene Domain Index

from eMails e where lcontains(bodytext,''security OR mysql'',''subject:ASC'',1)>0'));

SELECT * FROMTABLE(rhighlight(

'EMAILBODYTEXT','lucene OR mysql','SUBJECT,BODYTEXT','EMAILRSET',CURSOR(select /*+ DOMAIN_INDEX_SORT FIRST_ROW */ lscore(1) sc,e.*from eMails e where lcontains(bodytext,'security OR mysql','subject:ASC',1)>0)

));

First three arguments of both pipeline function are equals, the Lucene Domain Index used,the Lucene Query Syntax argument (should match with lcontains argument) and finallythe columns of the query which will be highlighted.Last arguments are, for phighlight is a VARCHAR2 type with SQL query to be executedby DBMS_SQL package, note the double single quote used as escape character sequenceto encode SQL single quotes char.For rhighlight two arguments are required the type returned by the cursor, in this exampleis EMAILRSET defined as:

CREATE TYPE EMAILR AS OBJECT(sc NUMBER,emailFrom VARCHAR2(256),emailTo VARCHAR2(256),subject VARCHAR2(4000),emailDate DATE,bodyText CLOB

);

CREATE OR REPLACE TYPE EMAILRSET AS TABLE OF EMAILR;

Note that EMAILR is record which holds all columns of table EMAILS plus the scorereturned by lscore() function, then EMAILRSET is simple collection type TABLE OFEMAILR which is required type for CURSOR value.And finally last argument is of CURSOR type which means any SQL query.

3.4.8 More like this functionality

More like this Lucene Functionality is wrapped using function:MoreLike.this(index_name IN VARCHAR2,

x IN ROWID,f IN NUMBER DEFAULT 1,t IN NUMBER DEFAULT 10,minTermFreq IN NUMBER DEFAULT 2,minDocFreq IN NUMBER DEFAULT 5) RETURN sys.odciridlist

where index_name could be owner,index_name pair as in other Lucene Domain Indexprocedures. A typically uses case is:

Page 30: Lucene Domain Index

select rowid,lscore(1),text from test_source_big where lcontains(text,'"procedurejava"~10',1)>0 order by lscore(1) desc;AAAOaPAAEAAAAnnABV 1.00000003 procedure (C, Java or PL/SQL), optionallyqualifiedAAAOaPAAEAAAA0aAAV .84852819 STATIC PROCEDURE refreshParameterCache asLANGUAGE JAVA NAME.........declare

ridlist sys.odciridlist;begin

ridlist :=MoreLike.this(index_name=>'SOURCE_BIG_LIDX',x=>'AAAOaPAAEAAAAnnABV',minTermFreq=>1);

FOR i IN (select rowid,text from test_source_big where rowid in (select * fromtable(ridlist_table(ridlist)))) LOOP

dbms_output.put_line('rowid: '||i.rowid||' text: '||i.text);END LOOP;

end;/rowid: AAAOaPAAEAAAAhLAAc text: -- after issuing insert, update, delete or anonymousPL/SQL callsrowid: AAAOaPAAEAAAAjrAAo text: -- QUALIFIED_SQL_NAMErowid: AAAOaPAAEAAAAk5AAe text: -- ORA-06502: PL/SQL: numeric or value error:character string buffer.....rowid: AAAOaPAAEAAAAtXAAb text: -- The name of the Java class, PL/SQL package orobject type implementing

Note that the anonymous PL/SQL block gets the first ROWID returned from the firstquery as pivot, then expands the result set with other rows which also includes terms like"procedure (C, Java or PL/SQL), optionally qualified", "C" is not take into account due waseliminate as stop word.Refers to the Appendix D.6 for a full explanation of each parameter.

3.4.9 Facets

Starting with Lucene Domain Index 2.4.1.1.0, Lucene Facets functionality is availablethrough an SQL aggregate function lfacets():lfacets(index_name_and_categories IN VARCHAR2

) RETURN LUCENE.agg_tblwhere index_name_and_categories is encoded string with the Lucene Index name andcategories, aggregated function only accepts one scalar value as argument so we need toencode the index and categories in a coma separated list, for example:

SQL> select lfacets('SOURCE_BIG_LIDX,TEXT:procedure,TEXT:java') from dual;

Using the index created on the example of section 2.5 Testing Lucene Domain Index indexname can be SCHEMA.IDX_NAME sintax, categories can be one or two and are expressedin Lucene Query Syntax, in the above example TEXT is the index column procedure is themain category and java the sub category.Creating a table with categories and linking the rows with parent is an option toautomatically generate facets, for example:

Page 31: Lucene Domain Index

create table source_categories (cat_code number(4),cat_name varchar2(256),cat_parent number(4),CONSTRAINT PK_SOURCE_CATEGORIES PRIMARY KEY (cat_code),CONSTRAINT FK_CAT_PARENT FOREIGN KEY (cat_parent)

REFERENCES source_categories (cat_code));insert into source_categories values (1,'TEXT:procedure',null);insert into source_categories values (2,'TEXT:function',null);...insert into source_categories values (6,'TEXT:java',1);insert into source_categories values (7,'TEXT:(pl sql)',1);insert into source_categories values (8,'TEXT:wrapped',1);...insert into source_categories values (21,'line:[1 TO 1000]',1);insert into source_categories values (22,'line:[1001 TO 2000]',1);insert into source_categories values (23,'line:[2001 TO 3000]',1);

Now we can query above table calling to lfacets with the category and sub category:

SQL> select ljoin(lfacets('SOURCE_BIG_LIDX,'||case level when 1 then cat_nameELSE PRIOR cat_name||','|| cat_nameEND)), cat_code,levelFROM source_categoriesstart with cat_parent is nullCONNECT BY PRIOR cat_code = cat_parentgroup by cat_code,level;

LJOIN(LFACETS('SOURCE_BIG_LIDX,'||CASELEVELWHEN1THENCAT_NAMEELSEPRIORCAT_NAME||','||CAT_NAMEEND))CAT_CODE LEVEL

TEXT:procedure(5116)1 1

TEXT:function(5574)2 1

TEXT:trigger(96)3 1

TEXT:package(860)4 1

TEXT:(object type)(5140)5 1

TEXT:procedure,TEXT:java(9)6 2

.....TEXT:procedure,line:[1 TO 1000](3)

21 2TEXT:procedure,line:[1001 TO 2000](615)

22 2...SQL> select ljoin(lfacets('SOURCE_BIG_LIDX,'||

case level when 1 then cat_nameELSE PRIOR cat_name||','|| cat_name

Page 32: Lucene Domain Index

END)), cat_parentFROM source_categoriesstart with cat_parent is nullCONNECT BY PRIOR cat_code = cat_parentgroup by cat_parent;

LJOIN(LFACETS('SOURCE_BIG_LIDX,'||CASELEVELWHEN1THENCAT_NAMEELSEPRIORCAT_NAME||','||CAT_NAMEEND))CAT_PARENT----------------------------------------------------------------------------------------------------------------------------------------- ---------------TEXT:procedure,TEXT:java(11),TEXT:procedure,TEXT:(pl sql)(70),TEXT:procedure,line:[1 TO1000](3),TEXT:procedure,TEXT:wrapped(21),TEXT:procedure,line:[1001 TO 2000](675),TEXT:procedure,line:[3001 TO 4000](105),TEXT:procedure,line:[4001 TO5000](10),TEXT:procedure,line:[2001 TO 3000](199) 1TEXT:function,TEXT:java(22),TEXT:function,TEXT:wrapped(85),TEXT:function,line:[1 TO1000](0),TEXT:function,TEXT:(pl sql)(87),TEXT:function,line:[1001 TO 2000](835),TEXT:function,line:[3001 TO 4000](21),TEXT:function,line:[4001 TO5000](0),TEXT:function,line:[2001 TO 3000](338) 2TEXT:trigger,TEXT:java(1),TEXT:trigger,line:[1 TO1000](0),TEXT:trigger,TEXT:wrapped(0),TEXT:trigger,TEXT:(pl sql)(1),TEXT:trigger,line:[1001 TO 2000](33),TEXT:trigger,line:[3001 TO 4000](0),TEXT:trigger,line:[4001 TO5000](0),TEXT:trigger,line:[2001 TO 3000](0) 3TEXT:package,TEXT:java(7),TEXT:package,line:[1 TO 1000](0),TEXT:package,TEXT:(plsql)(25),TEXT:package,TEXT:wrapped(137),TEXT:package,line:[1001 TO 2000](54),TEXT:package,line:[3001 TO 4000](5),TEXT:package,line:[4001 TO5000](0),TEXT:package,line:[2001 TO 3000](5) 4TEXT:(object type),TEXT:java(56),TEXT:(object type),TEXT:(pl sql)(106),TEXT:(object type),line:[1 TO1000](1),TEXT:(object type),TEXT:wrapped(76),TEXT:(object type),line:[1001 TO 2000](441),TEXT:(object type),line:[4001 TO 5000](0),TEXT:(objecttype),line:[3001 TO 4000](28),TEXT:(object type),line:[2001 TO 3000](119) 5TEXT:procedure(5574),TEXT:(object type)(5584),TEXT:package(868),TEXT:trigger(114),TEXT:function(6167)

6 rows selected.

Note that we are using ljoin() function which convert agg_tbl type to a coma separatedstring plus his cardinality. First row do not have a sub category because parent column isnull, so 5116 is a number of rows which includes the text procedure, last row showedincluded a category and sub category, TEXT:procedure,line:[1001 TO 2000] impliesthe bit AND intersection between the set of rows which includes procedure against a setof rows which match with line[1001 TO 2000], the group by cat_code causes that theoracle ODCI API call first to calculate the bit set for procedure and iterate over all his subcategories, java, pl sql, wrapped, doing the bit AND intersections, this is fast and oncethe facets is computed is stored as Filter in Lucene Domain Index memory structures.When a number of rows or the amount of categories is big we can use a materialized viewto work as cache of the facets computation. For example:

CREATE MATERIALIZED VIEW source_facetsASselect ljoin(lfacets('SOURCE_BIG_LIDX,'||

case level when 1 then cat_nameELSE PRIOR cat_name||','|| cat_nameEND

Page 33: Lucene Domain Index

)), cat_code,levelFROM source_categoriesstart with cat_parent is nullCONNECT BY PRIOR cat_code = cat_parentgroup by cat_code,level;

Now source_facets materialized view can be queried as any other table and his access willbe too fast. The materialized view then can be refreshed by the application at an specificpoint in time.

3.4.10 Terms pipeline table functions

Starting with Lucene Domain Index 2.9.1.1.0, two pipeline table functions has beenincluded to iterate over terms of Lucene Index structure, high_freq_terms():FUNCTION high_freq_terms(index_name VARCHAR2,

term_name VARCHAR2,num_terms NUMBER) RETURN term_info_set

is available for getting the Top-N (num_terms) most used terms on the whole index or ina particular field. term_info_set is defined as:TYPE term_info AS OBJECT (

term VARCHAR2(4000),docFreq NUMBER(10)

);TYPE term_info_set AS TABLE OF term_info;You can query your index by using:

select * from table(high_freq_terms('SOURCE_BIG_LIDX','TEXT',10));select * from table(high_freq_terms('SOURCE_BIG_LIDX',null,10));select * from table(high_freq_terms('SOURCE_BIG_LIDX','line',100));

and, index_terms():FUNCTION index_terms(index_name VARCHAR2,

term_name VARCHAR2) RETURN term_info_set

select * from table(index_terms('SOURCE_BIG_LIDX','TEXT')) order bydocFreq desc;select * from table(index_terms('SOURCE_BIG_LIDX','TEXT'));select * from table(index_terms('SOURCE_BIG_LIDX',null)) whererownum<10;select * from (select * from table(index_terms('SOURCE_BIG_LIDX','line'))order by docFreq desc) where rownum<=10;

on both functions if argument term is NULL, these functions will iterate over all indexterms. The natural order for high_freq_terms() is descendent by docFreq,but index_terms() is ordered by term_name:term_value ascending. Note that if youpass a non NULL value to term to starts with the first value for the specificterm index_terms() do not stop when all the values of this term are completed, thisfunctionality is similar to Lucene Java method reader.terms(new Term(term)). Hereexample if you want only iterate on an specific term name:

BEGIN

Page 34: Lucene Domain Index

FOR term_rec IN (SELECT * FROM table(index_terms('SOURCE_BIG_LIDX','line')))LOOP

/* Fetch from cursor variable. */EXIT WHEN substr(term_rec.term,1,length('line'))<>'line'; -- exit when last row

is fetched-- process data recorddbms_output.put_line('Name = ' || term_rec.term || ' ' || term_rec.docFreq);

END LOOP;END;

You can use index_terms() to get the Top-N terms order by docFreq, for example:

SQL> select * from (select * from table(index_terms('SOURCE_BIG_LIDX',null))order by docFreq desc) where rownum<=10;TEXT:in 24952TEXT:varchar 16996...TEXT:return 6241Elapsed: 00:00:06.09

SQL> select * from table(high_freq_terms('SOURCE_BIG_LIDX',null,10));TEXT:in 24952TEXT:varchar 16996...TEXT:return 6241Elapsed: 00:00:00.02

Two queries are equivalent semantically but high_freq_terms() is more efficient becauseit uses TermInfoQueue structure for sorting, caches his computation one is executed anddo not creates a lot of term_info objects which then are sorted by the RDBMS engine.

3.4.11 Did You Mean functionality

Starting with Lucene Domain Index 2.9.2.1.0, Did You Mean Lucene functionality wasadded as an extended LDI property using the Lucene SpellChecker library to create thedictionary index from the main index. Finaly, the dictionary index will be merged to themain index.

PROCEDURE indexDictionary(index_name IN VARCHAR2,spellColumns IN VARCHAR2 DEFAULT null,distancealg IN VARCHAR2 DEFAULT 'Levenstein')

is available to create the dictionary index to be merged with main index.

You can create the dictionary by using:

SQL> call didyoumean.indexdictionary('SOURCE_BIG_LIDX');Call completed.Elapsed: 00:01:11.61

Page 35: Lucene Domain Index

SQL> execdidyoumean.indexdictionary('EMAILBODYTEXT','BODYTEXT,subject,emailFrom,emailTo','NGram');PL/SQL procedure successfully completed.Elapsed: 00:00:01.62

Only index_name is mandatory. If spellColumns parameter is NULL, the mastercolumn of the main index will be used. By default Levenstein Distance Algorithm (a.k.a.edit distance) is applied (other options are Jaro - Jaro Winkler metric - and Ngramdistance).

Note: The dictionary structure create the "word", "gramN", "startN" and "endN"Lucene fields, so be carefull if you have this fieds in the main index. The structure ofthis index is (for a 3-4 gram) this:

Index Structure Example

word kings

gram3 kin, ing, ngs

gram4 king, ings

start3 kin

start4 king

end3 ngs

end4 ings

and,

FUNCTION suggest(

index_name IN VARCHAR2,cmpval IN VARCHAR2,highlight IN VARCHAR2 DEFAULT null,distancealg IN VARCHAR2 DEFAULT 'Levenstein'

) RETURN VARCHAR2

is available to query the dictionary index. You can query the dictionary by using:

SQL> select didyoumean.suggest('SOURCE_BIG_LIDX','sorce') suggestion from dual;

SUGGESTION--------------------------------------------------------------------------------sourceElapsed: 00:00:00.31

SQL> select didyoumean.suggest('SOURCE_BIG_LIDX','sorce','b') suggestion fromdual;SUGGESTION--------------------------------------------------------------------------------<b>source</b>Elapsed: 00:00:00.09

SQL> select didyoumean.suggest('SOURCE_BIG_LIDX','sorce','b','Jaro') suggestionfrom dual;

Page 36: Lucene Domain Index

SUGGESTION--------------------------------------------------------------------------------<b>source</b>Elapsed: 00:00:00.07

SQL> select didyoumean.suggest('EMAILBODYTEXT','lucene searhc','i') suggestion fromdual;SUGGESTION--------------------------------------------------------------------------------lucene <i>search</i>Elapsed: 00:00:00.06

SQL> select didyoumean.suggest('EMAILBODYTEXT','lucine injetion','b','Levenstein')suggestion from dual;SUGGESTION--------------------------------------------------------------------------------<b>lucene</b> <b>injection</b>Elapsed: 00:00:00.06

The index_name parameter and the word to respell (cmpval) parameter aremandatory. You can define, optionaly, the highlight to be used (e.g. b for bold, i foritalic, etc.) and define the distance algorithm to apply.

3.5 Synchronize

Working with SyncMode:Deferred you has to manually synchronize your index, it means updateLucene Domain Index structure applying pending changes such as insert and update. Deletesoperations are always applied due ODCI Api do not accept rowid of deleted rows.Here an example:

beginLuceneDomainIndex.sync('IT1');commit; -- release locks

end;

LuceneDomainIndex.sync procedure requires an argument of type VARCHAR2 with the indexobject name, index object name are usually capitalized and have the syntaxSCHEMA_OWNER.IDX_NAME.Synchronize operation could raise an exception if some rows being indexed are locked for update,in that case you have release first locked rows and re-sync the index.An exclusive lock at Lucene Index storage is obtained during index synchronization, so you hasto commit or rollback the connection immediately after this operation to release exclusive lock.Since Lucene Domain Index 2.4.0.1.0 you can use LuceneDomainIndex.sync('IT1') orLuceneDomainIndex.sync(USER,'IT1'), both procedure are equivalent.Note: Due a limitation on SYS.ODCIRidList() array you can enqueue more than 32767additions or deletions, an update is counted as one deletion plus one addition by Luceneimplementation code. This limitation will be removed in future releases of Lucene Domain Index.

3.6 Optimize

Optionally you can optimize Lucene Index storage, for doing that execute:

Page 37: Lucene Domain Index

beginLuceneDomainIndex.optimize('IT1');commit; -- release locks

end;

Like sync operation this procedure get an exclusive lock at Lucene Index storage table andperform an optimization of Lucene Index merging multiples segment in new one for example. Youcan still performing select operation (read-only) using Lucene Domain Index during optimizationtime, Oracle concurrency system (redo logs) provides you this functionality, once you perform acommit operation any other concurrent session will automatically see index changes.

3.6 XMLDB Export

You can perform an XMLDB Export operation for your Lucene Domain Index, this operationprovides an easy way to get Lucene Domain Index information available as Lucene File option.Once the operation is done you can get all the files from you database using WebDAV exploreror FTP. For example:

beginLuceneDomainIndex.xdbExport('IT1');commit; -- makes change visible to Ftp or WebDAV

end;

Your index will be visible at /public/lucene/SCOTT.IT1 directory for example.Once you copy this files into the file system you can open it with any Lucene compatibleapplication like Luke. Here some screen shots of Luke using Lucene Domain Index exportinformation.

Page 38: Lucene Domain Index
Page 39: Lucene Domain Index
Page 40: Lucene Domain Index
Page 41: Lucene Domain Index

3.7 Exporting/Importing functional index with exp/impOracle tools

You can perform an Oracle exp operation for your Lucene Domain Index. Oracle exp toolperforms by default functional index for every table being exported during the backup process.As I mention early Lucene Domain Index creates a table named IDX_NAME$T which haveLucene file storage replaced by BLOB, also a DBMS AQ is created during the index creationtime, this queue is associated to a table IDX_NAME$QT, both tables have a flag marked asSECONDARY, which means that you can not export these tables alone, but they are automaticallyincludes when Lucene Domain Index is included into the export.During import operation Oracle re-create the index using a create index ... parameters('yourlucene parameters') DML statement, all Lucene Domain Index parameters are included exceptfor the parameter PopulateIndex which always is stored as false into Oracle System's views.This parameter is altered intentional by Lucene Domain Index because if its set to true, duringimport operation Lucene Domain Index will try to re-create the Lucene Index structure insteadof using the information restored into IDX_NAME$T table.Alternative to XMLDB Export or Oracle exp tool you can also exports your Lucene Domain Indexstorage using a create table as ... DML statement. For example:

SQL> create table SOURCE_BIG_LIDX$T$BK as (select * from SOURCE_BIG_LIDX$T);... you can export now using exp tool SOURCE_BIG_LIDX$T$BK because is regular table ...-bash-3.2$ exp

Page 42: Lucene Domain Index

Export: Release 10.2.0.3.0 - Production on Fri Mar 27 02:46:18 2009

Copyright (c) 1982, 2005, Oracle. All rights reserved.

Username: scott/tiger

Connected to: Oracle Database 10g Release 10.2.0.3.0 - ProductionEnter array fetch buffer size: 4096 >

Export file: expdat.dmp > SOURCE_BIG_LIDX_BK.dmp

(2)U(sers), or (3)T(ables): (2)U > 3

Export table data (yes/no): yes > yes

Compress extents (yes/no): yes >

Export done in US7ASCII character set and AL16UTF16 NCHAR character setserver uses AL32UTF8 character set (possible charset conversion)

About to export specified tables via Conventional Path ...Table(T) or Partition(T:P) to be exported: (RETURN to quit) > SOURCE_BIG_LIDX$T$BK

. . exporting table SOURCE_BIG_LIDX$T$BK 19 rows exportedTable(T) or Partition(T:P) to be exported: (RETURN to quit) >

Export terminated successfully without warnings..... Now you can drop your index and re-create again without populating it ....SQL> select count(*) from test_source_big where lcontains(text,'function')>0;

COUNT(*)----------

6167

SQL> drop index SOURCE_BIG_LIDX;

Index dropped.

SQL> create index source_big_lidx on test_source_big(text)2 indextype is lucene.LuceneIndex

3parameters('PopulateIndex:false;AutoTuneMemory:true;Analyzer:org.apache.lucene.analysis.SimpleAnalyzer;MergeFactor:500;FormatCols:line(0000);ExtraCols:line"line"');

Index created.

SQL> drop table SOURCE_BIG_LIDX$T$BK;

Table dropped..... Restore your .dmp now and check again if your index returns a correct result ....-bash-3.2$ imp scott/tiger

Import: Release 10.2.0.3.0 - Production on Fri Mar 27 02:49:40 2009

Copyright (c) 1982, 2005, Oracle. All rights reserved.

Page 43: Lucene Domain Index

Connected to: Oracle Database 10g Release 10.2.0.3.0 - Production

Import file: expdat.dmp > SOURCE_BIG_LIDX_BK.dmp

Enter insert buffer size (minimum is 8192) 30720>

Export file created by EXPORT:V10.02.01 via conventional pathimport done in US7ASCII character set and AL16UTF16 NCHAR character setimport server uses AL32UTF8 character set (possible charset conversion)List contents of import file only (yes/no): no >

Ignore create error due to object existence (yes/no): no >

Import grants (yes/no): yes >

Import table data (yes/no): yes >

Import entire export file (yes/no): no > yes

. importing SCOTT's objects into SCOTT

. importing SCOTT's objects into SCOTT

. . importing table "SOURCE_BIG_LIDX$T$BK" 19 rows importedImport terminated successfully without warnings..... Check first that your index do not have information and populate them with Lucene Indexinformation ....SQL> conn scott/tigerConnected.SQL> select count(*) from test_source_big where lcontains(text,'function')>0;

COUNT(*)----------

0

SQL> truncate table SOURCE_BIG_LIDX$T;

Table truncated.

SQL> insert into SOURCE_BIG_LIDX$T (select * from SOURCE_BIG_LIDX$T$BK);

19 rows created.SQL> exit..... and connect again to refresh Lucene Domain Index in memory structures ....SQL> conn scott/tigerConnected.SQL> select count(*) from test_source_big where lcontains(text,'function')>0;

COUNT(*)----------

6167

As you can see the Lucene Domain Index structure can be export alone without exporting themaster table, this is useful when you are upgrading Lucene Domain Index that requires that allindex need to be dropped first and you don't want to re-create a very big index.

Page 44: Lucene Domain Index

4. Locking and Performance

4.1 Lock used by Lucene Domain Index

Operation Base Table (row/table) Index Table(SCHEMA.IDX$T)

Queue Table(SCHEMA.IDX$QT)

Insert X/RX (1) NONE NONE

Update X/RX NONE NONE

Delete X/RX X/RS(updateCount)|X/RX(3) NONE

Manually Sync X/RS (2) X/RS(updateCount)|X/RX DBMS_AQ.BLOCKED (4)

Automatically Sync X/RS X/RS(updateCount)|X/RX DBMS_AQ.BLOCKED

Optimize NONE X/RS(updateCount)|X/RX NONE

1. X = Row exclusive lock at the row being inserted, RX = Table row exclusive lock.2. X = Row exclusive lock at the row being indexed, RS = Table row share lock. A select

... for update no wait is performed at all rows being added to Lucene Index.3. X/RS is performed at the row where name='updateCount', this is writer lock

semaphore of Lucene Index and provide serialize write operations. X/RX is performed atmany rows of this table because Lucene is created and deleting many files.

4. To perform massive dequeue operations at DBMS AQ queue Sync scan this queue withDBMS_AQ.BLOCKED option.

4.2 Performance tips

4.2.1 Index Writer parameters

Lucene Index Writer class uses several parameters to control index structure. LuceneDomain Index pass to Index Writer several parameters such as MergeFactor,MaxBufferedDocs among others.As best practice if you want to index thousands of rows you can override default Luceneparameters for other which speed up indexing time. With create index or alter indexrebuild you can set MergeFactor to 100 and MaxBufferedDocs to 4000.This parameters increase index performance but then DML operations at the base tablewill batch small set of rows, so after DDL commands change MergeFactor to 2 andMaxBufferedDocs to 100. A good place to start knowing these parameters behavior is theWiki page Improving Indexing Speed.

4.2.2 Auto Tune Memory functionality

Lucene Domain Index have a parameter called AutoTuneMemory a true value means thatfor Index Writer operations it will try to use up to 90% of the Java Pool Size configured atthe Oracle SGA to adjust how many documents are buffered (MaxBufferedDocs) beforecall IndexWritter.flush().With AutoTuneMemory:true MaxBufferedDocs its not required, its calculated using freeram at the SGA, but you has to set MergeFactor.Due Java Pool Size is global parameter the rule is not valid if you want to create manyindex with parallel connexions, two connections will try to use 90% of the SGA, so one ofthem will ran out of memory.

Page 45: Lucene Domain Index

4.2.3 Keep Index on RAM

OJVMDirectory replaces Lucene file system storage by a table storage with BLOBs. Forevery Lucene Domain Index created there is a new table which stores every Lucene file asa row with a BLOB column, see section 6 for more detail, using similar strategy as OracleText you can keep this table in RAM. Unlike Oracle Text which uses multiples tables forstoring the inverted index, Lucene Domain Index use one table, execute this DDLcommand to keep Lucene Index on RAM:

create index source_small_lidx on test_source_small(text)indextype is lucene.LuceneIndexparameters('FormatCols:line(0000);ExtraCols:line"line";Analyzer:org.apache.lucene.analysis.StopAnalyzer;MergeFactor:100');

alter index source_small_lidx parameters('MergeFactor:2');

alter table source_small_lidx$t storage (buffer_pool keep) modify lob (data) (storage(buffer_pool keep));

During Index creation use AutoTuneMemory:true (default value) and a MergeFactor highbecause many rows will be indexed at this time. Then change MergeFactor to 2 to workbetter after each DML/sync operation. Finally change OJVMDirectory storage table andLOB to keep them in RAM.Be sure that your SGA has a enough RAM to keep it. To know how big your index you canquery the table:

SQL> select sum(file_size) from source_small_lidx$t where deleted='N';

SUM(FILE_SIZE)--------------

147444

Finally as Tom Kyte say, tkprof, tkprof, .... ;)You can see Lucene Domain Index IO operations with an "alter session set events'10046 trace name context forever, level 12'; then you can find operations at LuceneDomain Index table SCHEMA.IDX_NAME$T. Using TKPROF information you can alter tableand lob storage parameters manually.

4.2.4 Compare your execution plan

To be sure that your Lucene Domain Index is properly used compare your executionsplans and try to avoid non necessary filter by or sort order by predicates by using in-linesort or multiples field Query Parser conditions.Here examples of sorting using emails table created in section 3.1.4:

SQL> explain plan for2 SELECT subject FROM emails where lcontains(bodytext,'security',1)>03 order by subject ASC;

Explained.

Elapsed: 00:00:00.58SQL> set echo off

Page 46: Lucene Domain Index

PLAN_TABLE_OUTPUT-----------------------------------------------------------------------------------------------------Plan hash value: 1542204867

Id Operation Name Rows Bytes Cost (%CPU) Time

0 SELECT STATEMENT 1 4016 3 (34) 00:00:01

1 SORT ORDER BY 1 4016 3 (34) 00:00:01

2 TABLE ACCESS BY INDEX ROWID EMAILS 1 4016 2 (0) 00:00:01

* 3 DOMAIN INDEX EMAILBODYTEXT

Predicate Information (identified by operation id):---------------------------------------------------

3 - access("LUCENE"."LCONTAINS"("BODYTEXT",'security',1)>0)

Above execution plan tells that you are using Lucene Domain Index but you can get a betteroptimizer plan by using lcontains sort:

SQL> explain plan for2 SELECT /*+ DOMAIN_INDEX_SORT */ subject FROM emails3 where lcontains(bodytext,'security','subject:ASC',1)>0;

Explained.

Elapsed: 00:00:00.01SQL> set echo off

PLAN_TABLE_OUTPUT--------------------------------------------------------------------------------------------------------------------Plan hash value: 1450245214

Id Operation Name Rows Bytes Cost (%CPU) Time

0 SELECT STATEMENT 1 4016 2 (0) 00:00:01

1 TABLE ACCESS BY INDEX ROWID EMAILS 1 4016 2 (0) 00:00:01

* 2 DOMAIN INDEX EMAILBODYTEXT

Predicate Information (identified by operation id):---------------------------------------------------

2 - access("LUCENE"."LCONTAINS"("BODYTEXT",'security','subject:ASC',1)>0)

Here we have a better optimizer plan and lower cost.

5 Know caveats

1. Lucene Domain Index uses Java Util Logging API it means that a grant is required to createand operate any index:

dbms_java.grant_permission( 'USER_NAME','SYS:java.util.logging.LoggingPermission', 'control', '' )

Page 47: Lucene Domain Index

2. SyncMode:OnLine should be reserved only for index which a number of update/insert/deleteoperation are too small compared to select operations, because each message processrequires almost open an IndexWriter/IndexReader on the associated Lucene Index by abackground process, except for bulk collect operation or "insert into ... select ... from" whichare processed in batch off 150 rows. Tables with many insert/update operations by secondsshould use LuceneDomainIndex.sync(idx) procedure called by DBMS_JOB periodically or bythe application.

3. Syntax for Inline pagination is only supported at the beginning of the Query, it means that ifyou want to perform pagination lcontains() query syntax must start with "rownum:[n TO m]AND" note that this syntax is case sensitive. Also this extraction is performed by splitting thequery by position and does not take into account grouping operator, so this query"rownum:[1 TO 10] AND word1 OR word2" will be passed to Lucene's Query Parser as "word1OR word2" which is not semantically the original one if you look precedence operator. We cantry to modify Query Parser class in a future to solve this semantic issues.

4. Since October 25 column name are case sensitive in ExtraCols and FormatCols parametersusing traditional SQL behavior, it means that for this DDL index creation:

create index it1 on t1(f2) indextype is lucene.LuceneIndexparameters('Stemmer:English;FormatCols:F2(zzzzzzzzzzzzzzz),F3(00.00);ExtraCols:F3');

You can use ExtraCols with f3 or F3 but FormatCols should be F3 because f3 is returned bythe SQL select operation as F3 during the table full scan, also Lucene Index will have adocument with a Field F3 instead of f3. If you want to use f3 as is you can re-write DDL indexcreation with:

create index it1 on t1(f2) indextype is lucene.LuceneIndexparameters('Stemmer:English;FormatCols:F2(zzzzzzzzzzzzzzz),f3(00.00);ExtraCols:F3"f3"');

With this sentence Lucene will create documents with two field F2 and f3, F2 is uppercasebecause is the master column of the index and his passed as "F2" by ODCI API but, due is thedefault Field of the query, you can omit his name at lcontains syntax, F3 now is lowercaseand will be indexed as a Field "f3".

5. Since November Index parameters are pre-cached in memory for faster response. Dueisolation behaviour of Oracle JVM sessions, if you call to alter index or re-create a new one inanother session you need to close all SQL session that are already pre-load an indexparameter storage.Calling to LuceneDomainIndex.getParameter('owner.index_name','parameter_name') you cansee the values of any parameter passed to the ODCI API either by calling create index or alterindex.Otherwise you can call to LuceneDomainIndex.refreshParameterCache stored procedure.

6. If you re-install Lucene Domain Index without previous deleting existing indexes you canmanually drop resources associated to and old index. For example:

SQL> drop index source_big_lidx force;Index dropped.SQL> select table_name from tabs;

TABLE_NAME------------------------------DEPTEMPBONUSSALGRADE

Page 48: Lucene Domain Index

SOURCE_BIG_LIDX$QTDR$SOURCE_BIG_IDX$IDR$SOURCE_BIG_IDX$RSOURCE_BIG_LIDX$TTEST_SOURCE_BIGDR$SOURCE_BIG_IDX$NDR$SOURCE_BIG_IDX$K

11 rows selected.SQL> drop table SOURCE_BIG_LIDX$T;

Table dropped.SQL> conn / as sysdbaconnected.SQL>exec DBMS_AQADM.DROP_QUEUE ('SCOTT.SOURCE_BIG_LIDX$Q')BEGIN DBMS_AQADM.DROP_QUEUE ('SCOTT.SOURCE_BIG_LIDX$Q'); END;

*ERROR at line 1:ORA-01403: no data foundORA-06512: at "SYS.DBMS_AQADM_SYS", line 3359ORA-06512: at "SYS.DBMS_AQADM", line 167ORA-06512: at line 1

SQL> exec DBMS_AQADM.DROP_QUEUE_TABLE(queue_table =>'SCOTT.SOURCE_BIG_LIDX$QT', force=>true);

PL/SQL procedure successfully completed.

SQL> exit

Note that "drop index ... force" will de-register Lucene Domain Index from Oracle's systemviews, then Lucene Domain Index storage's table is manually dropped, finally connected asSYS Lucene Domain Index AQ's table is dropped.

7. Oracle 11g have a know bug "6445561 - ORA-00600 [26599] [62] DUE TO INCORRECTPERSISTENCE OF BY INVOKER PIN" please apply patch numberp6445561_111060_LINUX.zip available at Metalink, this bug affects select count(*) with alarge results.

8. Up to Lucene Domain Index 2.9.0 there is known problem with the WhereConditionparameter using OR SQL operator, see section A.3.3 to see the workaround.

Appendixes

A. Parameter reference and syntax

Lucene Domain Index accept several parameters which can be passed using create index or alterindex DDL commands. This parameters are divided into four categories, Index Writer, Analyzer, UserData Store and General parameters.

Page 49: Lucene Domain Index

A.1 Lucene Index Writer parameters

This section covers Lucene Index Writer parameters for more information about this parametersee Lucene docs and Wiki.

A.1.1 MergeFactor

Determines how often segment indices are merged by addDocument(). If you are creatinga new index over a table with thousands of rows a value of 100 to 500 is good value.

A.1.2 MaxBufferedDocs

Determines the minimal number of documents required before the buffered in-memorydocuments are merged and a new Segment is created. This value can cause an outof memory exception you provide a value larger than user space available. A typicalSGA configuration can accept values of 4000 or 5000 depending how big are your rowsbeing indexed. If you are not sure of how megabytes can consume your rows youcan use AutoTuneMemory:true parameter which is a default value, so you choose trueMaxBufferedDocs will be ignored and Lucene Domain Index will try to uso 90% of OracleJava Pool Size value.

A.1.3 MaxMergeDocs

Determines the largest number of documents ever merged by addDocument().

A.1.4 MaxBufferedDeleteTerms

Determines the minimal number of delete terms required before the buffered in-memorydelete terms are applied and flushed.

A.1.5 UseCompoundFile

Setting to turn on usage of a compound file. When on, multiple files for each segment aremerged into a single file once the segment creation is finished. This is done regardless ofwhat directory is in use. By default Lucene Domain Index do not use compound file formatbecause its not affected by max open file descriptors.

A.2 Analyzer parameters

An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extractingindex terms from text.Typical implementations first build a Tokenizer, which breaks the stream of characters from theReader into raw Tokens. One or more TokenFilters may then be applied to the output of theTokenizer.Analyzer, PerFieldAnalyzer or Stemmer parameter affects indexing and query expressions,so if you want to change this parameter on a exists index you to must rebuild it, the priorityof these three parameters is first check for the Stemmer if its not present check forPerFieldAnalyzer if its not present checks for Analyzer parameter, finally if none of them aredefined will use SimpleAnalyzer.

Page 50: Lucene Domain Index

A.2.1 Analyzer

This parameter is fully qualified Java class name which extendsorg.apache.lucene.analysis.Analyzer. For example:

• BrazilianAnalyzer• ChineseAnalyzer• CJKAnalyzer• CzechAnalyzer• DutchAnalyzer• FrenchAnalyzer• GermanAnalyzer• GreekAnalyzer• KeywordAnalyzer• PatternAnalyzer• RussianAnalyzer• SimpleAnalyzer• StandardAnalyzer• StopAnalyzer• ThaiAnalyzer• WhitespaceAnalyzer

See Lucene Java Docs for more details. A default analyzer is SimpleAnalyzer.

A.2.2 Stemmer

Stemmer is another kind of analyzer which divides words, stop words and another termrelated object based on an specific language. Stemmer parameter use Snowball Analyzer,possible values for Stemmer parameter using Lucene 2.2.0 distribution are:

• Danish• Dutch• English• Finnish• French• German• German2• Italian• Kp• Lovins• Norwegian• Porter• Portuguese• Russian• Spanish• Swedish

Stemmer parameter override Analyzer parameter.

A.2.3 PerFieldAnalyzer

PerFieldAnalyzer is a wrapper of other analyzers which provides an independent analyzerfor each column being indexed, see PerFieldAnalyzerWrapper class in Lucenedocumentation. Each column could have his own analyzer which extendsorg.apache.lucene.analysis.Analyzer. If a column is not in the list StandardAnalyzerwill be used as default. For example:

create table t1 (f1 VARCHAR2(10), f2 XMLType);insert into t1 values ('1', XMLType('<emp id="1"><name>ravi</name></emp>'));

Page 51: Lucene Domain Index

insert into t1 values ('3', XMLType('<emp id="3"><name>murthy</name></emp>'));

create index it1 on t1(f2) indextype is lucene.LuceneIndexparameters('IncludeMasterColumn:false;

ExtraCols:F1,extractValue(F2,''/emp/name/text()'') "name",extractValue(F2,''/emp/@id'') "id";

FormatCols:F1(000),id(00)');

alter index it1 rebuildparameters('PerFieldAnalyzer:F1(org.apache.lucene.analysis.KeywordAnalyzer),id(org.apache.lucene.analysis.KeywordAnalyzer)');

In the above example four columns are being indexed by Lucene Domain Index rowid(added by default) using KeywordAnalyzer, F1 and id (added by ExtraCols parameter)using KeywordAnalyzer too, and finally name which is not included intoPerFieldParameter and then using StandardAnalyzer.

A.3 User Data Store parameters

Lucene Domain Index implements a User Data Store functionality, this functionalityprovides many parameters to control which column will be included into a LuceneDocument which is inserted into the index.and First three parameters are used to choose which columns will added to the indexin addition to the master column. Oracle Domain Index are bound to a single column,this is a limitation with Oracle 10g version. To avoid this problem passing ExtraCols,ExtraTabsWhereCondition you can easily build a set of new column from the master tableand others. Basically a select DML statement is built using these parameters. To clarify thisLucene Domain Index will performs a query like:

Full table scan (create index statement):SELECT rowid,MasterTable.MasterColumn,ExtraCols FROM MasterTable,ExtraTabs whereWhereCondition;

Find a particular rowid (insert,update operations):SELECT MasterTable.MasterColumn,ExtraCols FROM MasterTable,ExtraTabs whereMasterTable.rowid=:rowid AND WhereCondition;

Text in italic are injected by Lucene Domain Index and text in bold are user defined.

A.3.1 ExtraCols

A coma separated list of columns of the Master table of table being indexed or the tablesdefined into ExtraTabs parameter. Note that if you don't define columns alias columnname are capitalized by default on Oracle databases. For example 'ExtraCols:F2 "f2",T2.F3"f3"' note that you can omit master table name if there is no collisions

A.3.2 ExtraTabs

A coma separated list of table name and alias for this tables. For example 'ExtraTabs:T2aliasT2,T3 aliasT3'. Note that ODCI API only will detect changes at index master column,to notify changes based on ExtraCols list you need to attach triggers, see section examplesabove for more detail.

Page 52: Lucene Domain Index

A.3.3 WhereCondition

An SQL where condition used to join index's master table with ExtraTabs tables. Forexample: 'WhereCondition:T1.f1=T2.f2(+) AND T1.F1=aliasT3.f3'. Be careful to produce acorrect join condition to guaranty single row result; multiple or zero row result based onthe master table values are not allowed.

Note: Up to Lucene Domain Index 2.9.0, if you use a WhereCondition whichhave an OR operator put this where condition enclosed with () because theprecedence of the OR over the AND operator makes that some queries returnsmore rows that the correct behavior, for example instead of:

WhereCondition:T1.F1='AA' OR T1.F1='BB'put:

WhereCondition:(T1.F1='AA' OR T1.F1='BB')this workaround fix some problems when working in OnLine mode. Startingwith 2.9.1 version this extra () are not required.

A.3.4 UserDataStore

This is a fully Java Class name which implementsorg.apache.lucene.indexer.UserDataStore interface, you can create your own DataStore class implementing this interface. By default Lucene Domain Index provides animplementation which covers most of the typical scenarios, this class isorg.apache.lucene.indexer.DefaultUserDataStore and use FormatCols parameter tocreate Lucene Fields.

A.3.2 FormatCols

A coma separated list of column(format) strings interpreted by User Data Store class tocontrol how an specific database column will be transformed in a Lucene Field. For exampleyou can choose padding, un-tokenized values and so on.Supported formats by Default Data Store class are:

• Number padding for numeric columns using java.text.DecimalFormat classsyntax, default is 0000000000.

• Date rounding for timestamp and date columns usingorg.apache.lucene.document.DateTools, default is day.

• Character left padding for VARCHAR2 or CHAR columns usingorg.apache.lucene.util.StringUtils class (leftPad method), default is no left charpadding. Any char can be used for left padding.

• XPath expression for XMLType columns, this XPath string will be passed toXMLType.extract("format","") method, the result of the XPath extraction will bea new XMLType object over getStringVal() will executed. If you want to performmore user defined XMLType to Field extraction extend DefaultUserDataStore classor use virtual column indexing.

• For columns of type VARCHAR2 or CHAR you can use an special stringNOT_ANALYZED or NOT_ANALYZED_STORED as format which tell to DefaultUser Data Store class that this column will be indexed but un-tokenized, this isuseful with columns which will be used for sorting.

A.4 General parameters

This set of parameters are Lucene Domain Index specific parameters.

Page 53: Lucene Domain Index

A.4.1 SyncMode

SyncMode tells to Lucene Domain Index which strategy is used to update the index.SyncMode:Deferred (default) left to the application when the index is synced eitherby calling LuceneDomainIndex.sync procedure after a set of changes pending or byDBMS_SCHEDULER process at an specific time. With SyncMode:Deferred update andinsert operations are queued using DBMS_AQ package. Delete operations are neverenqueued because require an update on Lucene Index to not return rowid of deleted rows.SyncMode:OnLine is implemented by using DBMS_AQ PLSQL callback, so immediatelyafter a commit operation which involves insert or update rows a parallel process dbms_j*is automatically started by DBMS_AQ package to applied pending changes.SyncMode:OnLine should be reserved for index which update, insert or delete operationsare much lower than select, AQ callbacks can not handle very well exceptions during synctime, for example when a row being index is locked by another session, so some changescan be lost with this scenario.

A.4.2 AutoTuneMemory

AutoTuneMemory:true (default) overrides MaxBufferedDocs parameter, it definesdynamically MaxBufferedDocs based on how much memory is reported byOracleRuntime.getJavaPoolSize() method.After each document is added to the index it calls to writer.ramSizeInBytes() and test thatis not over a 90% of the ram free.This parameter works in most of the common cases, but you can get a Java out of memoryerror in multiuser environments because Java Pool Size is common parameter for all thesessions. If you get an exception during index creation time set AutoTuneMemory:falseand adjust MaxBufferedDocs to a value which not raise an out of memory exception.

A.4.3 LobStorageParameters

Lucene Domain Index uses a BLOB column named "data" for storing Lucene Invertedindex files. You can control any LOB storage parameter with this parameter during indexcreation time, his default value is 'LobStorageParameters:PCTVERSION 0 ENABLESTORAGE IN ROW CACHE READS NOLOGGING' for 11g databases you can use abetter optimize storage by using newest Secure LOB parameter, for example:'LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768CACHE READS FILESYSTEM_LIKE_LOGGING'

A.4.4 LogLevel

Lucene Domain Index uses JDK Java Util Logging package, LogLevel parameter is any ofthe string defined by Level.parse() method, for example: LogLevel:ALL. By default logginglevel is defined to WARNING.Lucene Domain Index uses:

• SEVERE for non recoverable error conditions• FINER for debugging purpose such as ODCI API arguments• INFO for checking index operations such as value being indexed• WARNING for error messages which are reported as ERROR through ODCI API• CONFIG to see user parameters changed by users

Logging information is sent by default to Oracle .trc files, but you can redirect this outputusing dbms_java.set_output procedure for example.If you are not sure which field and how these fields are added to the index change LogLevelto INFO and check for lines starting with: "INFO: Document<"exiting and throwing methods does not print messages also with log level defined to ALL.This is because logging level used by these methods are controlled by ConsoleHandlerlevel.

Page 54: Lucene Domain Index

To get these methods work copy logging.properties file from your JAVA_HOME/jre/lib toORACLE_HOME/javavm/lib directory and edit the line which includes level property:

# Limit the message that are printed on the console to INFO and above.java.util.logging.ConsoleHandler.level = ALLjava.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter

Then shutdown and startup your Oracle database.

A.4.5 CachedRowIdSize

CachedRowIdSize is used by an LRU cached used to maintain the association betweenLucene Doc ID and a particular Oracle ROWID. For very big table using an array tostore this association can consume a lot of SGA RAM, starting with Lucene Domain Index2.9.0.1.0 only 10.000 ROWID are stored in this cache, tables with high frequency ofupdates can use this LRU small due every caused that LRU is completed flushed, but tableswith low frequency of updates/deletes can get a lot of performance improvement by usinglarger LRU cached size.

A.5 Query parameters

This set of parameters which affects QueryParser and search functionality.

A.5.1 DefaultColumn

DefaultColumn defines which columns is used as default column in QueryParser syntax, ifthis parameter is not set master column of the index is used, this name is a Lucene Fieldname. Here an example:

create index pages_lidx_all on pages p (value(p))indextype is Lucene.LuceneIndexparameters('PopulateIndex:false;DefaultColumn:text;SyncMode:Deferred;LogLevel:WARNING;Analyzer:org.apache.lucene.analysis.SpanishWikipediaAnalyzer;ExtraCols:extractValue(object_value,''/page/title'') "title",extractValue(object_value,''/page/revision/comment'')"comment",extract(object_value,''/page/revision/text/text()'')"text",extractValue(object_value,''/page/revision/timestamp'')"revisionDate";FormatCols:revisionDate(day);IncludeMasterColumn:false;LobStorageParameters:PCTVERSION0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');

Note the correlation between DefaultColumn and ExtraCols. ExtraCols defines a LuceneField named "text" with a value calculated by the SQL expressionextract(object_value,''/page/revision/text/text()''), then you can use a LuceneField text as default Field in QueryParser syntax.

A.5.2 DefaultOperator

DefaultOperator defines which Boolean operator is used in QueryParser syntax, if thisparameter is not set OR operator is his default value.

A.5.3 NormalizeScore

NormalizeScore is used during Lucene Index scan to know if they need to track themaximum score, the maximum score then used to normalize the result of lscore() operatorto return only values between 0 to 1. If you don't need a normalized range of the score you

Page 55: Lucene Domain Index

can avoid this computation and your query will be fast. Note that a not normalized scorenot implied that the document are not in order of relevance.

A.5.4 PreserveDocIdOrder

PreserveDocIdOrder is an internal parameter which is used by Lucene in some kind ofoperator, if you don't need that result preserve Lucene Doc ID in order rather than therelevance, you can put this value to false (default) and some operator will be fast.

A.6 Highlight parameters

This set of parameters which affects lhighlight, phighlight and rhighlight functionality.

A.6.1 Formatter

Formatter defines a valid class name which implements Lucene Interface Formatterand with a constructor with no arguments, default valueorg.apache.lucene.search.highlight.SimpleHTMLFormatter.

A.6.2 MaxNumFragmentsRequired

MaxNumFragmentsRequired defines a number of text fragments returned by Highlightfunction, default value is 4.

A.6.3 FragmentSize

FragmentSize defines the size of each fragment returned in characters of each fragment,default value is 100.

A.6.4 FragmentSeparator

FragmentSeparator defines a String used as fragment separator, default value is "...". Notethat you can not use ";" or ":" as fragment separator because are used as parameter andvalue delimiters into alter index ... parameters(..) DDL statement.

B Lucene Domain Index Storage

OJVMDirectory class creates a set of Oracle objects to represent Lucene Inverted Index and DomainIndex functionality. First it creates a table named IDX_NAME$T (IDX_NAME is your Lucene DomainIndex used at create index DDL statement) with this structure:Name Null? Type

NAME NOT NULL VARCHAR2(30)

LAST_MODIFIED TIMESTAMP(6)

FILE_SIZE NUMBER(38)

DATA BLOB

DELETED CHAR(1)

Also have and index based on IDX_NAME$T.DELETED column to speedy up purge operations.To enqueue operation at the index it defines a DBMS_AQ Queue IDX_NAME$Q with his storage tableIDX_NAME$QT.

Page 56: Lucene Domain Index

IDX_NAME$Q queue have payload defined as LUCENE_MSG_TYP object. This object type is definedas:Name Null? Type

RIDLIST SYS.ODCIRIDLIST

OPERATION VARCHAR2(32)

SYS.ODCIRIDLIST is an special structure defined by ODCI API to hold a list of rowid changed byan DML operation. OPERATION is one of insert, delete, update, rebuild or optimize reservedkeyword. rebuild and optimize operations are used with SyncMode:OnLine to perform these tasksautomatically using a background process.

C JUnit test suites explained

C.1 DBTestCase base class

This is base class for most of the test suites includes.It provides a connection pool using OracleDataSource with a minimum of two ready to useconnection and growing to 5, after this it will wait up to 20 seconds for free connection. Thisconnection pool is created at the class constructor.Utility methods provided by this class, each method use is own SQLConnection, so they areautonomous transactions:

• createTable(), create a test table as follow, (T1 is a constant value defined as TABLE):

create table T1 (f1 number primary key,f2 varchar2(200),f3 varchar2(200),f4 number)

• dropTable(), drop table created above.• createIndex(), add a Lucene Domain Index to previous one created table as follow,

(LogLevel,Analyzer,MergeFactor,ExtraCols and FormatCols are customizable at classlevel, after index creation MergeFactor is reduced to 2):

create index IT1 on T1(f2) indextype is lucene.LuceneIndex

parameters('LogLevel:WARNING;Analyzer:org.apache.lucene.analysis.StopAnalyzer;MergeFactor:500;ExtraCols:F1;FormatCols:F1(0000)')

• dropIndex(), drop previous one index.• int insertRows(int startIndex, int endIndex), insert a set of rows at above table with F1

column varying from startIndex to endIndex. F2 column is an english text representationof F1, F4 is F1*10 and F3 is an english text representation of F1*10.Return a number of rows inserted. If there are problems such as primary key violation itrollback the transaction.

• int deleteRows(int startIndex, int endIndex), delete a set of rows where F1 betweenstartIndex and endIndex.Return a number of rows deleted. If there are problems rollback the transaction.Note that deleting rows automatically update Lucene Index.

• int updateRows(int startIndex, int endIndex), update F2 column with his own value tofire ODCI update method on each row between startIndex and endIndex.Return a number of rows updated.

• findRows(int n), find rows which F2 match again a text representation of n usinglcontains operator. It only test for a result having 0 or more rows.

Page 57: Lucene Domain Index

• long syncIndex(), perform a sync operation at Lucene Domain Index applying pendingchanges (inserts, updates). If there are errors, usually caused by another transactionhaving an exclusive lock in a row being indexed, it rollback the operation. Next successfulsync will apply pending changes of failed operations.Return a long value with the amount of milliseconds spent during sync.

• long optimizeIndex(), perform an optimize operation at Lucene Domain Index mergingsegments in a new one. If there are errors, usually caused by another transaction havingan exclusive lock on the index, it rollback the operation.Return a long value with the amount of milliseconds spent during optimize.

C.2 TestDBIndex

Simple test which create a table his index and performs insertions, sync, optimize and deletions,finally drop index and table. His output look like:

[junit] Testsuite: org.apache.lucene.index.TestDBIndex[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 3.836 sec[junit][junit] ------------- Standard Output ---------------[junit] Table created: T1[junit] Index created: IT1[junit] Index altered: IT1[junit] Inserted rows: 40 total char inserted: 415 avg text length: 10[junit] Index synced: IT1 elapsed time: 265 ms.[junit] Avg Sync time: 6[junit] Index optimized: IT1 elapsed time: 40 ms.[junit] Avg Optimize time: 1[junit] Row deleted 40, from: 10 to: 49 elapsed time: 1303 ms. Avg time: 32 ms.[junit] Index droped: IT1[junit] Table droped: T1

C.3 TestDBIndexAddDoc

Performs several insertions and sync, starting with 10 rows, then 90 and so on, ending with3.000 insertions using insertRow method of DBTestCase base class. After each batch of insertionscalls to syncIndex method calculating average time of sync method for each row inserted. Hisoutput look like:

[junit] Testsuite: org.apache.lucene.index.TestDBIndexAddDoc[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 64.696 sec[junit][junit] ------------- Standard Output ---------------[junit] Table created: T1[junit] Index created: IT1[junit] Index altered: IT1[junit] Index synced: IT1 elapsed time: 126 ms.[junit] Inserted rows: 10 total char inserted: 49 avg text length: 4[junit] Index synced: IT1 elapsed time: 142 ms.[junit] Avg Sync time: 14[junit] Inserted rows: 90 total char inserted: 988 avg text length: 10[junit] Index synced: IT1 elapsed time: 374 ms.[junit] Avg Sync time: 4[junit] Inserted rows: 400 total char inserted: 9201 avg text length: 23[junit] Index synced: IT1 elapsed time: 1276 ms.

Page 58: Lucene Domain Index

[junit] Avg Sync time: 3[junit] Inserted rows: 500 total char inserted: 11726 avg text length: 23[junit] Index synced: IT1 elapsed time: 1601 ms.[junit] Avg Sync time: 3[junit] Inserted rows: 1000 total char inserted: 35950 avg text length: 35[junit] Index synced: IT1 elapsed time: 4675 ms.[junit] Avg Sync time: 4[junit] Inserted rows: 3000 total char inserted: 110851 avg text length: 36[junit] Index synced: IT1 elapsed time: 25480 ms.[junit] Avg Sync time: 8[junit] Index droped: IT1[junit] Table droped: T1

C.4 TestDBIndexDelDoc

At setup method this test case a create a table and fill it with 500 rows. Then performs deletionsbatch of 10, 90 and 400 rows each calculating average time for each row deleted. His outputlook like:

[junit] Testsuite: org.apache.lucene.index.TestDBIndexDelDoc[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 20.543 sec[junit][junit] ------------- Standard Output ---------------[junit] Table created: T1[junit] Index created: IT1[junit] Index altered: IT1[junit] Inserted rows: 500 total char inserted: 10238 avg text length: 20[junit] Index synced: IT1 elapsed time: 1643 ms.[junit] Row deleted 10, from: 1 to: 10 elapsed time: 356 ms. Avg time: 35 ms.[junit] Row deleted 90, from: 11 to: 100 elapsed time: 2535 ms. Avg time: 28 ms.[junit] Row deleted 400, from: 101 to: 500 elapsed time: 11526 ms. Avg time: 28 ms.[junit] Index droped: IT1[junit] Table droped: T1

C.5 TestDBIndexParallel

This is more complex test case to check concurrent access to Lucene Domain Index. To do thiscreates several threads, some for simulating batch insertions of 10 rows, others for simulatingbatch deletions of 10 rows, another for simulating batch updates of 10 rows and finally manythreads searching for rows each 0.5 seconds.By default creates 3 threads for each kind of operations and each thread perform:

• 20 inserts• 5 deletes• 5 update• 100 search

Each thread takes his own connection from the connection pool and do his job, if fastSyncconstant is true after each successful insert and update it calls to syncIndex method to updateLucene Index, if fastSync is false another thread is started performing sync index each 1 second.It end when all threads (inserts, deletes, updates) finish.Here some part of his output:

[junit] Testsuite: org.apache.lucene.index.TestDBIndexParallel[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 97.7 sec

Page 59: Lucene Domain Index

[junit][junit] ------------- Standard Output ---------------[junit] Table created: T1[junit] Index created: IT1[junit] Index altered: IT1[junit] FastSync: true[junit] Deleter 1 deleting at block 70[junit] Updater 1 updating at block 70[junit] Inserter 2 inserting at block 90[junit] No Row deleted at: 70 to: 79 elapsed time: 131 ms.[junit] No Row updated at: 70 to: 79 elapsed time: 12 ms.[junit] Searcher 2 searching row 30[junit] Searcher 1 searching row 77[junit] Not Found rows with: thirty elapsed time: 211 ms.[junit] Not Found rows with: seventy-seven elapsed time: 170 ms.[junit] Inserted rows: 10 total char inserted: 115 avg text length: 11[junit] Searcher 2 searching row 62[junit] Searcher 0 searching row 63[junit] Searcher 1 searching row 49[junit] Not Found rows with: sixty-two elapsed time: 64 ms.[junit] Index synced: IT1 elapsed time: 283 ms.[junit] Not Found rows with: sixty-three elapsed time: 215 ms.[junit] Searcher 2 searching row 74[junit] Not Found rows with: seventy-four elapsed time: 39 ms.[junit] Not Found rows with: forty-nine elapsed time: 137 ms.[junit] Searcher 1 searching row 95[junit] Searcher 2 searching row 46[junit] Found rows with: ninety-five elapsed time: 103 ms.

....[junit] Updater 2 updating at block 20[junit] No Row updated at: 20 to: 29 elapsed time: 3 ms.[junit] Inserted rows: 10 total char inserted: 80 avg text length: 8[junit] Searcher 0 searching row 97[junit] Found rows with: ninety-seven elapsed time: 60 ms.[junit] Index synced: IT1 elapsed time: 147 ms.

.....[junit] Searcher 2 searching row 39[junit] Searcher 1 searching row 84[junit] Not Found rows with: thirty-nine elapsed time: 33 ms.[junit] Not Found rows with: eighty-four elapsed time: 38 ms.[junit] Updater 0 updating at block 90[junit] Row updated 10, from: 90 to: 99 elapsed time: 16 ms. Avg time: 1 ms.[junit] Index synced: IT1 elapsed time: 162 ms.

......[junit] Inserted rows: 10 total char inserted: 125 avg text length: 12[junit] Searcher 0 searching row 57[junit] Searcher 1 searching row 28[junit] Deleter 1 deleting at block 80[junit] Searcher 2 searching row 64[junit] No Row deleted at: 80 to: 89 elapsed time: 58 ms.[junit] Not Found rows with: twenty-eight elapsed time: 112 ms.[junit] Not Found rows with: fifty-seven elapsed time: 155 ms.[junit] Index synced: IT1 elapsed time: 242 ms.[junit] Searcher 0 searching row 98[junit] Found rows with: ninety-eight elapsed time: 72 ms.[junit] Not Found rows with: sixty-four elapsed time: 175 ms.

Page 60: Lucene Domain Index

[junit] Searcher 0 searching row 27[junit] Not Found rows with: twenty-seven elapsed time: 75 ms.[junit] Searcher 1 searching row 5[junit] Deleter 2 deleting at block 50[junit] Searcher 2 searching row 84[junit] Not Found rows with: eighty-four elapsed time: 20 ms.[junit] Updater 2 updating at block 10[junit] No Row deleted at: 50 to: 59 elapsed time: 28 ms.[junit] Row updated 10, from: 10 to: 19 elapsed time: 36 ms. Avg time: 3 ms.[junit] Found rows with: five elapsed time: 216 ms.

.................[junit] Inserter 1 inserting at block 50[junit] Found rows at: 50 position, ignoring insertions[junit] Index droped: IT1[junit] Table droped: T1

C.6 TestDBIndexSearchDoc

This test check some special features of lcontains operator such as in-line pagination, sort byand filter by expressions.First create a table with 200 rows and then query them, his output look like:

[junit] Testsuite: org.apache.lucene.index.TestDBIndexSearchDoc[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 14.001 sec[junit][junit] ------------- Standard Output ---------------[junit] Table created: T1[junit] Index created: IT1[junit] Index altered: IT1[junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16[junit] Index synced: IT1 elapsed time: 746 ms.[junit] testFilterAll()[junit] Excecution time: 129 ms.[junit] 120 Score: 0.9606395 str: one hundred twenty[junit] 119 Score: 0.25453204 str: one hundred nineteen[junit] 118 Score: 0.25453204 str: one hundred eighteen[junit] 117 Score: 0.25453204 str: one hundred seventeen[junit] 116 Score: 0.25453204 str: one hundred sixteen[junit] 115 Score: 0.25453204 str: one hundred fifteen[junit] 114 Score: 0.25453204 str: one hundred fourteen[junit] 113 Score: 0.25453204 str: one hundred thirteen[junit] 112 Score: 0.25453204 str: one hundred twelve[junit] 111 Score: 0.25453204 str: one hundred eleven[junit] Index droped: IT1[junit] Table droped: T1[junit] Table created: T1[junit] Index created: IT1[junit] Index altered: IT1[junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16[junit] Index synced: IT1 elapsed time: 721 ms.[junit] testFilterBy()[junit] Excecution time: 162 ms.[junit] 103 Score: 1.0 str: one hundred three[junit] 120 Score: 0.9606395 str: one hundred twenty

Page 61: Lucene Domain Index

[junit] 101 Score: 0.28600293 str: one hundred one[junit] 100 Score: 0.27352643 str: one hundred

....[junit] 115 Score: 0.25453204 str: one hundred fifteen[junit] 116 Score: 0.25453204 str: one hundred sixteen[junit] Index droped: IT1[junit] Table droped: T1[junit] Table created: T1[junit] Index created: IT1[junit] Index altered: IT1[junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16[junit] Index synced: IT1 elapsed time: 751 ms.[junit] testFilterByOrderBy()[junit] Excecution time: 138 ms.[junit] 120 Score: 0.9606395 str: one hundred twenty[junit] 119 Score: 0.25453204 str: one hundred nineteen

....[junit] 103 Score: 1.0 str: one hundred three[junit] 102 Score: 0.25453204 str: one hundred two[junit] 101 Score: 0.28600293 str: one hundred one[junit] 100 Score: 0.27352643 str: one hundred[junit] Index droped: IT1[junit] Table droped: T1[junit] Table created: T1[junit] Index created: IT1[junit] Index altered: IT1[junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16[junit] Index synced: IT1 elapsed time: 761 ms.[junit] testPagination()[junit] Excecution time: 193 ms.[junit] 117 Score: 0.03489425 str: one hundred seventeen[junit] 118 Score: 0.03489425 str: one hundred eighteen

....[junit] 132 Score: 0.03489425 str: one hundred thirty-two[junit] 134 Score: 0.03489425 str: one hundred thirty-four[junit] Index droped: IT1[junit] Table droped: T1[junit] Table created: T1[junit] Index created: IT1[junit] Index altered: IT1[junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16[junit] Index synced: IT1 elapsed time: 743 ms.

&nbsp; [junit] testCountHits()[junit] Excecution time: 53 ms.[junit] Hits: 126[junit] Index droped: IT1[junit] Table droped: T1

C.7 TestQueryHits

This test is not autonomous because requires an additional step to run. Before run it create atable and his Lucene Index with:

Page 62: Lucene Domain Index

create table test_source_big as (select * from all_source);create index source_big_lidx on test_source_big(text)

indextype is lucene.LuceneIndexparameters('AutoTuneMemory:true;MergeFactor:500;FormatCols:line(0000);ExtraCols:line "line"');

For 11g databases you can create a best optimize Lucene Index using some new Secure LOBfeatures:

create index source_big_lidx on test_source_big(text)indextype is lucene.LuceneIndex

parameters('FormatCols:line(0000);ExtraCols:line"line";Analyzer:org.apache.lucene.analysis.StopAnalyzer;MergeFactor:500;LobStorageParameters:PCTVERSION0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');

On 10g running it as SCOTT, TEST_SOURCE_BIG table will have 220731 rows using a typicalinstallation based on database templates.Using above table two test checks performance with a query which returns 18387 hits, oncecall to LuceneDomainIndex.countHits function and another iterate over the result in pages of tenrows, typical scenario of web applications. His output look like:

[junit] Testsuite: org.apache.lucene.indexer.TestQueryHits[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 2.656 sec[junit][junit] ------------- Standard Output ---------------[junit] iteration from: 13775 to: 13785[junit] Step time: 791 ms.[junit] iteration from: 13785 to: 13795[junit] Step time: 49 ms.[junit] iteration from: 13795 to: 13805[junit] Step time: 40 ms.[junit] iteration from: 13805 to: 13815[junit] Step time: 44 ms.[junit] iteration from: 13815 to: 13825[junit] Step time: 40 ms.[junit] iteration from: 13825 to: 13835[junit] Step time: 42 ms.[junit] iteration from: 13835 to: 13845[junit] Step time: 41 ms.[junit] iteration from: 13845 to: 13855[junit] Step time: 50 ms.[junit] iteration from: 13855 to: 13865[junit] Step time: 41 ms.[junit] iteration from: 13865 to: 13875[junit] Step time: 41 ms.[junit] Elapsed time: 1877[junit] Hits: 18387[junit] Elapsed time: 564

Note that first iteration took more time because it includes parsing time and caching, also tosimulate a real word web application an SQLConnection is take and returned to the pool on eachiteration.

Page 63: Lucene Domain Index

D Functions, operators and utilities

D.1 lcontains ancillary operator

lcontains operator is similar to Oracle Text score operator, but differs in query argument andsupport another one argument to define in-line sorting.Syntax

LCONTAINS([schema.]column,text_query VARCHAR2[,sort VARCHAR2][,label NUMBER])

RETURN NUMBER;

[schema.]columnSpecify the Lucene text column to be searched on. This column must have a LuceneDomain Index associated with it.

text_querySpecify a Lucene Query Parser syntax argument. In addition to Lucene Query Parsersyntax, Lucene Domain Index support in-line pagination at lcontains, to do that thisquery must start with rownum[nn TO mm] AND where nn and mm are rownum valuesof the result query which will be returned, in Oracle syntax rownum start with 1, and thisboundary are inclusive which means that for 20 to 30 we get 11 rows.Follwing and excerpt of Lucene Query Parser Syntax.Terms

A query is broken up into terms and operators. There are two types of terms:Single Terms and Phrases.A Single Term is a single word such as "test" or "hello".A Phrase is a group of words surrounded by double quotes such as "hello dolly".Multiple terms can be combined together with Boolean operators to form a morecomplex query (see below).Note: The analyzer used to create the index will be used on the terms and phrasesin the query string. So it is important to choose an analyzer that will not interferewith the terms used in the query string.

FieldsLucene supports fielded data. When performing a search you can either specify afield, or use the default field. The field names and default field is implementationspecific.You can search any field by typing the field name followed by a colon ":" and thenthe term you are looking for.As an example, let's assume a Lucene index contains two fields, title and text andtext is the default field. If you want to find the document entitled "The Right Way"which contains the text "don't go this way", you can enter:

title:"The Right Way" AND text:go

or

title:"Do it right" AND right

Since text is the default field, the field indicator is not required.Note: The field is only valid for the term that it directly precedes, so the query

Page 64: Lucene Domain Index

title:Do it right

Will only find "Do" in the title field. It will find "it" and "right" in the default field (inthis case the text field).

Term ModifiersLucene supports modifying query terms to provide a wide range of searchingoptions.

Wildcard SearchesLucene supports single and multiple character wildcard searches within single terms(not within phrase queries).To perform a single character wildcard search use the "?" symbol.To perform a multiple character wildcard search use the "*" symbol.The single character wildcard search looks for terms that match that with the singlecharacter replaced. For example, to search for "text" or "test" you can use thesearch:

te?t

Multiple character wildcard searches looks for 0 or more characters. For example, tosearch for test, tests or tester, you can use the search:

test*

You can also use the wildcard searches in the middle of a term.

te*t

Note: You cannot use a * or ? symbol as the first character of a search.Fuzzy Searches

Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distancealgorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Singleword Term. For example to search for a term similar in spelling to "roam" use thefuzzy search:

roam~

This search will find terms like foam and roams.Starting with Lucene 1.9 an additional (optional) parameter can specify the requiredsimilarity. The value is between 0 and 1, with a value closer to 1 only terms with ahigher similarity will be matched. For example:

roam~0.8

The default that is used if the parameter is not given is 0.5.Proximity Searches

Lucene supports finding words are a within a specific distance away. To do aproximity search use the tilde, "~", symbol at the end of a Phrase. For example tosearch for a "apache" and "jakarta" within 10 words of each other in a documentuse the search:

Page 65: Lucene Domain Index

"jakarta apache"~10

Range SearchesRange Queries allow one to match documents whose field(s) values are betweenthe lower and upper bound specified by the Range Query. Range Queries can beinclusive or exclusive of the upper and lower bounds. Sorting is donelexicographically.

mod_date:[20020101 TO 20030101]

This will find documents whose mod_date fields have values between 20020101and 20030101, inclusive. Note that Range Queries are not reserved for date fields.You could also use range queries with non-date fields:

title:{Aida TO Carmen}

This will find all documents whose titles are between Aida and Carmen, but notincluding Aida and Carmen.Inclusive range queries are denoted by square brackets. Exclusive range queriesare denoted by curly brackets.

Boosting a TermLucene provides the relevance level of matching documents based on the termsfound. To boost a term use the caret, "^", symbol with a boost factor (a number) atthe end of the term you are searching. The higher the boost factor, the morerelevant the term will be.Boosting allows you to control the relevance of a document by boosting its term.For example, if you are searching for

jakarta apache

and you want the term "jakarta" to be more relevant boost it using the ^ symbolalong with the boost factor next to the term. You would type:

jakarta^4 apache

This will make documents with the term jakarta appear more relevant. You can alsoboost Phrase Terms as in the example:

"jakarta apache"^4 "Apache Lucene"

By default, the boost factor is 1. Although the boost factor must be positive, it canbe less than 1 (e.g. 0.2)

Boolean OperatorsBoolean operators allow terms to be combined through logic operators. Lucenesupports AND, "+", OR, NOT and "-" as Boolean operators(Note: Boolean operatorsmust be ALL CAPS).The OR operator is the default conjunction operator. This means that if there is noBoolean operator between two terms, the OR operator is used. The OR operatorlinks two terms and finds a matching document if either of the terms exist in adocument. This is equivalent to a union using sets. The symbol || can be used in

Page 66: Lucene Domain Index

place of the word OR.To search for documents that contain either "jakarta apache" or just "jakarta" usethe query:

"jakarta apache" jakarta

or

"jakarta apache" OR jakarta

ANDThe AND operator matches documents where both terms exist anywhere in the textof a single document. This is equivalent to an intersection using sets. The symbol&& can be used in place of the word AND.To search for documents that contain "jakarta apache" and "Apache Lucene" usethe query:

"jakarta apache" AND "Apache Lucene"

+The "+" or required operator requires that the term after the "+" symbol existsomewhere in a the field of a single document.To search for documents that must contain "jakarta" and may contain "lucene" usethe query:

+jakarta lucene

NOTThe NOT operator excludes documents that contain the term after NOT. This isequivalent to a difference using sets. The symbol ! can be used in place of the wordNOT.To search for documents that contain "jakarta apache" but not "Apache Lucene" usethe query:

"jakarta apache" NOT "Apache Lucene"

Note: The NOT operator cannot be used with just one term. For example, thefollowing search will return no results:NOT "jakarta apache"

-The "-" or prohibit operator excludes documents that contain the term after the "-"symbol.To search for documents that contain "jakarta apache" but not "Apache Lucene" usethe query:

"jakarta apache" -"Apache Lucene"

Grouping

Page 67: Lucene Domain Index

Lucene supports using parentheses to group clauses to form sub queries. This canbe very useful if you want to control the boolean logic for a query.To search for either "jakarta" or "apache" and "website" use the query:

(jakarta OR apache) AND website

This eliminates any confusion and makes sure you that website must exist andeither term jakarta or apache may exist.

Field GroupingLucene supports using parentheses to group multiple clauses to a single field.To search for a title that contains both the word "return" and the phrase "pinkpanther" use the query:

title:(+return +"pink panther")

Escaping Special CharactersLucene supports escaping special characters that are part of the query syntax. Thecurrent list special characters are

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \

To escape these character use the \ before the character. For example to search for(1+1):2 use the query:

\(1\+1\)\:2

sortSort string is with syntax sortField1[[:(ASC|DESC)]:[type]] for examplerevisionDate:DESC:string, ASC or DESC is optional as type which is either string, int orfloat. Multimples fields can be used for sorting, sort string must be separated by , forexample revisionDate:DESC:string,title:ASC.If you don't include sort argument at lcontains operator, a Lucene natural order which isscore descending will be used. For any other field ASC is the default sort order.

labelIs an string used in conjuntion with lscore operator to identified which is the lcontainsoperators is used for each lscore.

D.2 lscore ancillary operator

Use the LSCORE operator in a SELECT statement to return the score values produced by aLCONTAINS query. The LSCORE operator can be used in a SELECT, ORDER BY, or GROUP BYclause.Syntax

LSCORE(label NUMBER)

labelSpecify a number to identify the score produced by the query. Use this number to identifythe LCONTAINS clause which returns this score.

Example

Page 68: Lucene Domain Index

SELECT /*+ DOMAIN_INDEX_SORT */ lscore(1),subject FROM emailswhere lcontains(bodytext,'security',1)>0;

D.3 lhighlight ancillary operator

Use the LHIGHLIGHT operator in a select statement to return a highlighted version of themaster column of the index associated to the LCONTAINS query. By now only highlightingfunctionality is supported for the master column of the index and the return value of thisfunction is a VARCHAR2 data type with the text highlighted. VARCHAR2 limitation is not a bigproblem because highlighted text usually is an small part of the original text of the columnshowed to user as a preview of the original document.Syntax

LHIGHLIGHT(label NUMBER):VARCHAR2

labelSpecify a number to identify the score produced by the query. Use this number to identifythe LCONTAINS clause which returns this score.

Example

SELECT /*+ DOMAIN_INDEX_SORT */ lhighlight(1) txt,lscore(1) sc,subjectFROM emails where lcontains(bodytext,'security OR mysql','subject:ASC',1)>0;

D.4 phighlight pipeline table function

PHIGHLIGHT pipeline table function performs highlighting on any column of type VARCHAR2or CLOB of the input query. Columns not included into cols argument will not be affected andthey will be returned as is.Syntax

PHIGHLIGHT(index_name VARCHAR2, qry VARCHAR2, cols VARCHAR2, stmt INVARCHAR2) RETURN ANYDATASET

index_nameSpecify a Lucene Index to use.

qryLucene Query Parser syntax, same as the second argument of lcontains, except for theLucene Domain Index extension for pagination.

colsA coma separated list of columns to highlight, note that are capitalized if you not usecolumns alias.

stmtAny SQL text of the query to execute by DBMS_SQL package. Remember to use doublesingle quote to represent a SQL single quote inside the string. Columns returned by thisquery should be mapped as String, BigDecimal, Timestamp, CLOB, TIMESTAMP,TIMESTAMPTZ and TIMESTAMPLTZ Java types, it means for example that for tablewith a column VARCHAR2(40) the associated Java type inside the OJVM is String,then it can be highlighted or returned by this pipeline table function.

Example

Page 69: Lucene Domain Index

SELECT * FROMTABLE(phighlight(

'EMAILBODYTEXT','lucene OR mysql','SUBJECT,BODYTEXT','select lscore(1) sc,e.* from eMails e where lcontains(bodytext,''rownum:[1 TO 10]

AND (security OR mysql)'',''subject:ASC'',1)>0'));

D.5 rhighlight pipeline table function

RHIGHLIGHT pipeline table function performs highlighting on any column of type VARCHAR2or CLOB of the input query. Columns not included into cols argument will not be affected andthey will be returned as is. This is a variant of PHighlight which requires an additionalargument (rType) telling to this function the type that will be returned. This version is free toany kind of SQL injection and can start several invocation in parallel by the RDBMS based onthe information of the last argument.Syntax

RHIGHLIGHT(index_name VARCHAR2, qry VARCHAR2, cols VARCHAR2, rType INVARCHAR2, rws IN SYS_REFCURSOR) RETURN ANYDATASET

index_nameSpecify a Lucene Index to use.

qryLucene Query Parser syntax, same as the second argument of lcontains, except for theLucene Domain Index extension for pagination.

colsA coma separated list of columns to highlight, note that are capitalized if you not usecolumns alias.

rTypeA collection to be returned by RHighlight table function, usually is "colType TABLE OFaRowType".

rwsAny SQL query wrapped by the function CURSOR if you are using SQLPlus for example,or a JDBC ResultSet passed as setObject(n,rs), if you are using an application in Java.Columns returned by this query should be mapped as String, BigDecimal, Timestamp,CLOB, TIMESTAMP, TIMESTAMPTZ and TIMESTAMPLTZ Java types, it means forexample that for table with a column VARCHAR2(40) the associated Java type insidethe OJVM is String, then it can be highlighted or returned by this pipeline table function.

Example

CREATE TYPE EMAILR AS OBJECT( sc NUMBER,

emailFrom VARCHAR2(256),emailTo VARCHAR2(256),subject VARCHAR2(4000),emailDate DATE,bodyText CLOB);

CREATE OR REPLACE TYPE EMAILRSET AS TABLE OF EMAILR;

Page 70: Lucene Domain Index

SELECT * FROMTABLE(rhighlight(

'EMAILBODYTEXT','lucene OR mysql','SUBJECT,BODYTEXT','EMAILRSET',CURSOR(select /*+ DOMAIN_INDEX_SORT FIRST_ROW */ lscore(1) sc,e.*

from eMails e where lcontains(bodytext,'rownum:[1 TO 10] AND (security ORmysql)','subject:ASC',1)>0)

));

D.6 MoreLike.this function

MoreLike.this function have two declarations, once using index_name argument which usescurrent connected users and owner,index_name pair for using index in another databaseschema.Syntax

FUNCTION this(index_name IN VARCHAR2,x IN ROWID,f IN NUMBER DEFAULT 1,t IN NUMBER DEFAULT 10,minTermFreq IN NUMBER DEFAULT 2,minDocFreq IN NUMBER DEFAULT 5) RETURN sys.odciridlist

index_nameSpecify a Lucene Index to use.

xROWID used as pivot, it defines which row is used to extract the text with term used forMore Like This Lucene functionality. DefaultColumn parameter of the index is used todefine the column used to get the text, only columns of type VARCHAR2, CLOB orXMLType are supported.

f,tFrom to pagination information, default values are 1 to 10.

minTermFreq,minDocFreqminTermFreq is the frequency below which terms will be ignored in the source doc,minDocFreq is the frequency at which words will be ignored which do not occur in atleast this many docs, default values are 2 to 5.

SYS.odciridlistIs an array of ROWIDs which can be wrapped with a pipeline table function ridlist_tablefor selecting his values, for example (select * from table(ridlist_table(ridlist))).

FUNCTION this(owner IN VARCHAR2,index_name IN VARCHAR2,x IN ROWID,f IN NUMBER DEFAULT 1,t IN NUMBER DEFAULT 10,minTermFreq IN NUMBER DEFAULT 2,minDocFreq IN NUMBER DEFAULT 5) RETURN sys.odciridlist

Page 71: Lucene Domain Index

ownerDatabase schema owner of the index, for example SCOTT. Previous overloaded definitionof this function will query ALL_INDEXED system view to know which is the owner of theindex name.

D.7 lfacets aggregate function

lfacets() aggregate function have one argument which is a coma separated list of the indexname and the category and sub category to be queries.Syntax

FUNCTION lfacets(input IN VARCHAR2) RETURN LUCENE.agg_tbl

inputA coma separated list including index name, category and optional a sub category.Category and sub category are in Lucene Query Parser syntax including the column nameindexed (Lucene Field), for example text:(Ciencias naturales y formales), line:[1 TO 10]and so on. When category and sub category are present the ODCI API start thecomputation by calculating the bit set of the main category and then iterate over eachsub category doing a bit and operation between the two bit set.

LUCENE.AGG_TBLIs a TABLE OF agg_attributes and AGG_ATTRIBUTES is an object type with two fieldqryText VARCHAR2(4000) and hits NUMBER when a category and sub category ispassed as argument, this return value will be a table with each row representing thecardinality of intersection between the category and the sub category, it means a tablewith a number of rows equal to the number of sub categories.To help formatting the output in a traditional query there is function ljoin() whichreceives as input an agg_tbl type plus a char separator and returns an string with all therows, here the syntax:

FUNCTION ljoin(i_tbl in agg_tbl,i_glue IN VARCHAR2 := ',') RETURN VARCHAR2

i_tblA LUCENE.agg_tbl table to scan.

i_glueA VARCHAR2 string to use as separator, default value ",".

D.8 index_terms pipeline table function

index_terms() pipeline table function returns a list of Lucene terms values and theirfrequency, it have two arguments, first argument is the Lucene Domain Index name andsecond argument a Lucene term name.Syntax

FUNCTION index_terms(index_name VARCHAR2, term_name VARCHAR2) RETURNLUCENE.term_info_set

index_nameLucene index name with a syntax, SCHEMA.IDX_NAME or IDX_NAME if current user is theowner.

Page 72: Lucene Domain Index

term_nameLucene term name, if this argument is NULL the information of all Lucene Index terms willbe returned.

LUCENE.term_info_setIs a TABLE OF term_info and TERM_INFO is an object type with two field termVARCHAR2(4000) and docFreq NUMBER(10) this table can be easily iterated withtraditional SELECT FROM construction, for example:

SQL> select * from (select * from table(index_terms('SOURCE_BIG_LIDX',null)) order bydocFreq desc) where rownum<=10;TEXT:in 24952TEXT:varchar 16996...TEXT:return 6241

The natural order returned by index_terms() is ordered by term_name:term_value.

D.9 high_freq_terms pipeline table function

high_freq_terms() pipeline table function returns a Top-N most frequents Lucene termsvalues and their frequency, it have three arguments, first argument is the Lucene DomainIndex name, second argument is a Lucene term name and last argument how many Top-Nterms should be returned.Syntax

FUNCTION high_freq_terms(index_name VARCHAR2, term_name VARCHAR2,num_terms NUMBER) RETURN LUCENE.term_info_set

index_nameLucene index name with a syntax, SCHEMA.IDX_NAME or IDX_NAME if current user is theowner.

term_nameLucene term name, if this argument is NULL the information of all Lucene Index terms willbe returned.

num_termsHow many Top-N high frequency terms should be returned.

LUCENE.term_info_setIs a TABLE OF term_info and TERM_INFO is an object type with two field termVARCHAR2(4000) and docFreq NUMBER(10) this table can be easily iterated withtraditional SELECT FROM construction, for example:

SQL> select * from table(high_freq_terms('SOURCE_BIG_LIDX',null,10));TEXT:in 24952TEXT:varchar 16996...TEXT:return 6241

The natural order returned by high_freq_terms() is ordered by docFreq descending.Note that this result is similar to the example of index_terms(), but this is moreefficient.

D.10 DidYouMean package

Page 73: Lucene Domain Index

DidYouMean package provides dictionary creation and querying for Lucene Did You Meanfunctionality to Lucene Domain Index. This package have a procedure to create or updatethe dictionary and function to query it.

Syntax

PROCEDURE indexDictionary(index_name VARCHAR2, spellColumns VARCHAR2 DEFAULTnull, distancealg IN VARCHAR2 DEFAULT 'Levenstein')PROCEDURE indexDictionary(owner IN VARCHAR2, index_name VARCHAR2, spellColumnsVARCHAR2 DEFAULT null, distancealg IN VARCHAR2 DEFAULT 'Levenstein')

ownerLucene index owner.

index_nameLucene index name.

spellColumnsLucene Domain Index columns to be included in Did You Mean dictionary.

distanceAlgDistance Algorithm used when create the dictionary, possible values Levenstein,

NGram or Jaro, default Levenstein.

This procedures update Lucene Domain Index structure adding new Field storingthe information required for doing Did You Mean functionality.

FUNCTION suggestwords(

owner IN VARCHAR2,index_name IN VARCHAR2,cmpval IN VARCHAR2,highlight IN VARCHAR2 DEFAULT null,distancealg IN VARCHAR2 DEFAULT 'Levenstein'

) RETURN VARCHAR2

ownerLucene index owner.

index_nameLucene index name.

cmpvalString with values to be replaced by Did You Mean algorithm.

highlightTag using for highlighting if it is not null, for example i will be used to return the

tag <i>text</i>distanceAlg

Distance Algorithm used when create the dictionary, possible values Levenstein,NGram or Jaro, default Levenstein.

Page 74: Lucene Domain Index

This function query Lucene Domain Index to compute a Did You Mean words for theinput string.

E Project Change Log

2.9.2.1.0 Production Release based on Lucene 2.9 (2.9.2) core base

• Added elapsed time information when log level is INFO• Removed deprecated usage of LUCENE_CURRENT constant• Fixed facets inconsitence due ignore internal parameter ColName• Initial implementation of DidYouMean functionality contributed by Pedro Pinheiro• Temporary fix until Lucene defines clear semantics for Directory.fileLength (see Lucene issue

2316)

2.9.1.1.0 Production Release based on Lucene 2.9 (2.9.1) core base

• New Lucene Core base libraries• Full Lucene Test Suites certified• Fixed bug enqueue more rowids than required when using OnLine mode and ExtraTabs,

WhereCondition parameters• Fixed operator priority when WhereCondition have OR operator• DefaultUserDataStore now uses an array of cached fields to improve performance• Spanish Analyzer use latest ASCIIFoldingFilter• high_freq_terms(idx_name,term,max_num_term) pipeline table function was added to return

high frequent terms and the associated docFreq value• index_terms(idx_name,term) pipeline table function was added to return a list of terms and

their associated frequency• DefaultUserDataStore now have support for ANALYZED, ANALYZED_WITH_VECTORS,

ANALYZED_WITH_OFFSETS, ANALYZED_WITH_POSITIONS andANALYZED_WITH_POSITIONS_OFFSETS Lucene Field option values

• OJVMLock was replaced by SingleInstanceLockFactory for per instance locking, cross sessionslockings are implemented by select for update functionality

• an automatic upgrade from 2.9.0 is possible without Index deletions or rebuild, you have toexecute:

ant upgrade-domain-indexant ncomp-lucene-ojvm (10g only)ant jit-lucene-classes (11g only)

2.9.0.1.0 Production release based on Lucene 2.9.0 core base, 29/Sep/09

• Tested with Oracle 11gR2, 11gR1 and 10.2 databases• DefaultUserDataStore do a SAX parsing to get text nodes and attributes from an XMLType

value.• A SimpleLRUCache is used to load rowids and his associated Lucene doc id, this reduce

memory consumption when querying very big tables. A new parameters has been added,CachedRowIdSize by default 10000 to control the size of the LRU cache.

Page 75: Lucene Domain Index

• Lucene Domain Index core was updated to use TopFieldCollector and to avoid computationtime when lscore() is not used.

• Two new parameter has been added NormalizeScore which control when to track the MaxScore and PreserveDocIdOrder when querying, both parameters are consequence of newLucene Collector API and boost performance when querying.

• A table alias L$MT is defined for the master table associated to the index to be used incomplex queries to associate columns from master tables and columns from dependent tables

2.4.1.1.0 (maintenance release based on Lucene 2.4.1, 27/Mar/09)

• Do not store internal parameters into system's views and force to PopulateIndex:false• After every sync, now files marked as deleted are purged to free BLOB storage• Added lfacets aggregated function for doing facets• CountHits function no longer requires sort argument• Filter are stored/retrived only using QueryParser.toString() key• UN_TOKENIZED format string at DefaultUserDataStore class was replaced by NOT_ANALYZED

or NOT_ANALYZED_STORED according to new Lucene definitions.• Fix bug when sync try to process more than 32767 rowids enqueued.• Added parameters for highlighting functions Formatter, MaxNumFragmentsRequired,

FragmentSeparator and FragmentSize.• Added PerFieldAnalyzer parameter to use independent Analyzer for each columns.• Added sample of a custom Formatter org.apache.lucene.search.highlight.MyHTMLFormatter

2.4.1.0.0 (first release based on Lucene 2.4.1, 9/Mar/09)

• Fix compatibility problem between 10g/11g SQL Date representation on pipeline tablefunction.

2.4.0.1.0 (maintenance release based on Lucene 2.4.0, 10/Jan/09)

• Added Rhighlight(index_name VARCHAR2, qry VARCHAR2, cols VARCHAR2, rType INVARCHAR2, rws IN SYS_REFCURSOR) RETURN ANYDATASET pipeline table function

• Added Phighlight(index_name VARCHAR2, qry VARCHAR2, cols VARCHAR2, stmt INVARCHAR2) RETURN ANYDATASET pipeline table function

• Added lhighlight(NUMBER):VARCHAR2 ancilliary operator• Removed usage of Lucene deprecated API (Hits and IndexWriter for example)• Usage of FIRST_ROWS optimizer hits to decide how many rows load at first time• sync, optimize and rebuild interfaces now use index_name or [owner,index_name] arguments• A better build system to build Lucene Domain Index from sources• More tests• Tested against 11.1.0.7 and 10.2.0.3• See online docs to see usage of FIRST_ROWS and lhighlight() operator

2.4.0.0.0 (production release based on Lucene 2.4.0, 10/10/08)

• Added parameter for CLOB enconding• More Like this function• NGram analyzer• EnglishWikipediaAnalyzer• DataStore interface include API for setting current connection• Now analyzers, queries, snowball and WikiPedia contrib packages are required

Page 76: Lucene Domain Index

2.3.2.0.0 (binary release based on Lucene 2.3.2, 1/Jun/08)

• Compiled against Lucene 2.3.2 production release• Used latest API for merging based on RAM usage• Use Writer for deleting during Sync• Confirm 4x improvement during indexing reported by Lucene dev group• Fix workaround which changes order of the rowids in ODCRIDList• Added an Spanish WikiPedia Analyzer for testing• Reports IOException instead of RunTimeException to signal EOF or File Not Found• Decouple Flush functionality from TableIndexer

2.2.0.2.2 (fixpack for 2.2.0.2.0 release, 5/Apr/08)

• Added Rowid to lucene doc id caching.• Usage of LoadFirstFieldSelector during Document loading to only load rowid field.• Added a test suite which index a wikipedia dump inside the OJVM.

2.2.0.2.1 (fixpack for 2.2.0.2.0 release, 12/Dec/07)

• DefaultUserDataStore requires usage of XPath text() expresion for getting only textual value• Added logging info SQL being executed at table indexer• Change document logging to FINER level• More pre-defined mapping at DefaultUserDataStore for NUMBER, BINARY_FLOAT,

BINARY_DOUBLE, TIMESTAMP, TIMESTAMPTZ and TIMESTAMPLTZ Oracle types.• New parameter PopulateIndex:[true|false] for populating or not Lucene Index at creation

time.• New parameter IncludeMasterColumn:[true|false], to choose whether or not index master

column, useful with Virtual Columns and XMLType.• New parameter BatchCount:integer, to choose how many rows count are enqueued for

indexing using create ... index ... parameters('SyncMode:OnLine');• Creating an index with SyncMode:OnLine causes that LuceneDomain index will enqueue

batchs of "BatchCount" rows for index by AQ PLSQL callback in background. Lucene DomainIndex is intermediately ready for querying after create.

• Batch rowid indexing is doing using a pipeline function.

2.2.0.2.0 (third major release synchronized with Lucene 2.2.0, 12/Dec/07)

Binary download (see package ojvm):http://sourceforge.net/project/showfiles.php?group_id=56183CVS access:cvs -d:pserver:[email protected]:/cvsroot/dbprism logincvs -z3 -d:pserver:[email protected]:/cvsroot/dbprism co -P ojvm

• sort by column passed at lcontains(col,query_parser_str,sort_str,corr_id) syntax• Logging support using Java Util Logging package• JUnit test suites emulating middle tier environment• Support for rebuild and optimize online for SyncMode:OnLine index• XMLDB Export• AutoTuneMemory parameter for replacing MaxBufferedDocs parameter• Functional column support

2.2.0.1.1 (second release, 27/Sep/07 05:39 AM)

Binary download:https://issues.apache.org/jira/secure/attachment/12366661/ojvm-09-27-07.tar.gz

Page 77: Lucene Domain Index

CVS access:cvs -d:pserver:[email protected]:/cvsroot/dbprism logincvs -z3 -d:pserver:[email protected]:/cvsroot/dbprism co -P ojvm

• LuceneDomainIndex.countHits() function to replace select count from .. where lcontains(..)>0syntax.

• support inline pagination at lcontains(col,'rownum:[n TO m] AND ...") function• rounding and padding support for columns date, timestamp, mumber, float, varchar2 and

char• ODCI API array DML support• BLOB parameter support

2.2.0.1.0 (first release synchronized with lucene 2.2.0, 14/Sep/07 06:44AM)

CVS access:cvs -d:pserver:[email protected]:/cvsroot/dbprism logincvs -z3 -d:pserver:[email protected]:/cvsroot/dbprism co -P ojvm

• Synchronized with latest Lucene 2.2.0 production• Replaced in memory storage using Vector based implementation by direct BLOB IO, reducing

memory usage for large index.• Support for user data stores, it means you can not only index one column at time (limited by

Data Cartridge API on 10g), now you can index multiples columns at base table and columnson related tabled joined together.

• User Data Stores can be customized by the user, it means writing a simple Java Class userscan control which column are indexed, padding used or any other functionality previous todocument adding step.

• There is a DefaultUserDataStore which gets all columns of the query and built a LuceneDocument with Fields representing each database columns these fields are automaticallypadded if they have NUMBER or rounded if they have DATE data, for example.

• lcontains() SQL operator support full Lucene's QueryParser syntax to provide access to allcolumns indexed, see examples below.

• Support for DOMAIN_INDEX_SORT and FIRST_ROWS hint, it means that if you want to getrows order by lscore() operator (ascending,descending) the optimizer hint will assume thatLucene Domain Index will returns rowids in proper order avoided an inline-view to sort it.

• Automatic index synchronization by using AQ's Call Back.• Lucene Domain Index creates extra tables named IndexName$T and an Oracle AQ named

IndexName$Q with his storage table IndexName$QT at user's schema, so you can alterstorage's preference if you want.

• ojvm project is at SourceForge.net CVS, so anybody can get it and collaborate• Tested against 10gR2 and 11gR1 database.

2.0.0.1.3 (third release, 09/Jan/07 11:40 AM)

https://issues.apache.org/jira/secure/attachment/12348574/ojvm-01-09-07.tar.gz• The Data Cartridge API is used without column data to reduce the data stored on the queue

of changes and speedup the operation of the synchronize method.• Query Hits are cached associated to the index search and the string returned by the

QueryParser.toString() method.• If no ancillary operator is used in the select, do not store the score list.• The "Stemmer" argument is recognized as parameter given the argument for the SnowBall

analyzer, for example:

Page 78: Lucene Domain Index

create index it1 on t1(f2) indextype is lucene.LuceneIndexparameters('Stemmer:English');.

• Before installing the ojvm extension is necessary to execute "ant jar-core" on the snowballdirectory.

• The IndexWriter.setUseCompoundFile(false) is called to use multi file storage (faster than thecompound file) because there is no file descriptor limitation inside the OJVM, BLOBs are usedinstead of File.

• Files are marked for deletion and they are purged when calling to Sync or Optimize methods.• Blob are created and populated in one call using Oracle SQL RETURNING information.• A testing script for using OE sample schema, with query comparisons against Oracle Text

ctxsys.context index.

2.0.0.1.2 (second release, 20/Dec/06 02:03 PM)

https://issues.apache.org/jira/secure/attachment/12347614/ojvm-12-20-06.tar.gzThis new release of the OJVMDirectory Lucene Store includes a fully functional Oracle Domain Indexwith a queue for update/insert massive operations and a lot of performance improvement.

2.0.0.1.1 (first release, 28/Nov/06 01:04 PM)

https://issues.apache.org/jira/secure/attachment/12345967/ojvm-11-28-06.tar.gz• The complet API for the Oracle Domain index was completed, but the solution for the

operator contains outside the where clause is not good.• I will implement a singleton solution for the OJVMDirectory object when is used in read only

mode, typically when user performs select operations against tables which have columnsindexed with Lucene. This implementation will increase a lot the final performance becausethe index reader will be ready for each select operation. Obviously I will check if another useror thread makes a write operation on the index to reload the read-only singleton.

• The queue for storing the changes on the index is not implemented yet, I'll add it in a shorttime.

2.0.0.1.0 (initial implementation, 22/Nov/06 03:45 PM)

https://issues.apache.org/jira/secure/attachment/12345516/ojvm.tar.gz