isys document filters 10.2 for unstructured analytics in ... iq installation, provided the...
TRANSCRIPT
© 2010-2012, ISYS® Search Software Inc. www.isys-search.com
2012
Derek Murphy
ISYS Search Software
5/3/2012
ISYS Document Filters 10.2 for Unstructured Analytics in Sybase IQ
15.3/15.4
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 2
Introduction
ISYS Document Filters for Unstructured Analytics (UDA) is a set of components that may be added to a
Sybase IQ installation, provided the appropriate Unstructured Data Analytics (UDA) license has been
obtained from Sybase. The ISYS Document Filters for UDA enable a user to create text indices against
unstructured data stored in a Sybase IQ database. The text indices then facilitate the searching, via SQL,
of unstructured data that contain specific text terms.
ISYS Document Filters for UDA supports unstructured data adhering to common document formats such
as Microsoft Word (DOC, DOCX), Excel (XLS, XLSX), PowerPoint (PPT, PPTX), Adobe PDF, WordPerfect,
Rich Text Format (RTF), Open Document Format (ODF), HTML and many others. For a full list of
supported document types, refer to the Reference section Supported document formats.
Once installed, the ISYS Document Filters for UDA may be invoked during database indexing operations,
either during initial load into Sybase IQ, on a pre-populated, non-indexed column, or in real time as a
table is being modified with insert/update/delete statements. Document text and metadata will be
extracted with full Unicode support, allowing document content written in any language to be stored in
the index.
More information about Unstructured Data Analytics in Sybase IQ can be found at:
http://sybooks.sybase.com/nav/summary.do?prod=9787
ISYS Document Filters for UDA is based on the award-winning ISYS Document Filters suite. More
information can be found at http://www.isys-search.com/products/document-filters
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 3
System Requirements
Supported Operating Systems (* supported for testing only, not suitable for production):
o Windows Server 2008 (32bit & 64bit)
o Windows Server 2003 (32bit & 64bit)
o Windows 7 (32bit & 64bit)*
o Windows Vista (32bit & 64bit)*
o Windows XP SP2 (32bit & 64bit)*
o Red Hat Enterprise Linux Server 5+ (x86_64)
o SUSE Linux Enterprise Server 11+ (x86_64)
o Sun Solaris 9/10 (x86_64 / SPARC64)
o Oracle Solaris 11 (x86_64 / SPARC64)
o HP-UX 11.x (Itanium 64)
o AIX 6.x (POWER 64)
Minimum hard disk space: 100 MB
Required RAM: 512 MB, 1 GB recommended
Local administrative rights for installation
Installation
1. Install Sybase IQ 15.3/15.4 with the Unstructured Data Analytics (UDA) option
2. Download and unzip the ISYS Document Filters for UDA package to a temporary location
3. Navigate to the temporary location and run:
a. Windows: “install_windows.bat” b. Linux/UNIX: “sh ./install_unix.sh”
4. The contents of the temporary folder will be copied to a subfolder of the location specified by
the "IQDIR15" environment variable (created during the Sybase IQ installation)
5. The package will be installed to:
a. Windows 32 bit: "%IQDIR15%\bin32\isys_prefilter" b. Windows 64 bit: "%IQDIR15%\bin64\isys_prefilter" c. Linux/UNIX 64 bit: "$IQDIR15/lib64/isys_prefilter"
6. Post installation note:
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 4
a. On Windows, you will need to add the fully-qualified "isys_prefilter" and "isys_prefilter\isys_doc_filters" folders to the Windows %PATH% environment variable before running Sybase IQ.
b. On Linux & HP-UX, you will need to add the fully-qualified "isys_prefilter" and
"isys_prefilter/isys_doc_filters" folders to the LD_LIBRARY_PATH environment variable before running Sybase IQ.
c. On Solaris, you will need to add the fully-qualified "isys_prefilter" and "isys_prefilter/isys_doc_filters" folders to the LD_LIBRARY_PATH and/or LD_LIBRARY_PATH_64 environment variables before running Sybase IQ.
d. On AIX, you will need to add the fully-qualified "isys_prefilter" and
"isys_prefilter/isys_doc_filters" folders to the LD_LIBRARY_PATH and/or LIBPATH environment variables before running Sybase IQ.
7. The installation will now be complete and you may delete the temporary folder
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 5
Example Usage
To perform a database load of file-system documents, create 2 text files: load.sql & load.inp with the
following contents:
Sample SQL file (load.sql)
--create the text configuration object using the ISYS prefilter library create text configuration prefilter_cfg from default_char; alter text configuration prefilter_cfg prefilter external name 'isys_prefilter_func@isys_prefilter'; --create the table create table documents(id int, document long binary); --create the text index using the text configuration object that uses the ISYS prefilter create text index documents_index on documents(document) configuration prefilter_cfg immediate refresh; --load the table using an input file that contains 3 documents (below) load table documents(id, document binary file(',')) from '[FOLDER_NAME]/load.inp' quotes off escapes off delimited by ','; commit; --check to see how many terms were extracted call sa_text_index_vocab('documents_index', 'documents', 'dba');
Sample INP file (load.inp)
1,[FOLDER_NAME]/document1.doc, 2,[FOLDER_NAME]/document2.doc, 3,[FOLDER_NAME]/document3.doc,
Where [FOLDER_NAME] is the fully-qualified path to each file referenced above.
To run the document load, start DBISQL and connect to a database. Once connected, open the load.sql
file and press F5 to execute the SQL.
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 6
Configuration ISYS Document Filters for UDA has several configuration options that control how documents are to be processed. All configuration settings are stored in the isys_prefilter/isys_prefilter.ini file: License=[xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx] The ISYS Document Filters license to use. May be a time-limited evaluation or non-expiring purchased license key. MaxInMemDocumentSize=[Size in MB] (Integer: min=4, max=512, default=64) Documents smaller than this size are retrieved from the database in-memory. Documents larger than this size are written temporarily to disk to conserve system memory. Please note that ISYS Document Filters 10.2 for UDA will copy some documents to the TMP/TEMP location during processing and delete them afterward during the normal course of operations. DefaultInputCodePage=[CodePage] (Integer: default=1252) Specifies the default character set to assume for input documents that do not specify a character set (e.g. text files, HTML files with no charset directive): Code page numbers are used by ISYS Document Filters for all platforms to specify character sets. A list of valid code page numbers can be found at: http://msdn.microsoft.com/en-us/goglobal/bb964654 Note: Does not apply to document formats that specify a character set (e.g. MS Word, Adobe PDF, etc). OpenDocumentFlags=[0,1,2] (Integer: default=2) Controls how much of each document to process: 0 = Document body text only 1 = Document metadata only 2 = Document metadata and body text OpenDocumentOptions=[] Reserved for future use. OutOfProcessMode=[0,1] (Integer: default=0) (EXPERIMENTAL) Controls the isolation mode of text extraction:
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 7
0 = Document text is extracted inside the database process (fastest) 1 = Document text is extracted outside of the database process (safest) OutOfProcessTimeout=[Timeout in seconds] (Integer: min=30, max=900, default=120) When out-of-process mode is enabled, controls the interval before ISYS Document Filters for UDA considers the worker process to have stopped responding. Language=[xxx] (default=English) All localized strings are stored in the isys_prefilter_strings.ini file. Language refers to the .ini file section to be used for error and logging strings.
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 8
Troubleshooting
Logging
ISYS Document Filters for UDA logs all its activity to "[DatabaseName].iqmsg" in the database folder. All
ISYS messages begin with the "PF_LOG_PREFIX_FORMAT" resource string defined in the
isys_prefilter_strings.ini file. By default, this prefix is set to "[ISYS] [%d]" where %d is the current
database server thread ID.
Initialization
If the ISYS Document Filters fail to load or initialize, please make sure the following are true:
Windows: The isys_prefilter & isys_prefilter\isys_doc_filters folders are in the system PATH
Linux/UNIX: The isys_prefilter & isys_prefilter/isys_doc_filters folders are specified in the
LD_LIBRARY_PATH / LD_LIBRARY_PATH_64 / LIBPATH environment variables
There is a valid ISYS Document Filters license in the isys_prefilter.ini file
Out of process mode
ISYS Document Filters for UDA contains an experimental high isolation mode of operation known as
"out of process mode". When enabled, document text is extracted in a separate process from the
database server. If there is a hang or crash when processing a document, the database process will
remain unaffected. The tradeoff with this mode is speed for safety. The default mode (in process) will be
suitable for almost all applications - the out of process mode is a slower but safer mode of operation, to
be used in cases where corrupted documents are expected, and there is a need to ensure high
availability of the Sybase IQ server during indexing operations.
ISYS console application
ISYS Document Filters for UDA also contains a console application (isys_prefilter_console[.exe|.sh]) that
may be used from the command line to process a single document and display its metadata and text
content to the console. This tool can be used to verify or validate the extracted text from a particular
document or diagnose a troublesome document. Run isys_prefilter_console[.exe|.sh] without any
parameters to see usage information.
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 9
Reference
Supported document formats
1 File ID Only
2 File ID Only on HP-UX, AIX
3 ID, text and metadata only (No Hi-Def) on HP-UX, AIX, Solaris SPARC
+ Newly supported in ISYS Document Filters 10.0
X Formats enhanced to include HiDef HTML conversion (not applicable to UDA)
Archive
Document Format Version Extension 10.0 HiDef
7-Zip .7Z
ACE1 .ACE +
Apple Disk Image .DMG +
ARJ .ARJ +
Bzip2 .BZ2, TBZ2 +
ISO Disk Image .ISO +
Java Archive .JAR
LZH1 .LZH +
Microsoft Cabinet .CAB
Microsoft Office Binder .OBD +
RedHat Package Manager .RPM +
Roshal Archive 1.5, 2.0, 2.9 .RAR
Self-extracting .exe .EXE
StuffIt1 .SIT +
StuffIt Self Extracting Archive1 .SEA, .EXE +
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 10
StuffIt X1 .SITX +
GNU Zip 0.1, 1.0 .GZ
UNIX cpio .CPIO +
UNIX Tar .TAR
Zip PKZip, WinZip .ZIP
Database
Document Format Version Extension 10.0 HiDef
dBase file 3,4 .DBF
dBASE III file 3,4 .DB, .DB3
Microsoft Access file1 01/01/10 .MDB
Paradox Database File .DB +
Email and Messaging
Document Format Version Extension 10.0 HiDef
Encoded mail message MHT .MHT
Encoded mail message Multipart Alternative
Encoded mail message Multipart Digest
Encoded mail message Multipart Mixed
Encoded mail message Multipart News Group
Encoded mail message Multipart Signed
Encoded mail message TNEF
Eudora Classic (1-7), OSE .MBX
Microsoft Outlook3 97-2007 .MSG X
Microsoft Outlook Express3 .EML X
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 11
Microsoft Outlook Forms
Template
.OFT
Microsoft Outlook 97-2007 .PST
Sendmail "mbox" .MBOX
Thunderbird 1, 1.5, 2.x, 3.x .MBOX
Multimedia
Document Format Version Extension 10.0 HiDef
3GP1 .3GP +
Adobe Flash .SWF
Adobe Flash Video1 .FLV +
Audio Video Interleave (AVI) 2 .AVI
DVD Information File1 .IFO, .BUP +
DVD Video Object2 .VOB +
Microsoft Windows Movie Maker1 .MSWMM +
Musical Instrument Digital
Interface (MIDI) 1
Standard .MID, .MIDI,
.SMF
MPEG Video2 .MPG +
MPEG-1 Audio Layer 3 ID3v1, ID3v2 .MP3
MPEG-4 Video2 .MP4 +
MPEG-2 Audio Layer 3 ID3v1, ID3v2 .MP3
OGG FLAC Audio2 .FLAC +
OGG Vorbis Audio2 .OGG +
QuickTime1 1.x-X .MOV
Real Media2 .RM +
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 12
Waveform Audio File Format
(WAVE) 2
.WAV, .AIFF
Windows Media Audio WMT 4.0, WMA 2, 7, 8, 9 .WMA
Windows Media Video WMV 7, 9 .WMV
Other
Document Format Version Extension 10.0 HiDef
Apple Executable1 .BIN +
BIN HEX Encoded1 .HBX, .HEX, .HQX
BitTorrent Metafile1 .TORRENT +
Linux Executable and Linkable
Format1
.ELF
Log File .LOG
Microsoft Project 98-2003 .MPP
Microsoft Project 2007 .MPP, .MPX
Microsoft Windows DLL1 .DLL
Microsoft Windows Executable1 .EXE, .COM, .SYS
Microsoft Windows Installer1 .MSI +
Microsoft Windows Shortcut1 .LNK +
Open Access II (OAII) 01/02/11
vCard 2.1 .VCF
Uniplex
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 13
Presentation
Document Format Version Extension 10.0 HiDef
IBM Lotus Symphony
Presentation3
1.x, 3.x .SXI, .ODP X
LibreOffice Presentation3 Beta 3 .ODS X
Microsoft PowerPoint for
Windows3
3.0-2007, 2010 .PPT, .PPTX X
Microsoft PowerPoint for Mac3 1-4, 98, 2001, v. X, 2004, 2008,
2011
.PPT, .PPTX X
OpenOffice Impress3 1.x, 2.x, 3.x .ODP X
StarOffice Impress3 8, 9 .SXI, .SDI, .SDP X
Raster Image
Document Format Version Extension 10.0 HiDef
Encapsulated PostScript 1 .EPS
Graphics Interchange Format (GIF) 1
87a, 89a, Animated .GFA, .GIF, .GIFF
Joint Photographic Experts Group
(JPEG)
.JPEG, .JPG, .JPE,
.JIF
Microsoft Document Imaging .MDI
Microsoft Windows Bitmap1 .BMP
PCX1 .PCX
Portable Network Graphic (PNG) 1 1.0, 1.1, 1.2 .PNG
Progressive JPEG .JPEG, .JPG
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 14
Tagged Image Format File (TIFF) Revision 3.0-5.0 .TIF, .TIFF
Spreadsheet
Document Format Version Extension 10.0 HiDef
Comma Separated Values .CSV
Framework Spreadsheet III .FW3
IBM Lotus Symphony
Spreadsheet3
1.x, 3.x .SXS, .SX, .ODS X
LibreOffice Spreadsheet3 Beta 3 .ODS X
Lotus 1-2-3 Through Millennium 9.6 .WK, .WKS, .WK3,
.WK4
Microsoft Excel for Windows3 2.0 - 2010 .XLS, .XLSX X
Microsoft Excel for Windows3 2007 – 2010 (Binary) .XLSB X
Microsoft Excel for Mac3 1, 1.5, 2.2, 3.0, 4.0, 5.0, 8.0-14.0 .XLS, .XLSX X
Microsoft Works SS for DOS 2 .WPS
Microsoft Works SS for Windows 3, 4, 6, 7 .WPS
OpenOffice Calc3 1.1-2.0 .ODS X
StarOffice Calc3 8, 9 .SXC, .SXS, .ODS X
Text and Markup
Document Format Version Extension 10.0 HiDef
ASCII Text 7-bit, 8-bit .TXT
ANSI Text 7-bit, 8-bit .TXT
HTML (Text Only) 2.x, 3.x, 4.x .HTM, .HTML
HTML (Codes Revealed) 2.x, 3.x, 4.x .HTM, .HTML
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 15
HTML (Metadata Only) 2.x, 3.x, 4.x .HTM, .HTML
IBM DCA .RFT, .TXT, .DCA
Microsoft HTML Help 1.0, 1.1a, 1.3, 1.32, 1.33MAML .CHM
Microsoft OneNote 2007, 2010 .ONE +
Rich Text Format3 1.0, 1.3, 1.5, 1.6, 1.7, 1.8, 1.9.1 .RTF X
SGML Text .SGML
Source
Transcript
Unicode UTF8
Unicode UTF16 (big e & little e)
Unicode UCS2 (big e & little e)
XML Document File .XML
XML Record View .XML
Windows Enhanced Meta File1 .EMF
Windows Meta File1 .WMF
Vector Image
Document Format Version Extension 10.0 HiDef
Adobe Illustrator .AI +
Adobe InDesign 1.x-7.x .INDD
Adobe Photoshop 8.x, 9.x, 10.0 (CS 1-3) .PSD
AutoCAD Drawing2 12, 13, 14, 2000, 2002, 2004,
2005, 2006, 2007, 2008, 2009,
2010
.DWG X
AutoCAD Drawing Exchange
Format2
.DXF +
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 16
Corel Draw Image1 .CDR +
Intergraph-Microstation CAD2 .DGN X
MathCAD1 .MCD, .XMCD +
Microsoft XPS .XPS, .OXPS
Microsoft Visio3 .VSD X
Word Processing and General Office
Document Format Version Extension 10.0 HiDef
Adobe PDF 1.0 – 1.7 (Extension 3,
5)(Acrobat 1 - 9)
.PDF X
Adobe PostScript1 .PS +
Ami Pro for Windows .AMI, .SAM
Apple iWork .PAGES,
.NUMBERS, .KEY
Framework WP .FW3
Hangul HWP +
IBM DCA/FFT .RFT, .FFT
IBM DisplayWrite 4 .RFT, .DCA,
.DW4, .DOC
IBM DisplayWrite 5 .RFT, .DCA,
.DW5, .DOC
IBM Lotus Symphony3 Document 1.x, 3.x .ODT X
JustSystems Ichitaro .JTD, .JBW, .JTT +
LibreOffice Document3 Beta 3 .ODT X
Lotus Manuscript 1.0, 2.x .MANU, .MNU,
.MAN
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 17
Lotus Notes1 .NSF
Lotus WordPro1 .LWP
Mass 11 8 .M11
Microsoft Publisher .PUB +
Microsoft Word for DOS 4.0 - 6.0 .DOC
QuarkXpress1 .QXx, .QCx
Microsoft Word for Windows3 1.0 - 2010 .DOC, .DOCX X
Microsoft Word for Mac3 1-5, 5.1, 6, 98, 2001, v. X, 2004,
2008, 2010
.DOC, .DOCX X
MultiMate Through 4.0 .DOX
MultiMate Advantage .DOX
OpenOffice Writer3 1.1 - 3.0 .ODT X
Professional Write for DOS 1, 2 .PW, .PW1, .PW2
Professional Write Plus for
Windows
1 .PW
Q&A Write 3, 4 (Classic), 5 .QA, .QA3
QuickBooks Backup1 .QBB +
QuickBooks for Windows1 .QBW +
StarOffice Writer3 8, 9 .SXW, .SDW X
TrueType Font1 .TTF +
Wang IWP .IWP
Wang WP Plus .IWP
Windows Write .WRI
WinWord 6 .DOC
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 18
WordPerfect for DOS 4.2 .WPD
WordPerfect for Macintosh3 1.0-1.0.7, 2.0, 2.1, 3.0, 3.1, 3.5,
3.5e
.WPD
WordPerfect for Windows3 5.1-12.0, X3, X4 .WPD X
Wordstar 2000 for DOS 01/03/11 .WS2, .DOC
Wordstar for DOS 3.x-7 .WS, .WSx
Wordstar for Windows 1 .WSD
XYwrite I-III+, 4.0, Windows .XY
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 19
Logging resource strings
All resource strings are stored in the isys_prefilter_strings.ini file. Parameters are typed and must be used in the order specified in the Parameters column:
Name Description Parameters
PF_LOG_PREFIX_FORMAT All log messages begin with this string Thread ID (%d)
PF_ISYS_LOADED_ERROR Unable to load the ISYS Document Filters DLLs DLL filename (%s), OS error code (%d)
PF_ISYS_INITIALIZED_ERROR Unable to initialize the ISYS Document Filters ISYS Error message (%s), ISYS error code (%d)
PF_ISYS_STREAM_WRITE_ERROR Unable to write document stream from Sybase IQ
Stream filename (%s), Stream size (%d), WriteBytes (%d), BytesWritten (%d), OS error code (%d)
PF_ISYS_STREAM_CREATE_ERROR Unable to create ISYS document stream from Sybase IQ ISYS error message (%s), ISYS error code (%d)
PF_ISYS_STREAM_OPEN_ERROR Unable to open ISYS document ISYS error message (%s), ISYS error code (%d)
PF_ISYS_STREAM_EXTRACT_ERROR Unable to extract ISYS sub-document ISYS error message (%s), ISYS error code (%d)
PF_OPS_CANCELLED_ERROR User has cancelled the operation None
PF_GET_DATA_FROM_PROD About to retrieve document data from Sybase IQ None
PF_GET_DATA_FROM_PROD_RECVD Amount (in bytes) of document data retrieved from Sybase IQ Bytes (%d)
PF_GET_DATA_FROM_PROD_RECVD_TOTAL Total amount (in bytes) of document data retrieved from Sybase IQ Bytes (%d)
PF_NO_DATA_FROM_PROD Empty document retrieved from Sybase IQ None
PF_INCOMPLETE_DATA Incomplete document retrieved from Sybase IQ None
PF_START_DOCUMENT Start of document processing Document number (%d)
PF_DOCUMENT_OPENED Document opened by ISYS Document Filters Document handle (%d), Document format (%s)
PF_SUBDOCUMENT_OPENED Sub-document opened by ISYS Document Filters Document handle (%d), Document format (%s)
PF_DOCUMENT_TEXT_EXTRACTED Amount (in bytes) of text extracted Bytes (%d)
PF_DOCUMENT_TEXT_EXTRACTED_TOTAL Amount (in bytes) of text extracted in total Bytes (%d)
PF_END_DOCUMENT End of document processing None
PF_END_DOCUMENT_PROCESSING Time taken to process document Time (%f)
PF_ISYS_IGR_OK Document operation successful None
PF_ISYS_IGR_E_OPEN_ERROR Document open error None
PF_ISYS_IGR_E_WRONG_TYPE Document is wrong type None
PF_ISYS_IGR_E_IN_USE Document is in use None
PF_ISYS_IGR_E_NOT_READABLE Document is not readable None
I S Y S S e a r c h S o f t w a r e
© 2012, ISYS® Search Software Inc. P a g e | 20
PF_ISYS_IGR_E_PASSWORD Document is password protected None
PF_ISYS_IGR_E_NOT_FOUND Document not found None
PF_ISYS_IGR_E_WRITE_ERROR Document write error None
PF_ISYS_IGR_E_NOT_VALID_FOR_THIS_CLASS Document operation not valid None
PF_ISYS_IGR_E_ERROR Document error None
PF_ISYS_IGR_E_INVALID_HANDLE Invalid document handle None
PF_ISYS_IGR_E_INVALID_POINTER Invalid pointer None
PF_ISYS_IGR_E_INVALID_PARAMETER Invalid parameter None
PF_ISYS_IGR_NO_MORE Document has no more text or sub-documents None
PF_ISYS_INVALID_ISYSPATH ISYS Document Filters path is invalid None
PF_ISYS_INVALID_ISYSLICENSE ISYS Document Filters license is invalid or expired None