· 2020-03-23 · table of contents iii getting started with gptext 1.2 — contents table of...

39
PRODUCT DOCUMENTATION Pivotal GPText Version 1.2 Getting Started with GPText Rev: A02 © 2013 GoPivotal, Inc.

Upload: others

Post on 10-Jun-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

PRODUCT DOCUMENTATION

Pivotal GPTextVersion 1.2

Getting Started with GPTextRev: A02

© 2013 GoPivotal, Inc.

Page 2:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Copyright © 2013 GoPivotal, Inc. All rights reserved.GoPivotal, Inc. believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED "AS IS." GOPIVOTAL, INC. ("Pivotal") MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any Pivotal software described in this publication requires an applicable software license.

All trademarks used herein are the property of Pivotal or their respective owners.

Revised December 2013 (1.2.2.0)

Page 3:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Contents

Table of ContentsPreface ............................................................................................... 1

About this Guide............................................................................... 1Text Conventions ............................................................................. 2Command Syntax Conventions......................................................... 2

Chapter 1: Welcome to GPText .................................................... 4Sample GPText User Scenario .......................................................... 4Pivotal GPText Workflows ................................................................. 4

Data Load and Indexing Workflow .............................................. 4Query Processing Workflow......................................................... 5

The GPText Demo VM....................................................................... 6Demo System Requirements ............................................................ 6

Chapter 2: Install and Configure the Demonstration Virtual Machine .............................................................................................. 7

Configure network settings............................................................... 7Use a terminal to run the demos ...................................................... 8

Chapter 3: Searching Social Media Text................................... 9Set Up the Social Media Demo ......................................................... 9Create the Schema and Message Table ............................................ 9Insert Data Into the message Table................................................. 10Create a Solr Index ........................................................................ 11Use the text_sm Social Media Analyzer .......................................... 11Index the Message Table Data ....................................................... 11Run the Queries ............................................................................. 12

query-1.sql – a simple text search query ................................ 12query-2.sql – a simple text search query with a filter.............. 12query-3.sql – a faceted search ................................................ 13highlight_search.sql – a search with hit highlighting ............. 13

Chapter 4: k-means Clustering Using GPText........................ 15Set Up the k-means Demo ............................................................. 15Create the Schema and Message Table .......................................... 16Insert Data Into the Message Table................................................ 16Create a Solr Index ........................................................................ 16Index the Message Table Data ....................................................... 17Create Term Vectors and a Dictionary of Terms ............................. 17

Create a Terms Table................................................................ 17Create a Dictionary Table.......................................................... 18

Create a Sparse Vector Representation for Each Document............ 19Create Sparse Vectors .............................................................. 19Create tf-idf Score Arrays For the Terms in Each Document...... 20

Run the k-means Algorithm............................................................ 21Examine the Results....................................................................... 22

Some Helpful Queries ............................................................... 23

Chapter 5: SVM Classification Using GPText .......................... 25Set Up the SVM demo .................................................................... 25Train the SVM Model ...................................................................... 26

Create the SVM Model Schema and Tables................................ 26

Table of Contents iii

Page 4:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Contents

Populate the Training Data Table .............................................. 26Index the Data In the Training Data Table ................................ 27Create (or Extract) Features In the Training Data Table............ 27Create the Input Table for SVM Training ................................... 27Train the SVM........................................................................... 27

Test the SVM Model........................................................................ 28Populate the Testing Data Table ............................................... 28Index the Testing Data Table .................................................... 28Create the Input Table For SVM Testing.................................... 29Predict Classes For the Testing Data ......................................... 29

Examine the results........................................................................ 30Sample predictions Output ..................................................... 30Calculate the Accuracy of the Predictions.................................. 30

Chapter 6: LDA Topic Modeling Using GPText ....................... 31Set up the LDA demo ..................................................................... 31Train the LDA model....................................................................... 32

Create the LDA Model Schema and Tables ................................ 32Populate the Training Data Table .............................................. 32Index the Data In the Training Data Table ................................ 33Create Training Terms .............................................................. 33Create the Term Dictionary....................................................... 33Create the Input Data for LDA Training..................................... 33Train the LDA............................................................................ 33Show Top k Terms from the Topics ........................................... 34

Test the LDA Model ........................................................................ 34Load the Test Data into the Testing Data Table......................... 34Index the Testing Data Table .................................................... 34Create a Terms Table for the Indexed Test Data....................... 35Create the Input Table For LDA Testing .................................... 35Predict Classes For the Testing Data ......................................... 35Query the Prediction Results ..................................................... 35

Table of Contents iv

Page 5:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Preface

PrefaceThis guide introduces Pivotal GPText, providing information and instructions for installation, configuration, and use of the GPText platform for data scientists, application builders, system administrators, and database administrators.

• About this Guide• Text Conventions• Command Syntax Conventions

About this GuideThis guide introduces GPText, explains the GPText workflow, how to install, configure, and use the demo GPText virtual machine to do simple search queries and use k-means clustering, social media feeds, and Support Vector Machines with GPText.

The documentation is intended for:

• System administrators and database managers who will install and maintain a Greenplum Database (GPDB). These individuals should know basic Linux administration and should have some experience administering GPDB clusters.

• Application builders (software developers) who will use GPText as a platform for building new software.

• Data scientists who are interested in solving business problems with advanced techniques.

Application builders and casual data scientists should have a good working knowledge of PostgreSQL (the SQL language of GPDB), should understand basic GPDB principles, and should have some familiarity with Lucene query syntax.

A moderate user should also have a basic understanding of natural language processing and some familiarity with configuring Solr analyzer chains.

Advanced users should have a thorough understanding of PostgreSQL and a solid background in Machine Learning algorithms, especially as they are implemented in MADlib.

About this Guide 1

Page 6:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Preface

Text ConventionsTable 0.1 Text Conventions

Text Convention Usage Examples

italics New terms where they are defined

Database objects, such as schema, table, or columns names

The master instance is the postgres process that accepts client connections.

Catalog information for GPText resides in the pg_catalog schema.

monospace File names and path names

Programs and executables

Command names and syntax

Parameter names

Edit the postgresql.conf file.

Use gpstart to start GPText.

<monospace italics>

Variable information within file paths and file names

Variable information within command syntax

/home/gpadmin/<config_file>

COPY tablename FROM '<filename>'

monospace bold Used to call attention to a particular part of a command, parameter, or code snippet.

Change the host name, port, and database name in the JDBC connection URL:

jdbc:postgresql://host:5432/mydb

UPPERCASE Environment variables

SQL commands

Keyboard keys

Make sure that the Java /bin directory is in your $PATH.

SELECT * FROM my_table;

Press CTRL+C to escape.

Command Syntax ConventionsTable 0.2 Command Syntax Conventions

Text Convention Usage Examples

{ } Within command syntax, curly braces group related command options. Do not type the curly braces.

FROM { 'filename' | STDIN }

[ ] Within command syntax, square brackets denote optional arguments. Do not type the brackets.

TRUNCATE [ TABLE ] name

Text Conventions 2

Page 7:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Preface

... Within command syntax, an ellipsis denotes repetition of a command, variable, or option. Do not type the ellipsis.

DROP TABLE name [, ...]

| Within command syntax, the pipe symbol denotes an “OR” relationship. Do not type the pipe symbol.

VACUUM [ FULL | FREEZE ]

Table 0.2 Command Syntax Conventions

Text Convention Usage Examples

Command Syntax Conventions 3

Page 8:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 1: Welcome to GPText

1. Welcome to GPText

GPText enables processing mass quantities of raw text data (such as social media feeds or e-mail databases) into mission-critical information that guides project and business decisions. Greenplum joins the Greenplum Database massively parallel-processing database server with Apache Solr enterprise search and the MADlib Analytics Library to provide large-scale analytics processing and business decision support. Greenplum includes free text search as well as support for text analysis.

Greenplum supports business decision making by offering:

• Multiple kinds of data: GPText uniquely supports both semi-structured and unstructured data searches, which exponentially increases the kinds of information you can find.

• Less schema dependence: GPText does not require static schemas to successfully locate information; schemas may change or even be quite simple and still return targeted results.

• Text analytics: GPText supports analysis of text data using statistical machine learning techniques via the MADlib Analytics Library, which is integrated with Greenplum Database and available for use by GPText.

Sample GPText User ScenarioSuppose that a forensic financial analyst is examining communications among corporate executives to determine whether or not they knew of financial malfeasance in their firm. The analyst loads the email records into a Greenplum database, creates a Solr index of the email records, and runs queries that look for text strings and their authors. After refining the queries, the analyst pairs a dummy company name set up to funnel money offshore with the top four executives corresponding about these transactions. With this data, the analyst can focus the investigation on just the individuals involved rather than the thousands of authors in the initial data sample.

Pivotal GPText WorkflowsGPText works with Greenplum Database and Apache Solr to store and index big data for information retrieval purposes. High level workflows include data loading and data querying.

Data Load and Indexing WorkflowThe workflow for loading and indexing data for GPText is as follows:

Sample GPText User Scenario 4

Page 9:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 1: Welcome to GPText

1. Load data into your Greenplum system. See the Greenplum Database Database Administrator Guide for details, available on Support Zone.

2. Create an index targeted to your application’s requirements in the Solr instance attached to your data’s segment.

Query Processing WorkflowThe following is the high level GPText query processing workflow.

1. Create a SQL query targeted to your indexed information.

2. The Greenplum Master dispatches the query to Greenplum Segments.

3. The segments look up the appropriate indexes stored in its Solr repository, if required by queries.

Pivotal GPText Workflows 5

Page 10:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 1: Welcome to GPText

4. The segments gather the information according to index rules and send it to the master.

The GPText Demo VMA GPText demo is available, packaged as a VMware virtual machine image (VM) that contains everything you need to start experimenting with GPText. You can accomplish any GPText-supported task on a small scale. This demo is not intended for a production environment.

Packaged demos include:

• Searching Social Media Text• k-means Clustering Using GPText• SVM Classification Using GPTextFor an introduction to GPText, see the Pivotal GPText User’s Guide.

Demo System RequirementsYou can run the GPText demo on a system running Windows or Mac OS X. A 64-bit CPU with 4 Gbytes of RAM is required, as is 8 Gbytes of available disk space.

The demo file is a VMware virtual machine image. To run it on Mac OS X, you must have installed VMware Fusion on your system. To run it on Windows, you must have installed either VMware Workstation or VMware Player.

Note: For Windows systems, you must enter the system BIOS and make sure that VT (Virtualization Technology) is enabled and Trusted Execution is disabled.

The GPText Demo VM 6

Page 11:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 2: Install and Configure the Demonstration Virtual Machine

2. Install and Configure the Demonstration Virtual Machine

Warning: Never perform a forced shutdown (power off) when the demo virtual machine (VM) is running. If you do, Solr configuration files will be destroyed and you will have to download the VM and start all over from the beginning.

1. Download greenplum-text-1.2.x.x.vmwarevm.zip from https://feedbackcentral.emc.com.. Here 1.2.x.x is the build number, for example 1.2.0.0.

2. Extract the virtual machine image from the zipped file:For MacOS, right-click greenplum-text-1.2.x.x.vmwarevm.zip and choose Open With Archive Utility. For Windows, use a utility such as WinZip or 7-zip (do not open the zip file with Windows Explorer). Right-click greenplum-text-1.2.x.x.vmwarevm.zip and choose Open With WinZip (or 7-zip, or other utility).

3. Load the virtual machine using VMware.On the MacOS, right-click greenplum-text-1.2.x.x.vmwarevm and choose Open With VMware Fusion.On Windows, run VMware Workstation or VMware Player and choose Open a Virtual Machine. Find and open the file named greenplum-centos.vmx, which is in the greenplum-text-1.2.x.x.vmwarevm directory where you unpacked the zipped demo file. When VMware Player opens, choose Play Virtual Machine.

4. In the dialog box that appears when you first start VMware Fusion or Player, choose I copied it.

5. Log in to the virtual machine using the ID gpadmin and password changeme.

Configure network settingsNote: Network configuration may not work if you are connected to the host system by VPN.

1. Change to the /home/gpadmin/scripts directory.

2. Log in as root: execute su, password changeme.

3. Run the configuration script: sh networking_setupImportant: After you run the networking_setup script, enter exit to log out as root. You should then see a prompt like this:

[gpadmin@gptextsearch ~]$.You cannot run any of the gpadmin commands when you are logged in as root.

Your installation is now configured. Return to the /home/gpadmin directory.

Configure network settings 7

Page 12:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 2: Install and Configure the Demonstration Virtual Machine

Use a terminal to run the demosWhen you run the demos in the virtual machine, the cursor becomes unavailable and the mouse, copy/paste, and so on, do not work. To get the cursor back, enter Control-Command (Mac) or Ctrl-Alt (Windows).

You can also use a terminal to connect with ssh, for example, to copy screen text and paste it into an editor for printing, or to add data to the demos from your hard drive.

On Windows you will need terminal emulation software with ssh support, such as PuTTY, or a Linux-like environment for Windows, such as Cygwin.

To connect from a terminal:

1. In the virtual machine window, run ifconfig -a to get the IP address of your virtual machine. For example:

eth2 Link encap:Ethernet HWaddr 00:0C:29:B4:D9:F6inet addr:172.16.253.128 Bcast:172.16.253.255 Mask:255.255.255.0inet6 addr: fe80::20c:29ff:feb4:d9f6/64 Scope:Link

....

2. Switch to your OS, leaving the virtual machine running.

3. Open a terminal.

4. Enter ssh gpadmin@<VM IP address>. The IP address is in the output of the ifconfig -a command, in this example, 172.16.253.128.

5. Enter the password changeme.

You can now run the demo in your terminal, while having use of your OS tools.

6. To end your terminal session, enter exit and then close the terminal. Note: This does not cause you to exit the virtual machine session you are running, only the terminal session.

Use a terminal to run the demos 8

Page 13:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 3: Searching Social Media Text

3. Searching Social Media Text

You can use GPText to analyze text from social media feeds, for example, to determine sentiment about a public figure or to determine how consumers feel about a product offering. The GPText demonstration virtual machine includes a small database of social media feeds on the subject of attitudes towards iPhones and President Obama to show how to use GPText to analyze social media text.

Open the demo files and look at the queries to get the maximum benefit from running the demo.

The database of social media feeds is called demo.

Set Up the Social Media Demo1. Install and configure the demonstration virtual machine. See Install and Configure

the Demonstration Virtual Machine.

2. Start the Greenplum database by executing gpstart -a.

3. Start the text search engine by executing gptext-start.

4. Change to the /home/gpadmin/demo/search directory.

To run the social media demo:

1. Create the schema and message table needed for the demo:psql -f twitter.ddl

2. Load the sample data set:gpload -f load.yaml

3. Create an empty Solr index from the table:psql -f create_index.sql

4. Change the analysis chain to use the text_sm social media analyzer:gptext-config -d schema.xml -i demo.twitter.message

Note: Skip this step if you want to index the data using the default schema.

5. Load and commit the index:psql -f load_index.sql

The following sections walk you through the social media demo.

Create the Schema and Message TableEnter:

psql -f twitter.ddl

Set Up the Social Media Demo 9

Page 14:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 3: Searching Social Media Text

The twitter.ddl script uses SQL commands to create a schema called twitter and an empty table called twitter.message using that schema. twitter.ddl deletes pre-existing schemas or tables with those names before creating the new ones.

The table has the following definition:

CREATE TABLE twitter.message (

id bigint,

message_id bigint,

spam boolean,

created_at timestamp without time zone,

source text,

retweeted boolean,

favorited boolean,

truncated boolean,

in_reply_to_screen_name text,

in_reply_to_user_id bigint,

author_id bigint,

author_name text,

author_screen_name text,

author_lang text,

author_url text,

author_description text,

author_listed_count integer,

author_statuses_count integer,

author_followers_count integer,

author_friends_count integer,

author_created_at timestamp without time zone,

author_location text,

author_verified boolean,

message_url text,

message_text text

) DISTRIBUTED BY (id);

Insert Data Into the message TableEnter:

gpload -f load.yaml

This step loads the twitter.message table with the data in the twitter.csv.gz file.

The path to the data file is /home/gpadmin/demo/search/twitter.csv.gz.

The table now contains several tens of thousands of records. Each of the message_text fields is considered to be a unique document.

Insert Data Into the message Table 10

Page 15:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 3: Searching Social Media Text

Create a Solr IndexEnter:

psql -f create_index.sql

To create an empty Solr index, use the GPText function:

create_index(schema_name, table_name, id_col_name, def_search_col_name)

The parameters are twitter (the schema), message (the table), id (the record/document ID), and message_text (the default search column).

The resulting index is demo.twitter.message:

<database_name>.<schema_name>.<table_name>

Any existing demo.twitter.message index is erased before a new index is created.

Use the text_sm Social Media AnalyzerUse the gptext-config management utility to specify the text_sm social media analyzer instead of text_intl, the default analyzer.

The text_intl default analyzer is not well suited to twitter data. The text_sm analyzer is designed to handle social media text. The demonstration VM includes a Solr schema.xml file that sets the analyzer to text_sm. The schema.xml file is located in the /home/gpadmin/demo/search directory.

Change to the /home/gpadmin/demo/search directory, then enter:

gptext-config -d schema.xml -i demo.twitter.message

This moves the schema.xml file from /home/gpadmin/demo/search to the demo.twitter.message index.

See the GPText Function Reference for more information about the gptext-config management utility.

Index the Message Table Data Enter:

psql -f load_index.sql

The load_index.sql script uses the GPText index() function to populate the demo.twitter.message index using the text_sm analyzer. The syntax is:

index(table_name, index_name)

The table name must be preceded by the schema name, unless the schema is public. So, in this case the first parameter is twitter.messages.

Note: If there are empty documents in the table, they are removed during indexing.

Create a Solr Index 11

Page 16:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 3: Searching Social Media Text

Run the QueriesYou are now ready to run queries on the twitter data. Sample queries are provided to get you started. Run these queries using the psql command. For example:

psql -f query-1.sql

The sample queries are:

• query-1.sql – a simple text search query• query-2.sql – a simple text search query with a filter • query-3.sql – a faceted search• highlight_search.sql – a search with hit highlighting

query-1.sql – a simple text search querySearches for ‘iphone’ using the GPText function gptext.search(src_table, index_name, search_query, filter_queries, options).

Here is the query:

-- This query uses simple Lucene syntax to -- execute a query in Solr and join the results -- to the original table in the GP database select t.id, q.score, t.message_text from twitter.message t, gptext.search( TABLE(SELECT * FROM twitter.message), 'demo.twitter.message', 'iphone', null, 'rows=100' ) q where t.id=q.id order by score desc;

query-2.sql – a simple text search query with a filterquery-2.sql is a simple text search query with a filter applied in Solr. It uses the same search function as query-1.sql, except that the search_query parameter is '(iphone AND (hate OR love))' and the filter_queries parameter is '{author_lang:en}'.

Here is the query:

-- This query uses more advanced Lucene syntax -- and applies a filter in Solr before documents -- are searched

Run the Queries 12

Page 17:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 3: Searching Social Media Text

select t.id, q.score, t.author_screen_name, t.message_text

from twitter.message t, gptext.search( TABLE(SELECT * FROM twitter.message), 'demo.twitter.message', '(iphone AND (hate OR love))', --FTS query '{author_lang:en}', --Solr level filter 'rows=100' --Option to limit the num of rows ) q where t.id=q.id order by score desc;

query-3.sql – a faceted searchSearches for ‘obama’, faceted by the author_name field, using the GPText function gptext.faceted_field_search(index_name, query, filter_queries, facet_fields, facet_limit, minimum).

Here is the query:

-- This query uses Solr faceting to aggregate -- documents matching the search term before -- they are returned to the Greenplum database. -- Will return up to 100 documents per author. -- Will not return any documents for an author -- if there are less than 5 for that author. select * from gptext.faceted_field_search( 'demo.twitter.message', 'obama', --FTS query null, --FTS filter queries '{author_name}', --Aggregation (facet) field 100, --limit 5 --minimum ) q order by value_count desc;

highlight_search.sql – a search with hit highlightingSearches for ‘iphone’ and uses the GPText function gptext.highlight(text_field, field_name, offsets) to insert markup tags around the text in the results.

Run the Queries 13

Page 18:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 3: Searching Social Media Text

In the gptext.search() function, the options parameter is set to enable highlighting (hl=true) and to highlight the message_text field (hl.fl=message_text).

Here is the query:

/* Search query with highlight requested */ SELECT t.id gptext.higlight(t.message_text, 'message_text', s.hs) as message_text_hl FROM demo.twitter.message t gptext.search( table(select 1 scatter by 1), 'demo.twitter.message', '{!gptextqp}iphone’, null, 'rows=10&hl=true&hl.fl=message_text') s WHERE s.id = t.id ORDER BY t.id

Run the Queries 14

Page 19:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 4: k-means Clustering Using GPText

4. k-means Clustering Using GPText

The k-means algorithm is a well-known algorithm for grouping data into clusters, enabling classification of a data set. The GPText demonstration virtual machine provides sample data and queries for running a k-means demonstration, including:

• A set of 999 documents • A set of queries for running the k-means clustering algorithm on the documentsThe documents are WebKB documents from http://web.ist.utl.pt/~acardoso/datasets. Webkb.org provides sample data for testing machine learning algorithms. For the k-means demo, the documents have the following types:

• "student"

• "faculty"

• "course"

• "project"

Four of the documents are empty and are discarded during indexing.

The name of the database is demo. It is the same database that is used for the social media search demo, but with a different schema for the k-means demo.

Open the demo files and look at the queries to get the maximum benefit from running the demo.

Set Up the k-means Demo1. Install and configure the demo virtual machine. See Install and Configure the

Demonstration Virtual Machine.

2. Start the Greenplum database by executing gpstart -a.

3. Start the text search engine by executing gptext-start.

4. Change to the /home/gpadmin/demo/analytics/Kmeans directory.

To run the k-means demo:

1. Create the schema and the message table:psql -f kmeans.ddl

2. Insert data into the message table:gpload -f load.yaml

3. Create a Solr index:psql -f create_index.sql

4. Index the data in the message table:psql -f index_data.sql

Set Up the k-means Demo 15

Page 20:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 4: k-means Clustering Using GPText

5. Create term vectors for each document and create a dictionary of terms:psql -f create_corpus.sql

6. Create a sparse vector representation for each document:psql -f create_sparse_vector.sql

7. Run the k-means algorithm:psql -f kmeans_run.sql

The following sections walk you through the k-means demo.

Create the Schema and Message TableEnter:

psql -f kmeans.ddl

This step deletes pre-existing k-means demo schemas or tables, then uses SQL commands to create a schema called kmeans_demo and an empty table called kmeans_demo.messages in that schema. The table has two columns: id and message_body.

Insert Data Into the Message TableEnter:

gpload -f load.yaml

This step uses the Greenplum gpload command to load the kmeans_demo.messages table with the data in the kmeans.csv file.

Examine the contents of the .csv file to see the data used for the demo. The path to the file is /home/gpadmin/demo/analytics/Kmeans/kmeans.csv.

The table still contains 999 records. Each of the 999 message_body fields is considered to be a unique document.

Create a Solr IndexEnter:

psql -f create_index.sql

The create_index.sql script invokes the create_index() function to create an empty Solr index:

create_index(schema, table_name, id_col_name, def_search_col_name)

The parameters are kmeans_demo (the schema), messages (the table), id (the document ID), and message_body (the default search column).

The resulting index is called demo.kmeans_demo.messages (database_name. schema_name.table_name).

Create the Schema and Message Table 16

Page 21:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 4: k-means Clustering Using GPText

Next, the GPText enable_terms(index_name, field_name) function is called with the index name and the message_body column name as parameters. Solr notes the terms that are present and their positions in the message_body documents when an index is created.

Note: Enabling terms and positions are inherent capabilities of Solr that are disabled by default. The enable_terms() function tells Solr to turn on those capabilities before indexing.

Any existing demo.kmeans_demo.messages index is deleted before creating a new one.

Index the Message Table DataEnter:

psql -f index_data.sql

The index_data.sql script uses the GPText index() function to populate the demo.kmeans_demo.messages index. The syntax is:

index(table_name, index_name)

The table name must be preceded by the schema name, unless the schema is public. So, in this case the first parameter is kmeans_demo.messages.

Note: If there are empty documents in the table, they are removed during indexing.

Create Term Vectors and a Dictionary of TermsEnter:

psql -f create_corpus.sql

In this step, you assemble the basic data needed to run the k-means algorithm:

1. Create a Terms Table

2. Create a Dictionary Table

Create a Terms TableA terms table called kmeans_demo.term_vectors is created by the GPText function create_terms_table(terms_table_name, index_name, field_name, search_query, filter_queries). The field_name parameter given is message_body (the field that was previously enabled for terms).

Index the Message Table Data 17

Page 22:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 4: k-means Clustering Using GPText

This function produces a table with every term in each document listed, along with an array that shows the term’s positions in the document. The number of rows in the table is equal to the sum of all the terms in each of the documents—there is one row for each combination of a document ID and a term. The terms are in alphabetical order. The table is similar to the following:

Document ID Term Positions

1 a 3, 11, 19, 23, 35

1 above 41

1 champion 26

1 criminals 38

.

.

.

.

.

.

2 are 6, 11

2 feel 13

2 Ho 4

.

.

.

.

.

.

999 she’s 18, 20

999 they 50

999 this 1, 15

999 way 33

999 who 23

Create a Dictionary TableThe terms table is used to create a table of all the terms in the database: every term that appears in any of the 995 documents. With a large database, this table can include tens of thousands of rows. Each term appears once, in alphabetical order, in the second column. The first column is a serialized ID. The last column lists the number of instances of that term in all the documents.

The table is called kmeans_demo.dictionary. It looks like this:

ID Term Number of instances

1 a 470

2 all 26

3 an 216

Create Term Vectors and a Dictionary of Terms 18

Page 23:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 4: k-means Clustering Using GPText

Any pre-existing dictionary table is deleted before this table is created.

You now have:

• A terms table: document ID, term, positions• A dictionary table: ID, term, number of appearancesNote: In a real-world application, these tables and the array can include tens of thousands of terms.

Create a Sparse Vector Representation for Each Document

Enter:

psql -f create_sparse_vector.sql

In this step, combinations of GPText and MADlib functions prepare the data for running the k-means algorithm:

1. Create Sparse Vectors of terms, one for each document.

2. Create tf-idf Score Arrays For the Terms in Each Document from the sparse vectors of terms. These tf-idf vectors are input to the k-means algorithm.

Any existing tables of vectors are deleted before creating new ones.

Create Sparse VectorsA table called kmeans_demo.corpus is created to contain sparse vectors for each document. The GPText function gptext.gen_float8_vec() is used to create the required float vector representations for the documents, which are then typecasted to MADlib sparse vectors (with term frequencies).

The inputs are the terms table, kmeans_demo.term_vectors (see “Create a Terms Table” on page 17), and the dictionary table, kmeans_demo.dictionary (see “Create a Dictionary Table” on page 18).

The GPText function gptext.gen_float8_vec()is called 995 times, once for each document. gptext.gen_float8_vec() takes three parameters:

• The first parameter is an array that shows which of the terms in the dictionary are found in the document. For example, if the first, third, 74th, and 287th dictionary terms (but no others) are found in the document, the array will contain {1,3,74,287}.

4 any 82

.

.

.

.

.

.

.

.

.

ID Term Number of instances

Create a Sparse Vector Representation for Each Document 19

Page 24:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 4: k-means Clustering Using GPText

• The second term is an array that indicates the number of times each of the dictionary terms appears in the document. Continuing the same example, {12,3,1,1}would indicate that the first, third, 74th, and 287th dictionary terms appear 12 times, 3 times, once, and once respectively.

• The third term is the total number of terms in the dictionary, as determined by the gptext.count_t() function, which takes the name of the dictionary table as an argument.

The output of gptext.gen_float8_vec() is a float vector with term frequencies. A typical float vector can look like:

{1,1,2,2,0,....}

This means the first and second dictionary terms are in the document once each, while the third and fourth dictionary terms are in the document two times each. The fifth dictionary term is not in the document. And so forth.

The float vector is typecast into a MADlib sparse vector. The float vector {1,1,2,2,0,....} becomes the MADlib sparse vector {2,2,1,....}:{1,2,0,....}. The first element in the first array in the sparse vector, {2,2,1,....}, coupled with the first element of the second array in the sparse vector, {1,2,0,....}, means that the first two dictionary terms are present once each. The second elements of the two arrays mean that the third and fourth dictionary terms are present two times each. The third elements of the two arrays mean that the fifth dictionary term is present zero times, ....

The result is a table of sparse vectors, called kmeans_demo.corpus, similar to the following:

Document ID Sparse Vector Representation of Document

1 {x1,x2,x3,x4,.....}:{x1',x2',x3',x4',.....}

2 {y1,y2,y3,y4,.....}:{y1',y2',y3',y4',.....}

3 {z1,z2,z3,z4,.....}:{z1',z2',z3',z4',.....}

.

.

.

.

.

.

Create tf-idf Score Arrays For the Terms in Each DocumentThe tf-idf score is calculated for each term in each document. This score is a measure of how statistically important the term is in its document. See, for example, http://en.wikipedia.org/wiki/tf*idf.

A sparse vector is created of tf-idf scores for each document. A table of these vectors, called kmeans_demo.tfidf, is the main input for running the k-means algorithm (“Run the k-means Algorithm”).

MADlib functions are used to create the tf-idf arrays for each document, based on the sparse vectors of terms. The MADlib functions are:

Create a Sparse Vector Representation for Each Document 20

Page 25:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 4: k-means Clustering Using GPText

• svec_mult – multiplies two sparse vectors together• svec_log – calculates the log of a sparse vector• svec_div – divides two sparse vectors• svec_count_nonzero – counts the number of non zero entries in a sparse

vector.The result is a table called kmeans_demo.tfidf that contains a tf-idf sparse vector for each of the documents. Here is a typical tf-idf sparse vector:

{28,1,6,1,9,......}:{0,3.81170028380028,0,3.46875553267345,0,....}

Note: The sparse vector representations of the documents were necessary for creating the tf-idf sparse vectors, but only the tf-idf sparse vectors are used by the k-means algorithm.

Run the k-means AlgorithmEnter:

psql -f kmeans_run.sql

The k-means algorithm calculates the best composition of clusters of the documents based on the tf-idf scores of the terms in the document. It runs iteratively. You seed it with the number of clusters you expect. The data points of each document are assigned to a cluster and the mean distance to the cluster center (centroid) is calculated. On subsequent iterations, the documents are reassigned to different clusters if the reassignment makes the mean distance of their points to the centroid decrease.

With number_of_clusters = 4, for example, the algorithm will find four clusters of documents, where the documents in one cluster are similar to each other and dissimilar to those in the other clusters.

Here is the formalism:

SELECT * FROM madlib.kmeanspp( 'kmeans_demo.tfidf', --src_relation 'tf_idf', --data_column 4, --number of clusters 'madlib.dist_angle', --function that returns the distance between two points 'madlib.avg', --aggregate function for calculating centroids 30, --maximum number_of_iterations 0.01, --convergence_threshold );

The inputs are:

• The kmeans_demo.tfidf table.• The table’s data column (the tf-idf sparse vectors).• The expected number of clusters.• The name of a function that calculates the distance between two points. • The name of an aggregate function that calculates a centroid.

Run the k-means Algorithm 21

Page 26:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 4: k-means Clustering Using GPText

• The maximum number of iterations to perform.• The fraction of term scores (points) reassigned to each centroid after an

iteration, such that iteration will stop. For example, if the fraction of points reassigned to a different centroid is less than 0.01 during an iteration, the iterations will stop—even if the maximum number of iterations has not been reached.

The kmeans_run.sql script creates the kmeans_demo.kmeanspp table, which contains the one-row result from running the kmeanspp() function. The result table has four columns:

• The final centroid positions, a matrix with a column for each cluster.• The objective function.• The fraction of the points reassigned to clusters on the last iteration.• The number of iterations.

Examine the ResultsEnter:

psql -f gen_cluster_info.sql

The gen_cluster_info.sql script uses the k-means analysis saved in the kmeans_demo.kmeanspp table to generate tables that are helpful for interpreting the results.

The first query in this script assigns training documents to the closest centroid found in the k-means execution.

The remaining queries in gen_cluster_info.sql demonstrate a method for finding the top terms for the clusters, using td-idf scores. It treats all of the documents in a cluster as a single document; the number of documents is the same as the number of clusters, four in this example. td-idf calculations are performed on the four documents.

Here are the steps:

1. Aggregate all of the documents assigned to each cluster into a single document per cluster, creating the kmeans_demo.result_agg table.

2. Calculate tf-idf scores on the documents formed in the previous step. The results are saved in kmeans_demo.result_tfidf.

3. Unnest the tf-idf score array formed for clusters in the previous step. This is used to pick the top k terms according to td-idf in the final cluster information table. This querie creates the kmeans_demo.result_tfidf table.

4. Generate the final cluster information table, kmeans_demo.clusters_info, using the results from the previous steps. The kmeans_demo.clusters_info table has the following columns:• cluster_id – the zero-based cluster number.• num_docs – the number of documents in the cluster.

Examine the Results 22

Page 27:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 4: k-means Clustering Using GPText

• docs – an array of document IDs for the cluster.

• top_terms – an array of the top k (10, for this example) terms for the cluster.

Some Helpful QueriesThe following queries are provided to show the meaning of the results found:

1. Run psql -f num_docs_in_clusters.sql to produce a table showing the number of documents in each cluster. The data is extracted from the kmeans_demo.kmeans_assignments table. Here are the results: cluster_id | count

------------+-------

1 | 262

2 | 391

3 | 137

4 | 205

(4 rows)

2. Run psql -f top_terms_in_clusters.sql to produce a table showing the most frequent words found in each cluster. The data is extracted from the kmeans_demo.clusters_info table created by the gen_cluster_info.sql script. Here are the results:

cluster_id | top_terms ------------+-----------------------------------------------------

1 | {comput,scienc,univ,page,depart,research,home, interest,student,inform}

2 | {page,work,design,inform,project,univ,scienc, system,research,comput}

3 | {algorithm,interest,program,inform,depart,page, univ,research,scienc,comput}

4 | {assign,comput,class,page,inform,offic,hour, syllabu,program,homework}

(4 rows)

3. Run psql -f show_clusters.sql to show the cluster assignments for each document:

cluster_ids | docs --------------+------------------------------------------------- 1 | {2,4,6,7,9,11,12,13,14,18,20,23,29,31,32, 34,43,44,45,51,53,54,57,58,73,..............} 2 | {1,5,8,15,16,17,22,25,27,30,35,36,37,39,41,42, 47,49,56,59,61,62,63,64,66,72,79,...........} 3 | {3,19,24,28,38,46,65,69,74,77,88,95,97, 103,108,129,133,139,150,166,169,183,......}

Examine the Results 23

Page 28:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 4: k-means Clustering Using GPText

4 | {10,21,26,33,40,48,50,52,55,60,67,68, 70,71,76,78,80,81,83,92,94,101,............} (4 rows)

Examine the Results 24

Page 29:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 5: SVM Classification Using GPText

5. SVM Classification Using GPText

Support Vector Machines (SVMs) are supervised learning models that you train to analyze data and recognize patterns, then use to classify data based on the patterns. For example, you can use an SVM to determine whether email is legitimate or spam, or whether social media sentiment about a public figure is positive or negative.

The SVM demonstration shows how to use GPText with Support Vector Machines to perform binary classification: a data instance belongs to one of two possible categories.

The GPText demonstration virtual machine provides sample data and queries for training, testing, and running a demo SVM, including:

• A set of 1,370 documents from webkb-train-stemmed for training and testing• A set of queries for running the SVM classification algorithm on the documentsThe documents are WebKB documents from http://web.ist.utl.pt/~acardoso/datasets. Webkb.org provides sample data for testing machine learning algorithms. For the SVM demo, the training data set contains documents that have the following classifications:

• "student" (750 documents)cd • "course" (620 documents)The testing data contains 400 instances from webkb-test-stemmed, containing 200 from each class: student or course.

The name of the database is demo. It is the same database that is used for the social media search, k-means, and LDA demos, but with a different schema for the SVM demo.

Open the demo files and look at the queries to get the maximum benefit from running the demo.

The general steps for using the SVM demo are:

• Set Up the SVM demo• Train the SVM Model• Test the SVM Model• Examine the results

Set Up the SVM demo1. Install and configure the demo virtual machine. See Install and Configure the

Demonstration Virtual Machine.

2. Start the Greenplum database by executing gpstart -a.

3. Start the text search engine by executing gptext-start.

Set Up the SVM demo 25

Page 30:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 5: SVM Classification Using GPText

4. Change to the /home/gpadmin/demo/analytics/SVM directory.

Train the SVM ModelThe steps for training the SVM model are:

1. Create the schema and two tables (training data and testing data):psql -f svm.ddl

2. Populate the training data table:gpload -f load_train.yaml

3. Index the data in the training data table:psql -f index_train_data.sql

4. Create (or extract) features from the training data:psql -f create_dict_features.sql

5. Create the input table required for SVM training:psql -f create_train_input.sql

6. Train the SVM:psql -f svm_train.sql

The following sections walk you through training the SVM model.

Create the SVM Model Schema and TablesEnter:

psql -f svm.ddl

svm.ddl creates the tables svm_demo.train_data and svm_demo.test_data to contain the training and test data. They have the same schema:

Field Data type Description

id bigint The document id.

doc_data text The document content.

class_id float8 The classification: +1 for student, -1 for course. Note: This field is left empty in svm_demo.test_data.

Populate the Training Data TableEnter:

gpload -f load_train.yaml

load_train.yaml loads the svm_demo.train_data table with the training data from the file svm_train.csv. There are 1370 records: 750 of type "student" (class_id = +1) and 620 of type "course" (class_id = -1).

Train the SVM Model 26

Page 31:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 5: SVM Classification Using GPText

Index the Data In the Training Data Table Enter:

psql -f index_train_data.sql

The index_train_data.sql script indexes the doc_data field in the svm_demo.train_data table and enables terms for this field.

Note: If there are empty documents in the table, they are removed during indexing.

Create (or Extract) Features In the Training Data TableEnter:

psql -f create_dict_features.sql

The create_dict_features.sql script creates a terms table, svm_demo.train_terms_table, for the doc_data field in the svm_demo.train_data table.

create_dict_features.sql then creates a dictionary table, svm_demo.dictionary, from svm_demo.train_terms_table that contains the extracted features (terms) to use for training and testing.

Create the Input Table for SVM TrainingEnter:

psql -f create_train_input.sql

Create Float VectorsThe create_train_input.sql script invokes gptext.gen_float8_vec() to create float vector representations for each training document, then places the documents in the table svm_demo.train_corpus. The svm_demo.train_corpus table has two fields: the document id and the float vector representation (with term frequencies) for that document. Inputs to gptext.gen_float8_vec() are the svm_demo.train_terms_table and the svm_demo.dictionary.

This procedure is the same as creating sparse vectors in the k-means demo (see “Create Sparse Vectors” on page 19), except that the float vectors are not typecast to MADlib sparse vectors.

Create the Training Input TableThe create_train_input.sql script creates a table called svm_demo.train_input. svm_demo.train_input is the same as the svm_demo.train_corpus table, with the addition of a third field, label. The content of label is the class_id from the svm_demo.train_data table (+1 for "student" and -1 for "course").

Train the SVMEnter:

psql -f svm_train.sql

Train the SVM Model 27

Page 32:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 5: SVM Classification Using GPText

The MADlib function madlib.lsvm_classification() is used to train the SVM. Its input is the svm_demo.train_input table. This demo uses a linear classifier. The learned model from the SVM training is stored in the table svm_demo.model_table, which has the following fields:

Field Type Description

id text The name of the model table (svm_demo.model_table).

weight float [ ] An array of weights for the terms. In this demo, there are 6559 elements in the array, corresponding to the 6559 terms in the dictionary.

wdiv float A scaling factor for the weights.

wbias float Offset bias of the linear model.

Test the SVM ModelThe steps for testing the SVM model are:

1. Populate the testing data table:gpload -f load_test.yaml

2. Index the testing data table:psql -f index_test_data.sql

3. Form vectors from the testing data using features selected during training:psql -f create_test_input.sql

4. Predict classes for the testing data:psql -f svm_predict_batch.sql

The following sections walk you through testing the SVM model.

Populate the Testing Data Table Enter:

gpload -f load_test.yaml

In this step, the svm_demo.test_data table is loaded with the training data in svm_test.csv. There are 400 records: 200 of type "student" (class_id = +1) and 200 of type "course" (class_id = -1).

Index the Testing Data Table Enter:

psql -f index_test_data.sql

The doc_data field in the svm_demo.test_data table is indexed. Terms are also enabled for this field.

Test the SVM Model 28

Page 33:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 5: SVM Classification Using GPText

Note: If there are empty documents in the table, they will be removed when indexing.

Create the Input Table For SVM Testing Enter:

psql -f create_test_input.sql

A terms table called svm_demo.test_terms_table for the test documents is created.

The GPText function gptext.gen_float8_vec() is used to create float vector representations for each of the 400 test documents. They are placed in a table called svm_demo.test_corpus, which has two fields: the document id and the float vector representation for that document.

Inputs to the gptext.gen_float8_vec() function are the svm_demo.test_terms_table, created in this step, above, and the svm_demo.dictionary, created in “Create (or Extract) Features In the Training Data Table” on page 27.

Notes: (1) The same dictionary that was developed for training is used for testing. (2) For testing, the input data will be svm_demo.test_corpus, without adding the label field (classification) as was done with the training input.

Predict Classes For the Testing DataEnter:

psql -f svm_predict_batch.sql

The MADlib function madlib.lsvm_predict_batch() is used for predictions on the test data using the model learned during SVM training. Its inputs are:

Parameter Description

svm_demo.test_corpus The test input table created in “Create the Input Table For SVM Testing” on page 29.

ind The name of the field containing float vectors from test_corpus.

id The name of the document id.

svm_demo.model_table The model table developed in training (“Train the SVM” on page 27).

svm_demo.output_table The name of the output table that will contain the predictions.

parallel Madlib variable to determine whether to use multiple learning models. Set equal to null for a single model, as in our case.

The output table (svm_demo.output_table) has two fields: id, prediction.

The values in the prediction field are considered to belong to either the positive (student) or the negative (course) class, depending on whether the value is positive or negative.

Test the SVM Model 29

Page 34:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 5: SVM Classification Using GPText

Examine the resultsFirst, convert svm_demo.output_table to a table named svm_demo.predictions. Enter:

psql -f create_predict_table.sql

To see the first 10 records of the predictions table, enter:

psql -f show_predictions.sql

Note: show_predictions.sql contains the command: psql -c "SELECT * FROM svm_demo.predictions ORDER BY id limit 10;" To show more than 10 rows, increase the limit parameter.

Sample predictions OutputFollowing is sample output from the predictions table.

id | prediction | predicted_label

----+-------------------+-----------------

1 | 0.289759919468028 | student

2 | -14.7617320266913 | course

3 | 16.9256194389078 | student

4 | 8.91576115177027 | student

5 | -8.60030257504716 | course

6 | -49.2657369558985 | course

7 | 12.0844962983305 | student

8 | 10.588149145788 | student

9 | 4.77880137709461 | student

10 | -9.12842509947386 | course

(10 rows)

Calculate the Accuracy of the Predictions Enter:

psql -f accuracy.sql

The accuracy.sql script calculates the accuracy of the testing. It returns the number of successful classifications per 100 documents. For example:

accuracy

------------------

91.9395465994962

(1 row)

The result indicates that the testing is 91.939(...) percent accurate.

Examine the results 30

Page 35:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 6: LDA Topic Modeling Using GPText

6. LDA Topic Modeling Using GPText

Latent Dirichlet allocation (LDA) is a generative probabilistic model for natural text and is used for Topic Modeling for documents. Under LDA, the assumption is that each document is generated from a probability distribution over a set of topics where each word in the document is attributed to belonging to one of those topics. LDA generates the topic distributions for the documents, for instance, a document consists of topic 1 with 0.7 probability, topic 2 with 0.1 probability, and topic 3 with 0.2 probability. The number of topics has to be provided, though there are variations of LDA, HLDA (Hierarchical latent Dirichlet allocation) for example, where the number of topics are not required to be provided.

The LDA demonstration shows how to use GPText with LDA.

The GPText demonstration virtual machine provides sample data and queries for training, testing, and running a demo LDA, including:

• A set of 500 documents from webkb-train-stemmed for training and testing• A set of queries for running the LDA algorithm on the documentsThe documents are WebKB documents from http://web.ist.utl.pt/~acardoso/datasets. Webkb.org provides sample data for testing machine learning algorithms. For the LDM demo, the training data set contains 500 documents.

The testing data contains 200 instances from webkb-test-stemmed.

The name of the database is demo. It is the same database that is used for the social media search and k-means demos, but with a different schema for the LDA demo.

Open the demo files and look at the queries to get the maximum benefit from running the demo.

The general steps for using the LDA demo are:

• Set up the LDA demo• Train the LDA model• Test the LDA Model

Set up the LDA demo1. Install and configure the demo virtual machine. See Install and Configure the

Demonstration Virtual Machine.

2. Start the Greenplum database by executing gpstart -a.

3. Start the text search engine by executing gptext-start.

4. Change to the /home/gpadmin/demo/analytics/lda directory.

Set up the LDA demo 31

Page 36:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 6: LDA Topic Modeling Using GPText

Train the LDA modelThe steps for training the LDM model are:

1. Create the schema and two tables (training data and testing data):psql -f lda.ddl

2. Insert training data into the training data table:gpload -f load_train.yaml

3. Index the data in the training data table:psql -f index_train_data.sql

4. Create the terms table for the indexed training data:psql -f create_train_terms.sql

5. Extract terms from the training data:psql -f create_dictionary.sql

6. Create the input table required for LDA training:psql -f create_train_input.sql

7. Train the LDA:psql -f lda_train.sql

8. Show the top k terms (words) from the topics:psql -f top_words_per_topic.sql

The following sections walk you through training the LDA model.

Create the LDA Model Schema and TablesEnter:

psql -f lda.ddl

lda.ddl creates the tables lda_demo.train_data and lda_demo.test_data to contain the training and test data. They have the same schema:

Field Data type Description

id bigint The document id.

message_body text The document content.

Populate the Training Data TableEnter:

gpload -f load_train.yaml

load_train.yaml loads the lda_demo.train_data table with the training data from the file lda_train.csv. There are 500 records.

Train the LDA model 32

Page 37:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 6: LDA Topic Modeling Using GPText

Index the Data In the Training Data Table Enter:

psql -f index_train_data.sql

The index_train_data.sql script indexes the message_body field in the lda_demo.train_data table and enables terms for this field.

Note: If there are empty documents in the table, they are removed during indexing.

Create Training TermsEnter:

psql -f create_train_terms.sql

The create_train_terms.sql script creates the lda_demo.train_terms table by using the gptext.terms() function to retrieve the term vectors from the Solr index.

Create the Term DictionaryEnter:

psql -f create_dictionary.sql

The create_dictionary.sql script creates the lda_demo.dictionary table, which contains a record for each term. This dictionary is used for both training and testing.

The MADlib LDA library does not support the BIGINT datatype, so the script executes an ALTER TABLE statement to cast the wordid column to INT4.

Create the Input Data for LDA TrainingEnter:

psql -f create_train_input.sql

The create_train_input.sql script creates the lda_demo.train_input_data table. The input has three INTEGER columns: docid, a wordid, and count. The query in this script constructs the train_input_data table with a join on the train_terms and dictionary tables.

The lda_demo.train_input_data is one of the inputs to the MADlib lda_train() function. The lda_train() function does not support the BIGINT dataype, so an ALTER TABLE statement is executed to convert the BIGINT docid column to INT4.

Train the LDAEnter:

psql -f lda_train.sql

Train the LDA model 33

Page 38:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 6: LDA Topic Modeling Using GPText

The lda_train.sql script runs the MADlib lda_train() function on the lda_demo.train_input_data table. This stores the model in the table lda_demo.train_model_table and the topic counts and assignments for each input document in the table lda_demo.train_output_table.

Show Top k Terms from the TopicsEnter:

psql -f top_words_per_topic.sql

The top_words_per_topic.sql script executes the MADlib lda_get_topic_desc() function to generate the top most probable k terms for the topics and then displays them.

Test the LDA ModelThe steps for testing the LDA model are:

1. Load the test data into the testing table:gpload -f load_test.yaml

2. Index the testing data table:psql -f index_test_data.sql

3. Create a terms table for the indexed testing data:psql -f create_test_terms.sql

4. Create an input table in the format required for LDA testing:psql -f create_test_input.sql

5. Apply the learned model to the testing data:psql -f lda_predict.sql

6. Query the prediction results:psql -f topic_count.sql

The following sections walk you through testing the LDA model.

Load the Test Data into the Testing Data Table Enter:

gpload -f load_test.yaml

In this step, the lda_demo.test_data table is loaded with the test data in lda_test.csv. Two hundred records are loaded.

Index the Testing Data Table Enter:

psql -f index_test_data.sql

Test the LDA Model 34

Page 39:  · 2020-03-23 · Table of Contents iii Getting Started with GPText 1.2 — Contents Table of Contents Preface

Getting Started with GPText 1.2 — Chapter 6: LDA Topic Modeling Using GPText

The message_body field in the lda_demo.test_data table is indexed. Terms are also enabled for this field.

Note: If there are empty documents in the table, they will be removed when indexing.

Create a Terms Table for the Indexed Test DataEnter:

psql -f create_test_terms.sql

The create_test_terms.sql script creates the lda_demo.test_terms table by using the gptext.terms() function to retrieve the term vectors from the Solr index.

Create the Input Table For LDA Testing Enter:

psql -f create_test_input.sql

The create_test_input.sql script creates a table called lda_demo.test_input_data for the test documents. This table is in the format required by the MADlib lda_predict() function. The format is the same as the input format for the lda_train() function. The same dictionary that was developed for training is used for testing.

Since the lda_predict() function does not support the BIGINT datatype, an ALTER TABLE statement is executed to change the docid column to INT4.

Predict Classes For the Testing DataEnter:

psql -f lda_predict.sql

The MADlib function madlib.lda_predict() is used for predictions on the test data using the model created during LDA training. The function saves the predictions in the table lda_demo.test_output_table. Each row in the table stores the topic distribution and the topic assignments for a document in the test data.

Query the Prediction ResultsEnter:

psql -f topic_count.sql

The topic_count.sql script shows the per-document topic counts, that is, the number of words in the document belonging to a particular topic.

Test the LDA Model 35