gptext: greenplum parallel statistical text analysis framework · 2017-03-22 · gptext: greenplum...

34
GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang, Sunny Khatri, George Chitouras [email protected]fl.edu Sunday, June 23, 13

Upload: others

Post on 20-May-2020

27 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

GPText: Greenplum Parallel Statistical Text Analysis

FrameworkKun Li, Christan Grant, Daisy Zhe Wang, Sunny

Khatri, George Chitouras

[email protected]

Sunday, June 23, 13

Page 2: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

In-Database Analytics

Sunday, June 23, 13

Page 3: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

In-Database Analytics

• NLP support in RDBMs is limited

Sunday, June 23, 13

Page 4: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

In-Database Analytics

• NLP support in RDBMs is limited

• ML Algorithms in RDBMs is non-trivial

Sunday, June 23, 13

Page 5: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

In-Database Analytics

• NLP support in RDBMs is limited

• ML Algorithms in RDBMs is non-trivial

• Text search in RDBMs is slow

Sunday, June 23, 13

Page 6: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

GPText

A framework for large-scale statistical text analytics over a parallel DBMS.

• The DB provides parallelism and scale.

• Integrated text analytics algorithms with MADlib.

• Specialized architecture for text indexing and search using Solr.

Sunday, June 23, 13

Page 7: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Outline

• Introduction

• GreenplumDB

• GreenplumDB ∪ MADlib

• In-DB Conditional Random Field package

• GreenplumDB ∪ MADlib ∪ Solr

• Demo Screenshots

• Conclusion

Sunday, June 23, 13

Page 8: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Greenplum DB• A shared nothing

parallel dbms.

• Parallel PostgreSQL instances.

• Queries are distributed over segments with a parallel query optimizer.

Sunday, June 23, 13

Page 9: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Greenplum ∪ MADLib

• An open source library for in-database analytics

• A collaborative effort between Berkeley, Wisconsin, and UF

• Maintained by Greenplum

Sunday, June 23, 13

Page 10: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

MADlib

Sunday, June 23, 13

Page 11: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

MADlib

Sunday, June 23, 13

Page 12: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

MADlib

Sunday, June 23, 13

Page 13: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Conditional Random Fields

• The linear-chain CRF is used to find the most likely sequence of token labels.

Sunday, June 23, 13

Page 14: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Conditional Random Fields

• The linear-chain CRF is used to find the most likely sequence of token labels.

Sunday, June 23, 13

Page 15: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

CRF Architecture

Sunday, June 23, 13

Page 16: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

CRF Inference

Sunday, June 23, 13

Page 17: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

CRF Scalability

•Single Host•64 MBs, 32 cores•CoNLL 2000 dataset

Sunday, June 23, 13

Page 18: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Greenplum ∪ MADLib ∪ Solr

Sunday, June 23, 13

Page 19: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Greenplum ∪ MADLib ∪ Solr

GPText

Sunday, June 23, 13

Page 20: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

GPText queriesselect * from

gptext.create_index(<schema-name>,<table-name>, <id_col>,<def-search-column>);

select * from gptext.index(table(select * from <schema-name>.<table>),<index-name>);

select * from gptext.commit_index(<index-name>);

create table sigmod_terms as select * from gptext.terms(table(select 1 scatter by 1), <index-name>, <search-column>, 'sigmod*', 'rows=10');

Sunday, June 23, 13

Page 21: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Concept Application

Sunday, June 23, 13

Page 22: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Concept Application

Sunday, June 23, 13

Page 23: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Concept Application

Sunday, June 23, 13

Page 24: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Concept Application

Sunday, June 23, 13

Page 25: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Concept Application

Sunday, June 23, 13

Page 26: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Concept Application

Sunday, June 23, 13

Page 27: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Conclusion

• We discussed our CRF contribution to MADLib.

• GPText is a scalable framework for text analytics in the database.

• We show a concept application supporting fast text search

Sunday, June 23, 13

Page 28: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Thank youhttp://dsr.cise.ufl.edu

baby gators @pinterest

Sunday, June 23, 13

Page 29: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

Extra Slides

Sunday, June 23, 13

Page 30: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

CRF MADlib Interface

• select crf_train/test_data()

• select crf_train/test_fgen()

• select lincrf/vcrf_label()

Sunday, June 23, 13

Page 31: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

CRF Training Algorithm

• Features extracted with queries.

• The database takes care of the parallelism.

• Each inner loop updates the state until convergence.

Sunday, June 23, 13

Page 32: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

CRF Training Algorithm

• We can create a temporary table to store results

• Use a python udf or a with statement to control iterations

Sunday, June 23, 13

Page 33: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

CRF Training Algorithm

• Iterate the algorithm performing the crf_lbfgs step

• Increment and check to see if it is complete

Sunday, June 23, 13

Page 34: GPText: Greenplum Parallel Statistical Text Analysis Framework · 2017-03-22 · GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li, Christan Grant, Daisy Zhe Wang,

CRF Training Algorithm

• If it is converged finalize the result features.

Sunday, June 23, 13