at first sight - slac conferences, workshops and symposiums€¦ · 5 entitymention span mention...

At First Sight

Ying Zhang1, Richard Koopmanschap1, Martin L. Kersten1,2 1 MonetDB Solutions 2 CWI Amsterdam

Two halfs of a whole

2

DatabaseMachine LearningAn

alyt

ics

Data management

filteringaggregationstatistic functions, …

large data set managementcomplex transaction scenariosmulti-user concurrency, …

large collection of features?iterative learning process?post decision analysis? …

identificationclassificationprediction, …

Two halfs of a whole

2

Database Machine Learningfilteringaggregationstatistic functions, …

large data set managementcomplex transaction scenariosmulti-user concurrency, …

large collection of features?iterative learning process?post decision analysis? …

identificationclassificationprediction, …

In-

SQL engine

In-Database Machine Learning

3

SQL UDFs

embedded

process*Numpyarrays

•Zero data conversion cost•Zero data transfer cost

* M. Raasveldt and H. Mühleisen. Vectorized UDFs in Column-Stores. SSDBM ’16. ACM.

Dating stories• Classification

• M. Raasveldt, P. Holanda, H. Mühleisen and S. Manegold. Deep Integration of Machine Learning Into Column Stores. EDBT 2018.

• Speed up: 2x Postgres, 40x MySQL!

• Image processing• P. Holanda, M. Raasveldt, D. Tomé and P. Boncz.

MonetDB/Tensorflow: Performing In-Database Ensemble Learning. Submitted to AMW2018

• Text analysis• T. Kilias, A. Löser, F. A. Gers, R. Koopmanschap,

Y. Zhang and M. Kersten. IDEL: In-Database Entity Linking with Neural Embeddings. ArXiv e-prints arXiv:1803.04884, Mar. 2018.

4

Figure 1: Voter Classi�cation Benchmark

We can see that the in-database processing solution usingMonetDB/Python is signi�cantly faster than the alternative data-base solutions. The time spent on initial wrangling of the data isan order of magnitude lower than transferring it over a socketconnection using the other database solutions. We also note thatloading the data from CSV �les is comparable in speed to trans-ferring the data over a socket connection.

Loading the data from binary �les is much faster than load-ing from structured text or transferring the data over a socketconnection. However, this introduces additional challenges inmanaging the data. Especially in the case of NumPy binary �les,where each of the 96 columns is stored as a separate �le on disk.We do still see that the in-database processing solution spendsless time on initial wrangling of the data and runs the entirepipeline signi�cantly faster.

5 CONCLUSIONIn this work, we have shown how complex analysis pipelinescan be e�ciently integrated into column-store databases. Usingthese pipelines, it is possible to perform preprocessing, training,testing and prediction using complex machine learning modelsdirectly on data stored within a relational database. We havedemonstrated the e�ciency gained from using these in-databaseprocessing methods, and shown the additional bene�ts that comewith storing data in a relational database system.

5.1 Future WorkIn our pipeline, there is still some unnecessary overhead in theserialization of the models. Whenever a model is stored in thedatabase, we are serializing it to a BLOB. Before it can be usedagain, it must be deserialized. For larger models, this can have aperformance impact. The database system could be extended todirectly store snapshots of the in-memory representation of themodels to avoid this (de)serialization overhead.

Additionally, we have only experimented with datasets that�t in memory. Additional work could be done on working with

out-of-memory datasets, distributed execution of the UDFs, orapplying several models to the data in parallel.

ACKNOWLEDGMENTSThis work was funded by the Netherlands Organisation for Sci-enti�c Research (NWO), projects “Process Mining for Multi-Objective Online Control” (Raasveldt), “Data Mining on High-Volume Simulation Output” (Holanda) and “Capturing the Lawsof Data Nature” (Mühleisen). We also would like to thank BrianHentschel, without whom this paper would never have beenwritten.

REFERENCES[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,

A. Davis, J. Dean, M. Devin, et al. Tensor�ow: Large-scale machine learningon heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.

[2] R. Agrawal and K. Shim. Developing tightly-coupled data mining applicationson a relational database system. In In Proc. of the 2nd Int’l Conference onKnowledge Discovery in Databases and Data Mining, pages 287–290. AAAIPress, 1996.

[3] G. Allen and M. Owens. The De�nitive Guide to SQLite. Apress, Berkely, CA,USA, 2nd edition, 2010.

[4] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD skills:new analysis practices for big data. Proceedings of the VLDB Endowment,2(2):1481–1492, 2009.

[5] P. Domingos. A few useful things to know about machine learning. Commu-nications of the ACM, 55(10):78–87, 2012.

[6] X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a Uni�ed Architecture forin-RDBMS Analytics. In Proceedings of the 2012 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’12, pages 325–336, New York,NY, USA, 2012. ACM.

[7] J. M. Hellerstein, C. RÃľ, U. Wisconsin, A. Gorajek, K. Li, U. Florida, K. S. Ng,U. Wisconsin, C. Welton, D. Z. Wang, U. Florida, X. Feng, and U. Wisconsin.The MADlib analytics library, or MAD skills, the SQL.

[8] P. Holanda, M. Raasveldt, and M. Kersten. Don’t Hold My UDFs Hostage - Ex-porting UDFs For Debugging Purposes. In Proceedings of the 28th InternationalConference on Simpósio Brasileiro de Banco de Dados, SSBD 2017, UberlÃćndia,Brazil, 2017.

[9] S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Enterprise Data Analysisand Visualization: An Interview Study. IEEE Transactions on Visualization andComputer Graphics, 18(12):2917–2926, Dec. 2012.

[10] A. Kumar, M. Boehm, and J. Yang. Data Management in Machine Learn-ing: Challenges, Techniques, and Systems. In Proceedings of the 2017 ACMInternational Conference on Management of Data, pages 1717–1722. ACM, 2017.

[11] W. McKinney. Data Structures for Statistical Computing in Python. InS. van der Walt and J. Millman, editors, Proceedings of the 9th Python in ScienceConference, pages 51 – 56, 2010.

[12] C. Ordonez and S. K. Pitchaimalai. One-pass data mining algorithms in aDBMS with UDFs. In Proceedings of the 2011 ACM SIGMOD InternationalConference on Management of data, pages 1217–1220. ACM, 2011.

[13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machinelearning in Python. Journal of Machine Learning Research, 12(Oct):2825–2830,2011.

[14] M. Raasveldt and H. Mühleisen. Vectorized UDFs in Column-Stores. InProceedings of the 28th International Conference on Scienti�c and StatisticalDatabase Management, SSDBM 2016, Budapest, Hungary, July 18-20, 2016, pages16:1–16:12, 2016.

[15] M. Raasveldt and H. Mühleisen. Don’t Hold My Data Hostage: A Case forClient Protocol Redesign. Proc. VLDB Endow., 10(10):1022–1033, June 2017.

[16] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule miningwith relational database systems: Alternatives and implications. In Proceedingsof the 1998 ACM SIGMOD International Conference on Management of Data,SIGMOD ’98, pages 343–354, New York, NY, USA, 1998. ACM.

[17] M. Stonebraker and G. Kemnitz. The POSTGRES Next Generation DatabaseManagement System. Commun. ACM, 34(10):78–92, Oct. 1991.

[18] The HDF Group. Hierarchical Data Format, version 5, 1997-NNNN.http://www.hdfgroup.org/HDF5/.

[19] M. Vartak, H. Subramanyam,W.-E. Lee, S. Viswanathan, S. Husnoo, S. Madden,and M. Zaharia. Model DB: a system for machine learning model management.In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, page 14.ACM, 2016.

[20] S. v. d. Walt, S. C. Colbert, and G. Varoquaux. The NumPy Array: A Structurefor E�cient Numerical Computation. Computing in Science and Engg., 13(2):22–30, Mar. 2011.

[21] M. Widenius and D. Axmark. MySQL Reference Manual. O’Reilly & Associates,Inc., Sebastopol, CA, USA, 1st edition, 2002.

(courtesy of the authors)

Text analysis

55

EntityMention Span Mention Document Doc1,0,2 IBM IBM was founded in 1911. Its headquarter is in Armonk. The current CEO is Ginni Rometty. Doc2,0,1 HP HP, established in 1939, is lead by Dion Weisler. Doc3,0,7 HP Inc. HP Inc. with its main office in Palo Alto is the new name of Hewlett-Packard Company. Doc4,0,8 Big Blue Big Blue, headquartered in Armonk, pushes its system Watson to new use cases.

Organization Entity-Mention Name Headquarter Founded Span Mention Alt. name Product IBM Armonk 1911 Doc1,0,2 IBM HP Palo Alto 1939 Doc2,0,1 HP Microsoft Redmond 1975 NULL NULL

Doc4,0,2 Big Blue WatsonDoc3,0,1 HP Inc. Hewlett-Packard Company

Organization Name Headquarter Founded IBM Armonk NULL HP Palo Alto 1939 Microsoft Redmond 1975

Organization.Name EntityMention.Mention

* T. Kilias, et al. IDEL: In-Database Entity Linking with Neural Embeddings. ArXiv e-prints arXiv:1803.04884, Mar. 2018.

5

EntityMention Span Mention Document Doc1,0,2 IBM IBM was founded in 1911. Its headquarter is in Armonk. The current CEO is Ginni Rometty. Doc2,0,1 HP HP, established in 1939, is lead by Dion Weisler. Doc3,0,7 HP Inc. HP Inc. with its main office in Palo Alto is the new name of Hewlett-Packard Company. Doc4,0,8 Big Blue Big Blue, headquartered in Armonk, pushes its system Watson to new use cases.

Organization Entity-Mention Name Headquarter Founded Span Mention Alt. name Product IBM Armonk 1911 Doc1,0,2 IBM HP Palo Alto 1939 Doc2,0,1 HP Microsoft Redmond 1975 NULL NULL

Doc4,0,2 Big Blue WatsonDoc3,0,1 HP Inc. Hewlett-Packard Company

Organization Name Headquarter Founded IBM Armonk NULL HP Palo Alto 1939 Microsoft Redmond 1975

Organization.Name EntityMention.Mention

Text analysis

6

Problems• Expensive

• 3 separate systems for texts, relational data, text analysis

• Low precision/recall • Homonyms, hyponyms, synonyms, typos

In-Database Entity Linking with neural embeddings• Robust to language erros • Adaptive to new data • Single system


Text analysis

7

MonetDB SQL engine

Embedded Python process

relationalembeddings

textembeddings

(2) Search for candidates

(2.1) Compute similarities

(2.2) Compute rankings

(2.3) Select topN

candidates

relationaldata

textdata

(1) Create embeddings

In-Database Entity Linking with neural embeddings

• Choose best exec. env. • SQL or Python, CPU or GPU

• Preserve parallel execution!• When/how to exchange data • Maximise bulk execution


Happily ever after!• Similar interests

• strong focus on optimised processing on heterogeneous hardware • from multi-core in-memory processing to GPU/FPGA acceleration

• Excellent communication • speak the same language, Python • seamless information exchange using the same binary data format

• Complementary skills • machine learning + data management => AI-enabled big data analytical

applications

8

at first sight - slac conferences, workshops and symposiums€¦ · 5 entitymention span mention...

Documents