parti:qualityphraseminingink-ron.usc.edu/xiangren/sigmod17-structnet-part1.pdf · 2019. 6. 29. ·...
TRANSCRIPT
![Page 1: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/1.jpg)
Building Structured Databases of Factual Knowledge from Massive Text Corpora
Part I: Quality Phrase Mining
![Page 2: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/2.jpg)
Effort-Light StructMine: Methodology
2
Data-driven textsegmentation
(SIGMOD’15, WWW’16)
Entity names& context units
Partially-labeledcorpus
Corpus-specificStructureDiscovery
(KDD’15, KDD’16,EMNLP’16, WWW’17)
Structures fromthe remainingunlabeled data
Knowledgebases
Textcorpus
![Page 3: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/3.jpg)
Quality Phrase Mining• Quality phrase mining seeks to extract a
ranked list of phrases with decreasing quality from a large collection of documents• Examples:
3
ScientificPapers
NewsArticles
Expected Results
USPresidentAndersonCooperBarack Obama…Obama administration…atown…
Expected Results
data miningmachinelearninginformationretrieval…support vectormachine…the paper…
![Page 4: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/4.jpg)
Why Phrase Mining?
4
w/o phrase mining w/ phrase mining• What is “united”?• Which Dao?
• United Airline!• David Dao!
• Applications in NLP, IR, Text Mining• Documentanalysis• Indexinginsearchengine
• Keyphrases fortopicmodeling• Summarization
![Page 5: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/5.jpg)
What Kind of Phrases Are of “High Quality”?• Popularity
• “informationretrieval”>“cross-languageinformationretrieval”
• Concordance• “strongtea”>“powerfultea”• “activelearning”> “learningclassification”
• Informativeness• “thispaper”(frequentbutnotdiscriminative,notinformative)
• Completeness• “supportvectormachine” >“vectormachine”
5
![Page 6: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/6.jpg)
Three Families of Methods
Supervised(linguisticanalyzers)
Unsupervised(statistical signals)
Weakly/DistantlySupervised
6
![Page 7: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/7.jpg)
Supervised Phrase Mining• Phrase mining was originated from the NLP
community• How to use linguistic analyzers to extract phrases?
• Parsing(e.g.,stanford NLPparsers)• NounPhrase(NP)Chunking
• How to rank extracted phrases?• C-value[Frantzi etal.’00]• TextRank [Mihalcea etal.’04]
• TF-IDF
7
![Page 8: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/8.jpg)
• Minimal Grammatical Segments ó Phrases
• Phrases: “the chef”, “the soup”
Linguistic Analyzer – Parsing
8
Rawtextsentence(string)
Fullparsetree(grammaticalanalysis)
Thechefcooksthesoup.
Full-textParsing
![Page 9: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/9.jpg)
Linguistic Analyzer – Chunking
• Noun phrase chunking is a light version of parsing
1. Apply tokenization and part-of-speech (POS) tagging to each sentence
2. Search for noun phrase chunks
9
![Page 10: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/10.jpg)
Inefficiencies of Linguistic Analyzer• Difficult to directly apply pre-trained to new
domains (e.g. twitter, biomedical, yelp)• Unlesssophisticated,manuallycurated,domain-specifictrainingdataareprovided
• Computationally slow.• Cannotbeappliedonweb-scaledatatosupportemergingapplications
• Lack of the usage of corpora-level information• NPsometimescan’tmeettherequirementsofqualityphrases
• We need “shallow” phrase mining techniques
10
![Page 11: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/11.jpg)
Ranking• C-Value• Prefers“maximal”phrases• Popularity&Completeness
• TextRank• SimilartoPageRank• Popularity&Informativeness
• TF-IDF• TermFrequency• InverseDocumentFrequency• Popularity&Informativeness
11
Compatibilityofsystemsoflinearconstraintsover
thesetofnaturalnumbers.Criteriaof
compatibilityofasystemoflinearDiophantine
equations,strictinequations,andnonstrict
inequations areconsidered.…..
![Page 12: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/12.jpg)
Three Families of Methods
Supervised(linguisticanalyzers)
Unsupervised(statistical signals)
Weakly/DistantlySupervised
12
![Page 13: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/13.jpg)
Unsupervised Phrase Mining
• Statistics based on massive text corpora• Popularity• Rawfrequency• FrequencydistributionbasedonZipfian ranks[Deane’05]
• Concordance• Significancescore[Churchetal.’91][El-Kishky etal.’14]
• Completeness• Comparisontosuper/sub-sequences[Parameswaran etal.’10]
13
![Page 14: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/14.jpg)
Raw Frequency• Raw frequency could NOT reflect the quality of
phrases, because• Combine with topic modeling
• Mergeadjacentunigramsofthesametopic[Blei &Lafferty’09]• Frequentpatternminingwithinthesametopic[Danilevsky etal.’14]
• Limitations• Tokensinthesamephrasemaybeassignedtodifferenttopics• E.g.knowledge discovery usingleastsquaressupportvectormachineclassifiers…
14
![Page 15: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/15.jpg)
Frequency Distribution• Idea: ranks in a Zipfian frequency distribution is
more reliable than raw frequency• Heuristic: Actual Rank / Expected Rank• Example:• Givenaphraselike“eastend”• ActualRank:rank“eastend”amongalloccurrencesof“east”(e.g.,“east end”,“east side”,“theeast”,“towardstheeast”,etc.)• ExpectedRank:rank“__end”amongallcontextsof“east”(e.g.,“__end”,“__side”,“the__”,“towardsthe__”,etc.)
15
![Page 16: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/16.jpg)
Significance score • Significance score [Church et al.’91]• A.k.a.Zscore
• ToPMine [El-Kishky et al.’15]• Ifaphrasecanbedecomposedintotwoparts
• P = P1 ! P2• α(P1,P2)≈(f(P1●P2)̶µ0(P1,P2))/√f(P1●P2)
16
Qualityphrases
![Page 17: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/17.jpg)
Significance score (cont’d)• Merge adjacent unigrams greedily if their
significance score is above the threshold.
17
![Page 18: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/18.jpg)
Comparison to super/sub-sequences• Frequency ratio between an n-gram phrase
and its two (n-1)-gram phrases• Example
• Pre-confidence ofSanAntonio:2385/14585• Post-confidence ofSanAntonio:2385/2855
• Expand / Terminate based on thresholds
18
Phrase Rawfrequency
San 14585
Antonio 2855
SanAntonio 2385
![Page 19: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/19.jpg)
Comparison to super/sub-sequences (cont’d)• Assumption
• Anti-example• “relationaldatabasesystem”isaqualityphrase.• Both“relationaldatabase”and“databasesystem”canbequalityphrases.
19
Ann-gramqualityphrase
Two(n-1)-gramsub-phrases
Atleastoneofthemisnotaqualityphrase.
![Page 20: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/20.jpg)
Limitations of Statistical Signals
• The thresholds should be carefully chosen.• Only consider a subset of quality phrase
requirements.• Combining different signals in an
unsupervised manner is difficult.• Introducesomesupervisionmayhelp!
20
![Page 21: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/21.jpg)
Three Families of Methods
Supervised(linguisticanalyzers)
Unsupervised(statistical signals)
Weakly/DistantlySupervised
21
![Page 22: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/22.jpg)
Weakly / Distantly Supervised Phrase Mining Methods• SegPhrase [Liu et al.’15]• Weaklysupervised
• AutoPhrase [Shang et al.’17]• Distantlysupervised
22
![Page 23: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/23.jpg)
SegPhrase
23
Document 1Citationrecommendationisaninterestingbutchallengingresearchproblemindataminingarea.
Document 2Inthisstudy,weinvestigatetheprobleminthecontextofheterogeneousinformationnetworksusingdataminingtechnique.
Phrase Mining
Document 3PrincipalComponentAnalysisisalineardimensionalityreduction technique commonly usedin machine learning applications.
Quality Phrases
PhrasalSegmentation
RawCorpus SegmentedCorpus
InputRawCorpus Quality Phrases SegmentedCorpus
• Outperform all above methods on domain-specific corpus (e.g., Yelp reviews)
![Page 24: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/24.jpg)
Quality Estimation• Weakly Supervised
• Labels:Whetheraphraseisaqualityoneornot• “support vector machine”: 1• “the experiment shows”: 0
• For~1GBcorpus,only300labels
• Pros• Binaryannotationsareeasy
• Cons• Theselectionofhundredsofvarying-qualityphrasesfrommillionsofcandidatesshouldbecareful.
24
![Page 25: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/25.jpg)
Phrasal Segmentation• Phrasal segmentation can tell which phrase is
more appropriate• Ex:Astandard⌈featurevector⌋ ⌈machinelearning⌋ setupisusedtodescribe...
• Effects on quality re-estimation (real data)• nphardinthestrongsense• nphardinthestrong• databasemanagementsystem
25
Notcountedtowardstherectifiedfrequency
![Page 26: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/26.jpg)
Interesting Phrases Mined (From Titles & Abstracts of SIGMOD/SIGKDD Proceedings)
26
![Page 27: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/27.jpg)
AutoPhrase• No label selection and annotation effort• Smoothly support multiple languages
27
![Page 28: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/28.jpg)
How to get rid of human effort?
• Basic Idea:• Knowledgebasescangiveusacleanpositivepool• Theremainingfrequentn-gramsformanoisynegativepool.However,theratiooffalsenegativeislow.• Ensemble:averagethepredictionsfrombaseclassifiers
• Independence helps to denoise
28
![Page 29: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/29.jpg)
AutoPhrase’s Example Results
29
![Page 30: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/30.jpg)
ReferencesDeane, P., 2005, June. A nonparametric method for extraction of candidate phrasal terms. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 605-613). Association for Computational Linguistics.
Koo, T., Carreras Pérez, X. and Collins, M., 2008. Simple semi-supervised dependency parsing. In 46th Annual Meeting of the Association for Computational Linguistics (pp. 595-603).
Xun, E., Huang, C. and Zhou, M., 2000, October. A unified statistical model for the identification of English baseNP. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (pp. 109-116). Association for Computational Linguistics.
Zhang, Z., Iria, J., Brewster, C. and Ciravegna, F., 2008, May. A comparative evaluation of term recognition algorithms. In LREC.
Park, Y., Byrd, R.J. and Boguraev, B.K., 2002, August. Automatic glossary extraction: beyond terminology identification. In Proceedings of the 19th international conference on Computational linguistics-Volume 1 (pp. 1-7). Association for Computational Linguistics.
Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G., 1999, August. KEA: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries (pp. 254-255). ACM.
Liu, Z., Chen, X., Zheng, Y. and Sun, M., 2011, June. Automatic keyphrase extraction by bridging vocabulary gap. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning (pp. 135-144). Association for Computational Linguistics.
Evans, D.A. and Zhai, C., 1996, June. Noun-phrase analysis in unrestricted text for information retrieval. In Proceedings of the 34th annual meeting on Association for Computational Linguistics (pp. 17-24). Association for Computational Linguistics.
30
![Page 31: PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/sigmod17-StructNet-part1.pdf · 2019. 6. 29. · •TextRank •Similar to PageRank •Popularity & Informativeness •TF-IDF •Term](https://reader035.vdocuments.net/reader035/viewer/2022071109/5fe39a23fd4e890a280aa961/html5/thumbnails/31.jpg)
ReferencesFrantzi, K., Ananiadou, S. and Mima, H., 2000. Automatic recognition of multi-word terms:. the c-value/nc-value method. International Journal on Digital Libraries, 3(2), pp.115-130.
Mihalcea, R. and Tarau, P., 2004, July. TextRank: Bringing order into texts. Association for Computational Linguistics.
Blei, D.M. and Lafferty, J.D., 2009. Topic models. Text mining: classification, clustering, and applications, 10(71), p.34.
Danilevsky, M., Wang, C., Desai, N., Ren, X., Guo, J. and Han, J., 2014, April. Automatic construction and ranking of topical keyphrases on collections of short documents. In Proceedings of the 2014 SIAM International Conference on Data Mining (pp. 398-406). Society for Industrial and Applied Mathematics.
Church, K., Gale, W., Hanks, P. and Hindle, D., 1991. Using statistics in lexical analysis. Lexical acquisition: exploiting on-line resources to build a lexicon, 115, p.164.
El-Kishky, A., Song, Y., Wang, C., Voss, C.R. and Han, J., 2014. Scalable topical phrase mining from text corpora. Proceedings of the VLDB Endowment, 8(3), pp.305-316.
Parameswaran, A., Garcia-Molina, H. and Rajaraman, A., 2010. Towards the web of concepts: Extracting concepts from large datasets. Proceedings of the VLDB Endowment, 3(1-2), pp.566-577.
Liu, J., Shang, J., Wang, C., Ren, X. and Han, J., 2015, May. Mining quality phrases from massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 1729-1744). ACM.
Shang, J., Liu, J., Jiang, M., Ren, X., Voss, C.R. and Han, J., 2017. Automated Phrase Mining from Massive Text Corpora. arXiv preprint arXiv:1702.04457.
31