stylometric analysis of scientific articles
DESCRIPTION
Stylometric Analysis of Scientific Articles. Shane Bergsma, Matt Post, David Yarowsky Department of Computer Science and Human Language Technology Center of Excellence Johns Hopkins University Baltimore, MD 21218, USA [email protected], [email protected], [email protected]. OUTLINE. Abstract - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/1.jpg)
Stylometric Analysis of Scientific Articles
![Page 2: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/2.jpg)
OUTLINE
Abstract Introduction Related Work ACL Dataset and Preprocessing Stylometric Tasks Models and Training Strategies Stylometric Features Experiments and Results Conclusion
![Page 3: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/3.jpg)
OUTLINE
Abstract Introduction Related Work ACL Dataset and Preprocessing Stylometric Tasks Models and Training Strategies Stylometric Features Experiments and Results Conclusion
![Page 4: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/4.jpg)
Abstract
We present an approach to automatically recover hidden attributes of scientific articles. whether the author is a native English
speaker whether the author is a male or a female whether the paper was published in a
conference or workshop proceedings.
![Page 5: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/5.jpg)
OUTLINE
Abstract Introduction Related Work ACL Dataset and Preprocessing Stylometric Tasks Models and Training Strategies Stylometric Features Experiments and Results Conclusion
![Page 6: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/6.jpg)
Introduction
Stylometry aims to recover useful attributes of documents from the style of the writing.
We evaluate stylometric techniques in the novel domain of scientific writing.
![Page 7: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/7.jpg)
Introduction
Success in this challenging domain can bring us closer to correctly analyzing the huge volumes of online text that are currently unmarked for useful author attributes such as gender and native-language.
![Page 8: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/8.jpg)
Introduction
New Stylometric Tasks predict whether a paper is written:
by a native or non-native speaker. by a male or female. in the style of a conference or workshop
paper. New Stylometric Features
We show the value of syntactic features for stylometry.
Tree subsitution grammar fragments
![Page 9: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/9.jpg)
OUTLINE
Abstract Introduction Related Work ACL Dataset and Preprocessing Stylometric Tasks Models and Training Strategies Stylometric Features Experiments and Results Conclusion
![Page 10: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/10.jpg)
Related Work
Bibliometrics is the empirical analysis of scholarly literature.
Citation analysis is a well-known bibliometric approach for ranking authors and papers.
But our system does not consider citations.
![Page 11: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/11.jpg)
OUTLINE
Abstract Introduction Related Work ACL Dataset and Preprocessing Stylometric Tasks Models and Training Strategies Stylometric Features Experiments and Results Conclusion
![Page 12: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/12.jpg)
ACL Dataset and Preprocessing
Use papers from the ACL Anthology Network and exploit its manually-curated meta-data author names Affiliations citation counts
Papers are parsed via the Berkeley parser, and part-of-speech tagged using CRFTagger.
![Page 13: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/13.jpg)
![Page 14: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/14.jpg)
OUTLINE
Abstract Introduction Related Work ACL Dataset and Preprocessing Stylometric Tasks Models and Training Strategies Stylometric Features Experiments and Results Conclusion
![Page 15: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/15.jpg)
Stylometric Tasks
Each task has both a Strict training set:
using only the data for which we are most confident in the labels.
Lenient training set forcibly assigns every paper in the training
period to some class.
All test papers are annotated using a Strict rule.
![Page 16: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/16.jpg)
![Page 17: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/17.jpg)
NativeL Native vs. Non-Native
English We introduce the task of predicting whether a scientific paper is written by a native English speaker (NES) non-native speaker (NNS)
We annotate papers using two pieces of associated meta-data author first names countries of affiliation
![Page 18: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/18.jpg)
NativeL Native vs. Non-Native
English If the first author of a paper has an English first name and English-speaking-country affiliation, mark NES.
If none of the authors have an English first name nor an English-speaking-country affiliation, mark NNS.
![Page 19: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/19.jpg)
VenueTop-Tier vs. Workshop
This novel task aims to distinguish top-tier papers from those at workshops, based on style.
![Page 20: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/20.jpg)
VenueTop-Tier vs. Workshop
We label all main-session ACL papers as top-tier, and all workshop papers as workshop.
For Lenient training ,we assign all conferences to be top-tier except for their non-main-session papers, which we label as workshop.
![Page 21: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/21.jpg)
GenderMale vs. Female
We use the data of Bergsma and Lin. Each line in the data lists how often a
noun co-occurs with Male Female Neutral Plural pronouns
“bill clinton” is 98% male (in 8344 instances) “elsie wayne” is 100% female (in 23
instances)
![Page 22: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/22.jpg)
GenderMale vs. Female
The data also has aggregate counts over all nouns with the same first token.
e.g., ‘elsie...’ is 94% female (in 255 instances).
if the name has an aggregate count >30 and gender probability >0.85, label it’s gender.
![Page 23: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/23.jpg)
OUTLINE
Abstract Introduction Related Work ACL Dataset and Preprocessing Stylometric Tasks Models and Training Strategies Stylometric Features Experiments and Results Conclusion
![Page 24: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/24.jpg)
Models and Training Strategies
Model: We take a discriminative approach to
stylometry, representing articles as feature vectors and classifying them using a linear, L2-regularized SVM, trained via LIBLINEAR.
![Page 25: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/25.jpg)
Models and Training Strategies
Strategy: We test whether it’s better to train
with a smaller, more accurate Strict set, or a larger but noisier Lenient set ,so we also explore third strategy.
![Page 26: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/26.jpg)
Models and Training Strategies
We fix the Strict labels, but also include the remaining examples as unlabeled instances.
We then optimize a Transductive SVM, solving an optimization problem where we not only choose the feature weights, but also labels for unlabeled training points.
![Page 27: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/27.jpg)
OUTLINE
Abstract Introduction Related Work ACL Dataset and Preprocessing Stylometric Tasks Models and Training Strategies Stylometric Features Experiments and Results Conclusion
![Page 28: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/28.jpg)
Stylometric Features
We use the following three feature classes; the particular features were chosen based on development experiments. Bow Features Style Features Syntax Features
![Page 29: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/29.jpg)
Bow Features
In the text categorization literature have shown that simple bag-of-words representations usually perform better than “more sophisticated” ones.
One key aim of our research is to see whether this is true of scientific stylometry.
![Page 30: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/30.jpg)
Bow Features
Our Bow representation uses a feature for each unique lower-case word-type in an article.
The feature value is the log-count of how often the corresponding word occurs in the document.
![Page 31: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/31.jpg)
![Page 32: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/32.jpg)
Style Features
While text categorization relies on keywords, stylometry focuses on topic-independent measures.
We define a style-word to be: Punctuation Stopword Latin abbreviation
![Page 33: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/33.jpg)
Style Features
We create Style features for all unigrams and bigrams, replacing non-style-words separately with both PoS-tags and spelling signatures
Each feature is an N-gram, the value is its log-count in the article.
![Page 34: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/34.jpg)
Syntax Features
Unlike recent work using generative PCFGs, we use syntax directly as features in discriminative models, which can easily incorporate arbitrary and overlapping syntactic clues.
![Page 35: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/35.jpg)
Syntax Features
For example, we will see that one indicator of native text is the use of certain determiners as stand-alone noun phrases.
This contrasts with a proposed non-native phrase, “this/DT growing/VBG area/NN,” where this instead modifies a noun.
![Page 36: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/36.jpg)
![Page 37: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/37.jpg)
Syntax Features
We evaluate three feature types that aim to capture such knowledge.
We aggregate the feature counts over all the parse trees constituting a document.
The feature value is the log-count of how often each feature occurs.
![Page 38: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/38.jpg)
CFG Rules
We include a feature for every unique, single-level context-free-grammar (CFG) rule application in a paper.
The Figure 2 tree would have features: NP->PRP, NP->DT, DT->this.
Such features do capture that a determiner was used as an NP, but they do not jointly encode which determiner was used.
![Page 39: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/39.jpg)
TSG Fragments
A tree-substitution grammar is a generalization of CFGs that allow rewriting to tree fragments rather than sequences of non-terminals.
![Page 40: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/40.jpg)
TSG Fragments
Figure 2 gives the example NP->(DT this).
This fragment captures both the identity of the determiner and its syntactic function as an NP, as desired.
We parse with the TSG grammar and extract the fragments as features.
![Page 41: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/41.jpg)
![Page 42: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/42.jpg)
C&J Reranking Features
We also extracted the reranking features of Charniak and Johnson.
These features were hand-crafted for reranking the output of a parser.
![Page 43: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/43.jpg)
C&J Reranking Features
While TSG fragments tile a parse tree into a few useful fragments, C&J features can produce thousands of features per sentence.
![Page 44: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/44.jpg)
OUTLINE
Abstract Introduction Related Work ACL Dataset and Preprocessing Stylometric Tasks Models and Training Strategies Stylometric Features Experiments and Results Conclusion
![Page 45: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/45.jpg)
Experiments and Results
We take the minority class as the positive class: NES for NativeL, top-tier for Venue and
female for Gender.
We tune three hyperparameters for F1-score on development data: the SVM regularization parameter the threshold for classifying an instance as
positive transductive training
![Page 46: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/46.jpg)
![Page 47: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/47.jpg)
![Page 48: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/48.jpg)
![Page 49: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/49.jpg)
NativeL
Some reflect differences in common native/non-native topics. e.g., ‘probabilities’ predicts native while
‘morphological’ predicts nonnative.
![Page 50: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/50.jpg)
NativeL
Several features, like ‘obtained’, indicate L1 interference.
The word obtained occurs 3.7 times per paper from Spanish-speaking areas (cognate obtenir) versus once per native paper and 0.8 times per German-authored paper.
![Page 51: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/51.jpg)
NativeL
Natives also prefer certain abbreviations (e.g. ‘e.g.’) while non-natives prefer others (‘i.e.’, ‘c.f.’, ‘etc.’).
Exotic punctuation also suggests native text: the semi-colon, exclamation and
question mark all predict NES.
![Page 52: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/52.jpg)
NativeL
Table 5 gives highly-weighted TSG features for predicting NativeL.
![Page 53: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/53.jpg)
![Page 54: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/54.jpg)
NativeL
Table 5 gives highly-weighted TSG features for predicting NativeL. these, this and each predict native when
used as an NP that-as-an-NP predicts non-native
![Page 55: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/55.jpg)
NativeL
Furthermore, while not all native speakers use a comma before a conjunction in a list, it’s nevertheless a good flag for native writing.
![Page 56: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/56.jpg)
NativeL
We also looked for features involving determiners since correct determiner usage is a common difficulty for non-native speakers.
![Page 57: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/57.jpg)
NativeL
Non-natives also rely more on boilerplate.
For example, the exact phrase “The/This paper is organized as follows” occurs 3 times as often in nonnative compared to native text.
![Page 58: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/58.jpg)
NativeL
we found very few highly-weighted features that pinpoint ‘ungrammatical’ non-native writing.
![Page 59: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/59.jpg)
Venue
Table 6 provides important Bow and Style features for the Venue task.
![Page 60: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/60.jpg)
![Page 61: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/61.jpg)
Venue
Good papers often have an explicit probability model, experimental baselines, error analysis, and statistical significance checking. Focus on improves ‘performance’ by
‘#%’ on established tasks.
![Page 62: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/62.jpg)
Venue
Features of workshop papers highlight the exploration of ‘interesting’ new ideas/domains. Focus on what is ’possible’ or what one
is ‘able to’ do.
![Page 63: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/63.jpg)
Gender
The CFG features for Gender are given in Table 7.
![Page 64: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/64.jpg)
![Page 65: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/65.jpg)
Gender
Several of the most highly-weighted female features include pronouns.
we observe a higher frequency of not just negation but adverbs in general. (e.g. ‘VP->MD RB VP’)
![Page 66: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/66.jpg)
Gender
In terms of Bow features (not shown), the words contrast and comparison highly predict female
The top-three male Bow features are : simply, perform, parsing.
![Page 67: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/67.jpg)
![Page 68: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/68.jpg)
![Page 69: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/69.jpg)
![Page 70: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/70.jpg)
OUTLINE
Abstract Introduction Related Work ACL Dataset and Preprocessing Stylometric Tasks Models and Training Strategies Stylometric Features Experiments and Results Conclusion
![Page 71: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/71.jpg)
Conclusion
We have proposed, developed and successfully evaluated significant new tasks and methods in the stylometric analysis of scientific articles, including the novel resolution of publication venue based on paper style, and novel syntactic features based on tree substitution grammar fragments.
![Page 72: Stylometric Analysis of Scientific Articles](https://reader036.vdocuments.net/reader036/viewer/2022062306/5681369c550346895d9e3eca/html5/thumbnails/72.jpg)
Conclusion
We showed a strong correlation between our predictions and a paper’s number of citations.
We observed evidence for L1-interference in non-native writing, for differences in topic between males and females ,and for distinctive language usage which can successfully identify papers published in top-tier conferences versus wokrshop proceedings.