scientext: a corpus of french & english academic & scientific texts alice henderson (for the...

32
SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry, France BAAL, University of Newcastlem September 3-5, 2009

Upload: keith-stallworth

Post on 30-Mar-2015

249 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

SCIENTEXT: A Corpus of French & English Academic & Scientific Texts

Alice Henderson (for the Scientext team)

LLS research group– Université de Savoie

Chambéry, France

BAAL, University of Newcastlem September 3-5, 2009

Page 2: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

English

academic

FrenchEcrits

universitaires ss writing textbooks

Ecrits de recherche

articles theses

= scientifique

Page 3: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

OutlineOutlineGeneral overview of the

Scientext projectEnd product & applicationsGoals of the linguistic studyDetails of the corpus & taggingPresentation of the beta version

Page 4: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

General OverviewGeneral Overview Project financed by the French ANR CORPUS ET

OUTILS DE LA RECHERCHE EN SCIENCES HUMAINES ET SOCIALES (2007-2010).

Goals:Create a freely-available corpus of scientific &

academic writing in French & English Devise tools for studying linguistic markers of

stance/positioning AND reasoning Intended Users: Linguists, epistemologists,

information retrieval specialists, scientists, language teachers.

Long-Term Applications:◦ L1 & FL/L2 teaching ◦ Lexicography & writing aids◦ Information retrieval in scientific & technical fields

Page 5: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

General OverviewGeneral Overview Draws on several branches of linguistics:

◦ Corpus linguistics: creation & study of a large corpus of scientific & academic texts

◦ Natural Language Processing: processing & study of a corpus using a syntactic dependency parser (Bourigault’s Syntex).

◦ Traditional branches of linguistics: discourse analysis, lexicology, enunciation, syntax and semantics

Projet coordinatedby LIDILEM research group (F. Grossmann, A. Tutin), 3 teams = multidisciplinary◦ LIDILEM (Grenoble) : F. Grossmann, A. Tutin, F. Boch, C.

Cavalla, O. Kraif, M. Florez, I. Novakova, M.L. Nguyen, F. Rinck.

◦ LLS (Chambéry) : J. Osborne, A. Henderson, R. Barr.◦ LiCorN (Lorient) : G. Williams, H. Maury, C. Ropers.

Page 6: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

General OverviewGeneral Overview

Page 7: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

End Product & End Product & ApplicationsApplications1) Web site with several ways of

selecting sub-parts of texts.◦ Query search (complex & simple) and text

view◦ Search for traces of stance/positioning and

reasoning using local pre-established grammars

◦ Downloading of XML corpus (for authors who gave permission, Creative Commons)

◦ Downloading of search results (zip format, CSV format for statistics)

Page 8: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

End Product & End Product & ApplicationsApplications

1) Website allowing selection of sub-parts of texts

2) Teaching applications for both L1 and L2 learners: research into university writing, second language production, etc.

3) Lexicographical applications including assistance with encoding strategies using reference corpora.

4) Targeted information retrieval in scientific and technical fields.

Page 9: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

The Linguistic StudyThe Linguistic StudyFocus on 2 essential features of the

texts:◦ Authors use stance to situate themselves in

relation to previous and contemporary research whilst demonstrating what is specific to their work and the choices made.

◦ The intellectual process upon which findings and deductions are based can be revealed via the analysis of authorial reasoning.

Test two hypotheses:◦ Stance is expressed by a phraseology that

is shared (partly? largely?) across fields◦ This phraseology is more characteristic of

genres than of fields

Page 10: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

The Linguistic StudyThe Linguistic StudyDistinguish between 3 main

parameters:Field, Text genre (and sub-genres), Text

sectionScientific sub-genres

Scientific articles Conference proceedings PhD theses, HDR

Academic sub-genres (Learner corpus)

2nd year English majors, Long Essays 3rd year English majors, Language

Policy analyses

Page 11: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Details: French scientific Details: French scientific corpuscorpus

Articles & presentation

s

Theses HDRs

Social sciences 154 32 8

Linguistics 66 8 4

Education 63 8 2

NLP 8 8 1

Psychology 17 8 1

Natural sciences

21 10 0

Biology 6 7 0

Medicine 15 3 0

Applied sciences

0 8 1

Electronics 0 4 0

Mechanical engineering

0 4 1

TOTAL 175 50 9

234 texts (1997-2008), 5 million words

Page 12: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Details: English corporaDetails: English corporaAcademic (learner) corpus

(Chambery,1997-2007)1.1 million words, 300 texts, 4000-5000 words

long Scientific corpus (Lorient, g

[email protected]) 33 million words “hoovered” from BMC Corpus

of Biology and Medical Texts POS & lemmatised Theoretical analysis of meaning transfers for

the analysis of diachronic & synchronic meaning changes in context through collocational resonance

Creation of a bottom-up dictionary of verb patterns with corpus-driven thematic and conceptual groupings for NNS scientists

Page 13: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Corpus Tagging Corpus Tagging (French sci. + Eng (French sci. + Eng academic/learner)academic/learner)

XML format (Text Encoding Initiative)Tagged elements

◦ Header: Type of tagging, information about the text, availability

of the text

◦ Text Structure (semi-automatic tagging): Identification of text sections: abstract, introduction, body

of the text, conclusion, notes, references. Lay-out (when available): bold, italics, structure of lists

◦ Linguistic Tagging (automatic): Morpho-syntactic tagging & identification of syntactic

dependencies (Bourigault’s Syntex – 2007 version)

Page 14: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

OutlineOutlineGeneral overview of the

Scientext projectEnd product & applicationsGoals of the linguistic studyDetails of the corpus & taggingPresentation of the beta version

Page 15: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Presentation of the beta Presentation of the beta versionversion

Web site available on-line: http://scientext.dynalias.net

Interface created by Achille Falaise, using the query language Concquest developed by Olivier Kraif (Université Grenoble 3)

Page 16: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Step1 : Choosing the field, genre, & text Step1 : Choosing the field, genre, & text section (French scientific corpus)section (French scientific corpus)

Page 17: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Step 2 : Searching in the textsStep 2 : Searching in the texts

3 search modes◦ Simple interface, with scroll-menus and

predefined values◦ Complex query language, so grammars

can be created/written◦ Local grammars, involving

stance/positioning or reasoning Example: grammar of scientific affiliation

Page 18: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Example of a simple Example of a simple queryquerySelection of predicate adjectives used with the

noun policy.

Page 19: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Examples of predefined Examples of predefined searchessearches

Verbs of feeling: hate, love, feel, like, …

Verbs of opinion: consider, think find, …

Evaluative adjectives: true, great, important, best, new, right, …

Page 20: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Example of a complex query Example of a complex query (advanced search)(advanced search)

1. Search for syntactic dependency + co-occurrence

<hypothèse,#1><>*<cat=V,#2> :: (SUJ,#2,#1);

Verbs which come after the lemma hypothèse, where hypothèse is the subject of the verb.

2. Search for a disjunction of lemmas + syntactic dependency

<lemma=/(hypothèse|notion|concept)/,#1> && <cat=V,#2> && <cat=A,#3> :: (SUJ,#2,#1) AND (ADJ,#1,#3) ;

The lemmas hypothèse, notion or concept functioning as subjects & accompanied by an adjective

Page 21: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Example of a local grammar Example of a local grammar (to write an advanced (to write an advanced

search)search)◦Using variables ◦Re-defining a relation

Ex : (ATTSUJ,#2,#1) = (ATTS,#3,#1) AND (SUJ,#3,#2)

Page 22: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Step Step 3 : Display3 : DisplayKWIC display, can be customised

Page 23: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Displaying a wider contextDisplaying a wider contextDisplay of a wider context

Page 24: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Displaying syntactic Displaying syntactic structurestructure

Page 25: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Displaying syntactic Displaying syntactic structurestructure

Page 26: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Statistical calculations & Statistical calculations & displaydisplay of resultsof results

Page 27: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Graphic display ofGraphic display of results: results: “we”“we”

2nd Year: 728 occs. 3rd Year: 1607 occs.

Page 28: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

ConclusionConclusion

Av ailable on S c ientext Web s ite

F rench scientificcorpus , 5 million

E ng lishacademic/learnercorpus ,1.1 million

Page 29: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

ConclusionConclusion

Project still running (through early 2010)◦ Constitution of corpus & tagging : LONG … & fastidious◦ Interface still being developed◦ Linguistic model still needs finalising ◦ More grammars need to be developed◦ Teaching materials need developing & piloting

Issues: interface between lexis & rhetorical functions

Future Research◦ Linguistic study of markers :

“positioned” citations markers of scientific affiliation

◦ Teaching materials need piloting & evaluating

Page 30: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Thank you!!

(and please try it out)

Page 31: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Publications & resources Publications & resources linked to Scientext projectlinked to Scientext project

Boch F., Grossmann F. (2002). “Se référer au discours d’autrui : quelques éléments de comparaison entre experts et néophytes”. L’écrit dans l’enseignement supérieur. Enjeux,:Brussels, pp41-51.

Boch F., Grossmann F. , Rinck (2007). “Conformément à nos attentes ...” ou l’étude des marqueurs de convergence/divergence dans l’article scientifique”. Revue Française de Linguistique Appliquée. Voll. XII-2, pp109-122.

Bourigault D. (2007). SYNTEX, analyseur syntaxique opérationnel. Mémoire d’habilitation à diriger des recherches, Université Toulouse Le Mirail.

Chavez I. (2008). La démarcation dans les écrits scientifiques - Les collocations transdisciplinaires comme aide à l’écrit universitaire auprès des étudiants étrangers, Mémoire de Master 2 Français Langue Etrangère Recherche, C. Cavalla (supervisor), Université Stendhal-Grenoble3: Grenoble.

Garcia P.P. (2008). Etude des marques de la filiation dans les écrits scientifiques. Master 1 thesis, Université Stendhal-Grenoble3: , F. Grossmann and A. Tutin (supervisors).

Grossmann F., Tutin A. (2008). “Evidential Markers in French Scientific Writing: the Case ofthe French Verb voir. Evidentiality Workshop, Bamberg, 27-29 February 2008.

Henderson, A . & R. Barr (2009), “Corpus-based L2 Writing Instruction : Raising Awareness of Authorial Stance”, Journal of Writing Research, (forthcoming).

Rinck, F. (2006). L’article de recherche en Sciences du Langage et en Lettres, Figure de l’auteur et approche disciplinaire du genre. Doctoral thesis, Sciences du Langage, F. Boch and F. Grossmann (supervisors), Université Stendhal-Grenoble3: Grenoble.

Page 32: SCIENTEXT: A Corpus of French & English Academic & Scientific Texts Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry,

Rinck, F., Boch, F., Grossmann, F. (2007). “Quelques lieux de variation du positionnement énonciatif dans l’article de recherche”, in Lambert P., Millet A., Rispail, M.Trimaille C. (eds). Variations au coeur et aux marges de la sociolinguistique. L’Harmattan, Espaces Discursifs, Paris.

Tutin A. (2008). “Evaluative adjectives in academic writings”. Interpersonality in written academic discourse: perspectives across languages and cultures, 11-13 December, Zaragoza, Spain.

Tutin A. (2007a) (ed) “Lexique et écrits scientifiques”. Revue Française de Linguistique Appliquée, volume XII-2, December 2007.

Tutin, A. (2007b). “Modélisation linguistique et annotation des collocations : application au lexique transdisciplinaire des écrits scientifiques”,  in S. Koeva, D. Maurel, M. Silberztein (eds). Formaliser les langues avec l’ordinateur. Presses universitaires de Franche-Comté: Besançon.

Williams G & Millon C. (2009). “The General and the Specific: Collocational resonance of scientific language”. Proceedings Corpus Linguistics 2009, University of Liverpool. (forthcoming)

Williams G & Millon C. (2009.) “Les verbes et la science: la construction d’un dictionnaire organique”. Actes des Journées de la Linguistique de Corpus 2009. Texte et Corpus. (forthcoming)

Williams G. (2008). “Le Corpus et le dictionnaire dans les langues de spécialité”, in Maniez et al (eds). Corpus et dictionnaires de langues de spécialité. Presses Universitaires de Grenoble, pp 135-151.

Wiliams G. (2008) “Verbs of Science and the Learner’s Dictionary”. Proceedings Thirteenth EURALEX International CongressBarcelona, Spain. 15-19 July 2008, pp797-806.