bottom-up dialectometry using the geoling package...bottom-up dialectometry using the geoling...
TRANSCRIPT
Bottom-up dialectometry using the GeoLing package
Simon Pickl, Aaron Spettl, Simon Pröll, Stephan Elspaß, Werner König, Volker Schmidt
Methods in Dialectology XV, Groningen Friday, August 15, 2014
A statistical software package for geolinguistic data
• developed in cooperation by statisticians (Ulm University) and dialectologists (Universities of Augsburg and Salzburg)
• funded by the Deutsche Forschungsgemeinschaft (DFG)
• multi-platform (written in Java)
• open source (GPLv3)
• tried and tested with data from the Sprachatlas von Bayerisch-Schwaben (SBS) and other geolinguistic corpora
• www.geoling.net
A tool for bottom-up dialectometry (cf. Pickl/Rumpf 2012)
With GeoLing, you can
• produce probabilistic area-class maps of linguistic variables using intensity estimation
• find groups of maps that are spatially similar using cluster analysis
• identify and plot recurring spatial patterns using factor analysis
• What can you do with GeoLing? → Simon Pickl • intensity estimation
• factor analysis
• How do you use GeoLing? → Aaron Spettl • installing GeoLing
• performing analyses
• importing your data
Outline
• What can you do with GeoLing? → Simon Pickl • intensity estimation
• factor analysis
• How do you use GeoLing? → Aaron Spettl • installing GeoLing
• performing analyses
• importing your data
Outline
Testbed: Sprachatlas von Bayerisch-Schwaben (SBS)
• compiled 1984‒2009 at the University of Augsburg under the direction of Werner König
• approximately 2,700 maps in 14 volumes
• 272 sites
• for each map 0–3 records per site
photo by Stefan Puchner
Germany
Bavaria
Intensity estimation Cf. Rumpf/Pickl/Elspaß/König/Schmidt 2009; Pickl/Rumpf 2011; 2012
• Method for estimating the probabilistic distribution underlying the records
• Motivation: Individual records are not necessarily representative
• Records are treated as statistical samples from an underlying distribution
• Intensity estimation uses the geographical or linguistic relatedness between sites to infer local probabilities
Intensity estimation Cf. Rumpf/Pickl/Elspaß/König/Schmidt 2009; Pickl/Rumpf 2011; 2012
intensity estimation continuous intensity estimation
words for ‘woodlouse’
Linguistic distances in intensity estimation Cf. Pickl/Spettl/Pröll/Elspaß/König/Schmidt 2014
intensity estimation based on geographical distances
intensity estimation based on linguistic (in this case: lexical) distances
words for ‘woodlouse’
• Intensity estimation with linguistic distances:
• less “smooth” isoglosses and areas
• more detail
• preservation of language island (e.g. towns) and dialect borders
• continuous plot not possible with linguistic distances
Linguistic distances in intensity estimation Cf. Pickl/Spettl/Pröll/Elspaß/König/Schmidt 2014
• Further analysis:
• statistical analysis of spatial characteristics (homogeneity, complexity)
• Rumpf/Pickl/Elspaß/König/Schmidt 2010
• cluster analysis to obtain groups of maps with similar spatial structure
• Rumpf/Pickl/Elspaß/König/Schmidt 2010; Meschenmoser/Pröll 2012
Intensity estimation Cf. Rumpf/Pickl/Elspaß/König/Schmidt 2009; Pickl/Rumpf 2011; 2012; Pickl/Spettl/Pröll/Elspaß/König/Schmidt 2014
• statistical tool for dimensionality reduction Applications in dialectometry: Clopper/Paolillo 2006; Nerbonne 2006; Leinonen 2010; Grieve/Speelman/Geeraerts 2011
• condenses large numbers of variants with similar distributions into so-called “factors”
• provides a “summary” of predominant spatial patterns in the data
Factor Analysis Cf. Pröll/Pickl/Spettl (to appear)
• summarize 59.9 % of the data (equivalent of 16,961 variants)
• areas of similar variant distributions
• only the 10 locally dominant factors visible in this map
• in total: 15 factors (Kaiser criterion)
• non-dominant factors are hidden but ‘latently’ present
Combined Factor Map: Dominant factors in the SBS (all 2,160 maps, 28,315 variants)
Factor Analysis Cf. Pröll/Pickl/Spettl (to appear)
• summarizes 14.58 % of the data (equivalent of 4,128 variant distributions)
• area of tendential co-occurrence of variants
• fuzzy distribution
Example: Factor 1
Factor Analysis Cf. Pröll/Pickl/Spettl (to appear)
• summarizes 12.40 % of the data (equivalent of 3,511 variant distributions)
• area of tendential co-occurrence of variants
• fuzzy distribution
Example: Factor 2
Factor Analysis Cf. Pröll/Pickl/Spettl (to appear)
• summarizes 0.71 % of the data (equivalent of 201 variant distributions)
• area of tendential co-occurrence of variants
• fuzzy, discontinous distribution
Example: Factor 10
Factor Analysis Cf. Pröll/Pickl/Spettl (to appear)
• summarizes 0.62 % of the data (equivalent of 176 variant distributions)
• area of tendential co-occurrence of variants
• fuzzy distribution
Example: Factor 11
Factor Analysis Cf. Pröll/Pickl/Spettl (to appear)
Factor Analysis Cf. Pröll/Pickl/Spettl (to appear) Catchment area of market town Lauingen Factor 11
• nuanced and detailed account of overall spatial patterns in the data
• useful for
• a quick overview of major spatial structures
• a differentiated division into graded, fuzzy dialect areas
• an exploratory look into recurring spatial structures (even weak ones) that are hitherto unknown
Factor Analysis Cf. Pröll/Pickl/Spettl (to appear)
• What can you do with GeoLing? → Simon Pickl • intensity estimation
• factor analysis
• How do you use GeoLing? → Aaron Spettl • installing GeoLing
• performing analyses
• importing your data
Outline
• What can you do with GeoLing? → Simon Pickl • intensity estimation
• factor analysis
• How do you use GeoLing? → Aaron Spettl • installing GeoLing
• performing analyses
• importing your data
Outline
Installing GeoLing
Simple installation:
• download: www.geoling.net
• GeoLing is ready to use after unzipping a single file; no installation is required.
• Sprachatlas von Bayerisch-Schwaben (SBS) is included for demonstration purposes
Live demonstration of GeoLing – some screenshots are supplied at the end of this
presentation!
Installing GeoLing
Requirements:
• Java 7 (or higher) must be installed
• a 64-bit Java is recommended on 64-bit operating systems
• processor and memory requirements depend on the database, e.g. number of locations
• SBS database: dual-core CPU and 2 GB RAM recommended
Live demonstration of GeoLing – some screenshots are supplied at the end of this
presentation!
Performing analyses
Main window of GeoLing:
• maps are hierarchically organized for easy navigation
• individual maps can be investigated directly
• but: most analyses are performed on ‘groups’ of maps
• example: groups in SBS database • full corpus
• lexical sub-corpus
• phonetic sub-corpus
• morphological sub-corpus
Live demonstration of GeoLing – some screenshots are supplied at the end of this
presentation!
Performing analyses
With a group of maps, you can
• perform intensity estimations, plot maps to image files, calculate characteristics etc.
• perform factor analyses and cluster analyses
For factor and cluster analysis:
• results are visualized immediately
• results can be saved to CSV or XML files for further processing
Live demonstration of GeoLing – some screenshots are supplied at the end of this
presentation!
Importing data
• “Create new database” / “Edit existing database”
• “Database management” dialog: • import your own data from simple text files (CSV), whose format is
described in the user guide
• import custom distances between locations
• export/import database e.g. for backup and exchange of data
• computation of linguistic distances
• computation of bandwidths for intensity estimations
Live demonstration of GeoLing – some screenshots are supplied at the end of this
presentation!
Bottom line
GeoLing provides
• several methods for the detection of spatial patterns in geolinguistic data
• easy installation and import of your own data
• open-source license allows modifications and custom extensions
You can start now to use it on your own data!
• Clopper, C. G. / Paolillo, J. C. (2006): “North American English Vowels: A Factor-analytic Perspective”. Literary and Linguistic Computing 21/4, 445–462.
• Leinonen, T. (2010): An Acoustic Analysis of Vowel Pronunciation in Swedish Dialects. Groningen: Rijksuniversiteit Groningen.
• Meschenmoser, D. / Pröll, S. (2012): “Using fuzzy clustering to reveal recurring spatial patterns in corpora of dialect maps”. International Journal of Corpus Linguistics 17/2, 176–197.
• Nerbonne, J. (2006): “Identifying linguistic structure in aggregate comparison”. Literary and Linguistic Computing 21/4, 463–475.
• Pickl, S. / Rumpf, J. (2011): “Automatische Strukturanalyse von Sprachkarten. Ein neues statistisches Verfahren”. In: Glaser, E. / Schmidt, J. E. / Frey, N. (eds): Dynamik des Dialekts – Wandel und Variation. Akten des 3. Kongresses der Internationalen Gesellschaft für Dialektologie des Deutschen (IGDD). Stuttgart: Steiner, 267–285.
• Pickl, S. / Rumpf, J. (2012): “Dialectometric Concepts of Space: Towards a Variant-Based Dialectometry”. In: Hansen, S. / Schwarz, C. / Stoeckle, P. / Streck, T. (eds): Dialectological and folk dialectological concepts of space. Berlin: Walter de Gruyter. 199–214.
• Pickl, S. / Spettl, A. / Pröll, S. / Elspaß, S. / König, W. / Schmidt, V. (2014): “Linguistic distances in dialectometric intensity estimation”. Journal of Linguistic Geography 2, 25–40.
• Pröll, S. / Pickl, S. / Spettl, A. (to appear): “Latente Strukturen in geolinguistischen Korpora”. In: Elmentaler, M. / Hundt, M. / Schmidt, J. E. (eds.): Deutsche Dialekte. Konzepte, Probleme, Handlungsfelder. Akten des 4. Kongresses der Internationalen Gesellschaft für Dialektologie des Deutschen (IGDD) in Kiel. Stuttgart: Steiner.
• Rumpf, J. / Pickl, S. / Elspaß, S. / König, W. / Schmidt, V. (2009): “Structural analysis of dialect maps using methods from spatial statistics”. Zeitschrift für Dialektologie und Linguistik 76/3, 280–308.
• Rumpf, J. / Pickl, S. / Elspaß, S. / König, W. / Schmidt, V. (2010): “Quantification and statistical analysis of structural similarities in dialectological area-class maps”. Dialectologia et Geolinguistica 18, 73–98.
References
• www.geoling.net
• contents of extracted ZIP archive
• starting GeoLing
Appendix: Screenshots
www.geoling.net
double-click to start GeoLing
• main window after startup
• navigation by hierarchical categories to individual maps
• graded area-class-maps by intensity estimation
Appendix: Screenshots
• „groups“ for operations/analyses on many maps
• export function to generate e.g. graded area-class-maps for all maps of a group
• factor analysis
Appendix: Screenshots
• importing data
• example of file format required
Appendix: Screenshots
choose file name of database