Bottom-up quantitative methods in corpus linguistics

Event Date: 

Wednesday, May 25, 2011 - 3:30pm

Event Date Details: 

Refreshments served at 3:15 PM

Event Location: 

  • South Hall 5607F

Prof. Stefan Th. Gries (Linguistics, UCSB)

Title: Bottom-up quantitative methods in corpus linguistics

Abstract: The discipline of linguistics is currently undergoing an empirical evolution. After decades of 'research' based on judgments by the analysts themeselves, the field now exhibits an increasing reliance on experimental and observational data. This methodological change is accompanied by an increasing reliance on quantitative methods. While experimental research has utilized many different quantitative methods for quite a long time, observational research - especially in corpus linguistics, the analysis of large textual data - has only very recently begun to move away from the mere descriptive reporting of raw/observed frequencies and incorporate more and more diverse quantitative methods. In this presentation, I will discuss a few applications of quantitative methods in corpus linguistics with an eye to exemplify recent questions and developments:

- case study 1 is an example of a hierarchical clustering algorithm that has been designed to detect structure in temporally- or geographically-ordered data to which traditional clustering algorithms cannot be applied. The algorithm can be used to, e.g., facilitate model-building and the detection of outliers; data to be discussed are recordings from several years of first language acquisition data, historical letters from a time period of about 300 years, and frequency data describing approximately 40 dialects in Great Britain;

- case study 2 is concerned with the bottom-up exploration of textual databases to determine at what level of granularity/resolution frequency (or other statistical) differences should be studied. On the basis of box-whisker plots, different distinctions of corpus data are explored with an eye to determine how corpora should be split up to reveal the largest amounts of variability. Hierarchical cluster analysis and principal component analysis are then used in an attempt to explore dimensions of variation in the data.

Feedback and advice will be very welcome ...