Dan Villarreal talk November 3 on auto-coding

Dr. Dan Villarreal (University of Pittsburgh) is visiting the Sociolinguistics Lab in early November. He’ll be giving a talk, open to the public, on Thursday November 3, 2022. Dan’s presentation is of special interest to us because it’s about automating analyses of large-scale datasets. As we build a corpus of Michigan speech in the MI Diaries project, we’ve been using automatic speech recognition (ASR) to speed up our transcription time, and working with MSU’s Institute for Cyber-Enabled Research (ICER) to move some of our data processing to their supercomputer.

Dr. Villarreal is also giving a talk to the SoConDi group at University of Michigan on Nov 4th, 2022, 3-4pm. If you are interested in joining that talk, please contact Yongqing Ye (yeyongqi@msu.edu) or Suzanne Wagner (wagnersu@msu.edu) for the Zoom link.

Sociolinguistic auto-coding: Applications and pitfalls

Dan Villareal, University of Pittsburgh

Time: Thursday, Nov 3, 4:30-6:15pm

Location: Wells Hall B342 and on Zoom

Zoom link:  https://msu.zoom.us/j/98418360065   Meeting ID: 984 1836 0065 passcode: sociolab.

Researchers in sociophonetics and variationist sociolinguistics have increasingly turned to computational methods to automate time-consuming research tasks such as data extraction (e.g., Fromont & Hay 2012), phonetic alignment (e.g., McAuliffe et al. 2017), and accurate vowel measurement (e.g., Barreda 2021). In this talk, I discuss the advantages and challenges of using sociolinguistic auto-coding (SLAC), a method in which machine learning classifiers assign variants to variable data (Kendall et al. 2021; McLarty, Jones & Hall 2019; Villarreal et al. 2020; Villarreal under review). 

Villarreal et al. (2020) trained random forest classifiers of two sociolinguistic variables of New Zealand English, non-prevocalic /r/ (varying between Present vs. Absent) and intervocalic medial /t/ (Voiced vs. Voiceless), using over 4,000 previously hand-coded tokens (per variable). Cross-validation revealed accuracy rates of 84.5% for /r/ and 91.8% for /t/. In addition to binary predictions, these auto-coders calculate classifier probabilities: the likelihood that a given /r/ token was Present, or a /t/ token was Voiced. In a listening experiment in which 11 phonetically trained listeners coded 60 /r/ tokens, we found a significant positive linear relationship between classifier probability and human judgments; this indicates that classifier probability successfully captures listeners’ perception of phonetically gradient rhoticity. Finally, auto-coders can report which features were most important in classification, helping to shed light on acoustically complex variables like /r/. In short, SLAC can be used for at least three specific functions: binary coding, gradient ‘coding’, and feature selection. 

Like other machine learning (ML) methods, however, there are inherent concerns about SLAC’s fairness—that is, whether it generates equally valid predictions for different speaker groups  (e.g., Koenecke et al. 2020). First, given that there are multiple definitions of ML fairness that are mutually incompatible (Berk et al. 2018; Corbett-Davies et al. 2017; Kleinberg et al. 2017), fairness metrics must be decided upon within individual research domains; I argue for three fairness metrics relevant to the domain of sociolinguistic auto-coding. Second, I re-analyze Villarreal et al.’s (2020) /r/ auto-coder for fairness; I find poor performance on all three fairness metrics, with women’s tokens coded more accurately than men’s (88.8% vs. 81.4%). Third, to remedy these imbalances, I used the same data to test a variety of unfairness-mitigation strategies from the ML fairness literature; I find substantial improvement with respect to fairness, albeit at the expense of predictive performance. 

Given these fairness issues, I reconsider SLAC under Markl’s (2022) premise that some speech and language technologies are too inherently flawed to use. I argue that while SLAC does not fit into this category, its potential users and consumers deserve a “warts and all” awareness of its drawbacks. To that end, I close with concrete recommendations for using SLAC in large-scale research projects. 

References 

Barreda, Santiago. 2021. Fast Track: fast (nearly) automatic formant-tracking using Praat. Linguistics Vanguard 7(1). https://doi.org/10.1515/lingvan-2020-0051. 

Fromont, Robert & Jennifer Hay. 2012. LaBB-CAT: An annotation store. Proceedings of Australasian Language Technology Association Workshop 113–117. 

Kendall, Tyler, Charlotte Vaughn, Charlie Farrington, Kaylynn Gunter, Jaidan McLean, Chloe Tacata & Shelby Arnson. 2021. Considering performance in the automated and manual coding of sociolinguistic variables: Lessons from variable (ING). Frontiers in Artificial Intelligence 4(43). https://doi.org/10.3389/frai.2021.648543. 

Markl, Nina. 2022. Language variation and algorithmic bias: Understanding algorithmic bias in British English automatic speech recognition. In 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22), 521–534. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3531146.3533117. 

McAuliffe, Michael, Michaela Socolof, Sarah Mihuc, Michael Wagner & Morgan Sonderegger. 2017. Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In. 

McLarty, Jason, Taylor Jones & Christopher Hall. 2019. Corpus-based sociophonetic approaches to postvocalic r-lessness in African American Language. American Speech 94. https://doi.org/10.1215/00031283-7362239. 

Villarreal, Dan. under review. Sociolinguistic auto-coding has fairness problems too: Measuring and mitigating bias. Linguistics Vanguard

Villarreal, Dan, Lynn Clark, Jennifer Hay & Kevin Watson. 2020. From categories to gradience: Auto-coding sociophonetic variation with random forests. Laboratory Phonology 11(6). 1–31. https://doi.org/10.5334/labphon.216. 

Continue ReadingDan Villarreal talk November 3 on auto-coding

MI Diaries app gets NEH grant to go open-source

We are delighted to announce that Dr. Betsy Sneller, Assistant Professor of Linguistics and co-Director of the Sociolinguistics Lab, was awarded a $99,908 grant from the National Endowment for the Humanities (NEH) Digital Humanities Advancement Grant (DHAG) program. The new project, “Building and Disseminating an App for Ethnographic Remote Audio Recording”, is an innovative extension of the MI Diaries project. The goal is to provide other researchers with a convenient and accessible method of collecting speech data. In order to do that, Dr. Sneller’s team will develop an open-source code that anyone would be able to use to create a self-recording mobile app for their project. 

The inspiration for the project came from the successful adaptation of the MI Diaries app for the study of Judaism through cultural arts led by Laura Yares, Assistant Professor of Religious Studies at MSU, who will serve on the advisory council for the DHAG grant. Co-Director of the Sociolinguistics Lab, Dr. Suzanne Evans Wagner, is also a faculty advisor to the project.

Continue ReadingMI Diaries app gets NEH grant to go open-source

The interdisciplinary water cooler

Flyer for Yares and Sneller 2021 University Interdisciplinary Colloquium talk

Sociolinguistics Lab co-director Dr. Betsy Sneller will give a high-profile, university-wide talk on November 5th that is open to the public. Her co-presenter, Dr. Laura Yares, met Dr. Sneller at an informal College of Arts and Letters workshop in October 2020 about pivoting research to remote methods in response to the Covid-19 pandemic. Dr. Yares and her collaborators were looking for a way to capture participants’ reactions to a popular Netflix show, Shtisel. Upon learning about the MI Diaries project’s mobile app for self-recorded audio entries, Dr. Yares met with Dr. Sneller and co-investigator Dr. Suzanne Wagner to talk about adapting it for her project. Come and hear about this serendipitous cross-disciplinary conversation, and its broader implications, courtesy of the MSU Center for Interdisciplinarity.

Abstract

Can common research technologies serve diverse disciplinary needs? Even disciplines that seem on the surface to have little in common can benefit from casual conversations about the challenges and methods that they might share. In this talk, we show how a simple smartphone app developed for a project analyzing language during the pandemic (MI Diaries) was successfully adapted for a Religious Studies project examining learning about Judaism through the cultural arts (Shtisel Diary). By reflecting on these two case-studies we highlight how the tools that we use to conduct research can be just as interdisciplinary as research projects themselves. 

Details

Friday, November 5, 2021
12PM-1PM EDT via Zoom

Zoom Linkhttps://msu.zoom.us/j/96411904159
Passcode: msuc4i

Continue ReadingThe interdisciplinary water cooler

MSU represented at NWAV 49

For the first time, the New Ways of Analyzing Variation conference is being held online. Hosted by the University of Texas at Austin, NWAV 49 talks are available as pre-recorded videos to registered participants, and live Q&A sessions are happening this week, October 19 – 24, 2021.

MSU will, as always, be pretty well represented! Here’s the list of current and former MSU faculty and students who will be presenting this year:

  • Adam Barnhardt. I didn’t go to college with anyone that country: Age-stratified indexicality of Southern-shifted vowels.
  • Jack Rechsteiner and Betsy Sneller. Non-binary speakers’ use of (ING) across gender-related topics.
  • Denise Troutman. Throwing shade: Signifyin(g) and synchronic change among Ebonics speakers.
  • Mingzhe Zheng. One-ge person or One-wei person: Exploring the use of Mandarin classifier across time.
  • Dennis Preston. Women are hens: A taxonomic exercise in historical gender-based metaphor.
  • Rebecca Roeder. PALM and the low-back merger shift: Evidence from Victoria, BC.
  • Marisa Brook. Language shift in a microcosm: Finnish-English bilingualism, contact, and substrate effects in Sointula, British Columbia.

Continue ReadingMSU represented at NWAV 49

In the news: Researchers study how COVID pandemic is affecting language change

Check out this August 5, 2020 story from the College of Arts and Letters on the MI COVID Diaries project, run by MSU’s Sociolinguistics Lab. We’ve been collecting recorded speech from Michigan residents since the beginning of April to track changes to language during the COVID-19 pandemic. 

“Social distancing and distance learning are affecting how people behave in the world. Part of this project is to document how people’s lives are changing. But from the linguistics side, what we are interested in is how these social changes impact language use, both on a short-term scale and potentially on the long-term scale as well.”

Dr. Betsy Sneller
Continue ReadingIn the news: Researchers study how COVID pandemic is affecting language change