Statistics / Linguistics Seminar - Chadi Ben Youssef

Linguistics Statistics Seminar

Event Date

Mathematical Sciences Building 1147

Speaker: Chadi Ben Youssef, PhD Candidate, Dept of Linguistics, UC Santa Barbara

Title: "Towards interpretable Machine Learning for linguistic analysis: the case of Code-Switching in a low resource language"

Abstract: As language use depends on the situational context and evolves dynamically, speakers/signers are constantly making choices at many levels of linguistic structure, meaning, and code, often unconsciously. The overarching goal of my research is to understand what motivates and constrains these choices at different levels: (i) the cognitive level and how we process language, (ii) how that is reflected at the level of linguistic structure, and (iii) the social factors that shape language use and are shaped by it. In order to achieve such a goal, I argue that such inquiry needs to be data-driven to reflect the experiential nature of language, multifactorial to account for the complexity of linguistic phenomena, and inclusive of a wide variety of speech communities to comprehensively account for the social and cultural dimensions of language.


In this talk, I present a case study on Code-switching (CS), one of the most studied phenomenon of language contact and change. I show what challenges arise when dealing with naturally-occuring, messy and scarce language data and that computational techniques and Machine Learning (ML) offer fertile avenues for linguistic research to address these challenges. The study uses a mixed methods approach (combining qualitative manual annotation and quantitative techniques) to incorporate a number of structural, sociolinguistic, and psycholinguistic/cognitive factors. Then a Random Forest model is run on the resulting data set to overcome the methodological challenges associated with low-resource languages and imbalanced data. The resulting model shows that CS between Tunisian Arabic and French is affected by a constellation of competing factors: (i) Noun Phrases are a prime location for switched elements; (ii) when speakers code-switch, they are attuned to the cognitive load they impose on themselves and/or on listeners while (iii) maintaining the code-integrity at the phrase and discourse levels by switching dependent parts-of-speech when the phrase’s head is switched; 


Finally, I make the argument that in order to advance our knowledge of complex phenomena such as CS, it is crucial to understand what are the goals of such inquiry. If we are to explain why a given model produces a given prediction, then we have to privilege interpretable models (or at least insure that we can interpret them post-hoc). This is becoming increasingly important as language model architectures are deployed (rapidly and prematurely?) in many industries such as health, transportation, finance… I contend that as researchers we need to develop and adapt modeling techniques that we can understand and explain as comprehensively as necessary to insure their safety and their social acceptance, and reduce the bias that they can reflect.

Seminar Date/Time: Friday March 1st, 10:30am