Language is a cognitive function unique to humans, and among humans with unimpaired speech and hearing, linguistic activity is manifest primarily in speech. Linguistic information is communicated through the shared medium of the speech signal, and the listener is faced with the complex task of decoding the signal to uncover the elements of meaning at each linguistic level. I propose a research program of spoken language studies by establishing McMaster Spoken Language Research Laboratory (MSLRS) at the McMaster University. Below I describe what I have done in the area of spoken language, what aspects of spoken language I want to conduct research on in the near future, what training will take place to recruit and produce highly qualified personnel for spoken languages processing and understanding.
Research progress in research activities to the proposal: My research has focused on acoustic and perceptual evidence for prosody in spoken language, and the relationship between prosodic structure and higher levels of linguistic structure such as phonology, syntax, and semantics. The intonation and rhythm of speech play an important role in expressing meaning. These properties in an utterance reflect the prosodic structure of the language such as prosodic prominence and prosodic phrasing. For example, ‘Steve or Sam and Bob will come’ means one thing if it is said with a prosodic phrase boundary after Steve, and quite another thing if the boundary appears after Sam. The prosodic structure can be utilized in conveying syntactic information as well as pragmatic information. This kind of information, some of which is conveyed through punctuation in written languages, is expressed through the modulation of pitch, loudness, duration, and voice quality such as glottalization across the syllables in an utterance. In order to investigate prosody using spoken speech produced in a rather natural setting than in a laboratory, I have been developing a data-driven prosody prediction system using a spoken dataset called Boston University Radio Speech Corpus (Ostendorf, Price, & Shattuck-Hufnagel, 1995). In order to extract linguistic information from the spoken speech data, I have employed various natural language processing techniques (e.g. syntactic parsing and semantic role parsing) and machine learning techniques (e.g. Memory-Based Learning (MBL) and Classification and Regression Tree (CART)). In addition, I have applied to the speech sound data techniques from speech signal processing and a forced-alignment component in order to obtain phonetic information such as pitch, intensity, and duration at the phone level. I have demonstrated that prosodic features of an utterance can be reliably predicted from a set of features that encode the phonetics, phonological, syntactic and semantic properties of the local context (Yoon, 2007). The results reported in my work on some task were better than the results reported in the literature on a similar task and on a similar dataset. For example, in the task of predicting the presence or absence of pitch accent, the highest performance that had been reported in the literature is around 84-86% (cf. Brenier, Cer & Jurafsky 2005), whereas the performance that I obtained on the same task using the same corpus but using different features is 87.7%.
Independent of the work on developing a computational model of prosody prediction system, I have collaborated to work on developing an algorithm that detects a type of voice quality (i.e. creaky voice or glottalization) in a speech signal in the telephone band-limited spontaneous speech data set called Switchboard in American English (Yoon, Cole, & Hasegawa-Johnson, 2008), and developing a voice-quality dependent automatic speech recognition system (Yoon, Zhuang, Cole, & Hasegawa-Johnson, 2009). We showed, by conducting a classification experiment by means of a Support Vector Machine (SVM) classifier on standard input features in speech recognition (e.g. Perceptual Linear Predictive Coefficients (PLPC) and its derivatives), allophones that differ from each other regarding voice quality can be classified as distinct. Among different ways to incorporate voice quality information in HMM(Hidden Markov Model)-based automatic speech recognition, we demonstrated that by explicitly modeling voice quality variance in the acoustic phone models, we can improve word recognition accuracy.
Objectives of the research program: The goal of this multidisciplinary research proposal that combines both linguistics and speech technology is to develop an efficient algorithm for prosodic phrase prediction using voice source characteristics. The human voice has evolved as a vehicle for conveying many different types of information, and human listeners have developed the ability to detect very small and subtle voice quality changes, and interpret their function. In spite of improvements in recent years in the techniques for describing and modeling source variation, our abilities lag far behind what the human ear can effortlessly do. Most of the studies to date have been very limited, either in the quantity of data analyzed, or in the kinds of source measures made. Advances in the field have been particularly hampered by the lack of availability of suitable analysis tools. The manual interactive techniques have been generally adopted in studying voice quality across languages. These methods permit a fine grained analysis of the source, but because of their labour-intensive nature, are not suitable for large scale studies that would be ideally needed for progress in this field (Epstein, 2002; Hardcastle & Laver, 1997; Ishi, Ishiguro, & Hagita, 2008). Therefore, it is an important task to develop an efficient tool for (semi-)automatic prosodic phrasing annotation, which is becoming more and more important for the speech community. For example, the prediction of prosodic phrase boundaries is an important task for natural language processing to convey the correct meaning and text-to-speech synthesis applications to increase the naturalness of the synthesized speech. Other investigators have reported that the prosodic boundary is occasionally glottalized. But because glottalization is only occasional, and because glottalization is difficult to characterize using standard speech signal processing methods, this acoustic feature of prosodic phrasing has apparently not been extensively studied or incorporated into automatic prosodic phrasing prediction algorithm. Furthermore, investigating prosody or computational modeling of prosodic phrase boundary through the study of these acoustic features is complicated by the fact that pitch, loudness, duration and voice quality is also affected by paralinguistic properties of the utterance such as the speaker’s emotional state, and even by non-linguistic factors such as speaker’s gender and age. The goal of the proposed project, therefore, is to develop an efficient algorithm for prosodic phrase prediction by refining automatic detection algorithm for voice quality characteristics for both English and other languages, and to conduct phonetic and statistical analysis of lexical and contextual features of the utterance that may condition the presence of voice quality such as glottalization in the spoken language.