Research Interest

The following research statement was part of my research proposal I wrote back in 2009 when I was at McMaster University. The core of my research interest remains the same, but the old statement needs to be updated.

Language is a cognitive function unique to humans, and among humans with unimpaired speech and hearing, linguistic activity is manifest primarily in speech. Linguistic information is communicated through the shared medium of the speech signal, and the listener is faced with the complex task of decoding the signal to uncover the elements of meaning at each linguistic level. I propose a research program of spoken language studies by establishing McMaster Spoken Language Research Laboratory (MSLRS) at the McMaster University. Below I describe what I have done in the area of spoken language, what aspects of spoken language I want to conduct research on in the near future, what training will take place to recruit and produce highly qualified personnel for spoken languages processing and understanding.

Research progress in research activities to the proposal: My research has focused on acoustic and perceptual evidence for prosody in spoken language, and the relationship between prosodic structure and higher levels of linguistic structure such as phonology, syntax, and semantics. The intonation and rhythm of speech play an important role in expressing meaning. These properties in an utterance reflect the prosodic structure of the language such as prosodic prominence and prosodic phrasing. For example, ‘Steve or Sam and Bob will come’ means one thing if it is said with a prosodic phrase boundary after Steve, and quite another thing if the boundary appears after Sam. The prosodic structure can be utilized in conveying syntactic information as well as pragmatic information. This kind of information, some of which is conveyed through punctuation in written languages, is expressed through the modulation of pitch, loudness, duration, and voice quality such as glottalization across the syllables in an utterance. In order to investigate prosody using spoken speech produced in a rather natural setting than in a laboratory, I have been developing a data-driven prosody prediction system using a spoken dataset called Boston University Radio Speech Corpus (Ostendorf, Price, & Shattuck-Hufnagel, 1995). In order to extract linguistic information from the spoken speech data, I have employed various natural language processing techniques (e.g. syntactic parsing and semantic role parsing) and machine learning techniques (e.g. Memory-Based Learning (MBL) and Classification and Regression Tree (CART)). In addition, I have applied to the speech sound data techniques from speech signal processing and a forced-alignment component in order to obtain phonetic information such as pitch, intensity, and duration at the phone level. I have demonstrated that prosodic features of an utterance can be reliably predicted from a set of features that encode the phonetics, phonological, syntactic and semantic properties of the local context (Yoon, 2007). The results reported in my work on some task were better than the results reported in the literature on a similar task and on a similar dataset. For example, in the task of predicting the presence or absence of pitch accent, the highest performance that had been reported in the literature is around 84-86% (cf. Brenier, Cer & Jurafsky 2005), whereas the performance that I obtained on the same task using the same corpus but using different features is 87.7%.

Independent of the work on developing a computational model of prosody prediction system, I have collaborated to work on developing an algorithm that detects a type of voice quality (i.e. creaky voice or glottalization) in a speech signal in the telephone band-limited spontaneous speech data set called Switchboard in American English (Yoon, Cole, & Hasegawa-Johnson, 2008), and developing a voice-quality dependent automatic speech recognition system (Yoon, Zhuang, Cole, & Hasegawa-Johnson, 2009). We showed, by conducting a classification experiment by means of a Support Vector Machine (SVM) classifier on standard input features in speech recognition (e.g. Perceptual Linear Predictive Coefficients (PLPC) and its derivatives), allophones that differ from each other regarding voice quality can be classified as distinct. Among different ways to incorporate voice quality information in HMM(Hidden Markov Model)-based automatic speech recognition, we demonstrated that by explicitly modeling voice quality variance in the acoustic phone models, we can improve word recognition accuracy.

Objectives of the research program: The goal of this multidisciplinary research proposal that combines both linguistics and speech technology is to develop an efficient algorithm for prosodic phrase prediction using voice source characteristics. The human voice has evolved as a vehicle for conveying many different types of information, and human listeners have developed the ability to detect very small and subtle voice quality changes, and interpret their function. In spite of improvements in recent years in the techniques for describing and modeling source variation, our abilities lag far behind what the human ear can effortlessly do. Most of the studies to date have been very limited, either in the quantity of data analyzed, or in the kinds of source measures made. Advances in the field have been particularly hampered by the lack of availability of suitable analysis tools. The manual interactive techniques have been generally adopted in studying voice quality across languages. These methods permit a fine grained analysis of the source, but because of their labour-intensive nature, are not suitable for large scale studies that would be ideally needed for progress in this field (Epstein, 2002; Hardcastle & Laver, 1997; Ishi, Ishiguro, & Hagita, 2008). Therefore, it is an important task to develop an efficient tool for (semi-)automatic prosodic phrasing annotation, which is becoming more and more important for the speech community. For example, the prediction of prosodic phrase boundaries is an important task for natural language processing to convey the correct meaning and text-to-speech synthesis applications to increase the naturalness of the synthesized speech. Other investigators have reported that the prosodic boundary is occasionally glottalized. But because glottalization is only occasional, and because glottalization is difficult to characterize using standard speech signal processing methods, this acoustic feature of prosodic phrasing has apparently not been extensively studied or incorporated into automatic prosodic phrasing prediction algorithm. Furthermore, investigating prosody or computational modeling of prosodic phrase boundary through the study of these acoustic features is complicated by the fact that pitch, loudness, duration and voice quality is also affected by paralinguistic properties of the utterance such as the speaker’s emotional state, and even by non-linguistic factors such as speaker’s gender and age. The goal of the proposed project, therefore, is to develop an efficient algorithm for prosodic phrase prediction by refining automatic detection algorithm for voice quality characteristics for both English and other languages, and to conduct phonetic and statistical analysis of lexical and contextual features of the utterance that may condition the presence of voice quality such as glottalization in the spoken language.

Korean Forced Alignment System (LINK)

영어 주제적응형 말하기 평가 데이터 전사관리 플랫폼

Web-based speech data collection (Under construction)

Adapting Wav2Vec2.0 to Korean Speech of various styles

chatGPT via Streamlit (under construction but working)

Research Projects

  • 윤태진. Building Web-based Phonetic and Phonological Research Interface. (PI). 2021.07-2031.06.
  • 하승희, 윤태진, 소정민. Children Speech Database and Speech Sound Assessment Using Deep Learning-Based Automatic Speech Recognition (CO-PI). 2021.07-2024.06.

Computer Specification (2 Ubuntu Servers)

CPU: Intel i9, RAM: 128GB, GPU: RTX 3090 x 2; SSD: 1TB; HDD: 4TB x 4

CPU: AMD Ryzen Threadripper, RAM: 125GB, GPU: RTX 3090 x 4; SSD: 4TB; HDD: 20TB x 1

Mac Book Pro 16-inch, M2-Max, 12-Core CPU, 38-Core GPU, 32GB United Memory, 1TB SDD Storage

Research Grants

  • 서울시50플러스재단 (과제책임자). 소상공인을 위한 디지털마케팅. 2023.05-2023.11. 30,000,000원
  • NIA (한국지능정보사회진흥원) 2-023. 주제적응영 영어말하기 평가데이터 구축 (PI). 2022.09-2022.12. 250,000,000원
  • NIA (한국지능정보사회진흥원) 13 언어교육용 서양어, 아시아어 사용자의 한국어 음성 데이터 (한국음성학회 실무책임자) (2022.05.01-2022.11.30)
  • NIA (한국지능정보사회진흥원) 12 교육용 한국인 다국어 음성데이터 (영어 발음 평가 자문) (2022.08.04-2022.11.30)
  • NRF 중견연구자지원사업(PI) 웹기반 음성・음운 연구 인터페이스 구축 (2021.07.01-2031.06.30) 10,000,000원/년 [뉴스]
  • NRF 일반공동연구지원사업(Co-PI) 하승희(한림대)・윤태진(성신여대)・소정민(서강대). 아동 말 데이터베이스 구축 및 딥러닝 자동음성인식 기반 말소리 평가 시스템 개발 (2021.07.01-2024.06.30) 100,000,000원/년[뉴스]
  • 국립국어원 2020년 어휘의미 말뭉치 연구 분석 사업 (공동연구원)
  • 국립국어원 2019년 어휘의미 말뭉치 구축 (공동연구원)
  • 국립국어원 2019년 형태 분석 말뭉치 구축 (공동연구원)
  • NRF 중견연구자지원사업(PI) 대용량 음성코퍼스를 이용한 L1과 L2의 영어 리듬 메트릭 연구. (2018.07.01-2020.06.30)
  • NRF 일반공동연구 (PI) TOEFL11에 나타난 서법 조동사의 L1별 경쟁모델을 활용한 중간언어 구축 (2016.11.01-2018.10.31)
  • NRF 신진연구자지원사업 (PI) 대용량음성코퍼스를 이용한 음성자질의 위치적 변화연구 (2014)
  • NRF 신진연구자지원사업 (PI) 강제음성정렬시스템을 이용한 대용량 음성자료의 모음연구 (2013)
  • (Co-PI) A Corpus-based study of Korean dialects: microvariation and universals. 캐나다 SSHRC (Social Science and Humanties Research Council of Canada) Partnership Development Grant (2013-2015)
  • NSERC Discovery Grant (PI) Prosodic phrasing detection using acoustic source characteristics. 캐나다 NSERC(Natural Science and Engineering Research Council of Canada) (2009-2014)
  • Beckman Graduate Fellowship. Beckman Institute for Advanced Science and Engineering (2006).