Skip directly to content

Minimize RSR Award Detail

Research Spending & Results

Award Detail

Awardee:OHIO STATE UNIVERSITY, THE
Doing Business As Name:Ohio State University
PD/PI:
  • DeLiang Wang
  • (614) 292-6827
  • dwang@cse.ohio-state.edu
Award Date:07/29/2021
Estimated Total Award Amount: $ 390,585
Funds Obligated to Date: $ 390,585
  • FY 2021=$390,585
Start Date:08/01/2021
End Date:07/31/2024
Transaction Type:Grant
Agency:NSF
Awarding Agency Code:4900
Funding Agency Code:4900
CFDA Number:47.041
Primary Program Source:040100 NSF RESEARCH & RELATED ACTIVIT
Award Title or Description:Deep Learning Based Complex Spectral Mapping for Multi-Channel Speaker Separation and Speech Enhancement
Federal Award ID Number:2125074
DUNS ID:832127323
Parent DUNS ID:001964634
Program:EPCN-Energy-Power-Ctrl-Netwrks
Program Officer:
  • Donald Wunsch
  • (703) 292-7102
  • dwunsch@nsf.gov

Awardee Location

Street:Office of Sponsored Programs
City:Columbus
State:OH
ZIP:43210-1016
County:Columbus
Country:US
Awardee Cong. District:03

Primary Place of Performance

Organization Name:Ohio State University
Street:Office of Sponsored Programs
City:Columbus
State:OH
ZIP:43210-1016
County:Columbus
Country:US
Cong. District:03

Abstract at Time of Award

Despite tremendous advances in deep learning based speech separation and automatic speech recognition, a major challenge remains how to separate concurrent speakers and recognize their speech in the presence of room reverberation and background noise. This project will develop a multi-channel complex spectral mapping approach to multi-talker speaker separation and speech enhancement in order to improve speech recognition performance in such conditions. The proposed approach trains deep neural networks to predict the real and imaginary parts of individual talkers from the multi-channel input in the complex domain. After overlapped speakers are separated into simultaneous streams, sequential grouping will be performed for speaker diarization, which is the task of grouping the speech utterances of the same talker over intervals with the utterances of other speakers and pauses. Proposed speaker diarization will integrate spatial and spectral speaker features, which will be extracted through multi-channel speaker localization and single-channel speaker embedding. Recurrent neural networks will be trained to perform classification for the purpose of speaker diarization, which can handle an arbitrary number of speakers in a meeting. The proposed separation system will be evaluated using open, multi-channel speaker separation datasets that contain both room reverberation and background noise. The results from this project are expected to substantially elevate the performance of continuous speaker separation, as well as speaker diarization, in adverse acoustic environments, helping to close the performance gap between recognizing single-talker speech and recognizing multi-talker speech. The overall goal of this project is to develop a deep learning system that can continuously separate individual speakers in a conversational or meeting setting and accurately recognize the utterances of these speakers. Building on recent advances on simultaneous grouping to separate and enhance overlapped speakers in a talker-independent fashion, the project is mainly focused on speaker diarization, which aims to group the speech utterances of the same speaker across time. To achieve speaker diarization, deep learning based sequential grouping will be performed and it will integrate spatial and spectral speaker characteristics. Through sequential organization, simultaneous streams will be grouped with earlier-separated speaker streams to form sequential streams, each of which corresponds to all the utterances of the same speaker up to the current time. Speaker localization and classification will be investigated to make sequential grouping capable of creating new sequential streams and handling an arbitrary number of speakers in a meeting scenario. With the added spatial dimension, the proposed diarization approach provides a solution to the question of who spoke when and where, significantly expanding the traditional scope of who spoke when. The proposed separation system will be evaluated using multi-channel speaker separation datasets that contain highly overlapped speech in recorded conversations, as well as room reverberation and background noise present in real environments. The main evaluation metric will be word error rate in automatic speech recognition. The performance of speaker diarization will be measured using diarization error rate. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

For specific questions or comments about this information including the NSF Project Outcomes Report, contact us.