Speech discrimination in real-world group communication using audio-motion multimodal sensing

Takayuki Nozawa; Mizuki Uchiyama; Keigo Honda; Tamio Nakano; Yoshihiro Miyake

doi:10.3390/s20102948

Speech discrimination in real-world group communication using audio-motion multimodal sensing

Takayuki Nozawa^*, Mizuki Uchiyama, Keigo Honda, Tamio Nakano, Yoshihiro Miyake

^*この論文の責任著者

工学科　知能情報工学コース

研究成果: ジャーナルへの寄稿 › 学術論文 › 査読

2 被引用数 (Scopus)

抄録

Speech discrimination that determines whether a participant is speaking at a given moment is essential in investigating human verbal communication. Specifically, in dynamic real-world situations where multiple people participate in, and form, groups in the same space, simultaneous speakers render speech discrimination that is solely based on audio sensing difficult. In this study, we focused on physical activity during speech, and hypothesized that combining audio and physical motion data acquired by wearable sensors can improve speech discrimination. Thus, utterance and physical activity data of students in a university participatory class were recorded, using smartphones worn around their neck. First, we tested the temporal relationship between manually identified utterances and physical motions and confirmed that physical activities in wide-frequency ranges co-occurred with utterances. Second, we trained and tested classifiers for each participant and found a higher performance with the audio-motion classifier (average accuracy 92.2%) than both the audio-only (80.4%) and motion-only (87.8%) classifiers. Finally, we tested inter-individual classification and obtained a higher performance with the audio-motion combined classifier (83.2%) than the audio-only (67.7%) and motion-only (71.9%) classifiers. These results show that audio-motion multimodal sensing using widely available smartphones can provide effective utterance discrimination in dynamic group communications.

本文言語	英語
論文番号	2948
ジャーナル	Sensors
巻	20
号	10
DOI	https://doi.org/10.3390/s20102948
出版ステータス	出版済み - 2020/05/02

ASJC Scopus 主題領域

分析化学
情報システム
原子分子物理学および光学
生化学
器械工学
電子工学および電気工学

文献へのアクセス

10.3390/s20102948

引用スタイル

@article{8f4068b8249144ee9491b2fc68ec12ca,

title = "Speech discrimination in real-world group communication using audio-motion multimodal sensing",

abstract = "Speech discrimination that determines whether a participant is speaking at a given moment is essential in investigating human verbal communication. Specifically, in dynamic real-world situations where multiple people participate in, and form, groups in the same space, simultaneous speakers render speech discrimination that is solely based on audio sensing difficult. In this study, we focused on physical activity during speech, and hypothesized that combining audio and physical motion data acquired by wearable sensors can improve speech discrimination. Thus, utterance and physical activity data of students in a university participatory class were recorded, using smartphones worn around their neck. First, we tested the temporal relationship between manually identified utterances and physical motions and confirmed that physical activities in wide-frequency ranges co-occurred with utterances. Second, we trained and tested classifiers for each participant and found a higher performance with the audio-motion classifier (average accuracy 92.2%) than both the audio-only (80.4%) and motion-only (87.8%) classifiers. Finally, we tested inter-individual classification and obtained a higher performance with the audio-motion combined classifier (83.2%) than the audio-only (67.7%) and motion-only (71.9%) classifiers. These results show that audio-motion multimodal sensing using widely available smartphones can provide effective utterance discrimination in dynamic group communications.",

keywords = "Group communication, Multimodal sensing, Physical motion, Sensor fusion, Smartphone, Speech discrimination",

author = "Takayuki Nozawa and Mizuki Uchiyama and Keigo Honda and Tamio Nakano and Yoshihiro Miyake",

note = "Publisher Copyright: {\textcopyright} 2020 by the authors. Licensee MDPI, Basel, Switzerland.",

year = "2020",

month = may,

day = "2",

doi = "10.3390/s20102948",

language = "英語",

volume = "20",

journal = "Sensors",

issn = "1424-8220",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "10",

}

TY - JOUR

T1 - Speech discrimination in real-world group communication using audio-motion multimodal sensing

AU - Nozawa, Takayuki

AU - Uchiyama, Mizuki

AU - Honda, Keigo

AU - Nakano, Tamio

AU - Miyake, Yoshihiro

PY - 2020/5/2

Y1 - 2020/5/2

N2 - Speech discrimination that determines whether a participant is speaking at a given moment is essential in investigating human verbal communication. Specifically, in dynamic real-world situations where multiple people participate in, and form, groups in the same space, simultaneous speakers render speech discrimination that is solely based on audio sensing difficult. In this study, we focused on physical activity during speech, and hypothesized that combining audio and physical motion data acquired by wearable sensors can improve speech discrimination. Thus, utterance and physical activity data of students in a university participatory class were recorded, using smartphones worn around their neck. First, we tested the temporal relationship between manually identified utterances and physical motions and confirmed that physical activities in wide-frequency ranges co-occurred with utterances. Second, we trained and tested classifiers for each participant and found a higher performance with the audio-motion classifier (average accuracy 92.2%) than both the audio-only (80.4%) and motion-only (87.8%) classifiers. Finally, we tested inter-individual classification and obtained a higher performance with the audio-motion combined classifier (83.2%) than the audio-only (67.7%) and motion-only (71.9%) classifiers. These results show that audio-motion multimodal sensing using widely available smartphones can provide effective utterance discrimination in dynamic group communications.

AB - Speech discrimination that determines whether a participant is speaking at a given moment is essential in investigating human verbal communication. Specifically, in dynamic real-world situations where multiple people participate in, and form, groups in the same space, simultaneous speakers render speech discrimination that is solely based on audio sensing difficult. In this study, we focused on physical activity during speech, and hypothesized that combining audio and physical motion data acquired by wearable sensors can improve speech discrimination. Thus, utterance and physical activity data of students in a university participatory class were recorded, using smartphones worn around their neck. First, we tested the temporal relationship between manually identified utterances and physical motions and confirmed that physical activities in wide-frequency ranges co-occurred with utterances. Second, we trained and tested classifiers for each participant and found a higher performance with the audio-motion classifier (average accuracy 92.2%) than both the audio-only (80.4%) and motion-only (87.8%) classifiers. Finally, we tested inter-individual classification and obtained a higher performance with the audio-motion combined classifier (83.2%) than the audio-only (67.7%) and motion-only (71.9%) classifiers. These results show that audio-motion multimodal sensing using widely available smartphones can provide effective utterance discrimination in dynamic group communications.

KW - Group communication

KW - Multimodal sensing

KW - Physical motion

KW - Sensor fusion

KW - Smartphone

KW - Speech discrimination

UR - http://www.scopus.com/inward/record.url?scp=85085256851&partnerID=8YFLogxK

U2 - 10.3390/s20102948

DO - 10.3390/s20102948

M3 - 学術論文

C2 - 32456031

AN - SCOPUS:85085256851

SN - 1424-8220

VL - 20

JO - Sensors

JF - Sensors

IS - 10

M1 - 2948

ER -

Speech discrimination in real-world group communication using audio-motion multimodal sensing

抄録

ASJC Scopus 主題領域

文献へのアクセス

フィンガープリント

引用スタイル