Skeleton-based deep pose feature learning for action quality assessment on figure skating videos

Huiying Li; Qing Lei; Hongbo Zhang; Jixiang Du; Shangce Gao

doi:10.1016/j.jvcir.2022.103625

Skeleton-based deep pose feature learning for action quality assessment on figure skating videos

Huiying Li, Qing Lei^*, Hongbo Zhang, Jixiang Du, Shangce Gao

^*この論文の責任著者

工学科　知能情報工学コース

研究成果: ジャーナルへの寄稿 › 学術論文 › 査読

12 被引用数 (Scopus)

抄録

Most of the existing Action Quality Assessment (AQA) methods for scoring sports videos have deeply researched how to evaluate the single action or several sequential-defined actions that performed in short-term sport videos, such as diving, vault, etc. They attempted to extract features directly from RGB videos through 3D ConvNets, which makes the features mixed with ambiguous scene information. To investigate the effectiveness of deep pose feature learning on automatically evaluating the complicated activities in long-duration sports videos, such as figure skating and artistic gymnastic, we propose a skeleton-based deep pose feature learning method to address this problem. For pose feature extraction, a spatial–temporal pose extraction module (STPE) is built to capture the subtle changes of human body movements and obtain the detail representations for skeletal data in space and time dimensions. For temporal information representation, an inter-action temporal relation extraction module (ATRE) is implemented by recurrent neural network to model the dynamic temporal structure of skeletal subsequences. We evaluate the proposed method on figure skating activity of MIT-skate and FIS-V datasets. The experimental results show that the proposed method is more effective than RGB video-based deep feature learning methods, including SENet and C3D. Significant performance progress has been achieved for the Spearman Rank Correlation (SRC) on MIT-Skate dataset. On FIS-V dataset, for the Total Element Score (TES) and the Program Component Score (PCS), better SRC and MSE have been achieved between the predicted scores against the judge's ones when compared with SENet and C3D feature methods.

本文言語	英語
論文番号	103625
ジャーナル	Journal of Visual Communication and Image Representation
巻	89
DOI	https://doi.org/10.1016/j.jvcir.2022.103625
出版ステータス	出版済み - 2022/11

ASJC Scopus 主題領域

信号処理
メディア記述
コンピュータビジョンおよびパターン認識
電子工学および電気工学

文献へのアクセス

10.1016/j.jvcir.2022.103625

引用スタイル

@article{04bca53b3394474586de2f5d2237670b,

title = "Skeleton-based deep pose feature learning for action quality assessment on figure skating videos",

abstract = "Most of the existing Action Quality Assessment (AQA) methods for scoring sports videos have deeply researched how to evaluate the single action or several sequential-defined actions that performed in short-term sport videos, such as diving, vault, etc. They attempted to extract features directly from RGB videos through 3D ConvNets, which makes the features mixed with ambiguous scene information. To investigate the effectiveness of deep pose feature learning on automatically evaluating the complicated activities in long-duration sports videos, such as figure skating and artistic gymnastic, we propose a skeleton-based deep pose feature learning method to address this problem. For pose feature extraction, a spatial–temporal pose extraction module (STPE) is built to capture the subtle changes of human body movements and obtain the detail representations for skeletal data in space and time dimensions. For temporal information representation, an inter-action temporal relation extraction module (ATRE) is implemented by recurrent neural network to model the dynamic temporal structure of skeletal subsequences. We evaluate the proposed method on figure skating activity of MIT-skate and FIS-V datasets. The experimental results show that the proposed method is more effective than RGB video-based deep feature learning methods, including SENet and C3D. Significant performance progress has been achieved for the Spearman Rank Correlation (SRC) on MIT-Skate dataset. On FIS-V dataset, for the Total Element Score (TES) and the Program Component Score (PCS), better SRC and MSE have been achieved between the predicted scores against the judge's ones when compared with SENet and C3D feature methods.",

keywords = "Action quality assessment, Action relation learning, Figure skating sport videos, Spatial–temporal pose feature extraction",

author = "Huiying Li and Qing Lei and Hongbo Zhang and Jixiang Du and Shangce Gao",

note = "Publisher Copyright: {\textcopyright} 2022 Elsevier Inc.",

year = "2022",

month = nov,

doi = "10.1016/j.jvcir.2022.103625",

language = "英語",

volume = "89",

journal = "Journal of Visual Communication and Image Representation",

issn = "1047-3203",

publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Skeleton-based deep pose feature learning for action quality assessment on figure skating videos

AU - Li, Huiying

AU - Lei, Qing

AU - Zhang, Hongbo

AU - Du, Jixiang

AU - Gao, Shangce

PY - 2022/11

Y1 - 2022/11

N2 - Most of the existing Action Quality Assessment (AQA) methods for scoring sports videos have deeply researched how to evaluate the single action or several sequential-defined actions that performed in short-term sport videos, such as diving, vault, etc. They attempted to extract features directly from RGB videos through 3D ConvNets, which makes the features mixed with ambiguous scene information. To investigate the effectiveness of deep pose feature learning on automatically evaluating the complicated activities in long-duration sports videos, such as figure skating and artistic gymnastic, we propose a skeleton-based deep pose feature learning method to address this problem. For pose feature extraction, a spatial–temporal pose extraction module (STPE) is built to capture the subtle changes of human body movements and obtain the detail representations for skeletal data in space and time dimensions. For temporal information representation, an inter-action temporal relation extraction module (ATRE) is implemented by recurrent neural network to model the dynamic temporal structure of skeletal subsequences. We evaluate the proposed method on figure skating activity of MIT-skate and FIS-V datasets. The experimental results show that the proposed method is more effective than RGB video-based deep feature learning methods, including SENet and C3D. Significant performance progress has been achieved for the Spearman Rank Correlation (SRC) on MIT-Skate dataset. On FIS-V dataset, for the Total Element Score (TES) and the Program Component Score (PCS), better SRC and MSE have been achieved between the predicted scores against the judge's ones when compared with SENet and C3D feature methods.

AB - Most of the existing Action Quality Assessment (AQA) methods for scoring sports videos have deeply researched how to evaluate the single action or several sequential-defined actions that performed in short-term sport videos, such as diving, vault, etc. They attempted to extract features directly from RGB videos through 3D ConvNets, which makes the features mixed with ambiguous scene information. To investigate the effectiveness of deep pose feature learning on automatically evaluating the complicated activities in long-duration sports videos, such as figure skating and artistic gymnastic, we propose a skeleton-based deep pose feature learning method to address this problem. For pose feature extraction, a spatial–temporal pose extraction module (STPE) is built to capture the subtle changes of human body movements and obtain the detail representations for skeletal data in space and time dimensions. For temporal information representation, an inter-action temporal relation extraction module (ATRE) is implemented by recurrent neural network to model the dynamic temporal structure of skeletal subsequences. We evaluate the proposed method on figure skating activity of MIT-skate and FIS-V datasets. The experimental results show that the proposed method is more effective than RGB video-based deep feature learning methods, including SENet and C3D. Significant performance progress has been achieved for the Spearman Rank Correlation (SRC) on MIT-Skate dataset. On FIS-V dataset, for the Total Element Score (TES) and the Program Component Score (PCS), better SRC and MSE have been achieved between the predicted scores against the judge's ones when compared with SENet and C3D feature methods.

KW - Action quality assessment

KW - Action relation learning

KW - Figure skating sport videos

KW - Spatial–temporal pose feature extraction

UR - http://www.scopus.com/inward/record.url?scp=85144358823&partnerID=8YFLogxK

U2 - 10.1016/j.jvcir.2022.103625

DO - 10.1016/j.jvcir.2022.103625

M3 - 学術論文

AN - SCOPUS:85144358823

SN - 1047-3203

VL - 89

JO - Journal of Visual Communication and Image Representation

JF - Journal of Visual Communication and Image Representation

M1 - 103625

ER -

Skeleton-based deep pose feature learning for action quality assessment on figure skating videos

抄録

ASJC Scopus 主題領域

文献へのアクセス

フィンガープリント

引用スタイル