Automatic Dance Video Segmentation for Understanding Choreography(MOCO2024)

>100 Views

February 21, 26

スライド概要

profile-image

お茶の水女子大学 共創工学部文化情報工学科 / 大学院人間文化創成科学研究科共創工学専攻 表現工学研究室(土田研究室)

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

各ページのテキスト
1.

MOCO’24 Automatic Dance Video Segmentation for Understanding Choreography Koki Endo* †1, Shuhei Tsuchida* †2, Tsukasa Fukusato†3, Takeo Igarashi†1 †1 The University of Tokyo †2 Ochanomizu University †3 Waseda University (* = authors contributed equally) 1

2.

Automatic Dance Video Segmentation for Understanding Choreography: Overview 1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work 2

3.

Background / Learning Dance from Videos • Learning dance becomes easier if the choreography is divided into short movements. gLH_sFM_c01_d16_mLH0_ch01.mp4 3

4.

Background / Learning Dance from Videos • Typical dance videos lack segmentation • Learners need to find the appropriate segmentation points themselves • Difficult for beginners • Tedious for experienced dancers ➢We propose a method to automatically segment dance videos 4

5.

Automatic Dance Video Segmentation for Understanding Choreography : Overview 1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work 5

6.

Related Work / Dance Motion Segmentation Input Detecting Dance Motion Structure Using Motion Capture and Musical Information [Shiratori et al. 2004] Musical information Japanese dance Used motion data (Nihon-buyo) Method What is estimated Rule-based Motion segmentation points Dance Motion Segmentation Method Dance motion based on Choreographic Primitives data [Okada et al. 2015] Only beat positions used (prior knowledge) Rule-based Motion segmentation points Proposed Method Used Neural network Video segmentation points Dance video Takaaki Shiratori, Atsushi Nakazawa and Katsushi Ikeuchi. Detecting Dance Motion Structure through Music Analysis. 2004 6 Narumi Okada, Naoya Iwamoto, Tsukasa Fukusato and Shigeo Morishima. Dance Motion Segmentation Method based on Choreographic Primitives. 2015

7.

Automatic Dance Video Segmentation for Understanding Choreography : Overview 1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work 7

8.

Proposed Method Bone vectors Visual features 𝒗 𝑡 Fully Connected NN 2D CNN Dance videos Mel spectrogram Auditory features 𝒂 𝑡 8

9.

Proposed Method / Visual Features • Using AlphaPose [Fang et al. 2017] to detect keypoint positions • Keypoints: 26 body points + 21 points for each hand = 68 keypoints • → Convert to bone vectors (67) and normalize length to 0.5 • → Pass through a fully connected neural network to obtain visual features Fully connected NN Keypoint positions 𝒗 𝑡 ∈ ℝ134 Bone vector 9 Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai and Cewu Lu. RMPE: Regional Multi-person Pose Estimation. 2017

10.

Proposed Method / Auditory Features • Convert the music in the video to a Mel spectrogram using a short-time Fourier transform (STFT) • The Mel spectrogram is a 2D array representing the magnitude of each frequency component at a given time STFT Audio data Mel spectrogram 𝑆 10

11.

Proposed Method / Auditory Features • Compress the Mel spectrogram to match the number of samples with the number of video frames • For each frame 𝑡 , find the nearest Mel spectrogram index 𝑖 • Extract a 5-sample segment centered at 𝑖 , and apply a 2D CNN to obtain auditory features 𝒂 𝑡 𝑡 ↓ Extract 𝑖 2D CNN Mel spectrogram 𝑆 𝑆[𝑖 − 2, 𝑖 − 1, 𝑖, 𝑖 + 1, 𝑖 + 2] Auditory features 𝒂 𝑡 ∈ ℝ16 11

12.

Proposed Method Bone vectors Visual feature 𝒗 𝑡 Fully connected NN Temporal Convolutional Network (TCN) 2D CNN Dance video Mel spectrogram Segmentation possibility 𝑝 𝑡 Auditory features 𝒂 𝑡 12

13.

Proposed Method / TCN • Temporal Convolutional Network (TCN) [Bai et al. 2018] • 1D convolutions on time series data • Increase convolutional stride with deeper layers Output Input Time 13 Shaojie Bai, J. Zico Kolter and Vladlen Koltun. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. 2018

14.

Proposed Method / TCN Input: 𝒗 0 𝒂 0 … 𝒗 𝑇−1 … 𝒂 𝑇−1 𝑇 ∈ ℝ150×𝑇 TCN 𝑇 150 14

15.

Proposed Method / TCN Input: 𝒗 0 𝒂 0 … 𝒗 𝑇−1 … 𝒂 𝑇−1 𝑇 ∈ ℝ150×𝑇 TCN 𝑇 150 Fully connected layer 𝑝 0 𝑝 1 𝑝 𝑇−1 15

16.

Proposed Method Bone vectors Visual feature 𝒗 𝑡 Fully connected NN Peak Detection Temporal Convolutional Network (TCN) 2D CNN Dance video Mel spectrogram Segmentation points 𝑡0 , 𝑡1 , … Segmentation possibility 𝑝 𝑡 Auditory feature 𝒂 𝑡 16

17.

Proposed Method / Peak Detection • Conditions for peak detection: 1. Segmentation possibility exceeds a certain threshold 2. Segmentation possibility is a local maximum Segmentation possibility 𝑝 𝑡 17

18.

Automatic Dance Video Segmentation for Understanding Choreography 1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work 18

19.

Dataset • Manual segmentation of videos from the AIST Dance Video Database [Tsuchida et al. 2019] • 1200 basic dance videos + 210 freestyle dance videos = 1410 videos, approximately 10.7 hours (Audio off for explanation) An example of basic dance gLH_sBM_c01_d16_mLH0_ch01.mp4 An example of advanced dance gLH_sFM_c01_d16_mLH0_ch01.mp4 Shuhei Tsuchida, Satoru Fukayama, Masahiro Hamasaki and Masataka Goto. AIST Dance Video Database: 19 Multi-genre, Multi-dancer, and Multi-camera Database for Dance Information Processing. 2019.

20.

Dataset / Annotation Tool Annotation using the training data creation tool (audio off) segToolDemo.mov 20

21.

Dataset / Annotation Tool • Annotation workers Worker Years of dance experience Number of videos worked on Author 11 1410 20 experienced dancers 5-19 141 each • For each video, three segmentation annotations were collected. • The first author + two other experienced dancers 21

22.

Dataset / Creating Ground Truth Labels • The intended segmentation points of the workers might be a few frames off from the actual annotated positions. ➢Represent individual segmentation results as a sum of Gaussian distributions centered on the annotated positions. Segmentation position 𝑡0 , 𝑡1 , … 𝑡0 𝑡1 𝑡2 22

23.

Dataset / Creating Ground Truth Labels • The segmentation points of the three workers are averaged to create the ground truth labels. 23

24.

Automatic Dance Video Segmentation for Understanding Choreography: Overview 1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work 24

25.

Evaluation Experiment • Experiment overview: 1. Split the dataset into training (3), validation (1), and test (1) sets 2. Train on the training data 3. Stop training if validation loss does not improve for 10 epochs 4. Evaluate performance on the test data • Predicted segmentation points had an F-score of 0.797 ± 0.013 25

27.

Evaluation Results / Incorrect Predictions Ballet Jazz、Chaines Wrongly predicted split position JB_chaines.mov Correct split position Ballet Jazz、Paddbre JB_paddbre.mov Wrongly predicted split position (played at 0.75x speed, audio off) 28

28.

Evaluation Results/Feature Comparison • Comparison of the proposed method (V+A) against models using only one feature type (V for visual, A for auditory) 𝑝 = 5.08 × 10−3 < 0.05 <> 𝑝 = 7.40 × 10−2 > 0.05 <> • t-test at a significance level of 5%: V < V+A • No significant difference between A and V+A • Possible data set bias or too high dimensionality of visual features. 30

29.

Evaluation Results/Feature Comparison • Segmentation results of two videos with different choreography but the same music. (a) (b) 31

30.

Automatic Dance Video Segmentation for Understanding Choreography 1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work 32

31.

Application/Dance Learning Support appDemo.mov (Audio off for explanation) 33

32.

Application / User Test • Users learned choreography using the application and provided feedback. • Participants: 4 (2-4 years of dance experience). • Usability and usefulness rated on a 5-point Likert scale: ➢All participants rated 4 (good) or 5 (very good). Participants App Experiment Setup (Recreated by the Author) 34

33.

Application / User Testing Feedback • Positive feedback: • Loop playback made repeated practice easy. • Convenient as manual segmentation was not required. • Automated segmentation positions matched my sense. • Improvement suggestions: • Adjustable break times between loop playbacks. • Ability to manually specify segmentation points. 35

34.

Automatic Dance Video Segmentation for Understanding Choreography : Overview 1. Background 2. Related Work 3. Proposed Method 4. Dataset 5. Evaluation Experiment 6. Application 7. Future Work 36

35.

Future Work • Enhancing the dataset: • Videos from various genres, especially jazz and ballet. • Increasing the number of annotators. • Adapting to non-static camera videos. • Handling camera movements and switches. • Improving the application: • Detecting repetitions. • Semi-automatic adjustment of segmentation points based on users' dance experience and preferences. 37

36.

Summary • Proposed Automatic Segmentation Method for Dance Videos • Uses Temporal Convolutional Network (TCN) • General-purpose method that does not require genre-specific knowledge • Created a dataset of 1410 dance videos in the AIST Dance Video Database by manually annotating the segmentation positions • Author + 20 experienced dancers • Evaluate the proposed method using the dataset in experiments Confirmation of the effectiveness of the proposed method on many street dances • Confirmed the validity of visual and auditory features • Proposed an application to support dance learning by applying automatic segmentation • Segmented sections can be played back in a loop for repeated practice • Confirm validity by user test 38

37.

Acknowledgements This work was supported by: • JST, CREST Grant Number JPMJCR17A1 JSPS Grant-in-Aid 23K17022, Japan • JSPS Grant-in-Aid 23K17022, Japan We would also like to thank all the participants who took part in our experiments. 39

38.

Thank you AIST Dance Video Database (AIST Dance DB) is a shared database containing original street dance videos (1,410 dances) with copyright-cleared dance music (60). Contact: Shuhei Tsuchida [email protected] 40