Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition
We present a multi-modal genre recognition framework that considers the modalities audio, text, and image by features extracted from audio signals, album cover images, and lyrics of music tracks. In contrast to pure learning of features by a neural network as done in the related work, handcrafted fe...
Guardado en:
Autores principales: | , , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
MDPI AG
2021
|
Materias: | |
Acceso en línea: | https://doaj.org/article/260d78d3e8cc474fbad690f2379f312d |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:260d78d3e8cc474fbad690f2379f312d |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:260d78d3e8cc474fbad690f2379f312d2021-11-25T17:30:15ZStatistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition10.3390/e231115021099-4300https://doaj.org/article/260d78d3e8cc474fbad690f2379f312d2021-11-01T00:00:00Zhttps://www.mdpi.com/1099-4300/23/11/1502https://doaj.org/toc/1099-4300We present a multi-modal genre recognition framework that considers the modalities audio, text, and image by features extracted from audio signals, album cover images, and lyrics of music tracks. In contrast to pure learning of features by a neural network as done in the related work, handcrafted features designed for a respective modality are also integrated, allowing for higher interpretability of created models and further theoretical analysis of the impact of individual features on genre prediction. Genre recognition is performed by binary classification of a music track with respect to each genre based on combinations of elementary features. For feature combination a two-level technique is used, which combines aggregation into fixed-length feature vectors with confidence-based fusion of classification results. Extensive experiments have been conducted for three classifier models (Naïve Bayes, Support Vector Machine, and Random Forest) and numerous feature combinations. The results are presented visually, with data reduction for improved perceptibility achieved by multi-objective analysis and restriction to non-dominated data. Feature- and classifier-related hypotheses are formulated based on the data, and their statistical significance is formally analyzed. The statistical analysis shows that the combination of two modalities almost always leads to a significant increase of performance and the combination of three modalities in several cases.Ben WilkesIgor VatolkinHeinrich MüllerMDPI AGarticlemusic genre recognitionmulti-modal classificationfeature evaluationaudio signal featuresalbum cover imageslyricsScienceQAstrophysicsQB460-466PhysicsQC1-999ENEntropy, Vol 23, Iss 1502, p 1502 (2021) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
music genre recognition multi-modal classification feature evaluation audio signal features album cover images lyrics Science Q Astrophysics QB460-466 Physics QC1-999 |
spellingShingle |
music genre recognition multi-modal classification feature evaluation audio signal features album cover images lyrics Science Q Astrophysics QB460-466 Physics QC1-999 Ben Wilkes Igor Vatolkin Heinrich Müller Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition |
description |
We present a multi-modal genre recognition framework that considers the modalities audio, text, and image by features extracted from audio signals, album cover images, and lyrics of music tracks. In contrast to pure learning of features by a neural network as done in the related work, handcrafted features designed for a respective modality are also integrated, allowing for higher interpretability of created models and further theoretical analysis of the impact of individual features on genre prediction. Genre recognition is performed by binary classification of a music track with respect to each genre based on combinations of elementary features. For feature combination a two-level technique is used, which combines aggregation into fixed-length feature vectors with confidence-based fusion of classification results. Extensive experiments have been conducted for three classifier models (Naïve Bayes, Support Vector Machine, and Random Forest) and numerous feature combinations. The results are presented visually, with data reduction for improved perceptibility achieved by multi-objective analysis and restriction to non-dominated data. Feature- and classifier-related hypotheses are formulated based on the data, and their statistical significance is formally analyzed. The statistical analysis shows that the combination of two modalities almost always leads to a significant increase of performance and the combination of three modalities in several cases. |
format |
article |
author |
Ben Wilkes Igor Vatolkin Heinrich Müller |
author_facet |
Ben Wilkes Igor Vatolkin Heinrich Müller |
author_sort |
Ben Wilkes |
title |
Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition |
title_short |
Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition |
title_full |
Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition |
title_fullStr |
Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition |
title_full_unstemmed |
Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition |
title_sort |
statistical and visual analysis of audio, text, and image features for multi-modal music genre recognition |
publisher |
MDPI AG |
publishDate |
2021 |
url |
https://doaj.org/article/260d78d3e8cc474fbad690f2379f312d |
work_keys_str_mv |
AT benwilkes statisticalandvisualanalysisofaudiotextandimagefeaturesformultimodalmusicgenrerecognition AT igorvatolkin statisticalandvisualanalysisofaudiotextandimagefeaturesformultimodalmusicgenrerecognition AT heinrichmuller statisticalandvisualanalysisofaudiotextandimagefeaturesformultimodalmusicgenrerecognition |
_version_ |
1718412273127522304 |