Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping
In the process of violence recognition, accuracy is reduced due to problems related to time axis misalignment and the semantic deviation of multimedia visual auditory information. Therefore, this paper proposes a method for auditory-visual information fusion based on autoencoder mapping. First, a fe...
Guardado en:
Autores principales: | , , , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
MDPI AG
2021
|
Materias: | |
Acceso en línea: | https://doaj.org/article/4ba6c819a8f547abb7858b9008326632 |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:4ba6c819a8f547abb7858b9008326632 |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:4ba6c819a8f547abb7858b90083266322021-11-11T15:39:21ZViolence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping10.3390/electronics102126542079-9292https://doaj.org/article/4ba6c819a8f547abb7858b90083266322021-10-01T00:00:00Zhttps://www.mdpi.com/2079-9292/10/21/2654https://doaj.org/toc/2079-9292In the process of violence recognition, accuracy is reduced due to problems related to time axis misalignment and the semantic deviation of multimedia visual auditory information. Therefore, this paper proposes a method for auditory-visual information fusion based on autoencoder mapping. First, a feature extraction model based on the CNN-LSTM framework is established, and multimedia segments are used as whole input to solve the problem of time axis misalignment of visual and auditory information. Then, a shared semantic subspace is constructed based on an autoencoder mapping model and is optimized by semantic correspondence, which solves the problem of audiovisual semantic deviation and realizes the fusion of visual and auditory information on segment level features. Finally, the whole network is used to identify violence. The experimental results show that the method can make good use of the complementarity between modes. Compared with single-mode information, the multimodal method can achieve better results.Jiu LouDecheng ZuoZhan ZhangHongwei LiuMDPI AGarticleviolence recognitionauditory-visual fusionautoencoder mappingshared semantic subspacesCNN-LSTMElectronicsTK7800-8360ENElectronics, Vol 10, Iss 2654, p 2654 (2021) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
violence recognition auditory-visual fusion autoencoder mapping shared semantic subspaces CNN-LSTM Electronics TK7800-8360 |
spellingShingle |
violence recognition auditory-visual fusion autoencoder mapping shared semantic subspaces CNN-LSTM Electronics TK7800-8360 Jiu Lou Decheng Zuo Zhan Zhang Hongwei Liu Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping |
description |
In the process of violence recognition, accuracy is reduced due to problems related to time axis misalignment and the semantic deviation of multimedia visual auditory information. Therefore, this paper proposes a method for auditory-visual information fusion based on autoencoder mapping. First, a feature extraction model based on the CNN-LSTM framework is established, and multimedia segments are used as whole input to solve the problem of time axis misalignment of visual and auditory information. Then, a shared semantic subspace is constructed based on an autoencoder mapping model and is optimized by semantic correspondence, which solves the problem of audiovisual semantic deviation and realizes the fusion of visual and auditory information on segment level features. Finally, the whole network is used to identify violence. The experimental results show that the method can make good use of the complementarity between modes. Compared with single-mode information, the multimodal method can achieve better results. |
format |
article |
author |
Jiu Lou Decheng Zuo Zhan Zhang Hongwei Liu |
author_facet |
Jiu Lou Decheng Zuo Zhan Zhang Hongwei Liu |
author_sort |
Jiu Lou |
title |
Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping |
title_short |
Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping |
title_full |
Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping |
title_fullStr |
Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping |
title_full_unstemmed |
Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping |
title_sort |
violence recognition based on auditory-visual fusion of autoencoder mapping |
publisher |
MDPI AG |
publishDate |
2021 |
url |
https://doaj.org/article/4ba6c819a8f547abb7858b9008326632 |
work_keys_str_mv |
AT jiulou violencerecognitionbasedonauditoryvisualfusionofautoencodermapping AT dechengzuo violencerecognitionbasedonauditoryvisualfusionofautoencodermapping AT zhanzhang violencerecognitionbasedonauditoryvisualfusionofautoencodermapping AT hongweiliu violencerecognitionbasedonauditoryvisualfusionofautoencodermapping |
_version_ |
1718434665503653888 |