AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions

Crowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Ruihan Hu, Qinglong Mo, Yuanfei Xie, Yongqian Xu, Jiaqi Chen, Yalun Yang, Hongjian Zhou, Zhi-Ri Tang, Edmond Q. Wu
Formato: article
Lenguaje:EN
Publicado: IEEE 2021
Materias:
Acceso en línea:https://doaj.org/article/3ffb5e5cfb494ba3b2324d68e2d72155
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:3ffb5e5cfb494ba3b2324d68e2d72155
record_format dspace
spelling oai:doaj.org-article:3ffb5e5cfb494ba3b2324d68e2d721552021-11-19T00:07:05ZAVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions2169-353610.1109/ACCESS.2021.3074797https://doaj.org/article/3ffb5e5cfb494ba3b2324d68e2d721552021-01-01T00:00:00Zhttps://ieeexplore.ieee.org/document/9416332/https://doaj.org/toc/2169-3536Crowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used widely as media platforms for human beings to touch the physical change of the world. The cross-modal information gives us an alternative method of solving the crowd counting task. In this case, in order to solve this problem, a model named the Audio-Visual Multi-Scale Network (AVMSN) is established to model the unconstrained visual and audio sources for completing the crowd counting task in this paper. Based on the Feature extraction and Multi-modal fusion module, in order to handle the objects of various sizes in the crowd scene, the Sample Convolutional Blocks are adopted by the AVMSN as the multi-scale Vision-end branch in the Feature extraction module to calculate the weighted-visual feature. Besides, the audio, which is the temporal domain transformed into the spectrogram information and the audio feature is learned by the audio-VGG network. Finally, the weighted-visual and audio features are fused by the Multi-modal fusion module, which adopts the cascade fusion architecture to calculate the estimated density map. The experimental results show the proposed AVMSN achieves a lower mean absolute error than other state-of-art crowd counting models under the low-quality conditions.Ruihan HuQinglong MoYuanfei XieYongqian XuJiaqi ChenYalun YangHongjian ZhouZhi-Ri TangEdmond Q. WuIEEEarticleMulti-scale architectureaudio-visual modelcascade fusioncrowd countingElectrical engineering. Electronics. Nuclear engineeringTK1-9971ENIEEE Access, Vol 9, Pp 80500-80510 (2021)
institution DOAJ
collection DOAJ
language EN
topic Multi-scale architecture
audio-visual model
cascade fusion
crowd counting
Electrical engineering. Electronics. Nuclear engineering
TK1-9971
spellingShingle Multi-scale architecture
audio-visual model
cascade fusion
crowd counting
Electrical engineering. Electronics. Nuclear engineering
TK1-9971
Ruihan Hu
Qinglong Mo
Yuanfei Xie
Yongqian Xu
Jiaqi Chen
Yalun Yang
Hongjian Zhou
Zhi-Ri Tang
Edmond Q. Wu
AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
description Crowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used widely as media platforms for human beings to touch the physical change of the world. The cross-modal information gives us an alternative method of solving the crowd counting task. In this case, in order to solve this problem, a model named the Audio-Visual Multi-Scale Network (AVMSN) is established to model the unconstrained visual and audio sources for completing the crowd counting task in this paper. Based on the Feature extraction and Multi-modal fusion module, in order to handle the objects of various sizes in the crowd scene, the Sample Convolutional Blocks are adopted by the AVMSN as the multi-scale Vision-end branch in the Feature extraction module to calculate the weighted-visual feature. Besides, the audio, which is the temporal domain transformed into the spectrogram information and the audio feature is learned by the audio-VGG network. Finally, the weighted-visual and audio features are fused by the Multi-modal fusion module, which adopts the cascade fusion architecture to calculate the estimated density map. The experimental results show the proposed AVMSN achieves a lower mean absolute error than other state-of-art crowd counting models under the low-quality conditions.
format article
author Ruihan Hu
Qinglong Mo
Yuanfei Xie
Yongqian Xu
Jiaqi Chen
Yalun Yang
Hongjian Zhou
Zhi-Ri Tang
Edmond Q. Wu
author_facet Ruihan Hu
Qinglong Mo
Yuanfei Xie
Yongqian Xu
Jiaqi Chen
Yalun Yang
Hongjian Zhou
Zhi-Ri Tang
Edmond Q. Wu
author_sort Ruihan Hu
title AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
title_short AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
title_full AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
title_fullStr AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
title_full_unstemmed AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
title_sort avmsn: an audio-visual two stream crowd counting framework under low-quality conditions
publisher IEEE
publishDate 2021
url https://doaj.org/article/3ffb5e5cfb494ba3b2324d68e2d72155
work_keys_str_mv AT ruihanhu avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
AT qinglongmo avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
AT yuanfeixie avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
AT yongqianxu avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
AT jiaqichen avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
AT yalunyang avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
AT hongjianzhou avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
AT zhiritang avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
AT edmondqwu avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
_version_ 1718420604890120192