AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
Crowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used...
Guardado en:
Autores principales: | , , , , , , , , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
IEEE
2021
|
Materias: | |
Acceso en línea: | https://doaj.org/article/3ffb5e5cfb494ba3b2324d68e2d72155 |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:3ffb5e5cfb494ba3b2324d68e2d72155 |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:3ffb5e5cfb494ba3b2324d68e2d721552021-11-19T00:07:05ZAVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions2169-353610.1109/ACCESS.2021.3074797https://doaj.org/article/3ffb5e5cfb494ba3b2324d68e2d721552021-01-01T00:00:00Zhttps://ieeexplore.ieee.org/document/9416332/https://doaj.org/toc/2169-3536Crowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used widely as media platforms for human beings to touch the physical change of the world. The cross-modal information gives us an alternative method of solving the crowd counting task. In this case, in order to solve this problem, a model named the Audio-Visual Multi-Scale Network (AVMSN) is established to model the unconstrained visual and audio sources for completing the crowd counting task in this paper. Based on the Feature extraction and Multi-modal fusion module, in order to handle the objects of various sizes in the crowd scene, the Sample Convolutional Blocks are adopted by the AVMSN as the multi-scale Vision-end branch in the Feature extraction module to calculate the weighted-visual feature. Besides, the audio, which is the temporal domain transformed into the spectrogram information and the audio feature is learned by the audio-VGG network. Finally, the weighted-visual and audio features are fused by the Multi-modal fusion module, which adopts the cascade fusion architecture to calculate the estimated density map. The experimental results show the proposed AVMSN achieves a lower mean absolute error than other state-of-art crowd counting models under the low-quality conditions.Ruihan HuQinglong MoYuanfei XieYongqian XuJiaqi ChenYalun YangHongjian ZhouZhi-Ri TangEdmond Q. WuIEEEarticleMulti-scale architectureaudio-visual modelcascade fusioncrowd countingElectrical engineering. Electronics. Nuclear engineeringTK1-9971ENIEEE Access, Vol 9, Pp 80500-80510 (2021) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
Multi-scale architecture audio-visual model cascade fusion crowd counting Electrical engineering. Electronics. Nuclear engineering TK1-9971 |
spellingShingle |
Multi-scale architecture audio-visual model cascade fusion crowd counting Electrical engineering. Electronics. Nuclear engineering TK1-9971 Ruihan Hu Qinglong Mo Yuanfei Xie Yongqian Xu Jiaqi Chen Yalun Yang Hongjian Zhou Zhi-Ri Tang Edmond Q. Wu AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions |
description |
Crowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used widely as media platforms for human beings to touch the physical change of the world. The cross-modal information gives us an alternative method of solving the crowd counting task. In this case, in order to solve this problem, a model named the Audio-Visual Multi-Scale Network (AVMSN) is established to model the unconstrained visual and audio sources for completing the crowd counting task in this paper. Based on the Feature extraction and Multi-modal fusion module, in order to handle the objects of various sizes in the crowd scene, the Sample Convolutional Blocks are adopted by the AVMSN as the multi-scale Vision-end branch in the Feature extraction module to calculate the weighted-visual feature. Besides, the audio, which is the temporal domain transformed into the spectrogram information and the audio feature is learned by the audio-VGG network. Finally, the weighted-visual and audio features are fused by the Multi-modal fusion module, which adopts the cascade fusion architecture to calculate the estimated density map. The experimental results show the proposed AVMSN achieves a lower mean absolute error than other state-of-art crowd counting models under the low-quality conditions. |
format |
article |
author |
Ruihan Hu Qinglong Mo Yuanfei Xie Yongqian Xu Jiaqi Chen Yalun Yang Hongjian Zhou Zhi-Ri Tang Edmond Q. Wu |
author_facet |
Ruihan Hu Qinglong Mo Yuanfei Xie Yongqian Xu Jiaqi Chen Yalun Yang Hongjian Zhou Zhi-Ri Tang Edmond Q. Wu |
author_sort |
Ruihan Hu |
title |
AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions |
title_short |
AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions |
title_full |
AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions |
title_fullStr |
AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions |
title_full_unstemmed |
AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions |
title_sort |
avmsn: an audio-visual two stream crowd counting framework under low-quality conditions |
publisher |
IEEE |
publishDate |
2021 |
url |
https://doaj.org/article/3ffb5e5cfb494ba3b2324d68e2d72155 |
work_keys_str_mv |
AT ruihanhu avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT qinglongmo avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT yuanfeixie avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT yongqianxu avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT jiaqichen avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT yalunyang avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT hongjianzhou avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT zhiritang avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT edmondqwu avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions |
_version_ |
1718420604890120192 |