Audio-visual multi-modality driven hybrid feature learning model for crowd analysis and classification

التفاصيل البيبلوغرافية
العنوان:	Audio-visual multi-modality driven hybrid feature learning model for crowd analysis and classification
المؤلفون:	H. Y. Swathi, G. Shivakumar
المصدر:	Mathematical Biosciences and Engineering, Vol 20, Iss 7, Pp 12529-12561 (2023)
بيانات النشر:	AIMS Press, 2023.
سنة النشر:	2023
المجموعة:	LCC:Biotechnology LCC:Mathematics
مصطلحات موضوعية:	multi-modal crowd analysis, deep-spatio-temporal features, acoustic features, ensemble learning, audio-visual crowd classification, Biotechnology, TP248.13-248.65, Mathematics, QA1-939
الوصف:	The high pace emergence in advanced software systems, low-cost hardware and decentralized cloud computing technologies have broadened the horizon for vision-based surveillance, monitoring and control. However, complex and inferior feature learning over visual artefacts or video streams, especially under extreme conditions confine majority of the at-hand vision-based crowd analysis and classification systems. Retrieving event-sensitive or crowd-type sensitive spatio-temporal features for the different crowd types under extreme conditions is a highly complex task. Consequently, it results in lower accuracy and hence low reliability that confines existing methods for real-time crowd analysis. Despite numerous efforts in vision-based approaches, the lack of acoustic cues often creates ambiguity in crowd classification. On the other hand, the strategic amalgamation of audio-visual features can enable accurate and reliable crowd analysis and classification. Considering it as motivation, in this research a novel audio-visual multi-modality driven hybrid feature learning model is developed for crowd analysis and classification. In this work, a hybrid feature extraction model was applied to extract deep spatio-temporal features by using Gray-Level Co-occurrence Metrics (GLCM) and AlexNet transferrable learning model. Once extracting the different GLCM features and AlexNet deep features, horizontal concatenation was done to fuse the different feature sets. Similarly, for acoustic feature extraction, the audio samples (from the input video) were processed for static (fixed size) sampling, pre-emphasis, block framing and Hann windowing, followed by acoustic feature extraction like GTCC, GTCC-Delta, GTCC-Delta-Delta, MFCC, Spectral Entropy, Spectral Flux, Spectral Slope and Harmonics to Noise Ratio (HNR). Finally, the extracted audio-visual features were fused to yield a composite multi-modal feature set, which is processed for classification using the random forest ensemble classifier. The multi-class classification yields a crowd-classification accurac12529y of (98.26%), precision (98.89%), sensitivity (94.82%), specificity (95.57%), and F-Measure of 98.84%. The robustness of the proposed multi-modality-based crowd analysis model confirms its suitability towards real-world crowd detection and classification tasks.
نوع الوثيقة:	article
وصف الملف:	electronic resource
اللغة:	English
تدمد:	1551-0018
العلاقة:	https://doaj.org/toc/1551-0018Test
DOI:	10.3934/mbe.2023558?viewType=HTML
DOI:	10.3934/mbe.2023558
الوصول الحر:	https://doaj.org/article/2f2ac3156bf24037af26fa07d2a91487Test
رقم الانضمام:	edsdoj.2f2ac3156bf24037af26fa07d2a91487
قاعدة البيانات:	Directory of Open Access Journals

View record in DOAJ

الوصف
تدمد:	15510018
DOI:	10.3934/mbe.2023558?viewType=HTML