Computer Science Signal Processing

Music and Audio Processing

Description

This cluster of papers focuses on the classification and analysis of audio signals, including music genre classification, environmental sound recognition, melody extraction, and acoustic scene classification. It explores techniques such as deep learning, convolutional neural networks, and feature extraction for music information retrieval.

Keywords

Audio Signal Classification; Music Information Retrieval; Deep Learning; Convolutional Neural Networks; Feature Extraction; Environmental Sound Recognition; Music Genre Classification; Melody Extraction; Acoustic Scene Classification; Audio Event Detection

From the Publisher: The MPEG standards are an evolving set of standards for video and audio compression. MPEG 7 technology covers the most recent developments in multimedia search and retreival, … From the Publisher: The MPEG standards are an evolving set of standards for video and audio compression. MPEG 7 technology covers the most recent developments in multimedia search and retreival, designed to standardise the description of multimedia content supporting a wide range of applications including DVD, CD and HDTV. Multimedia content description, search and retrieval is a rapidly expanding research area due to the increasing amount of audiovisual (AV) data available. The wealth of practical applications available and currently under development (for example, large scale multimedia search engines and AV broadcast servers) has lead to the development of processing tools to create the description of AV material or to support the identification or retrieval of AV documents. Written by experts in the field, this book has been designed as a unique tutorial in the new MPEG 7 standard covering content creation, content distribution and content consumption. At present there are no books documenting the available technologies in such a comprehensive way. Presents a comprehensive overview of the principles and concepts involved in the complete range of Audio Visual material indexing, metadata description, information retrieval and browsingDetails the major processing tools used for indexing and retrieval of images and video sequencesIndividual chapters, written by experts who have contributed to the development of MPEG 7, provide clear explanations of the underlying tools and technologies contributing to the standardDemostration software offering step-by-step guidance to the multi-media system components and eXperimentation model (XM) MPEG reference softwareCoincides with the release of the ISO standard in late 2001. A valuable reference resource for practising electronic and communications engineers designing and implementing MPEG 7 compliant systems, as well as for researchers and students working with multimedia database technology.
We examine in some detail Mel Frequency Cepstral Coefficients (MFCCs) the dominant features used for speech recognition and investigate their applicability to modeling music. In particular, we examine two of … We examine in some detail Mel Frequency Cepstral Coefficients (MFCCs) the dominant features used for speech recognition and investigate their applicability to modeling music. In particular, we examine two of the main assumptions of the process of forming MFCCs: the use of the Mel frequency scale to model the spectra; and the use of the Discrete Cosine Transform (DCT) to decorrelate the Mel-spectral vectors. We examine the first assumption in the context of speech/music discrimination. Our results show that the use of the Mel scale for modeling music is at least not harmful for this problem, although further experimentation is needed to verify that this is the optimal scale in the general case. We investigate the second assumption by examining the basis vectors of the theoretically optimal transform to decorrelate music and speech spectral vectors. Our results demonstrate that the use of the DCT to decorrelate vectors is appropriate for both speech and music spectra. MFCCs for Music Analysis Of all the human generated sounds which influence our lives, speech and music are arguably the most prolific. Speech has received much focused attention and decades of research in this community have led to usable systems and convergence of the features used for speech analysis. In the music community however, although the field of synthesis is very mature, a dominant paradigm has yet to emerge to solve other problems such as music classification or transcription. Consequently, many representations for music have been proposed (e.g. (Martin1998), (Scheirer1997), (Blum1999)). In this paper, we examine some of the assumptions of Mel Frequency Cepstral Coefficients (MFCCs) the dominant features used for speech recognition and examine whether these assumptions are valid for modeling music. MFCCs have been used by other authors to model music and audio sounds (e.g. (Blum1999)). These works however use cepstral features merely because they have been so successful for speech recognition without examining the assumptions made in great detail. MFCCs (e.g. see (Rabiner1993)) are short-term spectral features. They are calculated as follows (the steps and assumptions made are explained in more detail in the full paper): 1. Divide signal into frames. 2. For each frame, obtain the amplitude spectrum. 3. Take the logarithm. 4. Convert to Mel (a perceptually-based) spectrum. 5. Take the discrete cosine transform (DCT). We seek to determine whether this process is suitable for creating features to model music. We examine only steps 4 and 5 since, as explained in the full paper, the other steps are less controversial. Step 4 calculates the log amplitude spectrum on the so-called Mel scale. This transformation emphasizes lower frequencies which are perceptually more meaningful for speech. It is possible however that the Mel scale may not be optimal for music as there may be more information in say higher frequencies. Step 5 takes the DCT of the Mel spectra. For speech, this approximates principal components analysis (PCA) which decorrelates the components of the feature vectors. We investigate whether this transform is valid for music spectra. Mel vs Linear Spectral Modeling To investigate the effect of using the Mel scale, we examine the performance of a simple speech/music discriminator. We use around 3 hours of labeled data from a broadcast news show, divided into 2 hours of training data and 40 minutes of testing data. We convert the data to ‘Mel’ and ‘Linear’ cepstral features and train mixture of Gaussian classifiers for each class. We then classify each segment in the test data using these models. This process is described in more detail in the full paper. We find that for this speech/music classification problem, the results are (statistically) significantly better if Mel-based cepstral features rather than linear-based cepstral features are used. However, whether this is simply because the Mel scale models speech better or because it also models music better is not clear. At worst, we can conclude that using the Mel cepstrum to model music in this speech/music discrimination problem is not harmful. Further tests are needed to verify that the Mel cepstrum is appropriate for modeling music in the general case. Using the DCT to Approximate Principal Components Analysis We additionally investigate the effectiveness of using the DCT to decorrelate Mel spectral features. The mathematically correct way to decorrelate components is to use PCA (or equivalently the KL transform). This transform uses the eigenvalues of the covariance matrix of the data to be modeled as basis vectors. By investigating how closely these vectors approximate cosine functions we can get a feel for how well the DCT approximates PCA. By inspecting the eigenvectors for the Mel log spectra for around 3 hours of speech and 4 hours of music we see that the DCT is an appropriate transform for decorrelating music (and speech) log spectra. Future Work Future work should focus on a more thorough examination the parameters used to generate MFCC features such as the sampling rate of the signal, the frequency scaling (Mel or otherwise) and the number of bins to use when smoothing. Also worthy of investigation is the windowing size and frame rate. Suggested Readings Blum, T, Keislar, D., Wheaton, J. and Wold, E., 1999, Method and article of manufacture for content-based analysis, storage, retrieval, and segmentation of audio information, U.S. Patent 5, 918, 223. Martin, K.. 1998, Toward automatic sound source recognition: identifying musical instruments, Proceedings NATO Computational Hearing Advanced Study Institute. Rabiner, L. and Juang, B., 1993, Fundamentals of Speech Recognition, Prentice-Hall. Scheirer, E. and Slaney, M., 1997, Construction and evaluation of a robust multifeature speech/music discriminator, Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing.
We introduce the Million Song Dataset, a freely-available collection of audio features and metadata for a million contemporary popular music tracks. We describe its creation process, its content, and its … We introduce the Million Song Dataset, a freely-available collection of audio features and metadata for a million contemporary popular music tracks. We describe its creation process, its content, and its possible uses. Attractive features of the Million Song Database include the range of existing resources to which it is linked, and the fact that it is the largest current research dataset in our field. As an illustration, we present year prediction as an example application, a task that has, until now, been difficult to study owing to the absence of a large set of suitable data. We show positive results on year prediction, and discuss more generally the future development of the dataset.
Digital processing of speech signal and voice recognition algorithm is very important for fast and accurate automatic voice recognition technology. The voice is a signal of infinite information. A direct … Digital processing of speech signal and voice recognition algorithm is very important for fast and accurate automatic voice recognition technology. The voice is a signal of infinite information. A direct analysis and synthesizing the complex voice signal is due to too much information contained in the signal. Therefore the digital signal processes such as Feature Extraction and Feature Matching are introduced to represent the voice signal. Several methods such as Liner Predictive Predictive Coding (LPC), Hidden Markov Model (HMM), Artificial Neural Network (ANN) and etc are evaluated with a view to identify a straight forward and effective method for voice signal. The extraction and matching process is implemented right after the Pre Processing or filtering signal is performed. The non-parametric method for modelling the human auditory perception system, Mel Frequency Cepstral Coefficients (MFCCs) are utilize as extraction techniques. The non linear sequence alignment known as Dynamic Time Warping (DTW) introduced by Sakoe Chiba has been used as features matching techniques. Since it's obvious that the voice signal tends to have different temporal rate, the alignment is important to produce the better performance.This paper present the viability of MFCC to extract features and DTW to compare the test patterns.
Several variants of the Long Short-Term Memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995. In recent years, these networks have become the state-of-the-art … Several variants of the Long Short-Term Memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995. In recent years, these networks have become the state-of-the-art models for a variety of machine learning problems. This has led to a renewed interest in understanding the role and utility of various computational components of typical LSTM variants. In this paper, we present the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling. The hyperparameters of all LSTM variants for each task were optimized separately using random search, and their importance was assessed using the powerful fANOVA framework. In total, we summarize the results of 5400 experimental runs ($\approx 15$ years of CPU time), which makes our study the largest of its kind on LSTM networks. Our results show that none of the variants can improve upon the standard LSTM architecture significantly, and demonstrate the forget gate and the output activation function to be its most critical components. We further observe that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment.
In this paper we compare different types of recurrent units in recurrent neural networks (RNNs). Especially, we focus on more sophisticated units that implement a gating mechanism, such as a … In this paper we compare different types of recurrent units in recurrent neural networks (RNNs). Especially, we focus on more sophisticated units that implement a gating mechanism, such as a long short-term memory (LSTM) unit and a recently proposed gated recurrent unit (GRU). We evaluate these recurrent units on the tasks of polyphonic music modeling and speech signal modeling. Our experiments revealed that these advanced recurrent units are indeed better than more traditional recurrent units such as tanh units. Also, we found GRU to be comparable to LSTM.
Two experiments were performed to evaluate the perceptual relationships between 16 music instrument tones. The stimuli were computer synthesized based upon an analysis of actual instrument tones, and they were … Two experiments were performed to evaluate the perceptual relationships between 16 music instrument tones. The stimuli were computer synthesized based upon an analysis of actual instrument tones, and they were perceptually equalized for loudness, pitch, and duration. Experiment 1 evaluated the tones with respect to perceptual similarities, and the results were treated with multidimensional scaling techniques and hierarchic clustering analysis. A three-dimensional scaling solution, well matching the clustering analysis, was found to be interpretable in terms of (1) the spectral energy distribution; (2) the presence of synchronicity in the transients of the higher harmonics, along with the closely related amount of spectral fluctuation within the the tone through time; and (3) the presence of low-amplitude, high-frequency energy in the initial attack segment; an alternate interpretation of the latter two dimensions viewed the cylindrical distribution of clusters of stimulus points about the spectral energy distribution, grouping on the basis of musical instrument family (with two exceptions). Experiment 2 was a learning task of a set of labels for the 16 tones. Confusions were examined in light of the similarity structure for the tones from experiment 1, and one of the family-grouping exceptions was found to be reflected in the difficulty of learning the labels.
Everyday listening is the experience of hearing events in the world rather than sounds per se. In this article, I take an ecological approach to everyday listening to overcome constraints … Everyday listening is the experience of hearing events in the world rather than sounds per se. In this article, I take an ecological approach to everyday listening to overcome constraints on its study implied by more traditional approaches. In particular, I am concerned with developing a new framework for describing sound in terms of audible source attributes. An examination of the continuum of structured energy from event to audition suggests that sound conveys information about events at locations in an environment. Qualitative descriptions of the physics of sound~producing events, complemented by protocol studies, suggest a tripartite division of sound-producing events into those involving vibrating solids, gasses, or liquids. Within each of these categories, basic-level events are defined by the simple interactions that can cause these materials to sound, whereas more complex events can be described in terms of temporal patterning, compound, or hybrid sources. The results of these investigations are used to create a map of sound-producing events and their attributes useful in guiding further exploration.
The visual environment contains massive amounts of information involving the relations between objects in space and time, and recent studies of visual statistical learning (VSL) have suggested that this information … The visual environment contains massive amounts of information involving the relations between objects in space and time, and recent studies of visual statistical learning (VSL) have suggested that this information can be automatically extracted by the visual system. The experiments reported in this article explore the automaticity of VSL in several ways, using both explicit familiarity and implicit response-time measures. The results demonstrate that (a) the input to VSL is gated by selective attention, (b) VSL is nevertheless an implicit process because it operates during a cover task and without awareness of the underlying statistical patterns, and (c) VSL constructs abstracted representations that are then invariant to changes in extraneous surface features. These results fuel the conclusion that VSL both is and is not automatic: It requires attention to select the relevant population of stimuli, but the resulting learning then occurs without intent or awareness.
Automatic urban sound classification is a growing area of research with applications in multimedia retrieval and urban informatics. In this paper we identify two main barriers to research in this … Automatic urban sound classification is a growing area of research with applications in multimedia retrieval and urban informatics. In this paper we identify two main barriers to research in this area - the lack of a common taxonomy and the scarceness of large, real-world, annotated data. To address these issues we present a taxonomy of urban sounds and a new dataset, UrbanSound, containing 27 hours of audio with 18.5 hours of annotated sound event occurrences across 10 sound classes. The challenges presented by the new dataset are studied through a series of experiments using a baseline classification system.
Many audio and multimedia applications would benefit from the ability to classify and search for audio based on its characteristics. The audio analysis, search, and classification engine described here reduces … Many audio and multimedia applications would benefit from the ability to classify and search for audio based on its characteristics. The audio analysis, search, and classification engine described here reduces sounds to perceptual and acoustical features. This lets users search or retrieve sounds by any one feature or a combination of them, by specifying previously learned classes based on these features, or by selecting or entering reference sounds and asking the engine to retrieve similar or dissimilar sounds.
The frequencies that have been chosen to make up the scale of Western music are geometrically spaced. Thus the discrete Fourier transform (DFT), although extremely efficient in the fast Fourier … The frequencies that have been chosen to make up the scale of Western music are geometrically spaced. Thus the discrete Fourier transform (DFT), although extremely efficient in the fast Fourier transform implementation, yields components which do not map efficiently to musical frequencies. This is because the frequency components calculated with the DFT are separated by a constant frequency difference and with a constant resolution. A calculation similar to a discrete Fourier transform but with a constant ratio of center frequency to resolution has been made; this is a constant Q transform and is equivalent to a 1/24-oct filter bank. Thus there are two frequency components for each musical note so that two adjacent notes in the musical scale played simultaneously can be resolved anywhere in the musical frequency range. This transform against log (frequency) to obtain a constant pattern in the frequency domain for sounds with harmonic frequency components has been plotted. This is compared to the conventional DFT that yields a constant spacing between frequency components. In addition to advantages for resolution, representation with a constant pattern has the advantage that note identification (‘‘note identification’’ rather than the term ‘‘pitch tracking,’’ which is widely used in the signal processing community, is being used since the editor has correctly pointed out that ‘‘pitch’’ should be reserved for a perceptual context), instrument recognition, and signal separation can be done elegantly by a straightforward pattern recognition algorithm.
We introduce the openSMILE feature extraction toolkit, which unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities. Audio low-level descriptors such as CHROMA and CENS … We introduce the openSMILE feature extraction toolkit, which unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities. Audio low-level descriptors such as CHROMA and CENS features, loudness, Mel-frequency cepstral coefficients, perceptual linear predictive cepstral coefficients, linear predictive coefficients, line spectral frequencies, fundamental frequency, and formant frequencies are supported. Delta regression and various statistical functionals can be applied to the low-level descriptors. openSMILE is implemented in C++ with no third-party dependencies for the core functionality. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. It supports on-line incremental processing for all implemented features as well as off-line and batch processing. Numeric compatibility with future versions is ensured by means of unit tests. openSMILE can be downloaded from http://opensmile.sourceforge.net/.
We present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal … We present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal processing. Descriptors from audio and video can be processed jointly in a single framework allowing for time synchronization of parameters, on-line incremental processing as well as off-line and batch processing, and the extraction of statistical functionals (feature summaries), such as moments, peaks, regression parameters, etc. Postprocessing of the features includes statistical classifiers such as support vector machine models or file export for popular toolkits such as Weka or HTK. Available low-level descriptors include popular speech, music and video features including Mel-frequency and similar cepstral and spectral coefficients, Chroma, CENS, auditory model based loudness, voice quality, local binary pattern, color, and optical flow histograms. Besides, voice activity detection, pitch tracking and face detection are supported. openSMILE is implemented in C++, using standard open source libraries for on-line audio and video input. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. openSMILE 2.0 is distributed under a research license and can be downloaded from http://opensmile.sourceforge.net/.
We present a methodology for analyzing polyphonic musical passages comprised of notes that exhibit a harmonically fixed spectral profile (such as piano notes). Taking advantage of this unique note structure, … We present a methodology for analyzing polyphonic musical passages comprised of notes that exhibit a harmonically fixed spectral profile (such as piano notes). Taking advantage of this unique note structure, we can model the audio content of the musical passage by a linear basis transform and use non-negative matrix decomposition methods to estimate the spectral profile and the temporal information of every note. This approach results in a very simple and compact system that is not knowledge-based, but rather learns notes by observation.
In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning approaches have not … In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning approaches have not been extensively studied for auditory data. In this paper, we apply convolutional deep belief networks to audio data and empirically evaluate them on various audio classification tasks. In the case of speech data, we show that the learned features correspond to phones/phonemes. In addition, our feature representations learned from unlabeled audio data show very good performance for multiple audio classification tasks. We hope that this paper will inspire more research on deep learning approaches applied to a wide range of audio recognition tasks.
Professional actors' portrayals of 14 emotions varying in intensity and valence were presented to judges. The results on decoding replicate earlier findings on the ability of judges to infer vocally … Professional actors' portrayals of 14 emotions varying in intensity and valence were presented to judges. The results on decoding replicate earlier findings on the ability of judges to infer vocally expressed emotions with much-better-than-chance accuracy, including consistently found differences in the recognizability of different emotions. A total of 224 portrayals were subjected to digital acoustic analysis to obtain profiles of vocal parameters for different emotions. The data suggest that vocal parameters not only index the degree of intensity typical for different emotions but also differentiate valence or quality aspects. The data are also used to test theoretical predictions on vocal patterning based on the component process model of emotion (K.R. Scherer, 1986). Although most hypotheses are supported, some need to be revised on the basis of the empirical evidence. Discriminant analysis and jackknifing show remarkably high hit rates and patterns of confusion that closely mirror those found for listener-judges.
We report on the construction of a real-time computer system capable of distinguishing speech signals from music signals over a wide range of digital audio input. We have examined 13 … We report on the construction of a real-time computer system capable of distinguishing speech signals from music signals over a wide range of digital audio input. We have examined 13 features intended to measure conceptually distinct properties of speech and/or music signals, and combined them in several multidimensional classification frameworks. We provide extensive data on system performance and the cross-validated training/test setup used to evaluate the system. For the datasets currently in use, the best classifier classifies with 5.8% error on a frame-by-frame basis, and 1.4% error when integrating long (2.4 second) segments of sound.
Musical genres are categorical labels created by humans to characterize pieces of music. A musical genre is characterized by the common characteristics shared by its members. These characteristics typically are … Musical genres are categorical labels created by humans to characterize pieces of music. A musical genre is characterized by the common characteristics shared by its members. These characteristics typically are related to the instrumentation, rhythmic structure, and harmonic content of the music. Genre hierarchies are commonly used to structure the large collections of music available on the Web. Currently musical genre annotation is performed manually. Automatic musical genre classification can assist or replace the human user in this process and would be a valuable addition to music information retrieval systems. In addition, automatic musical genre classification provides a framework for developing and evaluating features for any type of content-based analysis of musical signals. In this paper, the automatic classification of audio signals into an hierarchy of musical genres is explored. More specifically, three feature sets for representing timbral texture, rhythmic content and pitch content are proposed. The performance and relative importance of the proposed features is investigated by training statistical pattern recognition classifiers using real-world audio collections. Both whole file and real-time frame-based classification schemes are described. Using the proposed feature sets, classification of 61% for ten musical genres is achieved. This result is comparable to results reported for human musical genre classification.
Automatic music recommendation has become an increasingly relevant problem in recent years, since a lot of music is now sold and consumed digitally. Most recommender systems rely on collaborative filtering. … Automatic music recommendation has become an increasingly relevant problem in recent years, since a lot of music is now sold and consumed digitally. Most recommender systems rely on collaborative filtering. However, this approach suffers from the cold start problem: it fails when no usage data is available, so it is not effective for recommending new and unpopular songs. In this paper, we propose to use a latent factor model for recommendation, and predict the latent factors from music audio when they cannot be obtained from usage data. We compare a traditional approach using a bag-of-words representation of the audio signals with deep convolutional neural networks, and evaluate the predictions quantitatively and qualitatively on the Million Song Dataset. We show that using predicted latent factors produces sensible recommendations, despite the fact that there is a large semantic gap between the characteristics of a song that affect user preference and the corresponding audio signal. We also show that recent advances in deep learning translate very well to the music recommendation setting, with deep convolutional neural networks significantly outperforming the traditional approach.
Several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system. The vocabulary included many phonetically similar monosyllabic words, … Several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system. The vocabulary included many phonetically similar monosyllabic words, therefore the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations. For each parameter set (based on a mel-frequency cepstrum, a linear frequency cepstrum, a linear prediction cepstrum, a linear prediction spectrum, or a set of reflection coefficients), word templates were generated using an efficient dynamic warping method, and test data were time registered with the templates. A set of ten mel-frequency cepstrum coefficients computed every 6.4 ms resulted in the best performance, namely 96.5 percent and 95.0 percent recognition with each of two speakers. The superior performance of the mel-frequency cepstrum coefficients may be attributed to the fact that they better represent the perceptually relevant aspects of the short-term speech spectrum.
MPEG-7, formally known as the Multimedia Content Description Interface, includes standardized tools (descriptors, description schemes, and language) enabling structural, detailed descriptions of audio-visual information at different granularity levels (region, image, … MPEG-7, formally known as the Multimedia Content Description Interface, includes standardized tools (descriptors, description schemes, and language) enabling structural, detailed descriptions of audio-visual information at different granularity levels (region, image, video segment, collection) and in different areas (content description, management, organization, navigation, and user interaction). It aims to support and facilitate a wide range of applications, such as media portals, content broadcasting, and ubiquitous multimedia. We present a high-level overview of the MPEG-7 standard. We first discuss the scope, basic terminology, and potential applications. Next, we discuss the constituent components. Then, we compare the relationship with other standards to highlight its capabilities.
Perceptual systems routinely separate "content" from "style," classifying familiar words spoken in an unfamiliar accent, identifying a font or handwriting style across letters, or recognizing a familiar face or object … Perceptual systems routinely separate "content" from "style," classifying familiar words spoken in an unfamiliar accent, identifying a font or handwriting style across letters, or recognizing a familiar face or object seen under unfamiliar viewing conditions. Yet a general and tractable computational model of this ability to untangle the underlying factors of perceptual observations remains elusive (Hofstadter, 1985). Existing factor models (Mardia, Kent, & Bibby, 1979; Hinton & Zemel, 1994; Ghahramani, 1995; Bell & Sejnowski, 1995; Hinton, Dayan, Frey, & Neal, 1995; Dayan, Hinton, Neal, & Zemel, 1995; Hinton & Ghahramani, 1997) are either insufficiently rich to capture the complex interactions of perceptually meaningful factors such as phoneme and speaker accent or letter and font, or do not allow efficient learning algorithms. We present a general framework for learning to solve two-factor tasks using bilinear models, which provide sufficiently expressive representations of factor interactions but can nonetheless be fit to data using efficient algorithms based on the singular value decomposition and expectation-maximization. We report promising results on three different tasks in three different perceptual domains: spoken vowel classification with a benchmark multi-speaker database, extrapolation of fonts to unseen letters, and translation of faces to novel illuminants.
We describe a novel unsupervised method for learning sparse, overcomplete features. The model uses a linear encoder, and a linear decoder preceded by a sparsifying non-linearity that turns a code … We describe a novel unsupervised method for learning sparse, overcomplete features. The model uses a linear encoder, and a linear decoder preceded by a sparsifying non-linearity that turns a code vector into a quasi-binary sparse code vector. Given an input, the optimal code minimizes the distance between the output of the decoder and the input patch while being as similar as possible to the encoder output. Learning proceeds in a two-phase EM-like fashion: (1) compute the minimum-energy code vector, (2) adjust the parameters of the encoder and decoder so as to decrease the energy. The model produces stroke detectors when trained on handwritten numerals, and Gabor-like filters when trained on natural image patches. Inference and learning are very fast, requiring no preprocessing, and no expensive sampling. Using the proposed unsupervised method to initialize the first layer of a convolutional network, we achieved an error rate slightly lower than the best reported result on the MNIST dataset. Finally, an extension of the method is described to learn topographical filter maps.
This document describes version 0.4.0 of librosa: a Python package for audio and music signal processing. At a high level, librosa provides implementations of a variety of common functions used … This document describes version 0.4.0 of librosa: a Python package for audio and music signal processing. At a high level, librosa provides implementations of a variety of common functions used throughout the field of music information retrieval. In this document, a brief overview of the library's functionality is provided, along with explanations of the design goals, software development practices, and notational conventions.
This report introduces a new corpus of music, speech, and noise. This dataset is suitable for training models for voice activity detection (VAD) and music/speech discrimination. Our corpus is released … This report introduces a new corpus of music, speech, and noise. This dataset is suitable for training models for voice activity detection (VAD) and music/speech discrimination. Our corpus is released under a flexible Creative Commons license. The dataset consists of music from several genres, speech from twelve languages, and a wide assortment of technical and non-technical noises. We demonstrate use of this corpus for music/speech discrimination on Broadcast news and VAD for speaker identification.
Work on voice sciences over recent decades has led to a proliferation of acoustic parameters that are used quite selectively and are not always extracted in a similar fashion. With … Work on voice sciences over recent decades has led to a proliferation of acoustic parameters that are used quite selectively and are not always extracted in a similar fashion. With many independent teams working in different research areas, shared standards become an essential safeguard to ensure compliance with state-of-the-art methods allowing appropriate comparison of results across studies and potential integration and combination of extraction and recognition systems. In this paper we propose a basic standard acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis. In contrast to a large brute-force parameter set, we present a minimalistic set of voice parameters here. These were selected based on a) their potential to index affective physiological changes in voice production, b) their proven value in former studies as well as their automatic extractability, and c) their theoretical significance. The set is intended to provide a common baseline for evaluation of future research and eliminate differences caused by varying parameter sets or even different implementations of the same parameters. Our implementation is publicly available with the openSMILE toolkit. Comparative evaluations of the proposed feature set and large baseline feature sets of INTERSPEECH challenges show a high performance of the proposed set in relation to its size.
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on … This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.
Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M … Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.
Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or … Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.
Describes an audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Discusses why this task is an interesting challenge, and why it requires a specialized … Describes an audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Discusses why this task is an interesting challenge, and why it requires a specialized dataset that is different from conventional datasets used for automatic speech recognition of full sentences. Suggests a methodology for reproducible and comparable accuracy metrics for this task. Describes how the data was collected and verified, what it contains, previous versions and properties. Concludes by reporting baseline results of models trained on this dataset.
In this paper we present CatBoost, a new open-sourced gradient boosting library that successfully handles categorical features and outperforms existing publicly available implementations of gradient boosting in terms of quality … In this paper we present CatBoost, a new open-sourced gradient boosting library that successfully handles categorical features and outperforms existing publicly available implementations of gradient boosting in terms of quality on a set of popular publicly available datasets. The library has a GPU implementation of learning algorithm and a CPU implementation of scoring algorithm, which are significantly faster than other gradient boosting libraries on ensembles of similar sizes.
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on … This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, Ruslan Salakhutdinov. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, Ruslan Salakhutdinov. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification, music classification, speech emotion classification and … Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification, music classification, speech emotion classification and sound event detection. Recently, neural networks have been applied to tackle audio pattern recognition problems. However, previous systems are built on specific datasets with limited durations. Recently, in computer vision and natural language processing, systems pretrained on large-scale datasets have generalized well to several tasks. However, there is limited research on pretraining systems on large-scale datasets for audio pattern recognition. In this paper, we propose pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset. These PANNs are transferred to other audio related tasks. We investigate the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks. We propose an architecture called Wavegram-Logmel-CNN using both log-mel spectrogram and waveform as input feature. Our best PANN system achieves a state-of-the-art mean average precision (mAP) of 0.439 on AudioSet tagging, outperforming the best previous system of 0.392. We transfer PANNs to six audio pattern recognition tasks, and demonstrate state-of-the-art performance in several of those tasks. We have released the source code and pretrained models of PANNs: <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/qiuqiangkong/audioset_tagging_cnn</uri> .
The ability of deep convolutional neural networks (CNNs) to learn discriminative spectro-temporal patterns makes them well suited to environmental sound classification. However, the relative scarcity of labeled data has impeded … The ability of deep convolutional neural networks (CNNs) to learn discriminative spectro-temporal patterns makes them well suited to environmental sound classification. However, the relative scarcity of labeled data has impeded the exploitation of this family of high-capacity models. This study has two primary contributions: first, we propose a deep CNN architecture for environmental sound classification. Second, we propose the use of audio data augmentation for overcoming the problem of data scarcity and explore the influence of different augmentations on the performance of the proposed CNN architecture. Combined with data augmentation, the proposed model produces state-of-the-art results for environmental sound classification. We show that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a "shallow" dictionary learning model with augmentation. Finally, we examine the influence of each augmentation on the model's classification accuracy for each class, and observe that the accuracy for each class is influenced differently by each augmentation, suggesting that the performance of the model could be improved further by applying class-conditional data augmentation.
Merve Arslan , Şerif Ali Sadık | Eskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering
Converting original sounds into fake sounds using various methods and using these sounds for fraud or misinformation purposes poses serious risks and threats. In this study, a classification system using … Converting original sounds into fake sounds using various methods and using these sounds for fraud or misinformation purposes poses serious risks and threats. In this study, a classification system using machine learning methods is created and performance analysis is performed in order to detect sounds created with copy-move forgery, which is one of the types of sound forgery. Sound files are treated as raw data. Then, Mel-spectrograms are obtained to visually represent the spectral features of the sound over time. Logistic Regression, Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbors (KNN) and XGBoost algorithms are used in the classification phase. As a result of the performance analysis of the created models, the highest success is achieved with the XGBoost algorithm. The performance of the XGBoost algorithm is further improved by performing hyperparameter optimization with the Random Search method. The results of the models are analyzed using various metrics. According to the study results, it is seen that it gives competitive results with the XGBoost algorithm.
Music, as a complex form of human expression, is intrinsically dynamic, with nonlinear characteristics in its structure and behavior over time. Music expresses a variety of emotional qualities and can … Music, as a complex form of human expression, is intrinsically dynamic, with nonlinear characteristics in its structure and behavior over time. Music expresses a variety of emotional qualities and can be represented as a nonlinear dynamical system that frequently exhibits persistence. Traditional analysis approaches are frequently insufficient for capturing these complex dynamics. The research is to utilize Ordinary Differential Equations (ODEs) to investigate the dynamic properties of music, with focus on the nonlinear behavior observed in musical signals. The method involves breaking down music into its core components, examining local characteristics, and generalizing the underlying features. Using ODEs, the system's Lyapunov exponents to measure stability and chaotic behavior, the energy spectrum to analyze oscillatory modes, and the correlation dimension to comprehend the fractality inherent in music signalswere investigated. These strategies use numerical simulations to characterize music's nonlinear features. The findings show that ODE-based approaches can accurately simulate important dynamic characteristics of music, with Lyapunov exponents exposing chaotic behavior the energy spectrum demonstrating oscillatory patterns, and the correlation dimension emphasizing the fractal form of music. Finally, the research verifies the use of differential equations (DE) to explain the dynamic properties of music. The method provides important insights into the nonlinear dynamics of musical signals, paving the way for more detailed models and applications in music analysis.
<title>Abstract</title> This work presents AudioSet-Tools, a modular and composable Python framework designed to streamline the creation of task-specific datasets derived from Google’s AudioSet. Despite its extensive coverage, AudioSet suffers from … <title>Abstract</title> This work presents AudioSet-Tools, a modular and composable Python framework designed to streamline the creation of task-specific datasets derived from Google’s AudioSet. Despite its extensive coverage, AudioSet suffers from weak labeling, class imbalance, and a loosely structured taxonomy, which limit its practical applicability in machine listening workflows. AudioSet-Tools addresses these issues through configurable taxonomy-aware label filtering and class re-balancing strategies. The framework includes automated routines for data download and transformation, enabling reproducible and semantically consistent dataset generation for both downstream fine-tuning and pre-training of machine/deep learning models. While domain-agnostic, we showcase its versatility through AudioSet-EV, a curated subset focused on emergency vehicle siren recognition — a socially relevant and technically challenging use case that exemplifies the structural and semantic gaps in AudioSet taxonomy. We further provide an extensive comparative benchmark of AudioSet-EV against state-of-the-art emergency vehicle corpora, with source code and datasets openly released on GitHub and Zenodo, to foster transparency and reproducibility in real-world audio signal processing research.
Instrument recognition is a crucial aspect of music information retrieval, and in recent years, machine learning-based methods have become the primary approach to addressing this challenge. However, existing models often … Instrument recognition is a crucial aspect of music information retrieval, and in recent years, machine learning-based methods have become the primary approach to addressing this challenge. However, existing models often struggle to accurately identify multiple instruments within music tracks that vary in length and quality. One key issue is that the instruments of interest may not appear in every clip of the audio sample, and when they do, they are often unevenly distributed across different sections of the track. Additionally, in polyphonic music, multiple instruments are often played simultaneously, leading to signal overlap. Using the same overlapping audio signals as partial classification features for different instruments will reduce the distinguishability of features between instruments, thereby affecting the performance of instrument recognition. These complexities present significant challenges for current instrument recognition models. Therefore, this paper proposes a multi-instance multi-scale graph attention neural network (MMGAT) with label semantic embeddings for instrument recognition. MMGAT designs an instance correlation graph to model the presence and quantitative timbre similarity of instruments at different positions from the perspective of multi-instance learning. Then, to enhance the distinguishability of signals after the overlap of different instruments and improve classification accuracy, MMGAT learns semantic information from the labels of different instruments as embeddings and incorporates them into the overlapping audio signal features, thereby enhancing the differentiability of audio features for various instruments. MMGAT then designs an instance-based multi-instance multi-scale graph attention neural network to recognize different instruments based on the instance correlation graphs and label semantic embeddings. The effectiveness of MMGAT is validated through experiments and compared to commonly used instrument recognition models. The experimental results demonstrate that MMGAT outperforms existing approaches in instrument recognition tasks.
Acoustic scene classification aims to recognize the scenes corresponding to sound signals in the environment, but audio differences from different cities and devices can affect the model’s accuracy. In this … Acoustic scene classification aims to recognize the scenes corresponding to sound signals in the environment, but audio differences from different cities and devices can affect the model’s accuracy. In this paper, a time–frequency–wavelet fusion network is proposed to improve model performance by focusing on three dimensions: the time dimension of the spectrogram, the frequency dimension, and the high- and low-frequency information extracted by a wavelet transform through a time–frequency–wavelet module. Multidimensional information was fused through the gated temporal–spatial attention unit, and the visual state space module was introduced to enhance the contextual modeling capability of audio sequences. In addition, Kolmogorov–Arnold network layers were used in place of multilayer perceptrons in the classifier part. The experimental results show that the proposed method achieves a 56.16% average accuracy on the TAU Urban Acoustic Scenes 2022 mobile development dataset, which is an improvement of 6.53% compared to the official baseline system. This performance improvement demonstrates the effectiveness of the model in complex scenarios. In addition, the accuracy of the proposed method on the UrbanSound8K dataset reached 97.60%, which is significantly better than the existing methods, further verifying the generalization ability of the proposed model in the acoustic scene classification task.
Abstract For a software tool to be useful for musical corpus studies, it should expose symbolic musical data, however it is stored, as a set of musically meaningful software abstractions; … Abstract For a software tool to be useful for musical corpus studies, it should expose symbolic musical data, however it is stored, as a set of musically meaningful software abstractions; and support the batch manipulation of more than one piece of music with the help of these abstractions, preferably on the order of hundreds or thousands. In this chapter, the author explores the landscape of toolkits for analysis of symbolic corpora, including Humdrum and music21. In addition to providing a historical background to these tools, the author explores a number of use cases and introductory approaches to how to get started with each. Finally, the author discusses issues of maintainability and best practice in relation to research software.
Guijin Han , Junzhe Zhao , Yiming Zhou | Journal of King Saud University - Computer and Information Sciences
Yinghua Li , Xueqi Dang , Wendkûuni C. Ouédraogo +2 more | Proceedings of the ACM on software engineering.
Audio classification systems, powered by deep neural networks (DNNs), are integral to various applications that impact daily lives, like voice-activated assistants. Ensuring the accuracy of these systems is crucial since … Audio classification systems, powered by deep neural networks (DNNs), are integral to various applications that impact daily lives, like voice-activated assistants. Ensuring the accuracy of these systems is crucial since inaccuracies can lead to significant security issues and user mistrust. However, testing audio classifiers presents a significant challenge: the high manual labeling cost for annotating audio test inputs. Test input prioritization has emerged as a promising approach to mitigate this labeling cost issue. It prioritizes potentially misclassified tests, allowing for the early labeling of such critical inputs and making debugging more efficient. However, when applying existing test prioritization methods to audio-type test inputs, there are some limitations: 1) Coverage-based methods are less effective and efficient than confidence-based methods. 2) Confidence-based methods rely only on prediction probability vectors, ignoring the unique characteristics of audio-type data. 3) Mutation-based methods lack designed mutation operations for audio data, making them unsuitable for audio-type test inputs. To overcome these challenges, we propose AudioTest, a novel test prioritization approach specifically designed for audio-type test inputs. The core premise is that tests closer to misclassified samples are more likely to be misclassified. Based on the special characteristics of audio-type data, AudioTest generates four types of features: time-domain features, frequency-domain features, perceptual features, and output features. For each test, AudioTest concatenates its four types of features into a feature vector and applies a carefully designed feature transformation strategy to bring misclassified tests closer in space. AudioTest leverages a trained model to predict the probability of misclassification of each test based on its transformed vectors and ranks all the tests accordingly. We evaluate the performance of AudioTest utilizing 96 subjects, encompassing natural and noisy datasets. We employed two classical metrics, Percentage of Fault Detection (PFD) and Average Percentage of Fault Detected (APFD), for our evaluation. The results demonstrate that AudioTest outperforms all the compared test prioritization approaches in terms of both PFD and APFD. The average improvement of AudioTest compared to the baseline test prioritization methods ranges from 12.63% to 54.58% on natural datasets and from 12.71% to 40.48% on noisy datasets.
Automated giant panda (Ailuropoda melanoleuca) behavior recognition (GPBR) systems are highly beneficial for efficiently monitoring giant pandas in wildlife conservation missions. While video-based behavior recognition attracts a lot of attention, … Automated giant panda (Ailuropoda melanoleuca) behavior recognition (GPBR) systems are highly beneficial for efficiently monitoring giant pandas in wildlife conservation missions. While video-based behavior recognition attracts a lot of attention, few studies have focused on audio-based methods. In this paper, we propose the exploitation of the audio data recorded by collar-mounted devices on giant pandas for the purpose of GPBR. We construct a new benchmark audio dataset of giant pandas named abPanda-5 for GPBR, which consists of 18,930 samples from five giant panda individuals with five main behaviors. To fully explore the bioacoustic features, we propose an audio-based method for automatic GPBR using competitive fusion learning. The method improves behavior recognition accuracy and robustness, without additional computational overhead in the inference stage. Experiments performed on the abPanda-5 dataset demonstrate the feasibility and effectiveness of our proposed method.
Abstract Layout Analysis (LA) is a critical process for detecting and isolating different components within a scanned document, allowing for more straightforward and precise processing of each part independently. In … Abstract Layout Analysis (LA) is a critical process for detecting and isolating different components within a scanned document, allowing for more straightforward and precise processing of each part independently. In Optical Music Recognition (OMR), LA is essential for identifying and extracting music staves, which enables effective music notation recognition and processing. While the literature includes several studies exploring methods for staff retrieval, there remains room for improvement in terms of robustness and accuracy. In this work, we introduce a methodology that integrates Monte Carlo Dropout (MCD) into a neural network model in order to improve reliability in staff retrieval from scanned sheet music. Our approach leverages multiple non-deterministic predictions using standard dropout layers during inference and aggregates them through pixel-level combination policies. We extend the MCD technique, originally designed for classification and regression tasks using averaged predictions, to the LA task and introduce new combination strategies: maximum and voting criteria. Experiments on three diverse music score corpora, including printed and handwritten documents, demonstrated the effectiveness of our approach. The averaging and voting (with 25% and 50% of votes) criteria reduced the relative error by 63.6% compared to the baseline and achieved a 32.1% improvement over state-of-the-art methods. Our methodology notably enhanced detection accuracy without requiring modifications to the neural architecture, especially at the edges of staves, where conventional models tend to show higher error rates.
Jiby Mariya Jose , J. José | International Journal of Scientific Research in Computer Science Engineering and Information Technology
This paper proposes EPyNet, a deep learning architecture designed for energy reduced audio emotion recognition.In the domain of audio based emotion recognition, where discerning emotional cues from audio input is … This paper proposes EPyNet, a deep learning architecture designed for energy reduced audio emotion recognition.In the domain of audio based emotion recognition, where discerning emotional cues from audio input is crucial, the inte- gration of artificial intelligence techniques has sparked a transformative shift in accuracy and performance.Deep learn- ing, renowned for its ability to decipher intricate patterns, spearheads this evolution. However, the energy efficiency of deep learning models, particularly in resource-constrained environments, remains a pressing concern. Convolutional operations serve as the cornerstone of deep learning systems. However, their extensive computational demands leading to energy-inefficient computations render them as not ideal for deployment in scenarios with limited resources. Ad- dressing these challenges, researchers came up with one-dimensional convolutional neural network (1D CNN) array convolutions, offering an alternative to traditional two-dimensional CNNs, with reduced resource requirements. How- ever, this array-based operation reduced the resource requirement, but the energy-consumption impact was not studied. To bridge this gap, we introduce EPyNet, a deep learning architecture crafted for energy efficiency with a particular emphasis on neuron reduction. Focusing on the task of audio emotion recognition, We evaluate EPyNet on five pub- lic audio corpora—RAVDESS, TESS, EMO DB, CREMA D, and SAVEE.We propose three versions of EPyNet, a lightweight neural network designed for efficient emotion recognition, each optimized for different trade-offs between accuracy and energy efficiency. Experimental results demonstrated that the 0.06M EPyNet reduced energy consumed by 76.5% while improving accuracy by 5% on RAVDESS, 25% on TESS, and 9.75% on SAVEE. The 0.2M and 0.9M models reduced energy consumed by 64.9% and 70.3%, respectively. Additionally, we compared our Proposed 0.06M system with the MobileNet models on the CIFAR-10 dataset and achieved significant improvements. The proposed system reduces energy by 86.2% and memory by 95.7% compared to MobileNet, with a slightly lower accuracy of 0.8%. Compared to MobileNetV2, it improves accuracy by 99.2% and reduces memory by 93.8%. When compared to MobileNetV3, it achieves 57.2% energy reduction, 85.1% memory reduction, and a 24.9% accuracy improvement. We further test the scalability and robustness of the proposed solution on different data dimensions and frameworks.
We propose a novel framework for adaptive stochastic dynamics analysis of tourist behavior by integrating context-aware Markov models with finite mixture models (FMMs). Conventional Markov models often fail to capture … We propose a novel framework for adaptive stochastic dynamics analysis of tourist behavior by integrating context-aware Markov models with finite mixture models (FMMs). Conventional Markov models often fail to capture abrupt changes induced by external shocks, such as event announcements or weather disruptions, leading to inaccurate predictions. The proposed method addresses this limitation by introducing virtual sensors that dynamically detect contextual anomalies and trigger regime switches in real-time. These sensors process streaming data to identify shocks, which are then used to reweight the probabilities of pre-learned behavioral regimes represented by FMMs. The system employs expectation maximization to train distinct Markov sub-models for each regime, enabling seamless transitions between them when contextual thresholds are exceeded. Furthermore, the framework leverages edge computing and probabilistic programming for efficient, low-latency implementation. The key contribution lies in the explicit modeling of contextual shocks and the dynamic adaptation of stochastic processes, which significantly improves robustness in volatile tourism scenarios. Experimental results demonstrate that the proposed approach outperforms traditional Markov models in accuracy and adaptability, particularly under rapidly changing conditions. Quantitative results show a 13.6% improvement in transition accuracy (0.742 vs. 0.653) compared to conventional context-aware Markov models, with an 89.2% true positive rate in shock detection and a median response latency of 47 min for regime switching. This work advances the state-of-the-art in tourist behavior analysis by providing a scalable, real-time solution for capturing complex, context-dependent dynamics. The integration of virtual sensors and FMMs offers a generalizable paradigm for stochastic modeling in other domains where external shocks play a critical role.
Princy tyagi , N. K. Singh , Satyam Singh Rawat - +2 more | International Journal For Multidisciplinary Research
Music genre categorization is an essential activity in music data retrieval and recommendation systems. This research focuses on classifying music genres using machine learning techniques, specifically the Support Vector Machine. … Music genre categorization is an essential activity in music data retrieval and recommendation systems. This research focuses on classifying music genres using machine learning techniques, specifically the Support Vector Machine. The GTZAN dataset, comprising 10 distinct genres, is utilized for training and evaluation. We took audio features like MFCCs, spectral contrast, and chroma vectors from the GTZAN dataset and used them to train a Support Vector Machine (SVM) model. The categorization model achieved an accuracy of 81.1% across 10 distinct genres in a multi-class setting the research emphasizes the difficulties of genre convergence and the efficacy of machine learning in automating music categorization. Future developments might explore deep learning techniques, like Convolutional Neural Networks, better ways to choose features, and improving data to make music categorization more effective. assignment in music retrieve information
Abstract State Space Models have achieved good performance on long sequence modeling tasks such as raw audio classification. Their definition in continuous time allows for discretization and operation of the … Abstract State Space Models have achieved good performance on long sequence modeling tasks such as raw audio classification. Their definition in continuous time allows for discretization and operation of the network at different sampling rates. However, this property has not yet been utilized to decrease the computational demand on a per-layer basis. We propose a family of hardware-friendly S-Edge models with a layer-wise downsampling approach to adjust the temporal resolution between individual layers. Applying existing methods from linear control theory allows us to analyze state/memory dynamics and provides an understanding of how and where to downsample. Evaluated on the Google Speech Command dataset, our autoregressive/causal S-Edge models range from 8–141k parameters at 90–95% test accuracy in comparison to a causal S5 model with 208k parameters at 95.8% test accuracy. Using our C++17 header-only implementation on an ARM Cortex-M4F the largest model requires 103 sec. inference time with 95.19% test accuracy, and the smallest model with 88.01% test accuracy, requires 0.29 sec. Our solutions cover a design space that spans 17x in model size, 358x in inference latency, and 7.18 percentage points in accuracy.
Abstract Optimal arrangements of turbulent pipe systems strongly depend on branch patterns, and turbulence fields typically cause involved multimodality in the solution space. These features hinder gradient-based structural optimization frameworks … Abstract Optimal arrangements of turbulent pipe systems strongly depend on branch patterns, and turbulence fields typically cause involved multimodality in the solution space. These features hinder gradient-based structural optimization frameworks from finding promising solutions for turbulent pipe systems. In this paper, we propose a multi-stage framework that integrates data-driven morphological exploration and evolutionary shape optimization to address the challenges posed by the complexity of turbulent pipe systems. Our framework begins with data-driven morphological exploration, aiming to find promising morphologies. It results in the shapes for selecting a reasonable number of candidates for the next shape refinement stage. Herein, we employ data-driven topology design, a gradient-free and multiobjective optimization methodology incorporating a deep generative model and the concept of evolutionary algorithms to generate promising arrangements. Subsequently, a deep clustering strategy extracts representative shapes. The final stage involves refining these shapes through shape optimization using a genetic algorithm. Applying the framework to a two-dimensional turbulent pipe system with a minimax objective shows its effectiveness in delivering high-performance solutions for the turbulent flow optimization problem with branching.
Anjan Barman | International Journal for Research in Applied Science and Engineering Technology
The increasing consumption of web music has called for the creation of scalable and personalized streaming services. This report documents the design and implementation of a Soundscape Web Application, which … The increasing consumption of web music has called for the creation of scalable and personalized streaming services. This report documents the design and implementation of a Soundscape Web Application, which replicates key features of leading music streaming services while adding new features for enhancing user experience. The project has a responsive user interface and experience, a scalable backend system, support for real-time playback, and playlist management features. Implemented with modern web development tools and cloud computing technologies, the application is a viable means of launching a competitive, user-centric audio streaming service. Initial findings indicate that the application is extremely usable and functional, with tremendous potential for future expansion, especially in personalization and social integration
This systematic review synthesizes 82 peer-reviewed studies published between 2014 and 2024 on the use of audio features in educational research. We define audio features as descriptors extracted from audio … This systematic review synthesizes 82 peer-reviewed studies published between 2014 and 2024 on the use of audio features in educational research. We define audio features as descriptors extracted from audio recordings of educational interactions, including low-level acoustic signals (e.g., pitch and MFCCs), speaker-based metrics (e.g., talk-time and participant ratios), and linguistic indicators derived from transcriptions. Our analysis contributes to the field in three key ways: (1) it offers targeted mapping of how audio features are extracted, processed, and functionally applied within educational contexts, covering a wide range of use cases from behavior analysis to instructional feedback; (2) it diagnoses recurrent limitations that restrict pedagogical impact, including the scarcity of actionable feedback, low model interpretability, fragmented datasets, and limited attention to privacy; (3) it proposes actionable directions for future research, including the release of standardized, anonymized feature-level datasets, the co-design of feedback systems involving pedagogical experts, and the integration of fine-tuned generative AI to translate complex analytics into accessible, contextualized recommendations for teachers and learners. While current research demonstrates significant technical progress, its educational potential is yet to be translated into real-world educational impact. We argue that unlocking this potential requires shifting from isolated technical achievements to ethically grounded pedagogical implementations.
This study suggests combining wireless sensor networks (WSNs) into audio anomaly detection methods to further improve timbre training for oboe players. Basically, WSNs capture information regarding real-time measurement of parameters … This study suggests combining wireless sensor networks (WSNs) into audio anomaly detection methods to further improve timbre training for oboe players. Basically, WSNs capture information regarding real-time measurement of parameters such as breath pressure, embouchure, and the actual moments of sound features, serving as a backbone of scientific data-driven music education. Thus, the application of audio anomaly detection permits the assessment of the quality of oboe sound and the giving of real-time feedback on any musical performance. The audio data is preprocessed such as noise elimination, resampling, and removing silence so that it is stable and free from any extraneous influences. Sound anomalies are identified through a hybrid method that uses Variational Autoencoder (VAEs) combined with Convolutional Neural Networks (CNNs), enabling us to strongly evaluate oboe performance. The system is tested with an assortment of performance measurements: accuracy at 98.18%; precision at 97.12%; recall at 97.34%; and F1-score at 97.55%. Hence, it showcases a capability to provide exceptional performance to identify variation from the standard in audio signal. The approach in question is addressing some key challenges in oboe training, breath control, and accuracy of embouchure, thus providing musicians and instructors with valuable tools for sound quality improvement and performance analysis.
The process of sensing and transmitting acoustic signals by pervasive acoustic wireless sensor networks (PAWSNs) poses considerable energy challenges. These problems may be mitigated by filtering only relevant acoustic events … The process of sensing and transmitting acoustic signals by pervasive acoustic wireless sensor networks (PAWSNs) poses considerable energy challenges. These problems may be mitigated by filtering only relevant acoustic events from the sensor network. By reducing the number of acoustic events, the frequency of communication may be decreased, thereby enhancing energy efficiency. Although traditional machine learning models are capable of predicting relevant acoustic events by being trained on suitable data sets, they are impractical for direct implementation on resource-limited acoustic sensor nodes. To address this issue, this research introduces TinyML-based acoustic event detection (AED) models which facilitate efficient real-time processing on microcontrollers with scarce hardware resources. The study develops several TinyML models using an environmental dataset and evaluates their accuracy. These models are then deployed in hardware to assess their performance in terms of AED. Thanks to such an approach, only predicted events that exceed a certain threshold are transmitted to the base station via router nodes, which reduces the transmission burden, thus improving energy efficiency of PAWSNs. Real-time experiments confirm that the proposed method significantly improves energy efficiency and boosts node lifetime.
Yingrui Li | International Journal of Web-Based Learning and Teaching Technologies
This research focuses on the application of multimedia technology and the Dynamic Time Warping (DTW) algorithm in university vocal music teaching. This model provides students with multi-dimensional learning experiences and … This research focuses on the application of multimedia technology and the Dynamic Time Warping (DTW) algorithm in university vocal music teaching. This model provides students with multi-dimensional learning experiences and targeted feedback. Experimental results show that students in the experimental group using this new model have significantly higher singing proficiency scores and better final singing levels compared to the control group. The average score of the experimental group increased by 26.15%, with 35% and 45% of students reaching excellent and good levels respectively, while the control group had an average score increase of 14.29%, with 15% and 40% of students reaching these levels. The DTW algorithm effectively handles performance differences, and multimedia technology enriches teaching resources, enhancing students' learning enthusiasm. Future research will focus on improving feature-matching algorithms, optimizing teaching software, and enhancing hardware performance to promote the development of efficient vocal music teaching.
Ying Liu | International Journal of Web-Based Learning and Teaching Technologies
“Internet plus Music Education” has the characteristics of full-scene learning ecology, data-driven model innovation, and immersion experience upgrade, and the exploration of educational metauniverse and the breakthrough of educational big … “Internet plus Music Education” has the characteristics of full-scene learning ecology, data-driven model innovation, and immersion experience upgrade, and the exploration of educational metauniverse and the breakthrough of educational big model have become an important development trend, and the industrial ecology is also undergoing in-depth changes, bringing new opportunities for the development of music education. Therefore, its transformation into digitalization is an inevitable trend. Taking Perth Music Group as a sample, this study uses multi-source data fusion to build an analysis database. By building an econometric model, it evaluates the sustainability of the business model of “internet plus Music Education” and identifies key growth drivers, which opens up a new direction for music education research from the perspective of economic and technological integration, provides a method for quantitative research on the development of music education industry, and enriches the theoretical system of music education research.
Lei Deng | International Journal of Web-Based Learning and Teaching Technologies
As science and technology advance rapidly, video semantic understanding (VSU) technology has made significant strides. This technology has garnered widespread recognition within the music industry and piqued the interest of … As science and technology advance rapidly, video semantic understanding (VSU) technology has made significant strides. This technology has garnered widespread recognition within the music industry and piqued the interest of film and television music creators. In the realm of film music creation, VSU technology serves as a powerful tool, revolutionizing traditional approaches and steering the evolution of film and television music creation. This study employs the Spatiotemporal Pattern-based Saliency Map Generation (SMGTSM) algorithm, which generates saliency maps for each frame in an average of 45.91ms. This is notably faster than methods based on the optical flow field algorithm (81.49ms) and the random sample consensus (RANSAC) algorithm. The application of VSU technology not only enhances traditional film and television music creation methods but also significantly boosts the efficiency and quality of the creative process.
L Chen | International Journal of Web-Based Learning and Teaching Technologies
In the evolving landscape of vocal pedagogy, the integration of computer-assisted technologies represents a transformative shift from traditional master-apprentice models. This study investigates the efficacy of computer-assisted vocal training methods … In the evolving landscape of vocal pedagogy, the integration of computer-assisted technologies represents a transformative shift from traditional master-apprentice models. This study investigates the efficacy of computer-assisted vocal training methods compared to conventional approaches, focusing on improvements in pitch accuracy, vocal range expansion, and emotional expression among novice vocalists. Utilizing a mixed-methods approach, including digital signal processing, machine learning, and virtual reality, the authors conducted a 12-week experiment involving 60 participants randomly divided into two groups. Results indicate that computer-assisted training offers nearly double the improvement in pitch accuracy and vocal range expansion over traditional methods, with more pronounced enhancements in emotional expression skills. These findings contribute significantly to developing standardized, personalized, and scientifically-grounded vocal training methodologies, demonstrating a more efficient pathway for enhancing vocal performance.
In recent years, the rapid rise of technologies such as the Internet of Things (IoT) and Artificial Intelligence (AI) has transformed numerous domains, particularly smart homes. As people experience greater … In recent years, the rapid rise of technologies such as the Internet of Things (IoT) and Artificial Intelligence (AI) has transformed numerous domains, particularly smart homes. As people experience greater material comfort, they increasingly seek deeper, more emotionally intelligent ways to interact with technology. Music, rich in emotional content, serves as a powerful medium for interpersonal communication and is increasingly regarded as a natural channel for intelligent human-computer interaction. However, traditional music emotion recognition techniques face challenges with low recognition accuracy and high computational costs. To address these limitations, we propose an efficient deep learning-based music emotion recognition system that integrates generative adversarial networks (GANs) within an IoT framework. The system employs a convolutional neural network (CNN) to extract both local and global features from musical signals using Mel-frequency techniques. These features enhance the GAN’s ability to detect complex emotional expressions in music. Experimental results demonstrate that the proposed model achieves significantly lower error rates and greater recognition accuracy compared to state-of-the-art methods. Specifically, it attains an accuracy of 94.06%, confirming its effective performance and suitability for real-time, emotion-aware music recommendation in IoT applications.
Natural sound textures, such as rain or crackling fire, are defined by time-averaged summary statistics that shape their perceptual identity. Variations in these statistics provide a controlled means to examine … Natural sound textures, such as rain or crackling fire, are defined by time-averaged summary statistics that shape their perceptual identity. Variations in these statistics provide a controlled means to examine how the human brain processes complex auditory structure. In this study, we investigated how such statistics are represented along the ascending auditory pathway, within auditory cortex, and in high-level regions involved in natural pattern analysis. Using fMRI, we measured brain responses to synthetic sound textures in which higher-order statistical structure was systematically degraded while preserving the overall texture category. Participants listened to sounds with varying levels of naturalness, defined by their statistical fidelity to real-world textures, and performed a perceptual comparison task, judging whether two sequentially presented sounds matched in naturalness. We observed that increasing naturalness elicited stronger BOLD responses across bilateral auditory cortex for both reference and test sounds. Activity in medial temporal lobe regions, including the entorhinal cortex and hippocampus, was modulated by naturalness in a position-dependent manner. The entorhinal cortex activity was modulated only during the test sound, suggesting a role in perceptual-mnemonic comparison. The hippocampal connectivity with the auditory cortex increased when reference textures were more degraded or less natural, indicating top-down inference under uncertainty. Together, these findings highlight the interplay between bottom-up encoding and memory-based mechanisms in supporting judgments of auditory realism based on summary statistics.
Yanzhen Ren , Wuyang Liu , Chenyu Liu +1 more | EURASIP Journal on Audio Speech and Music Processing
Modern audio production workflows often require significant manual effort during the initial session preparation phase, including track labeling, format standardization, and gain staging. This paper presents a rule-based and Machine … Modern audio production workflows often require significant manual effort during the initial session preparation phase, including track labeling, format standardization, and gain staging. This paper presents a rule-based and Machine Learning-assisted automation system designed to minimize the time required for these tasks in Digital Audio Workstations (DAWs). The system automatically detects and labels audio tracks, identifies and eliminates redundant fake stereo channels, merges double-tracked instruments into stereo pairs, standardizes sample rate and bit rate across all tracks, and applies initial gain staging using target loudness values derived from a Genetic Algorithm (GA)-based system, which optimizes gain levels for individual track types based on engineer preferences and instrument characteristics. By replacing manual setup processes with automated decision-making methods informed by Machine Learning (ML) and rule-based heuristics, the system reduces session preparation time by up to 70% in typical multitrack audio projects. The proposed approach highlights how practical automation, combined with lightweight Neural Network (NN) models, can optimize workflow efficiency in real-world music production environments.
The digital music industry has undergone a paradigm shift, favoring streaming services and personalized recommendation systems. This paper presents one unique Online Music Recommendation System, a web-based music library and … The digital music industry has undergone a paradigm shift, favoring streaming services and personalized recommendation systems. This paper presents one unique Online Music Recommendation System, a web-based music library and recommendation platform that allows users to listen to, share, and recommend songs among friends. The system integrates social interactions to enhance music discovery while keeping the user interface intuitive and accessible. The application is not only follows a client-server architecture but also allows administrators to manage music content but also enables users to interact through friend lists and song recommendations. This experimental implementation shows a high level of user engagement and satisfaction, validating the feasibility and effectiveness of socially-enhanced music recommendation platforms.
With the increasing application of electrical network frequency (ENF) in forensic audio and video analysis, ENF signal detection has emerged as a critical technology. However, high-pass filtering operations commonly employed … With the increasing application of electrical network frequency (ENF) in forensic audio and video analysis, ENF signal detection has emerged as a critical technology. However, high-pass filtering operations commonly employed in modern communication scenarios, while effectively removing infrasound to enhance communication quality at reduced costs, result in a substantial loss of fundamental frequency information, thereby degrading the performance of existing detection methods. To tackle this issue, this paper introduces Multi-HCNet, an innovative deep learning model specifically tailored for ENF signal detection in high-pass filtered environments. Specifically, the model incorporates an array of high-order harmonic filters (AFB), which compensates for the loss of fundamental frequency by capturing high-order harmonic components. Additionally, a grouped multi-channel adaptive attention mechanism (GMCAA) is proposed to precisely distinguish between multiple frequency signals, demonstrating particular effectiveness in differentiating between 50 Hz and 60 Hz fundamental frequency signals. Furthermore, a sine activation function (SAF) is utilized to better align with the periodic nature of ENF signals, enhancing the model’s capacity to capture periodic oscillations. Experimental results indicate that after hyperparameter optimization, Multi-HCNet exhibits superior performance across various experimental conditions. Compared to existing approaches, this study not only significantly improves the detection accuracy of ENF signals in complex environments, achieving a peak accuracy of 98.84%, but also maintains an average detection accuracy exceeding 80% under high-pass filtering conditions. These findings demonstrate that even in scenarios where fundamental frequency information is lost, the model remains capable of effectively detecting ENF signals, offering a novel solution for ENF signal detection under extreme conditions of fundamental frequency absence. Moreover, this study successfully distinguishes between 50 Hz and 60 Hz fundamental frequency signals, providing robust support for the practical deployment and extension of ENF signal applications.
Background Early detection of elevated acute stress is necessary if we aim to reduce consequences associated with prolonged or recurrent stress exposure. Stress monitoring may be supported by valid and … Background Early detection of elevated acute stress is necessary if we aim to reduce consequences associated with prolonged or recurrent stress exposure. Stress monitoring may be supported by valid and reliable machine-learning algorithms. However, investigation of algorithms detecting stress severity on a continuous scale is missing due to high demands on data quality for such analyses. Use of multimodal data, meaning data coming from multiple sources, might contribute to machine-learning stress severity detection. We aimed to detect laboratory-induced stress using multimodal data and identify challenges researchers may encounter when conducting a similar study. Methods We conducted a preliminary exploration of performance of a machine-learning algorithm trained on multimodal data, namely visual, acoustic, verbal, and physiological features, in its ability to detect stress severity following a partially automated online version of the Trier Social Stress Test. College students ( n = 42; M age = 20.79, 69% female) completed a self-reported stress visual analogue scale at five time-points: After the initial resting period (P1), during the three stress-inducing tasks (i.e., preparation for a presentation, a presentation task, and an arithmetic task, P2-4) and after a recovery period (P5). For the whole duration of the experiment, we recorded the participants’ voice and facial expressions by a video camera and measured cardiovascular and electrodermal physiology by an ambulatory monitoring system. Then, we evaluated the performance of the algorithm in detection of stress severity using 3 combinations of visual, acoustic, verbal, and physiological data collected at each of the periods of the experiment (P1-5). Results Participants reported minimal (P1, M = 21.79, SD = 17.45) to moderate stress severity (P2, M = 47.95, SD = 15.92), depending on the period at hand. We found a very weak association between the detected and observed scores ( r 2 = .154; p = .021). In our post-hoc analysis, we classified participants into categories of stressed and non-stressed individuals. When applying all available features (i.e., visual, acoustic, verbal, and physiological), or a combination of visual, acoustic and verbal features, performance ranged from acceptable to good, but only for the presentation task (accuracy up to.71, F1-score up to.73). Conclusions The complexity of input features needed for machine-learning detection of stress severity based on multimodal data requires large sample sizes with wide variability of stress reactions and inputs among participants. These are difficult to recruit for laboratory setting, due to high time and effort demands on the side of both researcher and participant. Resources needed may be decreased using automatization of experimental procedures, which may, however, lead to additional technological challenges, potentially causing other recruitment setbacks. Further investigation is necessary, with the emphasis on quality ground truth, i.e., gold standard (self-report) instruments, but also outside of laboratory experiments, mainly in general populations and mental health care patients.
Fish produce a wide variety of sounds that contribute to the soundscapes of aquatic environments. In reef systems, these sounds are important acoustic cues for various ecological processes. Artificial intelligence … Fish produce a wide variety of sounds that contribute to the soundscapes of aquatic environments. In reef systems, these sounds are important acoustic cues for various ecological processes. Artificial intelligence methods to detect, classify and identify fish sounds have become increasingly common. This study proposes the classification of unknown fish sounds recorded in a subtropical rocky reef using different feature sets, data augmentation and explainable artificial intelligence tools. We used different supervised algorithms (naive Bayes, random forest, decision trees and multilayer perceptron) to perform a multiclass classification of four classes of fish pulsed sounds. The proposed models showed excellent performances, achieving 98.1% of correct classification with multilayer perceptron using data augmentation. Explainable artificial intelligence allowed us to identify which features contributed to predict each sound class. Recognizing and characterizing these sounds is key to better understanding diel behaviours and functional roles associated with critical reef ecological processes. This article is part of the theme issue ‘Acoustic monitoring for tropical ecology and conservation’.