Using audio in machine learning

The first and most vital step in order to be able to use audio in machine learning, is to understand how raw audio is being perceived by humans, and try to present in electronic format this information, in a way that resembles this perception. For this task, we extract features out of the audio signal.

The most basic information about the perception of human ear is going to be explored, without giving an excessive explanation on audio theory.


Hire a custom writer who has experience.
It's time for you to order amazing papers!

order now

Recording of a music piece is the first step in audio representation on computers and devices.
Sound is, in a more simplistic approach, vibrations, or else changes in the air pressure \cite{WesternElectricCo}.\newline

In digital recording, the differences in air pressure at a particular time and space can be captured, and the sound is represented as a sequence of discrete numbers (a waveform). This is different from the analog signal, where the values are continuous \cite{Proakis1992DigitalSP}.
We call every one of these numbers a sample, and we call sample rate the amount of samples per second. In reverse, these numbers can be converted back to sound when we listen to them from a device, with the opposite procedure.\newline

Sample rate is an important element in sampling, as it should be above than the input frequency, otherwise the result will be of low quality and maybe not resembling correctly the waveform. Furthermore, the bit rate, the number of bits used to store each sample, is also an important factor where the quality of the recording depends to \cite{PeterElsea}.

Giving some examples, 8,000 Hz is the telephone standard rate, 22,050 Hz and 44,100 Hz are common rates used for audio CDs, and finally 48 kHz and larger sampling rates are used in high quality video recordings and on professional video equipment.


Given the above information, it is obvious that the digital signal is a complex waveform, and the use of it to create a classifier, or for any other MIR application is not very viable, as it requires powerful hardware. Even if using a sampling rate smaller than the common 44,100 Hz, for example half of it, 22,050 Hz, the amount of arithmetic values that have to be stored and trained is still huge. In every second, 22,050 values are stored, meaning that for a one minute song, 1,323,000 samples are stored. As a result, the samples, or else, the waveform of the audio piece cannot be used directly. Another manipulation has to be done to the audio before is ready to be used.\newline

For this reason, we extract features out of the music piece. Exist features that are relevant or irrelevant to the digital audio signal itself.

McKay and Fujinaga \cite{McKay2004AutomaticGC} have proposed that features can be categorized into low-level, high-level and cultural features.
The low-level features can be obtained from the audio signal itself, and do not have a musical meaning. On the other hand, high-level features give musical information, such as the instruments used. At last the cultural features are based in social information \cite{Bogdanov2009FromLT}\cite{McKay2004AutomaticGC}. \newline

In this section, and in the thesis in general we are going to focus on the low-level features of the audio.

Many low-level audio features have been proposed and applied into different music genre classification systems.
Some of the most important of them, that present the spectral or temporal information of the audio, are:

Tempo, Spectrogram, Log spectrogram, Mel-spectrogram, Log mel spectrogram, Mel-Frequency Cepstral Coefficients (MFCCs), Chroma, Spectral Centroid, Spectral Contrast, Spectral Flatness, Spectral Rolloff, Spectral Flux, Tonnetz, Zero Crossing Rate and Root-Mean-Square (RMS) energy.\newline
% temporal envelope, tempo histogram,
% Spectral Spread, Spectral Flux, Measure
% 159_Paper5.pdf

Firstly, in order to give an understanding about feature extraction, it is important to provide some information about Fourier Analysis, as it consists the first step in many feature extractions.

Figure \ref{fig:Audio Waveform} \cite{Haugen} shows an example of a sinusoid waveform display of a recorded audio piece. The x axis represents the time, while the y axis represents the amplitude.\newline

\caption[Audio Waveform]{Audio Waveform}
\label{fig:Audio Waveform}

In this time domain representation, we do not have enough information about how the sound really is.\newline
Therefore, we have to get into the frequencies domain, the Spectrum, by decomposing the time series. This is were Fourier Transformation (FT) is used.
In general, FT and specifically Short-Time Fourier Transformation (STFT) is frequently used in audio signal processing \cite{Cohen:1995:TAT:200604}.\newline

The mathematical definition \cite{Allen1977AUA} of STFT is :

X_m(\omega) &=& \sum_{n=-\infty}^{\infty} x(n) w(n-mR) e^{-j\omega n}\nonumber \\[10pt] &=& \hbox{\sc DTFT}_\omega(x\cdot\hbox{\sc Shift}_{mR}(w))

x(n) = \hbox{input signal at time $n$}\\
w(n) = \hbox{length $M$\ window function (\textit{e.g.}, Hamming)}\\
X_m(\omega) = \hbox{DTFT of windowed data centered about time $mR$}\\
R = \hbox{hop size, in samples, between successive DTFTs.}\\

We break the signal into windows, (the time for which the signal is considered for processing) and we calculate the Discrete Fourier Transformation for every window. Usually the window function used is Hamming.
The data acquired in a window is called a frame.

Over the time period measured, the signal is divided into its frequency components, which are also sinusoidal functions, having their own amplitude and phase \cite{Allen1977AUA}\cite{All}\cite{Chauhan2015VoiceR}.

% J. B. Allen, β€œApplication of the short-time Fourier transform to speech processing and spectral analysis,” Proc. IEEE ICASSP-82, pp. 1012-1015, 1982.

In Figure \ref{fig:Audio Signal in Time and Frequency Domain} appears an example of an audio signal in the Time and Frequency Domain:

\begin{figure}[htbp!] \centering
\caption[Audio Signal in Time and Frequency Domain]{Audio Signal in Time and Frequency Domain}
\label{fig:Audio Signal in Time and Frequency Domain}


Now, being into the Frequency Domain, we have only frequency information, and the optimal would be to have also temporal information, as this is also the way humans perceive sounds in the cochlea. Therefore, Time/Frequency representations (spectrograms) are used.\newline

Spectrograms are a time series of frequency compositions. Short Time Fourier Transformation is applied for a specific time where the signal is seemingly stationary, and the resulted diagram is rotated by 90 degrees. All these diagrams combined create the spectrogram.\newline

In Figure \ref{fig:Audio in Time, Frequency and Time/Frequency Domains} \cite{DavidForsyth} is presented an audio piece in its three representations: Waveform (Time Domain), Spectrum (Frequency Domain) and Spectrogram (Frequency/Time Domain). With this frequency over time representation, it is possible to compare sounds as images.
Details about the features that are going to be extracted in this thesis are given bellow.

\begin{figure}[htbp!] \centering
\caption[Audio in Time, Frequency and Time/Frequency Domains]{Audio in Time, Frequency and Time/Frequency Domains}
\label{fig:Audio in Time, Frequency and Time/Frequency Domains}


The spectrogram of an audio wave is extracted by following the steps analyzed above.

The two steps summarized are:

1) Dividing the signal into frames.

2) Compute the amplitude spectrum of the signal using Short Time Fourier Transformation.

It is basically a series of short term DFTs.

\subsubsection{Log Spectrogram}
Humans do not perceive loudness linearly, but almost logarithmic, and for that it is usual to get the logarithmic of the amplitude, creating the Log Spectrogram \cite{Rabiner:1993:FSR:153687}\cite{Logan2000MelFC}.

\subsubsection{Mel Spectrogram and Log Mel Spectrogram}
Mel Spectrogram and Log Mel Spectrogram are used because the human ear, does not perceive frequencies and pitch linear (similar to the case of loudness). The cochlea acts like a filter, concentrating and emphasizing only certain frequencies. This scale of perception is called mel.

With mel spectrograms the spectrum is smoothed, emphasizing the most important frequencies, approximating better the way the human ear perceives sound. We perceive lower frequencies to be more important than the higher ones (we understand pitch changes better at low frequences). As a result, more mel frequency filters exist on the low frequency regions and less on the high ones.

To get into the mel scale, we use triangular overlapping windows like in Figure \ref{fig:Mel frequency filters}.The mathematical formula \cite{OShaughnessy2000SpeechC} for converting from frequency to Mel scale is:


Getting the logs of the powers at each of the mel frequencies is also common, and a necessity for extracting the MFCC feature \cite{Rabiner:1993:FSR:153687}\cite{Logan2000MelFC}.

\begin{figure}[htbp!] \centering
\caption[Mel frequency filters]{Mel frequency filters}
\label{fig:Mel frequency filters}

\subsubsection{Mel Frequency Cepstral Coefficients}

Mel Frequency Cepstral Coefficients (MFCCs) are one of the most used features in music and speech recognition, as it can extract the features of human voice. They are used more in audio/speech recognition and classification rather than simple mel spectrograms.

Once again, MFCC is based on the way human ear perceives sounds. In particular, humans do not perceive the sounds that are over 1000 Hz.

To extract this feature, after (a) dividing the audio signal into frames, (b) applying Fourier Transformation and (c) applying the mel filter, (d) we smooth the log mel spectrogram by using one more frequency transformation, the Discrete Cosine Transformation \cite{doi:10.1080/03043799808928258}\cite{Muda2010VoiceRA}\cite{Chauhan2015VoiceR}.

\subsubsection{Deltas and Deltas-Deltas}
In sound, the changes in the cepstral features over time are very important. Hence, we use Deltas, which represent the change of the MFCC coefficients over time.

Delta feature gives the velocity, while double delta gives the acceleration \cite{Muda2010VoiceRA}.


% Summary: Process of Feature
% Extraction
% β€’ Speech is analyzed over short analysis window
% β€’ For each short analysis window a spectrum is obtained
% using FFT
% β€’ Spectrum is passed through Mel-Filters to obtain MelSpectrum
% β€’ Cepstral analysis is performed on Mel-Spectrum to
% obtain Mel-Frequency Cepstral Coefficients
% β€’ Thus speech is represented as a sequence of Cepstral
% vectors
% β€’ It is these Cepstral vectors which are given to pattern
% classifiers for speech recognition purpose

As humans perceive pitch periodically, because of the way pitch traverses the helix, pithes that are octave related, are considered similar (they have the same harmonic role). Usually the chroma range is the twelve pitches of the chromatic scale (12 bins).

In chroma representations (chromagrams), for each time step in the spectrogram, the amplitudes of the frequencies with the same chroma bin are summed \cite{Jiang}.

% coefficients that belong to the same chroma are summed.
% that are perceived as similar, are presented as approximate colors.
% A chroma representation of an audio can be derived by summing up all pitch coefficients that belong to the same chroma.

\subsubsection{Spectral Contrast}
Spectral Contrast presents β€œthe decibel difference between peaks and valleys in the spectrum” \cite{Yang2002SpectralCE}.

Tonnetz, is a geometric representation that visualizes the relations between different notes or pitches \cite{LeonhardEuler}.

% of equal-tempered pitch intervals grounded in music theory.

% ——————————————————————————————–
% β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”-2.2β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”-
% ——————————————————————————————–

\section{Previous research}
% Uncomment this line, when you have siunitx package loaded.
%The SI Units for dynamic viscosity is \si{\newton\second\per\metre\squared}.
% \begin{figure}[htbp!] % \centering
% \includegraphics[width=1.0\textwidth]{minion}
% \caption[Minion]{This is just a long figure caption for the minion in Despicable Me from Pixar}
% \label{fig:minion}
% \end{figure}

On the last two decades, several studies have been made in the area of automatic music genre classification, and many different approaches and machine learning algorithms have been tested.

Following, is presented briefly the state-of-the-art in automatic music genre classification. Due to the use of different databases and number/type of genres explored in each research, the comparison of them in the face of accuracies is not completely objective.

The information given bellow is not a complete representation of all the features and systems used in automatic genre classification, but more of a presentation of the milestones in the field.\newline

In general, the following traditional techniques have been explored in the field of automatic music genre classification:
\item Gaussian models \cite{Tzanetakis2002MusicalGC}

\item Gaussian Mixture Models \cite{Tzanetakis2001AutomaticMG}

\item Support Vector Machines \cite{Xu2003MusicalGC}

\item Hidden Markov Models \cite{Jiang2}

\item k-Nearest Neighbour clasifiers \cite{Tzanetakis2002MusicalGC}

\item Linear Discriminant Analysis \cite{Tzanetakis2001AutomaticMG}

\item Neural Networks \cite{Pons2016ExperimentingWM}\cite{pikrakis}\newline
% \cite{Dieleman}


% Random Forests[] % Linear Prediction Coding
% Explicit Time Modelling with Neural Network
% Tree-based Vector Quantization

Following is presented a number of research work in the field of music genre recognition:
\item Tzanetakis et. al. were the first to introduce music genre classification, proposing classification based on extracting timbral texture (Spectral Centroid, Spectral Rolloff, Spectral Flux, MFCC), and rhythmic (Beat Histogram) and pitch content (Pitch Histogram), using a Gaussian Mixture Model classifier. They achieved an accuracy of 61\% in ten genres. They continued using support vector machine and Linear Discriminant Analysis \cite{Tzanetakis2002MusicalGC}\cite{Tzanetakis2001AutomaticMG}.

\item Li et al., proposed a new feature extraction technique, Dubechies Wavelet Coefficient Histogram (DWCH), and concluded that for music genre classification the timbral texture provides better results than rhythmic or pitch content. They achieved an accuracy of 61\% in 10 genres\cite{Li2003ACS}.

\item Jiang et al. proposed that spectral contrast features give better results than MFCC \cite{Jiang2}.

\item Xu et. al. used Support Vector Machines, using MFCC LPC-derived cepstrum, Spectrum power, ZCR and Beat spectrum \cite{Xu2003MusicalGC}.

% Meng et. al. [28] % Lidy et. al. [22]

\item Bergstra et al. proposed a classification by using a collection of frames of audio, instead of extracting a solo feature per song. They achieved an accuracy of 82,5\% for 13,4 second segments \cite{Bergstra2006AggregateFA}.

\item McKay and Fujinaga used a hierarchical classification system, and they considered the combination of features \cite{McKay2008CombiningFE}\cite{McKay2010ImprovingAM}.

\item Schindler and Rauber proposed a combination of audio and visual features, with positive results \cite{Schindler2015AnAA}.

\item Pikrakis proposed the use of deep networks, using the rhythm signature of the audio pieces \cite{pikrakis}.

\item Recently, with the rapid development of deep learning, it has been proposed by many the feature learning (spectral or temporal features) \cite{Humphrey2013FeatureLA}\cite{Sigtia2014ImprovedMF}\cite{Deshpande}\cite{Hamel2010LearningFF}.


% Il-Young Jeong and Kyogu Lee
% Music and Audio Research Group Graduate School of Convergence Science and Technology, Seoul National University, Korea

\item With the development of machine learning, many researchers have used convolutional neural networks for automate music genre classification

% [] % [5][8] % [13] % M. S. Keunwoo Choi, Gyorgy Fazekas. Automatic tagging
% using deep convolutional neural networks. 2016.

\item Other researches have used deep recurrent neural networks \cite{Freitag2017auDeepUL}\cite{Irvin2016RecurrentNN}.

% [Chun Pui Tang, Ka Long Chui, Ying Kin Yu, Zhiliang Zeng, Kin Hong Wong]


Even though computer vision have reached very good results, and deep learning algorithms are being created continuously meliorating the results, research in automatic music genre recognition is not as popular.