1. INTRODUCTION
A journey of a thousand mile begins with a single first step. Likewise, the journey of mastering a musical instrument for an unskilled beginning learner starts with repeated practicing of new music scores. They have to struggle with playing every note accurately and keeping the tempo correctly. At times they face some difficulties when they have to practice them without a teacher. At this point, a music learning assistant system with music score-tracking capability will come to the rescue. A music learning assistant is a system designed to help beginning learners practice music using realtime music score reading and feature tracking. An additional function required for the system is the alignment of audio-to-score that involves synchronizing audio performance produced by the learner and the musical symbols in the score [1,2,3,4].
Music score-following problem has been studied by a number of researchers. It was first introduced by Dannenberg who used classical approximate string matching and heuristic techniques [2,4]. Since then, researches using stochastic approaches have begun to appear. Raphael was a pioneer in applying HMM for music score following [2,3]. Later on, researchers began to use HMM for pitch class profiles or chroma features and reported some interesting results [5,6].
In their music score recognition study, Rabelo et al. [7] reported a research effort on optical music score recognition which consists of image preprocessing, music symbol recognition, musical notation reconstruction and final representation construction as shown in Fig. 1. In music symbol recognition there are various methods published to date which involve Neural Network (NN), Nearest Neighbour (kNN), Hidden Markov Model (HMM) and Support Vector Machine (SVM). Among them, SVM has been the most popular method exhibiting the best performance in recognition of both handwritten and synthetic music scores [7].
Fig. 1.Typical architecture of an OMR system [7].
Motivated by those previous research efforts, this paper presents a design of music learning assistant based on audio-visual analysis. Here, two popular pattern recognition methods are employed, SVM for visual analysis and HMM for audio decoding.
In the rest of the paper, Section 2 explains the design of the proposed music learning assistant for audio-visual analysis. Then, Section 3 will cover feature extraction for visual pattern recognition method to be described in Section 4. The next section discusses the experimental results followed by the conclusion in Section 6.
2. DESIGN OF MUSIC LEARNING ASSISTANT
The design of the proposed music learning assistant is illustrated in Fig. 2. There are two type of inputs, music score images and audio music signals. Therefore, we have two tasks, music score recognition and music signal transcription.
Fig. 2.Organization of the music learning assistant.
For each task, we employ a popular pattern recognition method, SVM for music score recognition task and HMM for music signal transcription task. The success of the system lies in the optimality of the model parameters. In order to estimate the optimal set of model parameters, the system undergoes a training phase for both the SVM classifier and the HMM classifier.
3. FEATURES EXTRACTION
3.1 Histogram of Oriented Gradients Features
Histogram of Oriented Gradients or HOG was first introduced as a way of detecting pedestrians in video with a high performance [8]. Later on, the feature has also been applied to digit and character recognition tasks [9,10]. In our research, we employ HOG as a descriptor for music symbol recognition.
HOG is defined as the number occurrences of gradient orientation in parts of an image [9]. Given an input image divided into a set of small regions called cells, we compute a histogram of gradient directions or edge directions for each cell [8,9,10]. Fig. 3 shows HOG features with different cell sizes. Smaller cells tend to have more spatial information but will increase the number of dimensions, and vice versa. In this research, we choose the 2 × 2 cell since it provides more spatial information and turns in higher recognition rate. The HOG features are extracted using the method developed in [10].
Fig. 3.(a) A music symbol image, (b) its HOG features using 2x2 cells, (c) 8x8 cells and (d) the magnification of a particular cell.
3.2 Chroma Features
In Western music notation, there are 12 pitches in an octave which are denoted by an ordered set of symbols {C,C#,D,D#,E,F,F#,G,G#,A,A#,B}. They represent a cycle and repeat in octaves below and above [11,12] as shown in Fig. 4. The distance between two adjacent notes is defined as a halfstep. The distance we perceive as a halfstep is equal in all octaves. We capture this perceptual distance using the chroma features developed for music analysis [11,12].
Fig. 4.Twelve semitones arranged in a cycle in Western music notation.
Chroma features are used to represent audio music signals where the entire spectrum for a short time frame is projected onto twelve bins corresponding to the twelve distinct semitones in the chromatic scale [12].
Given a music signal as shown in Fig. 5(a), we compute chroma features like Fig. 5(b) by summing up the log-frequency magnitude spectrum across octaves [11] as follows:
where Xlf is the log-frequency spectrum, Z the number of octaves, b the pitch integer identifier representing the class index and β the number of bins per octave.
Fig. 5.(a) Input audio music signal and (b) the corresponding chromagram.
4. PATTERN RECOGNITION METHOD
4.1 Support Vector Machine (SVM)
Support vector machine or SVM is basically a binary classification method that constructs a hyper-plane in high order space separating samples of two classes[9,13].
SVM is designed using kernels which are typically based on linear, polynomial, radial basis function (RBF) or sigmoid kernels [9]. Given a feature vector x, the kernel K is defined by the inner products of features ϕ(xa) = [ϕ1(xa), ϕ2(xa), ... ϕd(xa)]T with ϕ(xb) given by
Once the kernel function is chosen, the classifier function f(x) can be written as
where α = [ α1 …al ]T is the vector of l non-negative Lagrange multipliers to be determined, yi the target value of support vectors and b the bias [13].
For music symbol recognition, we first specify the number of musical symbol classes such as accidental, bar, braces, clef, digits, dot, note and rest as illustrated in Table 1. Given a set of extracted HOG features from segmented music symbols from the traning set, we construct an SVM classifier using the toolbox provided in [10].
Table 1.Music symbol classes for SVMs
4.2 Hidden Markov Model (HMM)
4.2.1 Overview of HMM
HMM is a generalization of Markov chain. According to the first order Markov chain, a state at time t depends only on the single preceding state at time t-1 instead of the whole history of the process prior to the state[14,15]. In HMM, each state qt of the Markov chain generates an observation ot at time t as shown in Fig. 6.
Fig. 6.HMM.
The HMM is characterized by a number of parameters [14,15]:
The model is often denoted simply by a triple λ=(A, B, π), consisting of probabilistic parameters.
4.2.2 Continuous observation density HMM
In the previous section we defined V as a set of discrete observation symbols from each state. Since, however, we are dealing with audio music captured as a continuous-valued signal, we employ continuous observation densities in the HMM.
A continuous density HMM is often characterized by a parametric family of density functions or a mixture of certain density functions in each state [14,15]. Assuming the use of Gaussian mixtures, the emission density of state j is defined as:
where K is the number of mixtures and wjk the mixing coefficient for the kth Gaussian in state j subject to the following stochastic constraints :
where N denotes the Gaussian density function with mean μjk ∈ Rd and covariance matrix Σjk ∈ Rdxd for the kth mixture.
4.2.3 Model Topology
In music transcription tasks, the pitches to be modeled are {C,C#,D,D#,E,F,F#,G,G#,A,A#,B} according to Western music notation. By using an HMM for the music transcription task, we try to describe the pitch trajectory given a sequence of chroma features of music signal. A music can be viewed as an ordered sequence of timed notes, but basically any note can be followed by any of the twelve notes. Based on this, we design a continuous HMM with 13 states, 12 states for the 12 pitches and one state for the silence. In addition, we choose a fully connected ergodic topology for our transcription HMM shown in Fig. 7. Since the input frame rate is generally much faster than the corresponding musical notes progression, the self loop transition parameters aii, i = 1,2, …, N dominate the distribution in each row of the transition matrix.
Fig. 7.HMM topology for music pitch transcription.
4.2.4 Viterbi Decoding
As in the case of many practical applications, HMMs are trained by the Baum-Welch algorithm. But most model decoding methods employ Viterbi algorithm for a detailed analysis into model behaviour. It is also adopted here for pitch tracking.
Viterbi algorithm is the most popular technique for finding the optimal path in an HMM given an observation sequence. The Viterbi algorithm is used to find the single best state sequence that most likely have produced the observations [14,15].
Using the trained HMM parameter λ we aim to find the single most likely pitch sequence q=(q1, q2, …, qτ given chroma features as an observation sequence O = (o1, o2, …, ot). In this case we define
as the probability along a single best partial path that ends in state i at time t with the partial output sequence [6,14,15]. By using induction, we can calculate the probability at time t + 1 as
δt+1(i) = [max δt(i)aij]bj(ot+1)
In online decoding tasks based on Viterbi algorithm, the information about the future notes is not available. This problem can be handled by an approximate decoding based on L local frames of the chroma features stored in the observation buffer as proposed in [6]. A correspondingly modified Viterbi algorithm will decode the short term sequence as shown in Fig. 8 [6]. To save computation, each buffer decoding calculation δt(i) will reuse previous buffer calculations δt-1(i) except for the initial for δ1(i). Regarding the decoding result of each observation in the buffer we employ a voting system so that each buffer has only one decoding result which becomes the pitch estimate.
Fig. 8.Viterbi decoding workflow [6].
5. EXPERIMENTS
Since there are two subtasks in the music learning assistant, two experimental results will be described separately. First music score recognition using SVM classifier and then pitch transcription of the music using an HMM.
5.1 Music score recognition
In order to train the music symbol classifier, we prepared a training set as described in Table 2. Music symbols in the training data are obtained from several beginner piano music scores which are already processed through line staff removal, gap stitching and segmentation. Then we extract HOG features and use them as the training data for training the SVM classifier.
Table 2.Music symbols training data
The trained SVM classifier is used to recognize music symbols in test music scores. The result of music score symbol recognition is shown in Table 3. The overall average performance of the classifier is 96.02% correct recognition for beginner’s piano music scores.
Table 3.Music score recognition result
There are some misclassification cases as shown in Table 4. Most of the errors are attributed to missegmentations. Current system handles only single head notes so it fails on multiple head notes. When this problem is resolved in the future work, we can expect a higher performance in music score recognition.
Table 4.Misclassification cases
Once all symbols are recognized, we can create a reconstruction matrix for the music score, and then finally convert it into a MIDI file using a method provided in [16]. Fig. 9 shows a sample result.
Fig. 9.(a) A music score and (b) the corresponding MIDI rendering.
5.2 Music transcription
For music transcription task, we have built a continuous HMM trained by the Baum-Welch algorithm. Then we have applied the Viterbi algorithm to decode out the most likely sequence of pitches given chroma features. Each state of the HMM models the local patterns of chroma features, using a single Gaussian. When projected to the feature space with reduced dimensionality, the thirteen Gaussians make an interesting pattern as shown in Fig. 10.
Fig. 10.HMM training result.
Each dot represents a chroma feature vector. And each cluster of dots represents an HMM state corresponding to a particular pitch. Given a feature vector in context, the HMM returns the most likely pitch via the Gaussian densities represented by the ellipses. The Gaussian ellipse in the center represents silence whereas all the other ellipses around it represent the corresponding pitches. The visualization show that the HMM managed to model the pitches intuitively well.
The trained HMM is applied to label input audio music using Viterbi algorithm. The parameters for online Viterbi decoding include: the frame size is 2048 and L=5 is the length of the observation buffer. Fig. 11 shows a sample decoding result given a music scale performance. It shows that the HMM with 13 states accurately segment and label the pitches including silence. Finally, given a beginner’s piano music performance on the familiar song “London Brigde is falling down”, the HMM also turned in a very accurate result are shown in Fig. 12, which is deemed sufficient for musical learning assistance.
Fig. 11.(a) A chromagram of music scale performance and (b) the corresponding Viterbi decoding result.
Fig. 12.(a) Music performance chromagram of“London Bridge is Falling Down”and (b) corresponding Viterbi decoding result.
There is a remaining problem of synchronization, that is defined as a task of aligning the audio music to music score. In audio music we distinguish three events for each note, namely attack, sustain and rest events [3]. By detecting these events given an audio music, the system can estimate the starting and ending time of some notes. Based on this, the proposed system can estimate the tempo of the notes being played and compare it with the music notation duration in music score as the ground truth. If all the results from the SVM classifier of music score, the HMM decoder, and the tempo estimation is properly synchronized, then we can give a useful feedback about the music performance, whether the learner has played pitches correctly and keeping the tempo accurately or not with respect to its music score [17,18].
6. CONCLUSION
This research presents a design of music learning assistant which involves two popular pattern recognition methods. The SVM classifies music symbols in a music score and the HMM tracks the sequence of pitches given a audio music. Based on the current implementation with limited training sets, the SVM shows an average performance about 96.02% correct symbol recognition. Whereas the HMM decoding found pitches almost perfectly with a few errors found only in the note boundaries by just one or two frames. Finally, if the result of both models are integrated with a proper synchronization based on attack times and beats, the proposed system will be able to give a useful feedback according to learner’s performance.
References
- P. Cano, A. Loscos, and J. Bonada, "Score-Performance Matching Using HMMs," Proceeding of the International Computer Music Association, pp. 441-444, 1995.
- N. Orio, "An Automatic Accompanist Based on Hidden Markov Models," Proceeding of Advances in Artificial Intelligence, pp. 64-69, 2001.
- A. Cont, D. Schwarz, and N. Schnell, "Training IRCAM's Score Follower," Proceeding of International Conference on Acoustics, Speech, and Signal Processing, pp. 253-256, 2005.
- B. Pardo and W. Birmingham, "Modeling Form for On-line Following of Musical Performances," Proceedings of the 20th National Conference on Artificial Intelligence Vol. 2, pp. 1018-1023, 2005.
- A. Sheh and D. Ellis, "Chord Segmentation and Recognition using EM-Trained Hidden Markov Models," Proceeding of 4th International Symposium on Music Information Retrieval, pp. 185-191, 2003.
- T. Cho and J.P. Bello, "Real-Time Implementation of HMM-Based Chord Estimation in Musical Audio," Proceeding of the International Computer Music Conference, pp. 16-21, 2009.
- A. Rabelo, I. Fujinaga, F. Paskiewicz, A.R.S. Marcal, C. Guedes, and J.S. Cardoso, “Optical Music Recognition : State-of-the-art and Open Issues,” International Journal of Multimedia Information Retrieval, Vol. 1, No. 3, pp. 173-190, 2012. https://doi.org/10.1007/s13735-012-0004-6
- N. Dalal and B. Triggs. “Histograms of Oriented Gradients for Human Detection,” IEEE Computer Vision and Pattern Recognition, Vol. 1, pp. 886-893, 2005.
- R. Ebrahumzadeh and M. Jampour, “Efficient Handwritten Digit Recognition Based on Histogram,” International Journal of Computer Applications, Vol. 104, No. 9, pp. 10-14, 2014. https://doi.org/10.5120/18229-9167
- R. Ebrahumzadeh and M. Jampour, "Digit Classification Using HOG Features," in MathWorksTM Documentation. http://www.mathworks.com/help/vision/examples/digit-classification-using-hog-features.html. (accessed on June 2, 2016)
- J.P. Bello, "Chroma and Tonality," New York University's Music Information Retrieval Course. http://www.nyu.edu/classes/bello/MIR.html (accessed June 2, 2016).
- C. Tralie, "Musical Pitches and Chroma Features" in Duke University's Digital Signal Processing Course, http://www.ctralie.com/Teaching/ECE381_DataExpeditions_Lab1 (accessed on June 2, 2016).
- D, Boswell. "Introduction to Support Vector Machines", 2002. http://dustwell.com/PastWork/IntroToSVM.pdf. (accessed on June 2, 2016).
- L.R. Rabiner, “A Tutorial on Hidden Markov Modles and Selected Applications in Speech Recognition,” Proceeding of the IEEE, Vol. 77, pp. 257-286, 1989. https://doi.org/10.1109/5.18626
- M, Nilsson, First Order Hidden Markov Model : Theory and Implementation Issues, Blekinge Institute of Technology Research Report, pp. 9-54, 2005.
- K. Schutte, "MATLAB and MIDI". http://kenschutte.com/midi (accessed on June 2, 2016).
- A.W. Mulyadi and B.-K. Sin. "Chroma Based Music Score-Following Using Hidden Markov Model," Proceeding of the Fall Conference Paper on Korean Multimedia Society, 2015.
- T. Hur, H. Cho, G. Nam, J. Lee, S. Lee, S. Park, et al., "A Study on the Implementation of the System of Content-based Retrieval of Music Data," Journal of Korea Multimedia Society, vol. 12, iss. 11, pp. 1581-1592, 2009.
Cited by
- Music Key Identification using Chroma Features and Hidden Markov Models vol.20, pp.9, 2017, https://doi.org/10.9717/kmms.2017.20.9.1502
- Speech/music classification using visual and spectral chromagram features vol.11, pp.1, 2020, https://doi.org/10.1007/s12652-019-01303-4