LSTM: A first try

I was inspired by these two videos:

And since I have learnt a little about LSTM before I decided to train my own models on music. But first is to review about LSTM. One recommended source is:

And this wikipedia plot is quite useful once we have gone through the details.


Basically, it means the data is trained using a combination of gates (multiplication, summation, and sigmoid function).

I started with an simple example of texts:

'''Example script to generate text from a text file.
Based on this example:

from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
from keras.callbacks import Callback
from keras.callbacks import ModelCheckpoint
import numpy as np
import random
import sys

def sample(preds, temperature=1.0):
helper function to sample an index from a probability array
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)

class LossHistory(Callback):
Use for recording history in the training process
def on_train_begin(self, logs={}):
self.losses = []

def on_batch_end(self, batch, logs={}):

# training data
path = "/Users/IrisYupingRen/Dropbox/csc530/module1/test.txt"

text = open(path).read().lower()
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
chunk = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
chunk.append(text[i: i + maxlen])
next_chars.append(text[i + maxlen])
print('nb sequences:', len(chunk))

X = np.zeros((len(chunk), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(chunk), len(chars)), dtype=np.bool)
# Use one hot encoding
for i, sentence in enumerate(chunk):
for t, char in enumerate(sentence):
X[i, t, char_indices[char]] = 1
y[i, char_indices[next_chars[i]]] = 1

# build the model: a single LSTM
poi = int(len(X) * .8)
X_train = X[:poi]
y_train = y[:poi]

X_test = X[poi:]
y_test = y[poi:]

print('Build model...')

in_out_neurons = 128
hidden_neurons = 300

# initialising the model
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
history = LossHistory()

for iteration in range(1, 60):
print('-' * 50)
print('Iteration', iteration)
# actually fit the model, y, batch_size=128, nb_epoch=1)

start_index = random.randint(0, len(text) - maxlen - 1)

for diversity in [0.2, 0.5, 1.0, 1.2]:
print('----- diversity:', diversity)

generated = ''
sentence = text[start_index: start_index + maxlen]
generated += sentence
print('----- Generating with seed: "' + sentence + '"')

for i in range(400):
# Output
x = np.zeros((1, maxlen, len(chars)))
for t, char in enumerate(sentence):
x[0, t, char_indices[char]] = 1.

# Use the trained model to predict
preds = model.predict(x, verbose=0)[0]
next_index = sample(preds, diversity)
next_char = indices_char[next_index]

generated += next_char
sentence = sentence[1:] + next_char


When the program runs, it looks like this:


Making and enjoying music in Rochester, NY


The “Fringe” festival has been going on for a while and it’s been nice that there’s even more music in the city. Although the annoying part is the shortage of parking space, I love festivals: the international food, the street plays, the people rush and the happy spirits, . This reminds me the Fiddler’s fair I went a month ago. Music music!

Besides my ASE programme at the UofR, during a year and one month study at Eastman Community Music School (ECMS), the music has never stopped. From classical to jazz, from baroque to folk, every style has its own charms. This year I got into more theory, composition and history classes, still lots to be learnt.

Just next door to ECMS, there’s a JAVA’s cafe. I’d recommend it to anyone who comes to Rochester, great chatting and meeting place.


To finish it up, here’s a concert report for my history class, lots of fun:

Barbara’s Venice

From playing in the Baroque ensemble at ECMS, I started to know more about the Baroque style of playing: no/less vibrato, inégale, ornaments, etc. Although our instructors have been playing in lots of concerts, I have not yet been able to attend and I only wish I had time. But today, a baroque concert on Sunday, in the afternoon with a proper autumn temperature and without any other schedules, I made it!

The concert name is Barbara’s Venice, one of the Pegasus early music series. It was nice to know this organisation who has been organising early time music for 12 years. The composer under the spotlight is Barbara Strozzi, also called Barbara Valle,  who was an Italian singer and composer. Her Baroque compositions were published in her lifetime under her own name, and she was the most prolific composer in mid-17th century Venice. The concert is on the theme of showcasing her unique passionate voice along with music of her mate contemporaries. There are sacred, secular, vocal and instrumental music by Strozzi, her teacher Cavalli, Vastello, Marine, and others.

Her father, Giulia Strozzi, is a poet and librettist. He recognised Barbara as his adopted daughter. However, she was most likely to be the illegitimate daughter of Strozzi and Isabella Garzoni, his long-time servant and heir [1]. He encouraged Barbara’s music talent and helped her exhibiting her vocal talent to a wider audience by creating an academy in which her performances could be validated and displayed. He also arranged her to study with composer Francesco Cavalli (the most influential composer in the rising genre of public opera in mid-17th century Venice [3]) since she was also compositionally gifted. His texts appears in her early pieces many times; later texts were written by her father’s colleagues; Barbara may have written her own texts in many other pieces.

She led a quiet life:  6 August1619 – 11 November 1677. She was the mother of four children, three out of whom were fathered by Giovanni Paolo Vidman. Vidman was a patron of the arts and supported early opera. Vidman did not leave anything to her or her children in his will, so Strozzi supported herself by her investments and compositions. She died without leaving a will and her son Giulia Pietro claimed her inheritance.

Her composition contains lots of secular vocal music with the exception of one volume of sacred songs. She was also known for her poetic ability and her lyrics were often well-articulated. Out of her printed works, >75% were written for soprano. Her composition are of the seconda practical tradition, which literally means “second practice” and is the counterpart to prima practica. The word seconda practica was coined by Claudio Monterverdi and encourages more freedom from the rigorous limitations of dissonances and ounterpoint characteristics of the prima practica. Her music evokes the spirit of her teacher Cavalli, heir of Monterverdi, with a more lyrical style and depend more on sheer vocal sound.

Back to the concert, the location is the downtown united presbyterian church in Rochester. It’s a rather beautiful red brick architecture. The pre-concert talk starts at 3:15pm, and sorry we were late for about 5 minutes. Some information in the pre-concert talk but didn’t show up in the internet includes: besides singing, she also played instruments, but she didn’t write any instrumental music. An interesting instrument was introduced: lira da gamba. It has a tuning of the circle of fifth and the bridge is very flat, so in the continuo part the chord can be played easier. The keys used in the music was asked by the audience. The interpreter said the keys are not too far off in general, but there are dissonance like F sharp major (also because the tuning is different so the diss), or Eb – Gb change. And it was said that a great new editions, the Richard Kolb for Cor Donato Editions would be used. One question was left to be contemplated on during the concert: whether we can heard that the music was written by a woman.

We were then given the programme, the lyrics of the songs, the information on the series of convert. One of the performers, Boel Gidholm is one of the violin instructor in the Baroque ensemble. Other performers include Laura Hermes on soprano, Luthien Brackett on Alto; Andrew Fuchs on tenor, Andrew Padgett on bass; Mary Riccardo on Violin; David Morris on gamba/lirone; Dan Swenberg and Deborah Fox on theorbos.

The programme include her famous and soulful lament Lagrime Mie, a sacred cantat for alto called In medio maris and dedicated to St. Peter, the duet for soprano and bass Morso e bacio dati in un tempo (Bite and kiss at the same time), and several vocal quartets including L’Usignuolo (the Nightingale). The instrumental pieces are by her contemporaries in Venice, music she most certainly have known, which include a Canzon by Cavalli (her teacher), Sonata 12 from book 2 of Dario Castello, a musician at St. Mark’s, pieces by Cazzati and Legrenzi, entitled La Strozza and La Strozzi, which are possibly in homage to Barbara herself.[4]

I sit on the second floor balcony on right hand side to get a better view on the violin and singers section. It was a shame that I couldn’t see lyre da gamba from this angle and the lutes weren’t clear. The church was almost full on the first floor, more than half full on the second: a very good turn out rate. People dressed more or less formally.

The first half of the concert consists of Silentio nocivo (4 voices), La Strozza (instrumental by Cazzati) and Il Ritorno (voice), Sonata 12, libro2 (instrumental, by Castello), Lamento: Lagirme mie (voice), Sonata duodecima a due (not played), la crudele che non sente (voice), balletto e corrente quarta (instrumental), con le belle non ci voul fretta (voice). In the Strozzi songs, there was a high contrast between the happy and the sad emotion. Certain motives evoke a strong sense of emotion in me. Also there were two kinds of sadness in Lamento and la crudele: the first one was very chromatic, the second was more diatonic but with dissonance.

The intermission was about 15 mins after which we resumed to the second section. The pieces are La Strozzi (instrumental) and L’Usignuolo (voice), Cantona (instrumental) and In mediomaris (voice), Sonata sopra Fuggi dolente (instrumental) and Tradimento (voice) and Morso e bacio dati un tempo (voice), and Vecchio amante che rende la piazza (voice).  Now, one sentence from the pre-concert talk was definitely hitting me: I’m getting lots of whining about love! It was quite an experience just imagining people has been singing so differently about the same topic through such a long time in history! In the Morso e bacio dati un tempo piece, there was acting between the soprano and the bass which was very amusing. There was one encore piece at the end. I believe they sung the first piece, and amazingly, it was a very different experience from the first time I listened to it at the beginning of the concert. For some reason, I could hear the bass clearer and the whole group more harmoniously.   

After the concert, there was a reception and I talked to the violinists, the alto singer and an audience who graduated from U of R. It was very interesting that from Boel and Luthien I learnt that the alto solo piece had a similar section with Cavalli’s piece. It’s not clear who cited whom. But it was an musically interesting point that I’m afraid I missed in the first time listening to it. I’m glad that I learnt this in the reception! I also learnt that the performers actually came from different cities and rehearsed for 3 days for the show and had 3 concerts in Syracuse, Ithaca and Rochester. The professional speed is quite amazing to me!

To sum up, the music at the concert was great, and I learnt a lot more about Baroque, especially Barbara Strozzi’s music. I would very much like to go to one of Pegasus’s concerts again if the time allows.







(Music) Pattern discovery paper review



Here’s a much-over-simplified one-sentence summary of the papers in pattern discovery, mainly music pattern discovery, but also NLP and data science stuff. Mostly for my own reference….and will maintain/update it!

As always, suggestions, comments, critics, all welcome!

Two big categories: first on audio signal and then on symbolic data:


  1. Collins, Tom, et al. “Bridging the audio-symbolic gap: The discovery of repeated note content directly from polyphonic music audio.” Audio Engineering Society Conference: 53rd International Conference: Semantic Audio. Audio Engineering Society, 2014.
    • Use LSTM and GMM Viterbi quantisation algorithms to transcribe and then use SIARCT-CFP (Structure Induction Algorithm for r superdiagonals and Compactness Trawler, with Categorisation and FingerPrinting)

    • Use Variable Markov Oracle model to locate the repeated suffixes in a post-processed chroma feature time series

  3. Weiss, Ron J., and Juan Pablo Bello. “Unsupervised discovery of temporal structure in music.” IEEE Journal of Selected Topics in Signal Processing 5.6 (2011): 1240-1251.
    • Shift-invariant probabilistic latent component analysis and sparsity constraints (Much improved NMF) applied to chroma features patterns (harmonic features only), with applications in music segmentation, riff detection and tempo identification, etc. 

  4. Hardy, Corentin, et al. “Sequential pattern mining on multimedia data.”European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Database Workshop on Advanced Analytics and Learning on Temporal Data. 2015.
    • Symbolise audio to MFCC features and then do frequent closed patterns mining

  5.  Nieto, Oriol, and Morwaread M. Farbood. “Identifying polyphonic patterns from audio recordings using music segmentation techniques.” The 15th International Society for Music Information Retrieval Conference. 2014.
    • Exploit the repetitive features extracted by a music segmentation algorithm using a score-based greedy algorithm and a clustering algorithm.

    • A brand new paper on the segmentation part: Nieto, Oriol, and Juan Pablo Bello. “Systematic exploration of computational music structure research.” Proceedings of ISMIR, New York, NY, USA(2016).
  6. Kroher, Nadine, et al. “Discovery of repeated vocal patterns in polyphonic audio: A case study on flamenco music.” Signal Processing Conference (EUSIPCO), 2015 23rd European. IEEE, 2015.
    • Segment vocal parts using spectral band ratio and RMS envelop, then use sequence alignment algorithm on the chroma vector, and then clustering algorithm

  7. Gulati, Sankalp, et al. “Mining melodic patterns in large audio collections of indian art music.” Signal-Image Technology and Internet-Based Systems (SITIS), 2014 Tenth International Conference on. IEEE, 2014.
    • Use DTW and its four variants cost function to calculate a huge number of distance between melodic patterns, with lower bounding and early abandoning

  8.  Park, Alex S., and James R. Glass. “Unsupervised pattern discovery in speech.” IEEE Transactions on Audio, Speech, and Language Processing  16.1 (2008): 186-197.
    • Use Dynamic Time Warping to identity patterns and then graph clustering with nodes being the time (from DTW), connections being the similarity (from DTW) between the time indices. 

  9. Jančovič, Peter, et al. “Unsupervised discovery of acoustic patterns in bird vocalisations employing DTW and clustering.” 21st European Signal Processing Conference (EUSIPCO 2013). IEEE, 2013.
    • Use sinusoidal detection to extract frequency tracks as the input features, and then DTW is used to segment the data, and finally an agglomerative hierarchical clustering to cluster recurring segments.

  10. Deecke, Volker B., and Vincent M. Janik. “Automated categorization of bioacoustic signals: avoiding perceptual pitfalls.” The Journal of the Acoustical Society of America 119.1 (2006): 645-653.
    • DTW and neural networks for an unsupervised categorisation of isolated vocalisations of dolphins and whales


  1. Conklin, Darrell. “Discovery of distinctive patterns in music.” Intelligent Data Analysis 14.5 (2010): 547-554.
    • Use an anti-corpus to find maximally general distinctive patterns which can be useful for music classification

  2. Conklin, Darrell, and Stéphanie Weisser. “Pattern and Antipattern Discovery in Ethiopian Bagana Songs.” Computational Music Analysis. Springer International Publishing, 2016. 425-443.
    • Use a new theorem for pruning statistically under-represented patterns in an efficient pattern discovery algorithm

  3. Conklin, Darrell. “Distinctive patterns in the first movement of Brahms’ string quartet in C minor.” Journal of Mathematics and Music 4.2 (2010): 85-92.
    • Use Brahms’ string quartet No1 as corpus, No2 and No3 as anti-corpus, found most of the structures independently proposed by a musicologist.

  4. Lartillot, Olivier. “Multi-dimensional motivic pattern extraction founded on adaptive redundancy filtering.” Journal of New Music Research 34.4 (2005): 375-393.
    • A specificity relation is defined amongst pattern descriptions, unifying suffix and inclusion relations, which enables filtering of redundant descriptions; patterns are discovered through an incremental adaptive identification in a multi-dimensional parametric space

  5.  Lartillo, Olivier, and Mondher Ayari. “Motivic pattern extraction in music, and application to the study of Tunisian modal music.” Arima Journal 6 (2007): 16-28.
    • Search for closed patterns and cyclic patterns in a multi-dimensional parametric space

  6. Meredith, David, Kjell Lemström, and Geraint A. Wiggins. “Algorithms for discovering repeated patterns in multidimensional representations of polyphonic music.” Journal of New Music Research 31.4 (2002): 321-345.
    • Two algorithms, SIA and SIATEX to compute the occurrences of all the maximal repeated patterns in a multidimensional dataset, with time complexity analysis $latex O(kn^2log_2n)$ and $latex O(kn^3)$

  7. Meredith, David. “Compression-based geometric pattern discovery in music.”2014 4th International Workshop on Cognitive Information Processing (CIP). IEEE, 2014.
    • Two algorithms, COSIATEC and SIATECCompress, were described and were able to generate the compressed encoding of music point-set input

  8. Collins, Tom, et al. “Using Geometric Symbolic Fingerprinting to Discover Distinctive Patterns in Polyphonic Music Corpora.” Computational Music Analysis. Springer International Publishing, 2016. 445-474.
    • The technique of symbolic fingerprinting was used to unify the viewpoints and geometric method to increase the algorithms’ flexibility.

  9. Conklin, Darrell, and Christina Anagnostopoulou. “Representation and discovery of multiple viewpoint patterns.” Proceedings of the International Computer Music Conference. San Francisco: International Computer Music Association, 2001.
    • Use the formalism of multiple viewpoints to view music as multiple streams of description derived from the basic surface representation, and then statistical methods and longest significant pattern definition were introduced



Some surveys:

  1. Mabroukeh, Nizar R., and Christie I. Ezeife. “A taxonomy of sequential pattern mining algorithms.” ACM Computing Surveys (CSUR) 43.1 (2010): 3.
  2. Masseglia, Florent, Maguelonne Teisseire, and Pascal Poncelet. “Sequential Pattern Mining.” Encyclopedia of Data Warehousing and Mining (2005): 1028-1032.
  3. Janssen, Berit, et al. “Finding repeated patterns in music: State of knowledge, challenges, perspectives.” International Symposium on Computer Music Modeling and Retrieval. Springer International Publishing, 2013.

Some thesis:

  1. Collins, Tom. Improved methods for pattern discovery in music, with applications in automated stylistic composition. Diss. Open University, 2011.
  2. Park, Alex Seungryong. “Unsupervised pattern discovery in speech: Applications to word acquisition and speaker segmentation.” (2007).
  3. Bajestani, Hossein Soleimani. PATTERN DISCOVERY FROM UNSTRUCTURED AND Scarcely Labeled Text Corpora. Diss. The Pennsylvania State University, 2016.

Annual Summary

It just hit me that the new academic year has started about a week now. It’s time to write something about the past year.


It was definitely a busy year. I’ll just list something I could think of, on a personal level, so that there be some traces of it 😛

Things that I enjoyed last year:

  • Implemented and explored my own ideas
  • Developed some industrial ideas and a stronger sense in real world problems
  • Learnt new materials in Artificial Intelligence, Machine Learning, Brain and Cognitive Science and Maths
  • Made new friends and established new collaborations
  • Started to take violin and voice lesson again
  • Took courses at Eastman Community Music School
  • Picked up ukulele and guitar
  • Joined a hammer dulcimer group and toured upstate NY small towns with the ensemble
  • Visited friends and attended conferences in Canada, Europe, and US
  • Experienced big big snow
  • Started cooking
  • Saw lots of wild deers, eagles, squirrels and skunks
  • Having fun with my betta, guinea pig, dwarf hamster and sunflowers
  • etc…….

Things that I struggled last year:

  • Producing top quality results in research
  • Passing my road test
  • Keeping a heathy life style
  • Not missing other places I have been to
  • etc……..

The new academic year has been ok so far. Not as hectic as last year. But still, two new classes for credits + some sitting-in classes + music ensembles + TA + meetings + projects are resuming. While waiting for the future anxiously, gonna do my best at present. Life goes on.


Audio Features Self-Similartity Matrix shows something interesting Part-2

As promised, I will play with the features used in this post. If you didn’t read the post, the process used to extract patterns out of music is to reduce the audio signal to audio features and then calculated the covariance and correlation coefficients of the features, then segments showed up. Now we are going to experiment on the subset of the features I used back then, and see which features are the most helpful.


I’m also using a more familiar song this time: Row row row a boat The arrangement has female voice sing once, male voice singing once, female another time, female and male together once, and finally a canon. I hand noted the segment boundaries. So will be looking like the figures in part one.

The whole song with all the features used look like this (covariance on the left, coefficients on the right):

It looks a bit messy. Coefficients matrix does look better. We can see that the intro and ending can be differentiated quite easily. But the exact repetition of the first two times of the melody in the beginning is not showing at all. However, the first and third times repetition seems to be showing by the off diagonal high values. So the algorithm is probably capturing lots of timbre information. The mixed and canon part shows an interesting trait two.

Ok. But now we gonna take all the features apart:

First, I’m going to use the Energy Group:

  • 1. Zero Crossing Rate: The rate of sign-changes of the signal during the duration of a particular frame.
  • 2. Energy: The sum of squares of the signal values, normalised by the respective frame length.
  • 3. Entropy of Energy: The entropy of sub-frames’ normalised energies. It can be interpreted as a measure of abrupt changes.

For each feature, as I said in part one, we have the mean and the standard deviation of the energy at each frame. I also separated them and plotted the covariance and correlation coefficients:

Three figures on the top (two big ones and two small ones) are the covariance matrices, at the bottom are coefficient matrices.

The two blue ones are the standard deviation. If we zoom in, we can see some small fluctuations corresponding to the ground truth segmentation boundaries. (We will see, but this is actually the only case that the standard deviation functioned..)

The big yellow one and the small red one have the 6 * 1 vectors as input.The female and male voices parts look different there. The female repeat was definitely captured.

The smaller yellow one and the green one to be the mean value. One thing come as a surprise is the noisiness on the coefficient matrix…. I do know eyeballing is sometimes very misleading, but not to this degree…

Ok, second, the Spectral Group:

  • 4. Spectral Centroid: The centre of gravity of the spectrum.
  • 5. Spectral Spread: The second central moment of the spectrum.
  • 6. Spectral Entropy: Entropy of the normalised spectral energies for a set of sub-frames.
  • 7. Spectral Flux: The squared difference between the normalised magnitudes of the spectra of the two successive frames.
  • 8. Spectral Rolloff: The frequency below which 90% of the magnitude distribution of the spectrum is concentrated.

We have a much easier explainable result in this group: the two groups in which the figures are similar to each other are using the mean (5 * 1) and mean + std (10 * 1) vectors as input.

The almost-pure-blue figures have the standard deviation vectors as input. They seem to be not very useful..

The spectral feature as a whole is definitely contributing the final boundaries. This is not that surprising since it has more dimensions and the features capture pitch change to some degree. The female and male voices parts look very different, and the female repeat was captured, too. Intro and ending are particularly obvious (bird chirping and water sound and voice are very different in spectral features). We actually see a clearer pattern here than the combining-all-features matrices I think.

Third, MFCCs:

  • 9-21. MFCCs(Mel Frequency Cepstral Coefficients): a cepstral representation where the frequency bands are not linear but distributed according to the mel-scale.

Lots of description from above (spectral) can be applied to here: two groups of two similar figures are the mean and mean + std. Not much going on the standard deviation at all. And the whole thing looks quite like the the final figure: the off diagonals are there, just some community structures are not as obvious as the final result.

MFCC should capture timbre information: we do see now the off diagonals are contributed by the different quality in female and male voice.

And finally Chroma Vectors:

  • 22-33. Chroma Vector: A 12-element representation of the spectral energy where the bins represent the 12 equal-tempered pitch classes of western-type music (semitone spacing).
  • 34 . Chroma Deviation: The standard deviation of the 12 chroma coefficients.

The last set is also similar to the previous two.The contrast is stronger.

Chroma is more about the harmonic information. Since there are lots of silence in the end, the similarities kept mostly steady. Some noises there are the Windchimes.

In other places, there are definitely some repetition going on. We can almost see a four by four box, which is good!

So in the end, I think the chroma and spectral features captured more of what we want to extract here: meaningful music patterns. It’s just a verification that the melody and the harmony is good as feature candidates.

There will be a third part going back to the silver swan song. Hopefully soon!



Audio Features Self-Similartity Matrix shows something interesting

Recently I was interested in extracting patterns out of audio data. It would be cool if we can automatically extract repeated licks, motifs and ornament, especially in improvised music.

Just to have something to start with, I have analysed one song using correlation coefficients and covariance self-similarity matrix, which shows the boundary of different sections of the song.

The music I’m working with is this one (The sliver swan from the MIREX task):


I’m still using a free wordpress account so can’t upload sound files. But the music is monophonic. First, the soprano part is played, and then alto, and then tenor, so on and so forth. And the spectrogram looks like this:


So probably the sound file was synthesised using some midi synthesiser.

To extract the features, I’m using a python package called pyaudioanalysis, and it extracts the following features:

  • 1. Zero Crossing Rate: The rate of sign-changes of the signal during the duration of a particular frame.
  • 2. Energy: The sum of squares of the signal values, normalised by the respective frame length.
  • 3. Entropy of Energy: The entropy of sub-frames’ normalised energies. It can be interpreted as a measure of abrupt changes.
  • 4. Spectral Centroid: The centre of gravity of the spectrum.
  • 5. Spectral Spread: The second central moment of the spectrum.
  • 6. Spectral Entropy: Entropy of the normalised spectral energies for a set of sub-frames.
  • 7. Spectral Flux: The squared difference between the normalised magnitudes of the spectra of the two successive frames.
  • 8. Spectral Rolloff: The frequency below which 90% of the magnitude distribution of the spectrum is concentrated.
  • 9-21. MFCCs(Mel Frequency Cepstral Coefficients): a cepstral representation where the frequency bands are not linear but distributed according to the mel-scale.
  • 22-33. Chroma Vector: A 12-element representation of the spectral energy where the bins represent the 12 equal-tempered pitch classes of western-type music (semitone spacing).
  • 34 . Chroma Deviation: The standard deviation of the 12 chroma coefficients.

In the end, we have 34*2 long feature vectors. This 68*1 vector is now a new representation of the music at each frame. The first 34 features consist of the mean values of the frame features listed about, and the second 34 features are standard deviation values.

Okay, once the new vector representation of the audio is made, we can use lots of multivariate time series techniques to do something interesting. So, using the 68*1 vector, I first calculated the correlation coefficient self-similarity matrix:


I used the frame length at 0.046s , so there are about 3000+ frames given that the music is about 2mins long. The blue lines here is my marking on the start of a new voice part. We can see some community structure here.

Let’s zoom into the soprano part (blue lines are some rough manual segmentation. There were two very obvious repetition if we look at the sheet music) :

I used two different colour schemes for the correlation coefficients self similarity matrix but they looked a little confusing:


(sorry for the inconsistent sizes of the figures… screen shot is a time saver but it doesn’t look pretty…)

I also tried to use covariance instead. Looks a little better? We can definitely see the repetition!


I’m thinking about tweaking the features and try it on a few other songs next. To be continued….