SANE workshop

Sorry for not updating for a while. I’m having a few drafts backed up and not ready to publish yet. But trust me, they will come out soon.

Now, the SANE (Speech and Audio in the Northeast) workshop. This one day workshop totally worthed a 6 * 2 hours road trip! Honestly, I didn’t expected it would be such a high-level workshop. There were lots of people from Google, Apple, MERL, and academic institutes like MIT, CUNY, NYU, etc. All cutting edge results. Brilliant discussions.

There were a range of topics: acoustics, machine learning, speech, sound event in general, etc. But there was only one talk from the Google Magenta team (a very nice talk by Jesse Engel) about music. It was about an unreleased research about the waveform music LSTM training (e.g. deep dream audio, music hallucinations). One interesting RNN architecture was used: Multiscale Truncated Backpropagation (see photos below, sorry about the low quality, had to zoom in). It’s an internship student’s project but the idea of using a hierarchy of nodes was very interesting. Some other insights include the challenges (see photos below). The long term structure one is my favorite. And of course, Wavenet, the autoregressive CNNs, was mentioned. I need to catch up on reading that paper to understand that part of the talk… I heard that the talks are normally uploaded to youtube, so maybe keep an eye on this: http://www.saneworkshop.org

/

Other topics:

  • Environmental noise detection. One important issue in this area of research seems to be a lack of data. Various data augmentation methods seemed helped.
  • Neural science. An interesting experiment to read a ferret’s mind while letting the ferret’s listen to a human speech. The recovered signal from ferret’s EEG sounded not bad at all!
  • Machine learning structures. Actually this topic was in almost every talk. There was one by a MERL speak Shinji Watanbe, who used beamforming acoustic model + joint CTC attention network to simplify all the signal processing, microphone array, mask estimation, feature extraction and transformation etc. But it was still pretty complicated for me.

Surely one can see how those methods in other areas can be used in music!

It was also a nice mixture of posters and audience in general. I almost went through all the posters and it was very enjoyable talking to the presenters and other audience. Music posters are all from our lab though. Other posters:

  • speech + image processing (a paper to be presented in NIPS 2016, Yusuf Aytar et al)
  • adult vs. kid voice recognition (an internship at Comcast, model was not hard, implementation was done during his internship also, Denys Katerenchuk et al)
  • prosody influences from others (ongoing PhD work, Min Ma et al), and echolocation (it was amazing to know what blind people can do using echolocation)
  • etc

Also, I was there thanks to a NSF I-Corps funding (there will be a post talking about this once I finished the whole programme). One requirement of the funding was actually to “conduct interviews” (mostly likely to be chats in a one day workshop though) about the project we are doing and take photos with them (I promised not to post it online though :P). It’s a very good activity actually: lots of fun and memories with taking photos.