100 Days Of ML Code — Day 057

100 Days Of ML Code — Day 057

Recap From Day 056

Day 056, we looked at working with time; how dynamic time warping works. We learned that dynamic time warping doesn’t require us to do explicit segmentation. whereas using a classifier means we need to make a choice about when a gesture begins and ends in order to pass the classifier of feature vector representing the gesture from beginning to end.

Today, we will start looking at dynamic time warping for music and speech analysis.

Working with time

Dynamic time warping for music and speech analysis.

We’ve seen how dynamic time warping can be used to recognize similar shapes of gestures, like shapes drawn with a mouse, or rotation of an accelerometer over time. But dynamic time warping is also useful for recognizing shapes in other types of feature spaces.

For example, we might say that two melodies have a similar shape: Let’s make a dynamic time warping program that’s trained to recognize different melodic sequences. First off, if we were going to build the program, we need to answer the question, what types of features would we use?

Peak frequency, Constant-Q bins, and chromagram bins could give us a pretty good start. RMS wouldn't be so useful unless we had an instrument that varied wildly in volume from one note to the next. Though it would be useful if we were interested in patterns and volume over time, instead. And spectral Centroid wouldn’t be so useful unless we had an instrument that varied wildly in timbre or tone color from one note to the next.

Although, centroid might be useful if we were interested in patterns in instrumentation, or pattern in synthesized sounds, where filtering or other effects were changing the brightness of the sound considerably over time. For the task, detecting patterns and melodies we played on the computer, there’s an even better representation. Because we are generating those sounds on the computer in response to key presses, we know exactly which note we’re playing when. So we can send a MIDI-note number, or similar, simple representation over to our dynamic time warping and the problem becomes much easier.

Another example, using dynamic time warping to build a simple voice controller. Let’s say we have a simple mock-up of a platform or video game, and we want our avatar(represented by a capsule) as shown below to move left, right and jump.

If we want the avatar(capsule) in our mock-up above to respond to our voice, speaking the words “left”, “right”, and “jump” what features(FFT peak frequency, Constant-Q bins, Centroid and MFCC) should we use? The frequency content of our voice and the timbre of our voice are both changing as we speak different words. However, we need to recall that MFCC’s are a type of feature designed to work well in speech, so they’re probably the best thing we can start with. Also, we need to recall that we prefer our feature vectors to be shorter rather than longer, so let’s leave out the other features for now.

That’s all for day 057. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

Reference

https://www.kadenze.com/courses/machine-learning-for-musicians-and-artists-v/sessions/working-with-time