In day 099 we looked at something a little bit different but actually goes back to some of what we were talking about when we looked at timbre which is essentially how we create these frequency representations of sound that were looking at when were talking about timbre, the sonogram and the spectral view.

You can catch up using the link below.**100 Days Of ML Code Day 099***Recap from day 098*medium.com

Today, well continue from where we left off In day 099

Windowing is when we take a waveform and split it up into tiny little bits. Then, we take each of those tiny little bits and we do this thing called Periodicization. Theres really nothing to this we just pretend that little bit repeats infinitely so that its a periodic sample and then on each of those little windows we apply a method called the Fast Fourier Transform which youll often see abbreviated as FFT and so we apply this process in order to convert our time domain set of amplitudes values into information about frequency. So, Im going to go through each of those steps in more detail.

The first step is Windowing so what were we do is divide the audio into equal size, overlapping frames. So, let me show you what I mean. We pick a number of samples that would be included in each frame. So, our frame size might be 1024 samples, for instance. So these are tiny frames. So 1024 samples if our sampling rate were 44,100 hertz is about 140th of a second. So tiny fractions of a second.

And so if we were taking the waveform seen above and splitting it up we might have the first red line that Ive drawn under the waveform to be one and then were going to overlap them with each other. So the second red line under the first one might be another, the third line which is the line above the second line might be another and so on and so forth all the way through our file.

But its more complicated than what Ive shown above actually because those are overlapping and we want smooth transitions from one to the next as were doing it, each of them kind of fades in and fades out.

So the first one Im going to fade in fade out with an amplitude envelope as represented by the green annotation as seen above. The next one well fade in and fade out too, and so on. So theres always one thats kind of fading in and always one thats kind of fading out with an overlap like the one seen above, and so on and so forth.

So thats what windowing is, we end up with these windows that kind of fade in and fade out that are each a tiny fraction of a second long then we take each of those windows and, this is the easy part, we pretend that its a periodic function.

So we take a tiny little window like image A above, and we repeat it and we repeat it and we repeat it and we repeat and we repeat it and again, and again like in image B above. We just pretend that the repetition goes on forever. So now weve met the periodic requirement of the Fourier Theorem.

The final step is called The Fast Fourier Transform. The details of how this algorithm works are a little bit beyond the scope of this article. I encourage you to look up some more details if youre interested, Ill point you towards some references, but right now I just want to explain about, kind of pretend that its a black box. And explain kind of what goes in and what goes out.

What comes in are these amplitude samples overtime in the frame. So if our frame size is 1,024, wed have 1,024 amplitude values that would go in. And what would come out are a set of amplitudes and phases for each frequency bin.

So in other words, Im going to divide up my frequency space into a series of linearly spaced bins and then Im going to look at whats going on in each of those. How much energy is there in each of those bins? And also the phase of the sine wave its represented by each of those bins.

There are some simple ways to calculate how the algorithm does this and my number of frequency bins is half of my frame size and then the width between each of these bins from one to the next to the next is my Nyquist frequency, the highest frequency I can represent in my sampling rate, divided by my number of bins.

Lets work through an example here just to make sure this is totally clear. So my frame size is 1,024 samples and my sampling rate is 44,100 Hertz then my Nyquist frequency would be 44,100 divided by 2 so 22,050. So then my number of bins is the frame size, 1024 divided by two. So thats 512, and my bin width is going to be my Nyquist frequency thats 22,050 Hertz divided by my number of bins, 512 This comes out to about 43 Hertz. Its a little bit more than 43 Hertz. So that means that my frequency bins are going to be spaced zero, 43, 86, 129, so on and so forth all the way up to 22,050 Hertz.

So thats how this stuff is divided up and then I have information at that point about whats going on in each of those frequency areas and so you can see how it could generate a sonogram from there. I could take each of these frames and generate one vertical strip of frequency view in my sonogram based on that data thats coming back

I want to talk about some of the issues with the process described above because it is not a perfect process.

First of all, its a Lossy process, I lose data in this process. If I do this fast Fourier Transform and then I go back to my waveform Ive lost something in the process because Ive split these things up into these linear frequency bins so I only know whats happening with a very low resolution as theyre moving up in frequency and I also only know things about a fairly low resolution in terms of time because I only know whats happening frame by frame by frame so 1,024 samples in the example weve been using at a time and so theres actually a big trade-off when I pick my frame size.

In terms of how much resolution do I want in a time-domain versus how much do I want in the frequency domain if I want to know exactly when things are happening in time along my x-axis, I can pick a very low frame size so my frames are really tiny so I get a lot of time resolution or horizontally but then my bin width gets huge and so I know very little about whats happening vertically in my frequency dimension.

If I want to know a lot vertically in my frequency dimension, I can pick a really high frame size but then theres a lot of time that passes from one frame to the next to the next and so I lose a lot of resolution in the horizontal in the time domain.

The one point I wanted to make is that the frequency space is divided linearly but if you remember from psychoacoustics, we actually hear a pitch not linearly but logarithmically and so a lot of linear frequency bins are kind of wasted if you will on things very high up in frequency space. So half of the bins are for what we would hear as just the final octave of our frequency space so this isnt a great match either, but thats how this particular algorithm works.

Wow, youre still here. Its the 100th day. You deserve some accolades for hanging in here till the end. I hope you found the journey from day 001 to day 100 informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>In the past two days, weve talked briefly about how to calculate the storage space of digital audio data based on decisions weve made about bit width, the number of channels, and sampling rate. Weve talked about ways to reduce that storage space through lossless file formats and lossy file formats and the implications of each.

You can catch up using the link below.**100 Days Of ML Code Day 098***Recap from day 097*medium.com

Today, were going to move onto something a little bit different but actually goes back to some of what we were talking about when we looked at timbre which is essentially how we create these frequency representations of sound that were looking at when were talking about timbre, the sonogram and the spectral view.

I want to cover a somewhat complex topic, but I think its really important for us to understand it which is how we get from the waveform representation of digital audio and where we have time on our x-axis and amplitude on our y-axis to the sonogram representation where we can see much more information about the frequency and the timbre content of the sound.

Were going to talk about how we get away from the sonogram and the role the Fourier Theorem plays in that. Were going to talk about how we kind of work around the limitations of the Fourier Theorem through a process of windowing Periodicization and fast forwarding transform in order to take any sound that we might want to look at and represent it as a sum of a series of sound waves.

Well talk about some implications of this algorithm in terms of particularly two parameters of the frame size and bin width but we need to think about very carefully as were configuring it because they have some serious implications in terms of what we get are zeroes.

Its pretty obvious now that we know how sound is represented digitally on a computer. Its pretty obvious how a waveform representation like the one seen in the image below comes about. You know, we simply take the successive amplitude values, and we kind of plot them over time on the x-axis and then we have our waveform, we can connect the dots if we want to make it look a little nicer.

But how we get from the kind of representation above to the one seen below is not obvious because when we represent sound digitally were encoding a series of amplitude values over time were not including any information about the frequency at all. So thats why we need to think about this a little bit more carefully and think about how we get to the representation seen below.

So were going to revisit the Fourier Theorem which we looked at in the timbre article. I want to look at it in a little bit more depth now.

Just to recap we said the Fourier Theorem said that any periodic waveform can be represented as a sum of sine waves at frequencies that are integer multiples of a fundamental frequency and we looked at examples of this with a sawtooth wave and we looked at examples of the trombone sound of how we could kind of combine sine waves together.

I mean we wouldnt hear them anymore as individual sine waves, but wed hear them kind of coming together come possibly to create this single sound for us because of this special relationship they had to each other in terms of being integer multiples in a base frequency and because of the way that they were linked.

I also mentioned a really important limitation here. The periodic limitation. It only works for periodic waveforms like a perfect sine wave or a perfect square wave or something like that and that isnt how sounds work in the real world. Theyre not perfectly periodic. They dont repeat a cycle infinitely over and over and over again without any variation.

So, that one problem is that weve gotten this spectra aspect of timbre but not the envelope of timbre, not the changing in time, aspect of it. The other problem is that when we say that the sum of sine waves, theres an important caveat. Its a potentially infinite number of sine waves may be required to do the summation and computers dont tend to like infinity very much. Theyre not continuous beings. Theyre discrete; they do things as sets of zeros and ones.

So if we need potentially infinite number of sine waves to do the summation of sine waves, thats also going to be really problematic for us and so what we do instead is, we use this basic idea of the Fourier Theorem but we tweak it a little bit, we kind of fake it out if you will pretend that were working periodic waves and we do process, it doesnt do things perfectly but doesnt use an infinite number of sine waves either to make the summation happen and so there are three stages to the process that Im going to talk about in detail.

Windowing is when we take a waveform and split it up into tiny little bits. Then, we take each of those tiny little bits and we do this thing called Periodicization. Theres really nothing to this we just pretend that little bit repeats infinitely so that its a periodic sample and then on each of those little windows we apply a method called the Fast Fourier Transform which youll often see abbreviated as FFT and so we apply this process in order to convert our time domain set of amplitudes values into information about frequency.

So, Im going to go through each of those steps in more detail tomorrow. Thats all for day 099. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>In day 097 we looked at how we actually take data and store it. How much space it takes up on our disk and different file formats that are available to us to manage that.

You can catch up using the link below.**100 Days Of ML Code Day 097***Recap from day 096*medium.com

Today we will continue from where we left off in day 097

There are other lossless compression formats that youll encounter from time to time. ALAC is Apples. Its called Apple Lossless Audio Codec but its not very well supported by many other programs that arent made by Apple or dont use Apples APIs for Audio.

Theyre tools for importing and exporting audio files but this is something thats available If you need to get that two to one saving in size because youre trying to email a file to someone or share it somewhere or whatever might be but you want to keep all those amplitude values perfectly intact, this can be a good technique.

What people usually want to do, wind up doing when they want to save space is they use a lossy file format. lossy file format will compress the file size in a way that you can never get the original back but it does it using a perceptual encoding strategy. In other words, it actually considers how we hear sound, psychoacoustic.

Its just like we were talking about earlier in this module and it thinks about what are the things were not going to miss so much in the sound. What are some frequencies that we cant hear that well or particularly ones that might get kind of hidden or covered up by other audio content thats in the sound.

They try to use that to make intelligent decisions about what to leave out and what to keep in and youve all heard Im sure of some popular file formats in the lossy category. Mp3 is the most popular, AAC is fairly popular as well, Ogg Vorbis is another one thats used quite a bit.

Many others as well, the ones listed above are three of the most popular ones and they usually get you about a 90% savings over the original. So, instead of 10 MB per minutes of CD-quality sound, 44,100 Hz, 16-bit stereo, you usually get about 1 MB per minute, depending on the exact savings.

So, thats a substantial saving particularly useful in a lot of scenarios in terms of how we consume music today. If you are on your cell phone and youre trying to stream music tracks from a music provider, you cant stream a WAVE file on your crappy 3G connection or whatever connection you have available or you might not want to use your data plan up for all that streaming.

So, you can use a lossy file format and heres something thats pretty good over your cell phone but takes up only you know, saves you 90% of your data. So it can be useful in a lot of situations like that.

I do want to issue a very important warning here, its a lossy format for a reason, you can never get the original back. And so if youre making your own music it would be a horrible idea to only save that in a lossy format like an MP3 or an AAC or Ogg Vorbis or something like that.

Lets say you then later want them go back and edit it or make some changes, or re-encode in another format, well, youd be doing all that for a version that has lost some of the amplitude data of the original, those amplitude values are not going to be the same as when you created them, recorded them and so is never going to sound quite as good as the original version that you created in a lossless format.

And if you then go and try to re-encode it in a lossy format again, this is a very common thing to see. Take an MP3, decompress it into a WAVE form, do some editing on it and save it as an MP3 again. Well, weve basically done two different MP3 compressions, the first one, when I save that the first time and the second after Ive decoded it and edited it and Im saving it again.

Thats going to compound the effects of the losses when I do that. So its always a good idea when youre editing files, when youre working with them, when youre saving them, your own music for archival purposes, save it in the lossless format like a WAVE or an AIFF file or even like a FLAC, Free Lossless Audio Codec. Something thats going to help you get back the full quality of the original if you ever want to edit it or re-encode it again in the future.

Thats all for day 098. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>In day 095 and 096 we talked about the way that we hear sound in space: interaural delay time, head related transfer function and we also talked about binaural recording and processing, which are very effective if we are just working with headphones. And then we talked about different speaker configurations available to us for a diffusion of sound and space through speakers.

You can catch up using the links below.**100 Days Of ML Code Day 096***Recap from day 095*medium.com**100 Days Of ML Code Day 095***//*medium.com

Today, well talk about the question of how we actually take data and store it. How much space it takes up on our disk and different file formats that are available to us to manage that.

So, weve decided on our sampling rate and our bit-width, how many channels we need to represent audio for a particular situation. Now, how do we figure out how much space it takes up? thats what were going to cover today.

Were going to look at how to calculate storage space and then were going to look at different file formats to use of lossless file formats that preserve all of our amplitude values perfectly and lossy file formats that can save us a lot of disk space but loosen data in the process.

So, before we look through the file formats lets just go through some very simple calculations here. lets assume that we had 1 minute of audio, at 16-bits, 44,100 Hz and stereo, so two channels. How much disc space would this actually take up to store?

So weve got, 60 seconds, multiplied by 44,000 100 samples per second multiplied by16 bits, per sample multiplied by two channels, two samples, per moment in time. This comes out to about 84 million Bits.

Now before you start freaking out thats bits, thats not usually how we talk about digital data. So, if we convert 84 million to bytes we divide by 8 and then were going to go to kilobytes we would divide by 1024 and then if we wanted to go to megabytes, wed divide by 1,024 again. And that number is going to end up coming out to be about 10 MB.

So, in order to store 1 minute of 16-bit 44100 Hz stereo sound, we need about 10 MB of disk space. So, how are we going to store this? Let's assume weve got plenty of space to store, thats not an issue, we just want to store it on this.

The, easiest thing we can do, is just to, use a, a standard file format. Basically, it takes all the amplitude values, all the binary digits and just kind of plots them onto disk in a structured format. The two most popular formats for doing that these days are WAVE files and AIFF files.

There was a time long ago when WAVE was the Windows format and AIFF was the Apple format. In any music technology program, wed be encountering these days, they would both support, they would all support both formats just as well. Theres a lot of more obscure formats that arent used nearly as much.

WAVE files and AIFF files are supported by just about every audio program out there. If we wanted to save some space, we could try to compress this data. And we could use a lossless compression format.

What a lossless compression format would do is something similar to what like a ZIP archive would do for other types of files. It would go through and it would try to re-encode all our amplitude values in a way that represents the most commonly used ones a little bit more efficiently, at the expense of representing some of the less frequently used ones less efficiently.

Using a technique like this, we could save usually about 50%. So, instead of our 10 MB per minute of CD-quality sound, wed have about 5 MB to represent that same minute and the most popular format here is, is FLAC, that stands for Free Lossless Audio Codec.

Thats all for day 097. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>In day 095 we talked about HRTF (head-related transfer function) and closed with what binaural sound essentially is.

You can catch up using the link below.**100 Days Of ML Code Day 095***//*medium.com

Today, well look deeper into binaural sound.

A binaural sound is essentially spatialized sound designed explicitly to be heard over headphones. There are two different ways that we can we can create binaural sound. One is that we can use a special microphone called a binaural microphone. Below is a picture of one.

The microphone as seen above essentially looks like a pretend human head with pretend ears sticking on the side and then what actually happens is there is a little microphone embedded inside of that pretend ear, and so we can put the microphone in a place when were recording and thats going to mimic the inner aural delay time because those microphones are placed apart from each other somewhat according to the placement of our ears in our heads and also, the material that its made out of is going to mimic that head-related transfer function of the sound passing through our heads and our ears.

Say we play back the sound over headphones and we send what was recorded in the left ear to our left ear same as was recorded in the right ear to our right ear wed get a very good sense of how the sounds were in space at the time that we were recording and what was in front of us, to our left and right, behind us and so on and so forth.

The other thing that we can do of course is we can try to simulate those effects digitally through applying artificially the phase difference in the interval delay time and some filtering, changing the frequency response to mimic that head-related transfer function to simulate sound coming from a particular location of space.

Binaural sound is great if youre listening over headphones but it doesnt work so well if were listening over speakers because when were listening over speakers If we have, say stereo speakers we dont have the luxury of having only the left channel go to our left ear and only the right channel going to our right ear.

We dont get that isolation. Theyre both going to go to both of our ears so we cant really simulate the interaural delay time or head-transfer function so well over speakers, where both channels are going to both ears.

So if we can get some limited sense of spatialization with stereo left and right but if we want to get more serious, we need to get more speakers involved and you all have probably heard of a 5.1, Surround Sound, for instance. Thats the basic idea that we have.

We have in the front, we would have a left, centre and right, and then in the back, wed have a left and a right, thats how we get our five channels of sound. This is what is used in movie theatres, thats whats used in home theatre systems as well most of the time to try to give a basic sense of sound moving through space.

If we have those five channels, we can do that. We can obviously have more channels, too. its fairly common especially in kind of the academic world of computer music to have eight or ten or even more channels of surround where there are more and more speakers around the space to be able to simulate space a little bit more precisely. And when we do this we have special software that can usually help us to figure out how much of each sound we want to send to each speaker.

Thats all for day 096. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>In day 094 we addressed the question of how many channels we need to record sound in different scenarios to represent the location of sound in space. We learned that interaural delay time, IDT is the difference in the time delay between when the sound gets to your right ear and when it gets to your left ear.

You can catch up using the link below.**100 Days Of ML Code Day 094***Recap from day 093*medium.com

Today, we are going to continue from where we left off in day 094

HRTF (head-related transfer function) simply means that if we look at the yellow line going to the right ear in the image above well see that for sound to get to your right ear its actually travelling through your head. So, sound depending on where it's coming from might have to get through your head to get to your ear.

It might have to get through parts of your ear. Your outer ear, it might have to go through all different parts of your head in order to get to your ear and actually be heard and as it travels through your head thats different from it travelling through air. In particular, some of the higher frequency components of that sound are going to lose some more energy and so its going to sound different by the time it gets to your ear because it had to travel through your head or through your outer ear through all these different places.

Again, we kind of automatically pick up on these queues and this helps us to figure out where a sound is coming from. So, interaural delay time and head-related transfer function are really powerful ques for us to get a sense of where a sound is coming from in space. So we can take advantage of these if were listening on headphones, to do something called a binaural sound.

A binaural sound is essentially spatialized sound designed explicitly to be heard over headphones. Well see binaural sound in details in day 095.

Thats all for day 095. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>In day 093 weve talked about bit width, what it means in terms of binary digits. We also talked about its implications in terms of when we record sound, were going to take advantage of our bit width but not go beyond the binary digits that are available to us.

You can catch up using the link below**100 Days Of ML Code Day 093***In the last two days, we talked about sampling rate, and how to determine an appropriate sampling rate to represent*medium.com

Today, well address the question of how many channels we need to record sound in different scenarios to represent the location of a sound in space.

Were going to address the issue of channels and spatialization; essentially how many amplitude values do we need to record in each sample and time in order to set an amount to multiple ears of headphones or multiple speakers and to simulate the location of sounds in space.

As we talk about channels and spatialization, I first want to talk about two important phenomenon related to this Interaural Delay Time and Head-Related Transfer Function and then well talk about how these can combine to simulate the spatialization of sound through headphones, through a process called binaural sound. And then well talk about what we might need to do differently if were sending sound out over multiple speakers with sound diffusion.

First I want to talk about Interaural Delay Time and the Head-Related Transfer Function.

So lets pretend that the image above showing your head and then we have a sound from the speaker over there. As that sound is travelling to your ears, imagine, that theres a sound wave kind of going to your right ear, a sound wave going to your left ear.

Now, whats important here is that the length of those two yellow lines that I drew in the image a little different from each other. Its going to take longer for the sound to get to your right ear than to your left ear because it has to travel just a little bit further and so, theres going to be a difference in phase between those two sound waves as they reach your two ears.

When were listening to sounds in the real world, we can automatically kind of process that difference in phase and use that as a cue to understand where that sound is coming from. So, thats called interaural delay time, IDT that difference in the time delay between when sound gets to your right ear and when it gets to your left ear.

Thats all for day 094. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>In the last two days, we talked about sampling rate, and how to determine an appropriate sampling rate to represent audio digitally and that had to do with the x-axis, the time axis of our waveform representation.

You can catch up using the links below**100 Days Of ML Code Day 092***//100 Days Of ML Code Day 092*medium.com**100 Days Of ML Code Day 091***Recap From Day 090*medium.com

Starting from today, were going to turn to the y-axis, the amplitude axis, and talk about how many binary digits. What our bit width needs to be to represent each amplitude sample that we record. Well talk more formally about what bit width is, and well review what binary numbers are, in case youre not familiar with them already. And then well also talk about some implications of bit width, in terms of how we record sound, and also how artists have used it in some interesting ways.

Formally the bit width is the number of binary digits that we use to represent the amplitude of each sample.

So for each of the dots in the image shown above, how many binary numbers are we using on the computer? How many zeros or one digits are we using to represent, what that amplitude value is? So it's important that we think about this in terms of binary numbers.

So if we had a bit width of one, for example, that would mean that we would use one binary digit and a binary digit can either be zero or it can be one. Its either on or its off. So we have two possibilities here, its either zero, or its one. So that means that our resolution is effectively two we have two options for how were going to represent the amplitude and that would obviously be an incredibly restrictive environment to work in.

So if we want two bits each one of them each of the two binary digits as seen in the table above could be either zero or one. So, two possibilities for the first digit two possibilities for the second digit. 2 times 2 is 4. still pretty limited. And when up to 8 bits which is something thats actually used in some fairly low-resolution recordings. I have 8 binary digits, 2 to the 8th power possibilities, 256 possible amplitude values. In other words, as were taking the negative one to positive one amplitude space, that y-axis over waveform, weve kind of limited it to 256 different possibilities evenly spread across that space.

16 bits which is what we use in CDs we have 2 to the 16th possibilities about 65,000 and then 24 bits which is what I like to use whenever possible we have two to the 24th possibilities on to 16 million. Also, those extra eight bits from 16 bits to 24 bits gets you a lot of extra resolution on your y-axis from 65,000 up to about 16,000,000. Some people recorded 32 bit as well with some high-end audio software and hardware.

So, obviously, we want to record with as much resolution as we can, within the limits of whatever media were working with. Obviously eventually you know using a CD were going to be limited to 16 bits when we finally code that file for a CD. But we cant use infinite amounts of disk space an issue well get to later in the future but, I also just want to talk about the implications of this for recording because its not enough to just record something at a good bit width using 16 bits or 24 bits or 32 bits or whatever.

Its very important that when you are recording you are trying to use the full dynamic range that is available to you because if you are recording at a high bit width but you are only using a tiny bit of the negative one, positive one amplitude range because you might be turning really low or whatever else, might be going on in your process youre wasting all these bits, theyre just never getting used for anything and so youre effectively recording at a much lower resolution.

But on the other hand, if you record too loud, I wouldnt use every single of those bits no matter what. Thats also a problem, because if you go over the negative ones, the positive one range well, then youve run out of binary digits to represent those amplitudes values and so they all just kind of clips are cut off at positive one or negative one, so you end up with something called digital distortion which is also not a good thing. Its basically the peaks and the troughs of all your waveforms just get kind of lopped off and that tends not to sound very good either.

Thats all for day 093. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 091, we looked at Nyquist Theorem

You can catch up using the link below.**100 Days Of ML Code Day 091***Recap From Day 090*medium.com

Today, Im going to talk briefly about what happens if our sampling rate is too low. We get something called foldover.

If our sampling rate is too low, its not just that the frequencies above the Nyquist frequency which is that highest frequency we can represent. Its not just that those frequencies disappear from our sound, but they actually turn into other frequencies in the sampling rates. So, I want to show you what I mean.

In the image above, weve got a sine wave on the top and we can look at the number of cycles in it from peak to peak, peak to peak, peak to peak, peak to peak so theres four plus a little bit more in it. That image is the sampling rate of 44,100 Hertz.

If we take that same sine wave and we reduce it down to something crazy low, like, 284 hertz, we end up with something like what you see in the bottom in the image above. Were still getting a periodic sound here.

Its not a sine wave anymore because weve lost kind of resolution of that curve and its also not the same number of cycles anymore but we are getting cycles. Were getting one full cycle plus a little bit more in this particular square of time.

Were going to hear that as a periodic sound. Its going to have a frequency to it. But its not going to be the original frequency that we expected of that 440-hertz sine wave that we had in the top image.

So just to quickly review here. In the last two days, we talked about the Nyquist Theorem as a way to figure out an appropriate sampling rate, that our sampling rate needs to be at least double the highest frequency we want to represent and we talked about how we kind of arrived at 44,100 hertz as a fairly standard sampling rate.

We also talked about foldover and other effects that can happen when were recording at a sampling rate thats too low. In the next couple of days, were going to get into the question of bit-width and how we decide what resolution we need to represent the amplitude of each sample.

Thats all for day 092. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 090, we looked at Sampling Rate

You can catch up using the link below.**100 Days Of ML Code Day 090***Recap From Day 089*medium.com

I promised you yesterday that Ill explain it a little bit more formally what a sampling rate is and well look at the Nyquist Theorem, which gives us some guidance on picking a sampling rate.

We learned a bit more formally about what sampling rate but didnt get the chance to look at Nyquist Theorem.

Today, without further ado, lets get to it.

The Nyquist Theorem says that the sampling rate must be at least twice the highest frequency that you wish to represent. This makes a lot of intuitive sense when you think about it. And the reason for that is lets think about our sine wave again, here.

If I have a sine wave going at say 440 hertz then, I have 440 peaks and I have 440 troughs happening every second so the minimum that I need to capture digitally in terms of those dots, those amplitude readings would be for each cycle of my sine wave, I need to make sure that I have at least one sample to represent somewhere on my peak, somewhere above the zero crossing and then, something somewhere below the peak to represent below the zero crossing somewhere down by my trough.

So I need 440 peaks and 440 troughs or 440 above zeros and 440 below zeros to be able to capture these 440 cycles of my sine wave in a second. So I simply would multiply 440 by 2 and Id end up with 880 as a sampling rate that I would need. So in reality, were not looking at every individual sine wave or frequency component that we want to represent, we want to come up with some general sampling rate thats going to work really well for a lot of things so what should that sampling rate be? We can kind of deduce this logically.

We talked about the range of human hearing is going from roughly 20 hertz up to 20,000 Hertz. So if we take 20,000 hertz and we multiply it by 2, we end up with 40,000 Hertz. So, we know that the sampling rate must be greater than 40,000 hertz and so the number that we usually end up seeing is 44,100 Hertz. The reason for this has to do with the history of the early days of digital recording and some decisions at Sony and other manufacturers made in the late 1970s that arent really worth getting into here but that number has largely stuck thats what we use on compact discs, in particular, is 44,100 hertz is their sampling rate.

Youll sometimes see other sampling rates. Youll see like 48,000 hertz for instance. Youll sometimes see higher rates like 96,000 hertz or even 192,000 hertz its in very high fidelity recordings and the reason for that, of course, is that if we had a sine wave that is able to capture at least one sample somewhere on the peak, and one somewhere on the trough, but thats not going to be enough to really capture the entire shape of that sine wave, that entire curve.

If you want to get a really nice representation of it, youre going to want as many samples as possible all along the way. So, the higher a sampling rate is the better resolution well get and the better we will be able to represent those curves.

Thats all for day 091. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 089, we looked at Copying Analog And Digital.

You can catch up using the link below.**100 Days Of ML Code Day 089***Recap From Day 088*medium.com

If you lose your digital data if it gets corrupted, usually youre just finished you have no semblance of the original left to work with at all, and so this is another point I wanted to make in this relationship between analog and digital.

So, to review here in the past two days we talked about the basic differences between analog and digital sound as continuous versus discreet representations of audio and we talked about those discreet samples of audio that we get in a digital representation of sound and the issue of horizontal resolution. Sampling rate, how many samples are we taking per second. Bit width, the vertical representation of how much resolution we are using to represent each amplitude dial, and how many channels of sound do we need. We talked about some implications of analog versus digital sound in copying audio and also in the kind of degradation and preservation of audio.

So, starting today and in the next couple of days, were going to delve into the three ideas of sampling rates, bit widths and channels in much more depth and look at some of the details about how we decide what we need in each of those domains.

Like I mentioned in the last paragraph, today, were going to delve into the question of sampling rate in much more detail and ask a simple question of how do we determine what the appropriate sampling rate is? Well, explain it a little bit more formally what a sampling rate is and well look at the Nyquist Theorem, which gives us some guidance on picking a sampling rate.

Later on, well also talk about foldover which is something that can happen usually that we usually dont want to happen if you pick a bad sample that is too low for a particular project.

Sampling rate is very simply put, its the number of samples per second of digital audio.

If you recall we have this kind of very zoomed-in sine wave as seen in the image above. Each of those dots is a sample. Its simply asking, well, how many of those dots are we capturing every second. And because it is in terms of samples per second kind of metric, we actually use hertz to represent it, the same thing we had used to represent frequency.

For instance, if 8,000 hertz is our sampling rate, it simply means that were capturing 8,000 of these samples every second. So thats how we talk about sampling rate. And now, how do we decide what our sampling rate should be? its actually fairly simple. We use something called the Nyquist Theorem, which is also sometimes known as the sampling theorem.

Thats all for day 090. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 088, we looked at the part two of analog versus digital sound.

You can catch up using the link below.**100 Days Of ML Code Day 088***Recap From Day 087*medium.com

Today we will continue from where we left off in day 088

I said I wanted to take a quick digression. Its not really a digression because its really elemental to the differences between analog and digital media and it has to do with how we copy media.

When you make a copy of a record. Say if youre dubbing a record to a cassette tape or something like I did as a kid, its not a perfect copy that you end up with because youre copying analog data. Youre copying this continuous function. Its not going to be perfect. Its going through and its essentially rerecording this data as its being played back.

In the digital domain youre just copying a bunch of zeros and ones and if youre worried you might make a mistake you can go back and check again and again to make sure that you havent made a mistake you know in all kinds of different ways.

Digital copies are perfect replicas of the originals. There isnt really no sense of a master anymore because every copy can be perfect and this obviously had lots of implications on music sharing and and piracy and legal ramifications.

Once you can just rip a CD or share a file online all of a sudden and its perfect as opposed to the generational effects of making copies of copies of copies of analog media it can become much more of an issue.

What I wanted to really talk about here is the implications of analog versus digital in a more artistic sense and to demonstrate this, I want to talk to you about a work by filmmaker Bill Morrison and a composer Michael Gordon its called Light is Calling.

It was written in 2004 and what Bill Morrison did here was he he took some film footage from some early silent films that was in archives and these film reels were starting to decay.

If you put these into a projector they might just disintegrate or they might be able to play once or twice before they start falling apart but the images were not as they looked originally in the 20s or the 30s or whenever they were originally made.

Theyre really transformed and dirty, and theres all kinds of noise, and sometimes its impossible to tell what the original was, sometimes you can make out little bits of it and so he edited a bunch of stuff together from a silent film called The Bells for this piece, and then Michael Gordon actually using digital sound composed a soundtrack to go along with it.

what I think it really shows is how analog and digital media can decay in different ways because heres this ancient kind of crumbling analog film reel and it still contains some of the original information in it. Digital data doesnt degrade nearly as gracefully.

Thats all for day 089. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 087, we looked at analog versus digital sound.

You can catch up using the link below.**100 Days Of ML Code Day 087***Recap From Day 086*medium.com

Today we will continue from where we left off in day 087

The key difference between analog and digital, is continuous versus discret e. And just to kind of drive that point home I want to go Audacity and below is a sine wave. All Ive done below is Ive zoomed in basically as far as Audacity will go.

Thats maybe a little bit too far there but you can see those individual dots on it. Each of those represents an individual amplitude value that weve stored digitally and so were down to the lowest level of the digital representation as we zoom in further and further and further.

We can see those individual samples that were captured and of course, Reverb has done a nice job of kind of connecting the dots between them where we dont really know whats happening. We can just interpolate them. So thats the key difference here is this continuous versus discrete.

In the digital domain if were thinking about what is actually stored at each of those dots, theres a few different considerations that we need to think about and these well be exploring in much more detail in the coming days.

The first one is sampling rate which is how many of those dots are we recording every second? and how, this has to do with the the horizontal the time resolution that were capturing. How many amplitude values are we recording every second? How fast those dots come one after another?

The second one is bit width. This has to do with resolution in the amplitude dimension. How many bits of digital data are we reserving to store every single one of our individual amplitude values. So again, bit width has to do with what is our resolution on the y axis of a wave form. Sampling rate has to do with what is our resolution on the x axis.

The third thing is how many amplitude samples are we recording for each of the values in the image below? because right now we just see a single dot for each of those single amplitude value in our waveform view but might we need to record two channels or three channels or ten channels in order to represent the locations of sound in space.

We may need to capture multiple amplitude values at each moment. Its like stereo sound would have two channels, for instance.

So those are the fundamental things well be exploring over the coming days. But before we do that I said I wanted to take a quick digression. Its not really a digression because its really elemental to the differences between analog and digital media and it has to do with how we copy these media.

Thats all for day 088. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 086, we looked at Envelope And Spectrogram

You can catch up using the link below.**100 Days Of ML Code Day 086***Recap From Day 085*medium.com

In the past two days, we looked at timbre. We talked about timbre as consisting of two components, spectra, and envelope. We talked briefly about the Fourier theorem for periodic sounds, how it describes them as consisting of a series of sine waves particular frequencies in integer multiple relationships to each other.

We talked about two new visual representations of sound. The spectra, which shows, at a particular moment what the frequency content is, and a sonogram, which shows over time how that frequency component is changing.

In the next couple of days, we are going to shift gears a little bit and focus on how we represent sound digitally on a computer and all the issues that come up with that.

So starting today and the next several days were going to focus in on digital sound, how we represent audio waveforms digitally on a computer or a compact disc or any other kind of digital media. Well talk about a lot of the issues that come up with that in particular.

Today, were going to look broadly at the differences between analog and digital sound and different challenges that come up in each media and then were also going to take a little bit of a digression to talk about some of the implications of copying audio and analog versus digital domains and preserving it and archiving it.

I want to start by talking about analog versus digital sound and the key idea here is that analog is a continuous medium and digital is a discrete medium. so its kind of flagship examples of each recording media.

Lets think about vinyl records as being the quintessential analog audio media, and compact disks as being a quintessential digital media.

So on a record, we have a needle thats reading from these grooves and those grooves are going up and down and thats creating those different amplitude values that are being reproduced by a record player. So this is a continuous process. its a continuous function.

We can think about it as being y equals f of x or something like that where at any arbitrary continuous moment in time there is an amplitude value to work with that. So that s in the analog realm.

When we move to the digital realm, you think of the compact disc. We have a laser that is reading a bunch of zeroes and ones off of our compact disc. Zeroes and ones are discrete, and theyre representing discrete moments in time. Discrete amplitude values at particular points. Its no longer a continuous function. Its a discrete function.

We decide what moments in time we want to record these amplitude values and then we capture them in those moments, we play them back in those moments, we knew nothing about what happens from in between these discrete points in time where were taking these, amplitude samples.

And so this is the key difference between analog and digital, there is continuous versus discrete. Thats all for day 087. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 085, we looked at Fourier Theorem.

You can catch up using the link below.**100 Days Of ML Code Day 085***Recap From Day 084*medium.com

Today, we will continue from where we left off in day 085

Now another way that we can look at Timbre information visually Its just something called a sonogram or a spectrogram and the image below is a little bit different because what we have on the x-axis now, is time. And our y-axis is frequency, and then our colour would map to decibels.

So the reddest areas in the colour scheme are the ones that are highest in decibels. So any given point, we can think of as a particular moment in time at a particular place in frequency space and the colour is an indication of the decibels so that particular moment in time in that particular frequency space.

The reason sonograms and spectrograms are important is that we obviously have sounds in the real world that arent sine waves or sawtooth waves or square waves, that change a lot over the course of the sound and this is a key component to timbre as well.

Its not just enough to say how frequencies are distributed and where the energy is across the frequency space but you also have to be able to say well how its changing over time.

It would not be enough just to list a bunch of frequencies and their amplitudes and phases in order to describe an instruments sound because we have to describe how its changing at the beginning part, the attack portion of the sound. We have to describe its envelope how its changing over time.

Below is a live sonogram view of a sawtooth wave.

You can see that its just those straight lines. Those frequency components according to the Fourier Theorem that is never changing. See the image below for how a more complex sound looks.

Now that is obviously changing because the pitches are changing, but, equally important is that with each of the notes, you can see from the image above that theyre not just static lines. Those things that are growing, and shrinking, and moving around and they look like real almost drawings or squiggles.

Rather than simple straight lines that are perfect. That is how theres a difference between the sounds that we work with in real life as opposed to the test sounds, the sawtooth waves and what we need to describe their timbres. Its not enough to just say what the vertical, the frequency component is but you need to describe the horizontal as well, how its changing over time.

Thats all for day 086. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 084, we looked at how we describe two sounds that actually have the exact same pitch and the exact same loudness but they sound really different from each other.

You can catch up using the link below.**100 Days Of ML Code Day 084***Recap From Day 083*medium.com

Today, we will continue from where we left off in day 084.

As seen in the image to the left below, we have a saw tooth wave, and so you see a number of peaks highlighted in red. So, you see a peak at 440 Hertz, and another one at 880, and at 1320 and 1760, and so on and so forth.

Those are all coming at integer multiples at the fundamental frequency 400. Theyre at one times, two times, three times, four times 5 times, 6 times and on and on and on and you can notice each one is at lower decibels than the one that came before it. So our most energy is at 440 and then it goes down and down and down and down from there.

The recording above might be a little bit loud, so watch the volume on your headphones, or your speakers. It is a sawtooth wave, same frequency as the sine wave was but obviously, it has a very different timbre. And part of the way that we can explain this is because of the different spectra. Because they both have their peak at 440 Hertz (as highlighted in blue in the image below), but obviously theres all this extra stuff happening in the different frequencies.

As weve seen before, were not hearing all those as separate sine waves or separate frequency components. Theyre all kind of blending together to create this single sound in our minds and the way that we can understand this is through a very important theorem in music technology called the Fourier Theorem.

What the Fourier Theorem says is that any periodic waveform can be represented as the sum of sine waves at frequencies that are integer multiples of a fundamental frequency like our fundamental frequency in this case was 440 Hertz. In the integer multiples are what we saw before, 1 times 440, 2 times 440, 3 times 440, and so on and so forth.

What were looking at is at each of those integer multiples, we have a sine wave at some particular frequency, amplitude, and phase and if we add those together, we can represent any periodic wave form, like a sawtooth wave, or a triangle wave, or something like that.

Now it is important to emphasize the word periodic here. This is a very important caveat because in the real world, like waveform of a recording of someone talking, is not periodic, the cycles dont repeat infinitely and infinitely and infinitely the way that sine wave would, or a sawtooth wave or something like that. So the Fourier Theorem only works for periodic wave forms.

Thats all for day 085. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 083, we looked at Fletcher-Munson Loudness Curves. You can catch up using the link below.**100 Days Of ML Code Day 083***Recap From Day 082*medium.com

What were going to talk about today and the next couple days is if we had two sounds that actually have the exact same pitch and the exact same loudness but they sound really different from each other, How do we describe that? So well be talking about timbre today.

So in the last couple of days, we looked at loudness and we looked at pitch and I asked the questions of how can we distinguish two sounds that have the same loudness and the same pitch and yet they sound very different. Thats where the notion of Timbre comes in.

Just to start, lets say I play you the sound of a trombone and then a double-base sound. Same pitch, more or less the same loudness, but it sounds very different. How do we describe the differences between them? Thats what we are going to talk about.

In the coming days, we will look at what is timbre? We will Describe it in terms of two components spectra and envelope and also we will look at some different ways we can look at sound besides the waveform view weve been using, that show the timbre of sounds a little bit more clearly.

Lets talk about the Fourier Theorem which helps explain how timbre works. There are no decent definitions of timbre out there. But lets look at my favourite definition which comes from the American Standards Association and this is how it goes. It says, that attribute of sensation in terms of which a listener can judge that two sounds having the same loudness and pitch are dissimilar.

The language in the definition above is a little fancy but basically what its saying is. If theyve got the same loudness, and theyve got the same pitch, but they still sound different to you, well, thats timbre. So basically its just a grab bag of everything else about a sound that cant be described by its loudness and pitch.

The definition above is really just saying what its not rather than what it is. Colloquially we tend to define timbre as the colour or the tone or something like that, and thats fine because It gives us a general sense of what were talking about with timbre, but it doesnt get into a specifics its really just a metaphor. They explain, in this vague way, this other stuff that we dont really have a good way to describe.

The way were going to talk about timbre going forward is in terms of two key components, spectra and envelope. So Im going to talk about this in a little bit more detail. In order to do that we really need to look at visual representations of sounds. Remember that up to now weve been doing the waveform representation of sounds, where our X axis is time and our Y axis is amplitude.

*sound spectrum*

What we have in the image above is a sound spectrum, a visual spectrum representation of sounds which is basically showing where theres energy in different frequencies. So what we have on our x-axis is frequency and our y-axis is decibels. And so we can see in the one on the left, it is a sine wave at 440 Hertz. Its a little hard to read the units on the diagram above but it is 440 Hertz on my x-axis and you can see the peak right there(the area highlighted by a red circle in), showing that the sine waves peak energy is at the 440 Hertz point.

Thats all for day 084. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 082, we looked at Harmonic Series. You can catch up using the link below.**100 Days Of ML Code Day 082***Recap From Day 081*medium.com

Today, we will continue from where we left off in day 082

Lets look at one more thing before we leave our discussion of psychophysics for now. I want you to play the chirp sound below. I want you to focus on how loud it sounds over the course of the chirp from 20 Hertz to 20,000 Hertz.

Does it sound like its ever getting louder or softer or does it feel like the loudness is the same the whole time? Okay, so the loudness is obviously changing as that goes from 20 Hertz up to 20,000 Hertz. The amplitude of that sine wave in the file is actually not changing at all. Its using the full negative one to positive one range throughout but our perception of that is changing based on the frequency of the sin wave.

This is explained by this phenomenon called the Fletcher-Munson Loudness Curves.

What Fletcher-Munson Loudness Curves shows is that on our y-axis we have decibels, on our x-axis we have frequency. If we follow one of the contours above, if were changing our loudness as we go up we actually perceive that curve as being the exact same loudness throughout.

So in order to get something that sounds like its equally loud from 20 Hertz all the way up to 20,000 Hertz we actually have to change its amplitude, in order to kind of fake our ears into hearing it sound like its the same. Because our ears are more sensitive, to a broader range of dynamics, especially around 3 to 5,000 Hertz.

Then lets say at the very low end of the spectrum or even at the very high end. So this is another example about how frequency and loudness come together in our brains as were hearing sounds to create effects that are very different from what we might see if were just looking at a waveform.

So to review what weve covered in the past three days, we talked about psychoacoustics as describing how we perceive sound and not just how it exists acoustically in the world or how we might represent it as a waveform. We particularly talked about loudness versus amplitude, and we talked about pitch versus frequency. We looked at the Fletcher-Munson Loudness Curves as a really good example of this.

Thats all for day 083. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 081, we looked at the third part of loudness and pitch. You can catch up using the link below.**100 Days Of ML Code Day 081***Recap From Day 080*medium.com

Today, we will continue from where we left off in day 081

Toward the end of day 081, we saw an image that contains two frequencies two sine waves, one at 440 Hertz, and the second one at 880 Hertz. I concluded by asking what happens if we actually listen to them?

Play the two audio above to hear what they sound like. When you played those together, the 440 Hertz sine wave and the 880 Hertz sine wave, how many different pitches did you hear? Okay, thats enough to get an idea. But if you go ahead and play the 440 Hertz one, you hear that very clearly or play just the 880 Hertz one, you hear that very clearly. But when you play them together youll hear something very different.

We find some melding of those because they have this special relationship to one another and this is something thats even more evident if we go to real-world sounds.

If you play the video above, the sound youd hear is a trombone sound but were not actually hearing the original trombone sounds here. Were hearing a bunch of sine waves. This is actually a harmonic series. A **harmonic series** is the sequence of sounds pure tones, represented by sinusoidal waves in which the frequency of each sound is an integer multiple of the fundamental, the lowest frequency.

The point is that its not just the difference between the linear and logarithmic relationship in terms of frequency pitch but its also a difference between making out individual frequencies and hearing them melding into some bigger composite results

Thats all for day 082. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 080, we looked at loudness and pitch. You can catch up using the link below.**100 Days Of ML Code Day 080***Recap From Day 079*medium.com

Today, we will continue from where we left off in day 080

As were thinking about frequency, we think about frequency as going up linearly and theres a key musical construct thats described. Its called the Harmonic Series. If we have a base frequency at say 100 Hertz, well we can think of integer multiples of that. So 2 times 100 is 200, 3 times would be 300, 4 times 400, and so on, 500, 600 and on and on and on. Harmonic Series is very important in music and we can think about the Hertz as representing our base frequency.

If we were to represent the low C in the image below then when we double that frequency we would be in the C an octave above. When we go three times that original frequency, we would be the G above that and if we went four times we would be the C above that and so were not always getting Cs. Were getting different notes.

If we went from there we would get an E and we get a G and then a kind of B flat and so on and so forth. But theres another way to think about pitch which is in terms of octaves and this is not a linear scale of 1 times, 2 times, 3 times, 4 times anymore. This is a scale of doubling every time. So 100, 200 Hertz, 400 Hertz, 800 Hertz, 1600 Hertz and so on and so forth.

If we go at those frequency ratios always doubling or rather than always multiplying by some integer multiple, we end up with successive octaves where theyre all Cs, from C to C to C to C and so you see we got C, we double it, we get the C the next octave up as seen below. We double that, we get the C the next octave up. We double that, we get to see the next octave up.

And so again, the way that we hear pitch, is not on this linear frequency scale, when theres logarithmic octave scale because we hear these Cs as sharing something in common with each other and going from one C to the next is traversing this space of an octave even though the difference between one 100 and 200 Hertz and between 200 and 400 Hertz is different in Hertz space is 100 versus 200.

So again theres this difference between how we represent things in frequency and how we hear them in terms of these octaves. This pitch, this logarithmic relationship. I want to go a little bit further than that because we hear something else thats a little bit more complicated too when were listening to pitch instead of frequencies.

The image below contains two frequencies two sine waves, one is at 440 Hertz, the one on top, and then the one on the bottom is at 880 Hertz.

So there is a two to one relationships, theyre an octave apart from each other. Now what happens if we actually listen to them? Thats all for day 081. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 079, we looked at a very brief introduction to the field of Psychoacoustics and talked about loudness. You can catch up using the link below.**100 Days Of ML Code Day 079***Recap From Day 078*medium.com

Today, we will continue from where we left off in day 079.

As weve learned previously, both amplitude and loudness are relative measures theyre not absolute measures of how loud something is in the real world. And theres a very important reason for this. if you think about amplitude, you know its negative one to positive one what? You know and the answer is really nothing, we are looking at that way from the computer. Because, when a computer is playing back the sound we have no idea how different things in the chain after that computer are going to affect the sound.

How loud is the amplifier its hooked up to, how loud are the speakers? how far away are we standing from the speakers, so how much are the sound waves kind of losing their energy as they go from the speakers to us? We dont know any of that, so we cant fix kind of absolute units to amplitude as we look at them on the computer. A sense of relative measure. So we know that plus one is more than plus 0.5 is more than 0.25 and so on and so forth, and the same thing is true of loudness.

If you look at a mixer, for instance, as seen in the image below, the example on the right that is a physical mixer like someone might use at a concert or a recording studio. And the example on the left is from Ableton Live, a Digital Audio Workstation Program and its controlling the loudness on the master channel and you can see the units in the virtual example. We have zero as Ive highlighted in red in the image below, thats zero decibels. That just means whatever sound is coming in is not making it louder, its not making it softer.

Below that, we have minus 12, minus 24, and so on, and so forth. And so again, this isnt speaking to a particular measure that we can measure in the real world of this sound. Its just saying well the sound is coming in at a certain level and then Im either leaving it alone, zero dB, or Im decreasing it by a certain amount. So when we look at mixers, either virtually or in the real world, thats how we tend to think about loudness.

Were using decibels here as a measure of loudness rather than amplitude because then moving the sliders on the mixer in the image above has more of perceptual psychoacoustic relevance to us, because of that logarithmic scale. lets move on and talk about pitch for a moment.

As you were hearing from the audio in the video above, the basic idea is that it was not increasing linearly. It seemed like the pitch was very quickly at the beginning and then it got slower and slower and slower and slower as it went on. In other words, that same kind of logarithmic curve. Its levelling out as it gets higher and higher. and thats because unsurprisingly there are different ways that we can think about pitch and pitch relationships. as were thinking about frequency, we think about frequency as going up linearly and theres a key musical construct thats described. Its called the Harmonic Series.

Thats all for day 080. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 077, we looked at another property of sound waves. You can catch up using the link below.**100 Days Of ML Code Day 077***Recap From Day 076*medium.com

Let s move on now and look at one more important property of sound which is phase So phase really relates to the offset between two waveforms and how they relate to each other. **Phase** is the position of a point in time (an instant) on a waveform cycle. A complete cycle) is defined as the interval required for the waveform to return to its arbitrary initial value.

The first illustration below shows what happens when weve got two channels of a signal in phase. When both channels are in phase, we hear the sound at the same amplitude level at the same time in both ears. These two sine waves in the illustration below are perfectly in phase. And the reason I say that theyre in phase is that every peak lines up with every peak, every trough lines up with every trough and so on and so forth or throughout.

*Left and right channels in phase. Source*

On the other hand, if one side of the stereo signal is reversed, as shown in the illustration below, the signals will cancel each other out, and we get a process called Phase Cancellation. so, these are perfectly out of phase. And what I mean by that is that all the peaks in the first one line up with all the troughs in the second one. All the troughs in the first one bind up with the peaks of the second one.

*Left and right channels out of phase. Source*

In fact, if we were using a pure sine wave, combining both signals out of phase would result in silence, since the sounds would literally cancel each other out. In the real world, we normally dont listen to pure sine waves. Since most of the music we hear and the instruments we record are a complex combination of multiple waves and harmonics, the results of phase cancellation will be equally complex.

Theres something in between the two which can happen, called beating. Which is when we have two sounds that have slightly different frequencies from each other. When two sound waves of different frequency approach your ear, the alternating constructive and destructive interference causes the sound to be alternatively soft and loud a phenomenon which is called beating or producing beats.

So lets say we have a 440 Hz and the next one is at 441 Hz, so they are one Hz apart. There are so close in frequency. But over time they because one is slightly faster than the other, their going to go through patterns of being completely in phase with each other. So they're adding up of to make something you know just higher in amplitude. And being fully out of phase with each other, in which case, theyre going to cancel each other out, so at atime, theyre completely out of phase, at another time, theyre completely in phase, completely out of phase, completely in phase, and so on and so forth. And so these patterns are going to generate their own unique sound. We're going to hear a pulse once a second, which is called beating.

And this is actually a phenomenon that musicians often use. When theyre playing real instruments. To help them tune. If they hear that beating they know that theyre a little bit out of tune with each other. And the more that beating slows down so if its at one, if it happens once a second, it beats at one hertz they know that theyre one hertz apart from each other. If its at five times a second they know theyre five hertz apart from each other and so on and so forth. Then when it disappears entirely they know that theyre completely in tune.

Thats all for day 078. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 076, we looked at the first property of sound waves. You can catch up using the link below.**100 Days Of ML Code Day 076***Recap From Day 075*medium.com

Today, we will continue from where we left off in day 076.

Frequencyis the speed of the vibration, and this determines the pitch of the sound. It is only useful or meaningful for musical sounds, where there is a strongly regular waveform.

Frequency is another really important property of waves. If we look at the sine wave below theres a couple of important terms to think of. The very top point of the sine wave is called its peak. And the very bottom point after the peak as seen below is called its trough. And so, if we measure the distance from one peak to the next peak, that is called Cycle. The **wavelength** is the distance from crest to crest, trough to trough, or from a point on one **wave** cycle to the corresponding point on the next adjacent **wave** cycle.

*Adapted From Here*

Frequency is simply measured as the number of cycles per second. And is measured in a unit called Hertz, which is usually abbreviated as Hz so for instance in the video below is a sine wave at 440 Hz. The note A which is above Middle C has a frequency of 440 Hz. It is often used as a reference frequency for tuning musical instruments.

So, what frequencies can we as human beings actually hear? What you usually see people say is that from about 20 Hz up to about 20,000 Hz. Its important that you understand this is all approximate. And I want to delve into that in a little more depth to explain exactly what I mean.

First of all, when were at the low end of a spectrum when we get below 20 hertz, we might very well still hear the sounds. But thats when it stops sounding like a pitch to us, and starts sounding more like rhythm. So as an example, if I play you something at 10 Hz e.g a saw tooth wave. What youll hear is that it is not going to sound anymore like a pitch but its going to sound like rhythm.

So you can hear ten times a second something going duh duh duh duh duh and so on. so, thats what happens in the low end of the spectrum. On the high end, the 20,000 Hz end different people may be able to hear up to different levels. Particularly as we get older we have a harder and harder time hearing those really high frequencies.

One final note here is that some animals have different ranges of hearing than we as humans do. dogs, for instance, can hear as high as 40,000 or sometimes even 60,000 Hz. So thats why when you see a dog whistle, its just making a tone thats higher than 20,000 Hz, but falls well within the range of dog hearing.

Thats all for day 077. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 075, we started looking at sound waves. You can catch up using the link below.**100 Days Of ML Code Day 075***Recap From Day 074*medium.com

Today, we will continue from where we left off in day 075.

It turns out that sine waves are a very important function for us to think about. Below is a sine wave.

*Adapted From Here*

Lets look at some of its properties now. The first one is Amplitude.

The

amplitudeof a periodic variable) is a measure of its change over a single period) (such as time) or spatial period).

*Adapted From Here*

Amplitude is the maximum displacement of points on a wave, which we can think of as the degree or intensity of change. As seen above, it is the y axis. And its going to go typically from negative one to positive one. And then the horizontal line across the middle represents the zero point. And so we can think about this as a lot of different things that the amplitude is representing.

We can think about it as a speaker cone pushing and pulling. kind of back and forth to create sound waves. We can think about it as kind of air molecules that are kind of pushing and pulling back and forth as the sound wave is propagating through space. We can think of it as our ear drum Pushing and pulling back and forth in response to the sound waves, but we can think about it as a violin string thats vibrating back and forth.

And so on and so forth, but literally its this measure of kind of how much is something, going back and forth. How much is it vibrating or pushing and pulling or something like that and what is the the magnitude of that change?

And so what we can think about individual values at particular moments time, what is my amplitude value. We dont have to think about it more generally in terms of the notion of an envelope. The envelope of something is kind of a gradual change. Envelope is a very important thing in terms of amplitude and in fact in terms of other parameters in music that we can control gradually, and gradually change them over time.

Thats all for day 076. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 074, we looked at the third part of designing custom algorithms for music. You can catch up using the link below.**100 Days Of ML Code Day 074***Recap From Day 073*medium.com

Weve learned a lot in the last 74 days. One thing I realized is that exploring some of the fundamentals of music technology, the way sound works, the way we represent it digitally, the way we convert it between different representations will be a nice to have.

So, today, well look at the very basics of sound waves, the way we look at them visually in terms of waveforms. The way they act as functions. And how sine waves are a particularly important function to think of with sound waves. And three important properties of them. Amplitude, frequency and phase.

So lets get started

A sound wave is the pattern of disturbance caused by the movement of energy traveling through a medium (such as air, water, or any other liquid or solid matter) as it propagates away from the source of the sound. https://whatis.techtarget.com/definition/sound-wave

The image below is that of a waveform. Youve probably seen something like it before. Its a basic way of representing sound visually. What we have below is, on our X-axis is time. On our Y-axis we have amplitude. Ill explain more about amplitude later. One important thing to note about this is that its actually a function. For each x value we have, each time value, we have exactly one, one and only one, amplitude value it corresponds to.

*Waveform*

You can see the waveform shown above and you can see how the amplitudes are going up and down over time. And you can see the silence at the beginning. So like I said, waveforms like the one shown above are functions and so we could think about them as actually being mathematical functions.

So what would make a good function for a waveform? We could think of something like for example y equals x squared that might make an interesting function. But, theres a little problem with that because as our time increase on our time axis our amplitude would eventually kind of go away, up, up, and off to infinity. Thats probably not something that we want.

Instead, we want something where the amplitude is always going to stay within a particular range, between a particularly high and low and so on. A better choice might be something like a periodic function like y equals sine of x because then, Itll go up and down and up and down over and over again. It will always stay within a particular range.

Thats all for day 075. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 073, we looked at the second part of designing custom algorithms for music. You can catch up using the link below.**100 Days Of ML Code Day 073***Recap From Day 072*medium.com

Today, well continue from where we left off in day 073

We saw from day 073 that particle filtering precisely has two critical features. Particle filtering can track nonlinear dynamics. To understand what is nonlinear dynamics of an object, imagine a car moving in a street observed through a video camera. Imagine that this car is accelerating and stopping more or less abruptly at different moment without apparent logic. Tracking this car can be difficult because its dynamic is complex. If the car were going straight, at a constant speed, it will be much easier. A tracking method such as particle filtering is flexible enough to adapt to complex dynamics such as the cars ones and to be able to track it.

Another interesting feature of particle filtering is that it allows for non-Gaussian tracking. To understand what is non-Gaussianity we have to understand the Gaussian hypotheses. Imagine that we have to track a bird flying. This bird is going straight at a constant speed towards the wall in which there is only one circular hole of size twice the size of the bird. Using probabilities to track the position of the bird when it passes the wall is fairly easy. The probability is maximum at the center of the hole and quickly decreases for other possible positions near to the border of the wall and zero everywhere in the wall. In other words the bird is not silly and its more likely to pass where the danger is minimum, which is at the center of the hole, far from the borders.

In this birds example, the tracking is Gaussian because the probability distribution over the positions of the bird has a bell shape with the maximum at the center of the hole, the bell mean. And then this decreasing as the position moves away from the mean.

In a non-Gaussian scenario, the probability distribution over the position of the bird is not a bell shape anymore. To illustrate such situation, imagine the same bird flying straight to the same wall in which they are now two holes instead of only one. In that case, the probability distribution may have the shape of two bell curves and not one.

Such probability distribution is not Gaussian anymore. Here, it is actually the sum of two Gaussian distributions, but it is more complex than the single Gaussian. As a result, its slightly harder to track the bird passing the wall because we dont know through which of the two holes the bird will choose to pass. And we have to keep both cases as possibilities.

GVF draws upon the features of particle filtering in order to track expressive variations in gesture execution. This technique is quite suitable for this problem because expressive gesture variations as the ones performed by musicians can have very complex dynamics highly nonlinear and can also have ambiguity or more complex probability distribution than Gaussian. In both cases, the need for having models that will offer new control for music drove changes and adaptation of conventional techniques. In the case of gesture follower, the need was to allow for real-time classification. In the case of gesture variation follower the need was to follow complex characteristics of the gesture.

Thats all for day 074. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*References*

Day 072, we looked at the first part of designing custom algorithms for music. You can catch up using the link below.**100 Days Of ML Code Day 072***Recap From Day 071*medium.com

Today, well continue from where we left off in day 072

The two models we have seen before come from more conventional methods as such, they can be seen as hacks of these more classical models. So, for instance, gesture follower is a Hidden Markov Model(HMM). As we have seen previously, HMM is a method that is particularly good at modelling a temporal sequence of an event such as words in a sentence or states in a gesture. The example given was drawing a circle gesture. We start at the bottom of the circle, then move toward the left, to the top, to the right and back to the bottom. This gesture can be modeled by a HMM with four hidden states bottom, left, top, right. Then we can spot one that the end is passing from the button to the left, to the top and to the right position.

In HMM we can define a transition that is more likely than others, for instance, moving from the bottom to the left is more likely to happen than moving from the bottom to the top directly. In gesture follower, the HMM is configured such as each state what we have previously. Bottom, left, top and right, are the gesture samples. So now, HMM is able to say that we are at the first sample of the circle moving to the second to the third and so on. If the gesture is performed faster, HMM can spot that we started the first sample of the circle and then move to the third and then the firth and so on. By changing the granularity of the state from the general purpose HMM, gesture follower transforms the model into a real-time classifier with the ability to estimate the progress bar within the gesture template.

So now lets inspect how gesture variation follower has also been designed based on general purpose algorithm and adaptive or hacked in order to fulfill a musical objective. GVF is based on a method used for tracking called particle filtering. Tracking is a task of estimating often the position of an object in the scene, for instance, a car in a video camera taking by CCTV or each finger captured by a depth camera and so on. Particle filtering is a widely used method for object or human tracking because it is a fairly generic method that does not rely on many hypotheses. More precisely, particle filtering has two critical features. We will looks at those features in day 074.

Thats all for day 073. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*References*

Day 071, we ended the part on what the models we have seen have in common. You can catch up using the link below.**100 Days Of ML Code Day 071***Recap From Day 070*medium.com

Whats the motivation for designing custom algorithms for music performance? Now we know that machine learning is a very powerful tool to build expressive, subjective musical systems. Machine learning techniques can afford new ways of building musical instruments and musical interaction that are not easily achievable by hand. And machine learning allows the user to provide the system with examples of behaviours of the musical system, for instance, an example of gesture to be recognized.

General purpose machine learning as presented numerous books and courses offers a wide range of techniques such as the ones we saw from day 002. However, sometimes taking a machine learning algorithm from the state of the art may not be enough to reach the desired behaviour for the musical system that we want. After all, conventional machine learning techniques are not necessarily facing the same types of problems than the ones encountered in creative applications such as music.

Typically in music, the subjective character of a performance, its interpretation is very important. In conventional machine learning, however, subjectivity is avoided because we want the algorithm to be able to generalize to other situations and other users with the same accuracy. For instance, if we build a speech recognition system, we may not want that our system will be user specific. We want that the accuracy of the speech recognition will be the same for anybody speaking. So general usability is one of the most important criteria for conventional machine learning.

But in music, we may prefer subjective exploration for systems. Trying to play with boundaries of the methods and variations of the inputs. In other words, music performance may require custom algorithms that are inspired by or based on general purpose machine learning techniques but that also embed specific structure that makes them suitable for the type of problems encountered in the design of musical interactive systems.

Thats all for day 072. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*References*

Day 070, we looked at what these models have in common. You can catch up using the link below.**100 Days Of ML Code Day 070***Recap From Day 069*medium.com

Today, we will continue from where we left off in day 070.

The probability of what we are looking for if we take it for any kind of observation is what we call the prior. Its our prior knowledge of what we are looking for and it is often defined by hand. Then the probability of what we observe given that we know what we are looking for is the likelihood because we already have a guess on what we are looking for and we test if the observation fit.

So the methods weve seen are both Bayesian and the Bayesian nature is intrinsically linked to the temporal structure. In this model, the belief we have in our estimation of the recognized gesture and all the potential variation is linked to the estimation we had at the last time which plays the role of the prior knowledge but taking into account the new incoming feature which is the new information on the gesture execution which is the likelihood. So the Bayesian rule is a way to start with an initial guess on which gesture the user will perform and which variations on that gesture and then constantly updating our belief based on the new observation.

As we can see, temporal modelling the fact that what happens at a given time also depends on what happened before is working well with the Bayesian hypothesis. Our belief on what happens now is based on our previous belief on what happened before. But considering now what is happening and this is important for real-time application as the belief of our estimation is constantly updated for each new observation and its pretty robust because beliefs are probability distribution over the possible values which allows the system to take into account values sources of noise such as the noise coming from the motion capture system.

Thats all for day 071. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*References*

Day 069, we looked at Sonic interaction with GVF. You can catch up using the link below.**100 Days Of ML Code Day 069***Recap From Day 068*medium.com

Today, well look at what these models have in common

We have seen how we can do real-time classification with Gesture follower and how we can get more about how the gesture is performed with the variation follower. Both of these models have similar names because they come from the same families of models. Lets see what they have in common that also differs from the other methods that we saw earlier on. Knowing what they have in common will enable us to choose what will be the most appropriate in practice.

To start with, both models are temporal models that is to say that they take into account the temporal trajectory of the gesture execution. What happens now depends on what happened before. Or from the methods perspective, what has been estimated before as an influence on what is been estimated now. After all, a gesture is a continuous physical phenomenon. There is no discontinuity in the execution of the gesture. If we start a circle with our hand starting from the bottom going towards the left, our hand can not disappear at some point in the path of that circle and reappear slightly later at another position within this path. There are no holes in the gesture trajectory so its temporal structure is important.

Dynamic time warping also considers the temporal structure of a gesture and its thanks to this feature that Dynamic time warping is able to segment a continuous gesture stream into gestures previously recorded. However, for both GF and GVF, the temporal structure is what builds the interaction. With GVF, we can perform the gesture and play with its temporal structure by freezing at the middle then going backwards, freezing again then going forward but faster than the original for instance. Playing with the temporal structure is a key feature of this models. Other methods such as nearest neighbour, DTW or Naive Bayes can not afford such control

Another common point between the two methods we saw previously is the fact that they are probabilistic and more precisely they are Bayesian. Weve already seen an example of the Bayesian method before which is a classification method in which classification decision is formulated in terms of probabilities. The Navie Bayes method relies on the Bayes rule. The Bayes rule gives a way to update our belief on what we are looking for, for instance, a gesture class given what we can observe. Indeed this probability is often very odd to calculated directly. The Bayes rules tells us that in order to calculate this belief we can do it in terms of a few simple probabilities which might be easy for us to compute specifically the probability of what we are looking for if we take it for any kind of observation and the probability of what we observed given that we know what we are looking for.

Thats all for day 070. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*References*

Day 068, we looked at how Gesture Variation Follower works. You can catch up using the link below.**100 Days Of ML Code Day 068***Recap From Day 067*medium.com

Today, well look at sonic interaction with GVF.

Behind the scene, GVF uses a set of parameters in order to perform the inference that we saw on day 067. Beyond the role in the algorithm machinery, these parameters have a direct influence on the method behaviour, and so on the resulting interaction. This can be useful if we know that the performed gesture will be very close to the template, and only slight variations could occur.

For instance, an expert musicians gesture is known to be very consistent. In the case, we would like to recognize musician gestures and tracks subtle variations, such configuration of GVF is the one to consider. On the other hand, high adaptation values will allow huge variations in gestures to be tracked but will lead to a less good precision in the estimation.

So there is a trade-off between the precision and speed of adaptation, that highly depends on the use cases considered. And its up to the designer or the artist to configure the algorithm for this desired behaviour. To finish with GVF, lets see how we can use such a tool to control sound.

This GVF based musical instrument will be based on drawn gesture shapes to make it simpler. Each gesture shape will be associated with a sound, and each continuous variation will control some synthesis parameters.

Thats all for day 069. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*References*

Day 067, we looked at how Gesture Variation Follower works. You can catch up using the link below.**100 Days Of ML Code Day 067***Recap From Day 066*medium.com

Today, well continue from where we left off in day 067

What we saw yesterday was the iteration on one particular combination of estimated values used to compute the weight for the estimation. So the process is iterated over each one of the hundreds of combinations of potential values. In order to give a probability distribution over the elements to estimate such as the speed or the size of the recognized gesture.

The whole process will then be repeated for each new observation. However before considering a new observation, what we want is not hundreds of combinations together with their weights, that is to say, their probabilities, but only one value that gives us the recognized gesture and the estimated alignment, speed and size. To do that, we have several possibilities. Lets see two of those.

The first possibility is to take the combination of values that has the highest weight, which means the highest probability. Although conceptually acceptable, in practice this approach is very noisy and gives jumpy estimations.

The second approach is to take the weighted mean. The weighted mean is computed by summing up the value of an element multiplied by its weight. This approach is much more robust and is actually the one implemented in GVF. When the process is repeated for a new observation, the hundreds of combinations of estimated valuation are updated from the previous estimation and then the process is repeated.

Thats all for day 068. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*References*

Day 066, we looked at how Gesture Variation Follower works. You can catch up using the link below.**100 Days Of ML Code Day 066***Recap From Day 065*medium.com

Today, well continue from where we left off in day 066

To answer the question from yesterday how do we infer? So, a straightforward way will be to take the incoming observation, the current X, Y value from the mouse and to compuderically the relative size, speed and so on through a clever formula. This is however not feasible in practice because the observations, that is to say, the captured data by the mouse or other motion capture system, are noisy. Which will make such a calculus highly not accurate. Instead, GVF will proceed by sampling.

Sampling means that instead of considering one potential value per element to estimate, that is to say, one relative side value, one relative speed value, one gesture recognized and so on, GVF will consider hundreds of combinations of potential values simultaneously.

Of course they are not all good, actually only one will be the closest to the true estimations that we want. So the challenge is to select the best one among all of them. To do so the algorithm will take the incoming observation in order to weight each combination of potential value according to their likelihood. A weight close to one means that the combination is good. A weight close to zero means that the combination of values shouldnt be taken into account.

To clarify lets explicit one specific iteration of the algorithm for a particular combination of estimated values. lets suppose that this particular combination is gesture index equals two, so a square for instance. Progressions bar equals 0.1, size 0.8, speed 1.2. So first the algorithm gets the gesture templates given by index two which is the square. Remember that the gesture template is a sequence of features.

The algorithm then picks in the sequence of features the point at the specific time given by the progression value, 0.2 as seen in the image below. So exactly 20% of the template.

So far this gives us an X, Y value corresponding to the value at 20% of gesture two. Then we scale these values by the estimated size. So we have a new coordinate which is the scale value. Which gives us at the end of the process a transformed version of the X, Y value picked in the template.

The next step is to compute the likelihood of the incoming observation given the transform value computed above. This likelihood value will give the weight for this current estimation. To do so we compute first a Euclidean distance between this value and the incoming observation. The resulting value gives an idea of the goodness of fit of our estimation.

If the value is close to zero, we have a good estimate. Otherwise no. In the end, this value is made in a probability range between zero and one where one means good and this is made in order to be considered as likelihood.

Thats all for day 067. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*References*

Day 065, we looked at Gesture Variation Follower. You can catch up using the link below.**100 Days Of ML Code Day 065***Recap From Day 064*medium.com

Today, well look at how Gesture Variation Follower works.

Weve learned the basic features of GVF. It is able to classify very earlier gesture and to align in real time and to estimate variations in real time as well such as speed and size. So now how does GVF work?

To work, GVF needs at least one recorded gesture template for instance. Like in DTW a gesture template is a sequence of features such as a sequence of X, Y values captured by the mouse. The recording phase is called training. In GVF the training phase consists in recording gesture template as many as we want. The goal of GVF is then to estimate a set of elements in real time. These elements are part of the model and are the recognized gestures given by the index of its template, the progression within the recognized gestures given by a continuous value between zero and one whereas zero is the staring point of the template and one is the ending point. And other elements are the relative size and the relative speed all continuous values.

When we say estimating these elements in real time we mean that the estimated value of each one of them is updated at each new observation. Like in gesture follower. An observation being a gesture sample. In the case of the mouse, an observation is a point X, Y at a certain time T. The process of estimating these elements is called inference. Performing inference in real time is also called incrementally inference. As it is incrementally updated at each new observation.

So now the question is how do we infer? join me here tomorrow as we learn how to infer. Thats all for day 066. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*References*

Day 064, we looked at how gesture follower works. You can catch up using the link below.**100 Days Of ML Code Day 064***Recap From Day 063*medium.com

Today, well look at another method that has be designed for expressive variations in gesture execution. Which means that it is more able to understand intentions in expressive changes in the gesture and then to use this information in real-time performance. For instance, to control sound synthesis parameters.

The method is Gesture Variation Follower(GVF). As we said, the idea is to follow, we say or subtract, the variations in gesture execution instead of considering them as noise. More precisely, if we want to slow down when were executing the gesture, GVF is able to estimate the decrease in speed dynamically.

Similarly, if we want to perform a gesture bigger than the one recorded, GVF is able to estimate the increase of relative size dynamically. These variations, are so dynamical because we could clearly start a gesture faster than the recorded template and then finish slower.

This is a very important feature because it means that these extracted variations are not relative to the gesture globally, but the gesture that is continuously changing and consequently they can be used as continuous controls while the gesture is being executed.

Below is a video demonstration of GVF

Thats all for day 065. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*References*

Day 063, we looked at Working With Time, Gesture Follower(GF.). We saw that Gesture Follower, is able to align the incoming sequence onto the template and compute their similarity on the fly. Which means at each new incoming feature value that is to say, while were performing the gesture.**100 Days Of ML Code Day 063***Recap From Day 062*medium.com

Today, well look at how gesture follower works

Indeed, If GF is like DTW, but real time, why dynamic time warping wouldnt work the same way? The answer is that GF is probably stick while DTW is not. GF consider that the input gesture can vary from the recorded template in amplitude and time. These variations, to simplify, are considered as noise in the data. And our model, by a mean, often zero, which means that the expecting noise value is zero. And variance, which is the tolerance to variations. Hence, variations are modeled as Gaussian distributions. More precisely, GF as a notion of tolerance in terms of amplitude between the recorded gesture and the ones performed in real time. This means, for instance, that if we recorded a circle, and then we try to perform more or less the same circle, if we perform it slightly smaller, GF will be tolerant and says, Ok this is still the same circle. This is properly handled in the model by a Gaussian noise for the observation.

Second, GF has a model of time, and a motion of tolerance in time. Taking the same example, if we now try to perform the same circle again, but we do it slightly faster, GF will understand it as the same circle but it will be able to see that we passed through the same values than the recording circle. For instance, bottom position, then left, the top, and finally right. But faster than in the recorded gesture. The tolerance in time will help GF to be more flexible in time and to handle lost data.

But the approach has several limitations since tolerance in time and amplitudes are considered as noise in the performed gestures, GF will not be able to consider variations as deliberate expressive variations. In other words, GF does not have a model of expressive variations and this brings limitations.

Thats all for day 064. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*References*

Day 062, we looked at the summary of dynamic time warping.**100 Days Of ML Code Day 062***Recap From Day 061*medium.com

I said that In the coming days, we will see methods that allow for capturing how we are performing a gesture, while we are performing it.

Lets get into it.

The first method that we will see starting today is a real time version of DTW. What does that mean? DTW aligns an input sequence onto a template in order to compute a similarity measure between those two sequences. The first method, Gesture Follower, is able to align the incoming sequence onto the template and compute their similarity on the fly. Which means at each new incoming feature value that is to say, while were performing the gesture.

As soon as we start performing a gesture, the method is able to recognize which gesture it is by giving us its index or label, and it is able to align it to the recognized gesture by giving us a continuous value corresponding to the progression of the executed gesture within the template. The method is called Gesture Follower because it operates as if its following the gesture while it is performing.

This method had been developed at IRCAM in Paris by Frederic Bevilacqua and colleagues. The Gesture Follower is a system for real-time following and recognition of time profiles. In the example in the video below, the Gesture Follower learns three gestures, i.e drawings using the mouse, while simultaneously recording voice data .

In the video above, during the performance, the Gesture Follower recognizes which gesture is being performed, and plays the corresponding sound, time stretched or compressed depending on the pacing of the gesture.

Lets inspect how Gesture Follower work. Like inDTW, a gesture is represented as a sequence of feature vectors. Each feature in the sequence is a point of the gesture trajectory which means a snapshot of the gesture at a certain time. GF stores each gesture template as sequence of features. As a result this operation gives the alignment of the incoming gesture onto the template as seen below.

So Gesture Follower gives the alignment for every template but what we want at the end is the one that is the likeliest. To do so, GF takes the distances between the incoming point and each template, given by the previous alignment, and build a probability distribution over the template. This probability distribution allows us to know which template is more likely to look like the input gesture. And the gesture reaching the highest probability will then be the outcome of the classification.

Thats all for day 063. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*References*

Day 061, we looked at at using Wekinator to control a drum machine with a webcam based on an example video from Wekinator.

Today, we will start looking at something new, that builds on what weve seen before.

In the past couple of days, we saw dynamic time warping. A method that can be used to compute the similarity between two sequences of data over time. These two sequences can typically be two gestures captured through a mouse, a Wiimote, a game track, a video camera and so on.

Since DTW(dynamic time warping) is used to compute similarity between temporal sequences, it can be used for gesture classification. Remember that to perform gesture classification with DTW(dynamic time warping), we first record several gesture templates so each template is a gesture that can be recognize by the system.

For instance, our first template is a circle drawn with the mouse so the template is represented as mouse x,y values while drawing the gesture. The first x,y value recorded is the bottom position of the circle and the last x,y value recorded is also the bottom position after weve drawn the entire circle. After that, another template can be recorded, lets say a square, and then a triangle to serve as our three gestures. At the end of the training, our vocabulary contains three gesture templates. Circle, square and triangle. In performance, the idea is to draw a certain shape that will be recognize by DTW.

Lets say that we are drawing a triangle for instance, DTW matches the sequence of features given by the input gesture, the triangle, to the sequences of features given by each template by computing a similarity measure between the input sequence and the template. The classification outcome is the gesture for which the similarity measure is maximum, In other word, the distance between the two sequences minimum. In the example that we saw previously, the triangle template is the third one. So, the DTW will return the index 3.

DTW is a powerful technique for gesture recognition and temporal sequence matching in general because it takes into account not only the current value, or position of the captured gesture, but also the past values. As a result, a hand gesture pose for instance depends on all the path the hand took to reach that particular position.

If we imagine that using DTW in a digital musical instrument, we could imagine to assign one song to each gesture in the vocabulary. For instance, a guitar riff associated to the circle, a baseline to the square, and a drum sequence to the triangle. Then while performing the continuous gesture, each time the gesture is recognized by the DTW, the associated song is played. For instance if we start to perform a circle and then a square, the guitar riff, followed by the drum sequence will be heard.

The relationship between the gesture performed and the song played is then based on triggering. We may find such method limited for our way of performing music and we may want to have a system that would allow for more control of the sound that not only triggers, but maybe we would like to modulate characteristics of the synthesized songs, for instance, its pitch, its frequency spectrum, amplitude and so on. And surely wewould like to modulate the songs while we are performing the gesture.

In other words, we may want to be able to use expressive variations, we are executing when we are doing our gesture, such as slowing down at some point an then going faster or exaggerating the amplitude of our gesture and so on. So, we may want to be able to use the expressive variation of our gestures in other to control continuously other parameters of the song synthesis. In that case, what we would need is not a method that will give us which gesture we are doing, such as DTW, but also nearest neighbor or Naive Bayes, but also how we are doing our gestures.

In the coming days, we will see methods that allow for capturing how we are performing a gesture, while we are performing it. And in turn we will see that such methods can provide additional expressive control on the songs, or other digital media, for real time performance.

Thats all for day 062. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

]]>Day 059, we looked at working with time; Hidden Markov Models(HMMs). We learned that HMMs can also give us information about which state were likely to be in at the current time. For instance, if we assume that were drawing a gesture at a point in time, what state of the sequence are we most likely to be in? That is, how far through the gesture are we? So, those are HMMs in a nutshell.

Today, well look at tradeoffs of classifiers other than kNN

We have seen previously how classifiers other than kNN can sometimes learn to generalize more effectively by creating explicit models that are tuned from the training data. There are likely to make decisions boundaries that are smoother and to create classifiers that perform more accurately on unseen data.

On the other hand, weve also seen that when the assumptions of the model are not appropriate to the learning problem, or when the parameters of the learning algorithm are not set in an appropriate way, these models can fail.

If dynamic time-warping is like a nearest-neighbour classifier using only the distance measurement without any underlying model, then Hidden Markovs Models are more like these other classifiers. HMMs can perform very well when the model of the data is appropriate and they can perform very poorly otherwise.

One of the biggest challenges in creating a good HMM is to set all of the parameters of the probability distributions. Like in Naive Bayes, its the training process that sets these probabilities from the data, hopefully in a reasonable way. But with so many parameters that need to be set, Hidden Markov Models can require a very large amount of data to be trained well.

In general, HMMs wont do well at all if we try to train them on just one example per class, whereas, dynamic time-warping, as we saw can do great, often with just one example. Even if we have several dozens of examples or even several hundred examples per class, we might still find that HMMs dont build good models from the data.

So, if were working on a modelling problem where we have access to a very large data set, we might want to give HMMs a try. Otherwise, we may find that dynamic time-warping works better for us. Alternatively, we may want to explore some of the domain-specific algorithms for temporal modelling that are able to combine some of the benefits of both HMMs and dynamic time-warping by making additional assumptions about the learning problem, or the data.

Thats all for day 060. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*Reference*

Day 058, we looked at working with time; Hidden Markov Models(HMMs). We saw that HMMs employs several principles that are similar to Naive Bayes. Using our training data, were going to fit the parameters of several probability distributions that altogether describe our training data. Except, in this case, these distributions will also describe how our data is likely to change over time.

Today, well continue from where we left off in day 058

Once an HMM is trained, we can ask the following questions: given a sequence of data that weve just observed, how likely is this sequence to be generated from that specific HMM? If we have two HMM trained, say one for a circle gesture and one for a triangle gesture, this allows us to use HMMs for classification.

If our current sequence of feature values is more likely under the type of movement captured by the circle HMM, then we can classify it as a circle. Otherwise, we can classify it as a triangle. Of course, if our current motion sequence is not very likely under either of the circle or the triangle, we might, alternatively, deduce that neither gesture is currently occurring, so we would use HMMs to do gesture spotting, just like dynamic time warping.

HMMs can also give us information about which state were likely to be in at the current time. For instance if we assume that were drawing a gesture at a point in time, what state of the sequence are we most likely to be in? That is, how far through the gesture are we? So, those are HMMs in a nutshell.

Thats all for day 059. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*Reference*

Day 057, we looked at working with time; dynamic time warping for music and speech analysis. We learned that dynamic time warping doesnt require us to do explicit segmentation. whereas using a classifier means we need to make a choice about when a gesture begins and ends in order to pass the classifier of feature vector representing the gesture from beginning to end.

Today, we will start looking at Hidden Markov Models(HMMs).

Hidden Markov Models, or HMMs, are one of the most common approaches for modelling data over time in many domains. Weve looked at it in depth before.

In dynamic time warping, we directly compared the current sequence to each of the sequences in our training set. This looks a lot like k-nearest-neighbour classifiers where we make decisions on new data only by considering the training examples directly.

Contrast this to the other classifiers we discussed which use the training examples to explicitly build some model of the data. For instance, the decision stump classifier employs a model which is simply a line drawn through the feature space, or a hyperplane if were working in higher dimensions. It uses the training data to find the best position for this line.

Or consider the Naive Bayes classifier which uses the training data to estimate the parameters of a few simple probability distributions.

HMMs employs several principles that are similar to Naive Bayes. Using our training data, were going to fit the parameters of several probability distributions that altogether describe our training data. Except, in this case, these distributions will also describe how our data is likely to change over time.

Specifically, a Hidden Markov Model models a gesture as a sequence of states. These states arent necessarily literal properties of the world, but theyre related to properties of the world that we can measure.

For example, lets say were drawing a circle in the air. In English, we might say that we start at the bottom of the circle, then move to the left, to the top, to the right, and back to the bottom. We could model this gesture using an HMM with four hidden states: bottom, left, top, right.

We can also say that when we draw a circle we move in sequence from bottom, to left, to top, to right, in that order. To be more precise, it might also be appropriate to say that we could exist in one of these states for a while before moving on to the next one.

Lets look at how this relates to the more general formulation of a Hidden Markov Model. In general, our Hidden Markov Model have some number of hidden states. In each state, we have a probability distribution over the feature values were most likely to see if the gesture is currently in that state.

This distribution is likely to be quite different from one state to the next. For instance, itll be different at the bottom of the circle from at the top. Each state also has a probability associated with moving to any other state or staying in the same state.

So, training a Hidden Markov Model involves using the training data to set these probabilities. The probability distribution over the feature measurements that we might observe in each hidden state as shown by the highlighted areas in the image below,

and the distribution describing the ways that one hidden state can move into another hidden state as shown by the highlighted areas in the image below.

Once an HMM is trained, we can ask the following questions: given a sequence of data that weve just observed, how likely is this sequence to be generated from that specific HMM? If we have two HMM trained, say one for a circle gesture and one for a triangle gesture, this allows us to use HMMs for classification.

Thats all for day 058. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*Reference*

Day 056, we looked at working with time; how dynamic time warping works. We learned that dynamic time warping doesnt require us to do explicit segmentation. whereas using a classifier means we need to make a choice about when a gesture begins and ends in order to pass the classifier of feature vector representing the gesture from beginning to end.

Today, we will start looking at dynamic time warping for music and speech analysis.

Weve seen how dynamic time warping can be used to recognize similar shapes of gestures, like shapes drawn with a mouse, or rotation of an accelerometer over time. But dynamic time warping is also useful for recognizing shapes in other types of feature spaces.

For example, we might say that two melodies have a similar shape: Lets make a dynamic time warping program thats trained to recognize different melodic sequences. First off, if we were going to build the program, we need to answer the question, what types of features would we use?

Peak frequency, Constant-Q bins, and chromagram bins could give us a pretty good start. RMS wouldn't be so useful unless we had an instrument that varied wildly in volume from one note to the next. Though it would be useful if we were interested in patterns and volume over time, instead. And spectral Centroid wouldnt be so useful unless we had an instrument that varied wildly in timbre or tone color from one note to the next.

Although, centroid might be useful if we were interested in patterns in instrumentation, or pattern in synthesized sounds, where filtering or other effects were changing the brightness of the sound considerably over time. For the task, detecting patterns and melodies we played on the computer, theres an even better representation. Because we are generating those sounds on the computer in response to key presses, we know exactly which note were playing when. So we can send a MIDI-note number, or similar, simple representation over to our dynamic time warping and the problem becomes much easier.

Another example, using dynamic time warping to build a simple voice controller. Lets say we have a simple mock-up of a platform or video game, and we want our avatar(represented by a capsule) as shown below to move left, right and jump.

If we want the avatar(capsule) in our mock-up above to respond to our voice, speaking the words left, right, and jump what features(FFT peak frequency, Constant-Q bins, Centroid and MFCC) should we use? The frequency content of our voice and the timbre of our voice are both changing as we speak different words. However, we need to recall that MFCCs are a type of feature designed to work well in speech, so theyre probably the best thing we can start with. Also, we need to recall that we prefer our feature vectors to be shorter rather than longer, so lets leave out the other features for now.

Thats all for day 057. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*Reference*

Day 055, we looked at working with time; how dynamic time warping works. We learned that dynamic time warping can be thought of as solving an optimization problem. Here, our objective function is the distance between our two sequences once weve warped one to match the other the best we can, and our task is to find the warping that minimizes this distance.

Today, we will continue from where we left off in day 055

Weve seen two basic approaches to modeling the way data changes over time. The first approach was to concatenate many time points into a single feature vector and then pass this feature vector to a conventional classification or regression algorithm. The second approach was dynamic time warping.

Dynamic time warping doesnt require us to do explicit segmentation. whereas using a classifier means we need to make a choice about when a gesture begins and ends in order to pass the classifier of feature vector representing the gesture from beginning to end.

Dynamic time warping is robust to changes in gesture speed. Our gesture can be faster or slower, or can even vary in speed over the course of over a single, but if their overall shape and feature space is the same, dynamic time warping will consider them to be similar. And because dynamic time warping can give us a continuous estimate of the closest of our current gesture at this movement to all of our recorded examples gestures, we can use it to do gesture spotting.

If were within a certain distance threshold to our closest gesture, we can assume that that gesture has just occurred. Otherwise, if were far from all gesture examples, we can assume that no gesture has just occurred. Unfortunately, dynamic time warping is generally more computationally intensive than the classification and regression algorithms weve seen previously, especially when we run it continuously in real-time. So wanting things to run very quickly might not be a good reason to choose it over another algorithm.

Thats all for day 056. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*Reference*

*RiskyLab*

Udacitys VR Nanodegree program has an option for Virtual Reality team projects where you get to create VR content while working with other students from all over the world.

This is a story for the teamwork project we did whilst undertaking the Udacity VR Nanodegree. The theme is Risk. My team, Team Kernel, consisted of 4 people coming from different background and time zone, scattered around Chicago, Mexico, Brasil and Lagos. Together, we worked on a team project to deliver an amazing VR game.

**RiskyLab** is a game where the user is tasked with cleaning a messy laboratory! We all know it is risky to leave the lab a mess.

The theme of the project is Risk. Our first thought was a concern about a creative way to implement the theme in our project. How do we design/develop with Risk in mind? What are we going to do with this theme? Lots of ideas came up.

We had so many ideas, to begin with. We evaluated each of these ideas by asking questions like:

Is it related to the theme Risk?

How big is the scope to build within a given time frame?

With the ideas coming in, do we have the skills to bring the one that will make the final selection to live?Finally, we ended up with RiskyLab because cleaning a messy lab sound interesting.

Github

We used Github for collaborating on code.

Slack

Slack was used for our communication

Skype

We had our weekly meeting over Skype

Google Drive

Google drive was the one-stop place for dumping quick sketches, 3D assets, sound/music and general documents.

Selection Of Idea

We had many different ideas, to begin with. We evaluated each of these ideas by asking some questions before deciding on what to build.

Sketching

Assigning Roles

Each team member was assigned a role in the project. The good thing is that we had a very balanced team. Also, team members wear multiple hats and were available to switch role as the need arose.

Working On Assigned Role

Each team member worked on their assigned task and pushed to the repo while we all test as the development went on.

Putting it all together

In the end, we did something. Everyone on the team worked in one way or the other to bring the project to what it is now.

We hope to add more features and possibly push the app to the store someday.

This is a great step in the right direction for all four of us on the journey of VR development and working in a team.

The game was a lot of fun to make especially because of such a dedicated and hardworking team. Its amazing that we can make something this intricate when none of us has ever met and doesnt live in the same country!

We learned a great deal during this teamwork project. Most importantly, we learned to collaborate, appreciate each other and have fun.

]]>Day 054, we looked at working with time; how dynamic time warping works. We learned that dynamic time warping doesnt only allow us to compute meaningful distance when one sequence is shifted earlier or later in time. Dynamic time warping also allows us to compute meaningful distance when the two sequences have different lengths in time.

Today, we will continue from where we left off in day 054

Dynamic time warping can be thought of as solving an optimization problem. Here, our objective function is the distance between our two sequences once weve warped one to match the other the best we can, and our task is to find the warping that minimizes this distance.

In order to solve this optimization problem, well use a technique called dynamic programming. This is why the algorithm is called dynamic time warping. This basic pattern of solving an optimization problem using dynamic programming appears in all sort of applications areas, from creating spell checkers to aligning DNA sequences.

Lets look at a few remarks of the computational efficiency of this algorithm and on some related practical considerations. Recall that when we talked about neural networks and support vector machines, we had to solve our optimization problem during training time.

Here, however, were solving the optimization problem during running time whenever were trying to compute our degree of match to a new sequence. Granted, this is usually a much simpler optimization problem than we encounter in training neural networks or SVMs, but it can still get computationally expensive.

In particular, this problem is gonna get harder and harder to solve as our sequences get longer. If were not careful we can find that dynamic time warping takes so long to compute its distance calculations that it cant keep up with new features coming in in real-time. In practice, its common to adapt the dynamic time warping algorithms slightly to make computation faster.

Many implementation of dynamic time warping imposes additional constraints on the warping process, preventing points in one sequence from being matched to points that are very far in time from the other sequence.

Thats all for day 055. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*Reference*

Day 053, we looked at working with time; how dynamic time warping works. We started by answering the question what if we take sequence B and shift it over just a bit to make a new sequence, D?

Today, we will continue from where we left off in day 053

Dynamic time warping doesnt only allow us to compute meaningful distance when one sequence is shifted earlier or later in time. Dynamic time warping also allows us to compute meaningful distances when the two sequences have different lengths in time. For example, the two sequences shown in the image below can also be warped to match quite closely.

If we use dynamic time warping for gesture analysis, we often want this. After all, we could say that if this is the hand height in the air over time, were drawing almost exactly the same shape in both cases. Were just moving more quickly in one example than the other.

Dynamic time warping also allows us to compute meaningful distance when the speed with which movement through a sequence changes within the sequence. Say the hand motion started quickly and the finished slowly as shown below.

Again, dynamic time warping can warp one of these gestures onto the other, where we can see that they are indeed almost exactly the same shape. So, how does dynamic time warping compute the best warping from one sequence to another? The basic approach is not too complicated, see a sketch below.

Say we have two sequences, A and B as shown below. Recall that warping sequence A to match sequence B can be understood as looking at each point in sequence A and finding a matching point in B. A good warp is one where each pair of points weve chosen to match are in fact close to each other, for instance, using Euclidean distance.

Dynamic time warping requires that our first point in sequence A has to be matched to at least the first point in sequence B. It could also be matched to more points in sequence B if B starts out slower than A, but lets ignore that for now. Dynamic time warping also requires that our final point in sequence A has to be matched to at least our final point in B as shown below.

Within these two constraints, dynamic time warping works to find the warping but minimizes the overall distance between the two sequences. Dynamic time warping considers the best warping to be the warping that gives us the smallest distance, so it already will have computed the distance for us by the time its found the warping path.

So, dynamic time warping can be thought of as solving an optimization problem. Thats what well look at tomorrow. Thats all for day 054. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*Reference*

Day 052, we looked at working with time; how dynamic time warping works. We learned that with dynamic time warping we can compute the distance between two sequences of data with each data point consisting of several features.E.g X and Y values from a mouse, or pitch, or roll, and yaw values from a Wiimote, or even MFCCs from audio.

Today, we will continue from where we left off in day 052

Lets start by answering the question what if we take sequence B and shift it over just a bit to make a new sequence, D? This is equivalent to starting the hand wave a little bit earlier in B.

What happens? Suddenly we see that our distance between the first point of A and D is much bigger, as it is for all the matching points, and if we add up all these distances, we see that the Euclidean distance between A and D is now bigger than the distance between A and C.

If we use this distance to build a classifier using nearest neighbor and if A and C were our training examples, then we would find that D would be classified as belonging to the same class as C, not A, and this is one of the main problems with using Euclidean distance to compare sequences.

Two things that we would judge to be incredibly similar in shape will not be judged to be similar using Euclidean distance if theyre not aligned precisely in time. Dynamic time warping is a distance measure that accounts for the fact that sequences might shift a bit forward or backwards in time. It works by first computing the best way to warp one sequence onto another.

We can think of warping as taking each point in one sequence as shown in A in image below and optionally moving it forward or backward in time until the overall shape of the two sequence is the most similar as shown B in the image below.

Equivalently, each point in the sequence weve warped is matched to at least one corresponding time point in the other sequence as shown below.

Once weve aligned the sequences in the best way possible as shown below, we can then use Euclidean distance between pairs of matching points and sum these up just like we saw before to get an overall distance.

Dynamic time warping will warp the sequence D to match the sequence A, as below. Before we compute the distance between them. The distance is now quite low, and if we compare the distance between D and A using dynamic time warping, we see that it is smaller than the distance between C and A and quite close to the distance between B and A.

Thats all for day 053. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*Reference*

Day 051, we looked at working with time; how dynamic time warping works. We learned that Dynamic time warping was designed specifically to compute the distance between two sequences of data. Just like weve seen, each data point can consist of several features. Perhaps X and Y values from a mouse, or pitch, or roll, and yaw values from a Wiimote, or even MFCCs from audio.

Today, we will continue from where we left off in day 051

is what is used by our nearest neighbor algorithm to compute the distance between examples in feature space if we have some number of features N.

The image below shows how we could use Euclidean distance to compute the similarity, or rather, dissimilarity, between two sequences.

If our two sequences are the same length, we could compute the Euclidean distance from the first point in one sequence to the first point in another sequence, and we could do the same for the second point in each sequence and the third and so on.

The image below contains two sequences, A and B. On the X axis, we have time, and on the Y axis, we have a single feature value. We can think of this as a measurement of a hand height over time as the hand move in front of a kinect, for instance.

In both of these sequences, the hand is moved up then down then back up. So how similar are they using Euclidean distance? Lets start by computing the distance from point one of line A to point one of line B, then the distance from point two of line A to point two of line B, all the way to the final point in each line as shown below.

Each of these calculations is measuring how close the two sequence are at a single moment in time. In order to get a single distance measurement indicating the overall distance between the two sequences, we can just add these up as shown below.

After all, we can reasonably assume that if two sequence are very close at all moments in time, like sequences A and B in the image above, we would say that they are very similar overall, so the distance between them should be low.

On the other hand, if we repeat this process to compare A to another sequence, C, we see that A and C have a larger overall distance as shown below. It seems sensible.

The hand wave in C would look quite different from the hand wave in A and B. However, what if we take sequence B and shift it over just a bit to make a new sequence, D? That question will be answered tomorrow.

Thats all for day 052. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*Reference*

In day 050, we looked at working with time, dynamic time warping. We learned that With dynamic time warping we can detect when a specific gesture occurs even when were not using explicit segmentation with a button or something equivalent. Dynamic time warping will spot the fact weve just completed the gesture. In fact, this is sometimes called gesture spotting.

Today, we will continue from where we left off in day 050

Now that weve seen what dynamic time warping can do from day 050, lets take a peek at the algorithm itself. Dynamic time warping was designed specifically to compute the distance between two sequences of data.

Just like weve seen, each data point can consist of several features. Perhaps X and Y values from a mouse, or pitch, or roll, and yaw values from a Wiimote, or even MFCCs from audio.

So if we wanted to design a good distance metric that is high when two sequences are dissimilar and low when two sequences are similar, how could we go about that?

Weve already seen one distance function in detail, Euclidean distance, so lets start there. As we discussed some days ago, Euclidean distance is often a natural distance measure to use between two objects in the physical world.

If we have two points on a plane, Euclidean distance is simply the length of the line we draw between them, and in this two-dimensional space, we can derive a formula for the length of that line using the Pythagorean theorem.

We can do the same thing to measure the distance between two objects in the three dimensional world, and we can even generalize this definition to cover an arbitrary number of dimensions.

As weve seen, the equation above is what is used by our nearest neighbor algorithm to compute the distance between examples in feature space if we have some number of features N.

That's all for day 051. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*Reference*

In day 049, we started looking at working with time, dynamic time warping. We saw that Dynamic time warping is a method that can be used to compute the similarity between two sequence of data over time. In general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules.

Today, well look at what we can do when we have a good measure of similarity between one sequence and another.

What can we do when we have a good measure of similarity between one sequence and another. First of all, we can look at the sequence of features between some time in the past and the time right now and when we can compare it to a sequence that we recorded previously.

With dynamic time warping we can detect when a specific gesture occurs even when were not using explicit segmentation with a button or something equivalent. Dynamic time warping will spot the fact weve just completed the gesture. In fact, this is sometimes called gesture spotting.

Dynamic time warping can be used to do classification using nearest-neighbor. We can use the measure of similarity between two gestures gotten from dynamic time warping as a distance metric within a nearest neighbor classifier. But instead of comparing two points in feature space using Euclidean distance, we can compare two sequences of features using their distance according to dynamic time warping.

Up until now, weve been seeing that dynamic time warping computes the similarity between two sequences but it computes the distance. We could say that two sequence are maximally similar when their distance is zero and the greater the distance the less similar they are.

Certainly, this notion of distance works well inside of a nearest neighbor classifier.

You deserve some accolades for being here till day 050. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, be legendary.

*Reference*