Jehoshaphat I. Abu


100 Days Of ML Code — Day 037

100 Days Of ML Code — Day 037

Jehoshaphat I. Abu's photo
Jehoshaphat I. Abu
·Aug 15, 2018·

3 min read

Recap From Day 036

In day 036, we looked at working with audio input: Common audio features. We saw that In music, since notes exactly one octave apart are perceived as particularly similar, knowing the distribution of chroma even without the absolute frequency (i.e. the original octave) can give useful musical information about the audio — and may even reveal perceived musical similarity that is not apparent in the original spectra.

Today, we’ll continue from where we left off in day 036

Working With Audio Input: Common Audio Features Continued

Mel-frequency cepstral coefficients

The mel-frequency cepstrum (MFC) is a representation of the short-term power spectrumof a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.

Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear “spectrum-of-a-spectrum”). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system’s response more closely than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warping can allow for better representation of sound, for example, in audio compression.


MFCC captures information that is great for anything to do with speech, instrumentation, or other measurements of sounds quality beyond just a sound’s pitch and loudness. MFCC re widely used in commercial speech recognition, in speaker identification, in musical genre classification, and all sorts of other applications.

Let the FFT and Constant Q, computing MFCCs gives us a vector of values. They typical length of an MFCC vector is around 12 or 13 numbers though, so this is much smaller than either FFT or Constant Q. These numbers don’t have a simple, intuitive explanation. It’s not that each coefficient is telling you about a different frequency but they do give you values that tends to be consistently similar for similar sounds. Applying first and second other differences to MFCCs is often really useful in speech and music analysis.

MFCCs are commonly derived as follows:

  1. Take the Fourier transform of (a windowed excerpt of) a signal.

  2. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows.

  3. Take the logs of the powers at each of the mel frequencies.

  4. Take the discrete cosine transform of the list of mel log powers, as if it were a signal.

  5. The MFCCs are the amplitudes of the resulting spectrum.

There can be variations on this process, for example: differences in the shape or spacing of the windows used to map the scale, or addition of dynamics features such as “delta” and “delta-delta” (first- and second-order frame-to-frame difference) coefficients.

It’s good to know that you’re still here. We’ve come to the end of day 037. I hope you found this informative. Thank you for taking time out of your schedule and allowing me to be your guide on this journey. And until next time, remain legendary.




Share this