[DSP] Section 7

[REF | AudioSignalProcessingForML]

19. MFCC(Mel-Spectogram)s Explained Easily

Mel-Frequncy Cepstral Coefficients
Cepstral > Cepstrum > Spectrum

!!NOTE!!

Computing the cepstrum

$C( x(t) ) = F^{-1} [log(F[x (t) ] ) ]$

$x(t)$: Time-domain signal
$F[x (t) ]$: Specturm
$[log(F[x (t) ] ) ]$: Log spectrum
$F^{-1} [log(F[x (t) ] ) ]$: Cepstrum > Specturm of a spectrum
Signal > (DFT) > Power Spectrum > (log) > Log power spectrum > (IDFT) > Cepstrum (Quefrency)
pitch detection에 활용 가능 (rhamonic)

!!NOTE!!

Glottal pusle > Vocal Tract > Speech signal (성문 펄스 > 성대 > 음성 신호)
Log-spectrum > Spectral envelope > Spectral detail
Formants = Carry identity of sound : Spectral envelope의 윗 부분
- 음성 과학 및 음성학에서 포만트는 인간 성대의 음향 공명으로 인한 광범위한 스펙트럼 최대값
Speech > Vocal tract freq. response > Glottal pulse
Speech = Convolution of vocal tract freq. response with glottal pulse

Formalising speech

$X(t)=E(t) \cdot H(t)$
$\log(X (t) ) = \log( E(t) \cdot H(t) )$
$\log(X (t) ) = \log( E(t) ) + \log( H(t) )$
Speech = Glottal pulse + Vocal tract freq. response

Goal: Separating components

로우-필터를 적용해서, 고주파수를 없앰 ($E(t)$ 제거)

Computing Mel-Frequency Cepstral Coefficients

Waveform > DFT > Log-Amplitude Spectrum > Mel-Scaling > Discrete Cosine Transform > MFCCs

Why Discrete Cosine Transform?

Simplified version of FT
Get real-valued coefficient
Decorrelate energy in different mel bands
Reduce # dimension to represent spectrum

How many coefficients?

Traditionally: first 12-13 coefficients
First coefficients keep most information (e.g, formants, spectral envelope)
Use $\Delta$ and $\Delta \Delta$ MFCCs
Total 39 coefficients per frame

MFCCs advantages

Describe the “large” structures of the spectrum
Ignore fine spectral structures
Work well in speech and music processing

MFCCs disadvantages

Not robust to noise
Extensive knowledge engineering
Not efficient for synthesis

MFCCs applications

Speech processing
- Speech recognition
- Speaker recognition
- …
Music processing
- Music genre classification
- Mood classification
- Automatic taggin
- …

20. Extracting MFCCs with Python

  
import os
import librosa
import librosa.display
import IPython.display as ipd
import matplotlib.pyplot as plt
import numpy as np

  
base_dir = r"./raw/20_audio/"
audio_file = os.path.join(base_dir, "debussy.wav")

ipd.Audio(audio_file)

  
signal, sr = librosa.load(audio_file)

print(signal.shape)

(661500,)

Extract MFCCs

  
mfccs = librosa.feature.mfcc(signal, n_mfcc=13, sr=sr)

print(mfccs.shape)

(13, 1292)

Visualise MFCCs

  
plt.figure(figsize=(12, 5))
librosa.display.specshow(mfccs,
                         x_axis="time",
                         sr=sr)
plt.colorbar(format="%+2f")
plt.title("MFCCs")
plt.show()

Calculate delta and delta2 MFCCs

  
delta_mfccs = librosa.feature.delta(mfccs)
delta2_mfccs = librosa.feature.delta(mfccs, order=2)

print(f"delta: {delta_mfccs.shape}\ndelta2: {delta_mfccs.shape}")

delta: (13, 1292)
delta2: (13, 1292)

  
plt.figure(figsize=(12, 5))
librosa.display.specshow(delta_mfccs,
                         x_axis="time",
                         sr=sr)
plt.colorbar(format="%+2f")
plt.title("MFCCs Delta")
plt.show()

  
plt.figure(figsize=(12, 5))
librosa.display.specshow(delta2_mfccs,
                         x_axis="time",
                         sr=sr)
plt.colorbar(format="%+2f")
plt.title("MFCCs Delta2")
plt.show()

  
comprehensive_mfccs = np.concatenate((mfccs, delta_mfccs, delta2_mfccs))

print(comprehensive_mfccs.shape)

(39, 1292)

21. Frequency-Domain Audio Features

Freq.-domain features

Band energy ratio (BER)
Spectral centroid (SC)
Bandwidth (BW)
…

Extracting freq.-domain features

Waveform > (STFT) > Spectogram > Feature Computation

Math convetions

$m_t(n)$ -> Magnitude of signal at freq. bin $n$ and frame $t$
$N$ -> # freq. bins

Band Energy Ratio (BER)

Comparison of energy in the lower/higher freq. bands
Measure of how dominant low frequencies are

$BER_t = { {\sum_{n=1}^{F-1} m_t(n)^2} \over {\sum_{n=F}^N m_t(n)^2} }$

$m_t(n)^2$: power of t, n
$F$: Split freq.
$\sum_{n=1}^{F-1} m_t(n)^2$: Power in the lower freq. bands
$\sum_{n=F}^N m_t(n)^2$: Power in the higher freq. bands

BER applications

Music/Speech discimination (구별)
Music Classification

Spectral centroid (SC)

Centre of gravity of magnitude spectrum
Freq. band where most of the energy is concentrated
Measure of “brightness” of sound
Weighted mean of the freq.

$SC_t = { {\sum_{n=1}^N m_t(n) \cdot n} \over {\sum_{n=1}^N m_t(n)} }$

$n$: freq. bin
$m_t(n)$: Weight for n

SC applications

Audio Classification
Music Classification

Bandwidth

Derived from spectral centroid
Spectral range around the centroid
Variance from the spectral centroid
Describe perceived timbre
Weighted mean of the distances of freq. bands from SC
Energy spread across frequency bands $\propto BW_t$

$BW_t = { {\sum_{n=1}^N \left| n-SC_t \right| \cdot m_t(n) } \over {\sum_{n=1}^N m_t(n) } }$

$m_t(n)$: Weight for n
$\left n-SC_t \right $: Distance of freq. band from spectral centroid

BW applications

Music processing

22. Implementing Band Energy Ratio from Scratch with Python

  
import math
import matplotlib.pyplot as plt
import numpy as np
import librosa
import librosa.display
import IPython.display as ipd

debussy_file = "./raw/22_audio/debussy.wav"
redhot_file = "./raw/22_audio/redhot.wav"

  
ipd.Audio(debussy_file)

  
ipd.Audio(redhot_file)

  
debussy, sr = librosa.load(debussy_file)
redhot, _ = librosa.load(redhot_file)

Extracting spectrograms

  
FRAME_SIZE = 2048
HOP_SIZE = 512

debussy_spec = librosa.stft(debussy, n_fft=FRAME_SIZE, hop_length=HOP_SIZE)
redhot_spec = librosa.stft(redhot, n_fft=FRAME_SIZE, hop_length=HOP_SIZE)

print(debussy_spec.shape)

(1025, 1292)

Calculate Band Energy Ratio

  
def calculate_split_frequency_bin(split_frequency, sample_rate, num_frequency_bins):
    frequency_range = sample_rate / 2
    frequency_delta_per_bin = frequency_range / num_frequency_bins
    split_frequency_bin = math.floor(split_frequency / frequency_delta_per_bin)
    return int(split_frequency_bin)

  
split_frequency_bin = calculate_split_frequency_bin(2000, 22050, 1025)
print(split_frequency_bin)

185

  
def band_energy_ratio(spectrogram, split_frequency, sample_rate):
    split_frequency_bin = calculate_split_frequency_bin(split_frequency, sample_rate, len(spectrogram[0]))
    band_energy_ratio = []
    
    # calculate power spectrogram
    power_spectrogram = np.abs(spectrogram) ** 2
    power_spectrogram = power_spectrogram.T
    
    # calculate BER value for each frame
    for frame in power_spectrogram:
        sum_power_low_frequencies = frame[:split_frequency_bin].sum()
        sum_power_high_frequencies = frame[split_frequency_bin:].sum()
        band_energy_ratio_current_frame = sum_power_low_frequencies / sum_power_high_frequencies
        band_energy_ratio.append(band_energy_ratio_current_frame)
    
    return np.array(band_energy_ratio)

  
ber_debussy = band_energy_ratio(debussy_spec, 2000, sr)
ber_redhot = band_energy_ratio(redhot_spec, 2000, sr)

print(f"{debussy_spec.T.shape}")
print(f"{ber_debussy.shape}")

(1292, 1025)
(1292,)

Visualise Band Energy Ratio curves

  
frames = range(len(ber_debussy))
t = librosa.frames_to_time(frames, hop_length=HOP_SIZE)

print(len(t))

1292

  
plt.figure(figsize=(15, 5))

plt.plot(t, ber_debussy, color="b")
plt.plot(t, ber_redhot, color="r")

plt.show()

23. Spectral centroid and bandwidth

  
debussy_file = "./raw/23_audio/debussy.wav"
redhot_file = "./raw/23_audio/redhot.wav"
duke_file = "./raw/23_audio/duke.wav"

debussy, sr = librosa.load(debussy_file)
redhot, _ = librosa.load(redhot_file)
duke, _ = librosa.load(duke_file)

Spectral centroid with Librosa

  
FRAME_SIZE = 1024
HOP_LENGTH = 512

sc_debussy = librosa.feature.spectral_centroid(y=debussy, sr=sr, n_fft=FRAME_SIZE, hop_length=HOP_LENGTH)[0]
sc_rehot = librosa.feature.spectral_centroid(y=redhot, sr=sr, n_fft=FRAME_SIZE, hop_length=HOP_LENGTH)[0]
sc_duke = librosa.feature.spectral_centroid(y=duke, sr=sr, n_fft=FRAME_SIZE, hop_length=HOP_LENGTH)[0]

Visualise spectral centroid

  
frames = range(len(ber_debussy))
t = librosa.frames_to_time(frames, hop_length=HOP_LENGTH)

plt.figure(figsize=(15, 5))

plt.plot(t, sc_debussy, color="b")
plt.plot(t, sc_rehot, color="r")
plt.plot(t, sc_duke, color="y")

plt.show()

Calculate bandwidth

  
ban_debussy = librosa.feature.spectral_bandwidth(y=debussy, sr=sr, n_fft=FRAME_SIZE, hop_length=HOP_LENGTH)[0]
ban_redhot = librosa.feature.spectral_bandwidth(y=redhot, sr=sr, n_fft=FRAME_SIZE, hop_length=HOP_LENGTH)[0]
ban_duke = librosa.feature.spectral_bandwidth(y=duke, sr=sr, n_fft=FRAME_SIZE, hop_length=HOP_LENGTH)[0]

  
plt.figure(figsize=(15,5))

plt.plot(t, ban_debussy, color='b')
plt.plot(t, ban_redhot, color='r')
plt.plot(t, ban_duke, color='y')

plt.show()

[DSP] Section 7: Audio Features

[DSP] Section 7

19. MFCC(Mel-Spectogram)s Explained Easily

!!NOTE!!

Computing the cepstrum

!!NOTE!!

Formalising speech

Goal: Separating components

Computing Mel-Frequency Cepstral Coefficients

Why Discrete Cosine Transform?

How many coefficients?

MFCCs advantages

MFCCs disadvantages

MFCCs applications

20. Extracting MFCCs with Python

Extract MFCCs

Visualise MFCCs

Calculate delta and delta2 MFCCs

21. Frequency-Domain Audio Features

Freq.-domain features

Extracting freq.-domain features

Math convetions

Band Energy Ratio (BER)

BER applications

Spectral centroid (SC)

SC applications

Bandwidth

BW applications

22. Implementing Band Energy Ratio from Scratch with Python

Extracting spectrograms

Calculate Band Energy Ratio

Visualise Band Energy Ratio curves

23. Spectral centroid and bandwidth

Spectral centroid with Librosa

Visualise spectral centroid

Calculate bandwidth

Further Reading

[DSP] Section 6: 스펙토그램, MFCCs

[DSP] Section 1: 소리의 특성

[DSP] Section 2: 오디오 특성 추출