Home [DSP] Section 7: Audio Features
Post
Cancel

[DSP] Section 7: Audio Features

[DSP] Section 7

[REF | AudioSignalProcessingForML]

19. MFCC(Mel-Spectogram)s Explained Easily

  • Mel-Frequncy Cepstral Coefficients
  • Cepstral > Cepstrum > Spectrum

!!NOTE!!

|Cepstrum|Quefrency|Liftering|Rhamonic| |Specturm|Frequency|Filtering|Harmonic|

Computing the cepstrum

$C( x(t) ) = F^{-1} [log(F[x (t) ] ) ]$

  • $x(t)$: Time-domain signal
  • $F[x (t) ]$: Specturm
  • $[log(F[x (t) ] ) ]$: Log spectrum
  • $F^{-1} [log(F[x (t) ] ) ]$: Cepstrum > Specturm of a spectrum

  • Signal > (DFT) > Power Spectrum > (log) > Log power spectrum > (IDFT) > Cepstrum (Quefrency)
  • pitch detection에 활용 가능 (rhamonic)

!!NOTE!!

  • Glottal pusle > Vocal Tract > Speech signal (성문 펄스 > 성대 > 음성 신호)
  • Log-spectrum > Spectral envelope > Spectral detail
  • Formants = Carry identity of sound : Spectral envelope의 윗 부분
    • 음성 과학 및 음성학에서 포만트는 인간 성대의 음향 공명으로 인한 광범위한 스펙트럼 최대값
  • Speech > Vocal tract freq. response > Glottal pulse
  • Speech = Convolution of vocal tract freq. response with glottal pulse

Formalising speech

$X(t)=E(t) \cdot H(t)$
$\log(X (t) ) = \log( E(t) \cdot H(t) )$
$\log(X (t) ) = \log( E(t) ) + \log( H(t) )$
Speech = Glottal pulse + Vocal tract freq. response

Goal: Separating components

  • 로우-필터를 적용해서, 고주파수를 없앰 ($E(t)$ 제거)

Computing Mel-Frequency Cepstral Coefficients

  • Waveform > DFT > Log-Amplitude Spectrum > Mel-Scaling > Discrete Cosine Transform > MFCCs

Why Discrete Cosine Transform?

  • Simplified version of FT
  • Get real-valued coefficient
  • Decorrelate energy in different mel bands
  • Reduce # dimension to represent spectrum

How many coefficients?

  • Traditionally: first 12-13 coefficients
  • First coefficients keep most information (e.g, formants, spectral envelope)
  • Use $\Delta$ and $\Delta \Delta$ MFCCs
  • Total 39 coefficients per frame

MFCCs advantages

  • Describe the “large” structures of the spectrum
  • Ignore fine spectral structures
  • Work well in speech and music processing

MFCCs disadvantages

  • Not robust to noise
  • Extensive knowledge engineering
  • Not efficient for synthesis

MFCCs applications

  • Speech processing
    • Speech recognition
    • Speaker recognition
  • Music processing
    • Music genre classification
    • Mood classification
    • Automatic taggin

20. Extracting MFCCs with Python

1
2
3
4
5
6
import os
import librosa
import librosa.display
import IPython.display as ipd
import matplotlib.pyplot as plt
import numpy as np
1
2
3
4
base_dir = r"./raw/20_audio/"
audio_file = os.path.join(base_dir, "debussy.wav")

ipd.Audio(audio_file)
1
2
3
signal, sr = librosa.load(audio_file)

print(signal.shape)
1
(661500,)

Extract MFCCs

1
2
3
mfccs = librosa.feature.mfcc(signal, n_mfcc=13, sr=sr)

print(mfccs.shape)
1
(13, 1292)

Visualise MFCCs

1
2
3
4
5
6
7
plt.figure(figsize=(12, 5))
librosa.display.specshow(mfccs,
                         x_axis="time",
                         sr=sr)
plt.colorbar(format="%+2f")
plt.title("MFCCs")
plt.show()

png

Calculate delta and delta2 MFCCs

1
2
3
4
delta_mfccs = librosa.feature.delta(mfccs)
delta2_mfccs = librosa.feature.delta(mfccs, order=2)

print(f"delta: {delta_mfccs.shape}\ndelta2: {delta_mfccs.shape}")
1
2
delta: (13, 1292)
delta2: (13, 1292)
1
2
3
4
5
6
7
plt.figure(figsize=(12, 5))
librosa.display.specshow(delta_mfccs,
                         x_axis="time",
                         sr=sr)
plt.colorbar(format="%+2f")
plt.title("MFCCs Delta")
plt.show()

png

1
2
3
4
5
6
7
plt.figure(figsize=(12, 5))
librosa.display.specshow(delta2_mfccs,
                         x_axis="time",
                         sr=sr)
plt.colorbar(format="%+2f")
plt.title("MFCCs Delta2")
plt.show()

png

1
2
3
comprehensive_mfccs = np.concatenate((mfccs, delta_mfccs, delta2_mfccs))

print(comprehensive_mfccs.shape)
1
(39, 1292)

21. Frequency-Domain Audio Features

Freq.-domain features

  • Band energy ratio (BER)
  • Spectral centroid (SC)
  • Bandwidth (BW)

Extracting freq.-domain features

  • Waveform > (STFT) > Spectogram > Feature Computation

Math convetions

  • $m_t(n)$ -> Magnitude of signal at freq. bin $n$ and frame $t$
  • $N$ -> # freq. bins

Band Energy Ratio (BER)

  • Comparison of energy in the lower/higher freq. bands
  • Measure of how dominant low frequencies are

\(BER_t = { {\sum_{n=1}^{F-1} m_t(n)^2} \over {\sum_{n=F}^N m_t(n)^2} }\)

  • $m_t(n)^2$: power of t, n
  • $F$: Split freq.
  • $\sum_{n=1}^{F-1} m_t(n)^2$: Power in the lower freq. bands
  • $\sum_{n=F}^N m_t(n)^2$: Power in the higher freq. bands

BER applications

  • Music/Speech discimination (구별)
  • Music Classification

Spectral centroid (SC)

  • Centre of gravity of magnitude spectrum
  • Freq. band where most of the energy is concentrated
  • Measure of “brightness” of sound
  • Weighted mean of the freq.

\(SC_t = { {\sum_{n=1}^N m_t(n) \cdot n} \over {\sum_{n=1}^N m_t(n)} }\)

  • $n$: freq. bin
  • $m_t(n)$: Weight for n

SC applications

  • Audio Classification
  • Music Classification

Bandwidth

  • Derived from spectral centroid
  • Spectral range around the centroid
  • Variance from the spectral centroid
  • Describe perceived timbre
  • Weighted mean of the distances of freq. bands from SC
  • Energy spread across frequency bands $\propto BW_t$

\(BW_t = { {\sum_{n=1}^N \left| n-SC_t \right| \cdot m_t(n) } \over {\sum_{n=1}^N m_t(n) } }\)

  • $m_t(n)$: Weight for n
  • $\leftn-SC_t \right$: Distance of freq. band from spectral centroid

BW applications

  • Music processing

22. Implementing Band Energy Ratio from Scratch with Python

1
2
3
4
5
6
7
8
9
import math
import matplotlib.pyplot as plt
import numpy as np
import librosa
import librosa.display
import IPython.display as ipd

debussy_file = "./raw/22_audio/debussy.wav"
redhot_file = "./raw/22_audio/redhot.wav"
1
ipd.Audio(debussy_file)
1
ipd.Audio(redhot_file)
1
2
debussy, sr = librosa.load(debussy_file)
redhot, _ = librosa.load(redhot_file)

Extracting spectrograms

1
2
3
4
5
6
7
FRAME_SIZE = 2048
HOP_SIZE = 512

debussy_spec = librosa.stft(debussy, n_fft=FRAME_SIZE, hop_length=HOP_SIZE)
redhot_spec = librosa.stft(redhot, n_fft=FRAME_SIZE, hop_length=HOP_SIZE)

print(debussy_spec.shape)
1
(1025, 1292)

Calculate Band Energy Ratio

1
2
3
4
5
def calculate_split_frequency_bin(split_frequency, sample_rate, num_frequency_bins):
    frequency_range = sample_rate / 2
    frequency_delta_per_bin = frequency_range / num_frequency_bins
    split_frequency_bin = math.floor(split_frequency / frequency_delta_per_bin)
    return int(split_frequency_bin)
1
2
split_frequency_bin = calculate_split_frequency_bin(2000, 22050, 1025)
print(split_frequency_bin)
1
185
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def band_energy_ratio(spectrogram, split_frequency, sample_rate):
    split_frequency_bin = calculate_split_frequency_bin(split_frequency, sample_rate, len(spectrogram[0]))
    band_energy_ratio = []
    
    # calculate power spectrogram
    power_spectrogram = np.abs(spectrogram) ** 2
    power_spectrogram = power_spectrogram.T
    
    # calculate BER value for each frame
    for frame in power_spectrogram:
        sum_power_low_frequencies = frame[:split_frequency_bin].sum()
        sum_power_high_frequencies = frame[split_frequency_bin:].sum()
        band_energy_ratio_current_frame = sum_power_low_frequencies / sum_power_high_frequencies
        band_energy_ratio.append(band_energy_ratio_current_frame)
    
    return np.array(band_energy_ratio)
1
2
3
4
5
ber_debussy = band_energy_ratio(debussy_spec, 2000, sr)
ber_redhot = band_energy_ratio(redhot_spec, 2000, sr)

print(f"{debussy_spec.T.shape}")
print(f"{ber_debussy.shape}")
1
2
(1292, 1025)
(1292,)

Visualise Band Energy Ratio curves

1
2
3
4
frames = range(len(ber_debussy))
t = librosa.frames_to_time(frames, hop_length=HOP_SIZE)

print(len(t))
1
1292
1
2
3
4
5
6
plt.figure(figsize=(15, 5))

plt.plot(t, ber_debussy, color="b")
plt.plot(t, ber_redhot, color="r")

plt.show()

png

23. Spectral centroid and bandwidth

1
2
3
4
5
6
7
debussy_file = "./raw/23_audio/debussy.wav"
redhot_file = "./raw/23_audio/redhot.wav"
duke_file = "./raw/23_audio/duke.wav"

debussy, sr = librosa.load(debussy_file)
redhot, _ = librosa.load(redhot_file)
duke, _ = librosa.load(duke_file)

Spectral centroid with Librosa

1
2
3
4
5
6
FRAME_SIZE = 1024
HOP_LENGTH = 512

sc_debussy = librosa.feature.spectral_centroid(y=debussy, sr=sr, n_fft=FRAME_SIZE, hop_length=HOP_LENGTH)[0]
sc_rehot = librosa.feature.spectral_centroid(y=redhot, sr=sr, n_fft=FRAME_SIZE, hop_length=HOP_LENGTH)[0]
sc_duke = librosa.feature.spectral_centroid(y=duke, sr=sr, n_fft=FRAME_SIZE, hop_length=HOP_LENGTH)[0]

Visualise spectral centroid

1
2
3
4
5
6
7
8
9
10
frames = range(len(ber_debussy))
t = librosa.frames_to_time(frames, hop_length=HOP_LENGTH)

plt.figure(figsize=(15, 5))

plt.plot(t, sc_debussy, color="b")
plt.plot(t, sc_rehot, color="r")
plt.plot(t, sc_duke, color="y")

plt.show()

png

Calculate bandwidth

1
2
3
ban_debussy = librosa.feature.spectral_bandwidth(y=debussy, sr=sr, n_fft=FRAME_SIZE, hop_length=HOP_LENGTH)[0]
ban_redhot = librosa.feature.spectral_bandwidth(y=redhot, sr=sr, n_fft=FRAME_SIZE, hop_length=HOP_LENGTH)[0]
ban_duke = librosa.feature.spectral_bandwidth(y=duke, sr=sr, n_fft=FRAME_SIZE, hop_length=HOP_LENGTH)[0]
1
2
3
4
5
6
7
plt.figure(figsize=(15,5))

plt.plot(t, ban_debussy, color='b')
plt.plot(t, ban_redhot, color='r')
plt.plot(t, ban_duke, color='y')

plt.show()

png

This post is licensed under CC BY 4.0 by the author.

[DSP] Section 6: 스펙토그램, MFCCs

[SP] 표준 입출력

Comments powered by Disqus.