sadaco.apis.models.torchvggish.torchvggish package

Submodules

sadaco.apis.models.torchvggish.torchvggish.mel_features module

Defines routines to compute mel spectrogram features from audio waveform.

sadaco.apis.models.torchvggish.torchvggish.mel_features.frame(data, window_length, hop_length)[source]

Convert array into a sequence of successive possibly overlapping frames.

An n-dimensional array of shape (num_samples, …) is converted into an (n+1)-D array of shape (num_frames, window_length, …), where each frame starts hop_length points after the preceding one.

This is accomplished using stride_tricks, so the original data is not copied. However, there is no zero-padding, so any incomplete frames at the end are not included.

Parameters
  • data – np.array of dimension N >= 1.

  • window_length – Number of samples in each frame.

  • hop_length – Advance (in samples) between each window.

Returns

(N+1)-D np.array with as many rows as there are complete frames that can be extracted.

sadaco.apis.models.torchvggish.torchvggish.mel_features.periodic_hann(window_length)[source]

Calculate a “periodic” Hann window.

The classic Hann window is defined as a raised cosine that starts and ends on zero, and where every value appears twice, except the middle point for an odd-length window. Matlab calls this a “symmetric” window and np.hanning() returns it. However, for Fourier analysis, this actually represents just over one cycle of a period N-1 cosine, and thus is not compactly expressed on a length-N Fourier basis. Instead, it’s better to use a raised cosine that ends just before the final zero value - i.e. a complete cycle of a period-N cosine. Matlab calls this a “periodic” window. This routine calculates it.

Parameters

window_length – The number of points in the returned window.

Returns

A 1D np.array containing the periodic hann window.

sadaco.apis.models.torchvggish.torchvggish.mel_features.stft_magnitude(signal, fft_length, hop_length=None, window_length=None)[source]

Calculate the short-time Fourier transform magnitude.

Parameters
  • signal – 1D np.array of the input time-domain signal.

  • fft_length – Size of the FFT to apply.

  • hop_length – Advance (in samples) between each frame passed to FFT.

  • window_length – Length of each block of samples to pass to FFT.

Returns

2D np.array where each row contains the magnitudes of the fft_length/2+1 unique values of the FFT for the corresponding frame of input samples.

sadaco.apis.models.torchvggish.torchvggish.mel_features.hertz_to_mel(frequencies_hertz)[source]

Convert frequencies to mel scale using HTK formula.

Parameters

frequencies_hertz – Scalar or np.array of frequencies in hertz.

Returns

Object of same size as frequencies_hertz containing corresponding values on the mel scale.

sadaco.apis.models.torchvggish.torchvggish.mel_features.spectrogram_to_mel_matrix(num_mel_bins=20, num_spectrogram_bins=129, audio_sample_rate=8000, lower_edge_hertz=125.0, upper_edge_hertz=3800.0)[source]

Return a matrix that can post-multiply spectrogram rows to make mel.

Returns a np.array matrix A that can be used to post-multiply a matrix S of spectrogram values (STFT magnitudes) arranged as frames x bins to generate a “mel spectrogram” M of frames x num_mel_bins. M = S A.

The classic HTK algorithm exploits the complementarity of adjacent mel bands to multiply each FFT bin by only one mel weight, then add it, with positive and negative signs, to the two adjacent mel bands to which that bin contributes. Here, by expressing this operation as a matrix multiply, we go from num_fft multiplies per frame (plus around 2*num_fft adds) to around num_fft^2 multiplies and adds. However, because these are all presumably accomplished in a single call to np.dot(), it’s not clear which approach is faster in Python. The matrix multiplication has the attraction of being more general and flexible, and much easier to read.

Parameters
  • num_mel_bins – How many bands in the resulting mel spectrum. This is the number of columns in the output matrix.

  • num_spectrogram_bins – How many bins there are in the source spectrogram data, which is understood to be fft_size/2 + 1, i.e. the spectrogram only contains the nonredundant FFT bins.

  • audio_sample_rate – Samples per second of the audio at the input to the spectrogram. We need this to figure out the actual frequencies for each spectrogram bin, which dictates how they are mapped into mel.

  • lower_edge_hertz – Lower bound on the frequencies to be included in the mel spectrum. This corresponds to the lower edge of the lowest triangular band.

  • upper_edge_hertz – The desired top edge of the highest frequency band.

Returns

An np.array with shape (num_spectrogram_bins, num_mel_bins).

Raises

ValueError – if frequency edges are incorrectly ordered or out of range.

sadaco.apis.models.torchvggish.torchvggish.mel_features.log_mel_spectrogram(data, audio_sample_rate=8000, log_offset=0.0, window_length_secs=0.025, hop_length_secs=0.01, **kwargs)[source]

Convert waveform to a log magnitude mel-frequency spectrogram.

Parameters
  • data – 1D np.array of waveform data.

  • audio_sample_rate – The sampling rate of data.

  • log_offset – Add this to values when taking log to avoid -Infs.

  • window_length_secs – Duration of each window to analyze.

  • hop_length_secs – Advance between successive analysis windows.

  • **kwargs – Additional arguments to pass to spectrogram_to_mel_matrix.

Returns

2D np.array of (num_frames, num_mel_bins) consisting of log mel filterbank magnitudes for successive frames.

sadaco.apis.models.torchvggish.torchvggish.vggish module

class sadaco.apis.models.torchvggish.torchvggish.vggish.VGG(features)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]
training: bool
class sadaco.apis.models.torchvggish.torchvggish.vggish.Postprocessor[source]

Bases: torch.nn.modules.module.Module

Post-processes VGGish embeddings. Returns a torch.Tensor instead of a numpy array in order to preserve the gradient.

“The initial release of AudioSet included 128-D VGGish embeddings for each segment of AudioSet. These released embeddings were produced by applying a PCA transformation (technically, a whitening transform is included as well) and 8-bit quantization to the raw embedding output from VGGish, in order to stay compatible with the YouTube-8M project which provides visual embeddings in the same format for a large set of YouTube videos. This class implements the same PCA (with whitening) and quantization transformations.”

__init__()[source]

Constructs a postprocessor.

postprocess(embeddings_batch)[source]

Applies tensor postprocessing to a batch of embeddings.

Parameters

embeddings_batch – An tensor of shape [batch_size, embedding_size] containing output from the embedding layer of VGGish.

Returns

A tensor of the same shape as the input, containing the PCA-transformed, quantized, and clipped version of the input.

forward(x)[source]
training: bool
sadaco.apis.models.torchvggish.torchvggish.vggish.make_layers()[source]
class sadaco.apis.models.torchvggish.torchvggish.vggish.VGGish(urls, device=None, pretrained=True, preprocess=True, postprocess=True, progress=True)[source]

Bases: sadaco.apis.models.torchvggish.torchvggish.vggish.VGG

forward(x, fs=None)[source]
training: bool

sadaco.apis.models.torchvggish.torchvggish.vggish_input module

Compute input examples for VGGish from audio waveform.

sadaco.apis.models.torchvggish.torchvggish.vggish_input.waveform_to_examples(data, sample_rate, return_tensor=True)[source]

Converts audio waveform into an array of examples for VGGish.

Parameters
  • data – np.array of either one dimension (mono) or two dimensions (multi-channel, with the outer dimension representing channels). Each sample is generally expected to lie in the range [-1.0, +1.0], although this is not required.

  • sample_rate – Sample rate of data.

  • return_tensor – Return data as a Pytorch tensor ready for VGGish

Returns

3-D np.array of shape [num_examples, num_frames, num_bands] which represents a sequence of examples, each of which contains a patch of log mel spectrogram, covering num_frames frames of audio and num_bands mel frequency bands, where the frame length is vggish_params.STFT_HOP_LENGTH_SECONDS.

sadaco.apis.models.torchvggish.torchvggish.vggish_input.wavfile_to_examples(wav_file, return_tensor=True)[source]

Convenience wrapper around waveform_to_examples() for a common WAV format.

Parameters
  • wav_file – String path to a file, or a file-like object. The file

  • assumed to contain WAV audio data with signed 16-bit PCM samples. (is) –

  • torch – Return data as a Pytorch tensor ready for VGGish

Returns

See waveform_to_examples.

sadaco.apis.models.torchvggish.torchvggish.vggish_params module

Global parameters for the VGGish model.

See vggish_slim.py for more information.

Module contents