Main Content

YAMNet

YAMNet sound classification network

Since R2021b

  • YAMNet block

Libraries:
Audio Toolbox / Deep Learning

Description

The YAMNet block leverages a pretrained sound classification network that is trained on the AudioSet dataset to predict audio events from the AudioSet ontology.

Ports

Input

expand all

Mel spectrograms, specified as a 96-by-64 matrix or a 96-by-64-by-1-by-N array, where:

  • 96 –– Represents the number of 25 ms frames in each mel spectrogram

  • 64 –– Represents the number of mel bands spanning 125 Hz to 7.5 kHz

  • N –– Number of channels.

You can use the YAMNet Preprocess block to generate mel spectrograms. The dimensions of these spectrograms are 96-by-64.

Data Types: single | double

Output

expand all

Predicted sound label, returned as an enumerated scalar.

Data Types: enumerated

Predicted activation or score values for each supported sound label, returned as a 1-by-521 vector, where 521 is the number of classes in YAMNet.

Data Types: single

Class labels for predicted scores, returned as a 1-by-521 vector.

Data Types: enumerated

Parameters

expand all

Size of mini-batches to use for prediction, specified as a positive integer. Larger mini-batch sizes require more memory, but can lead to faster predictions.

Data Types: int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Enable the output port sound, which outputs the classified sound.

Enable the output ports scores and labels, which output all predicted scores and associated class labels.

Block Characteristics

Data Types

double | single

Direct Feedthrough

no

Multidimensional Signals

no

Variable-Size Signals

no

Zero-Crossing Detection

no

Algorithms

expand all

References

[1] Gemmeke, Jort F., Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 776–80. DOI.org (Crossref), doi:10.1109/ICASSP.2017.7952261.

[2] Hershey, Shawn, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, et al. “CNN Architectures for Large-Scale Audio Classification.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 131–35. DOI.org (Crossref), doi:10.1109/ICASSP.2017.7952132.

Extended Capabilities

Version History

Introduced in R2021b