Introducing BirdAVES: Self-Supervised Audio Foundation Model for Birds

6.20.2024

By Masato Hagiwara, Senior AI Research Scientist

We are excited to announce the public release of BirdAVES, a series of newly trained animal vocalization encoder models developed using self-supervision. Compared to AVES, the self-supervised foundation model that the Earth Species Project published last year, BirdAVES achieves over a 20% improvement in bird-related datasets and tasks. In this blog post, we will discuss the significance of modeling bird vocalizations, the impact of foundation models on analyzing animal vocalizations, and provide a summary of experimental results comparing BirdAVES with other models, including AVES, BirdNET, and Perch.

Bird vocalizations and their ecological implications

Birds are one of the most well-studied taxonomic groups in the animal kingdom. Bird vocalizations have a wide range of ecological, behavioral, and conservation implications, and play an important role in studies related to behavior, population monitoring, and habitat analysis. 

Birds are often used as proxies for assessing ecosystem health, with population monitoring frequently conducted through the analysis of data collected via passive acoustic monitoring (PAM). Also, machine learning models trained to classify bird vocalizations have shown remarkable transferability to other domains of animal vocalizations (Ghani et al., 2023).

Foundation models and AVES

In recent years, deep neural networks (DNNs) have been increasingly used to monitor avian diversity, by identifying species in data gathered from PAM systems (Stowell, 2022). Successful examples include convolutional neural networks (CNNs) models trained on large-scale bird vocalization data such as BirdNET (Kahl et al., 2021) and Perch (Ghani et al., 2023).

However, many of these models have been trained on explicitly annotated data and are consequently limited by the availability of labeled datasets. While sufficient training data is available for certain taxonomic groups, such as common bird species, this limitation becomes a bigger issue when extending these models to rare and endangered species or other non-bird taxa that lack extensive labeled data. This scarcity of data poses a challenge to developing robust models for a wide range of species (Ghani et al., 2023).

In contrast, across other fields of machine learning, foundation models trained at scale using pure self-supervision have played a pivotal role in the AI revolution over the past few years. Notable examples include BERT (Devlin et al., 2018) and GPT-4 (OpenAI, 2023) for natural language, CLIP (Radford et al., 2021) for computer vision, and wav2vec 2.0 (Baevski et al., 2020) and HuBERT (Hsu et al., 2021) for human speech, among others.

Figure 1. Overview of AVES (a) pretraining and (b) fine-tuning, and (c) t-SNE plot of learned representations

We expect these large foundation models to play a key role in unlocking animal communication across species and have heavily invested in their development. 

Last year, we released the first-ever self-supervised foundation model for animal vocalizations, AVES (Hagiwara, 2022, Fig. 1). AVES can encode a wide range of animal vocalizations, achieving significant performance gains in many bioacoustic datasets and tasks compared to baseline CNN models such as ResNet (He et al., 2015) and VGGish (Hershey et al., 2016). 

AVES has also been used in a number of internal and external projects, such as voxaboxen, ISPA (Hagiwara et al., 2024), and ocean soundscape monitoring (Calonge et al., 2024). 

Building on the success of AVES, we have now developed BirdAVES, a new series of models specifically tailored for bird vocalizations, as our next step to advance bioacoustic research.

Scaling law and BirdAVES

In other domains of machine learning, the scaling law—which states that the performance of machine learning models improves according to a power law curve as the training data increases—has been confirmed for large language models (Kaplan et al., 2020). Similar trends have been observed in computer vision (Zhai et al., 2021) and speech recognition (Radford et al., 2022).

We significantly scaled up the training of AVES in terms of training data, model size, and compute power. Specifically:

  • Training Data: In addition to the core configuration used for AVES, we added a large amount of bird recordings from Xeno-canto and iNaturalist for self-supervised training of BirdAVES models.
  • Model Size: While the earlier AVES models were based on the HuBERT (Hsu et al., 2021) base configuration (~95M parameters), we have now successfully trained large models (~316M parameters) with significant performance improvements.
  • Compute: We significantly scaled up the training compute and increased the number of training steps for BirdAVES models to achieve an over 20% performance improvement. 

We evaluated the performance of the baseline AVES model released earlier (referred to as AVES-bio) and the newly trained BirdAVES models, which include biox-base, biox-large, and bioxn-large, each with different training configurations. 

We used BEANS (Hagiwara et al., 2022) as our benchmark. We computed the average metrics across all the datasets, as shown in the “BEANS avr. (all)” column in Table 1 and the metrics averaged only over the bird datasets (cbi, enabirds, dcase, and rfcx) as shown in the “BEANS avr. (birds)” column. As shown in the table, the newly trained BirdAVES models achieved over a 20% performance improvement on bird datasets and have shown enhanced performance across various other datasets as well. 

We also confirmed through preliminary studies that scaling the compute (the number of training steps) improves performance roughly in accordance with the power law.

Table 1: Training details and performance comparison between AVES and BirdAVES models

Comparison with other bird-focused models

Additionally, we compared the performance of AVES and BirdAVES models against BirdNET (Kahl et al., 2021, version 2.4) and Perch (Ghani et al., 2023, version 8). We extracted embeddings for the input audio from those models while freezing all the model parameters, and used a single linear layer (also called a linear probe) to train a classification or a detection model. We used the same task configurations on BEANS as the previous section, except that we excluded the cbi dataset from our comparison—BirdNET and Perch have been trained using explicit Xeno-canto labels, which are also the basis for the cbi dataset. This overlap biases the evaluation in favor of supervised models. The result is shown in Table 2. 

Table 2: Performance comparison of AVES, BirdAVES, BirdNET, and Perch models

BirdAVES models, although still lagging slightly behind on bird datasets compared against supervised models, have made remarkable progress through self-supervision alone. We also note that, unlike BirdNET and Perch, BirdAVES uses only a portion (33%) of the available Xeno-canto recordings and does not use any other large-scale datasets such as the Macaulay library, and does not rely on advanced data augmentation techniques.

Self-supervision has shown great benefits in the domain of generic audio processing (Baevski et al., 2020, Chen et al., 2021, Gong et al., 2021) and bioacoustics (Schäfer-Zimmermann et al., 2024). It is not difficult to imagine that similar or even greater performance improvements could be achieved by BirdAVES through scaling and incorporating data augmentation techniques.

Additionally, unlike traditional DNN models that use clip-level classification objectives, AVES and BirdAVES models are implemented using the transformer architecture with masked unit prediction as the pre-training objective. This allows them to produce fine-grained, per-frame (50ms) embeddings for the given input audio. So far, this capability has enabled downstream applications such as detection (Voxaboxen) and transcription (ISPA), which require detailed frame-by-frame descriptions of sound. Finally, AVES and BirdAVES are written in PyTorch, making them compatible and easily integrable with many modern large-scale machine learning pipelines.

Using BirdAVES

The code and pretrained model weights for BirdAVES are available on the AVES Github repository. Along with the original model weights trained using fairseq, we have also provided the model weights converted to the TorchAudio and ONNX formats. Supporting multiple formats makes it easier for everyone to use BirdAVES in different applications and workflows, making the models more accessible and practical for a wider range of practitioners working in the field of bioacoustics. 

We encourage you to explore the AVES Github repository here and start using BirdAVES in your projects to advance your research and applications in bioacoustics!

Redirecting you to…