Building An AI-Empowered Music Library with TensorFlow
stycznia 16, 2020
A guest post by Tencent QQ Music Audio Engineering Team

Introduction

QQ Music is a music streaming service owned by the Tencent Music Entertainment Group (TME) and we serve 800M+ users globally. We have a massive music library with an extensive collection of albums and live music available in both audio and video formats. Building upon TensorFlow, we are able to effectively manage the music library by implementing many machine learning algorithms for automatic music tagging, music evaluation, voice separation, music defect detection and restoration, and so on. In this article, we’d like to share with you how we have built such an AI-empowered music library and our experience of using TensorFlow.

Building a training framework with TensorFlow

Based on TensorFlow, we built an ML training framework specifically for audio to do feature extraction, model building, training strategy, and online deployment. It leverages many high-level APIs provided by TensorFlow, which is convenient for our algorithm implementation.
Figure 1: Training block diagram based on tf.estimator

Feature extraction

In order to handle audio-related tasks, we needed to first extract audio features. Here we mainly use high-level signal processing APIs in tf.signal, which provides common STFT, MFCC and other feature extraction functionalities. If we use  tf.py_func(), which takes numpy arrays as its input and returns numpy arrays as its output, we can even wrap other audio processing tools like librosa as an op in a TensorFlow graph. The following code snippet constructs a simple TensorFlow graph that invokes the librosa library for reading and extracting features as an op in the computation graph:
def _load_wav(filename, gt_rate=16000):
    try:
        data, _ = librosa.load(filename, sr=gt_rate)
    except Exception as e:
        raise ('the file %s  catch the exception...'%filename)
    #T0DO
    features =......
    #T0DO
    return features
read_wav = lambda x: tf.py_func(partial(_load_wav, gt_rate=gt_rate),  [x], tf.float32)
We use tf.data to convert extracted features into TFRecord files or tensors to feed directly into our models for training. Specifically, we use tf.data.Dataset.map() function to map one dataset element to another dataset element in order to process data in parallel as below. Then we use dataset.filter(lambda XXX).map(lambda XXX).apply(XXX) to perform other data conversions.
def audio_dataset_from_fileslist(path=None, num_parallel_calls=4, gt_rate=16000):
    fn_dset = load_fileslist(path)
    read_wav = lambda x: tf.py_func(partial(_load_wav, gt_rate=gt_rate), 
                                           [x], 
                                           tf.float32)
    audio_dset = fn_dset.map(read_wav, 
                                  num_parallel_calls=num_parallel_calls)
    return audio_dset

Model building

We used high-level TensorFlow APIs like tf.keras to build our neural networks without manually defining the weights in each convolution layer as low-level APIs do. Some pretrained models can even be directly used or quickly customized for our needs. At the same time, it is very convenient to finetune existing models this way.

Training strategy

During the model training process, we define data input, model output, loss function, gradient descent method, and then put them together for training. TensorFlow offers the tf.estimator module as a high-level API to encapsulate the whole ML lifecycle: training, evaluation, prediction and exporting models for deployment. This module provides rich external interfaces, for example input_fn and model_fn; the former provides input data for training as mini batches, and the latter builds computation graphs for training, evaluation, and prediction. Then, we can directly call train() for training, evaluate() for evaluation, predict() for prediction. But more importantly, it provides a variety of distributed training strategies: single machine multi-GPU, multi-machine multi-GPU, and etc. By using the distribution strategies, we can distribute our existing models and training code with minimal code changes for training across multiple machines/GPUs. Finally, the model files can be exported as Protobuf, SavedModel or FlatBuffers (for TensorFlow Lite) for deployment in various places. The whole workflow is shown in Figure 1.

Model deployment

The weights in our models are saved in protobuf format, and TensorFlow Serving is used as the deployment tool. TensorFlow Serving can achieve high throughput/availability through asynchronous calls, and automatically batch inference requests to best leverage our precious GPU resources. We deploy our trained models on our Nvidia P40 GPUs, and the performance on every GPU is greatly utilized.

Sample use cases

Music tagging

Music tagging, also called music classification, refers to tagging music according to their genre, mood, scene and more, by using Music Information Retrieval (MIR) techniques. Music tags are valuable because they help users better understand music and facilitate personalized recommendation.

Music tagging can be applied in the following scenarios:
  1. Building music channels
  2. Forming online music radios
  3. Improving the tag coverage of the music library and building an intelligent music library
  4. Recommending music based on user personal interest
In our case, music tagging can be described as a two-stage process: training classification models using labeled data, and testing the models using new/unseen data. The precision and recall metrics of the classification model must both be good, which is a huge challenge given the massive size and great diversity of our music library. Furthermore, as we strive to remove human involvement from the tagging process, we have adopted deep learning to create a more accurate and robust classification system.
Our main improvements over other approaches are:
  1. A classification model based on the combined features of audio content and lyrics is used to improve the accuracy of music classification. Figure 2 is the block diagram of our audio classification system
    Figure 2: Audio classification training block diagram
  2. Instead of using the conventional multiclass classification in deep learning, we use binary classification or even multilevel binary classification, to improve the precision and recall rates significantly. The following figure shows an example of 2-level classification to get target label “pop piano”
    Figure 3: Multilevel binary classification system

Voice separation

Voice separation, also known as singing voice separation, is a technique to separate the vocal sound and background music. It aims to decompose a music recording into two tracks: the singing voice on one track, and the accompanying instrumental music on the other.
Figure 4: Voice separation in the frequency domain
In the music domain, voice separation has a wide range of applications. For example:
  1. Generating high-quality accompaniments to build an accompaniment library
  2. Using the extracted voice for voiceprint recognition, voice conversion, sound correction template, intelligent MIDI, and more
  3. When combined with ASR (Automatic Speech Recognition) technology, it can generate lyrics, align lyrics, and even automatically create music lyrics timestamp file, as shown in Figure 5 below
Figure 5: Generating lyrics timestamps based on voice separation
Voice separation is a highly popular research topic in the Music Information Retrieval (MIR) literature, and annual competitions such as the SiSec MUS Challenge attract an increasing number of participating teams. We built a CRNN (Convolutional Recurrent Neural Network) model as in Figure 6. It is a variation of the U-Net model, which uses a convolutional autoencoder with additional skip-connections that bring detailed information lost in the encoding stage back into the decoding stage. The model can train on large datasets of pure accompaniments and vocals from our music library. It’s worth noting that our model does not directly output accompaniment or voice; instead, it outputs corresponding Ideal Ratio Mask (IRM) which is then used to multiply the spectrum characteristics of the original signal to get the desired accompaniment or voice.
Figure 6: CRNN model based encoder-decoder architecture

Defect detection

Defect detection refers to identifying abnormal music segments that adversely affect the users’ listening experience. The purpose of defect detection is to replace or repair them in our music library. Moreover, it can prevent the defective audio files from entering our music library in the first place. Therefore, it is critical to the sound quality of our music library, and it in turn leads to a superior product experience and strong branding for QQ Music.
Figure 7 shows a few defective audio fragments in a song due to historical reasons like network transmission error or codec issue. So, it is absolutely necessary to detect such defects in the songs. The main technique involved is Sound Event Detection (SED) based on 3 kinds of human annotations:
  1. Strong labeling: each event is labeled with start time and end time
  2. Weak labeling: each event has no time information, also known as presence/absence labeling
  3. Sequential labeling: the sequence of events without start/end time information
However, strong labels are both labor and cost-intensive for human annotators to produce, so we developed a weakly-supervised architecture based on the Convolutional Recurrent Neural Network (CRNN) with weak labeling (presence/absence labels). Our SED technique using weak labels can be formulated as a Multiple Instance Learning (MIL) problem. At a high level, our CRNN model predicts the probability of each sound event type at each frame. A pooling function aggregates the frame-level probabilities into a recording-level probability for each sound event type, and then we can predict whether the whole song contains defects based on recording-level probabilities.
Finally, we use TensorFlow Serving to deploy our model so that we scan our entire music library.
Figure 7: Defect detection process

Music restoration

There are abnormal songs with noise in the music library due to the limitation of recording equipment, such as mobile phone recording, or background environment interference, which introduces noise signals, such as click noise, background noise, and etc.
In particular, some User Generated Content (UGC) naturally include noise that affects the listening experience. Therefore, we should take technical measures to filter out these noise signals, which can effectively restore and enhance the sound quality to improve the entertaining experience of QQ Music. Music restoration can be deployed either in the cloud or on mobile devices to achieve real-time filtering of noise signals.
At a high level, the way we perform restoration on defective audios is by extracting the STFT linear amplitude spectrum and then passing it into a CRNN model. The whole model training block diagram is shown in Figure 8.
Figure 8: Audio denoising training block diagram
It is important to note that this model outputs IRM (Ideal Ratio Mask, also called Time-Frequency Masking), which predicts multiplicative time-frequency mask, as shown in Figure 9. We then multiply the predicted mask with the input time-frequency features to get the predicted time-frequency output. In other words, we multiply predicted IRM with the mix amplitude spectrum of the noisy signal to get the clean, noise-free amplitude spectrum to restore the audio. Due to the real-time requirement for our algorithm, we have mainly focused on estimating the amplitude of complex spectrum after STFT transformation while reusing the phase from origin noisy audio. Researchers are starting to extend their research to cover phase information and proposing new methods such as Phase Sensitive Mask (PSM) and complex Ideal Ratio Mask (cIRM) for estimating the phase of clean audio. We will explore these new methods in the future.
Figure 9: Audio denoising based on IRM

Conclusion

In this article we’ve discussed how we adopted TensorFlow to build an AI-enhanced music library, and illustrated how to build deep learning training models to tackle several audio-related challenges in QQ Music.
In the future, we will continue to explore the “Music & AI” field, including lyrics generation, AI music composition, audio fingerprinting, music recognition, and query by humming. TensorFlow takes an important role in developing all these intelligent music solutions, which helps us bring the best music experience to our 800M+ customers.

References

  1. Rafii Z, Liutkus A, Stoter FR, Mimilakis SI, FitzGerald D, Pardo B. An overview of lead and accompaniment separation in music. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP). 2018 Aug 1;26(8):1307-35.
  2. Wang D, Chen J. Supervised speech separation based on deep learning: An overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2018 May 30;26(10):1702-26.
  3. Virtanen, Tuomas, Mark D. Plumbley, and Dan Ellis, eds. Computational analysis of sound scenes and events. Heidelberg: Springer, 2018.
  4. B McFee, J Salamon, JP Bello.Adaptive pooling operators for weakly labeled sound event detection.IEEE/ACM Transactions on Audio, 2018.
Next post
Building An AI-Empowered Music Library with TensorFlow

A guest post by Tencent QQ Music Audio Engineering Team

Introduction QQ Music is a music streaming service owned by the Tencent Music Entertainment Group (TME) and we serve 800M+ users globally. We have a massive music library with an extensive collection of albums and live music available in both audio and video formats. Building upon TensorFlow, we are able to effectively manage the music libra…