This is the second article on spoken language recognition based on Mozilla Common Voice dataset. In the first part we discussed data selection and chose optimal embedding. Let us now train several models and select the best one.
We will now train and evaluate the following models on the full data (40K samples, see the first part for more info on data selection and preprocessing):
· Convolutional neural network (CNN) model. We simply treat language classification problem as classification of 2-dimensional images. CNN-based classifiers showed promising results in a language recognition TopCoder competition.
· CRNN model from Bartz et al. 2017. A CRNN combines the descriptive power of CNNs with the ability to capture temporal features of RNN.
· CRNN model from Alashban et al. 2022. This is just another variation of the CRNN architecture.
· AttNN: model from De Andrade et al. 2018. This model was initially proposed for speech recognition and subsequently applied for spoken language recognition in the Intelligent Museum project. In addition to convolution and LSTM units, this model has a subsequent attention block that is trained to weigh parts of the input sequence (namely frames on which Fourier transform is computed) according to their relevance for classification.
· CRNN* model: same architecture as AttNN, but no attention block.
· Time-delay neural network (TDNN) model. The model we test here was used to generate X-vector embeddings for spoken language recognition in Snyder et al. 2018. In our study, we bypass X-vector generation and directly train the network to classify languages.
All models were trained based on the same train/val/test split and the same mel spectrogram embeddings with the first 13 mel filterbank coefficients. The models can be found here.
The resulting learning curves on the validation set are shown on the figure below (each “epoch” refers to 1/8 of the dataset).
The following table shows mean and standard deviation for the accuracy based on 10 runs.
It can be clearly seen that AttNN, TDNN, and our CRNN* model perform similarly, with AttNN scoring the 1st with 92.4% accuracy. On the other hand, CRNN (Bartz et al. 2017), CNN, and CRNN (Alashban et al. 2022) showed very modest performance with CRNN (Alashban et al. 2022) closing the list with only 58.5% accuracy.
We then trained the winning AttNN model on the train and val sets and evaluated on the test set. The test accuracy of 92.4% (92.4% for men and 92.3% for women) turned out to be close to validation accuracy, which indicates that the model did not overfit on the validation set.
To understand the performance difference between the evaluated models, we first note that TDNN and AttNN were specifically designed for speech recognition tasks and already tested against previous benchmarks. This might be the reason why these models come out on top.
The performance gap between AttNN and our CRNN model (the same architecture but no attention block) proves the relevance of the attention mechanism for spoken language recognition. The following CRNN model (Bartz et al. 2017) performs worse despite its similar architecture. This is probably just because the default model hyperparameters are not optimal for the MCV dataset.
The CNN model does not possess any specific memory mechanism and comes next. Strictly speaking, the CNN has some notion of memory since computing convolution involves a fixed number of consecutive frames. Higher layers thus encapsulate information of even longer time intervals due to the hierarchical nature of CNNs. In fact, the TDNN model, which scored the second, might be viewed as a 1-D CNN. So, with more time invested in CNN architecture search, the CNN model might have performed closely to TDNN.
The CRNN model from Alashban et al. 2022 surprisingly shows the worst accuracy. It is interesting that this model was initially designed to recognize languages in MCV and showed accuracy of about 97%, as reported in the original study. Since the original code is not publicly available, it would be difficult to determine the source of this large discrepancy.
In many cases the user employs regularly no more than 2 languages. In this case, a more appropriate metric of model performance is pairwise accuracy, which is nothing more than accuracy computed on a given pair of languages ignoring the scores for all other languages.
The pairwise accuracy for the AttNN model on the test set is shown in the table below next to the confusion matrix, the recall for individual languages being on diagonal. The average pairwise accuracy is 97%. Pairwise accuracy will always be higher than accuracy since only 2 languages need to be distinguished.
So, the model distinguishes the best between German (de) and Spanish (es) as well as French (fr) and English (en) (98%). This is not surprising as the sound system is quite different in these languages.
Although we used softmax loss to train the model, it was previously reported that higher accuracy might be achieved in pairwise classification with tuplemax loss (Wan et al. 2019).
To study the effect of tuplemax loss, we retrained our model after implementing tuplemax loss in PyTorch (see here for implementation). The figure below compares the effect of softmax loss and tuplemax loss on accuracy and on pairwise accuracy when evaluated on the validation set.
As can be observed, tuplemax loss performs worse when overall accuracy (paired t-test pvalue=0.002) or pairwise accuracy is compared (paired t-test pvalue=0.2).
In fact, even the original study fails to explain clearly why tuplemax loss should do better. Here is the example that the authors make:
The absolute value of loss does not actually mean much. With enough training iterations, this example might be classified correctly with one or the other loss.
Anyways, tuplemax loss is not a versatile solution and the choice of loss function should be carefully leveraged for each given problem.
We reached 92% accuracy and 97% pairwise accuracy in spoken language recognition of short audio clips from the Mozilla Common Voice (MCV) dataset. German, English, Spanish, French, and Russian languages were considered.
In a preliminary study comparing mel spectrogram, MFCC, RASTA-PLP, and GFCC embeddings we found out that mel spectrograms with the first 13 filterbank coefficients resulted in the highest recognition accuracy.
We next compared the generalization performance of 5 neural network models: CNN, CRNN (Bartz et al. 2017), CRNN (Alashban et al. 2022), AttNN (De Andrade et al. 2018), CRNN*, and TDNN (Snyder et al. 2018). Among all the models, AttNN showed the best performance, which highlights the importance of LSTM and attention blocks for spoken language recognition.
Finally, we computed the pairwise accuracy and studied the effect of tuplemax loss. It turns out, that tuplemax loss degrades both accuracy and pairwise accuracy compared to softmax.
In conclusion, our results constitute a new benchmark for spoken language recognition on the Mozilla Common Voice dataset. Better results could be achieved in future studies by combining different embeddings and extensively investigating promising neural network architectures, e.g. transformers.
In Part III we will discuss which audio transformations might help to improve model performance.
- Alashban, Adal A., et al. “Spoken language identification system using convolutional recurrent neural network.” Applied Sciences 12.18 (2022): 9181.
- Bartz, Christian, et al. “Language identification using deep convolutional recurrent neural networks.” Neural Information Processing: 24th International Conference, ICONIP 2017, Guangzhou, China, November 14–18, 2017, Proceedings, Part VI 24. Springer International Publishing, 2017.
- De Andrade, Douglas Coimbra, et al. “A neural attention model for speech command recognition.” arXiv preprint arXiv:1808.08929 (2018).
- Snyder, David, et al. “Spoken language recognition using x-vectors.” Odyssey. Vol. 2018. 2018.
- Wan, Li, et al. “Tuplemax loss for language identification.” ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.