Neil Zeghidour

Date: Wednesday, 30th May 2018 at 2pm
Place: LORIA, room C005
Speaker: Neil Zeghidour (Facebook AI Research & Ecole Normale Supérieure)

Title: End-to-end speech recognition from the raw waveform

State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the waveform before the training pipeline. We study end-to-end systems trained directly from the raw waveform, introducing a trainable replacement of mel-filterbanks that uses a convolutional architecture, based on the scattering transform. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of melfilterbanks, and then fine-tuned jointly with the remaining convolutional architecture. We perform phone recognition experiments on TIMIT and show that models trained on TD-filterbanks consistently outperform their counterparts trained on comparable mel-filterbanks. We then improve this model and another frontend previously proposed and based on gammatones. We perform open vocabulary experiments on Wall Street Journal and show a consistent and significant improvement in Word Error Rate of our trainable frontends over mel-filterbanks, even with random initialization.