Date: Tuesday, 17th April 2018 at 11am Place: LORIA, room A008 Speaker: Angela Fan (Facebook Research)Title: Sequence to Sequence Learning for User-Controllable Abstractive SummarizationAbstract: The design of neural architectures for sequence to sequence tasks such as summarization is an active research field. I will first briefly discuss the architectural changes that enable our convolutional sequence to sequence model, such as replacing non-linearities with novel gated linear units and multi-hop attention. The second part of the talk will discuss ways to train models for large-scale summarization tasks and respect user preferences. Our model enables users to specify high-level attributes to control the shape of final summaries to suit their needs. For example, users may want to specify the length or portion of the document to summarize. With this input, the system can produce summaries that respect user preference. Without user input, the control variables can be automatically set to outperform comparable state of the art summarization models.
Hervé Bredin on Wednesday, April 25
Date: Wednesday, 25th April 2018 at 2pm Place: LORIA, room A008 Speaker: Hervé Bredin (LIMSI)
Title: Neural building blocks for speaker diarization
Speaker diarization is the task of determining “who speaks when” in an audio stream. Most diarization systems rely on statistical models to address four sub-tasks: speech activity detection (SAD), speaker change detection (SCD), speech turn clustering, and re-segmentation. First, following the recent success of recurrent neural networks (RNN) for SAD and SCD, we propose to address re-segmentation with Long-Short Term Memory (LSTM) networks. Then, we propose to use affinity propagation on top of neural speaker embeddings for speech turn clustering, outperforming regular Hierarchical Agglomerative Clustering (HAC). Finally, all these modules are combined and jointly optimized to form a speaker diarization pipeline in which all but the clustering step are based on RNNs. We provide experimental results on the French Broadcast dataset ETAPE where we reach state-of-the-art performance.
Neil Zeghidour on Wednesday, May 30
Date: Wednesday, 30th May 2018 at 2pm Place: LORIA, room C005 Speaker: Neil Zeghidour (Facebook AI Research & Ecole Normale Supérieure)
Title: End-to-end speech recognition from the raw waveform
State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the waveform before the training pipeline. We study end-to-end systems trained directly from the raw waveform, introducing a trainable replacement of mel-filterbanks that uses a convolutional architecture, based on the scattering transform. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of melfilterbanks, and then fine-tuned jointly with the remaining convolutional architecture. We perform phone recognition experiments on TIMIT and show that models trained on TD-filterbanks consistently outperform their counterparts trained on comparable mel-filterbanks. We then improve this model and another frontend previously proposed and based on gammatones. We perform open vocabulary experiments on Wall Street Journal and show a consistent and significant improvement in Word Error Rate of our trainable frontends over mel-filterbanks, even with random initialization.