Keynote speakers




Ralf Schlueter

Faculty of Mathematics, Computer Science and Natural Sciences
HLTPR - Human Language Technology and Pattern Recognition
RWTH Aachen University

Title: Automatic Speech Recognition based on Neural Networks

In automatic speech recognition, as in many areas of machine learning, stochastic modeling relies on neural networks more and more. Both in acoustic and language modeling, neural networks today mark the state of the art for large vocabulary continuous speech recognition, providing huge improvements over former approaches that were solely based on Gaussian mixture hidden markov models and count-based language models. We give an overview of current activities in neural network based modeling for automatic speech recognition. This includes discussions of network topologies and cell types, training and optimization, choice of input features, adaptation and normalization, multitask training, as well as neural network based language modeling. Despite the clear progress obtained with neural network modeling in speech recognition, a lot is to be done, yet to obtain a consistent and self-contained neural network based modeling approach that ties in with the former state of the art. We will conclude by a discussion of open problems as well as potential future directions w.r.t. to neural network integration into automatic speech recognition systems.

 




Attila Vékony

NNG Software Developing and Commercial Llc.

Title: Speech Recognition Challenges In The Car Navigation Industry

Until a few decades ago, machines talking and understanding human speech were only the subject of science fiction. Nowadays, Text to Speech (TTS) and Automatic Speech Recognition (ASR) became reality, but they are still being considered to be fancy. Automotive infotainment is a selling point for car manufacturers, it is a symbol of being hi-tech, and car commercials often feature the display of the head unit for a few seconds. As avoiding Driver Distraction has grown a major design aspect, Speech Recognition is becoming trendy and almost compulsory. But let us see how far we have gotten.

In the first part, this talk will summarize the most popular Speech features in today's car navigation systems, and will look into the underlying technology, solutions and limitations widely applied in the industry. We will mention typical context designs, dialogue systems and address search, and we will show how the common technology leads to typical HMI solutions. We will point out the possibilities and limitations of on-board and server-based recognition, and consider why we will need to resort to exclusively offline solutions for a while in this industry. 

At this point we will have an overview of the ingredients, so the talk will focus on problematic and sub-optimal ASR features requested by automotive manufacturers, explaining why they negatively affect recognition accuracy. A workaround often leads to troublesome and seemingly unnecessary questions for the user, so it is not easy to compromise. In the last part, we will examine a certain address search scenario which is trivial for users, and is feasible with a server-based recognizer, however being an open question as of 2016 when done offline.

 



Nick Campbell

Speech Processing Lab, Trinity College Dublin

Title: Machine Processing of Dialogue States; Speculations on Conversational Entropy

This talk will describe our approach to conversational speech synthesis, illustrated with examples from the Herme dialogues.
Herme was a small device that initiated conversations with passers-by in the Science Gallery at Trinity College in Dublin and managed to engage the majority in short conversations lasting up to about three minutes.  Experience from that data collection and analyses of human-human
conversational interactions has led us to propose a theory of Conversational Entropy wherein tight couplings become looser through time as topics decay and are refreshed by topic changes and conversational restarts.  Laughter is a particular cue to this decay mechanism and might provide sufficient information for machines to intrude into human conversations without causing particular offense.