so thank you for a nice introduction
my name is but the shreds and the
at the beginning i was recessionary the banana what is it the of technology
i for one many different fields start thinking of porting recognizer
then
we really try to do it should speaker identification don't know asr
are you
roll the speaker is no particular isn't enough
that many of you have still using
but and down the here
two thousand five
something like a stranger happen to of for us a basically we do our approach
by one company and the
the companies
set to will give you some money but you would like to have different license
for recognizer it was publicly available
also we said okay this is find a do this
bill help us to
the to finance the there is an essential but the third row to be
quite the low expediency to go also
nine months just to the
negotiated the license university
and we realise two things that there is interest from commercial market it can be
recast additional money
and
that we need to do we the heat a better way
so there we started a company called for X
so i would like to talk about to the topics are the two main topics
today
how to the speech tickle you probably such a two
mark and the and then a i would like to the
shows are much related that we see a problem the user point of view
so at the beginning i will talk a few words about the company
then about the text widget use cases
technologies that are behind the our programs
and the so
how big over the technology to the rose and then i mean really indicates someone
grand challenges
people usually don't know what is that you speech
but if you look at this
at the
at a dislike so you can see that they result
that there is before make sure the about speaker it can be
E hundred and there are you gonna be eight sure it can be speaker i
didn't at you can't that example emotion states
meant the speaker speaks and so on there is the goal that the
you can detect the language you can detect dialects a keyword crazy so
you can do the whole speech transcription
maybe the topic is interesting
you can
do something domain incapable then the
but there are other modalities you can have some information about the
and white men the that the speaker is
to whom the speaker speaks you can have other solid so
what is up of user you go
very close animals
or you have a lot of information about equipment that was used
of the to get a relief what we on voice it can be the device
the
for example we also for ticket be transcription of huge it can be according to
you may be the test it in a speech quality
this
is important for user and the users can benefit from this information
about products E R
it also based in two thousand six as us
startup from brno university of technology
it has C
in czech republic in button or just five minute walk though from the university come
rules
if we speak about the user so we have currently users in more than twenty
come companies
so i've got with the agency score said that all the bank
dell corporation that also brought up so service providers and others
the company use the roughly table and the
the little small only for an external funding so those so far
you if we speak the about the process how to transfer technology it probably search
the to the progress and the market
there are several steps
well i'm an important role E V speak about the research
theirself useful for a dollar a universities or inside companies
but the goal is to get the
the best like technology but the more i don't interest is unit could easily the
quality so like a or a set of for the stability of speech will for
the court is not do
the man it main importance
and the what was important for a so it probably also for
you
and this stage we will to be all limit as possible so it's the beans
quotative measurement
to
to the saba
okay so
open-source toolkit and so on
but then you need to getting better technology to
user somehow
this also for this you would be to do
next step but
you need to build the
code base that is or almost that is stable
it is fast the has a modified a P has documents a day shown a
proper licensing
us assume so the D V D this is what is what exactly
then that is others that but
a better you know
i need to build product for our customers so you can have nice technology you
could have nice interface is but if you don't have power back to you won't
be able to sell
so here the full cost is
functionally the and that this is donna either by phonics el or by other companies
we never be
not now i the bill mentioned
pretty domain use cases
or pretty main customers there are others but i selected this free
the first the are all sentences in course of course there are so we are
active you know why rasta the fires are is the quality control how to ensure
quality parentheses in
the call the course of terror and there are there is data mining from
voiced i think
the that the quality control what is it about
in both antenna
it you have to really
some kimberly there or supervisor
that's a pairwise well the
then a but i there are so i just this better than do
but i think of course
evaluation of operate that also
analysis of the results
for the team and some reporting
if the there are no speech technology
so you usually only three by a set of recordings is
inspect the then the use of to control wanting to local schools said there but
if you would be point something which is you are able to control
a hundred percent so the topic to get you are able to better use statistics
the
and the this everything is about the
moreover
the cost so of star but
we are able to try to reduce the number of advisers to how well
over operating costs
so it is very you are in but
to shorten the call so
if it you are able to the
you have
find problem so you really errors
some or but i just are not the us up despite the or a remote
well train the
i hear its is possibility to said that what is look training can that
usually a the
the formulation of people you know such posts of the rest is
you know tens of
paris and so we are able to reduce the this
the this amount of data and the again six of some calls
and about this approach is a huge the for quality control the main technology is
the
the at and doing some
this does so i'm not easy so on
topologies
so you are unable to get important statistics
like but better
the dialogue starts the number of speaker turn
us
speech adaption times
of unknown the call centres have
all the equipment so if they have some channels to the conversation the speakers are
not in separate the
well i like channels that we need to do diarization
then it is possible to deploy the key you want to raise the text order
you have some obligatory phase this you don't want the people to
speaker all the words
you would like to have some
of course grip compliance the people should
for all calls three the
and the it is possible at the to the voice speech transcription are mainly
for us
set a should
in this task
every about the data mining the this is other large a topic for also that
is
year again we have like two subtasks
one is but i mentioned of the corset errors
of course and therefore overloading
you may gina that you have
also there are of a few hundred people
and the there is large up
i rolled each of thousands of people start holding the
to all said that the service so you need to whatever it really quickly
we need to explore the export of what is wrong and the
maybe that some information to do initial i we are
a stage
and the japanese so you could be the this is solved by some be
screen
in the call centers showing the topics that are just the discussed that the
percentage
then
i other important the
but you use the like the
i did value speech technologies for business of eight basically now a moan about so
but
indeed i don't know if you have for example how
may be done also
it's looking for places that too but i'm not
in new
new fast foods
you the approach is to but the because go to telephone operator and they ask
please could you
if a statistics of where people that the visit it the our fast foods are
putting day and the
the place where is the highest of that it was good consideration is good place
to start the you fast food
but the you know speech technologies the same
if you have more information for example it's on the phone line a is a
male or female or that they are more people or on the line or the
pairs and was interest in the in some regrets the in boston
it helps you to go
the whole business certain to push to some more
for this we usually use of speech transcription
and then
some of data mining to on top of it so
of course it is possible to at the
so i changing if you want to the session a narcotics
then
the other big groups are bangs
a bank so of colours have all sentence so what i dimension on the past
two slides
is important also here
but the other too large task
was a box are the bands needs to ensure the security part so on the
other side
they need something that these
breeze and the
for the user does that at the
the that doesn't to
being
much complications
so here the voice by but lisa very interesting it can be ways by timidity
using a cape race or it can be ways by comedy that is the dominant
on a big get the using a text independent speaker identification system
and then i other
task is for our protection in major in there are people according to bank so
for example a hour a day
i shows we make i didn't theses and that today are
requesting clones
you it is how to detect that this that if you don't have technology
but if you have technology like the speaker identification is a really simple
now about the intelligence agencies
the intelligence agencies the situation it is usually that the this intelligence agencies have
they really huge amount of data
the amount this i should that the
they are not able to put to see the
manually this can came from
a big use of from telecommunication network and communication the internet and so on
they are looking really for need to
you know i say it's take a
and for these it is possible into to use combination of technologies
so we are using combination of technologies
and you language identification agenda to speaker diarization keyword spotting speech transcription
data mining tools
and also a little fun a correlation with some other metadata for example from this
text is used
and so of course the sequences are very interesting you know operation and forensic speaker
identification
now i will go to be better to the technologies
and the tell you what this important for a practical deployments
here are some of the technologies i want to speak about all of them you
can come and ask if you should it from a question
about the voice activity detection
i would say that this is the most important part therefore practical deployment
you can see our process
by this is the most important part
you can have
very nice results for example on these databases
but
you what do you will explore a target so
the users are working if such channels that the you should quantity of the traffic
it can be
tens of a sense is not speech at all
it is some technical signal for like dialling don so that six
and
everything can if you don't have that
this
built in
it it's a really harder to work with such channels
so we are using energy based the steam of would be eighties energy based
the at the beginning the to remove very large portion of the silences
then technique the signal removal like the tone detect to removal first
back to the spread like that are in mobile station i
signals and so on
and then we have a vad based on F zero tracking
because of the speech have the specific characteristic the that should be a we have
zero
and the
and the respective behaviour of this
F zero and then we have you wanted what based vad
to get a very precise the segmentation
but is this say sets it is very important technology and they are still many
challenges
so it is important
the accuracy of media
directly affects the accuracy of the technologies
us some sort are actually trying just you can have music
of
that they're other speakers sounds of like people tend to
well a four or something like this
you have a an alignment silence
use a different technical signals
what is a challenge is the vad one variable well snrs
we at on distort each section was
well what is also through you that important i think it is unknown parameter to
a
non automatic way
of green vad because we know that we can do it's one to the deep
or specific channel
by training just as before
some good classifiers
but how to get a rise this it is still difficult and of colours distant
mikes
and well the language identification
currently if we are able to recognize about fifty languages
and the
what is even more important
that's that the user can add a new language
themselves
this is important especially for the intelligence come community because today will never
tell you able to instead of the languages are great interest the that
what the correct you on sent to you won't be able to collect the data
to have much easier X axis
to such data
we are using i-vector based the technology
and it is commanded training okay and
we have first of all men which means that
the language print is a less than
well
"'kay" a record
bear
a better file
do it in this is the technology behind the there S this several stages
year we have feature extraction
collection of statistics using the ubm
usually the
we use gmm and the that is
are aesthetically the by some subspace so the subspace it's estimated on large quantity of
data to model the
for
variability in the speech
so in that we get the
of estimate so for but when we are in the subspace
so this part was prepared by for next year
but then and there is other part to
but that it is the classifier of languages we use a multi class logistic regression
here
and that this is done by using
about speaker recognition
the there are many task like speaker verification a speaker to
set of speaker spotting link analyses
for
after normalization some house
sometimes social network analysis
we can work in text independent or text dependent more
i-vector based the approach
we use diarization
i think it what is important here we have
use a based the system training for calibration
that again helps
people a lot
it is here
a so that the use of the same as a in case of
and which identification
what about a year we remove other what it but it is on speaker variability
i would be have some normalisation of ways pain so simply by
mean subtraction that can be done it user side
and
then
if we have well
scoring
we compare ways putting it's a
this
pretend the by one excel and it is do you get it but the
we allow our user to
the rain or i don't the this classifier
this is very important because
it's harder to get any recording for from clients
but
if you deliver a such system to clients the
and they are able to adapt the system you the amount of data can be
a really small it can be
for example fifty speaker does just few recordings of each
well i'm
we saw that the
a normal telephone channels that we are able to get about the forty percent improvement
for
the new deployments
and if it is about some
us special of or
for example many directions
we saw a hundred percent improvements just
with this
simple book
and of course so what is closely you
important is calibration
of
you know case that we are drinker like that and then calibration because that this
is also not seen too much in
and nice the because they're the recordings are
about two and half minutes long the but if you have
the huge
but variability in do lying to you need to do anything three this L C
the shore recording studio
solve it
by do some up for a times
what are the challenge in a language identification and the speaker identification
i think the that the main challenges are very short recording so it can be
one less than a three seconds
but the
very important for us is
keeping to the training corn a user side
because why less than three seconds to each if you have department of speaker identification
and the you would like to deploy eighteen bank the people don't want to speaker
they would like to have
the decision even before they start speaking
so i would say that the ten second these
the maximum of
a line that it that is a set the
and you really free second to for a verification
you can do we the we but text dependent systems
i is harder to do it with text independent systems but in case of text
independent systems
these two steps are report to study on background so bias it to do use
user is talking
the operator
of
i that is question how to ensure
accuracy over large number of acoustic channels and languages
the technologies are more and more general
independent but there still is someone
independence
what is was link important there are a graphical tools so that how
the user is to visualize the information to do the calibration
because
if you do want to do this for user the user will never we the
in self
what we see also very challenging is
language identification and we could ideas deviation a no voice over ip networks because
there you have pockets you have gets lost
and the you if you have this costs
you usually cortex a are doing something that they are either put their zero also
okay are sensing is i speech
but this is not so the speaker to the that it this is something that
was
generated by decoding
so that's also it's very important topic
and of course the distance might
now i would
say few words a ball so diarisation a because that this is very important technology
you useful for example the call centres but model also
for anger
other users
we are using approaches one approach is really possible the not so
much weaker the
this is approach are based on
clustering of i-vector so we basically split the audio too small chunks and to do
clustering go for i-vectors
but the
then the i don't take the
fully bayesian approach to the initial you know might take the by
fabio one a
patrick kenny worked with this to
on the reset assures
quite with the text and it it'll be the
D P this is approach to bear you don't do a heart decision
during of the process of you have everything good
probably sticks and the you are going to do the decision and at the beginning
it at the end
this approach is i would say
but if you're at the but you want see
well my next slide it's not
fully to
but
memory cons i'm mean and the quite small
so what are challenging
so in diarisation
in my point of view the diarization
still technology that needs quite a lot of research
really so that it is very sensitive to initialization
it is very sensitive to
non speech sounds
do you usually it is about the wall so you got more gaussian before
for example if there are you sure that the
new sounds that you haven't seen in your training data needs to be sorted for
example we ask
the system
to keep two speakers
but the output was so the that the
we got two speakers in a
one or like
like under one labeler and the second speak that are you know
there were segments
it was other us
speaker sounds
i think of a lot in this case the
so what is important the it's a very
would the duration of your vad if you have
i just sounds the that the speech
it can hardly due to the adaptation
a so it's a it is but very sensitive to two so that speech and
also and then which is
what we see that the you human of with this is things systems you can
easily very should diarisation error rate the
close to what one percent the one is data
but
well what we also saw
and the
you the
first it's us
we could be that the rest of the one percent the
is that the there won't be pro by means segmentation about it's fails so
forty four for this recording good did this is the
usually of speaker to sweep but very similar voice but this happens
okay so i think there is a shana
that was a lot done during past two years but the data
the challenge is quite the
and of course you can speak about
text and distant mikes for
the of
processing cove of
the or like for example of deviance
it in both keyword spotting
so we are
the we are using approach is what one approach is something probably you know all
few
no see the
probably from project
is the lvcsr based the keyword spotting
is this is very good
but the
small
and it's expensive for development
the other keyword spotting the that the T V are using this acoustic bass the
year indifference the that the
it here we usually use a larger acoustic model
here it's a simple on your network based acoustic model
the there is no language model or data simple language model but here it's much
cheaper of for our development
so in case of
lvcsr we are stopping creep hundreds of hours of training data
in case of plastic you want splitting a
we are stuck used by the office of acoustic data or human less
the speech transcription a what we are using a
this is probably not important of all of you are working can this
feel that
we are using the system based on a
bottleneck features that the can combination we've other features hlda vtln
gmm based system or and your network based system is not explaining okay
speaker adaptation
and gram language model and generate the
usually confusion networks
what are the challenging
from the deportment point of view here
well of course the accuracies
still important
but i would say that the it's not so the most important challenge
the challenge is us be the
lower memory consumption
how to train new systems for the automatically course we would like to do it
for
so how to donna
hundreds of recognizers in a parallel
before all compute efficient computation one
resources
and also
how to
to the lecture normalization is
speaker
adaptation of for any length of
speech utterance
course for example if we transcribe
along all source lectures some whatever
we try to put the
this much a adaptation is possible but if you are working with very short the
segments so like
three seconds or less
the adaptation
below heart how do you and the usually you will see worse results
but
the system
that was so one solution is to remove those this adaptation
but
the system to be less robust to train
not now of a how to sell of speech transcription
what we found that if you must speech transcription
and the you want to sell this technology is quite heart
you need to have but
something that this on top of this technology at that the real presently information to
users
this is the
because
there is too much text
and
this that there are still some errors
what is our experience that the
the user
bill never be happy about the accuracy of those
the speech recognition system if there are errors in more so the uses to mention
this are also
you if the words are correct the data start combine of a preposition suffixes in
this is correct the
a star complain about the some punctuation marks or grammar
this is but
if we use so
of for a summer
and the representation how to look at the data
bill help you
to sort of technology and that we are doing the in such weighted maybe do
configuration
we've the existing test bay it takes a base data mining tools
integration is donna
you usually on that the level
all of
of confusion networks also we have also the other a captive audience
this is one a to the to use of
this was the double
by our part company so like
so you have set session in gina
here you can have very complex squarey here are
documents it's a better found
the document
but you need to
bright some somehow the query the query can be very complex
so is here is
gladiator
but the was so it's one possible ability you but if you want to
we have described topic so you can use this but it there
or you can go from update time you can result i still they are you
can look at what is the correlation among works
and the
you can you can
take this could happen automatically two classes
well i mean you have these so you can
here are need the correct someday time
then train
statistical based approaches
or if you can deploy stuff for example to see you what is the
how well
the topics
and morph in time
of the input
not now i have
two slides so how we transfer the call
what is the time please
all cases so it each
okay so it's we just quickly
this is how we
to transfer the call so it to use this at the i think in a
two thousand seven wanna be decided to write our speech for the of score
the well the reasonable so that you wanted to have something very
stable very fast that
and the before the proper interfaces
the speech for it has morgan two thousand five hundred topic objects coding or the
hour of speech processing go
it is a more then
minimal first lines of source code and it ceases steely
still use it might enable
how
be approach to refer to the recession
we it the research is usually done using standard tools like that
S T K in a car be by transcripts
i think it it's all through the nose these two kids that this is
for of
hmms reconnect the but this is for neural network training good colour be it's made
in the by then pour we
and so and so on
but the that diana we can the to use our code base
and we can implementing new system and a two hour speechcorder quickly in a
just two days the
well final nor seen a single line of C plus court is written
everything is don
flew configuration file this could do this configuration file
can
look like this
you have some objectivity this object so
well this description is the map
two
C plus interface the user to set functions
and then i we have some framework out to connect to be subjects to better
so some
you of fun we have four or the artemis entity
but if you need a algorithm we just goal and to buy one simple chip
for simple objects
a about interface is
what
the customers are used to
a locally specific interfaces
us
so i don't want to change data bits is so we
men then the double the
large
set of interface this C plus channel aussie sharper and marcy be protocol uses for
ivr so that is nice open source project the
press instead of face to build the our on how based so set B C's
and so on
the this is common a framework
for
but based
over a solution
we were speech set of our application server the base ever and some clients
this is just example of
our testing client
okay so some not now i will just summarised
three slides about the
some ongoing challenge is that the I C now
partner very important challenge these
data
a training data is the smog a small company it is a difficult for us
to get the data it is expensive
and the this out that the a common approach of allows us to at just
two and we just by here
so we are working the
for cheaper mesa how to do these
i think it'd great inspiration is google
so but not a we did something similar in you know language identification
in which i didn't if we are able to bit the data the that the
we can use of for training go for organisational
a speech recognition systems that can be deployed on balls like don't the you know
quantization of speech for telephone speech and the
for broadcast
so of what one possibility that we export the was to use broadcast
for this
but not so the whole content but the
automatically to take the
phone calls in the broadcast this ensures
a high variability of speaker does the dialects and the was speaking styles
language can be a very fight the using the when automatic language identification will so
we need to some
a small amount of data to bootstrap the this approach but then it is possible
this is speaker identification of the speakers of the variability conventional by current
speaker id technology
the you would like to
transcribe the it is some of the speech we think crowd sourcing
and that use have really unsupervised the training for adaptation
of
currently we discuss the D so we've
several company sent to would like to form a conception for this
you have some expending admins experience to when we did this
you know project the for language identification ldc anthony's the
of it turns out the E to be very successful and the melody so like
mainstream language identification
we have one line
up or go for adaptation
be backed by our
after companies the spinoff from but and that we believe that we could put to
reduce the cost for the opened of new recognizer
to variable and models
so if you are interested in the you would like to one and more just
sent me email and to
we can discuss this
the then other trying to we see is the that the
we have quite the roots the technology about the still the deep one man the
is hardly bring some of somebody six of each customer list
but if the specified
the each if we have departments many cantonese
we never know what to be the final
accuracy of up
you're technology and of unevenly to do adaptation again some
project so that i mention on the previous slide so we have
to word this
usually if you speak of the technologies that
we claim the
the customers that the technology
is
language-independent the
channel independent the but always there is some for two for station
the only possible way i see to reduce this risk
is to built on to evaluate that these technologies
on a many languages and to know that the results in advance before the technology
so
so for this again the data collection project and can have
and to you are thinking about you want
to extends some approaches to something like to work through much of spoken languages
because for language identification we have collection of about fifty languages and
good all the rapidly
and the
finally remark
what we see that is that the percentage is
full cost mainly on accuracy or more most of the research articles
are describing some improvement in accuracy but if we speak about commercial market the
i think that any improvement and the you know speech the
or will
something that cannot
and i do use
the cost of hardware
can help and can
help you to
have successful technology
so we saw you in some approaches the hardware cost is a really large can
be
fifty percent of the cost of the project and so on
so this is everything from me and thank you for attention if you have any
questions please ask
any questions
so we how did you do that didn't have to go to cepstral but better
or something like that
we are considering this approach should to but
you know it at the beginning it's harder to get the
money from adventurers
the so this started the in the trade that event to customers and the ask
or negotiate at some contacts
and the we just started to be for contract so basically the
custom development and the
B and some money on this custom development and then we compute developing technology
and we start something good technology and then
even to product and stuff
i have a question your
your solutions are on site or is it based on cloud services
most of the solution so
i don't say it actually bosses possible be because
that we can use of the technology one site but we have the base the
or interfaces for example that is the best interface that can be used for
called department
but so you have a lot of cloud deployments or not please models and not
know most of our current improvements are
of
a local click the like of one side the departments
but we have
the spinoff at the battalion is it of technology this is to play well
that is for example the recording go lecture here
this is already got based this is gonna but
i don't at the of lectures
questions
so you started off connected with
with university do you still do you have now it's to say that you projects
that at the next cnns are an issue with that in terms of their
we only with the company a with the government
we are doing this see in different races we didn't have students
it's for next cer we some or twelve some people at the but
some contacts we have joint project
the sort out differently so
alright
that's one thank you thank you