i'm gonna be representing us to use a university of science and technology of china
the national engineering level of speech line and language
information processing of the
this is a paper by a margin is my student master student and some other
collaborators we asked him to
build his own c n which he did and then be austin to join using
it for something which he did
so what i'm gonna do is present
what came out when he tried that
we got before stages introduction really have this works for language and at the structure
in selling
is proposed method some experiments analysis and then
maybe and with but of sort on some future work
well the first thing to ask is what is language identification and
it just the task of taking a piece of speech and extracting language identification information
from that comes at different levels as we know
and we can say that that's all acoustic information or phonetic information list right hand
to disassociate that from the characteristics of a speech speaker as will say
and a little while
and was finding the tendency to do
speaker recognition
state-of-the-art as well probably
maybe this will change shortly i don't know but state-of-the-art is really gmm i-vectors
and we say in great gains but everybody's you know trying to find what's next
deep-learning in particular allows us to i'll take some of the advantages of supervised at
training be able to extract discriminative it
discriminative information out of the
the data is that we have especially when we have a small amounts of training
data we can use transfer learning methods to
to train something which may well be discriminative
on a weighted task of inferring it it's language id
some of these are we saying
recently they take bottleneck a network based i-vector representation of a
v and song collaborator
this was
i think it was last year in interspeech there's also a poster yesterday which you
missed paper should be in the savings we say dc non based a neural network
approaches here
doing great things that's transactions on ice lp this
then there's some approaches which are and to it and methods and we can look
at some of the
the we the
state-of-the-art as flown through that
deep neural networks
here
and that was i guess
long short-term memory
i'm n n's here
also in to speech
so this is really a extracting at a frame level
and gathering sufficient statistics over an utterance in order so
pulled out
language specific
identifiers they re entrant approach
using convolutional no young neural network to it so the text it's short
utterances and it's using the power of a c n
to put out
they the information from these short utterances
and seems to get over some of the problems in terms
of utterance length
we have a different method
we also think that doing st say so mfccs with a large context maybe
introducing
too much information that a c n all day n and then this to remove
so what we have today was
a use some of our train a precious training data to remove information that probably
shouldn't have been included in the first place if we had a magic wand
in terms of input features are
so what we're doing is a
is slightly different
we think convolution young neural network and
when using the c n to extract frame level information per se what we actually
doing
in this very
wide long
and to and type system is starting off with plp input features
and we're doing a the bottleneck
the nn just data to take bottleneck
network
taking the bottleneck features here
adding a what could be quite a lot of context to the bottleneck features and
then fading that adjusts ann
and here so three layers
i finally a fully connected output and we're getting is a language label
directly at the output from this
so you can see why this is sort of attractive in terms of a system
level implementation but to me it's
kind of counterintuitive
because we tend to use c n n's
to extract
front and information the mean in the related tasks that we've been trying
they tend to one for a while for that
i mean we did try things like stacks of mfccs as an input features to
a c n directly and it doesn't seem to somebody else can do about of
nasa doesn't seem to want that well
so what we did was we have of the nn
followed by a c n and the see how that works
and limits
sums up what is that transform acoustic features to a compact representation
we did not frame-by-frame and a context of multiple frames for the bottleneck features with
context into the c n
and we come out with something which should be discriminatively in terms of language
okay so this is what we call the lid features
i mean we think that the general acoustic features that the import they like i
said they do contain too much information
so we trying to reduce the amount of
information about an on the
on the train system
follows
i'm not given the limited amount of training data we don't really wanna voice that
we know that we can have a deep neural network which is trained on sentence
and that will be a phonetic information
the beginning of it is acoustic information
somewhere in the middle of that network is a transformation
effectively from the phonetic to the from the acoustic to a phonetic we take the
well something features which
we how far a compact
representation of the requirement information
not sure that's true because there's plenty of approach is that take
information from
both the center and the end of the day n and
seem to work well especially with fusion
anyway
what we're doing it would just
kind of different is when using a spatial pyramid polling
the output of the c n and
we want this allows us that there was it allows us to take the front
end information and to span utterance level with
which
provides us with a
utterance length invariant
fixed dimension vector this point
so i just a deal with arbitrary input so we just we take the method
spatial polling is from the paper by climbing huh
that's e c v computer vision two thousand and fourteen and it's designed to solve
the problem of making the feature dimension invariant to the input size missus a problem
we face often and is a problem certain
and areas of image processing also face
i think was happen is we've got i
so i kind of feedback where the speech technology goes into the image processing and
the
comes back to the speech failed and then i cycles around
so this is really inspired by a bag of words approach
and it comes through
into the special permit problem which uses a power of two
stack of max hold features
okay so it changes resolution of the power to
and we can control quite finally how many
features
we
one to the output of that
so attractive in that work well like it
the information on that is actually in the paper
so how do we do that had we put all the stuff together
well the shown in the diagram on the right here what we're doing is with
taking a
six layer the nn which is trained with large scale
switchboard
information
and with taking the half of the network up to the bottleneck layer and fading
that into system that now was trained using language id using lid training data
and
now if we propose that if we take that information and we feed directly into
a c n
given the training data that we would using this well it will not converge for
anything sensible
if at all
it just doesn't work so what we have that there was they have to build
a network
like a c n layer by layer
so that the nn is already trying that's fixed that's great
and then you start to build that the c n by having first convolutional and
then the second and then the third each one takes a special permit polling and
the fully connected layer at the output
to give us the direct language labels
and excel works right i'm we can see that late only look at the results
layer by layer
s two
how of the
how the accuracy improves with
the number of layers and with the size of the labels
it's quite interesting to say that
the nn pretty standard it's
forty eight features fifteen plp use delta and delta-delta
sorry pitch
and with a context size
twenty one frames
one of two four
one or two four fifty one or two for one or two four and three
zero to zeros senones at the output
and we look at the structure of the
c n of a little while
it is worth mentioning at this point because it's a problem
but we create sorry separate networks for the task of thirty second ten seconds and
three seconds data
i mean we would like to combine these with trying to money
but this separately trained
no maximum
a baseline is button like gmm i-vector and bottleneck the nn i-vector with lda doubly
c n
pretty much as we published previously
so we look at how this works
just try to visualise some of these layers
what we have here was we got the
post
pooling three
fully connected layer
information
note this diagram comes from the paper what we've done is be taken the
these
the test it
over some utterances and we've compared for different languages just visually
so what we don't is just thirty five randomly selected features from that stack
plotted here for two languages
because right
on the left this dowry
on the right it's farsi
which i'm told are very similar languages
the top and the bottom at different
segments
from utterances
so what we're looking on the left is intra language difference what we're looking at
the right just in between left and my is interlanguage difference so top and one
was intra
left and my is inter
so we should say that there is a large variability between languages small variability within
languages that's what we get
it gives us
visual evidence
to think that
these statistics might well be discriminative for languages
just leaving along a bit further
down here what we getting here was a frame level information
and we like to call this lid senones maybe this is not best terminology
but
just two
to explain have a how we get to that sort of a conclusion
if we look at this information i e bay saying and a right so i
and be noticed the scales on some of the one
one five a low lid senones coming out of the of the system out for
frame level with context for
speech
another piece of speech there
a transition region between
two parts of speech here
and non-speech region just here
so what we tend to say when we visualise this is we see a different
lid senones activating and a activating
as we go through an utterance or go between utterances
and we believe that this language discrimination information in the
if you look at the scale
the y-axis scale of that use
we can see that when there is a non speech regions around here we get
all sorts of things activating but the level
the amplitude of activation is quite low
you can it gives
evidence to the fact that rippling you have something which is a language specific at
least
so we also there's something called a
hybrid sampled evaluation so we spent thirty seconds ten seconds a three seconds in to
separate networks
we train them independently and we do well we don't do quite the same degree
of augmentation as a hundred but we do try to men by cutting the thirty
second speech into ten seconds
and three seconds regions
so we're doing is where
we're trying to make up to the fact that the three second information is woefully
inadequate in terms of statistics probably
i having a lot more effect
a mostly have that works
in terms of the but
performance of each
unfortunately we only have data here from
yes to allow you zero nine and for that we only have six
most confusable languages
it's a subset is much quicker subset so do analysis on into one experiments on
and if you look at papers over the last few years
we tend to publish with these six languages fast
and then extend later
seems worthwhile
it's about hundred fifty i was of
training data voice of america and radio broadcast cts and telephone speech
and we split up into the three different
level or looking at two baseline systems and our proposed network
normal
the fusion on that later
everybody wants to do fusion
the end
so let's look at three the way that this
this structure can be adapted because the so many different parameters that we could change
in here
the first one you wanted it was look at the
the size of the context
at the output of the
the nn layers
and with changing and
if you can make it out just here
lower case n
so what we're doing is where
keeping the same
bottleneck
network
but we're starting a more of them
and we can see from the results for thirty seconds ten seconds and three seconds
in eer
the bigger the context in general the better the results
no bear in mind that we only have some context the input here
right that's also got context twenty one frames to be precise so we adding more
context at this and we're saying benefit
and it turns out that for the ten seconds and three seconds tasks context of
twenty one
just here
tends to what better
for the
thirty second task and even longer context much better probably because the data was longer
i think that the
the problem is the three seconds and ten seconds data tends to saturate i mean
we just cannot physically get enough information had about data
no matter how much context size
we introduce
and moving on a little bit further
we can also experiment with
how
t and how wide the c n is
and we do that down here with
basically three different experiments one of which is the lid net with
i a one zero two four
such that convolution input layer
single-layer
then fading into the special permit polling and the fully connected system
we trained the system up we get about nine
nine percent to sixteen percent
performance on the three different scales if we add another layer so we have a
two class and then we all that down by reasonable amount for the three seconds
not quite so much of the thirty seconds
and we're looking at one two eight to five six or five one two
size on the secondly
in the c n
third layer
we check out sixty phone one two eight and we can say that basically with
increasing complexity the results tend to improve lesson for the thirty seconds more for the
others
i but temple evaluation what we actually doing here is way using the
the thumb thirty second network
to evaluate thirty second data
the ten second network to evaluate thirty second a ten second data
and the three second network to evaluate everything
and the performance
unsurprisingly of the three second network is better for the three second data you can
only use that ten seconds better for the ten second data but the thirty second
network a thirty second data is
however
it's better using the ten second one for thirty second data so this means that
perhaps this these networks themselves are hoping at different scales of information so we fuse
them together to get the results of the bottom
and we have a slight improvement there
but you won't notice that we can only improve on the baseline system for the
thirty second result
one more thing before we conclude the i-vector system uses a button first order statistics
but this effectively only uses a with order statistics
so
pretty much are a few what would be looking at hand we can incorporate more
statistics
whether we can build a comprehensive know what that uses
all scales and handles all scales simultaneously so that's it that's a weird and wonderful
day n c n hybrid thank you
we have time for questions
so
thanks very much that was very interesting and as far as i could see a
score of the network so
as far as understood you
did some incremental training and so once you that once you trying to part of
the network and then you extend the network the first the parameters of first part
they stay fixed you don't step them
you have this a fixed so that we fixed that enemy build on it and
it
again is what you do when you ask us to try different things and i
probably wouldn't have done this myself but it tends to one
quite well
the most
there's a fixed you mean you network trained the mortgage you just change
most layer
and they to retrain the whole system no we don't we train our system we
focus that the to the backend open and we just trained on us layer
q
i think we have another question for a we've got lots of time so you
spoke a lot about the information flow through the neural network so if you of
a read some of geoff hinton "'s" stuff on neural networks the you will you
will tell you again and again that
that there is more information
in our case in the speech and the labels so
use advocating for the use of generative models rather than discriminative ones as well as
i can see you horses ple discriminative so i just like to hear any point
you have on that matter
so actually is interesting that you bring that up because i was looking at some
of the comments in tumours making recently and
i he was talking about using
he was talking about the benefits of having a two-stage process where we have one
front end which is very good at picking out
how the most useful data from a large-scale dataset and then a backend which is
very good it using the and that these two tasks are complementary is seldom we
can use one system that excels in both tasks he believes that both women can
be trained and we seem to do not but we've done it
okay the opposite way around to the way i would have imagined
okay thank you very much speech