okay
welcome to the morning session acoustic modeling
start off with a speech by geoff zweig well
sure we introduce
actually i'm really happy to introduce to have known since it was this high
but has grown a lot since it became since he was a graduate student
anyway
this we followed by the poster session and acoustic models
after building berkeley where he really didn't amazing job and was already interested in
green welcome crazy different models which was i've always liked
he went on to I B M you work
and you can tune into work actually on graphical models
not only working on the throughput but also
working on implementations
yeah and you got sucked into a lot of darpa meeting
as well as many of us that
and he moved on from there microsoft word's been since two thousand six
so he's well known that field now goes for
a principled
developments and also for implementations that have been useful for the community
some happy you know
by jeff appear to give a stock and sick mother interesting idea of segmental conditional random field
i thank you very much
okay so i'd like to talk us start today with a very high level description of what the theme of
the talk is going to be
and i tried to put a little bit of thought in advance into what would be a good a sort
of a pictorial metaphor pictorial representation of a what the talk would be about and also something that is a
fitting to the beautiful location that we're in today
when i did that i decided the best thing that i could come up with was this picture that you
see here of a nineteenth century clipper ship
and these are sort of very interesting things they were basically the space shuttle
of their day they were designed to go absolutely as fast as possible making trips from say in the to
london or boston
and when you look at the ship there you see that they put a huge amount of thought and engineering
into its design
and in particular if you look at those sales they didn't sorta just build a ship and then put one
a big holes where sail up on top of it instead what they did was they try in many ways
to harness sort of every aspect every facet of the wind
that they could that they could possibly do and so they have sales positioned in all different ways they have
some rectangular sales they have some that triangular sales they have the sort of the funny sale that you see
there back at the end
and the idea here is to really pull out absolutely all the energy that you can get from the wind
and then drive this thing forward
that relates to what i'm talking about today which is speech recognition systems that in a similar way harness together
a large number of information sources to try to drive the speech recognizer forward i in a faster and better
way
and this is going to lead to a discussion of log-linear models
a segmental models and then there's synthesis
and in the form of segmental conditional random fields
there's an outline of the talk
i'll start with some motivation of the word
i'll go into the mathematical details
a segmental conditional random field starting with hidden markov models
and then progressing through a sequence of models that to the ser at
i'll talk about a specific implementation that my colleague patrick knowing in and i put together this is a scarf
toolkit i'll talk about the language modeling that's and implemented there that's sort of interesting
are the inputs to the system and then the features that it generates from them
at present some experimental results are research challenges in a few concluding remarks
okay so the motivation of this work is that state-of-the-art speech recognizers really look at speech sort of a frame-by-frame
we go we extract are speech frames every ten milliseconds
are we extract the feature usually one kind of feature for example P L Ps or mfccs
and send those features into a time synchronous
recognizer the processes them and outputs were
i'm going to be the last person in the room to underestimate the power of that basic model and how
well you can get perform have good performance you can get from working with that kind of model
and doing a good job i in terms of the basics of it and so a very good question to
ask
is how to improve that model in some way
but that is not the question that i'm going to ask today
i instead i'm going to ask a different question i should say i will read task
a question because this is something that a number of people have looked at in the path
i in this is whether or not we could do better with the more general model
and in particular the questions i'd like to look into our whether we can move from a frame-wise analysis
to a segmental analysis
i from the use of real-valued feature vectors
i such as mfccs and plps
two more arbitrary feature functions
i E and if we can design a system around the synthesis
at some disparate information source
what's going to be new in this
is doing it in the context of log-linear modeling
and it's going to lead us to a model of the one that you see at the bottom of the
picture here
so in this model we have basically a two-state a two layer model i should say
at the top layer we are going to end up with states these are going to be segmental states representing
stereotypically words
and then at the bottom layer will have a sequence of observation streams will have many observations training
and these
each provide some information they can be many different kinds of information sources for example at the detection of a
phoneme the detection of the syllable detection of an energy burst a template match score
all kinds of different information coming in at through these multiple observation streams
and because their general like detections
they're not necessarily frame synchronous and you can have variable numbers
in the fixed and of time across the different streams
and we'll have a log-linear model that relates
the states that were hypothesized thing to the observations that are on hanging down below a below each state and
blocked into work
okay so i'd like to move on
and now and discuss
a ser S mathematically but starting first from hidden markov models
so here's a depiction of it a hidden markov model i think we're all a familiar with this
the key thing that we're we we're getting here is an estimation of the probability of the state sequence
i given an observation sequence in this model states usually represent context-dependent phones or sub states of context dependent phones
and the observations are most frequently i'm spectral representations such as mfccs or plps
the probability is given by the expression that you see there where we go frame by frame
and multiply i in transition probabilities the probability of a state at one time given the previous state
and then observation probabilities the probability of an observation at a given time given that state
in those observations are most frequently i gaussians on i mfcc or plp features
whereas in hybrid systems you can also use neural net posteriors as input to the
to the likelihood computation
okay so i think the for
sort of
big step away conceptually from the hidden markov model is maximum entropy mark a markov models
and these were first investigated by and wait right now party in the mid nineties in the context
part-of-speech tagging
for natural language processing
and then generalized or formalise by mccallum and his colleagues in two thousand
and then there were some i seminal application of these two speech recognition by jeff well when you ching now
in the mid two thousand
the idea behind these models
is to ask the question what if we don't condition the observation on the state but instead condition the state
on the observation
so if you look at the graph your what's happened is the arrow instead of going down it's going up
and we're conditioning a state at a given time J on the previous state and the current observation
state are still context-dependent phone states as they were before
but what we're gonna get out of this whole operation is the ability to have potentially much richer observations and
then mfccs down here
the probability of the state sequence given the observations are pretty an em am is given by this expression here
where we go through time frame by time frame and compute the probability of the current state given the previous
state
and the and the current observation
how do we do that
the key to this is to use
a
small little maximum entropy model
and apply it at every time frame
so what this maximum entropy model does
is primarily
computes some feature functions that i
that relate the state
previous time to the state at the current time
and the observation at the current time
those feature functions can be arbitrary functions they can return a real number of a binary number and they can
do an arbitrary computation
they get weighted by lambda
those are the parameters of the model summed over all the different kinds of features that you have and then
exponentially eight
it's normalized by the sum over all possible ways that you could assign values to the state they're of the
same of the same sort of expression
and this is doing two things again
the first is gonna let us have arbitrary feature functions that we use
rather than say gaussian mixture
and it's inherently discriminative in that it has this normalisation factor here
i'm gonna talk a lot about features and so i wanna make sure that we're on the same page in
terms of what exactly i mean by features and feature functions
features by the way are distinct from observations you observations of things you actually see and then the features
are numbers that you can Q using those observations as in
a nice way of thinking about the features is has a product of a state component and a linguistic compiled
i'm sorry state component and then the acoustic component
and i've illustrated a few possible state functions and acoustic functions
in this table and then the features the kind of features that you extract from that
so one very simple
function is to ask the question is the current state
why what's the current phone or what's the current context dependent on what's the value of that and just to
use a constant for the acoustic function
and you multiply those together and you have a binary feature
it's either
state is either this thing why or it's not zero one
and the weight that you learn on that is essentially a prior on that particular concept context dependent state
a full transition function would be the correct the previous state was X
and the current state is why previous upon the such and so and the current phone as such and so
we don't pay attention to the acoustics we just use one and that gives us a binary function that says
what the transition
little bit more interesting features when we start actually using the acoustic function
so one example of that is to say the state function is the current state is such and so
oh and by the way when i take my observation and plug it into my voicing detector that comes out
either yes it's voiced or no it's not voiced and i get a binary feature when i multiply those two
together
yet another example is the state is such an so
and i happen to have a
a gaussian mixture model for every state and when i plug the observation into the gaussian mixture model for that
state i get a score and i multiply the score by the by the fact that i'm seeing the state
and that gives me a real-valued a feature function
and so forth and so you can get fairly a fairly sophisticated feature functions this one down here by the
way is the one that quoting now use in there and the mm work where they looked at the rank
of a gaussian mixture model
the rank of the gaussian mixture model associated with a particular state and compared all the other states in the
system
let's move on to the conditional random field
now
it turns out that under certain pathological conditions if you using em atoms you can make a decision early on
and the transition structure
just so happens to be set up in a way and such that you would nor the observations for the
rest of the utterance
and you run into a problem i think these are pathological conditions but they can theoretically exist
and that motivated the development of conditional random field
where rather than doing a bunch of the local normalizations making a bunch of local state wise decisions there's one
global normalisation over all possible state sequences
because there is a global normalisation the it doesn't make sense to have a rose in the picture the arrows
indicate where you're gonna do the local normalisation and we're not doing a local normalisation
so the picture is this
the states are as with the maximum entropy model and the observations are also as with the maximum entropy model
i and the feature functions are as with the maximum entropy model the thing that's different is that when you
compute the probably the state given the observations
you normalise
not locally but once globally over all the possible ways that you can assign values
to those state C one
that brings me now to the segmental version of the crf which is the main point of the stock
so the key difference between the segmental version of the crf and the previous version of the crf
is that we're going to take the observations
and we're not going to block them into groups that correspond to segments
and we're actually gonna make those segments in the words
conceptually they could be any kind of segment they could be a phone segment or syllable segment but the rest
of this talk i'm gonna refer to them as word
and for each word we're gonna block together a bunch of observations and associate it concretely with that state
those observations again can be more general than mfccs for example they could be phoneme detections are the detection of
the of articulatory feature
there's some complexity that comes with this model because
even when we do training where we know how many words there are we don't know what the segmentation is
and so we'd have to consider all possible segmentations of the observations into the right number of were
and then this guy in this picture here for example we have to consider segmenting seven observations not justice to
two and three but maybe moving this guy over here and having three associated with the first word and only
one associated with the second word
and then three with the lab
when you do decoding you don't even know how many words there are in so you have to consider both
all the possible number of segments and all the possible segmentations
given that number of sec
this leads to an expression for segmental crfs that you see here
it's written in terms of the edges that exist in the top layer of the graph there
i each edge has a state to its left in the state to its right
and it has a group of observations that are a link together underneath it O T
and the segmentation is denoted by Q
with that notation the probability of a state sequence given by the observations is given by the expression you see
there which is essentially the same as expression for the regular crf
except that now we have the some over segmentations that are consistent with the number of segments that are hypothesized
or non during training
okay so that was
that was a lot of work to go to introduce segment features do we really need to introduce segmental features
at all do we get anything from that because after all with the with the crf the state sequence is
conditioned on the observations we've got the observation sitting there in front of us
isn't that enough is there anything else you need
and i think the answer to that is clearly yes you do need to have boundaries are you get more
if you talk about concrete boundaries
segment boundaries here a few examples of that
i'm suppose you wanna use template match scores
as a feature functions for example you have a segment and you ask the question what's the dtw distance between
this segment and the closest example of the word that i'm hypothesize thing in some database that i have
to do that you need to know where do you start the alignment where you end alignment and you need
the boundary so you get something from that you don't have when you just say here's a big blob of
observation
similarly word durations if you wanna talk about a word duration model you have to be precise about when the
word starts and when the word ends so that the duration is defined
turns out to be useful to have boundaries if you're incorporating scores from other models
two examples of that are the hmm likelihoods and fisher kernel scores
the latent in gales have used
and the point process model scores
that the ends in and dog have propose
later in the talk all talk about detection sub sequences
as features in there again we need to know the bound
okay so before proceeding i'd like to just emphasise that this is really building on along a tradition of work
and i want to go over and call out some of the components of that tradition the first is log-linear
models that use a frame level markov assumption
and there i think he work was done by jeff cohen you ching gal with the maximum entropy markov model
there really was the first to propose an exercise
the power of using general feature functions
shortly thereafter
hidden or actually it's a more or less simultaneously with that a hidden crfs were proposed by cohen award on
a and his colleagues and then there was a very interesting paper by under asking one of his students at
last year's asr you
i where essentially an extra hidden variables introduced into the crf
to represent gaussian mixture components
with the intention
of simulating mmi training in a conventional system
jeremy morris and error faster loosey a did some fascinating initial work on applying crfs and speech recognition
they used features such as neural net attribute posteriors
and in particular
the detection of sonority voicing manner of articulation and so forth as a feature functions that went into the into
the model
and they also proposed and experimented with the use of mlp phoneme posteriors as feature
and proposed the use of something called the clam didn't model
which is essentially a hybrid crf hmm-model where the crf phone posteriors are used as a state likelihood functions rather
than neural net posteriors in the standard hybrid system
the second tradition i'd like to call out is actually the tradition of segmental log-linear models
the first use of this was a termed a semi crfs by zero windy and cohen i in the development
in natural language processing
late evening gail's propose something term the conditional augmented statistical model which is a segmental crf
that uses hmm scores and fisher kernel score
saying rocking gail's propose the use of structured svms
which are essentially a segmental crfs with large margin training
later in stratford on have an interesting transducer representation that uses perceptron training and similarly achieves joint acoustic language and
duration model training
and finally georg cycle
and patrick million i have done a lot of work on flat direct models which are essentially whole sentence maximum
entropy
acoustic models maxent models at the segment level and you can think of these segmental models i'm talking about today
essentially stringing together a whole bunch of flat direct models one for each sect
it's also important to realise that there's significant previous work on just classical segmental modelling and detector based asr
the segmental modelling i think comes in sort of two main thread
in one of these a likelihoods are based on framewise computations so you have a different number of scores that
contribute each segment
and there's a long line of work that was done here by mari ostendorf and her students and the number
of other researchers so you see here
i and then in a separate thread
there's a development of using a fixed-length segment representation for each segment
that mari ostendorf insulin glucose
looked at in the late nineties and then jim glass more recently has worked on and contributed using
phone likelihoods in the computation in a way that i think is similar to the normalisation and the ser a
a framework
i'm going to talk about using detections phone detections the multi-phone detections and the so is it that i think
too much and lee and his colleagues in their proposal a detector based asr
which combines detector information in the bottom a way to do speech recognition
okay so i'm gonna move on now to the start implementation a specific implementation of a crf
and what this is going to do is essentially extend that tradition that i've mentioned
and it's going to extend it with the synthesis of detector based recognition i segmental modelling and log-linear modeling
going to further
develop some new features that weren't present before and in particular features termed existence expectation and levenshtein features
and then i'm extend that tradition i would an adaptation to large vocabulary speech recognition by fusing finite state language
modeling into that segmental framework for that
talking about
okay so let's move on to a specific implementation
so this is a toolkit that i've i developed with a patrick neumann
it's available from the web page that you see there you can download it and play around with it
and the features that i talk about net
arts
civic
to this implementation and they're sort of one way of realizing the general S crf framework and using it for
speech recognition where you sort of have to dot all the icing cross all the T's and make sure that
everything were
okay someone at heart stop us start by talking about how language models are implemented there are because it's sort
of a tricky issue
when i see a model like this
i think bigram language model i see to state
they're next to each other they're connected to a narrow that's like the probability of one state given the preceding
state and that looks a whole lot like a bigram language model so is that what we're talking about we
just talking about bigram language model C
and the answer is no what we're going to do is we're actually going to be able to model long
span acoustic context
by making these states
refer to states in an underlying finite state language model
here's an example of that
what you see on the left is a fragment from a finite state language model it's a trigram language model
so it has bigram history states
for example there's a bigram history state the dog and dog are and dog way
and sometimes we don't have all the trigrams in the world so to
decode and unseen trigram we need to be able to back off to a lower order history state so for
example if we're in the history state the dog we might have to back off to the history state dog
the one word history state and then we could decode a word that we haven't seen before in a trigram
context like yep and then moved to the history state dog yep
finally as a last resort you can back off to the null history state three down there at the bottom
and just decode any word in the vocabulary
okay so let's assume that we want to decode the sequence the dog yet
how would that look
we decode the first word the and we end up in the state seven here i haven't seen the history
the
then we decode the word dog
that moves us around up the state one we've seen the bigram now need all
now suppose you wanna decode yeah
to do that
so right now we're in state one
we gotten as far as the dog back to get us to state one here
and now suppose you want to decode yeah we'd have to back off
from state one to state two and then we could decode the wordnet and end up in state six over
here thought yeah
so what this means is that by the time we get around to decoding the wordnet
we know a lot more then
the last word was dog we actually know that the previous state was state one which corresponds to the to
word history the dog and so this is not a bigram language model that we have here is actually reflects
the semantics
of the trigram language model that you see in that fragment on the left
so there's two ways that we can use this one is to generate a basic language model score if we
provide the system with the with the finite state language model then we can just look up the language model
cost of transitioning between states and use that as one of the features in the system
but more interestingly we can create a binary feature for each are in the language model
now these arts and the language model are normally labeled with things like bigram probabilities a trigram probabilities or back-off
probability
what we're gonna do is we're gonna create a binary feature that just says have i traverse
the are in transitioning from one state to the next
so for example when we go from
the dog to dog yep which reversed to works
that are from one to two and then the art from two to six
the weights
the lamb does that we learn in association with that
are analogous to the back-off weights and the bigram weights of the normal language model but we're actually learning what
those weights are
what that means is that when we do training we end up with the discriminatively trained language model and actually
a language model that we train in association with the acoustic model training at the same time jointly with the
acoustic model training
so i think that sort of a interesting a phenomenon
okay i'd like to talk about the inputs to the system now
the first input are detector inputs so a detection is simply a unit and its midpoint
an example of that is shown here what we have found detections this is from a voice mail system in
it
and start from a voice search system and it looks like the person is asking for burgers except the person
says we're
bird
burgers E
and so the way to read this is that we detected the but at time frame seven ninety and or
time frame at and so forth and these correspond to the observations that are in red in the
in the illustration here
actually you can also provide a dictionaries that specify the expected sequence of detections for each word for example that
if we're going to decode burgers we expect both for good and so forth that pronunciation of the word
second input is lattices
that constrain the search space
the easiest way of getting these lattices is to use a conventional hmm system
and use it to just provide
i constraints on the search space
and the way to read this is
that from time twelve twenty one the time twenty five sixty a reasonable hypothesis is workings
and these times here give us segment boundaries hypothesized segment boundaries and the word gives us
possible labelings of the state
and we're gonna use those when we actually do the computations to constrain the set of possibilities we have to
consider
second kind of a lattice input is user-defined features
if you happen to have a model that you think provide some measure of consistency between the word that you're
hypothesize thing in the observations you can plug it in is user-defined feature like you see here
this lattice has a single feature that's been added it's it a dynamic time warping feature
and the particular one and i've got underlined in red here is indicating that the dtw feature value for hypothesized
sing the words fell
between frames nineteen eleven and twenty to sixty is eight point two seven
and that feature corresponds to one of the features in the log-linear models that exist on those vertical edges
now multiple input
are very much encouraged i and what you see here is a fragment of a of a lattice file that
christa monk put together
and you can see it's got lots of different a feature functions and he's defined
and essentially these features are the things that the follow that a metaphor that i started at the beginning
are analogous to the sales in the model that are providing the information in pushing the whole thing for work
and that we want to get as many of those
as possible
okay
let's talk about some features that are automatically defined from the inputs
the user-defined features are we need to find you don't have to worry about them once you put them in
on a lattice
if you provide detector sequences are set of features that can be automatically extracted and then the system will learn
the weights of those features those are existence expectation and levenshtein features along with something called the baseline feature
so the idea of an existence feature is to measure whether a particular unit
exists within the span of the word
but you're hypothesize thing
these are created for all word unit pair
and they have the advantage that you don't need any predefined pronunciation dictionary
but they have the disadvantage that you don't get any generalization ability across words
i here's an example suppose we hypothesize in the word record
and we spend the detections it and or
i would create a feature that says okay i'm hypothesize in accord
and i detected a in the span that would be in existence feature when you train the model presumably would
get a positive weight because presumably it's a good thing to detect if you're hypothesize in the word court
but
there's no generalisation ability across words here so that's a completely different a then the code that you would have
if you are hypothesized thing accordion and there's no transfer of the waiter smoothing there
the idea behind expectation features is to use a dictionary to avoid this and actually get generalization ability across were
there's three different kinds of expectation features
and i think i'll just go through by example and it describes the examples
so suppose let's take the first one suppose we're hypothesize in accord again and we detected it core
we have a correct except
oh but because we expect to see it on the basis of the dictionary and we've actually detected
now that feature is very different from the other feature because we can learn that that's a good thing that
detecting occur when you expect that could is good in the context of training on the word accord
and then use that same feature weight when we detected a in association with the word accordion or the working
at
second kind of expectation features of false reject of the unit
and that is an example of that where we expect to see it but we don't actually detected
finally you can have a false accept of the unit where you don't expect to see it based on your
dictionary pronunciation but it shows up there in the things that you've detected
and that the
in this example illustrates that
a levenshtein features are similar to expectation features but they
use stronger ordering constraints
the idea behind the levenshtein features to take the dictionary pronunciation of a word
and the units that you've detected
in association with that word
align them to each other get yeah the distance
and then create one feature for each kind of added that you've had to make
so the follow along in this example where we expect accord and we see that core
we have a substitution of the a match of the cover a match of the war and the delete of
the data
and again presumably we can learn that matching and a is a good thing in that has a positive way
by seeing one set of words you know training data and then use that
to evaluate hypotheses of new word
at test time where we haven't seen those particular words but they use these subword units
whose feature values we've already learned
okay the baseline features a kind of an important feature i wanna mention it here
i think many people in the room have had the experience of taking a system having a an interesting idea
very novel scientific thing to try out
doing it adding it in and it gets worse
and the idea behind the baseline features that we wanna think it's sort of the hippocratic oath
where we're gonna do no harm we're gonna have a system where you can add information to it
and not go backward
so we're gonna make it so that you can build on the best system that you have
by treating the output of that system as a word detector stream the detection of words
and then defining a feature this baseline feature that sorta stabilises assist
the definition of the baseline feature is that if you look at a at that are
that you're hypothesize thing
and you look at what words you've detected underneath it you get a plus one up for the baseline feature
if the hypothesized word covers exactly one baseline detection and the words are the same and otherwise you get a
minus one for this feature
here's an example of that
in the lattice path the sample path that were evaluating is a random like sort card or more
the baseline system output was randomly sort cards man detected it these vertical lines that you see here
so when we compute the baseline feature we take the first are random and we say how many words does
it cover
one that's good is it the same word no minus one
then we take a light we say how many words does it cover not
that's not going to get some minus one then we take sort we say how many words does it cover
one
is it the same yes okay we get a plus one there and finally called a mom covers two words
not one like it's supposed to so we get some minus one also
it turns out if you think about this you can see that
the way to optimize the baseline score is to output exactly as many words as a baseline system as output
and to make their identities
exactly the same as the baseline identities
so if you give the baseline feature high enough weight the baseline output is guaranteed
in practice of course you don't just set that we randomly but yeah the feature to the system with all
the other features and what is more in the way
okay i'd like to move on now to some experimental results
and the first of these has to use has to do with time using multi-phone detector is detecting multi-phone units
in the context of voice search is nothing special about voice search here it just happens to be the application
we were using
the idea is to try to empirically find multi-phone units
sequences of phones it tell us a lot about word
then to train an hmm system
whose units
are these multi-phone systems do we decoding with that hmm system and take its output is a sequence of multi-phone
detections
we're gonna put that detector stream then into the ser at
the main question here is what are good a phonetic sub sequences to you
and we're gonna start by using every subsequent sit occurs in the dictionary as a candidate
the expression for the mutual information between the unit you J and the word
W is given by this big
big mess that you see here
and the important thing about this to take aways that there is a tradeoff it turns out that you want
words that occur in about half i'm sorry you want units that occur in about half of the words so
that when you get one of these binary detections you actually get full bit of information
and from that sense of phones come close
but you also need words it can be reliably detected because the best unit in the world isn't gonna do
you any good if you can't actually detected and from that point of view one units are better
turns out that if you do a phone decoding of the data you can then compile statistics and choose the
units that are bad
and my colleague patrick million i'd throw a research stream along that and you can look at this paper for
for details
if you do this and look at what are the most informative units in this particular voice search task you
see something sort of interesting
some of them are very short like on an R
but then some of them are very long like california
and so we get these units that sometimes are short and frequent and sometimes long and what california still pretty
frequent but it's less frequent
okay so what happens if we use multi-phone units
we started with the baseline system that was about thirty seven percent
if we added phone detections that dropped by about a percent
if we use multi-phone units instead of phone units
that turns out to be better so that was gratifying that using these multi-phone units instead of the simple phone
units actually made a different
and then if you use symbols together works
little bit better
if you use a multiple phone and multi-phone units of three best units that were detected it's little bit yet
or better yet
and finally when we did discriminative training
that added a little bit more
and so what you see here is it is actually possible to exploit some redundant information in this kind of
a framework
the next kind of features i want to talk about are template features and this is work that was done
in the two thousand ten johns hopkins workshop
on wall street journal by my colleagues christa monk and are incomparable
i in order to understand that work you need to i need to say a just a little bit about
how
a baseline template system
that is but about the baseline template system that's used that live in university
so the idea here is that you have a big speech database
and you do forced alignment of all the utterances those utterances are rows in that top picture
and for each phone you know where it's boundaries are
and that's what those square boxes are those are phone bound
and you get a new utterance like the utterance at U C and the bottom
and you try to explain it by going into this
database that you haven't pulling out phone templates
and then doing an alignment of those phone templates to the news each such that you cover the whole of
the new utterance
since the original templates come with phone labels you can then read off the phone sequence
okay so suppose we have a system like that setup is it possible to use features
that are created from templates in the sort of the ser a frame or
and it turns out that you can do it in a in a sort of interesting the kinds of features
that you can have
so the idea is to create features
based on the template matches that explain a hypothesis what you see at the upper left is a hypothesis of
the word V
and we further aligned it so that we know where the first phone the is and the second phone E
is
then we go into the database we find all the close matches to those phones
so the number thirty five was a good match the number four hundred twenty three was a good match the
number one thousand two no twelve thousand eleven was a good match and so for
so given all those good matches what are some features that we can get
one of these features is a word id feature
what's a fraction of the templates that you see stacked up here that actually came from the word that were
hypothesized T V
another question is position consistency if the phone is word-initial like the
what fraction of the
the templates were word-initial in the original data that's another interesting feature
speaker id entropy are all the close matches just from one speaker that would be a bad thing "'cause" potentially
it's a flute
a degree of working if you look at how much you have to work those examples to get them to
fit what's the average word scale those are all features that the provide some information in that you can put
into the system
and the crust among wrote a word icassp paper that describes this in a detail
if we look at the results there we started a from a baseline template system at eight point two percent
adding the template metadata features provided an improvement
to seven point six percent
if we then add hmm scores we get the six point eight percent have to say there that the hmm
itself actually was seven point three so that's
seven point three sort of the baseline
and then adding phone detectors dropped it down finally to six point six percent and this is actually very good
number for the open vocab twenty a twenty K test set
and again this is showing the effective use
of multiple information source
okay the last
experimental result would like to go over
is a broadcast news system that we worked on also at the twenty ten C lsp workshop
i don't have time to go into detail on all the particular information sources that went into this
i just want to call out a few things
so I B M was kind enough to donate their at ellis system for use in creating a baseline system
that constrain the search space
we created a word detector system it
at microsoft research that created these word detections that you see here detector streams
there were a number of real value
information sources here and hansen had a point process model that he worked on justine how worked on a duration
model les atlas and some of the students had some scores based on modulation features those are provided real-valued a
feature scores such as you see here
and then facial i had some deep neural net phone detector
and samuel thomas looked at
the use of mlp phoneme detections in those provided the discrete detection streams that you see at the very bottom
there
if we look at the results let's just move over to the test results the baseline a system that we
built had a fifteen point seven percent word error rate
if we did that training with the scarf baseline feature there was a small improvement there i think that has
to do with the dynamic range of the baseline feature plus minus one versus the dynamic range of the original
likelihood
adding the a word detectors provided about a percent adding the other feature scores added a bit more
and the altogether we got about a nine point six percent
relative improvement or about twenty seven percent of the gain possible i given the lattices and again this indicates that
you can take multiple kinds of information put it into a system like this
and then move in the right direction
okay i want to just quickly go over a couple of research challenges i won't spend much time here because
i research challenges are things that haven't been done in people are gonna do what they're gonna do anyway
but i'll just mention a few things that seem like they might be interesting
what one of them would be to use in a crf
to boost hmm
in the motivation for this is that the use of the a word detectors in the broadcast news system was
actually a very effective we try to combine combination with rover and that didn't really work but we were able
to use it with this log-linear weighting
so the question is can we use crfs
a crfs in a more general boosting loop
the idea would be to train the system
take its output take the word-level output
reweighted training weighted according to the boosting algorithm up waiting the regions where we have mistake
train a new system
and then treat the output of that system is a new a detector stream to add in to the overall
group obsessed
second question is the use of spectro-temporal wreck receptive field models as detectors
previously we've worked on hmm systems S detectors i think would be interesting to try to train S T R
F
models
two
work as a detectors and provide these
detection streams
one way of approaching that would be to take a bunch of examples of phones or multi-phone units in class
examples and out-of-class examples for example
and train a maximum entropy classifier to make the distinction
and use the weight matrix of the maxent classifier essentially is a learned spectro-temporal receptive field
the last
idea that all throw out is to try to make much larger scale use of templates
information then we used so far we start from the wall street journal results
that there's comments there
and i think
maybe we could take that further for example in voice search systems we have an endless stream of data that
comes in
and we keep can transcribing some so we get more and more examples
of phones and words and sub-word units and so forth
and could we take some of those same features that are we're described previously and use "'em" a on a
much larger scale as they come in on an ongoing basis
okay so i'd like to conclude here
i've talked today about segmental log-linear model specifically segmental conditional random fields
i think these are flexible framework for testing novel scientific ideas
in particular they allow you to in integrate diverse information sources different types of information a different granularities at the
word level at the phone level at the frame level
information it comes in a variable quality level some can be better than others
potentially redundant
information sources and generally speaking much more than where i'm currently using
and finally i think there's a lot of interesting research left to do in this area
so thank you
okay we have a time for some questions
we do have
and please if you want to step up to my can actually put most close to the microphone that's
actually very helpful
so in a segmental models as an issue normalization
because the comparing hypotheses with different numbers of segments and so
there's an issue of how you how you make sure that you know
i was used with fewer segments over
longer segments knows wondering how
still with that yeah good question and they deal with it because when you do training you have to normalize
by considering all possible segmentations in the denominator
so when you do training you know how many segments are you know how many words there are in the
training hypothesis that gives you a fixed number like maybe it
and then you have this normaliser
where you have to consider all possible segmentations
and if the system has a strong bias
say towards
segmentations it only had one segment because there were fewer score
that wouldn't work because
your denominator then with the sign high weights
to the wrong segmentations it wouldn't assign highway to this thing in the did not in the numerator
it would that has ten segments it was assigned high weight
to the
hypotheses it just had a single segmentation in the denominator and the objective function would be bad
and
training would take care that because maximizing the objective function the conditional likelihood of the training data it would have
to assign parameter values
and so that it didn't have that particular bias
in one of those like you were saying that you in a discriminative kind of language more implicitly by building
the other morning
yeah so my question is but then it limits
the thing is in order to bring them or do we need to have a bit of it is not
really acoustically under data
but also have other features that but usually for language modeling we have well who do most of text data
with it but for which we may not have corresponding acoustic
feature
the whole you if we were to train a big language model from just X how to incorporate i didn't
it is yeah
so i think the way to do that
is to
annotate the lattice
with the language model score
that you get from this language model you train on lots and lots of data
so that scores gonna get into the system
then have a second language model that you could think of a sort of a corrective language model
that is trained only on the data for which you have acoustics
and
add those
features in
in addition to the language model score from the basic language model
all the other more
well it just one pass decoding
i mean for it what i understand that you take lattices and then to constrain your subspace
but then what if i have a language model which is much more complicated than n-gram and i wish to
do these coding
so the possible output of like this like structure
all that is the out of the core
there's a question of in theory and in the particular
implementation that we've made in the particular implementation that we've made a know it takes lattices in it produces one
best
there's nothing about the theory or the framework that says you can't take lattices in and produce lattices out and
i was just curious about the pocket yeah okay
i yeah i just have one question
i think is a good idea to combine the different source of information but this thing can also be dying
in a much simpler model right without using the concept of this that the buttons
you introduce the segments here on the what is real benefits you
on the
so i think the benefit is
features that you can't express
if you don't have the concept of the segment
an example of a feature where you need segment boundaries probably the simplest example is say a word duration model
you really need to talk about when this the word star
and when there's a word and
another example where i think it's useful is in template matching if there's a hypothesis and you wanna have a
feature of the form
what is the dtw distance
to the close this
example in my training database
of this word that i'm hypothesize thing
it helps if you have a boundary to start that dtw alignment and the boundaries and that dtw alignment
so i think the answer to the question is that
by reasoning explicitly about segmentations
you can incorporate features
that
you can incorporate if you reason only about frames
feature against your incorporating a very heuristic we
in some of the simple models of three levels
with all the
just combine them
mapping
the whole thing
and the green
it can we do that
i
i really i is so my own personal philosophy is that if you care about features if you care about
information
where the natural measure of that information is in terms of segments
then you're better off
bliss lately reasoning in terms of those units in terms of segments
then somehow trying to
implicitly or through the back door
encode that information in some other way
how many sorts of segments of you tried to be tried syllables of it right
i would imagine because many syllables are also monosyllabic words that you might cease in confusion
in your word models
i
syllables
right i didn't mention this as a as a research direction but one thing i'm really interested in is being
able to do decoding from scratch with the segmental model like this
i also didn't go into detail about the computational
burden of using these models
but it turns out that it's
proportional to the size of your vocabulary
so if you wanted to bottom-up decoding from scratch without reference
just some initial lattices or an external system
you need to use subword units for example syllables which are on the order of some thousands
or
even better phones and for phones we actually have
began some initial experiments
with doing bottom-up phone recognition actually just at the segment level with the pure segment model where we just by
force consider
all possible segments and all possible phones
okay let's think speaker