i
she had your dark suit in greasy wash more
zero such and number eight this type of features for speaker recognition we've got five
papers
there will be presented in the session
there is a bit of time yeah
before the process
process will come for a this evening's event my
we can actually now you're a little bit afterwards
discussion
sure the first error talk is a feature extraction using two younger regression models for
speaker recognition a johns hopkins group
rescue representing the paper
oh
i think that you want to ask once you got constraints also i'm just sitting
here watching
so that neither idea of the last but not what they are
is that i want to use this but also for S L T V
possible to some discussion about features in general for speaker recognition
because i think we started yesterday and i was to realise that we have some
issues just like the mean
so i have a few slides at the beginning which perhaps
it would be more general and what i want to talk about that like
and he'll be back again is it
i always like if you have a questions during the presentation
these past me immediately don't feel
but i mean if we don't this work the slides
is that receives as everybody's here and you don't know what i'm talking
so just keep asking questions or something
so the business a is following
we have a speech
speech information so he streams
that is a speaker by
that is
probably should environment
and this information
and this is
speech
a right be pretty
all of us
one of them
so these are really
if you are not speaker a special case i environment or you message
can be used or a
and
speaker
oh
speaker is easy job is the number of things
which may consider as a disturbing as audio
one i saw the speaker has a speaker agents only can do not annoying
source audio video information if you would like to be invariant
a nice piece of a signal and there is this information
right
you
and balls
analysis of features
and the classifier
yeah analysis that would you stop which we know
before
we see that they
this is based on what everyone in school or whatever you got from previous experience
with the day
and then there is a classifier and classifier was typically train
now is the distinction you another classification is somehow coming because we then train feature
extraction
so
that is
some you know like
but the so as i said and this is what exactly what is what we
tried that before
also
outcome of this whole process should be in our identity of the speaker
right so the eagles this process
somehow alleviating sources
on what information
yeah
stressy information about the speaker so you would like to see you analysis
which somehow suppresses to answer
the message will influence of the environment and so on and hence is the information
about who is speaking
but of course time constant you also over years in speech research
that is very often better
what is used as much as possible to use something which is wrong because what
you have either you can get
and so on
yeah
that in speech recognition here with the same way as we end up in the
speech recognition
i well i know how to do this process speech that is
you take the signal and there's some frequency axis is not clear so you get
a sequence of each of the each maybe describe the signal in different frequency sub-bands
you're at five
ignore to face there and you want to find it somehow you know in quotes
because people that hearing is to some extent first thing and the press
and this properties might be might be useful some properties
and
so we don't signal so analysis is very high
right here
so this typeface that's
the money
and then some modifications to depending on the school of thought of that
the be seen before the vacations
plp people different modifications the mfcc people and so on it's own there is a
people
and
yeah we take it cosine transform a few cases because people
the features used modifications most likely there is some compression type of a room that
something happens here are also transformed here approximately the correlation features
and you get the cepstrum
and cepstrum is what we use any using
both in speech and speaker position and all these are the worst
so that's because
we should be all representation
as the speaker recognition people i'm not world from speech recognition you are you this
if i give different whole speech recognition people also oral presentation from speech coding people
and so on and so basically he was then
on the shoulders times
right so much as i mentioned briefly at the work site you
i
yeah data is actually a slight
online
so what was the sources of a body at that time
so that the source a different channels
what you
right most in one
interspeech case
so we use a set of points
about the speech sound
that's why we shouldn't
what conditions
you
this information which is of course
the design
the most suitable
function
and so forth
this is something you
of course is you don't live which i or typically work will be first thing
this
the but you just changing channels down
a lot of course also the goal
and high basically space exposed
so this is the formation
which i feel E you will be are not speak
yeah so yeah pretty funny "'cause" it's a little late and of course
say
briefly speaker a techniques like a background what
joint factor analysis and so one image that i speakers
in some cases embarrassingly well i zero
doesn't exist by from the G
so
probably doesn't is not sure that
i
now let's see how much this machinery minutes i mean from the from these days
i
so
i
i
i
all
this is so this is like i think that
yeah exactly
my
you know this is a spectrum so it's not accustomed as the suspect
so that is because we copied some as well as far as a very fast
where
yeah
firstly
yeah that's
such in the break that it might be worthwhile looking back into these
the basic analysis
because we have a data much more data and a very fancy processing techniques may
wants to know how much
variability yeah exactly how much variability
i
i
i
i
so
i
i
yes
and the techniques which you can be physical for recognizing speaker actually very much bigger
than
that is
maybe
is it is misleading because we use
when you
speaker dependent on
i
yeah
i
and
what are you want
ask
or maybe sets it they pay cisco phase right
somebody
and
but the same decide it is this work
the work on i don't sources and methods applied
speech on this is people might be more specific for speaker recognition
but this would be another story so results
this
i talk about are based on deriving spectrum are focused on
normally the signal people you know a second time
and after some preprocessing fine autoregressive model i mean like
and what we what log spectral line spectrum and we for a and a
and a spectral
spectrum
right
the sequence is the functional a
you can also differently in this to help with this
where presenting here
if you think it will sometimes long signal
in do exactly the same thing
and you between those on your on your cosine
so then you want to be able to derive the model and in this particular
frequency that
but by wideband and you end up which is time-frequency nation
just like you for this is that you know i sometimes like all overlay this
is this is a very rich people whose second level or when they do this
i
spectral
and this is maybe more weight to each hearing is working because i don't see
that
and second of speech and speaker
what frequency components and the most
then the second so this is the way you have a what is important for
you to somehow get some system
the global this way
start
this
well i enough not be possible at you know which we can see if i
was
which one
if you just look at the picture might believe me
okay
yeah
so this is what we all frequency domain linear prediction of these gonna fight students
recording three
as you don't prediction
oh that's a perceptual linear prediction so this can be side i
but i think it is a quite a bit of perceptions
it's
as the
so here is one seven
we have a signal
yeah you have a basal
finally all of this model
oh
and you also otherwise
see what is left after
is that
and not really different frequency bands that i
this time domain signal are bands
different frequency band you can be some is for the channel over there
so the resynthesized speech from adults only one can also synthesized speech from them
yeah
so if you
the signal
oh
oh search
i
i table you
yeah
i
oh
oh search
i just don't
yeah
and
i
if you where
well i
i
i
that is to send messages because then
thus
speech
but bottom line here is that
what we should not be used for speaker a single be this way
oh
i
a four or is that actually you know
in some ways
one is some there is a whole
components
yeah
formation
well
also
shen
for
speech
here is that since a young
a simple and here is that you get a sound
robustness so you know it's
and you have a representation
yeah
in
so as well as you have some problem here
give some more
high energy possible and we can see
oh is assumed
so
i mentioned in
so
well as a whole
which i
since so you'll find the right
as i
so if you before
or if you
yeah
different S is divided by S
and this is just a to see this somehow
that's easily different frequencies
depending on the frequency
well
channel and you can you like
i one of the suspect
to see what is just a way
or gain of the older this
and that's what we foresee essentially you just ignored
in this new
so
thus
well you right side or depending on
oh
oh by the
the signal is you and i think this task to say
then
so i eight
also
oh or similar
you
more
more robust in presence of an average
noise that's right
is just a mess
well
then
i
so basically we so people
if you look at more than me importance
well
and
so how many that is more
first thing is that
speech
you
and
these be different frequency ranges
E
try to find
i don't know
well
and also different
this is a state
and then they want to be able to use the one speaker recognition techniques which
you
friends don't know and so on
but then we then we just
or something which is small
that's cs for significantly
this way they expect to take a frequency
respect or
and five respect to see that all over
oh
yeah and then so you do this
at a time
this time frequencies
i
this is me
here we already removed
okay some
you
that is
then
yeah
yeah
she is much longer
responsible rule
very short
the communication
so it's yeah
oh that's out
i
style
which i theses so you know
so our
first
yeah i
yeah i think that both my main street
one
yeah
and we also
a false one
i
i
performance
and
oh
this is
both
i
so
again
right
yeah
right
this
so
oh
i
i
i know i was also
i have some
the task
yeah i
same i
i
but that's a
you know
i
yeah
oh
you
right
well
i
yeah
i
oh
oh
i
i
and
well
where
oh
yeah
i can't
yeah
i
i was hoping that are supposed to be
based
oh yeah probably get a degree without so maybe somebody
i think this there's function is expressed here
but at the same time
features and classifier based or a classifier for speaker recognition results speaker
use all the knowledge data used bandage fact they can tell you for different areas
of speech sounds
by different parts of the model and so on and so on
interesting
how to realise doesn't take advantage of that somebody was pointing out what
so yeah that is
i
oh
oh
oh
i
i
i
no
it's such that every utterance is about a sentence we just take over the whole
utterance
if we have a lot of speech we could be chopped into segments i one
five seconds and then be on the length of the segment be always choose the
what the model about how their second right hand we expect second
so what the country that segment of the signal
by to be if you if you don't signal just you mean and the iced
i typically the first
very personal data to model doesn't lid can check
so we use the central file
this is
i
i
vol
i didn't say exactly
what i said that what might be interesting for speaker is to use there is
you run
this process which is like you all pole zero signal in different component and yeah
it was the one which was used here
component was the war which was god
but i have been like what i write
was that it sounded like a global K but information about this about this message
a problem and she would have was
just
information about some information about the speaker
i don't think they and eighty or assigned to the original
the original
this other sort of T V this profile used as it is for speech recognition
component is
component so we just use it as a speech signal utterance
our phoneme recognizers got getting what was it fifty five four percent
fifty percent accuracy
so you can understand the same machine that's the two
a with respect to recognizing phonemes
somebody i you know
i mean happening at the top the loss and all four formants are gone
and everything is one
and it is
it's a bit
i way that you don't
the
the only assumption is not in use it is useful since out
oh also somebody speaker
i
i
oh of course i mean i see that course yeah so that might be right
i
a
or
i
i
of course we oh no i'm you know i again that they all cases you
ask and you saw or fusion right all six together things i one side try
to paper as a matter of fact it was of a speaker recognition
which was called towards decreasing error rates
and there's one of the reviewers if
if she uses here and you feel that
says he is not doesn't between
the paper was rejected so i have a that are saying about you know if
you are working on something you
and of course if you use it on its own is very likely that you're
performance
the other
degrees that's why neural paper to was increasing rate
but now since we have these huge
and that was fifteen years ago and you start working one fusions
if you if you just the goals for different source of what you have a
different source of information you have very like to the improvement after you you'll see
that that's why should research when you things
of the diffusion you are very unlikely degrees error rates will be all right i'm
you want to do something you it doesn't work and you put your what's that
works
and you can present at the conference
seven
others