0:00:15i
0:00:20she had your dark suit in greasy wash more
0:00:25zero such and number eight this type of features for speaker recognition we've got five
0:00:31papers
0:00:32there will be presented in the session
0:00:34there is a bit of time yeah
0:00:37before the process
0:00:38process will come for a this evening's event my
0:00:44we can actually now you're a little bit afterwards
0:00:46discussion
0:00:48sure the first error talk is a feature extraction using two younger regression models for
0:00:54speaker recognition a johns hopkins group
0:00:57rescue representing the paper
0:01:02oh
0:01:04i think that you want to ask once you got constraints also i'm just sitting
0:01:08here watching
0:01:10so that neither idea of the last but not what they are
0:01:15is that i want to use this but also for S L T V
0:01:20possible to some discussion about features in general for speaker recognition
0:01:24because i think we started yesterday and i was to realise that we have some
0:01:29issues just like the mean
0:01:31so i have a few slides at the beginning which perhaps
0:01:34it would be more general and what i want to talk about that like
0:01:39and he'll be back again is it
0:01:43i always like if you have a questions during the presentation
0:01:47these past me immediately don't feel
0:01:50but i mean if we don't this work the slides
0:01:53is that receives as everybody's here and you don't know what i'm talking
0:01:58so just keep asking questions or something
0:02:02so the business a is following
0:02:07we have a speech
0:02:09speech information so he streams
0:02:14that is a speaker by
0:02:18that is
0:02:19probably should environment
0:02:21and this information
0:02:23and this is
0:02:25speech
0:02:27a right be pretty
0:02:29all of us
0:02:33one of them
0:02:35so these are really
0:02:37if you are not speaker a special case i environment or you message
0:02:44can be used or a
0:02:47and
0:02:47speaker
0:02:49oh
0:02:50speaker is easy job is the number of things
0:02:54which may consider as a disturbing as audio
0:02:59one i saw the speaker has a speaker agents only can do not annoying
0:03:07source audio video information if you would like to be invariant
0:03:15a nice piece of a signal and there is this information
0:03:21right
0:03:22you
0:03:23and balls
0:03:25analysis of features
0:03:27and the classifier
0:03:30yeah analysis that would you stop which we know
0:03:35before
0:03:37we see that they
0:03:38this is based on what everyone in school or whatever you got from previous experience
0:03:44with the day
0:03:46and then there is a classifier and classifier was typically train
0:03:50now is the distinction you another classification is somehow coming because we then train feature
0:03:58extraction
0:03:59so
0:04:01that is
0:04:03some you know like
0:04:05but the so as i said and this is what exactly what is what we
0:04:10tried that before
0:04:12also
0:04:14outcome of this whole process should be in our identity of the speaker
0:04:19right so the eagles this process
0:04:23somehow alleviating sources
0:04:26on what information
0:04:28yeah
0:04:29stressy information about the speaker so you would like to see you analysis
0:04:34which somehow suppresses to answer
0:04:37the message will influence of the environment and so on and hence is the information
0:04:43about who is speaking
0:04:46but of course time constant you also over years in speech research
0:04:52that is very often better
0:04:54what is used as much as possible to use something which is wrong because what
0:05:00you have either you can get
0:05:03and so on
0:05:05yeah
0:05:05that in speech recognition here with the same way as we end up in the
0:05:12speech recognition
0:05:13i well i know how to do this process speech that is
0:05:19you take the signal and there's some frequency axis is not clear so you get
0:05:26a sequence of each of the each maybe describe the signal in different frequency sub-bands
0:05:30you're at five
0:05:32ignore to face there and you want to find it somehow you know in quotes
0:05:38because people that hearing is to some extent first thing and the press
0:05:42and this properties might be might be useful some properties
0:05:48and
0:05:50so we don't signal so analysis is very high
0:05:57right here
0:05:59so this typeface that's
0:06:02the money
0:06:03and then some modifications to depending on the school of thought of that
0:06:08the be seen before the vacations
0:06:11plp people different modifications the mfcc people and so on it's own there is a
0:06:18people
0:06:19and
0:06:22yeah we take it cosine transform a few cases because people
0:06:27the features used modifications most likely there is some compression type of a room that
0:06:32something happens here are also transformed here approximately the correlation features
0:06:37and you get the cepstrum
0:06:39and cepstrum is what we use any using
0:06:42both in speech and speaker position and all these are the worst
0:06:47so that's because
0:06:50we should be all representation
0:06:53as the speaker recognition people i'm not world from speech recognition you are you this
0:06:59if i give different whole speech recognition people also oral presentation from speech coding people
0:07:06and so on and so basically he was then
0:07:08on the shoulders times
0:07:11right so much as i mentioned briefly at the work site you
0:07:17i
0:07:18yeah data is actually a slight
0:07:24online
0:07:25so what was the sources of a body at that time
0:07:31so that the source a different channels
0:07:35what you
0:07:36right most in one
0:07:39interspeech case
0:07:40so we use a set of points
0:07:42about the speech sound
0:07:44that's why we shouldn't
0:07:48what conditions
0:07:50you
0:07:51this information which is of course
0:07:56the design
0:07:58the most suitable
0:08:00function
0:08:01and so forth
0:08:03this is something you
0:08:05of course is you don't live which i or typically work will be first thing
0:08:11this
0:08:12the but you just changing channels down
0:08:17a lot of course also the goal
0:08:20and high basically space exposed
0:08:25so this is the formation
0:08:27which i feel E you will be are not speak
0:08:32yeah so yeah pretty funny "'cause" it's a little late and of course
0:08:38say
0:08:40briefly speaker a techniques like a background what
0:08:46joint factor analysis and so one image that i speakers
0:08:50in some cases embarrassingly well i zero
0:08:56doesn't exist by from the G
0:08:59so
0:09:01probably doesn't is not sure that
0:09:08i
0:09:11now let's see how much this machinery minutes i mean from the from these days
0:09:17i
0:09:20so
0:09:22i
0:09:35i
0:09:40i
0:09:49all
0:09:53this is so this is like i think that
0:09:57yeah exactly
0:10:01my
0:10:03you know this is a spectrum so it's not accustomed as the suspect
0:10:07so that is because we copied some as well as far as a very fast
0:10:15where
0:10:16yeah
0:10:19firstly
0:10:20yeah that's
0:10:22such in the break that it might be worthwhile looking back into these
0:10:28the basic analysis
0:10:29because we have a data much more data and a very fancy processing techniques may
0:10:36wants to know how much
0:10:38variability yeah exactly how much variability
0:10:43i
0:10:44i
0:10:54i
0:10:55i
0:11:03so
0:11:10i
0:11:11i
0:11:15yes
0:11:17and the techniques which you can be physical for recognizing speaker actually very much bigger
0:11:23than
0:11:24that is
0:11:26maybe
0:11:27is it is misleading because we use
0:11:31when you
0:11:32speaker dependent on
0:11:34i
0:11:36yeah
0:11:36i
0:11:39and
0:11:40what are you want
0:11:41ask
0:11:42or maybe sets it they pay cisco phase right
0:11:47somebody
0:11:49and
0:11:51but the same decide it is this work
0:11:54the work on i don't sources and methods applied
0:11:59speech on this is people might be more specific for speaker recognition
0:12:04but this would be another story so results
0:12:07this
0:12:08i talk about are based on deriving spectrum are focused on
0:12:15normally the signal people you know a second time
0:12:22and after some preprocessing fine autoregressive model i mean like
0:12:28and what we what log spectral line spectrum and we for a and a
0:12:35and a spectral
0:12:39spectrum
0:12:41right
0:12:42the sequence is the functional a
0:12:44you can also differently in this to help with this
0:12:49where presenting here
0:12:51if you think it will sometimes long signal
0:12:55in do exactly the same thing
0:12:58and you between those on your on your cosine
0:13:02so then you want to be able to derive the model and in this particular
0:13:09frequency that
0:13:12but by wideband and you end up which is time-frequency nation
0:13:18just like you for this is that you know i sometimes like all overlay this
0:13:25is this is a very rich people whose second level or when they do this
0:13:30i
0:13:30spectral
0:13:32and this is maybe more weight to each hearing is working because i don't see
0:13:37that
0:13:38and second of speech and speaker
0:13:41what frequency components and the most
0:13:45then the second so this is the way you have a what is important for
0:13:50you to somehow get some system
0:13:53the global this way
0:13:55start
0:13:56this
0:13:57well i enough not be possible at you know which we can see if i
0:14:02was
0:14:03which one
0:14:05if you just look at the picture might believe me
0:14:08okay
0:14:09yeah
0:14:10so this is what we all frequency domain linear prediction of these gonna fight students
0:14:16recording three
0:14:16as you don't prediction
0:14:18oh that's a perceptual linear prediction so this can be side i
0:14:24but i think it is a quite a bit of perceptions
0:14:28it's
0:14:30as the
0:14:31so here is one seven
0:14:34we have a signal
0:14:36yeah you have a basal
0:14:38finally all of this model
0:14:41oh
0:14:43and you also otherwise
0:14:45see what is left after
0:14:48is that
0:14:50and not really different frequency bands that i
0:14:54this time domain signal are bands
0:14:59different frequency band you can be some is for the channel over there
0:15:05so the resynthesized speech from adults only one can also synthesized speech from them
0:15:12yeah
0:15:16so if you
0:15:18the signal
0:15:21oh
0:15:22oh search
0:15:24i
0:15:24i table you
0:15:30yeah
0:15:32i
0:15:34oh
0:15:38oh search
0:15:41i just don't
0:15:46yeah
0:15:49and
0:15:55i
0:15:59if you where
0:16:02well i
0:16:04i
0:16:08i
0:16:09that is to send messages because then
0:16:14thus
0:16:15speech
0:16:17but bottom line here is that
0:16:19what we should not be used for speaker a single be this way
0:16:24oh
0:16:24i
0:16:25a four or is that actually you know
0:16:29in some ways
0:16:31one is some there is a whole
0:16:36components
0:16:37yeah
0:16:39formation
0:16:40well
0:16:41also
0:16:43shen
0:16:43for
0:16:44speech
0:16:50here is that since a young
0:16:53a simple and here is that you get a sound
0:16:59robustness so you know it's
0:17:03and you have a representation
0:17:06yeah
0:17:07in
0:17:07so as well as you have some problem here
0:17:13give some more
0:17:15high energy possible and we can see
0:17:20oh is assumed
0:17:25so
0:17:26i mentioned in
0:17:30so
0:17:31well as a whole
0:17:33which i
0:17:35since so you'll find the right
0:17:38as i
0:17:40so if you before
0:17:41or if you
0:17:43yeah
0:17:45different S is divided by S
0:17:49and this is just a to see this somehow
0:17:54that's easily different frequencies
0:17:57depending on the frequency
0:17:59well
0:18:00channel and you can you like
0:18:03i one of the suspect
0:18:07to see what is just a way
0:18:11or gain of the older this
0:18:14and that's what we foresee essentially you just ignored
0:18:19in this new
0:18:20so
0:18:22thus
0:18:23well you right side or depending on
0:18:27oh
0:18:28oh by the
0:18:31the signal is you and i think this task to say
0:18:36then
0:18:37so i eight
0:18:41also
0:18:42oh or similar
0:18:44you
0:18:46more
0:18:46more robust in presence of an average
0:18:49noise that's right
0:18:53is just a mess
0:18:55well
0:18:56then
0:18:58i
0:19:00so basically we so people
0:19:05if you look at more than me importance
0:19:11well
0:19:14and
0:19:15so how many that is more
0:19:19first thing is that
0:19:21speech
0:19:22you
0:19:23and
0:19:24these be different frequency ranges
0:19:28E
0:19:29try to find
0:19:32i don't know
0:19:34well
0:19:36and also different
0:19:39this is a state
0:19:40and then they want to be able to use the one speaker recognition techniques which
0:19:47you
0:19:48friends don't know and so on
0:19:50but then we then we just
0:19:52or something which is small
0:19:54that's cs for significantly
0:19:57this way they expect to take a frequency
0:20:02respect or
0:20:04and five respect to see that all over
0:20:09oh
0:20:09yeah and then so you do this
0:20:13at a time
0:20:14this time frequencies
0:20:17i
0:20:17this is me
0:20:21here we already removed
0:20:23okay some
0:20:28you
0:20:31that is
0:20:32then
0:20:33yeah
0:20:34yeah
0:20:36she is much longer
0:20:40responsible rule
0:20:42very short
0:20:46the communication
0:20:48so it's yeah
0:20:51oh that's out
0:20:53i
0:20:53style
0:20:57which i theses so you know
0:21:01so our
0:21:02first
0:21:08yeah i
0:21:11yeah i think that both my main street
0:21:14one
0:21:16yeah
0:21:18and we also
0:21:20a false one
0:21:31i
0:21:33i
0:21:34performance
0:21:38and
0:21:39oh
0:21:41this is
0:21:42both
0:21:44i
0:21:56so
0:22:01again
0:22:02right
0:22:07yeah
0:22:10right
0:22:12this
0:22:14so
0:22:16oh
0:22:18i
0:22:23i
0:22:26i know i was also
0:22:30i have some
0:22:32the task
0:22:36yeah i
0:22:38same i
0:22:39i
0:22:42but that's a
0:22:44you know
0:22:47i
0:22:48yeah
0:22:53oh
0:22:55you
0:22:56right
0:23:00well
0:23:01i
0:23:04yeah
0:23:04i
0:23:08oh
0:23:09oh
0:23:13i
0:23:14i
0:23:15and
0:23:16well
0:23:16where
0:23:18oh
0:23:19yeah
0:23:20i can't
0:23:22yeah
0:23:23i
0:23:24i was hoping that are supposed to be
0:23:28based
0:23:30oh yeah probably get a degree without so maybe somebody
0:23:36i think this there's function is expressed here
0:23:40but at the same time
0:23:42features and classifier based or a classifier for speaker recognition results speaker
0:23:49use all the knowledge data used bandage fact they can tell you for different areas
0:23:55of speech sounds
0:23:57by different parts of the model and so on and so on
0:24:00interesting
0:24:02how to realise doesn't take advantage of that somebody was pointing out what
0:24:09so yeah that is
0:24:12i
0:24:21oh
0:24:22oh
0:24:25oh
0:24:27i
0:24:30i
0:24:32i
0:24:48no
0:24:51it's such that every utterance is about a sentence we just take over the whole
0:24:57utterance
0:24:58if we have a lot of speech we could be chopped into segments i one
0:25:03five seconds and then be on the length of the segment be always choose the
0:25:08what the model about how their second right hand we expect second
0:25:14so what the country that segment of the signal
0:25:19by to be if you if you don't signal just you mean and the iced
0:25:23i typically the first
0:25:26very personal data to model doesn't lid can check
0:25:31so we use the central file
0:25:35this is
0:25:37i
0:25:40i
0:25:48vol
0:25:51i didn't say exactly
0:25:54what i said that what might be interesting for speaker is to use there is
0:25:57you run
0:25:59this process which is like you all pole zero signal in different component and yeah
0:26:05it was the one which was used here
0:26:09component was the war which was god
0:26:11but i have been like what i write
0:26:14was that it sounded like a global K but information about this about this message
0:26:21a problem and she would have was
0:26:25just
0:26:26information about some information about the speaker
0:26:30i don't think they and eighty or assigned to the original
0:26:34the original
0:26:36this other sort of T V this profile used as it is for speech recognition
0:26:41component is
0:26:43component so we just use it as a speech signal utterance
0:26:47our phoneme recognizers got getting what was it fifty five four percent
0:26:53fifty percent accuracy
0:26:55so you can understand the same machine that's the two
0:26:59a with respect to recognizing phonemes
0:27:02somebody i you know
0:27:17i mean happening at the top the loss and all four formants are gone
0:27:24and everything is one
0:27:26and it is
0:27:28it's a bit
0:27:31i way that you don't
0:27:32the
0:27:38the only assumption is not in use it is useful since out
0:27:43oh also somebody speaker
0:27:48i
0:27:49i
0:27:51oh of course i mean i see that course yeah so that might be right
0:27:58i
0:27:59a
0:28:02or
0:28:03i
0:28:10i
0:28:12of course we oh no i'm you know i again that they all cases you
0:28:17ask and you saw or fusion right all six together things i one side try
0:28:22to paper as a matter of fact it was of a speaker recognition
0:28:26which was called towards decreasing error rates
0:28:29and there's one of the reviewers if
0:28:32if she uses here and you feel that
0:28:35says he is not doesn't between
0:28:38the paper was rejected so i have a that are saying about you know if
0:28:43you are working on something you
0:28:46and of course if you use it on its own is very likely that you're
0:28:50performance
0:28:51the other
0:28:52degrees that's why neural paper to was increasing rate
0:28:56but now since we have these huge
0:28:58and that was fifteen years ago and you start working one fusions
0:29:02if you if you just the goals for different source of what you have a
0:29:07different source of information you have very like to the improvement after you you'll see
0:29:13that that's why should research when you things
0:29:18of the diffusion you are very unlikely degrees error rates will be all right i'm
0:29:22you want to do something you it doesn't work and you put your what's that
0:29:26works
0:29:27and you can present at the conference
0:29:31seven
0:29:33others