0:00:07our idea
0:00:08i am
0:00:08writing saving
0:00:10uh
0:00:11what about um
0:00:12research uh directions that uh we are working in university of is there a lot
0:00:17is concentrated on uh
0:00:19uh new features for
0:00:21uh speaker recognition
0:00:23uh
0:00:23it could be i have a
0:00:25line
0:00:25three in our laboratory that we are working on um
0:00:28oh
0:00:29let me say
0:00:29i'm a little meeting on and working on it
0:00:32and new features exploring area of new features for
0:00:35the speaker recognition
0:00:37now the current work is the application of uh
0:00:41some sort of features name uh
0:00:44uh weighted linear prediction features for
0:00:46the speaker recognition
0:00:48and uh our work
0:00:50is
0:00:50yeah done jointly bar
0:00:52a group of how holding university of helsinki that nowadays the following a
0:00:57all all the universe
0:00:59let me just say that
0:01:01uh i'm not pretending that i know
0:01:03what is happening inside the
0:01:06uh is a weighted linear prediction i'm just
0:01:09presenting on this up my understanding that
0:01:12what is weighted linear prediction
0:01:15and
0:01:15uh from the group of older is here i just have to me
0:01:19to help me if
0:01:20i cannot describe something to you
0:01:23then
0:01:24so the concept
0:01:26we have
0:01:27sometimes the the customers or reckons that the users of
0:01:30speaker recognition technology that
0:01:32they want to use
0:01:33speaker recognition when they are when when they are
0:01:36outside of
0:01:37environment
0:01:38but we also have some sort of other users of a speaker recognition technology that they want to use it
0:01:44in any environment in high energy
0:01:46noise environment in the street
0:01:48or like
0:01:48fig track
0:01:49noise whatever
0:01:50they are not in office environment
0:01:52or the way to control
0:01:54uh and wonderful
0:01:55speech record
0:01:56then we are interested you know
0:01:58how our speaker recognition systems
0:02:00could it
0:02:01the degrade in performance
0:02:03but having this type of
0:02:04additive noise
0:02:08well
0:02:08just to describe what is our for
0:02:11since record here is the
0:02:13we are uh i think that uh
0:02:15uh a typical speaker recognition system but different
0:02:18phases and different modules
0:02:20but our for
0:02:22in this society is to
0:02:23see that
0:02:24how feature extraction
0:02:25could affect
0:02:26speaker recognition
0:02:28all of the speaker recognition performance
0:02:32how typically we are being uh
0:02:34feature extraction
0:02:36is that
0:02:36we have window frames
0:02:38we do is it from estimation
0:02:40having the mfccs duress the
0:02:42filtering
0:02:43appending delta double that of
0:02:45frame dropping according to energy
0:02:47and uh cepstral mean and variance normalisation this is something typical weekly
0:02:52thirty six dimensional feature vector that we have in our experiments but
0:02:56uh this is just based on problems
0:02:57is that we have
0:02:59then
0:03:00now the question is that
0:03:01is really
0:03:03uh we are all the time using F
0:03:05p2p
0:03:05to make the spectrum but if it is really
0:03:08uh the best way
0:03:09that we can do it
0:03:10or
0:03:11another question is that
0:03:12is it really that much for robust in additive noise condition
0:03:18that are going to the L P
0:03:19it is
0:03:20something uh
0:03:21well known that
0:03:22uh estimating the spectrum could be done by
0:03:25linear prediction
0:03:27or if if the estimation
0:03:28and they are uh
0:03:30fig just alternate model alternate the way to estimate the spectrum
0:03:34nobody save uh that
0:03:36L P is better for speaker recognition or if
0:03:39debaters metaphor
0:03:40the speaker recognition or even any
0:03:42other
0:03:42the speech processing applications
0:03:45and that
0:03:47now
0:03:47we are trying
0:03:48the two
0:03:49uh say that
0:03:51what is the performance of fft
0:03:54L P
0:03:54and now introducing we L P
0:03:57the V L P
0:03:58it's uh
0:03:59just
0:04:00targeted to pay more stress
0:04:03and
0:04:03some regions that
0:04:04speech that
0:04:05uh do you have
0:04:07let me say uh they have
0:04:09more energy
0:04:10yeah we have
0:04:12uh
0:04:13uh the way the uh we are waiting there
0:04:15energy of error
0:04:16by a weighting function and where the weighting function comes from
0:04:20is that we are uh
0:04:22computing there
0:04:28yeah we are
0:04:30we are computing the
0:04:32uh the weighting function as that
0:04:34the immediate energy of the signal
0:04:37before that
0:04:38current sample something like in samples before the current sample
0:04:41and put it and
0:04:43weighting function
0:04:44where we are estimating interrupted
0:04:46for example based on the previous at
0:04:49in this way
0:04:50it's possible
0:04:52yeah again set the derivatives of that
0:04:54wait
0:04:54echo or uh with respect
0:04:56yeah
0:04:57estimator
0:04:57a chance to zero and
0:04:59at least two normal double curve
0:05:01uh decorations
0:05:02and fine
0:05:03the weights
0:05:04after predictor
0:05:05and uh
0:05:06it is
0:05:07maybe the history count tonight
0:05:09seventy five and after that
0:05:11again activated in nineteen ninety three
0:05:13that the weighted linear prediction
0:05:16but
0:05:17let's say
0:05:18why
0:05:19we are choosing the S E
0:05:21short time energy
0:05:22four weighting function of the V L
0:05:26it can be true that
0:05:27yeah regions
0:05:28speech that they are they have high energy
0:05:31they are less contaminated with additive noise
0:05:34and uh
0:05:35it is a
0:05:36something some uh some sort of
0:05:38five
0:05:39but it is known we can have
0:05:42it or estimation of the spectrum in the region that
0:05:44speech that they are
0:05:45less
0:05:46corrupted by noise
0:05:47and these regions that
0:05:48speech
0:05:49how
0:05:49higher
0:05:50short time
0:05:51energy
0:05:53it corresponds also
0:05:55to the region of the i mean
0:05:57when you're talking the regions of a speech that they are
0:05:59higher
0:06:00short time energy
0:06:02it also corresponds to the regions
0:06:04that
0:06:05uh our
0:06:06little hole
0:06:07a little
0:06:08and the
0:06:09yeah
0:06:09some local system it disconnected
0:06:11from this the speech production
0:06:13system
0:06:14and the
0:06:14in this case we have some standing wave inside our local calls
0:06:19where
0:06:19if we want to compute
0:06:21formance of
0:06:22speech signal
0:06:23we can have more prominent
0:06:25uh formant
0:06:26estimation
0:06:27of that
0:06:28speech signal
0:06:31well
0:06:32if
0:06:32now what is the problem with reality
0:06:35normal equation somehow gravity to lead
0:06:38two
0:06:39table filter when we are
0:06:41predicting the coefficients of the predictor
0:06:44now the problem with the L P that it is that correctly
0:06:46sure
0:06:47to lead to stable filter
0:06:49and this is a problem
0:06:50speech thing
0:06:50as for example
0:06:52oh how we can
0:06:53what we can do
0:06:55is that uh
0:06:57instead of using
0:06:58some sort of
0:06:59weighting function
0:07:00we can decompose into partial weights
0:07:02and a light
0:07:04in
0:07:04this way
0:07:05to the estimator
0:07:06after uh
0:07:08yeah
0:07:09current sample
0:07:10and
0:07:10in this way
0:07:11we can only
0:07:12to such equations
0:07:14that they are derived
0:07:15in the paper up to maggie
0:07:17and uh
0:07:19uh
0:07:20they describe
0:07:21the behaviour of the
0:07:23a total weight
0:07:24i mean these base
0:07:26in the way
0:07:27that the
0:07:28final estimator coefficients should be
0:07:31it should be in such a way that lead to the
0:07:34a stable filter
0:07:36well
0:07:37i'm not
0:07:38still
0:07:38understanding completely what's happening here but in this paper
0:07:42because we describe describe
0:07:44but for more different
0:07:46please
0:07:46you can refer to that
0:07:48paper
0:07:50well here
0:07:51i'm the reading of
0:07:52frame and
0:07:53i spectrum estimation of it
0:07:56voice
0:07:56right
0:07:57from these two thousand
0:07:59to uh sorry
0:08:01and the
0:08:02uh somehow
0:08:04the same frame
0:08:05that we contaminated with factory noise
0:08:07with your db snr
0:08:09it is
0:08:11let me think obvious that
0:08:12uh
0:08:13uh
0:08:14when we are doing the the
0:08:16uh spectrum estimation of the noise to signal
0:08:18there are
0:08:19some problems
0:08:20that
0:08:21it
0:08:21is mainly cool
0:08:22by
0:08:23the the the
0:08:25the noise signal and
0:08:26how it affects
0:08:27depends on the snr level it depends on the noise that is adjusted
0:08:31sample
0:08:32and the tequila just more intuition what is
0:08:36zero T V factory noise i have here
0:08:38yeah
0:08:39speech file just the
0:08:40P stuff
0:08:41speech files that
0:08:42we do all this
0:08:43frame
0:08:43from those people
0:08:44speech file
0:08:49a little
0:08:51it'll the other way
0:08:54we go real but i don't know what or something
0:08:58yeah it was a clean sample from these two thousand
0:09:01you
0:09:01test set
0:09:03yeah
0:09:04yeah
0:09:05yeah
0:09:05yeah
0:09:06the other way
0:09:07the remote
0:09:09really
0:09:10or
0:09:11yeah
0:09:12and
0:09:13same piece
0:09:13that we can can it be zero T V
0:09:16additive noise
0:09:17well factor
0:09:19well
0:09:20no it shows that
0:09:21what are what is really
0:09:23the mean by zero D B
0:09:25snr
0:09:26yeah
0:09:27yeah
0:09:28yeah
0:09:28yeah
0:09:30connected to some results
0:09:31ah
0:09:32yeah
0:09:33let me think uh opted for
0:09:35spectrum estimation method
0:09:37that we are thinking about
0:09:38and used to come into
0:09:40corpus we had known or has some other type of
0:09:43speaker detection
0:09:44and using factory noise then
0:09:46the only be
0:09:48snr
0:09:49here we can see that
0:09:50the method mainly grouped into
0:09:53sure method
0:09:54after
0:09:54the N L P
0:09:55and let me see the weighted
0:09:57L P group
0:09:59plp itself
0:09:59and
0:10:00it's the L P
0:10:02yeah
0:10:03i i should mention that needs to go into
0:10:05it's a
0:10:07uh the database collected in uh
0:10:10um
0:10:11uh
0:10:12that
0:10:13mobile handsets mainly
0:10:15and it includes
0:10:16inside
0:10:17come with uh convolutional noise and some additive noise
0:10:20although we are i think i did too much white
0:10:23ourselves
0:10:25yeah
0:10:26we can
0:10:27see
0:10:27that that is really some difference between the performance of
0:10:31these feature
0:10:32in additive noise environment
0:10:36we don't try
0:10:37uh some
0:10:38just
0:10:38let me say one
0:10:39very famous
0:10:40a speech enhancement method
0:10:42and uh
0:10:43uh as it
0:10:44just some added to black
0:10:46in our feature extraction
0:10:47to see what
0:10:48really uh one simplicity
0:10:50speech enhanced
0:10:51method
0:10:52i have
0:10:53a speaker recognition system in additive noise N Y
0:10:56and
0:10:57looking at the results
0:10:58it shows that yes there is
0:11:00uh some
0:11:01good improvement
0:11:03based on
0:11:04having a speech
0:11:06and enhancement or latency spectrum
0:11:08yeah subtracting our
0:11:09them
0:11:10but
0:11:11uh these results
0:11:12although they are too much different but
0:11:14i should say that
0:11:15uh our
0:11:17uh
0:11:18noise
0:11:19it's
0:11:19stationary remote
0:11:20and uh and uh real work it is
0:11:23not really the case
0:11:26coming
0:11:27some
0:11:27more recent data that
0:11:29we were here
0:11:30see
0:11:30that
0:11:31if these results from this to tell them to generalise to nice two thousand
0:11:35eight and maybe need two thousand
0:11:37ten because
0:11:38we were one of the ladies that i for you for some should be nice
0:11:41two thousand
0:11:42ten sre and this was
0:11:44our
0:11:44based system i mean the contribution of our
0:11:47uh university of eastern finland was
0:11:49trying some
0:11:50new features
0:11:52and it's
0:11:52for speaker recognition
0:11:54looking at the results
0:11:56let me see
0:11:57just
0:11:58somehow
0:11:59how them
0:12:00group
0:12:02the system here is
0:12:03'cause that's where we are with
0:12:05an A P
0:12:06and the condition is
0:12:07eight content second if you ask me why it contents that can be selected for
0:12:11evaluation 'cause i was working on a forecast for
0:12:14the speaker recognition and this was something
0:12:17well let me say
0:12:18somehow it has some metric nice
0:12:20how to and i selected here
0:12:22for the presentation but
0:12:23if uh
0:12:24we
0:12:24look at the other
0:12:26core test
0:12:26also
0:12:27they have this
0:12:28same
0:12:29uh interpretation
0:12:32looking at the results of any weed out any P
0:12:35it says that uh
0:12:36uh
0:12:37it's plp
0:12:38based results
0:12:39they are improving
0:12:41the det care
0:12:42in uh all that
0:12:44area if
0:12:44i
0:12:45i carried the results correct
0:12:47uh thing
0:12:48i mean dcf at whatever rate
0:12:51A S P L P is improving compared to
0:12:54yeah
0:12:55mfcc here directors are for
0:12:57uh many of the balloon are for females and the green one uh is for
0:13:03let me say
0:13:04all trials male and female
0:13:07coming to the results
0:13:09with any any
0:13:10the effect of using S P L P
0:13:13but to someone rotating the det curve in some sense because
0:13:17min dcf
0:13:17getting through to be but
0:13:19equal error rate
0:13:20get a bit worse
0:13:22but
0:13:22if
0:13:23uh
0:13:25you had
0:13:25why
0:13:26happening
0:13:27we have i have no idea right now
0:13:29we just applied
0:13:30live in this
0:13:31S P L P and
0:13:33uh we try
0:13:34time
0:13:34effect
0:13:35but
0:13:36coming to the interpretation that
0:13:37why it happens need more study on it
0:13:41well
0:13:43i think
0:13:44yes
0:13:45this was the point that i want to
0:13:46oh
0:13:47thank you
0:13:57okay questions we have the whole question could've
0:14:14just click less know that yes signal to noise ratio you use on the inside
0:14:19T
0:14:20no matter
0:14:21yeah and you also had yeah we'll deal
0:14:25in the you mentioned that and that you know performance was supported in the two hundred zero D B
0:14:32yes
0:14:33um
0:14:34my question is how did you miss european signal you know to noise ratio because one nine or you know
0:14:40what do you
0:14:42it sounded as yeah maybe it's me maybe
0:14:46he
0:14:46other people may not agree with the
0:14:48i thought i think that's the most signals so therefore maybe that's not zero D B maybe i did that
0:14:54idea
0:14:55women tend either minus ten
0:14:57higher
0:14:58i don't noise in it
0:14:59and then i mean uh yeah yeah uh i thought
0:15:02the editorial you display
0:15:05that you called not zero D B
0:15:07sounded is in
0:15:08the signal is only the stronger than uh you know zero D B situation
0:15:12well because i was suspected that somebody will ask how i'm like that with exactly the matlab code that you
0:15:17have to get yeah
0:15:18i uh i can interpret here that we are measuring the energy of the every frame that
0:15:23speech signal and averaging them
0:15:25or
0:15:26signal and over the noise and uh putting all that
0:15:30snr
0:15:32snr here
0:15:33yeah
0:15:34to to gain to have the game
0:15:35and then
0:15:36needs to all just signal together
0:15:38the noise
0:15:39and we'll get signal together with thinking that we have as
0:15:42average snr
0:15:45so
0:15:47you are meddling signal to noise
0:15:49yeah
0:15:49so by using that intense
0:15:52the
0:15:53uh rather than uh no i'm pretty you know the
0:15:57yeah yeah framing the signal and the measuring the energy of the
0:16:00uh
0:16:00frames
0:16:01and uh averaging the more
0:16:03signal
0:16:03and uh
0:16:04okay uh
0:16:05finding the
0:16:06relative gain between the noise and signal
0:16:24i don't see any difference in these
0:16:26ah
0:16:29what you cant difference you expect to see
0:16:31well that's noisy i expected the spectrum ooh
0:16:34these are
0:16:35flat and then filled in
0:16:37noise
0:16:38i mean this is a
0:16:39this looks
0:16:39because
0:16:41yeah this is depends on the noise
0:16:43because this is fact
0:16:44just the the noise that we use here
0:16:47it just factory noise
0:16:48i
0:16:48these
0:16:48right
0:16:49just
0:16:50uh i had these
0:16:50type of behaviour we just selected one right
0:16:53the effect of noise is not the same for all frames maybe you're right because
0:16:57the I S P X
0:16:58right
0:16:58but i think that by increasing the noise on the noise level of the spectrum
0:17:02uh it's flat and more flat and we are losing the information
0:17:06in the spectrum but just some typical example to show
0:17:10how it works
0:17:28the other questions
0:17:30two questions
0:17:31we have it or not
0:17:32but may get one more interpretation that
0:17:35we use this as the L E in conjunction with mfcc add other features
0:17:40and uh i for you separation
0:17:42is that we
0:17:43right somehow evaluated our system
0:17:45or just yeah he uh
0:17:47feature
0:17:48and then
0:17:48i mean
0:17:49score four
0:17:50so
0:17:50subsystem
0:17:51they use the other side
0:17:53sensing i for you and taking
0:17:55uh uh let me say
0:17:57uh
0:17:58using uh me
0:17:59having
0:18:00this type of
0:18:01them that they are
0:18:02uh ultra wide
0:18:04beat
0:18:05S A P
0:18:06a different type of
0:18:07score
0:18:08speaker
0:18:10or just
0:18:11one of the assumptions
0:18:12um
0:18:13for for your model is that you
0:18:15more energy
0:18:17um observations in the signal
0:18:19so in the most reliable right
0:18:21that's right we have a good because that has the energy of the noise increasing
0:18:25they could be a really just
0:18:27uh
0:18:28uh coloured by the noise
0:18:30okay i just
0:18:31the the other side of the uh the the body is also
0:18:35uh if you don't um
0:18:37uh
0:18:37the situation where you getting distortions because
0:18:40uh
0:18:40or are driving the channel for example
0:18:43um and it may be the case where the signal is actually one time
0:18:48and then you
0:18:49the
0:18:49could be
0:18:50um but maybe another
0:18:53indicated
0:18:53silver jews
0:18:54the work
0:18:54syllable are energy
0:18:56um observations
0:18:58well in this case you're right uh
0:19:00we don't know exactly what will happen if signal is to be
0:19:03by by channel by recording device or
0:19:07what about this
0:19:08formance us are just somehow done with the uh
0:19:12sounds and that
0:19:13uh
0:19:13all the signal exactly but if you ask me what will happen if all the signals here
0:19:18i will say that
0:19:19uh i think
0:19:20after the spectral L P spectrum that all
0:19:23fig
0:19:23the same way that we hope U S we'll get the fate
0:19:28really thank you very much