i know miming with powerlessness and today it is my great pleasure
two percent or on speaker detection in the while
lessons learned from jason two thousand nineteen
i would like to reverse that all the off course
that may this work possible
let's first ask a question
one they didn't we have here
right now we have plenty of devices
like smart phones
recorders
social media
from which we can gather data
and use it for downstream
tests
and you we go for then we can even
performance speaker detection
hello
my name is bonus to date you'd is my great pleasure percent or
on the speaker detection in the while
lessons learned from jason two thousand nineteen
i would like to the first faculty of course
no make this work possible
so let's start
what they did we have older
we have plenty of devices like
a smart phones
recorders
even we can get information from social media
and
we gather data
and you see
for downstream task
however these data needs to be label
to be useful
and we'd is labeling we can you can perform speaker detection
one of them are very experiments was to use brute force
and it was the motivation to use diarization actual words
so we have the speech recording
and we obtain homogeneous segments from it
from those segments we computed the and things and we compare those in billings
with the target
speakers involved in between
and gave and i result
but then
we need diarization
and we extracted the segments that not to the same speaker
and we obtain better results
so it was i would find it
to do it this way
so this is the be sure
a whole pipeline
we have
a record mean and we're looking for john
the first stage
is to a client was voice activity detection
that means to get rid of all the silence that's the second stage
is to perform speaker type classification or super b e
that means to time
all the segments according to the gender or if it's a keyboard and at all
or even it is t v
the speaker diarization
that answers the question who spoke when
so gathered together
the a segments that you know to the same speaker detection as the question
if we have john in
in any segment so is a binary decision
and then we can look for john a low
the recording with the speaker tracking
thus in were fine
to follow this type of and if we have challenges in are used as a
cocktail party that there is no
if any we have five psnr in the answer again is
so let's take a look at some of the numbers on the diarisation side
on the right now
we can observe the results of obtain
on the that i try to
based on the p x
provided by but
so we can observe that ceilings
we do not too long the recordings
and basque which are due to be d s
got very that results
we conclude that is that results are because we're talking about far field microphone
noisy speech
overlapping speech
condition mismatch not comparative speakers
and biased towards angles speech
so we wanted to study this conditions
no that's is some numbers on speaker recognition
for speaker recognition we compare two systems
a two datasets
the first one is that it's alright and the second one is the voices
and we are comparing
a close talking microphone are feel
we can observe that for far microphone
a big our doubles or false
then our main goal was to research developed and benchmark speaker diarization a speaker recognition
systems
for real speech
by using a single microphones in realistic scenarios that included right around noises
so just television audio music
or other people talk
the data one of the characteristics of the data
is it
like this one where you're having a meeting
or is it
completely while
as the one inch i'm five were people gathered together to have already
or anything they long recording
just having a five hour recording or even longer
or is it
that we have a are far field microphone
on the other room
that is catching
the voice of the speaker
to cover
although this type of data sets we included this so core right
i mean this alright channel five and bt training
going from the easiest one
to the most typical one
so for i mean we have a meeting domain
and we use it for both for devastation a detection
for this alright we have i mean to control domain
we just use it for the for detection we then used for the recession because
we have
the complete
labels for all the speakers
for channel five we use it for diarisation only
and it's an injured domain
we didn't using for addition because we usually four speakers
which is
quite a few persons
and babies right
we use it for both for their station and detection and is completely while i
don't control
the models that we explore as i said before is the devastation and the speaker
detection
so
from the devastation we get the labels for all the speakers and for the speaker
detection we can
try the speaker i don't are equal
this is that the picture of the for the devastation so we have
a traditional modularized system that is composed enhancement the p
the embedding the scoring the cluster e
the re-segmentation and the overlap assignment
we have to type something enhancement
one of the signal level
and you're with their one i the enhancement level
the
boxes that are in orange
are the ones that we explore
let's start with the enhancement
on the signal level
we feel
and snr progressive multi target and based speech enhancement model
the progressive mode in time
network or p n g
is divided into statistically stacking blocks
with one elicit em where you're
and one phoneme connected they can be a multi target learning per block
the one connected to let your in every plot
is designed to ranger meeting speech target with higher
snr than the previous target the first
a serious progress you variation masks
are concatenated with the progressively and have low power spectral features
other targets
i test time with directly be
the enhanced audio
processed by awarding has been model to the back end systems
note that we have a wiener signal
we can
explored vad
in this case
we have two directions
the one on the top that is based
on mfccs and on the one on the bottom that is based on
i think that
and volatile then sure there is a philosophy a list we collected layers
the output these the speech
and nonspeech
it is important to note
that the lower branch is the one that we chose
for works very
although this is not part of the finite stages it is also true that debated
invading network
the related to the performance
as shown in the table
so we explore the extended t and then
with a box so that and with box so that
cluster augmentation
and we also explore a
after t and then
we also there was a commendation
so we can see that the factor t v n and with even
the best results are be trained
and i mean it was completely given in child five
so we chose the factor g d n
for our experiments
now let's focus
on the speech enhancement
we had is i mean how to train an unsupervised speech enhancement system
which can be used as a front end
good processing model
to improve the quality of the features
before they are passed
two than varying or
the main idea here is to use an unsupervised
adaptation system
based on cycle against
we train a cycle can network using a lot will be addressed
as input
to each of the generator networks
so we have a clean source signal on the left and the real time domain
data on the right
during testing
we process that is data to the target signal
these are then huh
acoustic features
i being used
just write extractors
even though the cycle get and you work was trained for doing the reverberation
we also testing on noisy data sets
showing improvements
now let's continue with the overlap assignment
but have these architecture might also sample mean here
it is exactly the same as the one use for the vad approach
but now training in a certain way that would ease
overlap or not overlap
speech
it can also be used to perform a speaker at right
and also asking the vad
the thing that approach show better results
let's continue with the overlap assignment
from the e
we got a posterior matrix
for each of the speakers
so the most probable speakers will be you rolls one and two
so we can combine this with the overlap detector
and also we didn't vad
merging these results
we got what we call the overlap assignment where we have regions where the overlapping
to tell us that we have two speakers and we put their the most probable
speaker
in this part
we ended our diarization system
but now the question is what combination of all these things
a good results
so in our case
we put together to into n b a d enhancement
that maybe re-segmentation an overlap assignment
for all thus a corpora we got a nice improvements
for example i mean
we went problem fourteen nine percent
the residual error rate to thirty percent
there is station
for the channel five
so the corpus
we also put together
the same combination we went
problem sixty nine percent every station error rate
justice degree
or set at every station
and finally pervading train
we got a nice improvement from eighty five percent every session error rate to forty
seven percent
the recession error rate
it is important to note here that in two
and but
really improve the system
this is the speaker detection pipeline
we have the enhancement
and the signal level and also the invading level we have the devastation segmentation
we have been the in extractor the that okay the calibration and finally
we get the speaker detection
the boxes in orange
use the same techniques
i think conversation
so we use the enhancement two levels
and the signal level and also
and the and very little
that there is station
a segmentation
is fed into that invading extractor and the type like wendy's
then that in extractor as we are really emphasised before
it is a factor at the nn
which is getting the best results for speaker i p
we also is used an enhancement
module
for this and getting extractor
and finally we have the backing and the calibration
the backend
sure the key lda front of devastation with documentation and the calibration stage
goes directly to speaker detection
the combination of the use of results for all hours of corpora
enclosed is speech enhancement the spectral augmentation and of the lda with augmentation it is
important to note that although this
this is include the devastation as their first stage
so for
we got an improvement going problem
seventeen percent equal error rate
two percent equal error rate
in terms of mindcf and actual dcf shown in the bottom we can also something
improvement
remained trained we kind of so the same trend
going from fourteen percent equal error rate
two nine percent equal error rate
on the bottom we kind of service then mean these yes
and the actual dcf the mean this got improvement
but the actual dcf
for the is alright data our system also include the results going from twenty one
percent equal error rate to sixty percent
equal error rate
the mean these i'm the actual dcf
for this and trend
getting improvement simple
finally some taken ways i'd like to mention
the recession ease that fundamental stage to perform speaker detection
there are some models that are really needed to have a competitive system
course a whole good enhancement could be a i
we didn't beginnings
an overlap detection and assignment
the speaker detection they hence not only
on the devastation model
but also wanting but in extractor on the augmentation
then you directions of this work are as follows
or the signal to signal enhancement and speaker separation we need some customisation
you could be by the test it by speaker or quite task
for the speech enhancement
we have to explore other hand gestures a transformer and largescale training
for the vad we need ways to handle domain mismatch
you can be done for example using domain or sorry
for the clustering we need an unsupervised adaptation
take the overlap into account
during the clustering
and also included transcription
in parallel with the speaker and b
for the speaker detection
some enhancement for the multi-speaker scenario
that means
hi light
that's speaker of interest
and also perform better clustering
for short segments
this is our amazing thing
i would like to thank
all of them very much thank you questions