however affinity for changing my spiritualist extraction presentation
my name's in the mountains and i'm going to present you know how to stay
on a linguistically it is triggered iterations distant loses information from stricter rules
personal unseen words our task is and what sticklers issues
sewing
small still a generic
setting standardisation i don't want to answer the question
so where
a given as input a real speech signal
what's wanted used to partition the signal into since derivatives
without having any prior information about the speakers a single precision errors
and conceptual and traditionally
this
task involves two steps
first
we want to circumvent the signal
into speaker images segment and this can be found either a uniform way or according
to some speaker change detection
and then how those speaker sessions and we want to cluster those interesting speaker groups
but
a there are a specific problems are connected to
instead of clustering
and in particular
speakers within the conversation
recite wrinkle means taking stays in terms of the acoustic characteristics
then there is there is all merging
the corresponding clusters together
also
it was too much noise or silence
we think the speech signal
which probably has not been a catchy by giving attention
then we may construct a close to shown cultures
well those nuisances
and as a result
is in fact
v performance of the system
using
we knew in advance the number of speakers
in the conversation
in this work we are for closed or scenarios word of speakers
a specific roles
for example with me think of that occupation direction a meeting collection where we have
the teacher questions
anyway interview will be out that each of your and interviewee and so on
and the interesting fee of those scenarios
is that different roles are usually associated
well with distinction
when we see colours
for example in and you we expect that the interviewer with a small portion and
you're you mm we'll answer those questions
over in another conversation we except for us the emissions will describe there's in terms
and the doctor will
you medical i still don't
so the question now and is kind we use language and commonly used
those linguistic buttons
to cities
there is a sh
so
if we remember of the problem for
diarisation in a traditional or a bunch
what we
we do is given the audio signal
first hmms and is given really done with involved in addition and the cluster
instead
if you're propose to also
process the fisher information which can really
you are from an asr
and issues
some extent no knowledge about so there are within the conversation
and give it is knowledge to estimate their profiles
and files you mean the acoustic
changes
all the speakers in the conversation
and now
since we have those two profiles we can conclude a clustering problem
into which conditional
and thus
we're gonna for the potential problems races which are conducted in clustering
we mention triggers
and now the next a few slides and one to go into detail
well on what i
someone change your use
and how we have implemented
so noticed your the in the first
a couple steps of our system
we all process the texture flemish
so given text the first step is that we want to change the chronology text
so
in which a segment after this segmentation set
we ones
every
to be uttered by a single speaker
so i really want assistant
that
no as a kind of their
where and there is in you are
speaker a speaker change in the conversation
instead
permissible we assume
that there is a single speaker or sentence
so we will segment i s the sentence level
and energy just so we view of this problem was sequence labeling or sequence tagging
problem
and
we construct this is a similar situation here were initially we construct
a
current level representation which were stressed and then something
we concatenate
this is representation with the
ward embedding all the course from war and now this
a sequence of words sheer ease thing to a biased and steering wheel
which predicts a sequence of labels
and a little here are two
but no that the war is at the beginning of a sentence and denotes the
war
is that the middle sentence
which essentially means
every which is not
so our sentence here each one of those machines
or whatever
words
strong
one when b
until the next one
now handles a segment we want to a sign role
to ensure those
so
and the domain working on a we assume that we more at
the roles in this domain
so we you
and roles just
language models for three and also we have and also with a wrong language model
and for this to a construction and prior models
and after we interpolate the language models and by these symbols you're a regional ventilation
and all the ways of your on some of the questions
are optimized on a development set
so what we interpolate the language models
we can just a sign
to each take segment the role that minimizes the corresponding complex
no not is that if you're we have built on about to a text
in the next step to discontinue was the case densities of the speakers for the
year in the conversation
we also need was you only so
so you're we need to align the text and the audio
and the
textual information comes from an asr system which to be in a real-world application
then these all right information is already available probability can last
so have no those module and a segments
we extract speaker rating which one visual with the extractor
for each
segment
a sign to a statistical
and we can now define as the wrong for the
are all acoustic identity
as a range of all those
speaker ratings transform that role
a by doing so however
we assume that
only on v
segments
r g
however
we cannot be confidently about all the roles segments and the reason e
since we have conversational interactions
after oversegmentations that we may have
some very short sessions for example
like even one or things like
well which do not contain sufficient information
well that that's all right recognition
so what we're doing instead is that we
assign a confidence measure
creation of those segments
and its confidence measure is the also difference
between the best implicitly we have
from a and the second was classes
and now we can then define a few
profile
a an average but now for this average we only a control and
e
segments
for which the confidence
is able
some stuff racial factor
and this is the size the tunable parameter all sources
so we can we have now estimated or profiles were ready to
or
a regularization
we're instead of clustering we can have a classification much
election
you're
and we're calling a traditional approach for a diarisation were first we segment
uniform the speech signal with a sliding window
we extract
us to go embedding for each resulting segment
and we probably
the only a similarity
known for each segment
with all the role profiles are just a estimate
and the role that we are assigned to each day
using one
that is most similar to segment
we know that maximizes
this is a single are in school
so this is this is in the were proposing and we're going to use in
to evaluate the system on dialects i felt interactions what we have two rolls namely
the normal that there is an efficient
and we are also going to use a mix of corporal
in order to train a our students tiger and or language models
is your in those the data is and reading the sizes of the core well
we're using well
and not going to go into detail
i'm to the specific parameters that we used for system and the several subsystems
i just mentioned that if a score or sentences like or more so
point age a after all
a working at all possible there she said
but a word error rates for asr system we're using
was about forty percent for dataset but we just is a lot a but actually
is
can call com one source some changes medical conversations
and
also baselines we will use in your own and it language baseline
forty one you know baseline a workout this is then
that we have
already mentioned the traditional system i'll mention where we have a uniform segmentation and then
to lda clustering
and forty language from baseline
we essentially how the first steps
all our a text based system you
well for one takes with a text we segments with our
a sentence tiger
and we assign a each
segments to enrol
and the only think of that we need to do in order to evaluate the
diarisation is to
a line you're
and the text here and
they have already mentioned
in the text can strong it is are then be alignment information
already available
chair our results on the survey data the we have testing
well we have used i don't the reference prostrate or asr transcript
we using a or something you're or an oracle text segmentation
here are or
unimodal
baseline same as yours the system that the we have
controls and by looking at the numbers we can make
interesting observations and
generate some interest conclusions
some personal
if we can further of the to a baseline we have
we see that the results or
better we feel guilty
that's just a
i which instantly on your screen as expected contains one information for the task also
speaker and session
and this is why
we propose using the ontology information only as the supplementary q
a what is interesting to notice is that
you know language model system comparing work and the some additional the timer
segmentation
i based machine
there is the
performance gap
and the reason for that is that
the tiger overstatement and also mention
we may have also show segments there's
do not contain sufficient information for english
however in our system we use this information only
in and i would be useful in order to a reddish
all the
segments of the rules segments to get a acoustic identity
the article rule
so
so such an actress is kind of cancel out you know system after this
well i'm british
a similar factor
is observed
last year you we compare the
results using the reference for the asr transcript
and because condition we have a pretty high word error rate
we have as if you're degradation in performance for the language system
once when using a star
results
however
when the trustees are only used for the profile estimation as we're doing in our
proposed system
then the performance
is substantially smaller
finally
when we see here is the if we estimate the files
using
not only know all the
i relevant segments but only
the segments that we are most compelling about then we have further a performance improvement
and instead of the parameters that we introduce
the earlier
here
we are using the eight percent all the
test segments
or station by the segment i mean the segments that we're most confident about
and they is a parameter optimize convertible
well i first observation again it's made from this library
where we have illustrated the
diarization error rate a function
all of the number of segments that's clear thinking this duration
or final estimation is that
unless we use
a very small number of segments per session most of the time
but performance is better five
the key audio-only baseline which is illustrated by a dashed line we shoot
also
if we compare those
blue and red lines
what we see is that even though
when we're using
v
sequence this time you're
a bit which is
this red line
i don't though
when using this
we have a slightly worse performance is an oracle
segmentation we observe that you we have two shoes
you're only the number of segments to use
then a tiger performance approaches the oracle
segmentation performance
to some with my presentation today we propose a system for speaker diarization
in scenarios were speakers for a specific roles
and we use the lexical information machine
with those roles
in order to estimate the acoustic advantages
and which changes the ability for classification approach
instead of a clustering
approaches use a common thing to do diarization
we evaluated our system on dynamics et cetera interruptions
and we just really a relative improvement of about
thirty percent
number two t only on baseline
so
this was my own presentation
thank you very much for button