however affinity for changing my spiritualist extraction presentation

my name's in the mountains and i'm going to present you know how to stay

on a linguistically it is triggered iterations distant loses information from stricter rules

personal unseen words our task is and what sticklers issues

sewing

small still a generic

setting standardisation i don't want to answer the question

so where

a given as input a real speech signal

what's wanted used to partition the signal into since derivatives

without having any prior information about the speakers a single precision errors

and conceptual and traditionally

this

task involves two steps

first

we want to circumvent the signal

into speaker images segment and this can be found either a uniform way or according

to some speaker change detection

and then how those speaker sessions and we want to cluster those interesting speaker groups

but

a there are a specific problems are connected to

instead of clustering

and in particular

speakers within the conversation

recite wrinkle means taking stays in terms of the acoustic characteristics

then there is there is all merging

the corresponding clusters together

also

it was too much noise or silence

we think the speech signal

which probably has not been a catchy by giving attention

then we may construct a close to shown cultures

well those nuisances

and as a result

is in fact

v performance of the system

using

we knew in advance the number of speakers

in the conversation

in this work we are for closed or scenarios word of speakers

a specific roles

for example with me think of that occupation direction a meeting collection where we have

the teacher questions

anyway interview will be out that each of your and interviewee and so on

and the interesting fee of those scenarios

is that different roles are usually associated

well with distinction

when we see colours

for example in and you we expect that the interviewer with a small portion and

you're you mm we'll answer those questions

over in another conversation we except for us the emissions will describe there's in terms

and the doctor will

you medical i still don't

so the question now and is kind we use language and commonly used

those linguistic buttons

to cities

there is a sh

so

if we remember of the problem for

diarisation in a traditional or a bunch

what we

we do is given the audio signal

first hmms and is given really done with involved in addition and the cluster

instead

if you're propose to also

process the fisher information which can really

you are from an asr

and issues

some extent no knowledge about so there are within the conversation

and give it is knowledge to estimate their profiles

and files you mean the acoustic

changes

all the speakers in the conversation

and now

since we have those two profiles we can conclude a clustering problem

into which conditional

and thus

we're gonna for the potential problems races which are conducted in clustering

we mention triggers

and now the next a few slides and one to go into detail

well on what i

someone change your use

and how we have implemented

so noticed your the in the first

a couple steps of our system

we all process the texture flemish

so given text the first step is that we want to change the chronology text

so

in which a segment after this segmentation set

we ones

every

to be uttered by a single speaker

so i really want assistant

that

no as a kind of their

where and there is in you are

speaker a speaker change in the conversation

instead

permissible we assume

that there is a single speaker or sentence

so we will segment i s the sentence level

and energy just so we view of this problem was sequence labeling or sequence tagging

problem

and

we construct this is a similar situation here were initially we construct

a

current level representation which were stressed and then something

we concatenate

this is representation with the

ward embedding all the course from war and now this

a sequence of words sheer ease thing to a biased and steering wheel

which predicts a sequence of labels

and a little here are two

but no that the war is at the beginning of a sentence and denotes the

war

is that the middle sentence

which essentially means

every which is not

so our sentence here each one of those machines

or whatever

words

strong

one when b

until the next one

now handles a segment we want to a sign role

to ensure those

so

and the domain working on a we assume that we more at

the roles in this domain

so we you

and roles just

language models for three and also we have and also with a wrong language model

and for this to a construction and prior models

and after we interpolate the language models and by these symbols you're a regional ventilation

and all the ways of your on some of the questions

are optimized on a development set

so what we interpolate the language models

we can just a sign

to each take segment the role that minimizes the corresponding complex

no not is that if you're we have built on about to a text

in the next step to discontinue was the case densities of the speakers for the

year in the conversation

we also need was you only so

so you're we need to align the text and the audio

and the

textual information comes from an asr system which to be in a real-world application

then these all right information is already available probability can last

so have no those module and a segments

we extract speaker rating which one visual with the extractor

for each

segment

a sign to a statistical

and we can now define as the wrong for the

are all acoustic identity

as a range of all those

speaker ratings transform that role

a by doing so however

we assume that

only on v

segments

r g

however

we cannot be confidently about all the roles segments and the reason e

since we have conversational interactions

after oversegmentations that we may have

some very short sessions for example

like even one or things like

well which do not contain sufficient information

well that that's all right recognition

so what we're doing instead is that we

assign a confidence measure

creation of those segments

and its confidence measure is the also difference

between the best implicitly we have

from a and the second was classes

and now we can then define a few

profile

a an average but now for this average we only a control and

e

segments

for which the confidence

is able

some stuff racial factor

and this is the size the tunable parameter all sources

so we can we have now estimated or profiles were ready to

or

a regularization

we're instead of clustering we can have a classification much

election

you're

and we're calling a traditional approach for a diarisation were first we segment

uniform the speech signal with a sliding window

we extract

us to go embedding for each resulting segment

and we probably

the only a similarity

known for each segment

with all the role profiles are just a estimate

and the role that we are assigned to each day

using one

that is most similar to segment

we know that maximizes

this is a single are in school

so this is this is in the were proposing and we're going to use in

to evaluate the system on dialects i felt interactions what we have two rolls namely

the normal that there is an efficient

and we are also going to use a mix of corporal

in order to train a our students tiger and or language models

is your in those the data is and reading the sizes of the core well

we're using well

and not going to go into detail

i'm to the specific parameters that we used for system and the several subsystems

i just mentioned that if a score or sentences like or more so

point age a after all

a working at all possible there she said

but a word error rates for asr system we're using

was about forty percent for dataset but we just is a lot a but actually

is

can call com one source some changes medical conversations

and

also baselines we will use in your own and it language baseline

forty one you know baseline a workout this is then

that we have

already mentioned the traditional system i'll mention where we have a uniform segmentation and then

to lda clustering

and forty language from baseline

we essentially how the first steps

all our a text based system you

well for one takes with a text we segments with our

a sentence tiger

and we assign a each

segments to enrol

and the only think of that we need to do in order to evaluate the

diarisation is to

a line you're

and the text here and

they have already mentioned

in the text can strong it is are then be alignment information

already available

chair our results on the survey data the we have testing

well we have used i don't the reference prostrate or asr transcript

we using a or something you're or an oracle text segmentation

here are or

unimodal

baseline same as yours the system that the we have

controls and by looking at the numbers we can make

interesting observations and

generate some interest conclusions

some personal

if we can further of the to a baseline we have

we see that the results or

better we feel guilty

that's just a

i which instantly on your screen as expected contains one information for the task also

speaker and session

and this is why

we propose using the ontology information only as the supplementary q

a what is interesting to notice is that

you know language model system comparing work and the some additional the timer

segmentation

i based machine

there is the

performance gap

and the reason for that is that

the tiger overstatement and also mention

we may have also show segments there's

do not contain sufficient information for english

however in our system we use this information only

in and i would be useful in order to a reddish

all the

segments of the rules segments to get a acoustic identity

the article rule

so

so such an actress is kind of cancel out you know system after this

well i'm british

a similar factor

is observed

last year you we compare the

results using the reference for the asr transcript

and because condition we have a pretty high word error rate

we have as if you're degradation in performance for the language system

once when using a star

results

however

when the trustees are only used for the profile estimation as we're doing in our

proposed system

then the performance

is substantially smaller

finally

when we see here is the if we estimate the files

using

not only know all the

i relevant segments but only

the segments that we are most compelling about then we have further a performance improvement

and instead of the parameters that we introduce

the earlier

here

we are using the eight percent all the

test segments

or station by the segment i mean the segments that we're most confident about

and they is a parameter optimize convertible

well i first observation again it's made from this library

where we have illustrated the

diarization error rate a function

all of the number of segments that's clear thinking this duration

or final estimation is that

unless we use

a very small number of segments per session most of the time

but performance is better five

the key audio-only baseline which is illustrated by a dashed line we shoot

also

if we compare those

blue and red lines

what we see is that even though

when we're using

v

sequence this time you're

a bit which is

this red line

i don't though

when using this

we have a slightly worse performance is an oracle

segmentation we observe that you we have two shoes

you're only the number of segments to use

then a tiger performance approaches the oracle

segmentation performance

to some with my presentation today we propose a system for speaker diarization

in scenarios were speakers for a specific roles

and we use the lexical information machine

with those roles

in order to estimate the acoustic advantages

and which changes the ability for classification approach

instead of a clustering

approaches use a common thing to do diarization

we evaluated our system on dynamics et cetera interruptions

and we just really a relative improvement of about

thirty percent

number two t only on baseline

so

this was my own presentation

thank you very much for button