two
yes
so interviews the allow testing well
thank you on the first i want to thank you again for of anything this
water in it is very
and if you could the a moment and if you could your but you did
very well and i'm sure we will we will take advantage of you organisation an
optional
secondly equal i'm really be too
have you location to introduce you musical of any will be whole first the speaker
i will be sure because i'm quite sure but quite all of japanese
no you would really so i will not even introduction they know you even if
you are still you with a true
as
you go according to me at least we will see you was a really a
few buttons
no
so we go you us your master in the second and hearing that went to
university the frozen even
so about twenty ten years ago
and the you went to be a is to use a student rental
with the wheel invisible trying to one be from the sunni a remote case the
you then you is the on droning for distance speech in the two thousand seventeen
and then known the meta
maybe not the useful to introduce them you know that a
i'm also true but they will all of us so awful you know very well
and you start work has a also is a sure working closely with a threshold
venue
you work on several topics may be wrong than representation only for speech button not
only about
and he really you also one of the code from the of the speech way
initiative for building you
two k open okay for a speech and speaker recognition it was a singing about
so
even us to
we use it to form you use you already have a long list of speakers
in the topics and i know you we need to as a very nice a
or a now and i don't even for you but before two
do they do you say it will be wall of introduction if you want before
a good movie do i will explain how the decision we walk
we will close to a pre-recorded view bone by nicole
during easy do you have you wanted to some question maybe case in intensive box
please or a few
think about question and i haven't integration is possible now see we give you what
you do need good complete variances
and then we will have a fifteen minutes
live
question and answers with music or doing this decision is fifteen years
you could use both the question and answer box
well
be a raise your hand the so we raise your hand with the you know
that to a i-th question in i
during position
so we could be want to say some well handled before two good we do
just i think you're much for the introduction hello
i hope the d v d w within the video will be fine now but
in the worst case you probably you guys have to increase a little bit it
but
let's see how it goes
it can be cool i think we give a really do know
sorry we have a simple was shown to an small technical problem good we don't
have you do you the
before it was working so it's better to does which the previews
present
annotation we're
can't hear nothing alright
yes a
can and have a little stuff
okay training
hi everyone i mean permanently
and a very high
to give it is here today
had obviously
so let me for the whole thing rather
for i by can be used for them
with the
the speech commute
entitled make you know used to words unsupervised training
all speech work station
well so supervised learning is a key a lot of what are the my shooter
feel
and of course is getting ready
within the speech community well
so today i will like to share the experience
the time again after working poor
i two or three years
on this topic
okay but if or diving into cell supervised learning that me room are some of
the limitations
of supervised there
which is the dominant paradigm stays there
well you can see deep-learning
as a way to lure hierarchical representation is where we start from the low concepts
we combine them
we create
high-level also console
so the learning
is a very general is the case
is implemented through a deep neural networks
that are often
trained
in a supervised way
using a large and rotated corpora
you can do this is that only and approach
alright integrate
success
are you learning many practical application
is clear today
and is paradigm
has some limitations
what are
this issues
for example
we
indeed the data and not
general data
but and updated data and crosses they cannot the issue the expense the time-consuming however
wires numerals normal
rubber supervised learning is data and
also computationally demanding
one
of course to these days to reach state-of-the-art performance
machine learning
we need a lot of data
and a lot of data requires a lot of computations
deleting the fact the access but
a
supervised learning
a technology to have brute
brute
setup
users
moreover
if we
training a system now
supervised way the representations that the latter
my by the hours
to worse a specific application
for instance if we train a system for speaker identification
the representation that's been there are would that not or
speech recognition
so we might want to real or some kind of general representation that annoying
transfer learning
much easier and better
density
the third imitation is actually more exploration
and is that where rain
does not use
only supervised learning
critical mine different all
i'm
pretty sure
that
combined
different the remote data that is cool but she
to reach higher levels
or artificial intelligence
we can combine
supervised learning
we and
contrastive learning
weighted imitation learn a
well we'd reinforcement learning and of course
with some supervised learning
so what is sell supervised there
so supervised learning is a type of an unsupervised learning
where we have a supervision
but the supervision
is extracted
from the city no it's channel
in cell supervised learning we'd ask
don't have
you models that have to create labels we don't have you months
but the labels
i retreated basically
for free
we can create
columns of them without s
normally in some supervised learning
we applied some kind of
known transformations to the input signal
and use the resulting outcomes
as a label as targets
well let me clarified his with some example derived from the computer vision community which
was the first one
teaching better
this
approach
in this
comparison of community actually they
the not is quite early i this earlier than the other that by solving some
kind of symbols task we were able
to train a neural network that there are some kind of needful
representation
for instance you can ask your neural network was also kind of relative positioning task
where you have small edges of an image
and you have to decide their relative position
between them
you can ask your neural network
but the right colour
set an image
or to find the correct
rotation and of any age
goal of this task are relatively
easy but each we design a system your vector learners used in table show this
task
we inherently over a wider system to have some kind of semantic knowledge of the
words or at least semantic knowledge on the image
that can be really very have their
representation hopefully high level
robust representations
and yes
subsets unsupervised learning is extremely
interesting is gaining a lot of randy
let me show that animals
give low rank k by
the kernel
showing saying that you know if only the cage
no supervised learning the su or look at a reformer learning is the charger indicate
that an unsupervised
or supervised learning is the basic indicate you sell
and meaning that
we believe this modality is
definitely
ingredient
a two
to develop intelligent systems
okay but what about the old you an speech field
as i mentioned before
there is a crucial we number of research more stuff cools in the direction
also supervise there really you know speech
and we have seen as many of them even
at the interspeech
but here let me just highlight here of
and my opinion the first work that firstly shows the appendices also supervised learning you
know you speech
is the contrastive predictive coding was by are among the nor backing
two thousand eight key
this work is mostly about
predicting
the future
given the past
more recently we have seen
another
very good where by facebook with what we should back to zero where d with
we were able to show impressive results with that our approach
which implies some kind of masking technique sooner number couple
which ones dish
i also contributed
does feel with the problem of analysis which encode it which as we will see
later i which we explore
multi doubts selsa provides there
however
cell supervised learning all speech
is it really challenge
why
first of all because speech is characterised by high dimensional that
we have typically a long sequences
of samples that can be well variable length
the last
but not laced
speech in her you know the and tails
complex hierarchical structure that might be very difficult to further
without being guided
by a strong
supervision
speech in fact
as characterised by samples we can combine
there were sampled that the
aims
i from twenty and you can create two levels of all syllables okay worse and
finally
we have than me
all descendants
and inferring
all these kind of structure
might be
extremely difficult
on my side i started i've been some supervised learning when i started my all
stock
i mean the almost
three years ago
and time
people it means that we're doing research ourselves supervised learning
a approaches based on what information
and i got so excited that
i decided to study some supervised learning
approaches with motion information
for learning
speech representations
and that led to the development
all the technical
a lot coming from max that i will and described in the next my
after that we for extended
this techniques using a multi task supervised learning approach
and that led to the double meant
all the problem of the gnostic speech encoder plays
the presented
and interspeech two thousand nineteen
and also we extended
days with another technique
if you can improve system called base plus
and we recently presented this work
at i
okay let's start from motion information based approach
what is more information
the motion information is defined as the key and they are virgins
between
the joint distributions of two random variables
and their product or marginal
why
this is important
because we move information we can capture complex problem being of relationships
between
random part of
eve the
two random variables are independent univoxel formation zero
while you do with some kind of dependency between is why doubles the are then
mutual information is greater you
this is very attractive
the issues that much information that's difficult to compute high dimensional space
and is limited
a lot
it's optical but
in
for a decal mush entirely sure
however one recent were coal mine actual information
you're estimator
phone that it is possible
one maximizing minimizing motivation
within a framework that closely resembles
data counts
how does where
i think mention and we can sample somehow
some samples from the joint distribution
recorded
positive samples
we will explain later
how we can do that graph
it's also assume we can i
sample
some kind of examples from the marginal distributions and we call
there's negative samples
then we can see that
this positive and negative samples
with the special neural net where was cost function
is it don't are far down
bound works mesh
the don't screw are no information that has low where
and if we train
this is a letter to maximize
this them about
we finally converge to also mesh
and inspired by this approach i started
thinking about
motion information based approaches specific only
for speech
i danced idea and then you do cool a little informatics that works
in this way
so
for example we employ s seven they strategy
that will
several positive and negative
this way
sure the whole
that choosing a random shyer
from i runs and scolded
so you one
then
which is another out of the channel from the same sentence
and we call it
two
and finally which is another random from another sentence
that's your front
we this
samples with his chance we can
please some kind of interesting things
for instance we can process
c one c two i was your problem with and recorder
which provide
hopefully higher level information
then
we can go free positive and negative so all we
if we
concatenate
z one and two we create
samples from the joint distribution
positive system
which is a positive sense or because we expect some kind of relation between
this random variables because extract
from say
a signal
then we can also can also create
and negative samples michael t z one and that run
in this can be seen
and a sample from the chronicle marginal distribution
after that
we employ and discriminator which is
with posting
or negative samples
and it is screaming the
should figure out
basically
if
you need to get positive or negative examples for this case
if the representations
kind of from seven
or from you
in this system that discriminate rollers is
set
to maximize the mutual information
moreover the encoder and a discrete mister
are jointly trained from scratch
and this
results in
compared to
game nodding an adversarial game like can
this case
the encoder and its creator should cooperate to learn
i hu and hopefully high level
representation
a good question here okay
and but one two will are you play is k
with this came we basically learn speaker identities of our wheeler speaker endings
why
because this approach is based on randomly
sam thing
within the same set
and if we randomly sample within the same sentence
and reliable started or that the system can disentangle are the variable factor is
definitely the speaker identity
rubber in
we assume that we have i dataset and just large enough without
large variability a speaker and if we randomly sample two sentences
the probability of by me
the same speaker is very low
so overall
this can be c
as a system for learning
speaker of endings without
provided to the system the police
this is label
on the speaker identity
the encoder is fat by their roles speech samples directly
in the first layer of a contact the architecture we just use see that makes
learning problem to roll samples much easier
in fact instead of using the standard convolutional filters we use a band pass parameterize
filters that only learns d
because of this is distilled
this makes
learning from the rose i'm all easier
and not only used on the supervised learning but we also only useful in this
also provides context and
i will encourage you to read a reference paper
if you would like to hear more about
sing
what are the strength and issues a lot come from
once trained is that
we are able
when they let me from us were able to learn
high quality
speaker representation which are competitive
with the ones
learning standard supervised we
or rubber
luckily formats is very simple and also computationally efficient
because we only use the local information thanks to that we can provide a lot
the computations
the mediation with that
is that the representations are very task specific
as we have seen before with lee we can
there
speaker baddies
but what about the other for and
informations that's a banded in speech signal mike phonemes
and motions
and many are things
so when it's this results i ask myself
i w really sure that a single task as in our
actually most of the forest the trying to used cell supervised learning by solving single
task
but
my experience suggests that one single task was not is not know because
we
with a single task we always only count sure
little information
on the signal that we might want
well based on this observation we decided star and you project called problem i know
stick speech coder where we wanted to learn
more general representation might join the demixing multiple
cell supervised task
in pays we have an ensemble on your macros that mass operate together
to discover good speech representations
so what is the intuition behind that
if we joint this'll moldable unsupervised task
we can expect that each task ratings different you
under speech
and you
put together
different views on the same signal
we might have higher chances
two
have a more general incomplete
description
on the signal so
moreover
and consensus across all these uses needed
and using pose some kind of
soft constraint in the representation
it may seem we can improve
its robustness
so with this approach we were actually able
to learn
general robust
and transferable features
thanks to
a joint is holding multiple task
and let me explain next slide more details on how
a system works
a is based on an encoder
the transforms more samples higher level representation
you colour is based on signal formal by seven locks
and the also earlier
he writing we start from the raw set will be
one starts from the lowest possible speech representation
after the encoder we have a bunch all workers where each worker saul's different sensible
mice task
one thing to remark is that the worker
workers are very small
one
because you've if the workers are very simple a small you're not sure
we forced encoder to provide
and much more robust and what is higher now
representation
there are actually two types of work we
started
regression workers that solves
error regression task and the binary
strolls
binary classification task
you binary workers are similar to that one
other than the one that we have some for an hour
more show you from which
as for the regression task
we have some workers that is t v some kind of normal speech representation
for instance we have one worker estimating waveform back
you know encoder fashion
we estimateable always spectrum
we estimate that about
frequency cepstral coefficients embassy they also have positive features such as
bottom-up probability zero crossing rate and i don't
so why we do something like that
because we use the way being jack quarters some kind of
prior knowledge that can be very helpful
in
so supervised learning
in particular in the speech community we are well aware that there are some
features that are we are very helpful
like mfcc
cross at least
why not
try to take advantage of that
i y
we are not trying to jack
this information inside a wire
neural network
you parallel to the regressors we also have
binary classification task
binary classification task working with similar to what we have described for with more to
the formation approaches
basically we sample tree
speech and x
are core of the negatives according to some kind of predefined extra you
we don't process all the stress
weighted the our case encoder
and then we should and scream inter
which is trained on binary percent we should figure out any
we have a positive or negative
so very similar to
the only approach we describe four
only difference
is the article or something strategy
because we didn't different some to strategy we can't
hi my
different features
one simple strategy that we don't
is the one proposed in mock of the infomax that has we have seen for
is able to lure
speaker and wendy's and general speaker identity
together with that we have an under similar strategy called good level the marks
here we do we play basically the same game but we use
larger chunks
and with larger channels
we hope white while i
kind of
complementary information which hopefully is more
global them
well finally we propose another interesting task or sequence pretty code
would this task be hopefully are able to capture some kind of
information on the order
all
the sequence
it works in this way we choose a random channel from
and a random sentence
cultures and core change
which is another random show on the future
of the same set those and is also one
and then we choose another random chat on that
passed on the same
so if we
palais de ziggy
we are
hopefully able to capture a little bit better how
the sequence can involve and ask country some kind of longer context information we were
able to capture with previous task
this sequence political endings similar
two contrastive predictive coding proposed by are one or
the main difference is that no work is
the negative samples actually all the samples are derived from the same sentence not for
other ones because
in this case you will like to only focus on how
this you possible we don't want to capture
another kind of pixel information such as speaker that we will capture
with other tasks
okay but how can we use
mays
inside s speech cross i
well
step one is unsupervised training so we can take the architecture
that we have
and i four
and training particular we can jointly train you quarter and workers using standard issue
a by optimising a loss which is computed as the average
each worker cost
in of are you experiment with it
we tried different
alternatives
but we found that
average e
the courses
the best approach we very fine
once we have train
i where a architecture we now use
i didn't label
we can go to step two which is supervised by joining
this case
we get to create a all the workers and
like our colour into
a supervised classifier which is trained with little
i'm now a supervised eight
actually here and there are a couple of also the data is not number one
is to use
is it as a standard
feature called or this case
freeze
pays yuri this supervised fine phase
another approach
just a pre-training priest with this unsupervised
parameters
and fine curate
you re
the
supervised find you phase so this several approaches the one usually hears
the best for four
it is very important
true mar
that is
step number one this unsupervised three
can
should be done only once
in fact we have seen
there is a dish variance phase
are generally now that can use for large are righty
all speech tasks like
speech recognition speaker recognition speaker speech enhancement
and min six
and you even don't wanna
three by yourself
that's a supervised extractor you can use
and three
parameters that share
but the i were proposed
well this is not all about he's
in fact
in created by the good results achieved with the original version
we decided
two
spend some time to founder
we revise the architecture and improving
and we don't use opportunity of the judges are two dollars a night t
organized by the johns hopkins university to set up t
working on improving
pace
and as a result we came up with a you architecture called
pays last where we introduced
different types all improvements
first of all week apple
a peas with on-the-fly data ish
here we use speech what an initial techniques like anti noise reverberation
but we also out
some kind of run zeros in the time waveform and also we filter the dixie
data in the signal of with some kind of random band must and boston's order
to use
zeros
in the frequency domain
so what is that are not be very important because
i gives us to the system so i kind of robustness is a noise and
reverberation another environment artifacts
a nice things that
since everything is on the fly
every time we contaminated descendants for distortion
and also
the workers are based on the clean
alone labels extracted from the clean version signal so we
implicitly ask
this way
our system to
perform some kind of
i dunno ways
and then we also robust colour
we still have seen no always on the years but that we have also i
recurrent neural network that is
and efficient way to introduce some kind of we can see that sure
and we also
some ski connection that have a rowdy and back to punish
then we have improve a lot other workers
so we have not so that
the more workers
the better it is
and yes
we definitely have a introduced
a lot of workers the injured that estimates for instance you type of features on
different
context lines et cetera overall
we can improve a lot the performance
all the system will different speech tasks
what do we learn phase
we show some kind of it isn't applauded
assuming that's
here
we show that bayes variable are pretty well speaker identity is and you can
clearly recognise
that the
there are pretty defining cluster
a four
the speakers
here is that we show some carol
i'll
deceived lots
for phonemes
and you can see here
everything's lossless well the final but
you have some phonemes
like it is
sure
right
but you can also detect some kind of phonemes which are
a pretty clusters of meaning that
we are actually learning
some kind of twenty
representation
even
without
and he
so when you label
okay we try these plots are different
speech tasks and you can refer to the paper to see all the results
but she really we just discussed some all the numbers that we have chi
on a noisy asr tasks highlight
i think a little bit then robustness
on the proposed approach
furthermore let me say that we have three
a wire
ace on every speech
without using the labels and
very interesting
we have noticed that we don't need
a not a lot of data to train a base we just need
one hundred fifty a wire one hundred that was really the speech
and these are enough to
i generated numbers sdc staples
this is quite interesting because
i usually standard sort of about approaches rely on a lot a lot of data
in our case with thing that
somehow we are more that efficient because we employ a lot a lot of workers
trying to extract a lot of information
are on our speech signals
on the left you can see the results when we treat only here you right
is a challenging task characterised by speech recorded in a domestic requirement
and corrupted by noise ratio
you can see here
that pays a single outperform
traditional features and also combinations a traditional speech features
on the right you can see the results of time five
jerry time
probably is the most challenging
task average
and where design speech is discover or as white noise you're a sure
a lot a lot of these two buses such as overlap speech
and that even guess
a pretty challenging scenario we are able
to the slightly outperform
the standard and based on their
i features
all their current database
actually do representations of other with them
a is
are quite a general or boston transferable
and we have successfully applied
them to different tasks
why don't we have seen speech recognition but you can use it
for speaker recognition
for speech announcement
was learning and motion recognition and i and also aware of some works right to
use
p is for transfer learning across languages train one that based on and trivias on
english and you task and another language and seems to
sure some kind of surprising robustness here
transformation
you can find the code in the tree model
on guitar when i encourage you to
well here and play would pace as well
but let me conclude this park with some sides also supervised learning and their role
that it can lady
in the future
has a mentioned in the first part of the presentation i think they're the g
be of intelligent machines is the combination of different note that this
we can combine a supervised learning
with unsupervised imitation the room for smaller in contrast one has all
so i think there is a huge based here for which tweezers direction where we
basically
combine
in a simple and again the way
difference
elderly time that
one of them
could be and
so supervised learning but not only
this is
very important in days because
stand our supervised learning as i don't know approach but we are start something see
some kind of limitation in this limitation mouldy even including your
in the next
years so supervised learning is too much as a demanding too much or addition to
learning
and we've been going the direction
only few it was a few companies the war will be able
to train state-of-the-art just
and i think different starting different learning with what is conditioned
an especially selsa for about thirty because i we has we have seen
in his presentation
so supervised learning can
an extremely useful the transfer learning area
so we sell supervised learning we have channels cooler a representation which is
generally now
it can use
for several down by class task
and this is
a really big advantage
in terms of computational complexity scores
so i think
the future paradigm
will be a final enough will be similar to the first a popular approach of
learning where we where he where
able to initialize current
neural network
using
unsupervised learning approaches also provides a legal approach
and then we can find you know that we need also
i think is
could be
pretty much
i feature primetime needed for speech where
bayesian transfer to remove lady
always measure
role in the pipeline
and yes
that some similar to what we have seen the last the differences that
and you at first system we were using for a supervisor some supervised learning where
based on restrictive about of washing
right now is the as we are using
much more sophisticated techniques
but the idea is the same manner
could be
quickly and the measurable in speech processing and more in general
in that the machine learning in the near future
if you're interested in to the stopping again you would like to read
a more also supervised
learning you know you speech you can take a look
into the and i c m l workshop
also supervised learning you know the speech that you have
recently
organized
and you can going to the website c or the presentation and read all the
which i think is
kind of interesting initiative
and that we also highlight
they will be
seen their initiative
it is your i knew it is so i will equation also
you also to participate
to use that
alright since i have a few more minutes
i'm very happy to of the u
on another very exciting projects and leading these days which is called
speech frame
speech frame will be an open-source all than one two
entirely down well i
no one goal
be a little in that can significant speed-up
research and double of all speech and audio processing techniques
so we are building
toolkit which will be efficient flexible
moreover and very important we'd i hu
the main difference with the other existing toolkit that speech rate is specifically designed with
addressed
multiple speech task
i don't see time
recent speech brain muscle or speech michelle channels operations recognition and most recognition multi microphone
signal processing speaker diarization
and many other things
so
typically all this task share the underlying technology which is unclear me
and the room there is
the reason why we have we need different repository or
different kind of speech applications is so what we want
is like our brain
we have a single that is able
to process several speech applications and the c time
main issue with the other tokens
this most of them is that the
i really for a single task
for instance you can use county for each and you know speech recognition and i
don't know colour the is
is
we the idea creating can show that can be extremely is that still on
meeting speech recognition
standard v is yes
very good or
speaker recognition
i think
it is fess explicitly them will
what
different task is still not exist
and people when they how to implement complex pipeline involving
different technologies lie like speech enhancement last
speech recognition
or
speech recognition speaker recognition
they are like because the captain john
and of course jumping from one looking to is very demanding here t can be
there are different programming languages will different constant errors are we there's cetera
and the
one other issues that
if we have different look at very how to combine a system together and uniformly
in a single system just fully range just
a very important use this we declare
so we actually working on that and we are trying to lower best rate
to do not always will allow users to
actually a couple the next
a speech point one
in an easy way
what a time line actually we have work a lot of these you're on that
we haven't email
a lot of people working on that a lot of interest
and we are very close to a first really is that
will happen we estimate within a couple amount so i as strongly encouraged you to
stay tuned and then
and that try
speech brain
i in the future and q how's your feedback
speaker in
as quickly the project is how would be as well people
we have lower while the
twenty delaware as last having solar raiders you have all sources sounds will all be
ones and so the project is getting bigger and we go to have also the
product
all the speech community
technical the store
saying it be right to my
collaborator
the guys year are being
this ain't is that working on there
all these are the other works lots of the what's happening
and here you can see
the key that is currently working on the speech rate and that recyclable them because
i think together we are working very well and
well we soon you'll see and the result of our house work
thank you very much
for everything
and i'm very happy now to reply to your
many thanks musical than i wasn't nation
i already have a
a set of questions for you
so as to what is wrong using both ukrainian but at so complex the first
patient was from nicole rubber
and the only the we i have to you england
it a weight on a holiday is less computationally demanding men so that is known
in
actually is nothing but the best and then i'm
i think and i can take this opportunity to clarify little bit matter this the
things there are a couple of things to consider
for the whole with bayes
we're trying to learn not and task specific representation but in general representation
a at this means that you can train you are i'll supervise a network just
once right and then you can use just a little amount of supervised data to
train the system
so and is naturally it's to the computational advantages because you have to train
the big thing on the one
and a menu don't things
when you have
some
things which are and we have to the standard supervised learning and usually
if you have a good representation a supervised learning part is gonna be are much
easier
and the other i think good think about pay is a
that they didn't remark too much in the presentation but this is better to remark
here a little beads
is that the basis pretty there's a sufficient right
we found very good results even just using something like fifty hours of speech so
very little compared to
what we see these days
even on cell supervised learning where people are using tie was on and thousand how
real speech
and we are data efficient because mm with the multiple workers
somehow we try to extract as much as all the possible information from phone signal
we are trying to do our best to be also that efficient extract everything we
can from the signal
so the right shoe things here
the day
and the fact that we are learning a general representation right so when we you
can train only one time phase and use it for multiple task and then also
be that late fusion part that to allow you to
learn reasonable representation
even or it then
and a relatively small amount of unlabeled data
an eco are you are you k do you have other
comments on the part
so
okay
five is very bad because you really question is on the sides of anyway and
try my best
i haven't quite a you have a question from don't combo well as you could
become a common and remote with this also is supervised learning and this ideal conditions
actually mm and we increased a lot the robustness of bayes the when we revise
it with bayes plus
and as i mentioned before in based blast we combine basically sell supervised learning with
on-the-fly data limitation
well that's domain it means that every time we have and you sent those we
contaminated with a different sequence of noise and on different reverberation such that the system
every time and looks also different
sentence
a lda different at least contamination and in the output
our workers are i'm not extracting their labels from the noisy signal but from the
original clean one
so somehow i wire system is a forest
two
they noise the features
and d is that it's to the robustness we have seen before we actually tried
it they're challenging task like your our time i data and it was so realistically
rate at these increase robustness to where standard approaches
good thank you same really sure
okay
you ask some questions
bayes rule has also a question about the competition between the walkers in days
and i don't we should be visible but he or within them or when
leam engine could consider some segments
one in the same interest
one has a positive example and you're as a negative
some people
and ask you what to expect been able to learn in this case
actually the set of workers that we tried is not random right we took the
opportunity of the day salt for instance to do a lot a lot of experiments
we and we just come out with a set of word the subset of worker
the subset of ideas
that actually works for us
so actually i one of our concern was okay how is possible to put together
a regression task which are bayes on the square error for instance is lost with
binary task which are based on other kind of lost like better because entropy
how we can how we can learn things together and we told that there was
a big issue but we realise that actually is not just doing an experiment doing
some kind of operation of the workers so we not does that if we put
together more workers the batter units
and the same atoms for a leam and jean
which are a different actually because a lean is based on small amount is small
chunks of speech
and we that the will there are not and meeting not carry information
while the with them james in the same game but played with the larger ta
and larger chunks of one seconds one second house
and we that tubular hopefully
higher level representations so we found that the
they did chew and the same time are at any we have full even though
at the are clearly correlated subsets right
and cuban equation is coming from one channel
and the nist you is you have to the right including to provide us to
pay
and she's really thinking about the five but they is not explicitly thing within speaker
variability
so none of the task is forcing and then he's from different from those from
within speaker could be seen you know we shall work
use it for you and known problem in adding some supervised five little ones you're
where you have always easy
well first of all on including supervised task totally makes sense honestly one can play
with the and
same a supervised of course seems cell supervise in this case a things and i
e n is present all people already d the i saw some recent papers that
actually work trying to do that the
in this paper for base we prefer to stay
on the selsa broom buys side only to make sure us to do actually check
what are the output read is something it's a pure
so supervised learning approach
so as for bayes for speaker recognition and then within speaker maybe this yes is
not that specifically designed for that
so is not them is not the optimal but we anyway learn some kind of
a or speaker identity
actually
we didn't there's too much about we can we are confident that we can learn
can be quite competitive with them with data with standard system actually maybe we have
to devise a little bitty architecture for that you're speaker recognition applications because these days
also here so
numbers to which are impressive in terms of equal error rate for box so that
but
the same idea i mean could be could be i think it's extended and we
designed to specifically lower better speaker imaginings actually was in our main target was
was more general so we wanted to
to learn a pretty general representation and see if this is somehow works
reasonably well for multiple stars
thank you very is a nicely with the next question from o coming from the
we're not
which tries to use
if you common than the five about the things that you system is no need
to give or speaker restitution and ten information you can really
as you are using examples positive examples coming from within a this a single you
five
actually what we do is to do this on the slide at the moment a
sure right
so if we have sentence one
one time sentence one is
contaminated with some kind of channel so i kind of reverberation affect the next time
is contaminated with another one so maybe with this approach we try to limit a
little bit the these affective but
there might be there might be this issue read through
do you mean to you thing but that the motivation you use it would take
decision problem of internal run by itself
so of maybe not tickling the full problem but at least
minimizing right or
reducing its right
i think about the and the other hand we don't that many out there does
it will feel will like to stay in the
so supervised domain right so we don't and speaker labels so we cannot say okay
let's jump to another's signal from the same speaker because that case
we have
we use the
the labels so
the best we can do is to contaminate the sentence two
i mean
change a little bit some other database the reverberation noise effect and
hope to have
to learn more this p can left the channel
fine i we moved to a question from and you can turn
hasn't that i model can use form from two perspectives and dealing extraction and more
than ornament pre-training
both for this we don't should be effective
well but which one may be built for speaker verification so
language and then
we take a look again
i
okay
i think a please could be used both you're right i can be used for
feature extraction or embedded guess extraction all for and basically pre-training
my experience is that
these works very well in a pre-training scenario so it is designed basically to have
the
to train printing your network with their nest so nist also provides way and then
find your eight with the small supervised data
this is the
basically the mean the main application we have in mind for a for pays but
we also tried it as a standard feature extractor
where embedding a structure
not for speaker recognition but for a speech recognition
and it works quite well so if you freeze the encoder right and you plan
just the features that you have there you can and supervisor coded what's well but
it works better if you jointly finetune the encoder and the classifier during that a
supervised phase
thank you and we'll come back to the grid also no with a question about
the
temporal
sequence walker also can you would avoid more on the minimum detection walker but focused
on the right sequence
this is for that the
maybe some cases the segment from the few to and but would reasonably contain the
thing then
so
with some problem with this walkers you know some comments
definitely that's that have very nice question actually mm could easily the soup as worker
is the one that has that's important thing the performance
so as i mention with the a lot of model glacial we try to figure
out the effect of each and which task and is what was working well improve
but less than other work at where more important like the rest of the regressors
and the m and g
and mm
actually this is an important risk when you what when you build a view sample
from the past is simple from the future you have to make sure you just
you are not same thing with being the receptive field of your convolutional neural networks
otherwise the task becomes
too easy
so what we have done is to make sure that the next the future sample
is not too close rights from the people from the and core one and not
too far because if it is to close the risk is to learn nothing basically
if it is to fire
the risk is that there isn't anything in anymore and reasonable correlation between that you
so or it's not easy to design the this task
and them
we did the
you didn't as weights are we
we were able to sample the past in the feature representation within some reasonable range
it could be interesting to write i believe that you hide traces
but are we move to another question from i in one but still were asked
to you being to write the all the same it does
known from will lead to four speakers for extracting speaker-specific information
well and in this paper we new the bayes paper actually is not only about
the nist speaker recognition so the filters that will learn are actually are not that
far away from stand out method
mel filters where we basically try to locate
more precisely more filters in the lower part of the spectrum and less filters in
the higher part of the spectrum
we de lima
local informants the technique that was designed basically that work only for speaker recognition the
filters we still are there are some harass right where more filters in there are
as more common for the speech and the formants
so similar to what we have seen in
in
using sync net
with the supervised a approach
but with bayes we are not the we're not a look at more
more filters in the speech region we are more or less the same as the
standard not filter scale back
i we have
we are also conclusion i don't have more open question i just have one or
be possible
a i would like to see you
to the explaining more well i use a about unsupervised training
used and the it composes of as the training
it's of my feeling what an issue as
b s
a more easy to find if you have some
during a supervised training because you're some information on the data meta information of video
and we each with the unsupervised training seems to me that
you have less information but you have no reason to have a list yes in
the
they
the figure sure
okay and
the reason is that the
if you train your representation with the supervised data your presentation could be biased to
the task right specifically for instance if it's frame
a aspic a representation with speaker recognition right your presentation is not could for speech
recognition and it is does a bias on speaker recognition around it
with a supervised learning at least the in the way we are trying to do
it with a multitask et cetera this list a risk is reduced because you have
the same representation that is good for both speech recognition
and speech recognition and the speaker recognition
communist and that i
really the want to thank you again and we are we will be the over
the official but before to close the position i will be the microphone to get
a the only those two
wants to you to also i think you actually
thank you are service right
yes i stepped off state of a very wide of the top integerization and then
s l obtained in this session
so as dataset now do you think that something to us to decisions
but
system
one of the stuff can show this not
and the second
yes
if your best guess okay just to heal i you the that token decision that
the but there is something that sequences changes that's thanks for inviting me that was
really great thank you
okay that's a tennis together again
and test on a distance for a
a sentence
and you lucille tomorrow a same time this time i in a and ten
definitely
so as you can just
of that by time