you
okay very much so the other not to seventy six system was a collaborative work
between the brno university of technology
medial and a little
so let's start already introduced we had twenty four languages to deal with
we had a new metric
the list of the languages somehow
screwed the
and so we have the new metric the two seventy six language but that that's
where the name comes from
and we had to select a twenty four was pairs in terms of mindcf and
then compute the average
actually see
in order to be able to deal with those languages we had to call we
had to collect or data so this is basically the list of data that we
used in the past
evaluations there are some callfriend fisher
oh mixer data from the sre evaluations
previous lre evaluations ogi data of foreign accented english
some a speech data for the you've really used european languages
i was and switchboard and some broadcasts though
data from the voice of america and ready for your
and then some
iraqi arabic conversational speech
arabic broadcast
speech
oh as doc showed
there were some languages for which we didn't have enough data
so what we did is that we add on this additional radio data from public
sources so we use the radio free europe radio free asia
some czech broadcast
america
and there's a there's a list of languages first we covered check farsi lotion and
job be
and again i say arabic knackered be
mandarin in training and i and i guess that were couple
so what we did is that we did a phone call detection so we
detecting the parts of
broadcasts where
there were some
and
telephone
conversations would be greater
and for each language we we're and
automatic speaker like labeling
so that we didn't want that we do
yeah
train and test sets
but the speakers to overlap
oh this is this is the scheme for a development set so we use the
lre eleven development data
so we make two sets actually one D one was the trusty data which was
actually based on nist so thirty seconds so cost
definition
we also an automatic speaker labeling
and we split into non-overlapping
training data
test parts
and then we took the entire conversations and in the thirty seconds excerpts
results
so was presented thirty second segments
and all splits from one conversation side to be trained and test set
again there was some
speaker automatic speaker labeling
of course we had more data but it was less reliable that's all seen our
contrastive system
oh
this doing this
helped a little bit
having the all the P V their position
oh
so
to give a little bit of statistics on or not
dataset so we get the train set which are sixty six sre six two thousand
segments
and it was based of all kinds of sources
and yet the test set
which was thirty eight thousand segments
was based basically on a previous lre evaluations
and then be and test sets with service
on the other you
evaluations comprising one
so a little overview of our systems we have we have a summation of three
systems one primary and you can trust it
so the primary system consisted of one acoustic system which was based on the i-vectors
you be the descriptions later
and then we had three phonotactic subsystems so that would
yeah diver system so we had
a binary decision tree systems based on the english tokenizer then we had a pc
reduction
systems based on the russian systems
and then we had some multinomial subspace
i-vector space but
hungarian
tokenizer
that the first thing just a system or same as the primary what we did
is that we excluded the P two
that means that the entire conversations
we'll see the results
later
and then contrasted to system was just of
fusion of two best system with the acoustic and the english
problem of the image was that the at the development that the case
very good results but it's see as kind of problem i think
so there is a little diagram
of our system
in the in the first
at a very left you know we have the front end so we have the
acoustic i-vector the phonotactic i-vector and the pca
oh which basically convert the input of some form
into a fixed factors either i-vectors for the for the acoustic i-vector extractor or we
also column i-vectors for the phonotactic i-vector extractor pca
and after which we had to do some scoring
and then we use the det binary decision tree model which show was based basically
going on a log likelihood evaluation
of the of the n-gram counts
itself so we already got discourse
which could then go do precalibration
both the scoring and pre-calibration are based on logistic regression
i'm unit-fusion also based on logistic regression out of which we get twenty four scores
likelihoods and then we do pair-wise log likelihood ratio for
each of the errors
it is just to show the how is the data that the data described in
the previous section so that the train database was used for the
the for the front-end training and for the for the scoring
classifier training
at the dev and test databases where they used for the for the
back in the pre-calibration
fusion
so for acoustic system we use the hungarian phoneme recognizer based vad
oh basically to kill the
silence then we use the vtln
i'll features dithering
cepstral mean
and variance normalisation with rasta processing basically
what similar to that
right later previously
and the modeling was based on full covariance ubm we two thousand forty eight components
and the i-vector size was example
for the phonotactic systems
we used we used a diversity of techniques to
sources are tokenization
so that pca for each feature extraction
oh
was based on is something you
hungarian tokenizer
so what we do is that we would do accounts with the square root of
the counts
ppca on top of that we used the dimensions to six hundred
and we basically used in the same way as the acoustic i-vector
and we had a multinomial subspace modeling of the trigram counts was based on the
regression tokenizer
so this is something a slightly newer and something that marshall
one
features
is the basically modeling the n-gram counts
the subspace
of the simplex
the output of
of such a approach is also
i-vector like
a feature
which we then again process the same way as the i-vectors
and then we had the binary decision tree which is basically a novel technique where
the decision trees are used to cluster the n-gram counts
and a claimant like this is used
if the score
so you know the scoring for the for the acoustic i-vector and the two phonotactic
i-vector systems was the
the input was usually the i-vector gonna six hundred dimensional or one thousand dimensional case
pca
we perform length normalization
ultimately performed within class covariance normalization
and after that is that as the
S R
classify we used to regularize multiclass logistic regression
with cross entropy objective function
the regularizer was the L two regularizer but the penalty was chosen without cross validation
and it was trained on the train database
the output
was
scores
what we did that with the
the each set up to twenty four scores is that we do precalibration
of E system so that was a full affine transform
and we use the regularized logistic regression
which was trained on the test
and the database is
and in the end we use the
we used the four systems
two
oh
with the with the constrain affine transform so instead of assigning each of the twenty
four scores of each of the systemsindividual scale constant
we had one scale constants for one system
and we had a we have a vector offsets
and this logistic regression was also bayesian logistic regression stumps regularized
was trained on the test
jeff databases
as i said that the decisions where done using the log-likelihood ratios that came out
these are
the decisions for the for the
two seventy six course where
converted from the twenty four scores
oh as a as a log likelihood ratios
among all those pairs
oh
for
this is just a
the subtraction of course
we
we give a score
it is
and decisions the models
where
all set to threshold of zero
as just a little common that these decisions are invariant to scaling of the of
the log likelihoods
and
you relations by the heart language pairs
does not sell
just the calibration
so the analysis that we used to
we so we
we fixed that the
denotes the use when designing the system is that
oh we fix the twenty four worst pairs on our thirty seconds evaluations
and we compared
of three different numbers of the actual dcf
minimum dcf
start dcf which was based
on the
on that may cause recipe and mentioned monday
it's based on the
likelihood pre-calibration
so we will compare the development
and the evaluation sets
we will try if the comparison of four eight system so we will have the
individual systems on getting phonotactic direction of the technique phonotactic
eucharistic i-vectors and then we will present for fusions
so one of the primary
the contrastive the second contrastive
and we also have a three system fusion which excluded the
the english phonotactic system which somehow
he misbehaved
this will see
so this is that this is the result of three seconds that we fixed the
person thirty seconds
oh
as we see this is the this is the
this is the miss behavior of the system so the last parser with the that
set and the right or with
the evaluation set and we see that the trend
going like this in the in the dataset but
the
the english phonotactic system
speech
oh of that vector
yeah
this is for the for the ten seconds
so this is this contrast if one system is the system where we excluded the
where we excluded that the dft to data which are
comprise the entire segment so we see that there is a slight hit
compared to
the primary system so these two systems to remind you are
very same except that
in the in the calibration and scoring
there are there are some data left out
oh a again the english
the english is very behaved here
oh we see that though
that the difference between the balloon the and the right one
well which is the difference between the minimum and actual is actually
we see that
the miscalibration was
no i mean
it was not a tragedy but
we didn't do very well the calibration which is even more simple on the thirty
second stuff so if we see that
that the miscalibration is much
much more reasonable
especially fusions
here on the on the contrastive one versus the primary
oh we see that the that excluding the data
really are
so
that's one thing
oh
on the evaluation systems excluded so the three systems is equivalent to the primary by
excluding the english system
we see that the on the development set didn't do much
in fact the system is slightly worse excluding the english
systems because the english perform very well
on the
the development set
but is putting it
really hot on the evaluation
after the evaluation
so again just to summarise the observation is that it was
the big deterioration between the mindcf for development and
we differ
then there was no calibration disasters but on the thirty seconds as i pointed out
well we could have done better
the binary tree system was kind of screwed and
what we found out later is that if we apply
similar dimensionality-reduction plot scoring techniques
even to the english tokens the system where good again so
so it was it was
due to the
the plane landed evaluation
and acoustic outperforms phonotactic almost everywhere that there were a couple of systems a couple
of language pairs where the where the acoustic
where the phonotactic was better
if you have ever so we did it is that we didn't analysis but for
novice versus mit system since the mighty was the best
a there was a weak correlation between sites and difficulty of paris
a domain mindcf
was were very similar
of the mindcf five
for the worst twenty four pairs for slightly worse for us than for mit
an actual dcf oh
we had a big calibration
that's an interesting how there's an interesting plot here which compares us some of the
selected arabic dialects versus like languages where O
somehow we knew that i might get more data are for the arabic dialects
so we see that we do very poor
and some of the some of the pairs
okay arabic versus push to
et cetera because we mostly due to the lack of data while i'm on the
on the
the slavic languages
we had we do better
and some
selected pairs
so this is just two are just to show that
oh
the amount of data really
there's
this is just a correlation between some of the best of
unlike the end and be useful is the
but axes and use them mit excess
oh
we see that if we did the same thing
all the points we would be aligned
as the
the ability but the we see that some errors are really
of the
we did very differently
some of the past
and this is just to show all the worst
the worst the mindcf
and versus actual mindcf for the for the mit and be used
but not system
so
oh
we these are average
at which point so we see that on average my better
so mit
and my these points are more know on a on a single line all the
systems are more scattered around here so this shows the again
the
calibration hit
so that concludes that we built several systems but we only selected for
for the primary for the primary fusion
get the acoustic outperforms the phonotactic
for the phonotactic we try to from the different backgrounds and we saw that the
dimension reduction really else
we have
the big hit
for the english phonotactic systems where there was a
i forgot to delete the we did not know why
we already is the
scoring
and probably we could use special detectors for select paris
that is that is
yes
oh
yeah
yes
oh
sorry
well
the unique and the shifted
yeah we use that
right
oh
so we use the six mfccs plus C zero yeah shifted again
okay
sorry
i
oh yeah
so
for the which so we use
real the regularisation in our scoring and then or pre-calibration
so a little L two regularization
okay
oh
i
i
i
oh
so