Speech Transcript - LEAP System for SRE 2019 CTS Challenge - Improvements and Error Analysis

0:00:14	i
0:00:15	this is a special edition
0:00:17	from the lab id indian institute of science battle
0:00:22	and i think presenting a paper
0:00:25	system for assigning twenty nineteen cts challenge improvements in data analysis
0:00:31	the goal of those of this paper
0:00:34	actually i someone g
0:00:35	log online so
0:00:37	but using
0:00:38	actually on band
0:00:41	let's go
0:00:42	and the total number of this presentation
0:00:45	and introduce a brief overview
0:00:48	off how speaker recognition systems well
0:00:51	discuss
0:00:52	sre nineteen challenge performance metrics
0:00:56	talk about
0:00:57	the front end and back and modeling in a system something
0:01:01	discuss the results of these systems
0:01:04	and then some analysis of post evaluation results before concluding the presentation
0:01:12	this is a brief overview of how speaker verification for speaker recognition systems well
0:01:19	the first phase
0:01:20	we have the raw speech and extract features like mfccs from that
0:01:26	these features
0:01:27	well then
0:01:29	processed with some voice activity detection and normalization
0:01:33	then these features are given as an input to train a deep neural network model
0:01:38	parameters
0:01:40	the most popular neural network based embedding extractor and the last few years have been
0:01:45	the exploration of mars
0:01:47	once the extractor training phase is done
0:01:50	we enter the lda training phase
0:01:54	these extracted extractors
0:01:56	have some processing done on them
0:01:58	like sending and lda
0:02:00	the other then unit length normalized
0:02:03	before cleaning up but a more
0:02:06	most popular state-of-the-art systems
0:02:08	use a generative gaussian but more
0:02:12	for the back end system
0:02:14	in the verification phase
0:02:16	we have but right which consists
0:02:19	off and domain utterance under test functions
0:02:22	and the objective of the speaker recognition system
0:02:27	i don't know whether the test utterance belongs to the target speaker or non-target stego
0:02:34	thus once you extract extra and ratings for the enrollment and test utterances
0:02:41	we compute
0:02:42	log-likelihood ratio scores using will be lda back-end more
0:02:47	and then
0:02:47	using these scores we did only
0:02:50	if the trial is a target one
0:02:52	or a non-target one
0:02:57	let's look at the sre nineteen performance metrics
0:03:01	then this assigning challenge in twenty nineteen
0:03:04	consisting
0:03:05	of two tracks
0:03:07	the forest
0:03:08	one speaker detection one conversational telephone speech or c d s
0:03:13	and second
0:03:14	was no multimedia speaker recognition
0:03:18	a work was on the forced that the cts challenge
0:03:22	the normalized detection cost function or dcf
0:03:26	is defined
0:03:27	as in equation one
0:03:29	just seen on all be done on my data
0:03:32	is equal to be missed of the you'd are pleasantly times p fa of t
0:03:38	right
0:03:38	in this and be a fee
0:03:40	the probability of miss and false alarms respectively
0:03:45	on this is when the speaker recognition system
0:03:48	the database a target trial as a non-target one that is the system wrong
0:03:55	though and alignment and best
0:03:57	to be
0:03:58	of the same speaker
0:04:00	of false alarm
0:04:01	is when non-target trial is it on you ready to as a target right
0:04:07	in this and be a free
0:04:09	computed by applying detection threshold of the ego the log-likelihood ratios
0:04:15	the training cost mentally all the nist sre nineteen for the conversational telephone speech
0:04:21	is given
0:04:22	by equation two
0:04:24	you're to be done one
0:04:26	is equal to ninety nine and be done to is equal to one eighty nine
0:04:32	the minimum detection cost was alone
0:04:35	as mindcf or semen is computed using the detection thresholds that minimize the detection cost
0:04:44	you creation three
0:04:46	ins to minimize you wish to
0:04:48	on the threshold you know one and the two
0:04:52	the equal error rate eer
0:04:54	is the value of p fa and p miss
0:04:57	computed at that actually read p fa and b m is equal
0:05:02	we report the results in terms of eer
0:05:06	semen
0:05:07	and c primary for all of a systems
0:05:11	the assigning nineteen evaluation set consisted of or two and a half million trials from
0:05:17	fourteen thousand five hundred and sixty one segments
0:05:22	let's look at the front-end modeling in a systems
0:05:26	we obtain
0:05:27	t expect all models with different subsets of the training data that i described in
0:05:33	the next slide
0:05:34	we used the extended time delay neural network architecture
0:05:39	the extended d v and an architecture consisted of twenty hidden layers and value nonlinearities
0:05:46	the model
0:05:47	mostly to discriminate among the speakers in the training set
0:05:52	the forest and hidden layers
0:05:54	all three i-th frame level by the last two already at the statement level
0:05:59	that is a one thousand five hundred dimensional statistics putting your between the frame level
0:06:05	and the same and several years
0:06:07	it computes
0:06:08	the mean and standard deviation
0:06:11	after training and ratings are extracted from the five hundred and twelve dimensional affine company
0:06:18	of the lemon clear
0:06:19	which is the forced alignment label layer
0:06:22	these and weightings are the extra cost use
0:06:28	this table
0:06:30	describes the details of the training and development datasets
0:06:34	used in the assigning nineteen evaluation systems
0:06:38	x p one
0:06:40	well extract the one model
0:06:42	i was trying valiantly
0:06:43	on the wall syllabic or whatever
0:06:46	x lead to
0:06:47	you was mixer six
0:06:49	and vts sat process
0:06:52	x p d
0:06:53	was the full extent a system
0:06:56	which are staying on the little box ella
0:06:59	and previous sre data sets
0:07:02	the data partitions
0:07:04	use in the back end martyrs of the system in individual systems submitted are indicated
0:07:09	in the table two
0:07:14	now let's look at the background model
0:07:18	once the popular systems in speaker verification
0:07:22	use
0:07:23	the generated of course in the lda bungee nearly as of that in modeling approach
0:07:28	once the extra those that extracted
0:07:31	there is some preprocessing done on them
0:07:34	they are standard are the mean is a model
0:07:36	the transformed using lda
0:07:39	and are you wouldn't like nonetheless
0:07:41	the bleu model
0:07:43	on this process extract of a particular recording
0:07:46	is given
0:07:47	by equation four
0:07:49	but you do i
0:07:51	is the extra for the particular recording
0:07:54	well make our
0:07:55	this kind of only can speak of five do we just go origin
0:07:59	five
0:07:59	characterizes the speaker subspace matrix and axes on a is a collection procedure
0:08:06	now the scoring
0:08:08	well bad of expect that was one from the enrollment recording be noted your diary
0:08:15	and one
0:08:15	from the test recording denoting show but you don't e
0:08:18	are used
0:08:19	when w can be lda model or to compute the log-likelihood ratio score given in
0:08:25	equation five
0:08:27	english and five is of course that one and b and q
0:08:32	alright in many cases
0:08:35	along with the g vad approach
0:08:38	we propose
0:08:40	when you wouldn't be lda model what and the lda model
0:08:43	for background modeling
0:08:45	what we have you are
0:08:48	pairwise discriminative network
0:08:50	the bayesian portion of the network
0:08:52	corresponds
0:08:53	to the enrollment and ratings
0:08:56	and the pink portion of the network correspond
0:08:58	the test and really
0:09:01	we construct
0:09:02	the preprocessing steps
0:09:04	in the generated a gpu
0:09:06	as layers in the neural network
0:09:10	lda
0:09:11	as the force affine layer
0:09:14	then unit length normalization as a nonlinear activation
0:09:18	and then be is entering and diagonalization as another affine transformation
0:09:25	the final pairwise
0:09:27	but is scoring
0:09:29	which is given in equation five in the previous slide is implemented as a quadratically
0:09:36	the by having those of this model
0:09:38	are optimized
0:09:40	using an approximation of the minimum detection cost function
0:09:44	or seen in
0:09:49	no less than i'd are submitted systems and the results
0:09:54	the database your
0:09:56	shows det is about the seven individual models that we submitted
0:10:00	and a couple of fusion systems
0:10:04	the best individual system
0:10:06	was the combination of the x t which is the for the extra extractor with
0:10:12	the proposed and b idea more
0:10:16	for the s i eighteen development set
0:10:18	it had a score of five point three one person ser and pointing to a
0:10:23	signal
0:10:25	and the best scores for the assigned nineteen evaluation
0:10:28	was
0:10:29	for one nine seven percent
0:10:30	and for the and point four two
0:10:33	for semen
0:10:35	the fusion systems
0:10:37	are some gains on the individual systems
0:10:41	all that all
0:10:42	the for an extra system just actually three
0:10:45	performs significantly better than the walks l images extreme one
0:10:50	and the x s i next week two systems
0:10:53	for any choice of backing
0:10:57	systems be which is trained on and vad
0:10:59	just in the c include a system f
0:11:02	and it is observed that a model support in domain and out-of-domain data better than
0:11:08	the collision be lda
0:11:13	let's talk about some post evaluation experiments and analysis
0:11:18	one of the factors
0:11:19	then we found that we didn't to optimally
0:11:22	with calibration
0:11:24	in our previous work for sat
0:11:27	we propose
0:11:28	an alternative approach to calibration
0:11:30	but the target and non-target scores will model
0:11:34	as a gaussian distribution with the shape variance
0:11:39	as assigning nineteen did not have an exclusively matched development dataset provided
0:11:45	the aforementioned calibration
0:11:47	using the sre eating development dataset when applied on assigned nineteen don't know to be
0:11:54	ineffective
0:11:55	this was done for all of us operating systems and thus the calibration
0:12:00	was
0:12:00	not as optimal as you want to
0:12:03	the graph on the right
0:12:05	shows
0:12:06	how exciting
0:12:07	development and assigning nineteen evaluation datasets are not matched
0:12:12	and
0:12:13	the threshold instantly opening
0:12:15	well not optimal for are selected systems
0:12:21	we perform some normalisation techniques to improve a score
0:12:26	we perform the adaptive symmetric normalization well yes non using the sri meeting development unlimited
0:12:34	say as the core
0:12:36	and be achieved
0:12:37	twenty four percent relative improvements for the x p one which is the voxel of
0:12:42	extract the system
0:12:43	and twenty one percent relative improvement for the full extract the system actually the on
0:12:48	the sre eighteen development set
0:12:51	you got comparatively low but consistent improvement of about fourteen percent on an average
0:12:57	in all of us systems for the sre nineteen evaluations yes
0:13:02	the table
0:13:04	shows the best values
0:13:05	there we go out for the exciting development and the sre ninety eight evaluation
0:13:11	you got and eer of four point seventy question
0:13:14	and assuming all point two seven as best scores for deciding of love me
0:13:20	and eer also point five one
0:13:23	and semen of point thirty six and the c by many of pointy nine for
0:13:28	the sre ninety evaluation systems
0:13:33	to summarize
0:13:34	we k t extractor extract was and background models on different partitions be available data
0:13:41	sets
0:13:42	we also explored a normal discriminative back end model quality and the lda which is
0:13:48	inspired from be neural network architectures and the generated of be a dog key idea
0:13:53	more
0:13:54	we observe that the and view stuff only this of the system or g p
0:13:59	lda for with his datasets
0:14:02	the errors that will cost by calibration
0:14:05	with the mismatched development datasets are discussed
0:14:09	but also significant performance gains that were achieved by using
0:14:13	cohort based
0:14:14	yes non adaptive score normalization technique for various systems
0:14:21	these are some of the references that we use
0:14:25	thank you

LEAP System for SRE 2019 CTS Challenge - Improvements and Error Analysis

Evaluation and Benchmarking

Shreyas Ramoji, Prashant Krishnan, Bhargavram Mysore, Prachi Singh, Sriram Ganapathy