0:00:07 | right thanks |
---|
0:00:08 | my name's cornell |
---|
0:00:09 | and this is |
---|
0:00:10 | work with |
---|
0:00:11 | she was sitting up there she's |
---|
0:00:12 | happy to take all of their questions afterwards |
---|
0:00:18 | so i before we begin i just want to sort of |
---|
0:00:20 | lay down that the definitions that i'm gonna be using this is my first time at this meeting so i |
---|
0:00:24 | may be saying things very wrong |
---|
0:00:26 | and i apologise for that in advance |
---|
0:00:28 | um |
---|
0:00:29 | i conceive of all features that you could uh |
---|
0:00:31 | compute from speech |
---|
0:00:33 | in these four areas |
---|
0:00:35 | um where on the left hand side |
---|
0:00:37 | i can i consider |
---|
0:00:38 | course |
---|
0:00:39 | spectral features and on the right hand side fine spectral features |
---|
0:00:42 | and at the top two panels things that characterise a single frame of speech |
---|
0:00:46 | where is |
---|
0:00:47 | at the bottom things that characterise the trajectory |
---|
0:00:49 | that models things across great |
---|
0:00:53 | so |
---|
0:00:53 | all the features that you probably are familiar with can be |
---|
0:00:56 | used to test like |
---|
0:00:57 | space |
---|
0:00:58 | and prosodic features tend to be those that uh |
---|
0:01:01 | either model the fine structure in the in the spectrum for that was that |
---|
0:01:05 | model long term dependencies at the bottom |
---|
0:01:07 | but we here in this paper are gonna look at only the so called |
---|
0:01:11 | contain use prosodic features namely those that |
---|
0:01:13 | characterise a single frame |
---|
0:01:15 | and in particular of looking at |
---|
0:01:19 | okay so pitch is estimated |
---|
0:01:20 | uh using a pitch detector |
---|
0:01:23 | um which typically produces a |
---|
0:01:25 | best estimate for pitch but it usually is so noisy |
---|
0:01:27 | that a pitch detector |
---|
0:01:29 | it's typically expected to produce and best estimate |
---|
0:01:32 | and then a dynamic programming approach is used |
---|
0:01:35 | two |
---|
0:01:36 | ah |
---|
0:01:36 | and so i think that to a single best estimate per frame |
---|
0:01:39 | i'm gonna refer to these two components |
---|
0:01:41 | pitch estimation |
---|
0:01:42 | and the best estimate per frame that comes out of this |
---|
0:01:45 | um |
---|
0:01:46 | can be linearly or nonlinearly smooth |
---|
0:01:49 | can be normalised |
---|
0:01:50 | based on proximity to some kind of landmark |
---|
0:01:52 | and then different kind of |
---|
0:01:53 | features can be extracted from |
---|
0:01:55 | and the simple model now these things at the bottom here |
---|
0:01:57 | i assume or what you were calling high level feature computation or |
---|
0:02:01 | high level features |
---|
0:02:02 | in this talk |
---|
0:02:03 | i hope i'm not disappointing in once we're actually gonna be looking at this point |
---|
0:02:07 | which is as low level |
---|
0:02:09 | the |
---|
0:02:10 | in that session |
---|
0:02:12 | um |
---|
0:02:13 | we're gonna claim that these features are as low level is M S |
---|
0:02:17 | okay so if we look at this right box a little more closely |
---|
0:02:20 | um typically pitch estimation of pitch section is that |
---|
0:02:23 | two step process |
---|
0:02:24 | where the source to think we start from |
---|
0:02:27 | is an fft |
---|
0:02:29 | um |
---|
0:02:30 | so the first step is the computation of what i'm gonna be calling a transform domain |
---|
0:02:35 | and there's lots of alternatives here |
---|
0:02:37 | so was saying this is all the correlation spectrum |
---|
0:02:40 | and then the second step is simply finding the the art |
---|
0:02:43 | um |
---|
0:02:44 | yeah |
---|
0:02:45 | so a lot of effort has gone into |
---|
0:02:47 | uh |
---|
0:02:48 | this process and typically the effort is both it's only on this first |
---|
0:02:51 | because the second step is so elementary that nobody really questions |
---|
0:02:55 | um |
---|
0:02:56 | the |
---|
0:02:57 | most of the work was improving pitch detection has gone into making sure this is just |
---|
0:03:02 | oh such just force optimally or most free |
---|
0:03:05 | to uh |
---|
0:03:06 | we consider |
---|
0:03:08 | what we're gonna claim in this work is that you should just throw well as whole seconds |
---|
0:03:12 | and you should model hi transform |
---|
0:03:15 | and that's what this talk |
---|
0:03:17 | so there are four parts to stop |
---|
0:03:20 | men are described |
---|
0:03:21 | what i'm calling the harmonic structure transform |
---|
0:03:24 | presenting something |
---|
0:03:24 | garments |
---|
0:03:25 | some additional analysis |
---|
0:03:27 | and i will conclude |
---|
0:03:28 | three slides |
---|
0:03:29 | yeah |
---|
0:03:36 | okay so |
---|
0:03:37 | the particular pitch detection algorithm that we're gonna look at |
---|
0:03:41 | is um was proposed by shorter in nineteen sixty eight |
---|
0:03:44 | and it involves |
---|
0:03:45 | summing |
---|
0:03:46 | producing an you spectrum the sigma spectrum where uh at each frequency we have |
---|
0:03:50 | the sum of all the frequencies |
---|
0:03:52 | that are |
---|
0:03:53 | integer multiples of some candidate fundamental frequency in the original effect |
---|
0:03:58 | okay and this was very quickly after he proposed this dog harmonic compression which is |
---|
0:04:03 | distinctly nonlinear operation |
---|
0:04:05 | um |
---|
0:04:06 | i wanna demonstrated over here on the right |
---|
0:04:08 | basically what what ends up |
---|
0:04:11 | what ends up happening is that the spectrum is compressed |
---|
0:04:14 | conceptually |
---|
0:04:15 | by integer factors and then it's a bird |
---|
0:04:18 | right |
---|
0:04:18 | and the the problem with congresswoman compression is that has led to people actually looking for |
---|
0:04:23 | uh implementations of this algorithm |
---|
0:04:25 | in exactly this way so first compressed and then uh |
---|
0:04:28 | and it turns out that |
---|
0:04:29 | this is occupied people for about twenty years |
---|
0:04:32 | a lot of last century |
---|
0:04:34 | a a much better way to |
---|
0:04:36 | do this is to actually not do any compression |
---|
0:04:38 | but |
---|
0:04:39 | uh comb filter |
---|
0:04:40 | so you just add whatever's at whatever frequency you want without first having to compress it towards the |
---|
0:04:45 | and that a harmonic frequency or |
---|
0:04:47 | fundamental frequency that you're interested in |
---|
0:04:49 | um |
---|
0:04:51 | so when this happens there of course no compression difficulties |
---|
0:04:55 | filtering is linear |
---|
0:04:57 | uh |
---|
0:04:58 | we in this work are gonna be defining all of the all of our filters over this range of three |
---|
0:05:02 | hundred hertz to eight thousand hertz |
---|
0:05:05 | um if you have lots of such computers and that |
---|
0:05:08 | you'll the filter bank |
---|
0:05:09 | and in this work we're gonna |
---|
0:05:11 | nominally three four hundred |
---|
0:05:12 | filters and this filterbank |
---|
0:05:14 | um which |
---|
0:05:15 | range from fifty to four hundred fifty hertz space one hurt so far |
---|
0:05:20 | uh |
---|
0:05:20 | so this is |
---|
0:05:21 | the continuous frequency space of course we want a discrete frequency space |
---|
0:05:25 | filter because uh we have |
---|
0:05:27 | discrete ffts |
---|
0:05:28 | so |
---|
0:05:29 | there are lots of ways to do this and |
---|
0:05:32 | i |
---|
0:05:32 | always like siding |
---|
0:05:33 | the work by you know colleagues |
---|
0:05:35 | from |
---|
0:05:35 | lindsay because i |
---|
0:05:37 | this is actually work that influenced me |
---|
0:05:39 | um but it's probably not the first work |
---|
0:05:41 | what we're gonna do in this work is a little bit different |
---|
0:05:43 | we're gonna sing that each |
---|
0:05:44 | uh |
---|
0:05:45 | to stop the columnist triangular |
---|
0:05:48 | and |
---|
0:05:49 | then we're gonna simply riemann sample this |
---|
0:05:52 | such that the filterbank filter |
---|
0:05:54 | for that |
---|
0:05:55 | comb filter |
---|
0:05:56 | discrete filter actually end up looking like this and |
---|
0:05:58 | as you will know it doesn't look harmonic at all |
---|
0:06:03 | it's a what what do you do with this now so |
---|
0:06:05 | uh if you have a set of such discrete calls can filters |
---|
0:06:08 | then that actually |
---|
0:06:10 | um |
---|
0:06:11 | implement |
---|
0:06:11 | uh |
---|
0:06:12 | filter bank |
---|
0:06:13 | that has represented a matrix representation age |
---|
0:06:16 | and it's very simple to use you just |
---|
0:06:18 | matrix multiplication by the fft that you haven't |
---|
0:06:21 | we're also gonna take the logarithm of the output of that filter bank |
---|
0:06:24 | the same way that's done for |
---|
0:06:25 | the mel frequency filter |
---|
0:06:27 | um finally we're gonna subtract from that |
---|
0:06:30 | the energy that is founded specific integer more at integer multiples of a specific candidate fundamental frequency |
---|
0:06:35 | we're gonna subtract from that |
---|
0:06:37 | the energy found everywhere else |
---|
0:06:39 | and to do that we're gonna form |
---|
0:06:41 | this |
---|
0:06:41 | complementary it is complement transform H tilde |
---|
0:06:45 | um which i can |
---|
0:06:46 | demonstrate over here this is the |
---|
0:06:48 | column |
---|
0:06:49 | a column vector for particular |
---|
0:06:50 | a comb filter |
---|
0:06:52 | then we just |
---|
0:06:53 | form a unity complement right and that gives us a |
---|
0:06:56 | this here so that's the corresponding |
---|
0:06:58 | column vector of H two |
---|
0:07:00 | um |
---|
0:07:01 | what this of course implements why implements a |
---|
0:07:04 | if they're in i form of the harmonic |
---|
0:07:06 | to noise ratio |
---|
0:07:08 | um which is known to correlate with force |
---|
0:07:10 | voice or hoarseness or roughness of reading |
---|
0:07:12 | and |
---|
0:07:13 | typically in in in pathological you so |
---|
0:07:16 | two |
---|
0:07:17 | what's done is the harmonic noise ratios computed only and not at the fundamental frequency once that is known |
---|
0:07:23 | and what we're doing is we're computing it for all possible candidate fundamental frequencies |
---|
0:07:27 | and then using that vector |
---|
0:07:28 | as a |
---|
0:07:29 | feature |
---|
0:07:30 | fish |
---|
0:07:34 | okay |
---|
0:07:34 | so the elements of this are still correlated in me decorrelate them in a way that anybody else would |
---|
0:07:39 | we subtract the global mean |
---|
0:07:41 | we form a D correlation matrix |
---|
0:07:43 | and then we truncate |
---|
0:07:44 | after applying that matrix all those features |
---|
0:07:46 | that are in the mentions that have one positive eigenvalue |
---|
0:07:50 | we're gonna call the output of this |
---|
0:07:52 | harmonic structure cepstral coefficients for lack of a better term |
---|
0:07:55 | and um |
---|
0:07:56 | this is simply a D correlation of the logarithm |
---|
0:07:59 | oh the output of the filter banks minus some normalisation term which |
---|
0:08:02 | is uh |
---|
0:08:03 | R H tilde here |
---|
0:08:05 | we we actually explore two different options for this book pca |
---|
0:08:08 | and lda which you probably know more about than i do |
---|
0:08:12 | i want |
---|
0:08:13 | i've claim that this is at the level of |
---|
0:08:15 | mfccs but |
---|
0:08:16 | i would like to try to convince you hear that it's nearly identical from a functional point of view |
---|
0:08:20 | um |
---|
0:08:21 | the mel filterbank |
---|
0:08:22 | kind of |
---|
0:08:23 | also be implemented as a matrix |
---|
0:08:25 | and so if it's if that's and |
---|
0:08:27 | you can see that at least |
---|
0:08:28 | inside here |
---|
0:08:29 | is approximately the same it's a matrix multiplication of the lombard |
---|
0:08:33 | the decorrelating transforms of course |
---|
0:08:35 | different |
---|
0:08:36 | and |
---|
0:08:37 | sort of |
---|
0:08:37 | important and in our case unfortunate that |
---|
0:08:40 | article or decorrelating matrix is data |
---|
0:08:42 | pendant where is the mfcc one is not but |
---|
0:08:45 | um |
---|
0:08:46 | to compare |
---|
0:08:47 | here 'em in H |
---|
0:08:49 | these are |
---|
0:08:49 | essentially the columns of and |
---|
0:08:51 | that is to say |
---|
0:08:53 | they |
---|
0:08:53 | smear energy across frequencies that are related by |
---|
0:08:56 | jason |
---|
0:08:57 | see |
---|
0:08:57 | where is the columns H the matrix that we're proposing here |
---|
0:09:01 | smear energy across frequencies that are related by harmonicity |
---|
0:09:06 | um i also wanting |
---|
0:09:08 | just say that this is as direct |
---|
0:09:09 | sort of a lot from |
---|
0:09:11 | our previous work |
---|
0:09:12 | using a representation call fundamental frequency variation which |
---|
0:09:16 | models the |
---|
0:09:17 | instantaneous change in fundamental frequency without actually computing the fundamental frequency |
---|
0:09:22 | um |
---|
0:09:23 | so what we're gonna what we're doing here in the in the current work |
---|
0:09:26 | is we take it |
---|
0:09:27 | frame of speech we take it at fifty |
---|
0:09:29 | and then we take a bunch of idealised fft that are the con |
---|
0:09:32 | filters where the columns were capital H |
---|
0:09:35 | we formed a dog |
---|
0:09:36 | matrix |
---|
0:09:37 | oh |
---|
0:09:37 | the frame |
---|
0:09:38 | that we are currently looking at with every one of these |
---|
0:09:41 | and |
---|
0:09:42 | the locus of these dot product |
---|
0:09:43 | of course defines |
---|
0:09:44 | uh |
---|
0:09:45 | this trajectory which is a function of a lie |
---|
0:09:48 | which |
---|
0:09:48 | corresponds to |
---|
0:09:50 | fundamental frequency |
---|
0:09:51 | right so |
---|
0:09:54 | in contrast the F F E stuff that we've done before we take two frames we take the current frame |
---|
0:09:58 | the same as here |
---|
0:09:59 | but we also take |
---|
0:10:00 | the previous frame |
---|
0:10:02 | and we die like the previous frame by logarithmic factors |
---|
0:10:06 | by a range of them |
---|
0:10:07 | and then again we take the dot product |
---|
0:10:09 | oh |
---|
0:10:09 | the dilated |
---|
0:10:10 | previous frame |
---|
0:10:11 | with the current frame |
---|
0:10:13 | and then the locus of those dot product give us gives us another focus |
---|
0:10:16 | it is also a function of i wear |
---|
0:10:18 | i hear is the logarithmic dilation factor so |
---|
0:10:21 | this |
---|
0:10:22 | expresses |
---|
0:10:23 | expresses |
---|
0:10:24 | the key here nominally expresses the |
---|
0:10:27 | uh location of the peak sorry |
---|
0:10:28 | expresses the |
---|
0:10:30 | fundamental frequency in hertz |
---|
0:10:32 | and the location of the peak here expresses the |
---|
0:10:35 | rate of change of fundamental frequency and |
---|
0:10:37 | for sex |
---|
0:10:39 | okay so not gonna |
---|
0:10:41 | um describe |
---|
0:10:42 | the experiments that we did too |
---|
0:10:43 | see whether |
---|
0:10:44 | this makes any sense at all |
---|
0:10:46 | um |
---|
0:10:46 | the data that we use this wall street |
---|
0:10:48 | journal data mostly coming from wall street |
---|
0:10:50 | the |
---|
0:10:51 | this corpus |
---|
0:10:52 | um the number of speakers that we have over a hundred and two female speakers and ninety five male speakers |
---|
0:10:56 | and leave it |
---|
0:10:57 | close to classification of gender separately |
---|
0:11:00 | we used and second trials |
---|
0:11:01 | we had enough |
---|
0:11:02 | we get enough data |
---|
0:11:04 | to have five minutes of training minutes of development data |
---|
0:11:07 | three minutes |
---|
0:11:07 | test data and that |
---|
0:11:08 | corresponding to approximately fourteen hundred eighteen hundred |
---|
0:11:13 | trials that are ten seconds apiece um |
---|
0:11:15 | all of the data comes from a single microphone i |
---|
0:11:18 | and that's what we're calling match multisession which means that |
---|
0:11:21 | for the majority of speakers but |
---|
0:11:23 | data that both in the train in the data and the test set |
---|
0:11:26 | is drawn from all |
---|
0:11:27 | speakers that all sessions that are available for that |
---|
0:11:32 | something that is not in the paper |
---|
0:11:34 | um |
---|
0:11:34 | is |
---|
0:11:35 | we did this afterwards but i thought that |
---|
0:11:37 | you might |
---|
0:11:38 | appreciate this is we built a system that's based just on pitch |
---|
0:11:41 | and we extract pitch unit standard |
---|
0:11:43 | sound processing to it |
---|
0:11:44 | in its default settings |
---|
0:11:46 | the comparison isn't quite fair because |
---|
0:11:49 | any pitch tracker currently |
---|
0:11:50 | actually uses dynamic programming so this |
---|
0:11:52 | the speech |
---|
0:11:54 | this system is actually using longterm constraints where is our system will not 'cause it treats brains in the end |
---|
0:11:59 | it |
---|
0:12:00 | um |
---|
0:12:00 | we we ignore unvoiced frames and we transformed voice frame to what do you mean |
---|
0:12:04 | and what we see is that |
---|
0:12:05 | for females this this system |
---|
0:12:07 | uh achieves an accuracy of approximately eighteen percent |
---|
0:12:11 | and |
---|
0:12:11 | approximate twenty seven percent for females and males |
---|
0:12:14 | respectively |
---|
0:12:17 | i get the feeling that my microphone is louder sometimes and white or another |
---|
0:12:20 | is that true or |
---|
0:12:21 | does it bother anyone |
---|
0:12:24 | okay |
---|
0:12:25 | sorry |
---|
0:12:27 | okay so the system that we're proposing here um to explore this this idea of of modelling the entire transform |
---|
0:12:33 | domain signal |
---|
0:12:34 | um |
---|
0:12:35 | we don't perform any preemphasis |
---|
0:12:36 | uh because partly because we |
---|
0:12:38 | are not using frequencies below three hundred hertz though because we throw those away there's no D C component we |
---|
0:12:43 | decided not to bother with this |
---|
0:12:44 | uh |
---|
0:12:45 | we have seventy five percent |
---|
0:12:46 | frame overlap |
---|
0:12:47 | with thirty two millisecond frames and we use the |
---|
0:12:50 | and window instead of the hamming window which is ubiquitous use |
---|
0:12:53 | um |
---|
0:12:54 | number of dimensions and the number of gaussian gender models |
---|
0:12:56 | as |
---|
0:12:57 | is |
---|
0:12:57 | still need to discuss that in coming coming um |
---|
0:13:01 | slide |
---|
0:13:01 | we don't use universal background model and we don't use any speech activity detection |
---|
0:13:05 | so |
---|
0:13:08 | um in |
---|
0:13:09 | optimising this number of dimensions |
---|
0:13:11 | um |
---|
0:13:12 | what we what we have done here is we have |
---|
0:13:14 | we have created |
---|
0:13:15 | the most laconic modelling you can invent a single gaussian which the diagonal covariance |
---|
0:13:20 | and we train our pca lda trans |
---|
0:13:23 | on the training set |
---|
0:13:24 | and we |
---|
0:13:25 | select that number of dimensions which maximise accuracy on the developer |
---|
0:13:29 | right so we can see here that |
---|
0:13:31 | four |
---|
0:13:32 | the pca transform |
---|
0:13:33 | which even accuracy of about forty percent |
---|
0:13:35 | for the for the first part |
---|
0:13:37 | and top discriminants |
---|
0:13:38 | and accuracy about eighty five percent |
---|
0:13:40 | for the |
---|
0:13:41 | um |
---|
0:13:43 | for that first hand lda |
---|
0:13:45 | uh_huh |
---|
0:13:45 | components |
---|
0:13:46 | for females and slightly better for males but that's |
---|
0:13:49 | approximating that the ballpark |
---|
0:13:51 | the lighter colours in these two parts |
---|
0:13:52 | uh represent |
---|
0:13:53 | longer trial durations we decided not to do sixty second in thirty seconds which is what we start out with |
---|
0:13:58 | because |
---|
0:13:58 | the the numbers were were too high and it was difficult them |
---|
0:14:01 | paris |
---|
0:14:01 | so |
---|
0:14:04 | um |
---|
0:14:04 | this table summarises uh the performance of |
---|
0:14:07 | this agency C system that i just described |
---|
0:14:09 | uh once the number of |
---|
0:14:11 | gaussians |
---|
0:14:12 | has been optimised |
---|
0:14:13 | has been set to optimise |
---|
0:14:15 | that's set accuracy and that number happen |
---|
0:14:17 | to be two fifty six |
---|
0:14:18 | in our experiments |
---|
0:14:19 | what what we see here is that a single |
---|
0:14:22 | is |
---|
0:14:22 | if you take the |
---|
0:14:24 | representation that a pitch tracker is exposed to and you spend the time looking for the art max in it |
---|
0:14:29 | you achieve eighteen and twenty seven percent |
---|
0:14:31 | spectral here but if you don't bother doing that you just throw everything that that representation has any model that |
---|
0:14:36 | then you achieve almost a hundred percent |
---|
0:14:38 | so |
---|
0:14:39 | the claim here based on these experiments |
---|
0:14:41 | is that |
---|
0:14:41 | there is speaker discriminant information |
---|
0:14:43 | it is beyond our max |
---|
0:14:45 | in these uh |
---|
0:14:46 | in |
---|
0:14:47 | in his representation vectors |
---|
0:14:48 | and |
---|
0:14:49 | of course discarding it needs the performance that |
---|
0:14:51 | is |
---|
0:14:52 | not really comparable |
---|
0:14:53 | um |
---|
0:14:54 | spending time improving our backs estimation |
---|
0:14:56 | it appears unnecessary and of course |
---|
0:14:58 | our backs estimation here |
---|
0:15:00 | speech |
---|
0:15:03 | okay so we also constructed a |
---|
0:15:05 | a contrastive mfcc system which |
---|
0:15:07 | is not really stan |
---|
0:15:08 | nerd in the weighted |
---|
0:15:09 | but you probably build them but |
---|
0:15:11 | um |
---|
0:15:11 | we we try to retain as many similarities with |
---|
0:15:14 | but a very simple to just see system as we could |
---|
0:15:16 | so |
---|
0:15:17 | um |
---|
0:15:18 | we did apply |
---|
0:15:19 | preemphasis and having window because that just happens to be the standard front end feature processing in our asr systems |
---|
0:15:25 | um |
---|
0:15:25 | we retain twenty |
---|
0:15:27 | of the |
---|
0:15:28 | the |
---|
0:15:28 | uh lowest order mfccs |
---|
0:15:30 | but then |
---|
0:15:32 | we also don't use a universal background model or any speech activity detection |
---|
0:15:36 | so |
---|
0:15:37 | uh_huh |
---|
0:15:37 | uh in this respect the the two systems are most comparable |
---|
0:15:42 | and what we see when we compare these these two systems is that |
---|
0:15:45 | essentially in every case at least for this data for these experiments that we did here |
---|
0:15:50 | this H assisi representation outperforms the mfcc representation but what we're happy just saying that they're they're comparable in magnitude |
---|
0:15:56 | um |
---|
0:15:57 | we've also |
---|
0:15:58 | just |
---|
0:15:59 | just to be safe |
---|
0:15:59 | applied you know lda to the mfcc system |
---|
0:16:02 | this is also not if they're thing to do because we actually haven't |
---|
0:16:05 | truncated or thrown away discarded any dimensions after that so we we take twenty dimensions and we rotate them |
---|
0:16:10 | um |
---|
0:16:11 | it it it leads to |
---|
0:16:12 | uh negligible improve |
---|
0:16:15 | um if we combine the agency see in the next |
---|
0:16:17 | systems and |
---|
0:16:18 | we achieve uh |
---|
0:16:19 | we we get improvements in every case |
---|
0:16:21 | uh except |
---|
0:16:22 | for dev set in males here mfccs don't seem to help |
---|
0:16:25 | um |
---|
0:16:27 | but |
---|
0:16:27 | um |
---|
0:16:28 | other than that in general |
---|
0:16:29 | at least combination with mfccs |
---|
0:16:35 | okay so |
---|
0:16:36 | given these results i i'm gonna |
---|
0:16:38 | describe a couple of |
---|
0:16:40 | uh |
---|
0:16:41 | analyses or analysis of a couple of perturbations 'cause we were interested in |
---|
0:16:45 | seen how |
---|
0:16:45 | lucky we were in just guessing at the |
---|
0:16:48 | parameters that actually drive |
---|
0:16:50 | the can |
---|
0:16:50 | evaluation of our system |
---|
0:16:51 | so we considered three different kinds of perturbations |
---|
0:16:54 | one was um changing the |
---|
0:16:56 | the the frequency of range |
---|
0:16:59 | to which the |
---|
0:16:59 | the filterbank is exposed |
---|
0:17:02 | um |
---|
0:17:02 | one is |
---|
0:17:03 | changing the number of a comb filters in the filterbank |
---|
0:17:06 | and the other is |
---|
0:17:08 | they'll be throwing out the |
---|
0:17:09 | the |
---|
0:17:10 | so called spectral envelope information which is contained in mfcc |
---|
0:17:14 | um |
---|
0:17:15 | i mean we very much |
---|
0:17:17 | we had a very simple |
---|
0:17:18 | version of this analysis we where we used |
---|
0:17:20 | only diagonal covariance gaussian |
---|
0:17:22 | um |
---|
0:17:23 | single that that'll convince causing per speaker |
---|
0:17:26 | and we only show numbers on that set because |
---|
0:17:28 | find it there |
---|
0:17:29 | sufficiently similar but of course uh |
---|
0:17:31 | granularity to that you've all set numbers that |
---|
0:17:33 | we didn't actually bother doing |
---|
0:17:36 | um |
---|
0:17:37 | as before i'm gonna plug accuracy as a function of the number of dimensions |
---|
0:17:43 | a so the first perturbation has to do with modifying the slow order cut off as i said the justice |
---|
0:17:47 | system |
---|
0:17:48 | looks at frequencies between three hundred hertz and eight kilohertz |
---|
0:17:51 | and if it is interesting to see what happens if you choose a different value for this low order or |
---|
0:17:56 | low frequency cutoff |
---|
0:17:57 | um |
---|
0:17:58 | so the results here for females on the left and males on the right |
---|
0:18:01 | um |
---|
0:18:02 | indicate that what we had chosen this three hundred hertz cutoff this just happens to correspond to the first hand |
---|
0:18:08 | fifty |
---|
0:18:09 | space |
---|
0:18:09 | um |
---|
0:18:10 | what's but was it is that is to say that to the |
---|
0:18:13 | to the |
---|
0:18:14 | best performance |
---|
0:18:15 | if we if we expose the algorithm to |
---|
0:18:18 | also frequencies between zero and three hundred |
---|
0:18:21 | then for females we actually lose approximately four percent absolute here |
---|
0:18:24 | that |
---|
0:18:25 | the drugs much smaller for males |
---|
0:18:27 | and um |
---|
0:18:28 | moving the the cutoff further up |
---|
0:18:31 | has a smaller effect but it's also worse than than keeping that we have |
---|
0:18:35 | the second perturbation that we |
---|
0:18:37 | there are that we analyse ways |
---|
0:18:39 | changing the upper limit here so |
---|
0:18:41 | um as i said we had three hundred eight thousand to begin with |
---|
0:18:45 | but it's interesting to see what happens if you cut it at four thousand words |
---|
0:18:48 | two thousand |
---|
0:18:49 | and and this |
---|
0:18:50 | configuration particular corresponds approximately |
---|
0:18:53 | to an upsampled eight kilohertz telephone |
---|
0:18:55 | no |
---|
0:18:55 | uh |
---|
0:18:56 | uh_huh |
---|
0:18:57 | so here again results for males on the right and for females on the left |
---|
0:19:01 | uh |
---|
0:19:02 | what we see is that |
---|
0:19:03 | for males |
---|
0:19:04 | chip |
---|
0:19:05 | reducing the number of high frequency components that that that you looking at in the F F T |
---|
0:19:09 | has a more drastic effect and for females |
---|
0:19:12 | um |
---|
0:19:13 | for females actually |
---|
0:19:14 | going down to four thousand |
---|
0:19:16 | is only a drop |
---|
0:19:17 | less than one |
---|
0:19:18 | one percent absolute |
---|
0:19:19 | but then dropping it further |
---|
0:19:21 | um |
---|
0:19:22 | you see drops of approximately three percent up to |
---|
0:19:25 | wanna stated even under these sort of ridiculous ablation conditions |
---|
0:19:28 | this significantly outperforms a pitch tracker and |
---|
0:19:31 | or so it is not known how well the pitch tracker would operate on |
---|
0:19:34 | three hundred to two thousand |
---|
0:19:35 | or |
---|
0:19:36 | audio |
---|
0:19:37 | so um |
---|
0:19:38 | the third perturbation |
---|
0:19:40 | is |
---|
0:19:41 | is is |
---|
0:19:42 | that in the transform domain so we |
---|
0:19:43 | and i said at the very beginning we have four hundred filters under one hertz apart |
---|
0:19:47 | and |
---|
0:19:48 | uh were at liberty to choose |
---|
0:19:50 | however many filters we want |
---|
0:19:52 | right so |
---|
0:19:52 | uh |
---|
0:19:54 | so it's interesting to see what happens if you double that number and space them point five hertz apart or |
---|
0:19:58 | you have that number space imports as far apart |
---|
0:20:01 | uh what the results |
---|
0:20:02 | show here for female the left again and for males on the right |
---|
0:20:05 | is that |
---|
0:20:06 | increasing the resolution |
---|
0:20:08 | of the |
---|
0:20:09 | candidate fundamental frequencies |
---|
0:20:10 | with which you construct the filter bank |
---|
0:20:12 | actually leads to significant improvements for females of |
---|
0:20:15 | um |
---|
0:20:16 | almost two percent absolute and for males is slightly smaller |
---|
0:20:19 | um |
---|
0:20:20 | and but then decreasing resolution has a similar impact negatively |
---|
0:20:23 | for boston |
---|
0:20:26 | uh finally |
---|
0:20:27 | the fact that the |
---|
0:20:28 | that the mfcc an H assisi features |
---|
0:20:30 | combine to improve performance in three out of four cases |
---|
0:20:34 | suggests that the |
---|
0:20:35 | that the to surf feature streams are complementary |
---|
0:20:38 | um |
---|
0:20:39 | but there is actually no proof of that until sort of now |
---|
0:20:43 | so what we're gonna do here is we're gonna take that |
---|
0:20:45 | the the source domain um |
---|
0:20:47 | fft |
---|
0:20:48 | and we're going to uh |
---|
0:20:50 | lifter it by transforming it into the real cepstrum and then throwing out the low order cepstral coefficients and then |
---|
0:20:55 | transforming it back into the |
---|
0:20:57 | into the spectrum |
---|
0:20:58 | coming |
---|
0:20:59 | um |
---|
0:20:59 | and i wanna say here that the the lower order |
---|
0:21:02 | real cepstral coefficients correspond approximately |
---|
0:21:05 | to the low order mfcc coefficients right so |
---|
0:21:08 | um |
---|
0:21:08 | ablating |
---|
0:21:10 | real cepstral coefficients |
---|
0:21:11 | is a |
---|
0:21:12 | which are working |
---|
0:21:13 | computed without a filterbank |
---|
0:21:15 | is a |
---|
0:21:16 | very similar to removing |
---|
0:21:17 | exactly that information that's captured by |
---|
0:21:20 | so |
---|
0:21:21 | this system that we have a justice system that you saw the performance of in the table |
---|
0:21:25 | would we actually don't do when you lettering |
---|
0:21:26 | but we could |
---|
0:21:27 | lifter |
---|
0:21:28 | you know the first thirteen low order cepstral coefficients |
---|
0:21:31 | which corresponds approximately what people typically use an asr system |
---|
0:21:34 | and |
---|
0:21:35 | or or twenty which is what we use in ornaments |
---|
0:21:37 | see a baseline we saw earlier |
---|
0:21:39 | and |
---|
0:21:40 | what and that happening here |
---|
0:21:42 | you can see that |
---|
0:21:43 | a personal comment on females is that removing the spectral envelope information actually improves performance here so |
---|
0:21:49 | if we throw away |
---|
0:21:50 | the first thirteen cepstral corpus are sort of the information contained in the first |
---|
0:21:53 | thirteen cepstral coefficients |
---|
0:21:55 | we get an improvement of about two percent absolute |
---|
0:21:57 | i meaning that the spectral |
---|
0:21:59 | and information that's model and mfccs actually hurts here for when |
---|
0:22:03 | um |
---|
0:22:04 | it's also the case that if we throw twenty of them |
---|
0:22:06 | we also do better than not throwing out any but it's already not as good as only doing a thirteen |
---|
0:22:11 | which suggests that the |
---|
0:22:12 | the cepstral coefficients that are found between |
---|
0:22:14 | or thirteen and twenty or |
---|
0:22:16 | or |
---|
0:22:16 | are useful |
---|
0:22:17 | um in male doing any kind of |
---|
0:22:19 | uh |
---|
0:22:20 | evolution seems to |
---|
0:22:22 | sorry blistering seems to hurt |
---|
0:22:24 | but it |
---|
0:22:24 | the pain is |
---|
0:22:25 | smaller here you throw away the first thirteen that's negligible i believe |
---|
0:22:29 | it's |
---|
0:22:29 | one one trial |
---|
0:22:32 | and uh i have no idea what statistically so |
---|
0:22:35 | so the findings of this is that um |
---|
0:22:38 | the this representation appears to be robust |
---|
0:22:40 | to perturbations of various sorts |
---|
0:22:42 | there is play of approximately five percent absolute |
---|
0:22:45 | um |
---|
0:22:46 | the performance for |
---|
0:22:47 | female speakers seems to be more sensitive to these perturbations than for males |
---|
0:22:51 | in both |
---|
0:22:52 | uh pleasing and displeasing directions and |
---|
0:22:54 | um |
---|
0:22:55 | it it's again important to say that and even under these perturbed conditions |
---|
0:22:59 | the |
---|
0:23:00 | the performance of these systems is |
---|
0:23:01 | vastly superior to |
---|
0:23:03 | um |
---|
0:23:04 | the performance that would be achieved if you spent a lot of time finding the art max in the representation |
---|
0:23:09 | that pitch trackers are exposed to |
---|
0:23:11 | uh |
---|
0:23:12 | we don't know how to pitch track would perform |
---|
0:23:14 | whisper |
---|
0:23:16 | so the summary of the stock |
---|
0:23:18 | um |
---|
0:23:18 | i still have three slides |
---|
0:23:20 | is that uh the information that's available |
---|
0:23:23 | to it |
---|
0:23:24 | standard pitch tracker |
---|
0:23:25 | because it is computed by this pitch tracker |
---|
0:23:27 | and then subsequently discarded is |
---|
0:23:29 | valuable for speaker recognition |
---|
0:23:31 | um |
---|
0:23:32 | and then the three points that i would like to |
---|
0:23:34 | pay specific attention to is that |
---|
0:23:36 | the performance achieved which is where they just C C but features |
---|
0:23:39 | is comparable to that achieved with mfcc features |
---|
0:23:42 | um |
---|
0:23:43 | the information contained in these pages theses |
---|
0:23:46 | if you're is complementary to the information |
---|
0:23:48 | okay |
---|
0:23:49 | mfccs |
---|
0:23:50 | and the H assisi modelling appears to be at least as easy as |
---|
0:23:53 | C C |
---|
0:23:56 | um |
---|
0:23:57 | so |
---|
0:23:57 | that this evidence |
---|
0:23:58 | suggests as i probably said |
---|
0:24:00 | too often now |
---|
0:24:01 | that |
---|
0:24:03 | improving estimation of |
---|
0:24:05 | detecting arguments are finding the art max |
---|
0:24:08 | in this representation the goal which are essentially |
---|
0:24:11 | um |
---|
0:24:11 | seems like |
---|
0:24:12 | and endeavour that doesn't |
---|
0:24:14 | warrant further time investment and |
---|
0:24:16 | uh_huh |
---|
0:24:16 | it's |
---|
0:24:17 | it's |
---|
0:24:17 | possible |
---|
0:24:18 | to simply model the entire transform domain |
---|
0:24:20 | and and do better |
---|
0:24:22 | um |
---|
0:24:23 | if pitch is required for other high level kind of features which of course we're ignoring here 'cause we're not |
---|
0:24:27 | doing anyone |
---|
0:24:28 | distance |
---|
0:24:29 | um |
---|
0:24:30 | feature computation |
---|
0:24:32 | then |
---|
0:24:32 | at least |
---|
0:24:33 | that information should not be discarded |
---|
0:24:35 | even if it's not used to estimate pitch |
---|
0:24:37 | i if |
---|
0:24:38 | it |
---|
0:24:38 | it yeah it |
---|
0:24:39 | these ideas |
---|
0:24:40 | um generalised to other data types in other tasks then |
---|
0:24:44 | there there is some chance that this |
---|
0:24:46 | um who will lead to some form a paradigm shift |
---|
0:24:49 | in the way that prosody is modelled |
---|
0:24:51 | speech |
---|
0:24:54 | so i wanna close with a couple of |
---|
0:24:56 | happy at um |
---|
0:24:57 | we we don't actually know how these features compared to other |
---|
0:25:00 | instantaneous |
---|
0:25:01 | uh for prosody |
---|
0:25:03 | vectors right so it's possible that if you had a |
---|
0:25:05 | uh |
---|
0:25:05 | a vector that contains page and maybe harmonic to noise ratio and and maybe some other things that |
---|
0:25:10 | are |
---|
0:25:11 | computable |
---|
0:25:12 | you know instantaneously personal frames and |
---|
0:25:14 | it would be much |
---|
0:25:15 | the difference would be much smaller |
---|
0:25:17 | um |
---|
0:25:18 | we don't know that this guy |
---|
0:25:19 | at the current time |
---|
0:25:20 | we also don't know how this |
---|
0:25:21 | this representation performed under various |
---|
0:25:23 | various um |
---|
0:25:25 | mismatched conditions for example channel or or or session or distance from microphone |
---|
0:25:30 | or uh vocal effort |
---|
0:25:31 | so these are things that need to be |
---|
0:25:33 | explored and it's also quite possible that |
---|
0:25:36 | better |
---|
0:25:36 | better |
---|
0:25:37 | uh |
---|
0:25:38 | that |
---|
0:25:39 | there are other classifiers the maybe better |
---|
0:25:41 | suited to this |
---|
0:25:42 | um |
---|
0:25:43 | in particular the performance that wasn't bad which you do this single diagonal covariance gaussian suggested |
---|
0:25:48 | maybe svm so much better that the the feature vectors are large right so |
---|
0:25:52 | um |
---|
0:25:53 | this presents some some problems |
---|
0:25:55 | existing prosody systems of course |
---|
0:25:57 | focus a lot on long term features |
---|
0:26:00 | and we haven't |
---|
0:26:01 | attempted that here at all so |
---|
0:26:03 | um a simple thing to try would be to uh simply start features from temporally adjacent frames |
---|
0:26:08 | or stack i differences |
---|
0:26:10 | from features but i |
---|
0:26:11 | i think |
---|
0:26:12 | that |
---|
0:26:12 | probably the best thing to do is to |
---|
0:26:14 | simply compute the modulation spectrum over this |
---|
0:26:17 | so how it just the spectrogram |
---|
0:26:19 | um |
---|
0:26:21 | and of course probably most importantly |
---|
0:26:22 | we would really like to have a data independent |
---|
0:26:25 | uh feature rotation which allows us to |
---|
0:26:27 | compress the feature space |
---|
0:26:30 | this would significantly improve understanding 'cause right now we just have this huge bag of numbers |
---|
0:26:34 | and it would prevent it would allow |
---|
0:26:36 | as to apply some normal |
---|
0:26:38 | things that people apply like universal background models |
---|
0:26:41 | um |
---|
0:26:45 | and it would allow us to |
---|
0:26:47 | deployed in other large ask |
---|
0:26:50 | thank you |
---|
0:26:51 | thank you |
---|
0:27:06 | could you |
---|
0:27:07 | perhaps |
---|
0:27:08 | uh |
---|
0:27:09 | just help me understand your your last one |
---|
0:27:12 | uh |
---|
0:27:13 | oh a explain please explain to me why do some difficulty in applying our method to to a lot |
---|
0:27:20 | that's because you're |
---|
0:27:21 | feature vectors are very large or |
---|
0:27:25 | well the first thing like yeah so in the in the in the system that we describe most |
---|
0:27:29 | extensively here |
---|
0:27:30 | the feature vector has |
---|
0:27:31 | four hundred number |
---|
0:27:33 | um |
---|
0:27:33 | and so |
---|
0:27:35 | i have found it to be |
---|
0:27:37 | painful to so that's four hundred every |
---|
0:27:39 | ten milliseconds right |
---|
0:27:43 | does that answer your question actually |
---|
0:27:44 | i say something more |
---|
0:27:46 | um |
---|
0:27:47 | we'd actually found that if you're looking at |
---|
0:27:49 | different kinds of mismatch |
---|
0:27:50 | you need to do some |
---|
0:27:52 | um homomorphic processing which actually |
---|
0:27:54 | increases the size of this feature vector and so it becomes even more painful |
---|
0:27:57 | and it's basically because we don't really know |
---|
0:28:00 | how come |
---|
0:28:00 | really properly model |
---|
0:28:02 | with a data independent |
---|
0:28:04 | yeah |
---|
0:28:06 | okay thanks becomes uh |
---|
0:28:08 | the those seem like it would be very worthwhile try that is based on those |
---|
0:28:12 | yeah |
---|
0:28:13 | on the nasdaq does it so well |
---|
0:28:15 | would be nice to think of ways to |
---|
0:28:17 | my proposal |
---|
0:28:18 | definitely and if any of you have any suggestions |
---|
0:28:20 | i would like to take |
---|
0:28:30 | you have any thoughts on this |
---|
0:28:32 | might be hit on mismatched data |
---|
0:28:36 | um |
---|
0:28:38 | i do |
---|
0:28:39 | i have we have some thoughts |
---|
0:28:40 | so |
---|
0:28:41 | but we we don't have |
---|
0:28:42 | really the correct |
---|
0:28:42 | kinds of thoughts |
---|
0:28:43 | so |
---|
0:28:44 | um |
---|
0:28:46 | note also that the problem is that the other dataset that we've been playing with most recently after doing this |
---|
0:28:52 | is that is a far field data so |
---|
0:28:54 | and so everything is far field |
---|
0:28:55 | so that there is a big change in what happened then we actually don't really know |
---|
0:28:59 | exactly where the changes so i guess we're now in the process of thinking about buying it |
---|
0:29:02 | different data but |
---|
0:29:03 | we try to remember this table um |
---|
0:29:06 | so |
---|
0:29:06 | this is on something called the mixer five dataset |
---|
0:29:09 | which is |
---|
0:29:10 | which which contains lots of |
---|
0:29:11 | different channels but |
---|
0:29:12 | that nine channels that we use are all the far field channel |
---|
0:29:16 | and um |
---|
0:29:17 | we we what we have there is we have uh |
---|
0:29:20 | two evaluation sets |
---|
0:29:22 | um |
---|
0:29:23 | one has |
---|
0:29:24 | session match and the other is session mismatch and then |
---|
0:29:27 | we |
---|
0:29:29 | we build models |
---|
0:29:30 | for data from every channel and apply them to that same channel that's the match channel condition |
---|
0:29:34 | and we also apply those models to data from every other channel and that's a mismatch channel conditions so that |
---|
0:29:39 | it's not channel condition consists of |
---|
0:29:41 | um |
---|
0:29:42 | i think it's a times nine it's an average of eight times nine numbers and the nine channel condition is |
---|
0:29:46 | an average of nine |
---|
0:29:48 | so |
---|
0:29:48 | what we see is that in channel |
---|
0:29:50 | match and channel |
---|
0:29:51 | so in section match and channel matched conditions |
---|
0:29:54 | um |
---|
0:29:54 | we we're we're doing something on interesting there |
---|
0:29:57 | um |
---|
0:29:58 | but in |
---|
0:29:59 | it it here is here |
---|
0:30:01 | that session mismatch |
---|
0:30:02 | is is more painful than channel mismatch |
---|
0:30:04 | right |
---|
0:30:05 | and um |
---|
0:30:07 | there there is a there is a clear reversal here in the ordering between the mfcc system and a justice |
---|
0:30:13 | system that we reported in this work |
---|
0:30:15 | um |
---|
0:30:16 | so |
---|
0:30:17 | oh yeah by the way |
---|
0:30:18 | so uh |
---|
0:30:20 | this |
---|
0:30:20 | this line this H assisi or is that system that i just described |
---|
0:30:24 | it you see sinew is something that we |
---|
0:30:27 | submitted this summer also just problem that can be accepted which is where the this |
---|
0:30:30 | this table comes from |
---|
0:30:31 | um |
---|
0:30:32 | but uh |
---|
0:30:34 | but the point is that um |
---|
0:30:35 | be these numbers in this role or |
---|
0:30:37 | are always smaller than the average |
---|
0:30:44 | then the numbers in in in this role |
---|
0:30:46 | so |
---|
0:30:47 | uh |
---|
0:30:48 | i don't know if that answers your questions i can probably talk a little bit about the magnitude of these |
---|
0:30:52 | numbers but |
---|
0:30:53 | you're happy with this then |
---|
0:31:02 | but |
---|
0:31:03 | but i could see actually here that um |
---|
0:31:06 | i don't recall him but i think that |
---|
0:31:08 | we did the combination here and the combination leads to approximately ten percent |
---|
0:31:13 | absolute increase |
---|
0:31:14 | you over this mfcc number |
---|
0:31:17 | it on average over all conditions right |
---|
0:31:20 | and |
---|
0:31:21 | and asks you know processing side |
---|
0:31:23 | sorry |
---|
0:31:24 | yeah and asks you know persisting sign |
---|
0:31:26 | i think this proposal to choose |
---|
0:31:28 | please |
---|
0:31:28 | see mia |
---|
0:31:29 | two |
---|
0:31:30 | O D U these images |
---|
0:31:32 | cool |
---|
0:31:33 | oh if you these things |
---|
0:31:34 | use |
---|
0:31:35 | you know maybe |
---|
0:31:36 | holding |
---|
0:31:36 | well |
---|
0:31:37 | right right side is withholding extra but could you hold your microphone a little bit too so sorry |
---|
0:31:43 | okay |
---|
0:31:44 | um |
---|
0:31:45 | um i think this so use it is futures |
---|
0:31:47 | this proposal you chose to me that the band limited |
---|
0:31:50 | i a harmonic to noise ratios |
---|
0:31:52 | no |
---|
0:31:52 | i think in addition to harmonic to noise ratios disputes to be seen you know to |
---|
0:31:58 | oh |
---|
0:31:58 | i p2p these images |
---|
0:32:00 | or if you these two things |
---|
0:32:01 | which is useful |
---|
0:32:03 | made recordings |
---|
0:32:05 | mated |
---|
0:32:05 | in it decoding so mixed excitation |
---|
0:32:08 | and |
---|
0:32:09 | so |
---|
0:32:10 | you have in mind |
---|
0:32:11 | well i have not i i'm not sure that i got all of the things that you said |
---|
0:32:15 | um |
---|
0:32:16 | but if you said that there is something that's very similar it is |
---|
0:32:19 | yeah i would really like to talk with you yeah system and and that's where we can do that offline |
---|
0:32:24 | or or you can use this |
---|
0:32:26 | right |
---|
0:32:27 | thank you |
---|
0:32:32 | just to come back to the |
---|
0:32:33 | four hundred dimensional features i think you wanna |
---|
0:32:36 | one of those like to mention all the lights |
---|
0:32:39 | did you reduce that |
---|
0:32:40 | feature dimensionality before nor modelling stage i'm sorry |
---|
0:32:44 | uh can you just at the very beginning of your question |
---|
0:32:49 | your |
---|
0:32:49 | your your features of |
---|
0:32:50 | four hundred dimensional yep |
---|
0:32:52 | so |
---|
0:32:53 | did you use an L D I to reduce the dimensionality before you're such as well |
---|
0:32:58 | model together |
---|
0:33:00 | yeah |
---|
0:33:00 | so to what dimensionality do you use |
---|
0:33:03 | sorry i i uh |
---|
0:33:17 | so it turns out that |
---|
0:33:19 | i i differ males are for females it was fifty two and fifty three |
---|
0:33:22 | i i i don't remember which gender doesn't matter |
---|
0:33:24 | is close enough |
---|
0:33:25 | okay because then it would seem it |
---|
0:33:28 | probably not a practical problem anymore |
---|
0:33:30 | uh |
---|
0:33:31 | we we typically use sixty dimensional features |
---|
0:33:34 | uh with the with the |
---|
0:33:36 | uh |
---|
0:33:37 | ubm that has two thousand components |
---|
0:33:40 | so that's doable |
---|
0:33:41 | right but the problem is that we would need to invert |
---|
0:33:44 | uh you know we we need to compute the L D or pca transform |
---|
0:33:47 | over |
---|
0:33:48 | because these transforms are global |
---|
0:33:50 | right |
---|
0:33:51 | so we would need to compute |
---|
0:33:52 | a pca transform over |
---|
0:33:54 | two thousand features for the entire |
---|
0:33:57 | i don't know |
---|
0:33:58 | ubm training set if you will |
---|
0:34:01 | do do do i understand you correctly |
---|
0:34:04 | oh |
---|
0:34:04 | did you say words um |
---|
0:34:07 | we have to transform is estimated remote will depend on where you are from sparse rooms |
---|
0:34:13 | cross |
---|
0:34:14 | no |
---|
0:34:15 | i |
---|
0:34:16 | what i meant |
---|
0:34:17 | if i gave that impression i didn't intend to |
---|
0:34:20 | uh |
---|
0:34:21 | so it's it's russell dimensional so much |
---|
0:34:25 | yes |
---|
0:34:26 | um remote closed so hard to see what would be um |
---|
0:34:30 | a problem when using a universal remote |
---|
0:34:33 | some of these features |
---|
0:34:35 | but you would use everything rooms who are |
---|
0:34:38 | for two dimensions in the morning |
---|
0:34:41 | right so |
---|
0:34:48 | i guess |
---|
0:35:02 | i see |
---|
0:35:03 | training and |
---|
0:35:05 | and extracting the features and the |
---|
0:35:08 | and change energies of all time |
---|
0:35:10 | and |
---|
0:35:11 | and it's a lower |
---|
0:35:13 | yeah right |
---|
0:35:14 | this |
---|
0:35:15 | yeah |
---|
0:35:16 | so basic |
---|
0:35:18 | basically yeah so we start from like |
---|
0:35:20 | i see |
---|
0:35:21 | and just |
---|
0:35:21 | and |
---|
0:35:22 | oh it's really uh you yeah |
---|
0:35:24 | chaney and then |
---|
0:35:25 | because |
---|
0:35:26 | based on this and |
---|
0:35:27 | there is that we do |
---|
0:35:28 | and happy and sad and change and that's the test set |
---|
0:35:32 | you'll often but |
---|
0:35:33 | and |
---|
0:35:34 | just |
---|
0:35:35 | yeah listing |
---|
0:35:36 | people recently |
---|
0:35:37 | we like to do it |
---|
0:35:39 | and |
---|
0:35:40 | due to some limitations |
---|
0:35:42 | a couple of it's not difficult imprints |
---|
0:35:45 | so many times it's just fine |
---|
0:35:47 | i mean |
---|
0:35:48 | i mean we just haven't gotten around to getting that far |
---|
0:35:50 | and like i said |
---|
0:35:52 | with |
---|
0:35:53 | feature vectors of the order of |
---|
0:35:55 | four hundred or eight hundred is shown being better |
---|
0:35:57 | and |
---|
0:35:58 | two thousand and forty eight after |
---|
0:36:00 | some homomorphic |
---|
0:36:02 | yeah scores |
---|
0:36:03 | um |
---|
0:36:04 | just haven't gotten around to even estimating how much disk space we would need and a particular court |
---|
0:36:09 | so |
---|
0:36:10 | um |
---|
0:36:12 | that's essentially the the |
---|
0:36:14 | correct answer there but my |
---|
0:36:15 | thought always was that we were gonna attack this problem by making the feature vector smaller first |
---|
0:36:19 | rather than addressing the |
---|
0:36:21 | the |
---|
0:36:22 | the |
---|
0:36:22 | infrastructure problem |
---|
0:36:23 | and by more disks right |
---|
0:36:25 | so |
---|
0:36:25 | okay |
---|
0:36:26 | okay |
---|
0:36:32 | okay but |
---|
0:36:33 | i think |
---|
0:36:33 | right now |
---|
0:36:35 | i |
---|