0:00:13 | was gonna continue and the theme um |
---|
0:00:15 | looking at what we can get from biologically inspired auditory processing |
---|
0:00:18 | um his in doing a lot of work on |
---|
0:00:20 | figure out how you can extract |
---|
0:00:22 | useful features from auditory models |
---|
0:00:25 | um that can be used to do interesting task making in this case exposed to use of sparsity |
---|
0:00:29 | thanks to |
---|
0:00:33 | make josh |
---|
0:00:34 | i to see so many uh all the new friends here |
---|
0:00:37 | of friends like now malcolm |
---|
0:00:38 | a friends like josh action |
---|
0:00:40 | so many people attending |
---|
0:00:42 | um |
---|
0:00:44 | the system i'm discussing today and the representations of audio that it uses were reported in uh considerable detail in |
---|
0:00:51 | uh i think september |
---|
0:00:53 | neural computation magazine and then discussed again and uh signal processing magazine column about that time |
---|
0:00:59 | what's new in today's paper uh uh |
---|
0:01:02 | the last couple words in the title the says in interference so we we've done some more tests |
---|
0:01:07 | to see |
---|
0:01:09 | to what extent these uh |
---|
0:01:11 | kind auditory model based representations uh |
---|
0:01:14 | perform better than more conventional mfcc like representations |
---|
0:01:19 | in mixed sounds |
---|
0:01:21 | so let me just describe what the representation is since that's |
---|
0:01:24 | the yeah main topic of this session |
---|
0:01:28 | uh |
---|
0:01:29 | we start off with a uh a we have simulation that produces this uh |
---|
0:01:33 | this thing we call the uh |
---|
0:01:35 | neural activity pattern |
---|
0:01:37 | a |
---|
0:01:38 | purse round screens |
---|
0:01:40 | um |
---|
0:01:41 | the neural activity pattern you can think of that as sort of a a a cochlea gram or a a |
---|
0:01:45 | representation of uh |
---|
0:01:48 | firing rate or firing probability of primary auditory neurons |
---|
0:01:52 | as a function of |
---|
0:01:53 | time and place |
---|
0:01:55 | with a high frequencies that the |
---|
0:01:57 | top here representing the base of the cochlea and low frequencies at the bottom |
---|
0:02:01 | the apex of the cochlea |
---|
0:02:04 | from those we detect certain events called strobe of event says a way to um |
---|
0:02:10 | make an of efficient and realistic computation of what uh right patterson has done to the stabilised auditory image which |
---|
0:02:16 | is a lot like the auditory correlogram that not come i did a lot of work on back in the |
---|
0:02:21 | eighties and nineties |
---|
0:02:23 | make these nice pictures is stabilised in the sense that um you know i'm like this one that's kinda racing |
---|
0:02:29 | by with time this one is is like a |
---|
0:02:32 | like a trigger display on on the source go for the trigger events "'cause" it to uh to stand still |
---|
0:02:37 | so as you as you have pitch pulses the |
---|
0:02:40 | this uh central feature at zero time like just |
---|
0:02:43 | stays there and as the pitch period |
---|
0:02:45 | varies these other ones that look like copies of it move back and forth |
---|
0:02:49 | to have a spacing equal to the pitch period |
---|
0:02:52 | to get these very nice looking dynamic will be here |
---|
0:02:55 | the problem that we've uh a up and had with the user the years as figure out how to go |
---|
0:02:59 | from this |
---|
0:03:00 | rich |
---|
0:03:02 | a be like representation to the some kind of features that we can use to do |
---|
0:03:06 | classification recognition tasks and things like that |
---|
0:03:10 | um |
---|
0:03:11 | a we have try started at the google in two thousand and six we were joined by samy bengio would |
---|
0:03:17 | been doing a a very interesting work in |
---|
0:03:19 | image classification |
---|
0:03:21 | using a a um |
---|
0:03:23 | sort of |
---|
0:03:23 | high dimensional sparse feature |
---|
0:03:26 | back of visual words approach and the system trained with those and |
---|
0:03:31 | and when you described to system to me i said that's that's exactly what we need |
---|
0:03:35 | to analyse the use uh these movies of sounds |
---|
0:03:38 | to get into a feature space on which we can train up |
---|
0:03:41 | classifiers |
---|
0:03:43 | so |
---|
0:03:44 | what we've done is uh this next box we called multi-scale segmentation that's kind of motivated by a lot of |
---|
0:03:51 | the work they do in the visual analysis for they try to detect |
---|
0:03:55 | features all over the image at different scales |
---|
0:03:58 | kind of different uh |
---|
0:03:59 | keypoints points or other strategies based and just looking at regions of the image |
---|
0:04:04 | and saying which of |
---|
0:04:06 | several features that region is close to and doing that a multiple scales so we came up with the a |
---|
0:04:10 | way to do that |
---|
0:04:12 | we get a bunch of um |
---|
0:04:14 | really just abstract features or there's um |
---|
0:04:17 | at their sparse so there mostly zeros and occasionally some of them get ones and |
---|
0:04:21 | that's sparse coding gives you for each |
---|
0:04:23 | frame of this movie |
---|
0:04:25 | this long vector of |
---|
0:04:27 | that has some someone's in it a lot of zeros |
---|
0:04:29 | then one we aggregate that over sound files we just basically add those that and so what you get here |
---|
0:04:34 | in the |
---|
0:04:35 | the sum of all the sparse vectors as |
---|
0:04:37 | what called a bag representation that tells you how many times each feature occurred it's just a histogram really really |
---|
0:04:44 | so it's a account of how many times each of those abstract features occurred in a sound file |
---|
0:04:48 | it's still relatively sparse and that's the kind of feature vector that we used to represent |
---|
0:04:53 | a that |
---|
0:04:54 | going of these stages and a little bit |
---|
0:04:56 | more details see see what we do in there |
---|
0:04:58 | the |
---|
0:04:59 | the peripheral role model though the coke we're |
---|
0:05:01 | simulation is uh you know if you know anything about my work you know i've spent the last thirty years |
---|
0:05:06 | or so working on a filter cascade as an approach to |
---|
0:05:09 | simulating the cochlea because it's way to connect the |
---|
0:05:13 | a the underlying uh wave propagation i dynamics to an efficient digital signal processing uh filtering architecture |
---|
0:05:21 | in a way that |
---|
0:05:22 | that let you do both the uh |
---|
0:05:24 | good models of the choirs a linear filtering as well as uh easy ways to incorporate the nonlinear effects and |
---|
0:05:30 | of that that's you get |
---|
0:05:31 | compression of dynamic range |
---|
0:05:33 | generation of cubic distortion tones and stuff like that |
---|
0:05:37 | it's basically just a cascade of these simple filter stages some uh halfway detectors that's end |
---|
0:05:42 | these represent inner hair cells and send a signal the represents instantaneous neurons firing probability |
---|
0:05:49 | and feedback network that that takes the output of the code clean and sends it back to control the parameters |
---|
0:05:53 | of the filter stage |
---|
0:05:55 | by |
---|
0:05:55 | by reducing the Q of the filters you can reduce the gain a lot in a cascade like this you |
---|
0:06:00 | don't have to change the Q or the bandwidth very much to |
---|
0:06:02 | change the gain a lot so you get |
---|
0:06:04 | pretty nice compressive |
---|
0:06:06 | result from that |
---|
0:06:09 | we stabilise the image using right patterson's technique of stroop temporal integration |
---|
0:06:13 | sort like looking at a the solar scope as i mentioned where |
---|
0:06:16 | each line of this image independently uh triggered so that at this uh uh zero time interval you get this |
---|
0:06:23 | nice stable vertical feature doesn't really mean anything it just kind of a zero point |
---|
0:06:27 | and the other stuff |
---|
0:06:28 | moves around relative to it as pitch changes and as formants go up and down and so on |
---|
0:06:32 | so this is a frame of speech where the the horizontal bands you see are are resonances of the vocal |
---|
0:06:37 | tract there or uh formants |
---|
0:06:39 | and then the |
---|
0:06:41 | pattern that repeats and the time time lag dimension |
---|
0:06:44 | or the pitch pulses |
---|
0:06:45 | other other sounds uh uh that are less |
---|
0:06:48 | periodic than speech have |
---|
0:06:50 | different looking but still very interesting and and kind of |
---|
0:06:53 | characteristic can unique kind of patterns |
---|
0:06:55 | and the problem was to try to |
---|
0:06:57 | summarise a complex sound file |
---|
0:07:00 | using some statistics of these patterns in a way that |
---|
0:07:02 | you could do recognition and retrieval and so on |
---|
0:07:05 | we we did a |
---|
0:07:06 | retrieval and recognition task |
---|
0:07:10 | um um |
---|
0:07:10 | that we've we've reported it and a couple different context |
---|
0:07:13 | but but uh |
---|
0:07:14 | a show you the results in a second |
---|
0:07:17 | the um |
---|
0:07:18 | the features that we extract from these stabilised auditory images are pulled out of a bunch of different boxes we |
---|
0:07:23 | have like you know long skinny boxes and short fat boxes and |
---|
0:07:26 | small boxes in big boxes and |
---|
0:07:28 | within each box we |
---|
0:07:30 | we |
---|
0:07:31 | um |
---|
0:07:31 | for the current |
---|
0:07:32 | uh a set of features for using we just |
---|
0:07:35 | do a uh |
---|
0:07:37 | uh |
---|
0:07:37 | sort sort row and column marginal as to reduce it to a somewhat lower dimensionality and then we vector quantise |
---|
0:07:42 | that and we and we do that at a fixed |
---|
0:07:45 | resolution like thirty two |
---|
0:07:47 | um |
---|
0:07:48 | thirty two rows and sixteen columns of gives us a forty eight dimensional |
---|
0:07:53 | feature vector for each one of those boxes and then that forty dimensions goes into a vector quantizer with |
---|
0:07:58 | um |
---|
0:07:59 | different codebook for each different box |
---|
0:08:01 | size and position so we get a whole bunch of code-books to a whole bunch of vector quantisation as |
---|
0:08:06 | size is can be quite large up to several thousand per could |
---|
0:08:10 | several hundred thousand of mentions total |
---|
0:08:13 | um |
---|
0:08:13 | sparse in the sense that only the one code but that's closest gets a one and all the others get |
---|
0:08:18 | zeros |
---|
0:08:20 | so we for each frame of the video we get |
---|
0:08:24 | uh you could take of this sparse code as being segmented one segment for each codebook and within within each |
---|
0:08:29 | segment there's a single one |
---|
0:08:30 | you you could use any kind of a sparse code here |
---|
0:08:33 | and when you accumulate that over the frames to but to make a a summary for the whole document you |
---|
0:08:38 | just a up that |
---|
0:08:39 | that's it before |
---|
0:08:42 | again that's the uh |
---|
0:08:43 | the overview of the |
---|
0:08:45 | the system |
---|
0:08:47 | um what we do at the end here we we going to this document |
---|
0:08:51 | we take this document feature vector and we train up a a uh a ranking and retrieval system on that |
---|
0:08:57 | using uh samy bengio as i'm your system that that stands for uh passive aggressive model for |
---|
0:09:03 | image retrieval |
---|
0:09:04 | so we're doing the same thing for sound retrieval |
---|
0:09:08 | um um |
---|
0:09:09 | his student david grant is the lead author on that |
---|
0:09:12 | paper and no seven |
---|
0:09:14 | uh basically computes a a scoring function between a a uh |
---|
0:09:18 | a query and uh audio document the query Q here is a uh is a bag of words that |
---|
0:09:24 | the |
---|
0:09:25 | terms in a query |
---|
0:09:27 | like if i was searching for us a fast car |
---|
0:09:31 | it would |
---|
0:09:32 | it would look for audio documents that we that have a |
---|
0:09:35 | but good score between |
---|
0:09:38 | a bag of words fast and car and the audio the back of abstract audio features that we have from |
---|
0:09:43 | that histogram |
---|
0:09:44 | and that's scoring function is computed as just a bilinear linear transformation so there's a a weight matrix that |
---|
0:09:50 | that simply |
---|
0:09:52 | maps the the audio sparse audio features in the sparse query terms into this uh |
---|
0:09:57 | score for that query and that |
---|
0:10:01 | oh we had to do is train that weight matrix and there's a |
---|
0:10:03 | a simple uh you know stochastic gradient descent method for and that and that's |
---|
0:10:07 | it's actually um a nice thing about the pen your method is that |
---|
0:10:11 | that our wear them is |
---|
0:10:13 | extremely fast we could do large data so that |
---|
0:10:15 | so we were able to run many different um |
---|
0:10:18 | experiments with variations on the features this |
---|
0:10:21 | kind of scatter plot |
---|
0:10:23 | of of uh many different |
---|
0:10:25 | uh variations on the auditory features as well as a bunch of uh mfcc variants |
---|
0:10:30 | the little axes here |
---|
0:10:32 | mostly a one codebook size but varying the number of uh |
---|
0:10:36 | mfcc coefficients and so on um |
---|
0:10:39 | and the the window length |
---|
0:10:41 | and a bunch of different things and you can see here that the the mfcc result is not too bad |
---|
0:10:46 | in terms of precision at top one and retrieval |
---|
0:10:49 | um um we can we can be that by a fair amount with a very large code-books in the auditory |
---|
0:10:54 | features but the the difference in terms of what we could do with the best mfcc and what we could |
---|
0:10:59 | do with the best |
---|
0:11:00 | uh auditory based system here was was kind of small i was a little disappointed and that |
---|
0:11:04 | so these are the results we uh reported before |
---|
0:11:07 | the line on top is kind of the |
---|
0:11:09 | convex all it shows that |
---|
0:11:11 | perhaps the most important to |
---|
0:11:13 | here is just the size of the abstract feature space a use with this method |
---|
0:11:17 | it it's nice that |
---|
0:11:19 | this matrix when you get a pure to a hundred thousand and uh |
---|
0:11:22 | a hundred thousand mentions in your sparse feature vectors a and you've got |
---|
0:11:26 | say three thousand query terms |
---|
0:11:28 | was that come out to three hundred million uh |
---|
0:11:31 | elements that we're training in that matrix |
---|
0:11:33 | you can train a three hundred million parameter system here without over fitting it due to the nature of the |
---|
0:11:38 | uh |
---|
0:11:39 | the regularized training over them |
---|
0:11:41 | and actually works quite nicely |
---|
0:11:44 | so what we did for the icassp paper was to |
---|
0:11:47 | see how the well this work in interference we actually took a um |
---|
0:11:51 | from a database of sound files with took pairs of files a random an added together |
---|
0:11:55 | so you might have a |
---|
0:11:57 | a sound file |
---|
0:11:58 | whose |
---|
0:11:59 | tags say it represents a fast car and another one that says it's a barking dog or something like that |
---|
0:12:04 | just you add the files together you got but sounds in there |
---|
0:12:07 | a person listening to it can still tell you what it is typically |
---|
0:12:11 | what the both things are so we just we take the union of the uh tags |
---|
0:12:15 | we truncated the sound file just to the length of the shorter one "'cause" we notice that in |
---|
0:12:20 | and almost all cases uh a few seconds of the sound was enough to |
---|
0:12:24 | to tell you what it was and |
---|
0:12:26 | an extra thirty seconds of fast car didn't really help you any so |
---|
0:12:30 | um |
---|
0:12:31 | just some to truncate everything so we had sort of a nominal zero db signal-to-noise ratio if you like of |
---|
0:12:36 | if you consider one of the sounds to be the true sound of the other one to be interference |
---|
0:12:40 | then we did this uh same kind of ranking and retrieval task using but me to uh |
---|
0:12:45 | so given a query like fast car again used to wanna get that file that has the barking dog in |
---|
0:12:50 | it because it has the |
---|
0:12:51 | the fast car tags |
---|
0:12:52 | you don't want that barking dog interfere too much with the retrieval |
---|
0:12:57 | so we did that that kind of test and the |
---|
0:12:59 | the results showed a |
---|
0:13:01 | a much bigger difference between the best mfcc system and the best uh auditory stabilised auditory image based system |
---|
0:13:08 | so the the punch line of this experiment is that the the sparse coded |
---|
0:13:12 | stabilised auditory images show |
---|
0:13:14 | a bigger advantage over mfcc for sounds in interfere interference |
---|
0:13:18 | then for clean sounds |
---|
0:13:20 | this is what we had hoped based on the idea that these boxes and the stabilised auditory image focusing on |
---|
0:13:26 | different regions well |
---|
0:13:28 | well sometimes pick up regions |
---|
0:13:30 | where the sounds of separate out so the |
---|
0:13:32 | a certain combinations of sort of frequency bands and time like patterns that will be robust |
---|
0:13:38 | that will represent that carton will still lead to the same uh code words the same sparse feature |
---|
0:13:44 | and you know same way with the dog so that |
---|
0:13:46 | features for both sounds will be present |
---|
0:13:48 | in the sparse vector even though other features may be wiped out by fair |
---|
0:13:54 | so it's this locality and this higher dimensional space that we believe is the |
---|
0:13:58 | was the motivation and we believe is the explanation for this so we don't have a |
---|
0:14:02 | a great way to test or prove that yeah |
---|
0:14:05 | so |
---|
0:14:06 | in conclusion the uh |
---|
0:14:09 | auditory representations do work pretty well |
---|
0:14:12 | mfccs work |
---|
0:14:13 | pretty well to a few run through an appropriately powerful training system |
---|
0:14:17 | and the sparse coding works well and you when you put these together sparse codes that are somewhat localised in |
---|
0:14:22 | the stabilised auditory image space |
---|
0:14:24 | take advantage of how the auditory system separates |
---|
0:14:27 | certain features of sound at least at a fairly low level |
---|
0:14:31 | thank you |
---|
0:14:40 | any questions |
---|
0:14:45 | i i think i |
---|
0:14:45 | so for a sparse this to well |
---|
0:14:48 | oh and you kind of role i condition the signal |
---|
0:14:50 | in a to have some after then somehow be some property in the the original or |
---|
0:14:55 | that |
---|
0:14:57 | um |
---|
0:14:59 | that the statistics |
---|
0:15:00 | all those makes |
---|
0:15:02 | information formation is ice information |
---|
0:15:05 | a such that |
---|
0:15:07 | can facilitate the spot cult to get we don't know i used |
---|
0:15:11 | you know |
---|
0:15:12 | i for information |
---|
0:15:14 | it was like a a lot more |
---|
0:15:15 | not which even we think here |
---|
0:15:18 | i mean so each case will come back to the way for |
---|
0:15:20 | i all went that i question how i what to this this |
---|
0:15:24 | these two different |
---|
0:15:25 | i |
---|
0:15:26 | yeah so the |
---|
0:15:28 | yeah of the informations all there in the waveform but it's not |
---|
0:15:32 | easily accessible directly to get a a a a a a sparse code that |
---|
0:15:37 | corresponds to features of |
---|
0:15:39 | how you hear the sound similarly and um and a short-time spectral representation like the mfccs the a lot of |
---|
0:15:46 | the information is there but some of it is lost by that sorta |
---|
0:15:50 | noncoherent spectral detection |
---|
0:15:52 | stuff so that you you no longer have a easy way to take advantage of uh |
---|
0:15:57 | separation of things by pitch and the different |
---|
0:16:00 | regions of the auditory image |
---|
0:16:02 | uh pictures or other characteristic time patterns in these non periodic sounds |
---|
0:16:07 | so i |
---|
0:16:08 | um |
---|
0:16:10 | the idea was that this you know this uh |
---|
0:16:12 | duplex representation that like later proposed before |
---|
0:16:15 | for pitch captures a lot of different aspects of |
---|
0:16:18 | uh psychological pitch perception and we thought that would be a good at |
---|
0:16:22 | a better starting point then either short-time spectral or wave form as a way to uh |
---|
0:16:28 | a lot abstract features that correlate with what you hear |
---|
0:16:31 | and what not trying to suppress the interfering sounds were just trying to get a bag of features in which |
---|
0:16:35 | both sounds have |
---|
0:16:37 | some features come through |
---|
0:16:41 | click a question most what's um map types |
---|
0:16:43 | correct |
---|
0:16:44 | yeah down so we can use talk |
---|
0:16:46 | yeah so i'm i was just taking off phone your calm night at the start so we see now in |
---|
0:16:53 | text |
---|
0:16:54 | scanlon i |
---|
0:16:55 | and you've got your stabilise a very you maybe is um |
---|
0:16:58 | and you you want them are now there is a would will lead his he's |
---|
0:17:03 | he's model B |
---|
0:17:05 | and |
---|
0:17:06 | something thing that you would consider trying to see how well it performs |
---|
0:17:12 | a uh yes and no |
---|
0:17:13 | i |
---|
0:17:16 | some my representations presentations are sort of uh |
---|
0:17:19 | um |
---|
0:17:19 | mid brain to to a very or can y'all is probably |
---|
0:17:24 | the cortical stuff that me my has a um |
---|
0:17:29 | it's amazing how well it works like in his human experiments we can resynthesize the spectrograms of the sound that |
---|
0:17:34 | the human is paying attention to |
---|
0:17:36 | that's sort suggests that it's a representation that comes after |
---|
0:17:40 | some sound separation process |
---|
0:17:43 | and |
---|
0:17:43 | i think that is a layer we need to put in |
---|
0:17:46 | somehow explicitly before for the uh cortical |
---|
0:17:50 | spectro-temporal receptive fields |
---|
0:17:53 | really make a lot of sense if you if you get there directly without doing the separation first i think |
---|
0:17:58 | you'll have the same problem as the other short time spectral techniques and that interfering sounds won't be |
---|
0:18:04 | um um the representation won't capture the differences between |
---|
0:18:08 | the you know the features of the interface interfering sound |
---|
0:18:12 | have to do some separation before you |
---|
0:18:16 | um before you give up the fine time structure and go to a purely spectral |
---|
0:18:20 | or short time spectral spectro-temporal approach |
---|
0:18:23 | uh a something else it has to be done in there so |
---|
0:18:26 | um |
---|
0:18:27 | i just talking gonna morgan about that earlier and we've talked me man i've talked before to and i |
---|
0:18:32 | we do want to figure out a way to put all these ideas together |
---|
0:18:34 | just not clear exactly |
---|
0:18:37 | how to |
---|
0:18:37 | how to bridge that right now |
---|
0:18:40 | but do you turn use the area of the spectral versus temporal or is going on for a long time |
---|
0:18:45 | and |
---|
0:18:45 | and and settled yeah |
---|
0:18:47 | X |
---|