0:00:15 | table presents a small footprint multichannel keyword spotting |
---|
0:00:19 | and after presenting on behalf of chewing on art in urine around the |
---|
0:00:26 | this is overview of our presentation |
---|
0:00:29 | but first talk about the motivation of this work |
---|
0:00:32 | then we'll go over three d s t v f r |
---|
0:00:36 | introduced in this paper |
---|
0:00:38 | paul by how that's as figure five |
---|
0:00:40 | but the model architecture |
---|
0:00:43 | procurement star |
---|
0:00:45 | next we will discuss the experimental setup |
---|
0:00:48 | after that we will show some results |
---|
0:00:50 | and cover a final conclusions from this paper |
---|
0:00:54 | finally |
---|
0:00:55 | we will show some future work |
---|
0:00:57 | but we want to |
---|
0:00:58 | as an extra |
---|
0:01:03 | voices distance |
---|
0:01:04 | are increasingly popular |
---|
0:01:07 | keywords that just a google are commonly used to initiate conversation the voice the system |
---|
0:01:13 | keyword spotting low latency becomes the core technological challenge to achieve the task |
---|
0:01:20 | since keywords fires typically run embedded devices such as found speakers limited battery random computer |
---|
0:01:26 | available |
---|
0:01:28 | you want to detect |
---|
0:01:30 | with high accuracy |
---|
0:01:32 | we also want to have small model size all your cost |
---|
0:01:36 | is thus those reasons |
---|
0:01:39 | well to microphone setups are widely used to increase noise robustness |
---|
0:01:42 | so will be interesting to see if we can integrate |
---|
0:01:45 | a noise robustness |
---|
0:01:48 | as far and then do and neural network architecture |
---|
0:01:55 | we first recall std a |
---|
0:01:59 | the svd f idea originated in doing singular value decomposition |
---|
0:02:04 | but a fully connected weight matrix |
---|
0:02:08 | alright one svd s |
---|
0:02:10 | it composes the weight matrix vector dimension |
---|
0:02:14 | as shown in figure |
---|
0:02:16 | these two dimensions the inter but still turns me feature in time domain |
---|
0:02:22 | we refer to the filter the feature domain is how well the first from the |
---|
0:02:26 | in domain data |
---|
0:02:31 | from |
---|
0:02:31 | the feature maps from what they're |
---|
0:02:34 | but first convolved with the one d feature filter also |
---|
0:02:41 | then the output of the current state |
---|
0:02:44 | reporters into memory buffer |
---|
0:02:48 | given memory size |
---|
0:02:50 | but the performance for the past on stage |
---|
0:02:55 | is an states the memory buffer well can vol but the time-domain filter |
---|
0:03:00 | beta |
---|
0:03:01 | to produce the final or |
---|
0:03:06 | this feature vector or start a single us that you have no |
---|
0:03:10 | in practice |
---|
0:03:11 | of course to be several nodes ritual to achieve |
---|
0:03:14 | an dimensionality outputs |
---|
0:03:18 | more a three |
---|
0:03:19 | the first we have is used mecc input layer of speech model |
---|
0:03:23 | but future filter |
---|
0:03:24 | but correspond to transformation of the filterbank train but |
---|
0:03:29 | but and still |
---|
0:03:31 | but contain the past and filtered |
---|
0:03:36 | but should be friends |
---|
0:03:38 | and the time output will correspond to a summary of those past frames |
---|
0:03:44 | as far as shown that that's really a more well single channel |
---|
0:03:48 | in the literature |
---|
0:03:52 | three d s p d f |
---|
0:03:54 | extends that cts existing to dimension |
---|
0:03:57 | but several feature in time to do not |
---|
0:04:00 | channel |
---|
0:04:02 | three d has to be a reference to three dimensions feature time |
---|
0:04:06 | and channel |
---|
0:04:09 | an example where |
---|
0:04:10 | filterbank energies from each channel airfares and it's really are still |
---|
0:04:15 | each channel learns it's weights |
---|
0:04:18 | on its own time and frequency domain for |
---|
0:04:22 | the outputs of all channels are concatenated |
---|
0:04:25 | after the input layer so that was later where is the number and then filter |
---|
0:04:29 | exploiting the time delay redundancy between the two channels |
---|
0:04:35 | really std a can also be considered as applying as idea only channel and simply |
---|
0:04:40 | fusing the results |
---|
0:04:44 | this approach enables the non that we're gonna take advantage the redundancy in the frequency |
---|
0:04:49 | features from each channel |
---|
0:04:51 | but also select the temporal variation cross channel and hence the noise robustness |
---|
0:04:57 | and then approach allows the following learnable's signal filtering module |
---|
0:05:03 | to leverage the multi-channel input |
---|
0:05:06 | but in general |
---|
0:05:11 | but this for |
---|
0:05:13 | an architecture pigeon but the three d s p d f |
---|
0:05:18 | at the input layer |
---|
0:05:19 | to enable noise-robust learning |
---|
0:05:22 | the three d s p d f |
---|
0:05:25 | it's not original features doesn't work |
---|
0:05:29 | and the miss concatenated features for each channel |
---|
0:05:33 | as output |
---|
0:05:34 | and then there immediately follows the three d s ideas |
---|
0:05:38 | and sounds the features from the channels together |
---|
0:05:41 | acting as a filter |
---|
0:05:44 | following the first three ds-cdma |
---|
0:05:47 | there are two models that and further |
---|
0:05:50 | and the decoder |
---|
0:05:53 | but and couldn't fix the filter to really that's really a exemplar |
---|
0:05:57 | and ms softmax probabilities for phonemes and make your |
---|
0:06:02 | i thought |
---|
0:06:04 | encoder consists of a star of std a where |
---|
0:06:08 | but some fully connected bottleneck layers and between interesting yes where |
---|
0:06:13 | to further up the total number of |
---|
0:06:15 | parameters |
---|
0:06:17 | the decoder then case |
---|
0:06:19 | but and better results as an input |
---|
0:06:22 | and in this yes a decision |
---|
0:06:24 | as to whether the utterance and hence the keyword are now |
---|
0:06:28 | but decoder consists of three s first start directly or no bottleneck |
---|
0:06:36 | and's the softmax as the final activation |
---|
0:06:41 | the training also use the frame level cross entropy |
---|
0:06:44 | the trend of the encoder and decoder |
---|
0:06:48 | the final models is a one stage unified moss to jointly trained and you order |
---|
0:06:59 | the experiment setup i'll talk about training and testing data set |
---|
0:07:03 | or training data |
---|
0:07:05 | use two point one million |
---|
0:07:07 | single channel anonymous audio gaining okay go |
---|
0:07:10 | ready to go |
---|
0:07:13 | generate additional year |
---|
0:07:14 | on this mono data |
---|
0:07:17 | we use the multi-style rooms duration but it dual microphone setup |
---|
0:07:22 | the simulations different dimensions |
---|
0:07:25 | and for different microphone spacing |
---|
0:07:28 | or testing data |
---|
0:07:30 | we generate a the humour containing utterances in the following way |
---|
0:07:35 | first use problems randomly and then or and then you words |
---|
0:07:39 | then these problems for spoken by workers volunteers |
---|
0:07:43 | and we recorded but it will microphone set up to convert them into more channel |
---|
0:07:49 | and that harvard set as multi-channel noise we recorded with a similar don't microphone setup |
---|
0:07:55 | as possible |
---|
0:07:59 | table you can see the size of the different testing data set used in this |
---|
0:08:03 | experiment |
---|
0:08:07 | further to evaluate the single channel baseline model and two channel audio |
---|
0:08:12 | we evaluated two different strategies |
---|
0:08:15 | first |
---|
0:08:17 | we will run the keyword detector |
---|
0:08:20 | and i don't channel one or channel zero |
---|
0:08:22 | ignoring the other channel entirely |
---|
0:08:26 | seven will run the single channel keyword spotting |
---|
0:08:31 | on each channel independently |
---|
0:08:33 | given a fixed threshold |
---|
0:08:36 | then we use the both logical or |
---|
0:08:39 | to produce the final result |
---|
0:08:40 | it's in the binary outcome of each channel |
---|
0:08:45 | their strategies or something to evaluate a single channel baseline |
---|
0:08:50 | since three d nested x multi-channel input directly |
---|
0:08:54 | we use the output from three d s figure directly as well |
---|
0:08:58 | we learned from extraction strategies |
---|
0:09:02 | we also evaluate the single channel model |
---|
0:09:05 | but the simple broadside beamformer |
---|
0:09:08 | to capture an improvement from the original signal enhancement back muscles |
---|
0:09:13 | but now present the results it's was performed |
---|
0:09:19 | keeping the false accept rate for x zero point one f a parameter |
---|
0:09:24 | we now present |
---|
0:09:25 | results and false reject |
---|
0:09:28 | as we can see |
---|
0:09:30 | given the same model size |
---|
0:09:33 | or thirty k |
---|
0:09:35 | but proposed to channel model outperforms single-channel baseline |
---|
0:09:39 | and though |
---|
0:09:40 | the queen testing |
---|
0:09:42 | and noisy t v set |
---|
0:09:48 | the relative improvement over the two channel logical or combination baseline |
---|
0:09:53 | it's twenty seven percent |
---|
0:09:55 | and thirty one percent |
---|
0:09:57 | on clean and noisy respectively |
---|
0:10:02 | to further understand this result |
---|
0:10:05 | really |
---|
0:10:06 | create roc curves |
---|
0:10:09 | to confirm the model well quality improvement the various special |
---|
0:10:14 | as we can see our proposed model the best performance |
---|
0:10:17 | compare a single channel and what for all strategies discussed |
---|
0:10:21 | and compare it also provides an informer baseline |
---|
0:10:26 | on the inside |
---|
0:10:27 | the kings firstly really has the idea |
---|
0:10:30 | are small |
---|
0:10:32 | but still nonnegligible |
---|
0:10:34 | and these filtering of three d s video |
---|
0:10:37 | does not seem to interfere performance |
---|
0:10:39 | mcqueen notion |
---|
0:10:44 | for negative set as all some of the channel |
---|
0:10:47 | we hypothesize that some of the gains in the clean so |
---|
0:10:50 | something but not really has really as variables altered |
---|
0:10:54 | producing confusable stage nonnegative set |
---|
0:10:58 | and suppressing for false accept |
---|
0:11:03 | we have seen some such false accept in the past when experimenting with other signal |
---|
0:11:07 | enhancement |
---|
0:11:10 | and noisy test sets |
---|
0:11:12 | against for the three d s p d f are much larger |
---|
0:11:16 | we find that the three s e d r e |
---|
0:11:19 | without performing a baseline single channel model of comparable size |
---|
0:11:24 | even on the baseline also includes basic thing on a technique such as the broadside |
---|
0:11:28 | meeting beamformer |
---|
0:11:31 | which in practice the broad same you are does not seen |
---|
0:11:36 | to add much performance or t k |
---|
0:11:41 | it is difficult noisy set |
---|
0:11:45 | we therefore hypothesize that is larger against noise is that |
---|
0:11:49 | our results of the three d i c d s ability |
---|
0:11:52 | to learn better filtering but the multichannel data |
---|
0:11:55 | and therefore after specific evaluations the intention |
---|
0:11:59 | on the difficult noisy stuff |
---|
0:12:01 | better |
---|
0:12:02 | and the more general classical techniques |
---|
0:12:05 | that just beamforming |
---|
0:12:10 | conclusion this paper has proposed a new architecture for keyword spotting utilizes multichannel microphone input |
---|
0:12:17 | signals to improve detection accuracy |
---|
0:12:20 | but something able to maintain a small model size |
---|
0:12:26 | compared with a single channel baseline which runs in parallel each channel |
---|
0:12:30 | proposed architecture |
---|
0:12:32 | the reduce the false reject rate |
---|
0:12:35 | i twenty seven percent and thirty one percent relative |
---|
0:12:38 | onto a microphone clean and noisy test |
---|
0:12:41 | respectively |
---|
0:12:42 | i don't fix |
---|
0:12:43 | false accept rate |
---|
0:12:48 | as for fusion where |
---|
0:12:50 | those aren't inference in those ideas on how to increase its robustness accurate style |
---|
0:12:56 | for example i'm used references are were you know |
---|
0:13:00 | an adaptive noise cancellation |
---|
0:13:04 | but be interesting to see if we could further integrate |
---|
0:13:08 | it's techniques |
---|
0:13:09 | as part of learnable and only neural network architecture |
---|
0:13:16 | and you this concludes our presentation |
---|
0:13:19 | the small open mode and channel keyword spotter |
---|