0:00:14 | she |
---|
0:00:16 | good thing under need for a nice introduction |
---|
0:00:19 | actually anthony what's in my life for several years a four years |
---|
0:00:25 | during the |
---|
0:00:28 | twenty nine ten leading not introduce him |
---|
0:00:33 | i can also be very well |
---|
0:00:35 | okay so first of all i would like to conform to do that actually indeed |
---|
0:00:41 | speech and language as p l c is the most vibrant to a special interest |
---|
0:00:48 | group we know it's got |
---|
0:00:49 | and not that the all be cased i don't |
---|
0:00:53 | presence of myself i was also don't things and then the past of the vice |
---|
0:00:59 | president of this guy and then |
---|
0:01:01 | and jean francois a past president of is colour or |
---|
0:01:05 | must come to show the support |
---|
0:01:09 | and also like the thing release and |
---|
0:01:14 | at a dual due to |
---|
0:01:16 | to have |
---|
0:01:17 | brought them all the c in to spend a belief that many of us to |
---|
0:01:23 | have wanted to come to appeal power in the visit the basque country for low |
---|
0:01:27 | income and then he may want excuse for all of us |
---|
0:01:31 | to counter just beautiful and |
---|
0:01:33 | harry oppression |
---|
0:01:36 | a year ago |
---|
0:01:40 | i do not all relevant to me doing but i mean yes that the extend |
---|
0:01:43 | the invitation |
---|
0:01:45 | to ask me too |
---|
0:01:46 | to talk about indy spoofing |
---|
0:01:49 | i thought they do would be this will be a topic that |
---|
0:01:55 | it's very close to will be discussing speaker recognition |
---|
0:01:59 | and that it also made me one very hot the past few days to put |
---|
0:02:02 | together the selected this is really the for like this presentation on this topic |
---|
0:02:06 | is actually a topic but my |
---|
0:02:08 | phd student |
---|
0:02:10 | and session who |
---|
0:02:14 | right you eighty two years ago now he told me that he's working in apple |
---|
0:02:17 | computer |
---|
0:02:19 | he's not here it's is are you hear no |
---|
0:02:22 | and are like to start with our like to thank the global for people to |
---|
0:02:27 | ten |
---|
0:02:28 | at all ni |
---|
0:02:31 | nick you've and same opportunity |
---|
0:02:33 | for sharing with me a set of all presented a slight step save me a |
---|
0:02:37 | lot of time they did that you total in a nasty a in the asia |
---|
0:02:42 | pacific signalling information processing society and you summit conference |
---|
0:02:46 | in |
---|
0:02:48 | on call used in hong kong i vocal folds at any last december i attended |
---|
0:02:53 | the talk and then they may be the set of slicing i extract quite a |
---|
0:02:58 | number of them from |
---|
0:03:00 | from their presentations i just want to say that thanks to them |
---|
0:03:03 | and also |
---|
0:03:05 | a also thing my student on another student show higher to |
---|
0:03:10 | prepare some experiments a to just to make might also complete |
---|
0:03:18 | wonderful so |
---|
0:03:20 | so my topic will be on and useful thing i understand that infeasible think is |
---|
0:03:24 | actually not the scientific disciplines is that kind of application to that goes with a |
---|
0:03:30 | speaker recognition system |
---|
0:03:32 | and also because it's not yet the of the establish displaying so that's why i |
---|
0:03:36 | don't see i don't think this so what definition it or what and t spoofing |
---|
0:03:41 | is anything that to protect the security of a speaker recognition system |
---|
0:03:45 | that's what we |
---|
0:03:46 | think about so today only share with you some of the |
---|
0:03:50 | i experience that we had we touch a pontoon perhaps those experience can |
---|
0:03:56 | speaker for the discussion so during to |
---|
0:04:00 | workshop |
---|
0:04:01 | voiced by metric used based actually i actually just like a the name of speaker |
---|
0:04:06 | recognition or community in the twenty twelve there's a report saying that at eighteen off |
---|
0:04:14 | at top bangs in the world have adopted speaker recognition the system actually now |
---|
0:04:22 | the number numbers increase the tremendously |
---|
0:04:24 | i just a month ago many of my tingles announcement density per in addition to |
---|
0:04:29 | a launch voice authentication system for a |
---|
0:04:35 | for |
---|
0:04:37 | call center services and |
---|
0:04:39 | for somalia we also part of this project and just |
---|
0:04:43 | turn people that |
---|
0:04:44 | for the first time we are paid to become a heck of so the system |
---|
0:04:49 | so we just two |
---|
0:04:51 | evaluate the up these security of features of the also just a speaker recognition to |
---|
0:04:57 | the point |
---|
0:04:59 | this is a projection by |
---|
0:05:01 | try to come the market size one or |
---|
0:05:07 | what kind of a biometrics |
---|
0:05:10 | used in |
---|
0:05:11 | both ranking financial findings of something of course maybe other areas |
---|
0:05:16 | and you can see that a voice parametric is actually want of the growth area |
---|
0:05:22 | no i both the colour with them for my laptop screen but it just the |
---|
0:05:30 | the last |
---|
0:05:32 | however it shows that so it is |
---|
0:05:34 | we see a tremendous group which is a must be good in fingerprint because finger |
---|
0:05:38 | brings can still a mature technology time with these |
---|
0:05:42 | and we talk to customers i was working institute's into we face a lot of |
---|
0:05:48 | our industry up on the someone the need to deploy speaker recognition system |
---|
0:05:53 | the question they ask that's not so much how accurate system these because they can |
---|
0:05:57 | see that this is kind of given the because the system |
---|
0:05:59 | must be what must what within the be well the question usually to ask is |
---|
0:06:03 | how secure the system is in face of that x and |
---|
0:06:07 | and |
---|
0:06:09 | i know using the other things like that |
---|
0:06:11 | so |
---|
0:06:13 | recently we actually to you three years ago we deploy a technology to |
---|
0:06:18 | the noble smartphone if you get the learnable's smartphone the screen unlocking |
---|
0:06:23 | likely to be a |
---|
0:06:25 | to includes a voice authentication it is somewhat technology and of course they all day |
---|
0:06:30 | or also ask her voice ask for indy spoofing |
---|
0:06:34 | can isn't note that |
---|
0:06:35 | to go against a three by tech |
---|
0:06:40 | please i will talk about it |
---|
0:06:42 | so i talked to someone who talk about four |
---|
0:06:46 | man items one is people would be one this exploding the text talk about |
---|
0:06:51 | most compression in the artifact stick we may discomfort in the in the voice |
---|
0:06:56 | in a also lastly |
---|
0:06:59 | yes to be automatic speaker verification |
---|
0:07:02 | in t spoofing the comparing the last year |
---|
0:07:05 | i don't want to go through the details of the evaluation campaign by will talk |
---|
0:07:10 | about |
---|
0:07:11 | some of some of the observations i suppose a different a start |
---|
0:07:17 | okay so typically a speaker verification system a taken voice as input to doing that |
---|
0:07:22 | make a decision is to set identity claim to reject |
---|
0:07:28 | most of the time we assume that the voice input is actually from a sheep |
---|
0:07:33 | life a person like speech |
---|
0:07:36 | in reality it may not be true |
---|
0:07:39 | the of we can categorise all it is possible detecting to this four types impersonation |
---|
0:07:46 | just like a getting a person to mimic two |
---|
0:07:50 | a impersonate your voice |
---|
0:07:51 | and |
---|
0:07:52 | replacing managed to record somebody's voice you can play back to the to the system |
---|
0:07:57 | and speech synthesis and postcompletions these are the scientific |
---|
0:08:01 | thank technology means of a creating a speech |
---|
0:08:07 | well the could be some other new methods that do the because invents now i |
---|
0:08:14 | suppose we know we that the weights of the fact that can be categorized is |
---|
0:08:18 | this for every as |
---|
0:08:20 | used table summarize are the |
---|
0:08:23 | we're going to assess abilities the effectiveness the and the reason to the system and |
---|
0:08:28 | the com the availability of the countermeasures so sensibility meanings that how easy no this |
---|
0:08:34 | you have access to this technology to spoof a system |
---|
0:08:38 | so they're studies on the impersonations of basically you get a person to |
---|
0:08:43 | to act as another person |
---|
0:08:46 | this is actually part one of the very old the performing arts usually you try |
---|
0:08:52 | to learn to maybe |
---|
0:08:53 | some of this voice |
---|
0:08:55 | and |
---|
0:08:57 | study shows that system so i think people like this may be able to a |
---|
0:09:02 | maybe another person very well to the human years actually the voice may not be |
---|
0:09:06 | a very |
---|
0:09:16 | very strong as a as a as a tech because the computer listen so differently |
---|
0:09:21 | form of human yes |
---|
0:09:23 | and the is also difficult to kind of a train the person to make some |
---|
0:09:28 | of these voice so basically the it is a little accessibility saying it is not |
---|
0:09:32 | it doesn't propose a present a strong the risk of to a speaker verification system |
---|
0:09:39 | a replay the tech is basically to have somebody is |
---|
0:09:43 | voice winding up talking and then you play back to the system |
---|
0:09:47 | which is a low tech there is a |
---|
0:09:50 | usually in the context of text-dependent |
---|
0:09:53 | if it is a text independent are used have some of these voice see that |
---|
0:09:57 | golding vad |
---|
0:09:58 | we added the voice input impact so basically forced into the |
---|
0:10:02 | the speech synthesis in voice conversion all categories so |
---|
0:10:07 | for replay attacks |
---|
0:10:09 | we |
---|
0:10:11 | evaluate the oak to the risk |
---|
0:10:14 | mostly in the context of text-dependent the speaker verification |
---|
0:10:20 | when we talk about the voice i'm looking screen of untenable phone |
---|
0:10:25 | it should we develop a system that is kind of a |
---|
0:10:29 | taking the a unique features up of voice |
---|
0:10:34 | optimal pretty back |
---|
0:10:35 | we know that |
---|
0:10:37 | a human |
---|
0:10:40 | vocal system cannot repeat digits only the same |
---|
0:10:43 | a voice to construct so if you happen that you you're able to record all |
---|
0:10:47 | the voices and then when they have comes in |
---|
0:10:50 | and you compare the incoming voice with data in the storage if they are exactly |
---|
0:10:55 | the same the timings |
---|
0:10:56 | this is a deeply fact |
---|
0:10:58 | so we have mechanism to do this but there could be also other ways to |
---|
0:11:01 | do this for example |
---|
0:11:04 | they're studies |
---|
0:11:05 | she by idea those group on the some years ago on the on the |
---|
0:11:12 | protecting replay attacks obviously the |
---|
0:11:14 | the idea is well you replay |
---|
0:11:16 | actually it is a replay of a recording in the recording usually is |
---|
0:11:20 | taken from a far-field microphone |
---|
0:11:22 | the level to |
---|
0:11:23 | the did not always of we re overruns |
---|
0:11:26 | and acoustic effect of the room |
---|
0:11:29 | if you're able to categorise it |
---|
0:11:33 | you able to detect the retailing deeply example here so this is original speech |
---|
0:11:40 | for the works |
---|
0:11:50 | no i |
---|
0:11:53 | well i |
---|
0:11:56 | so you hear you hear it reverberation in the noise level in this is |
---|
0:12:00 | extent it might |
---|
0:12:03 | unique characteristics of a far-field microphone recordings that if we detect this thing of course |
---|
0:12:07 | you can you can you can accept or reject a recording voice but this is |
---|
0:12:12 | very difficult because room acoustics that changes from place to place it's very difficult to |
---|
0:12:17 | be just one model there |
---|
0:12:19 | that kind of a identify or the car or the room acoustics |
---|
0:12:24 | another techniques that we i just dimension is got audio fingerprinting |
---|
0:12:29 | yes idea the idea is that if we can |
---|
0:12:32 | keep |
---|
0:12:33 | the voice |
---|
0:12:33 | in the storage |
---|
0:12:35 | for at least cells they are presented to the system |
---|
0:12:38 | of course you'd only to do that we keep the recording as a whole cube |
---|
0:12:42 | away from this a whole |
---|
0:12:44 | think of this we do fingerprint recognition actually the system doesn't we call doesn't get |
---|
0:12:48 | the picture the picture |
---|
0:12:49 | of the figure three |
---|
0:12:51 | you keep only the cued always the training voicing what those be cocky points of |
---|
0:12:55 | the fingerprints |
---|
0:12:56 | the same for the for audio |
---|
0:13:00 | there's a this the software that the quality quite |
---|
0:13:03 | show them something you know you can |
---|
0:13:05 | you can record the piece of music and then you retrieve |
---|
0:13:08 | we choose the collection of the audio from the |
---|
0:13:13 | from the system is the same technology you have a you have a voice recording |
---|
0:13:18 | contained in the spectrogram |
---|
0:13:19 | and then you kind of finalise the spectrogram into pixels in the only remember the |
---|
0:13:25 | keypoints key point of those data |
---|
0:13:27 | of high energy so high contrast an data |
---|
0:13:30 | and actually you only need like |
---|
0:13:33 | so the forty bytes |
---|
0:13:35 | keep recording off |
---|
0:13:38 | five seconds |
---|
0:13:40 | so |
---|
0:13:40 | practically you can kind of the store a lot unlimited number of entries in the |
---|
0:13:45 | system |
---|
0:13:46 | so when the test speech comes just compare |
---|
0:13:49 | one by one then if this check matching just rejected because |
---|
0:13:53 | no one can produce a voice of identical voice |
---|
0:13:57 | this to time signal to noise |
---|
0:14:02 | then comes to speak speech synthesis in voice conversion this to share |
---|
0:14:07 | many common |
---|
0:14:09 | properties for example within it difficult to |
---|
0:14:12 | generate the voice they rely on a cook statistical models to the generate the features |
---|
0:14:17 | et cetera so |
---|
0:14:19 | that's london to get the so today open focus of will be on voice conversion |
---|
0:14:23 | thing of course of this as many of them |
---|
0:14:25 | much of the techniques also we plan to |
---|
0:14:27 | a speech synthesis detection |
---|
0:14:29 | we do a speaker verification |
---|
0:14:34 | the we what on robust features |
---|
0:14:38 | features has to be a real course has to be reliable has to be robust |
---|
0:14:42 | and so we see this is chip in good |
---|
0:14:46 | well we start a fine okay |
---|
0:14:49 | features of both |
---|
0:14:51 | properties but |
---|
0:14:52 | most of us to use the short-term spectral features because these easy to achieve |
---|
0:14:57 | and is actually but the reliable |
---|
0:14:59 | and robust against noise was getting c |
---|
0:15:03 | ageing have states |
---|
0:15:05 | a channel |
---|
0:15:08 | variation says that's what is at all focus |
---|
0:15:11 | the typically to type of features one is on |
---|
0:15:14 | voice production system like a lpc features you consider the vocal system as a |
---|
0:15:18 | as excitation follow but followed by a resonance a few to right so you bottle |
---|
0:15:24 | the excite that the second the source model with the future this where you kind |
---|
0:15:29 | of similar |
---|
0:15:30 | production system there's another type of thinking days to formal like the are required the |
---|
0:15:35 | peripheral auditory system we report use a cell we don't we don't hear part of |
---|
0:15:39 | it so things like you can see that it |
---|
0:15:43 | the court we have possible a member right |
---|
0:15:47 | this |
---|
0:15:49 | and path bandpass filters to get the signals |
---|
0:15:52 | and we try to |
---|
0:15:53 | derive features that |
---|
0:15:55 | kind of the follow |
---|
0:15:58 | bandpass filters at different scales of mel scale |
---|
0:16:02 | in this set of parameters to record all jittery |
---|
0:16:06 | features things like mfcc have |
---|
0:16:08 | many other little talk about the tree transform |
---|
0:16:11 | et cetera |
---|
0:16:15 | unfortunately most of them we will on robustness we try to extract the people's |
---|
0:16:21 | characteristics unique characteristics speaker characteristics |
---|
0:16:24 | we can see that the rest as a noise to try to accommodate |
---|
0:16:28 | so as a result and no more robust the speaker recognition system they also means |
---|
0:16:32 | is more vulnerable to the tech because it'll when we synthesise the voice you have |
---|
0:16:37 | all kind of variations and we've real features are very good in |
---|
0:16:41 | overcoming the |
---|
0:16:43 | what kind of noise actually your system become very vulnerable to the system so we |
---|
0:16:48 | have like a contradicting |
---|
0:16:50 | requested to the system a one hand we want to detect the synthetic voice which |
---|
0:16:54 | is |
---|
0:16:55 | unwanted and on the other hand we want to be |
---|
0:16:57 | a robust in these two things |
---|
0:16:59 | are not are the same direction therefore we cannot have one system that does both |
---|
0:17:04 | you to go t we have one system this for synthetic speech detection in the |
---|
0:17:08 | front as a filter so when |
---|
0:17:10 | we detect yes this is |
---|
0:17:13 | this is a not that it a synthetic voice then the signal pasta a speaker |
---|
0:17:16 | verification system |
---|
0:17:20 | next ever going to talk about people voice comparison so voice compression this actually now |
---|
0:17:25 | is very accessible so we can even go to amazon dot com you can buy |
---|
0:17:30 | a box ninety nine point i five dollars |
---|
0:17:33 | ready for |
---|
0:17:34 | we those k |
---|
0:17:36 | and actually allows you to a change your voice to masquerading a voice to be |
---|
0:17:42 | too i mean to check you identity from one to another or to |
---|
0:17:47 | kind of a you can use that |
---|
0:17:49 | step here |
---|
0:17:55 | so basically okay five two and the system the formants the peach you can |
---|
0:18:01 | you can try use this |
---|
0:18:04 | put forth a to kind of possible for |
---|
0:18:06 | a speaker verification system |
---|
0:18:08 | clearly in your |
---|
0:18:10 | in your room so |
---|
0:18:12 | so |
---|
0:18:13 | it will cost a |
---|
0:18:15 | if we understand well how postcompletions done maybe we can be with system to detect |
---|
0:18:21 | synthetic voice quality points the system is a basically three parts |
---|
0:18:26 | at all of this like formant judge in the slides i believe that |
---|
0:18:30 | distance voice with must you with my student |
---|
0:18:33 | voices very different from a from his voice at one time this analysis and you |
---|
0:18:37 | can be a system that combine the voice one another this must be very strong |
---|
0:18:41 | the |
---|
0:18:42 | voice comparison system |
---|
0:18:44 | so busy that three modules |
---|
0:18:46 | to analyze compare the features and |
---|
0:18:49 | and to synthesise |
---|
0:18:51 | by analyze because |
---|
0:18:53 | it's very |
---|
0:18:54 | how to deal with the time-domain signal so you compared to |
---|
0:18:58 | the two men that you can |
---|
0:19:00 | many project |
---|
0:19:01 | releasing frequency domain |
---|
0:19:03 | and then you complete the features into |
---|
0:19:05 | where you manipulating the way you want then you have put them back |
---|
0:19:09 | synthesising generate the voice of another plus |
---|
0:19:13 | we do that this is a couple coding to actually it is |
---|
0:19:17 | account isn't it was very well studied in a |
---|
0:19:20 | only as in communication you're all common people want to transmit signals duty codings the |
---|
0:19:26 | one to compress the signal |
---|
0:19:28 | they want to |
---|
0:19:29 | multiplex the signal they want to increase the signals |
---|
0:19:32 | with |
---|
0:19:34 | coats et cetera |
---|
0:19:35 | so they and that i think into features into the parameters then you do what |
---|
0:19:40 | they want to estimate the over the narrowband channel and a at the end to |
---|
0:19:44 | make sure that all the signal can be we can put the signal back you |
---|
0:19:48 | are using the parameters so this was better but at a traditional framework for |
---|
0:19:53 | in the communications and |
---|
0:19:55 | today actually we replace the transmission channel with a |
---|
0:19:59 | feature compression that |
---|
0:20:01 | allows us to do voice compression |
---|
0:20:04 | they all kind of voters on the data does this |
---|
0:20:09 | we just group their body into two categories of people |
---|
0:20:12 | why in speech synthesis and all this very well one score |
---|
0:20:15 | sinusoidal vocoders basically the idea is a similar to |
---|
0:20:20 | to generate signal that please all yes we |
---|
0:20:24 | okay how much human the voices so generating so much to generate some of which |
---|
0:20:29 | sounds very natural humour years which is a good |
---|
0:20:33 | so the idea is to components |
---|
0:20:35 | i mean to decompose the |
---|
0:20:37 | the two row ticks lawns into a collection of |
---|
0:20:40 | and i'm harmonics and then of course to include writing |
---|
0:20:44 | and record the modulated noise components so you have the noise which represent the fricatives |
---|
0:20:49 | in a the harmonic components that representing involves input this to get together you can |
---|
0:20:54 | regenerate the cell |
---|
0:20:57 | this |
---|
0:20:58 | kind of vocal the |
---|
0:20:59 | or in this study is that it's |
---|
0:21:02 | people |
---|
0:21:03 | evaluated in found that they are actually very natural and that has some issues a |
---|
0:21:10 | some of the issues of like |
---|
0:21:13 | because you've completed to this harmonic opal components and the number of parameters data they |
---|
0:21:19 | need to describe the signals varies from |
---|
0:21:23 | from the signal itself like |
---|
0:21:25 | like fundamental frequencies like something rates et cetera they affect the numbers for every frame |
---|
0:21:29 | we have different number problematic |
---|
0:21:32 | present the problem you want to the model it |
---|
0:21:34 | in a in the |
---|
0:21:36 | statistical model we need the same number of parameters to model |
---|
0:21:40 | of course they also like and they have a single overcome this so the studies |
---|
0:21:44 | on this if focusing on how to manage the number of features in the data |
---|
0:21:48 | on the other hand how to manage the noise because |
---|
0:21:52 | harmonics is you know this card |
---|
0:21:54 | good to describe karate signal is not very well in describing |
---|
0:21:59 | another type of for a few days sorry overcome this call source-filter model which is |
---|
0:22:05 | i think i mentioned earlier you can you think of this vocal production system |
---|
0:22:10 | you have the |
---|
0:22:10 | source excitation thank you of resonance you to anything you try to model |
---|
0:22:14 | this both |
---|
0:22:15 | and the good thing about this is |
---|
0:22:18 | than parameters for example you use a linear predictive coding |
---|
0:22:23 | actually you can fix the number of a parameters |
---|
0:22:26 | and that helps to have stopped the modelling |
---|
0:22:29 | of course addition also has a problem you compared this to the final sort of |
---|
0:22:33 | encoding signal so you don't called the |
---|
0:22:35 | some of the study seem like music a synthesis the quite face welcome to say |
---|
0:22:39 | they're |
---|
0:22:40 | they allows you to scale in both time and frequency domain so we hang |
---|
0:22:44 | actually |
---|
0:22:45 | control the phase of the signal the many to interface for source-filter model |
---|
0:22:51 | you don't this filter has to be |
---|
0:22:55 | stapling call calls so you have all the all the all the remote set to |
---|
0:23:01 | be reading the |
---|
0:23:02 | the unit circle in |
---|
0:23:04 | because of all day so if a low minimum phase |
---|
0:23:07 | a strategy we reconstruct the signal that actually cost artifacts |
---|
0:23:12 | it is good for a |
---|
0:23:16 | a defect detection synthetic speech detection |
---|
0:23:20 | on so this up |
---|
0:23:21 | where simple study which stuff by a judge and a few years ago and doesn't |
---|
0:23:26 | do a very simple test you have a number of vocoders and that you to |
---|
0:23:31 | copy synthesis you do not you just |
---|
0:23:34 | simply analysing to the features we compose the signals |
---|
0:23:37 | it was see what they detected this is synthetic voice on |
---|
0:23:41 | and the result shows that with this modified group delay are cepstral coefficient you can |
---|
0:23:46 | you can do very well in detecting the synthetic voice so there's artifacts all the |
---|
0:23:51 | data and a lot effect to be analytically visualise but |
---|
0:23:57 | popular features of okay |
---|
0:23:58 | you can actually detect |
---|
0:24:00 | so no |
---|
0:24:01 | after talking about the vocal we talk about voice compose |
---|
0:24:06 | so |
---|
0:24:07 | voice conversion basically you want to convert ones |
---|
0:24:15 | spectral from one person to a not while the |
---|
0:24:18 | things that is quite people |
---|
0:24:21 | a voice quite a number of things the main the main items that the formants |
---|
0:24:27 | the formants about the formants the first is that it to tell which is how |
---|
0:24:31 | it is by all will or the valves which one of these |
---|
0:24:34 | but you also has the personal |
---|
0:24:36 | a we also represent the vocal tract structure in a different way people are different |
---|
0:24:41 | formant structures in |
---|
0:24:43 | maybe formant tracks |
---|
0:24:45 | of course you have also be fundamental frequency which is the peach and also the |
---|
0:24:49 | intensity of the |
---|
0:24:50 | the energy envelope all these are very difficult to kind of a manipulate individually what |
---|
0:24:56 | we usually do is |
---|
0:24:58 | spectral compose compare one |
---|
0:25:01 | expect special level one person's voice to a not to kind of a transform |
---|
0:25:06 | a typical example we select is |
---|
0:25:18 | so well usually do is that you have |
---|
0:25:20 | also called parallel corpus |
---|
0:25:23 | we have samples of the same content and you do alignment you can do just |
---|
0:25:27 | to do a dtw alignment and then you come up with the panel of |
---|
0:25:34 | features right |
---|
0:25:35 | in then and then you |
---|
0:25:37 | divide a track compression function from |
---|
0:25:40 | from this past |
---|
0:25:41 | you have all the past stop suppose the enough to cover all the mappings |
---|
0:25:47 | and then you do the combustion this a one time |
---|
0:25:49 | and one topic that the important you have the |
---|
0:25:51 | source features you prior to compression functioning then you come up with the help |
---|
0:25:56 | so the m many techniques and you are not this is not for reading this |
---|
0:26:00 | is presented by children as a sparse the web or for the progress of the |
---|
0:26:04 | this research is to say that but they are |
---|
0:26:08 | many of a |
---|
0:26:10 | compression techniques |
---|
0:26:12 | using samples |
---|
0:26:14 | and linear regression |
---|
0:26:16 | it's a linear function to convert source to target one not normally a method to |
---|
0:26:20 | do we and then this way is kind of the transfer learning so you know |
---|
0:26:25 | that |
---|
0:26:28 | people a chance for the form one percent to another's voice at england the transform |
---|
0:26:33 | matrix from many pairs of people named now you only have a very little samples |
---|
0:26:37 | and you language or |
---|
0:26:39 | the history of a number of using the dependence that they rely on from other |
---|
0:26:44 | people's in this way you height of all to |
---|
0:26:49 | composed so that allows you to use fewer data thank with we to estimate fewer |
---|
0:26:54 | number put parameters will achieve the same goal |
---|
0:26:57 | so i just us |
---|
0:27:00 | i would just a touch upon a few |
---|
0:27:04 | basic approach so one disk or complement mapping so basically the same thing to alignment |
---|
0:27:10 | you get the parents and you do vector quantization for the past |
---|
0:27:15 | and this is in past so with the runtime we only have one |
---|
0:27:20 | samples |
---|
0:27:21 | for example with the sources right column |
---|
0:27:24 | in the and it and you the green is a target at the green you |
---|
0:27:28 | don't have so you are right the source into this vectors you get all those |
---|
0:27:32 | cool was and then you |
---|
0:27:35 | string |
---|
0:27:35 | the green ones to get in the generate a target voice |
---|
0:27:39 | of course this is very elementary techniques |
---|
0:27:42 | to do this |
---|
0:27:44 | imagine you to do this |
---|
0:27:46 | you focus very much on the parent |
---|
0:27:49 | you know the source and target match but had a cat too much about the |
---|
0:27:53 | continually t and the target |
---|
0:27:54 | therefore this a lot of continuing discontinuity in the in the generative voice |
---|
0:28:00 | another technique is to kind of a convert this you to a continuous a space |
---|
0:28:04 | but and that if you do this and then you can have a formula like |
---|
0:28:09 | this you have access the input as the source in the white yourself and this |
---|
0:28:13 | is a linear transformation |
---|
0:28:15 | i think of the is it is kind of a |
---|
0:28:17 | the previous one this is quite a few them are coke bottle of cohen this |
---|
0:28:21 | is a continuous version of it |
---|
0:28:23 | right fielder continuous version of it |
---|
0:28:26 | and then and then of course of this one generate that slightly a smooth the |
---|
0:28:30 | a voice |
---|
0:28:32 | in another technique is a the previous two are kind of our remembering the samples |
---|
0:28:37 | right |
---|
0:28:37 | in this one we deal with a remembering the samples we remembered the |
---|
0:28:43 | competition |
---|
0:28:44 | the warping functions you know that |
---|
0:28:46 | source speaker the target speakers if you have enough samples we can kind of derive |
---|
0:28:50 | a |
---|
0:28:51 | well warping function |
---|
0:28:53 | between them and we don't remember this warping functions |
---|
0:28:56 | and run time when the test data comes is applied the right |
---|
0:28:59 | what in function to generate target |
---|
0:29:04 | it is not technical frame selection |
---|
0:29:08 | frame selection |
---|
0:29:08 | does not talk about a global approach |
---|
0:29:11 | basically doesn't care too much about the continuing tid target |
---|
0:29:14 | is one plastic taking into consideration |
---|
0:29:17 | so you have certain frames uk in the training data article on |
---|
0:29:21 | each other with a similar peach similar |
---|
0:29:26 | or phonetic context thing they tend to get together so we have a kind of |
---|
0:29:29 | a selection process not just by |
---|
0:29:32 | a source target distance about also talking to talk it's friend distance to ensure the |
---|
0:29:37 | continuing this one |
---|
0:29:39 | give us a little bit smooth the |
---|
0:29:41 | i'll post |
---|
0:29:43 | thank of course is it is unit selection technique this is a very non in |
---|
0:29:47 | the |
---|
0:29:48 | of speech synthesis |
---|
0:29:49 | where you have a |
---|
0:29:53 | sufficient sample maybe you have pain twenty utterances of fifty utterances of a target language |
---|
0:29:58 | you can achieve break it down to elements components |
---|
0:30:01 | and then |
---|
0:30:03 | at one time you want to compose something just pull the samples together you concatenated |
---|
0:30:07 | into one piecing playback |
---|
0:30:10 | this is actually one of the come away on doing that as a specious feuding |
---|
0:30:16 | the speech synthesis system but think of this if you do this there is a |
---|
0:30:20 | discontinued we do between the |
---|
0:30:22 | between the between that units |
---|
0:30:25 | both in magnitude and phase in this could be at the next we can detect |
---|
0:30:31 | so the some summarise so i just say about a |
---|
0:30:35 | actually we did in a voice compression and in the a speech synthesis |
---|
0:30:41 | studies we have |
---|
0:30:43 | subjective evaluation objective evaluation |
---|
0:30:46 | and actually not of their address spoofing quality of a synthetic voice |
---|
0:30:53 | looks at unvoice lisa in one of the example you hear that |
---|
0:30:56 | you a see that |
---|
0:30:57 | or |
---|
0:31:00 | assembled that you cannot even understand but it is a very it |
---|
0:31:04 | including a very strong |
---|
0:31:07 | taking voice for speaker verification system so this'll to analysis are |
---|
0:31:14 | well for |
---|
0:31:15 | kind of a |
---|
0:31:18 | quality perceptual quality i evaluation |
---|
0:31:20 | but when it comes to spoofing the tech i believe that this effort to us |
---|
0:31:24 | define what the best ways to analyze the completely voice as of the details of |
---|
0:31:30 | the strings to the text system in last year is yes the spoof evaluation campaign |
---|
0:31:37 | my view is providing object wondering allows us to kind of evaluating the string of |
---|
0:31:42 | for a synthetic voices are completely points |
---|
0:31:47 | okay so it makes a let me talk about |
---|
0:31:52 | the effects of |
---|
0:31:55 | the artifacts of |
---|
0:32:00 | we size in the synthetic voice that possibly we can detect |
---|
0:32:04 | we know that we cannot visualise the i-th effects is very difficult to see it |
---|
0:32:07 | actually i |
---|
0:32:08 | i |
---|
0:32:09 | get my student group to try to |
---|
0:32:12 | so all spectrograms in to see that differences |
---|
0:32:14 | and |
---|
0:32:16 | there is no direct ways to kind of measuring but their indirect way of a |
---|
0:32:21 | model ringing for example |
---|
0:32:23 | if you know that the signal is discontinuous of course you can use features that |
---|
0:32:27 | represents a thinking kind of this crap continue we deal first speech you both in |
---|
0:32:32 | can both in many do anything phase kind of to model |
---|
0:32:36 | the data |
---|
0:32:39 | that'll things that we should look into one is the manager |
---|
0:32:43 | and the other is the phase i mean this is like the standard tech signal |
---|
0:32:47 | processing a textbook |
---|
0:32:49 | what was important is a |
---|
0:32:51 | in most of the speech recognition thing speech synthesis of research |
---|
0:32:55 | we pay much attention to the many to get interface |
---|
0:33:01 | for simple reason that |
---|
0:33:03 | my to do is easier to manage it is you easier to |
---|
0:33:07 | visualise |
---|
0:33:09 | and that this case is a much more difficult to update to |
---|
0:33:14 | to |
---|
0:33:15 | to describe to associate the parameters with the physical meaning |
---|
0:33:20 | and |
---|
0:33:21 | but actually they a lot of research in the literature on phase features for speech |
---|
0:33:27 | recognition and that provides a |
---|
0:33:29 | kind of a to see for us to |
---|
0:33:33 | to start this research |
---|
0:33:35 | so in terms of ninety two |
---|
0:33:39 | we don't know that to analyze the speech signal we need to do this short |
---|
0:33:43 | time fourier transform |
---|
0:33:45 | i don't you use |
---|
0:33:47 | sinusoidal coding will use the source-filter vocoder you wanted to do this short time |
---|
0:33:55 | time-frequency analysis |
---|
0:33:56 | in this present at effect you know that we don't is a fft |
---|
0:34:01 | then you use a fixed window length |
---|
0:34:04 | and then you have |
---|
0:34:06 | spectral |
---|
0:34:09 | you change in you have |
---|
0:34:10 | windowing effects of all these all these are at effects |
---|
0:34:14 | produced by the by the in by the system in the process |
---|
0:34:18 | and then you have this more think effect you know that when we do |
---|
0:34:22 | introducing this is a compilation most almost all models are |
---|
0:34:26 | maximum likelihood estimation right next a more likely to wasn't maximum like the |
---|
0:34:32 | estimation trying to do |
---|
0:34:34 | they try to give you the average over everything |
---|
0:34:36 | because the averaged you always higher |
---|
0:34:39 | the higher |
---|
0:34:42 | probabilities right |
---|
0:34:44 | and they cause a problem |
---|
0:34:45 | the limited dynamic range of the |
---|
0:34:48 | of the signals without test generated in the could be at effect so that we |
---|
0:34:52 | can |
---|
0:34:53 | the same for phase |
---|
0:34:55 | the same faces a bigger problem |
---|
0:34:57 | often time what we do as i said that when we do synthesis we do |
---|
0:35:00 | recognition we use a many to features a week actually |
---|
0:35:04 | in order to ignore the phase i mean we still think that face continua t |
---|
0:35:09 | v is a is important and we don't think that modeling the faces as important |
---|
0:35:13 | as the many achieved it also present an opportunity for us to kind of the |
---|
0:35:18 | tech artifacts we can model still patterns of |
---|
0:35:23 | phase |
---|
0:35:25 | distribution seen |
---|
0:35:26 | a natural speech then we are able to detect synthetic |
---|
0:35:31 | next just some examples of this is a |
---|
0:35:35 | just to really wanna say that a short time frequency |
---|
0:35:39 | analysis you use a fixed window fixed length window to analyze the |
---|
0:35:43 | to analyse the signal was saying |
---|
0:35:46 | and up you have a |
---|
0:35:49 | record the interference between |
---|
0:35:52 | frequency being |
---|
0:35:54 | i'll the energies across the frequency |
---|
0:35:58 | and are the same time because you do shifting window to window without overlap sending |
---|
0:36:02 | actually you also have this smearing expect i don't have time axis so we have |
---|
0:36:07 | you have the interference a the convex s and you also have the in the |
---|
0:36:13 | this mary factor in the frequency |
---|
0:36:16 | access |
---|
0:36:18 | if we were able to detect |
---|
0:36:20 | detect this then this could be |
---|
0:36:22 | something that |
---|
0:36:24 | a signature morphosyntactic balls |
---|
0:36:29 | well coldest |
---|
0:36:31 | most of the everything the vocal this actually two |
---|
0:36:33 | kind of a remote |
---|
0:36:34 | two most the they'll the waveform as a result of this |
---|
0:36:39 | short time sometime |
---|
0:36:44 | effect and |
---|
0:36:48 | where people set actually you are using |
---|
0:36:50 | one artifacts to correct another artifact so you have to short time frequency short-term a |
---|
0:36:55 | spectral |
---|
0:36:58 | really cage so we take cost you problems and that you are used another smoothing |
---|
0:37:02 | methods kind of try to smooth everything about so you have a quality factor corrigan |
---|
0:37:06 | out if you have to a different significant |
---|
0:37:09 | and you can |
---|
0:37:09 | kind of a extract the signal but interestingly this smoothing effect because you use human |
---|
0:37:15 | years to kind of a pressure the quality actually after this of the smoothing evaluation |
---|
0:37:21 | says that |
---|
0:37:23 | the sound quality suppressed |
---|
0:37:25 | but i believe that they're artifacts inside you can describe and just not also mention |
---|
0:37:31 | that we use statistical model |
---|
0:37:32 | i don't in the voice compression |
---|
0:37:35 | or in the in |
---|
0:37:39 | okay to markov model or a synthesis |
---|
0:37:42 | and then we try to i'll try to estimate the |
---|
0:37:47 | how to generate the parameters using maximum likelihood |
---|
0:37:50 | criteria they always give you the average will not always you have other ways to |
---|
0:37:56 | just means you might disagree with me a but the other ways to model the |
---|
0:38:00 | to the dynamics about the |
---|
0:38:02 | in general systems give you kind of a |
---|
0:38:05 | average |
---|
0:38:06 | a signal |
---|
0:38:08 | that is a limited dynamic range of a completely speech |
---|
0:38:12 | this example a i just plot the spectrogram of the natural speech in the copy |
---|
0:38:18 | synthesis speech and hearing see that |
---|
0:38:21 | actually |
---|
0:38:23 | absolute differences in the spectral |
---|
0:38:25 | two main |
---|
0:38:27 | in this is a pitch patterns in the |
---|
0:38:32 | get a map that he hmm based a synthetic |
---|
0:38:35 | well ways we know that a human speech |
---|
0:38:38 | actually the peach patent is not so stable as you know synthetic voice using the |
---|
0:38:44 | paper by what you have |
---|
0:38:48 | twenty two thousand five for the height of a trot to chart one shows the |
---|
0:38:54 | synthetic voice which has a very straight up each pattern |
---|
0:38:58 | it is in a p h this is the autocorrelation of this at the time |
---|
0:39:02 | domain signals |
---|
0:39:03 | and you see that a natural speech actually has about has something like you know |
---|
0:39:08 | when you believing loosing you have this but broughton |
---|
0:39:11 | the two roddick modulation top each round |
---|
0:39:14 | and also some peach level |
---|
0:39:16 | and synthetic voices rather strict |
---|
0:39:20 | because of this if we believe that this |
---|
0:39:23 | there is a lack of a dynamic range in the synthetic voicing completely voice then |
---|
0:39:28 | the dynamic range of the spectrogram can be used as a features also of great |
---|
0:39:34 | one paper by tom these group that we talk about only use their with and |
---|
0:39:37 | without delta dynamic features of |
---|
0:39:40 | of |
---|
0:39:42 | spectrograms as the features i ignoring the static features are used to detect synthetic voice |
---|
0:39:49 | in the also techniques to |
---|
0:39:51 | a model the temporal modulation features you know when we have a feature frames which |
---|
0:39:57 | is a like the usually one frame by frame by frame we selected ten miliseconds |
---|
0:40:02 | shift in this |
---|
0:40:04 | cut a piece of signals well like the fifty frames and you extract it into |
---|
0:40:10 | a temporal a few using the temporal futile to model it is and then use |
---|
0:40:15 | this to |
---|
0:40:16 | to oaks former oak supervector like this to model |
---|
0:40:21 | the model that i'd of the many to |
---|
0:40:27 | features audio based features and it works for |
---|
0:40:32 | well for this a complementary features into |
---|
0:40:35 | in the extended voice detection |
---|
0:40:38 | phase is something that we will this was |
---|
0:40:43 | us to pay attention to |
---|
0:40:46 | why people don't use face creatures is because mostly because it's of it difficult to |
---|
0:40:51 | to describe it and it because |
---|
0:40:55 | many unique properties for example we have this mapping effect when you want to see |
---|
0:40:59 | you have to unblacked it this is a real red |
---|
0:41:06 | record a signal you can see any patents |
---|
0:41:08 | but actually |
---|
0:41:09 | if you think if you have a real time you have you have a real |
---|
0:41:13 | signal and then we do fourier transform you have the |
---|
0:41:17 | the real part in have the imaginary parts right and then |
---|
0:41:20 | the man did you is come from this to pass in the face also come |
---|
0:41:24 | from these two pass and |
---|
0:41:26 | then |
---|
0:41:27 | by right they should present a similar patterns like this if you many people have |
---|
0:41:31 | shown that day |
---|
0:41:33 | unwrapping do it properly with proper normalization you see similar patterns |
---|
0:41:37 | face feature and thus many the you manicure feature the looks about the same |
---|
0:41:44 | and they give us a opportunities to i mean another new features to look into |
---|
0:41:49 | you in synthetic voice and completely both people do not pay enough |
---|
0:41:53 | i things into two |
---|
0:41:55 | to a face increase feature become very useful for detection for synthetic oppose detection |
---|
0:42:01 | a to have to be too |
---|
0:42:03 | must there are many papers on all this other techniques with recordings ten years instantaneous |
---|
0:42:09 | frequency which is the time that derivative of the phase signal so basically you have |
---|
0:42:13 | to frames |
---|
0:42:15 | and then this is the method you look very similar but they are |
---|
0:42:19 | phase features |
---|
0:42:20 | could be very the face sorry the phase |
---|
0:42:24 | if a square and could be very different |
---|
0:42:26 | good very different |
---|
0:42:27 | so |
---|
0:42:28 | by |
---|
0:42:28 | by taking their |
---|
0:42:30 | difference |
---|
0:42:32 | as a features |
---|
0:42:33 | you're able to extend it to remember something |
---|
0:42:37 | we strip is remembered every sample was in the time-domain actually this two pi shift |
---|
0:42:42 | of the signal to maintain the continuing this so we want to do this you |
---|
0:42:45 | have to kind of unwrap it |
---|
0:42:47 | because usually we should do window by ten milisecond twenty millisecond not by every samples |
---|
0:42:52 | right |
---|
0:42:52 | so when you take this thing you want to make sure that the |
---|
0:42:55 | features are |
---|
0:42:56 | kind of a complete the phase are continues you have to do |
---|
0:42:59 | kind of a normalization |
---|
0:43:01 | ross a little bit |
---|
0:43:02 | and then is a group delay features this which is a frequency derivative of phase |
---|
0:43:09 | you know we have a single like these |
---|
0:43:11 | and you have the power spectrum which shows the two resonance pick here |
---|
0:43:15 | you see really and then the |
---|
0:43:18 | group delay also shows or something like this |
---|
0:43:20 | and these features rather complex |
---|
0:43:23 | a mechanism but at least a show you initial step |
---|
0:43:27 | a similar utterance has many to a feature |
---|
0:43:31 | this is a novel different plots or spectrograms face were so that a development my |
---|
0:43:39 | student groups in last year's this |
---|
0:43:42 | if you spoofing and compare their see that if the if the log magnitude spectrum |
---|
0:43:49 | make it you |
---|
0:43:50 | and a you can have a group delay unit we probably actually you see the |
---|
0:43:55 | similar patterns |
---|
0:43:56 | did manage |
---|
0:43:58 | and you have many other things non modified group delay of the instantaneous frequencies on |
---|
0:44:04 | the other features you see the paper but specimen to print |
---|
0:44:08 | in this |
---|
0:44:09 | allpass also but features to do that the detection |
---|
0:44:14 | finally comes to the last year so that into scooting evaluation |
---|
0:44:19 | each shows that |
---|
0:44:21 | this is |
---|
0:44:22 | a performance on the data a speaker recognition you've just use the gmm the standard |
---|
0:44:27 | gmm system and then once you a lda with the spoofing voice us anything voice |
---|
0:44:34 | the performance at twelve o a missile |
---|
0:44:41 | okay looked in the evaluation they were kind of five |
---|
0:44:48 | synthetic voice |
---|
0:44:49 | which is used as a training development data how this is called norm that x |
---|
0:44:53 | you have to access to the to the training data of the synthesized |
---|
0:44:58 | and i have another five that you don't have access to tell you what how |
---|
0:45:03 | to generate |
---|
0:45:04 | and then you only given the evaluation data supposed to detect d a synthetic voice |
---|
0:45:09 | for all of them so you typically use the five a |
---|
0:45:13 | a voice to train your system and used |
---|
0:45:17 | use the system to tessa |
---|
0:45:19 | across the ten evaluation |
---|
0:45:21 | ten a voices and this is a brief summary of the resulting see that for |
---|
0:45:28 | the not attack italy the performance the average is kind of a |
---|
0:45:34 | for unknown to take it gives to like a four times higher so error rates |
---|
0:45:38 | so of course of this is kind of a and |
---|
0:45:42 | known beforehand you know |
---|
0:45:44 | we denote this signals you of course you can do something you're trying to train |
---|
0:45:48 | the detector using the samples in the you detect that |
---|
0:45:54 | like to actually we do one particular i think use a synthesiser which is a |
---|
0:46:00 | kind of outline of the system |
---|
0:46:03 | you know |
---|
0:46:04 | all the system did pretty these are the sixteen estimations of the system thing the |
---|
0:46:08 | rank by the performance |
---|
0:46:10 | most of them did very well for but to take one is example t very |
---|
0:46:15 | well for |
---|
0:46:18 | for all of them we don't |
---|
0:46:20 | as ten without the unique selection synthesizer right |
---|
0:46:25 | and it pretty reasonably well |
---|
0:46:28 | and then one comes to f k even very the equal error is very high |
---|
0:46:32 | so basically all the features kind of felt that for |
---|
0:46:35 | for testing |
---|
0:46:36 | so was tested |
---|
0:46:37 | as in this is the tts |
---|
0:46:40 | using |
---|
0:46:42 | unit selection and replay |
---|
0:46:44 | sound clip to see show you how it is that this is a testing |
---|
0:46:59 | if i should |
---|
0:47:02 | if i should |
---|
0:47:08 | actually |
---|
0:47:10 | so we say it's night so here okay thank you can really hear this a |
---|
0:47:14 | this set in the s k i s ten present the strongest |
---|
0:47:20 | the text to the speaker recognition system i believe that is because his unit selection |
---|
0:47:24 | demos of the salsa silence frames because we do frame-by-frame |
---|
0:47:28 | and the frames are a natural voice except the |
---|
0:47:31 | the vad the connection points which is represented minority the yep in the back or |
---|
0:47:36 | friends |
---|
0:47:40 | nowadays everything must have a little bit of a deep neural network so i also |
---|
0:47:44 | include neural network my presentation |
---|
0:47:49 | so this is a |
---|
0:47:51 | very simple deep neural a simple neural network is this is not appear |
---|
0:47:55 | there's one layer anyway neural network there has to take the speech as the input |
---|
0:48:00 | take the features as input for type of features |
---|
0:48:02 | and then generated output |
---|
0:48:04 | so |
---|
0:48:04 | the sounds that this the closer to |
---|
0:48:06 | the something closer to the right things like is more natural speech and laughing size |
---|
0:48:13 | more synthetic voice is occurring see that has can |
---|
0:48:18 | overlaps with natural speech very much as ten and natural speech you give a very |
---|
0:48:22 | similar score |
---|
0:48:23 | that makes the features that we have kind of this difference to differentiate them |
---|
0:48:27 | so i wasn't another recent research in this is a very recent resistant work |
---|
0:48:31 | we take one hundred frames as the input to a |
---|
0:48:34 | convolutional neural network so you have how different to do polling and |
---|
0:48:39 | and all this allows you to get a wider range of a samples |
---|
0:48:44 | how difference actually can cover the kind of one |
---|
0:48:47 | one minute thing to make sure the in the one when it is there are |
---|
0:48:50 | some transition of a |
---|
0:48:53 | of |
---|
0:48:54 | acoustic units between |
---|
0:48:56 | their subjects junctions |
---|
0:48:59 | in it |
---|
0:49:00 | we can see that as ten and natural speech kind of that so but the |
---|
0:49:05 | good separation |
---|
0:49:08 | and |
---|
0:49:09 | as a |
---|
0:49:10 | positive and studies are i read quite a number of literatures one of them is |
---|
0:49:15 | a multiple of the things is given to |
---|
0:49:20 | features that d |
---|
0:49:21 | the best in the evaluation which is a so-called ward italy transform basically the idea |
---|
0:49:26 | is in canada here you have this |
---|
0:49:29 | you have this |
---|
0:49:33 | you have this |
---|
0:49:34 | possible to member different few |
---|
0:49:38 | filters with different pen with a different center frequencies that so you're kind of a |
---|
0:49:42 | trying to be filters of the kind of a different awfully good a good friends |
---|
0:49:48 | with different pen with a to get the coefficients |
---|
0:49:52 | this is actually not new in the on "'em" scale a good cepstral coefficient already |
---|
0:49:58 | is doing this but was differences |
---|
0:50:01 | in this |
---|
0:50:03 | a set of you just a status similar to kind of a wavelet kind of |
---|
0:50:07 | a |
---|
0:50:07 | it's question you have we for low-frequency you have a longer windows |
---|
0:50:12 | and but for high frequency inverse filtering |
---|
0:50:15 | yes up to sort the shot to |
---|
0:50:19 | response function in this way you get |
---|
0:50:21 | different resolutions to |
---|
0:50:23 | two different frequency bands |
---|
0:50:27 | this is a paper this is a |
---|
0:50:30 | slight so there is given to the need via by any device and a big |
---|
0:50:34 | they |
---|
0:50:36 | just got a very impressive result that he's going to present in all this is |
---|
0:50:40 | why don't |
---|
0:50:40 | one to jump from one |
---|
0:50:42 | so i try to share with us so |
---|
0:50:44 | is the effect of see that this is |
---|
0:50:48 | spectral where that is shown that using |
---|
0:50:52 | constant q cepstral coefficient for in the similar concept of auditory transform |
---|
0:50:57 | at the |
---|
0:50:58 | low frequencies the better frequency resolutions but poor convex solutions |
---|
0:51:04 | point time resolutions allows |
---|
0:51:07 | asked to |
---|
0:51:07 | have a bigger windows in terms of time it has a bigger range |
---|
0:51:11 | range the cover to cover |
---|
0:51:13 | you know the |
---|
0:51:15 | the discontinuity of the features |
---|
0:51:18 | the higher frequency is a better time resolution |
---|
0:51:22 | it is equal costs them to collect |
---|
0:51:26 | need to ship would you littering his one presentation |
---|
0:51:30 | so it |
---|
0:51:31 | with these techniques to give the very impressive without giving the evaluation the best result |
---|
0:51:36 | was equal error rate eight point |
---|
0:51:38 | five percent |
---|
0:51:40 | and |
---|
0:51:40 | then with a with days |
---|
0:51:43 | do you achieved like a one percent equal error rate this is really impressive |
---|
0:51:48 | okay to some |
---|
0:51:51 | so |
---|
0:51:54 | splitting the deck a spoofing the tech a this many challenges and opportunities in the |
---|
0:51:59 | most systems also on there |
---|
0:52:01 | that the input speech is actually natural speech i don't know this is opportunity always |
---|
0:52:05 | the challenge it depends on your the heckler you want to |
---|
0:52:08 | the system developed a |
---|
0:52:11 | more robust speaker verification system |
---|
0:52:14 | many meetings this very vulnerable to have text and then we need to take special |
---|
0:52:21 | because to address this issue and |
---|
0:52:26 | and |
---|
0:52:27 | but the speech perceptual quality doesn't equal to |
---|
0:52:30 | to less artifacts actually in speech synthesis |
---|
0:52:34 | my impression of this we try to the just the |
---|
0:52:38 | the output signal just to please the human years by actually in the spectral |
---|
0:52:42 | ram it in the it generate or interface claim has a lot of artifacts day |
---|
0:52:47 | that yet to be discovered |
---|
0:52:49 | motion humans listen different |
---|
0:52:52 | matching |
---|
0:52:53 | mostly now listen to frame-by-frame features i remember the days when they were in the |
---|
0:52:58 | company we wanted to keep those ten will be a single ip what demonstration for |
---|
0:53:03 | tts system to have a dialog speech recognition system |
---|
0:53:07 | every time i give just tamil |
---|
0:53:09 | and people across very happy |
---|
0:53:12 | and |
---|
0:53:12 | and people thought that this was |
---|
0:53:15 | magical demonstration but to me to save a demonstration because the lpc features top of |
---|
0:53:20 | lpc features every time they get the echoes get the was correct if i talked |
---|
0:53:25 | to the system some something kimmy role model without leaves acoustically ninety five percent |
---|
0:53:30 | so matching and humans listen to different things and we need to discover it is |
---|
0:53:36 | more and a |
---|
0:53:38 | and the study also shows that from the last two yes i |
---|
0:53:43 | publications shows that features are more important than classifier |
---|
0:53:47 | or maybe we have not reach the level having good features so the a lot |
---|
0:53:52 | more study to be that the features for the thing to classify |
---|
0:53:55 | this way |
---|
0:53:57 | thank you |
---|
0:54:04 | so you how do for this presentation so we have time for a couple of |
---|
0:54:08 | questions |
---|
0:54:09 | then when you want your judgements to start |
---|
0:54:18 | anyone |
---|
0:54:27 | we get idiots from terrorists |
---|
0:54:29 | obviously have the voice pitch the use of pitch stretch a pitch appending algorithm so |
---|
0:54:36 | it sounds like this begin with very low voice either to discuss the voice or |
---|
0:54:41 | to sound more threatening |
---|
0:54:44 | the question is that way of |
---|
0:54:47 | of inferring the degree of change that has been made to the pit the pitch |
---|
0:54:51 | can both either just formant frequency or formants and found difficult or just formants so |
---|
0:54:58 | but would be here we are the loss of any way of knowing whether and |
---|
0:55:03 | it |
---|
0:55:04 | what extent it is possible to and four |
---|
0:55:07 | the degree of change that has been made in order to |
---|
0:55:10 | to change it back |
---|
0:55:14 | well |
---|
0:55:16 | we don't have |
---|
0:55:18 | i think for forensic unit kind of visual tools that allows you to |
---|
0:55:22 | to do analysis have believed that the |
---|
0:55:26 | the features that we just talk about things tending as instantaneous frequencies the group feature |
---|
0:55:32 | group delay modified group delay |
---|
0:55:35 | cepstral coefficients the constant q cepstral coefficients those are the wonderful tools for you to |
---|
0:55:44 | do |
---|
0:55:45 | comparison since i just show you just shout on the second |
---|
0:55:50 | so actually a |
---|
0:55:57 | so we did in the left when we |
---|
0:56:01 | analyze the features |
---|
0:56:02 | we did |
---|
0:56:04 | observe some features for example this is a call |
---|
0:56:09 | relative of phase shift is a natural speech this is a synthetic speech and you |
---|
0:56:14 | can see you cannot hear |
---|
0:56:17 | the difference because they are all very natural you cannot really |
---|
0:56:22 | here any differences but |
---|
0:56:24 | the craze gram actually tells you something so i believe that |
---|
0:56:28 | maybe it can be used as the tools |
---|
0:56:30 | i don't think anybody has really appealing to a system for practical used yet |
---|
0:56:38 | you is just |
---|
0:56:51 | so very nice talk time was very happy to see the breath of work field |
---|
0:56:56 | and beam our work is actually been covered by someone you folks here i wanted |
---|
0:57:01 | to make one comment i think one of the fundamental challenges when you look at |
---|
0:57:04 | voice conversion most of that research is really focused on humans being able to |
---|
0:57:11 | assess the quality should usually for human consumption not necessarily for speaker recognition systems |
---|
0:57:18 | so if you look at voice conversion technologies most end up focusing on making sure |
---|
0:57:24 | that the prosody is correct because that's something it's pretty easy to kind of assess |
---|
0:57:28 | it was different like fundamental frequency and so forth so i think in a bit |
---|
0:57:33 | nineties we had to some work or what we did as we took |
---|
0:57:38 | the output of natural speech and segment based speech synthesis and fed into archer here |
---|
0:57:45 | cell models and look at here some firing characteristics on the output of what we |
---|
0:57:51 | saw was that in regular normal speech |
---|
0:57:54 | there's an actual production evolution that takes place in the articulators |
---|
0:57:58 | the corresponding here saw firing characteristic also have a natural variation |
---|
0:58:03 | but in the synthetic side in segment based synthesis |
---|
0:58:07 | when you could be hair cell finding characteristics they don't necessarily behave the same way |
---|
0:58:12 | so we found that was actually very interesting way to kind of bring kind of |
---|
0:58:17 | the signal processing side of the hearing into the speaker assessment side |
---|
0:58:23 | you could find actually really more quality speech synthesis |
---|
0:58:26 | but the hair cell firing characteristics would be able to pick up that differences there |
---|
0:58:32 | certainly i think just now example asked an the unit selection feature is a quite |
---|
0:58:37 | example you can hear anything by actually it is stronger is exposed in voiced |
---|
0:58:48 | yes one last question and then after ending you can break |
---|
0:58:55 | so part of what you with through you're talking about the different aspects as try |
---|
0:59:00 | to detect whether the voice that modify |
---|
0:59:02 | right and you're look so the things in there were the |
---|
0:59:06 | looking at the pitch the phase and so forth but what is the really that |
---|
0:59:11 | isn't speaker |
---|
0:59:13 | verification be used is because the big with it is of the handset being delivered |
---|
0:59:17 | around most speech coming in the systems already gonna go through some form vocoder |
---|
0:59:23 | so is that by its nature going to start to |
---|
0:59:27 | you know you're gonna get you're gonna start detecting a lot of these artifacts are |
---|
0:59:30 | really gonna be natural artifacts of the communication system itself and |
---|
0:59:34 | i think is that thing looked at is look like most of this is based |
---|
0:59:38 | on what inputs are happening right at the from lips into the system |
---|
0:59:43 | so i think that's vertical question so i think the challenge now we just two |
---|
0:59:51 | model different type of artifact the artifacts to are susceptible to the system for example |
---|
0:59:55 | you set a two |
---|
0:59:57 | the could be if the system that's going through communication channel days editable coding that |
---|
1:00:01 | already |
---|
1:00:03 | but most of them do not really manipulative |
---|
1:00:07 | parameter stay just try to recover the signal as much as possible |
---|
1:00:12 | so |
---|
1:00:14 | and |
---|
1:00:18 | at the moment the researches focusing the task focused very much on |
---|
1:00:22 | the features they are able to surf a store scientific |
---|
1:00:25 | exactly the if we have good features so we can tell we can model them |
---|
1:00:30 | effective |
---|
1:00:31 | and of by this is a mostly also telephone channel two different channel you mentioned |
---|
1:00:37 | that and ten telephone channel it is also |
---|
1:00:40 | you know |
---|
1:00:41 | and a lot channel digital channels in all kind of things |
---|
1:00:46 | so i |
---|
1:00:50 | they could be issue but they also asked me about this when we do not |
---|
1:00:53 | this when we kind of doing this analysis the or digital |
---|
1:00:58 | the by actually this process the complete to analogue an income packaging |
---|
1:01:01 | what the effects of the data |
---|
1:01:03 | we have not really studied |
---|
1:01:09 | okay thank you i think we have to stick to scale so thank let thanks |
---|
1:01:13 | again purpose only how do |
---|