Speech Transcript - THE NON-BAYESIAN RESTLESS MULTI-ARMED BANDIT: A CASE OF NEAR-LOGARITHMIC REGRET

0:00:15	that
0:00:17	it there is a joint work with uh
0:00:19	and i is due that of a seems to the found my
0:00:22	the
0:00:23	and you guy and uh a a separation much very from us C
0:00:27	um so first
0:00:29	the problem we are considering a a is that dynamic spectrum access and on the model
0:00:34	and the is illustrated that system where looking at to um
0:00:38	there are and need given then channels uh
0:00:41	i i don't get it it to a primary system that's using a slot transmission structure
0:00:46	and uh of second and their use of the objective here is to find the the i'd slot and the
0:00:52	limit its sensing
0:00:53	oh we a at any given so uh a set use the can only choose K channels to set
0:00:59	and if
0:01:00	the channel is i do it what transmit and uh
0:01:04	if you transmit if you patch what i'd slot you will receive one unit to work then essentially is you
0:01:09	get one packet delete
0:01:11	and the problem here is how to way i choose
0:01:14	which K channels was we should sense and the beginning of a
0:01:18	based on
0:01:20	observation as a visual have so far
0:01:21	and to answer this question the first thing we need to specify
0:01:25	is the model of channel all two
0:01:27	so if we look at
0:01:29	each channel so
0:01:30	each channel is essentially piece the i do is a stochastic process
0:01:34	and each channel is model as a stock as a process
0:01:37	with on parameter
0:01:39	so we do know what kind of stochastic process with be ways
0:01:42	but the prime in
0:01:44	so the simple liz
0:01:45	so the process we can be
0:01:47	i be here will be a i pro
0:01:50	so the channel or as the ah for channel i will the i people in the only with a known
0:01:55	means it
0:01:56	so see the i'd say is the probability channellise i'd in any given as
0:02:01	and
0:02:02	changing
0:02:02	i
0:02:04	and not
0:02:05	we want to choose
0:02:07	without knowing which channel was more likely to be bus so i don't
0:02:11	imagining for pink is with on the by S and you want to get as many had
0:02:16	oh
0:02:17	so now how do we choose
0:02:18	channel
0:02:19	um K out of
0:02:20	you know there is lot
0:02:22	and objective is to maximise house swoop
0:02:25	so what is the maximum throughput would we kind
0:02:28	even assume
0:02:29	we not the perfect
0:02:31	we know all the C
0:02:33	in that scenario we no we should always sense the channel with the largest to seat
0:02:38	with
0:02:39	the max probability
0:02:40	i
0:02:41	and you that case
0:02:42	i was we'll put we look at it is just see
0:02:45	as every i does well we get
0:02:46	when you in
0:02:48	the now without knowing all the see guys is it possible to achieve the same maximum throughput be fine by
0:02:54	the non model seen that
0:02:55	and if yes how
0:02:58	a turns of this is actually a well studied problem you've we take a D model
0:03:03	this is you was special case of more the amp and
0:03:06	vol
0:03:07	and the region problem we can explain using this slot machine with a dependent are
0:03:14	and that you just time count is that like one um to play
0:03:17	but we don't know the we word statistics
0:03:20	and the we word processor soon i
0:03:23	and maximise long term
0:03:25	i
0:03:26	as essence of this problem is in the tradeoff between exploitation and exploration
0:03:32	so you to it to like
0:03:33	we want to play a a a um that has a large is the sample me based on what we
0:03:37	have
0:03:38	is there are so
0:03:39	but that same time
0:03:40	we have to play each are sufficiently many both
0:03:44	a sufficient number of times it so that our some whole me is close enough to the true
0:03:49	oh that's essentially the
0:03:51	i which a problem
0:03:53	and uh you can be read was we formulate it in the following way
0:03:56	see the one to see that and is the we what mean of each are which is a little or
0:04:01	no
0:04:02	and the maximum we what we can i get over a reading of lance T
0:04:07	you
0:04:08	it has it's no is given by the maximal see that i use subscript the one here to in the
0:04:13	extreme the here
0:04:14	to see the white the largest one multiplied by right
0:04:17	which is achieved by always playing the bass ah
0:04:20	very into it
0:04:22	or not we don't know all the see so we don't know which way is the best um
0:04:26	we can design design a policy high
0:04:29	that tell us which um to play based on what we have
0:04:32	there are so far
0:04:33	and uh this is the re what we can get
0:04:36	for a given policy
0:04:38	and how do we know how to these policy is
0:04:41	it's measured by that you friends
0:04:43	to these ideals in their real way where you do not
0:04:46	so this is called the required
0:04:48	also called cost of nor learning that's the cost we pay
0:04:51	for not snowy them uh
0:04:53	and
0:04:54	very intuitive to this
0:04:56	every time we spend time
0:04:58	a a a a a a a that's not the
0:05:01	best
0:05:02	we waiting incur a loss of C that one minus seat
0:05:05	we are spending time
0:05:07	i
0:05:08	a multiplied by the expected time
0:05:10	we
0:05:11	yeah play a the ice best
0:05:13	so to minimize wet
0:05:14	we want to minimize a model
0:05:16	spend a each
0:05:18	and uh object here is to minimize the growth rate of the web
0:05:23	she
0:05:23	it's intuitive
0:05:25	whether always grew of with time because we and never stop playing
0:05:28	each a
0:05:29	to make sure our sample means zach
0:05:32	and we can also see
0:05:34	if we have a sampling year roles of whether with time T then we can achieve the maximum average reward
0:05:40	of the best uh
0:05:41	a find five
0:05:42	the normal models area because if you divide by T C these is not be know you with
0:05:49	and there is a problem is nice solved by at all
0:05:53	i five
0:05:54	they show that
0:05:55	the minimum required will street is lot
0:05:58	with
0:05:58	T
0:05:59	they also found the best leading constant uh and given in terms of kl divergence
0:06:04	and they the also construct a the policy to achieve peace past
0:06:08	well
0:06:09	but was rather complicated policy
0:06:11	and uh about ten to a fifty years later
0:06:14	there's simple policies to achieve lot
0:06:17	at
0:06:17	this is
0:06:18	essentially a index
0:06:20	type of policy based on simple me
0:06:22	one of the most uh uh uh well no one is
0:06:25	proposed by our uh
0:06:27	is the so called U C B one policy
0:06:29	so in this case what compute the in that
0:06:31	for each a are at each time T
0:06:34	the in is them play the sum of me of these are
0:06:38	class
0:06:39	this
0:06:39	turn is called of the confidence of on
0:06:42	used term is given by a square with the two of log T divide by G i
0:06:47	yeah i is the number of times we have played arm
0:06:50	i
0:06:51	not a car
0:06:53	so you can see
0:06:54	if we are playing a a a a significant a last log T used remote allow
0:07:00	it will make the index the large as
0:07:02	that we will have to
0:07:03	so the is always played around with a large
0:07:06	oh use actually were thing
0:07:08	the policy to at you spend a lot of keep time
0:07:11	each uh
0:07:12	which that's what line all be has shown that
0:07:15	what we have
0:07:16	so you've what it you know i be model though wow a channel occupancy
0:07:20	the problem is completely so
0:07:22	it's just a special case of the classic the um and
0:07:27	so you we consider a more general more um
0:07:30	complex model that's the each channel is a markov chain
0:07:33	to state B the i don't
0:07:35	it change according to this
0:07:37	a a transition that
0:07:38	here
0:07:39	and we don't know the transition problem it's we don't know P zero one Q one
0:07:43	at all the channels are still has had an equal they have the same transition
0:07:48	not
0:07:48	how do we design channels action policy
0:07:51	to maximise
0:07:52	swoop
0:07:53	turns out
0:07:54	this is a variation of the uh a more yeah and it problem but has never been self so we
0:08:00	are
0:08:00	essentially a on the first to to address
0:08:03	these
0:08:04	more down band
0:08:05	and um my
0:08:07	and the what to challenges here
0:08:10	so first here we can see if we have a markovian model for each channel
0:08:15	then the optimal policy even if we know the transition probability
0:08:19	is no longer the state home one channel
0:08:21	we need to
0:08:23	so each across channels based on our prediction or which channel is more likely to
0:08:27	i do like these
0:08:29	a fixed
0:08:29	because of the channel memory makes prediction
0:08:33	so that in this case we can not simply
0:08:36	minimize our required by funding at that time was spent on a bad channel
0:08:41	because
0:08:41	there's no really bad channel
0:08:43	say we need to switch the was smart we are all switching on much
0:08:48	so you actually large mine to or what is the best way
0:08:51	to switching on among channels based on how observation
0:08:55	and in general
0:08:56	there are infinite possibilities of
0:08:58	how to to switching
0:08:59	so the wheel are
0:09:00	from these infinite possibility which one is the bad
0:09:03	so that's really that difficult part
0:09:05	and perhaps the reason that this problem has not addressed
0:09:09	a lead to each
0:09:10	so actually turns out
0:09:12	if we do one of the transition probability
0:09:15	it's is this problem itself is complex enough not this is is actually the S is multi um band problem
0:09:21	and they're not that that makes
0:09:23	uh formulate that we do ninety eight eight
0:09:26	and this problem in general is peace space ha
0:09:29	that's shown but have them each one that's
0:09:32	the cool think here is we have a special recipe spend it
0:09:35	the think we are you ways here is the only a two state markov G channel each channel
0:09:40	and we have shown and they no model
0:09:42	the optimal policy is a simple now
0:09:46	and do this is uh a a a a real work uh um have down with a a couple of
0:09:50	clever with er
0:09:51	so the optimality is
0:09:53	current to and there's certain conditions
0:09:55	so if we have a key one one we to that P zero one which be is is that daisy
0:09:59	chain
0:10:00	it's more likely to stay rather than change state
0:10:03	then of to melody is generally um shoe
0:10:06	but that you are the think like you P was model that P zero once a sure what three channels
0:10:12	and the all K equal to and mine there's one
0:10:14	but we come it's can gently these holds for all and for
0:10:18	you you a rising case we have be able to prove
0:10:21	so but that's
0:10:22	assume we can this time your to project or is true then we know and the not model myopic policy
0:10:28	is the optimal
0:10:30	and the more significant to these on the the problem
0:10:33	is that the mouth you policy has a simple so a universal struck
0:10:39	oh which means a we need to know is what if P one mine's squid to that P zero one
0:10:43	we always switch in this way
0:10:46	as we stay at a i don't channel if this channel with current the sense is i don't who was
0:10:50	stay with
0:10:51	you it becomes a it then we switched to the channel visited the log is
0:10:56	that's are are way of speech
0:10:58	and if
0:10:59	you want one small the
0:11:00	if zero one then way switching a you from way
0:11:03	we actually stay if the channel was a the and the was which if becomes i with where
0:11:08	why you re what then we go to somewhere else
0:11:11	and how the with switch with switched to the channel most recently visited a are all channels with a T
0:11:17	that you most of all
0:11:18	you've there are not such an no
0:11:20	the we switch to the channel busy long as
0:11:23	anyway
0:11:24	there essentially a of this says is all optimal we of change
0:11:29	among channels
0:11:30	i'm only takes two possible
0:11:32	depending on the transition
0:11:34	so that's really
0:11:36	makes the are no problem
0:11:38	significant a simpler to some instead of searching my infinite possible ways of how to switching on channels
0:11:44	we know there are only two possible
0:11:47	if we don't to P zero one
0:11:48	what is either there this way or that
0:11:51	the
0:11:52	we can formulate a sort of conceptual more the and it
0:11:55	well
0:11:56	so we treat each way of channel switching as a a
0:12:00	you really can see that what is a a a remote on a problem
0:12:04	is nothing but a choice
0:12:05	a possible option
0:12:07	have and we don't know which of change the bass
0:12:09	as exactly what do we have
0:12:11	so then not we'll a which um is that what are we only have forms we of switching
0:12:16	so we want to do which one is the good one
0:12:18	because we don't know the transition probably so we don't know which one to choose
0:12:22	so it same we just read of the problem to the class more they are banded problem
0:12:27	not that i C there are two channel is we have to address
0:12:31	the first one the price are we have to answer is how long to play each i
0:12:35	so a region that we just play one as we got some word for the classic
0:12:39	model
0:12:40	but not a a a a a one way of switching on much
0:12:43	so we have to follow that we of switching among channels long enough you all there to see if i
0:12:50	so how long we play
0:12:52	and terms not to the optimal lattice of formal each way use spell star
0:12:57	depends on the transition probability
0:12:59	unfortunately fortunately and that's something we don't not
0:13:02	so that's one they we have to dress
0:13:04	and the out way is the reward in this case and the longer i be kinda
0:13:08	not even independent
0:13:10	oh i be of rows are
0:13:12	because
0:13:12	each um is are only able way of switching on among channels
0:13:15	and that they are also reaching a on the same as that of and J
0:13:19	they are no longer
0:13:20	and that right
0:13:22	so to address these two challenges that's our uh solution here
0:13:26	so first how long to play each uh
0:13:28	the optimal lattice of playing
0:13:30	each arm of only each we of a channel switching
0:13:34	depends on the parameters are no priming mean
0:13:37	so to get a a a round of was back to the trick we have here is
0:13:41	we play a each and with increasing that
0:13:43	so we do not fixed
0:13:45	a a specific lance for playing each a
0:13:47	but rather choose a sequence that rules with
0:13:51	and
0:13:51	we will pace
0:13:52	certain price in terms of required
0:13:55	but we can choose the increasing read of view sequence which really small a lot a lot of T
0:14:01	and
0:14:02	do the price a pay we pay will be how which we really
0:14:05	a
0:14:06	yeah a thing to deal with not i T samples
0:14:10	um what we did it is we um proved a modified each one of thing
0:14:15	which essentially is the chief for proving a lot required to in the classic i D and eight uh
0:14:21	model
0:14:22	so the uh channel thing an eh
0:14:25	uh are we should not foundation is all the conditional um
0:14:29	expectations should have should have the same me so that a conditional exactly expectation should be the same me
0:14:35	and then we come by to the probability that
0:14:38	the sample mean is to far from the truth
0:14:41	and in this case we how we don't have identical um every we sample anymore
0:14:46	so we need to to modify that because the conditional expectation no longer the same
0:14:50	but we can get a the result if the conditional expectations are with thing
0:14:54	a certain range of each other
0:14:56	so
0:14:57	then we can use similar techniques to prove there is
0:15:01	and the what we have shown here is
0:15:03	we can achieve a which very close to a log all there for the required
0:15:08	essentially
0:15:09	in in front of a lot we don't have a constant anymore is
0:15:12	it's but a you know increasing sequence
0:15:14	but this sequence can be a it is small
0:15:17	as slow as you want
0:15:18	so you choose
0:15:19	a small you want to you hug close you want to get a lot T and based on that
0:15:24	we can choose
0:15:25	the increasing the for each
0:15:27	L N
0:15:28	so i was give the details
0:15:30	on the relation between G T and uh L T actually very simple
0:15:34	is just showing at each time
0:15:36	what is the current L where you
0:15:39	and uh that's a it or what is conclude um
0:15:43	this paper here
0:15:44	so what
0:15:45	we have considered a a a a of the problem was a multi bit it by um dynamic spectrum access
0:15:52	um we believe this general approach would be useful for for solving a either
0:15:57	uh rest we spend it probably even a mark of general mark of C your process is on the primary
0:16:03	a what is the message here
0:16:05	the method here is if we are dealing with a small T are bandit problem
0:16:09	with a known
0:16:10	uh then them
0:16:12	and we want to achieve
0:16:14	the same i reach word as we can get and the not
0:16:18	so in general problem is intractable
0:16:21	but the problem may have a whole
0:16:24	if
0:16:24	and the the not model
0:16:26	the optimal policy has what we call a find that option
0:16:30	by means
0:16:32	there are only fun and two ways of
0:16:34	do we find the and optimal policy only has four that possible for
0:16:39	depending on the primary
0:16:42	don't we treat each possible for
0:16:44	as as a a a a we don't to know the prior
0:16:47	parameters then we tried to or which one is actually a one for that and a line on them
0:16:53	so that's the general message
0:16:54	but of course for each specific problem we have to but yeah sure
0:16:59	whether log
0:17:00	oh i'm sure close to log where can be but she
0:17:04	and in the context of dynamic spectrum access that's great we have been focusing are
0:17:08	the problem is really possible because house
0:17:11	of this i a universal structure of the our P policy for that
0:17:15	no model pace
0:17:16	and uh i a little bit advertisement
0:17:19	um i be giving a another problem related a problem being about forty minutes
0:17:25	uh
0:17:25	because this one
0:17:27	a uh we can only somehow for a very special cases where you have only to state markov chain and
0:17:33	all the channels are are are arms a stochastic identical
0:17:37	the what if we're dealing with a general fine state markov chain with the no in the transition problem
0:17:44	what can we do there
0:17:45	so this is paper actually is a dressing
0:17:48	this complete be general model
0:17:50	um but we can also achieve not require
0:17:53	but
0:17:54	the definition of the required it is uh
0:17:56	we
0:17:57	oh
0:17:58	oh
0:18:02	path
0:18:03	i don't one person something to sort of question from the speech
0:18:08	true
0:18:15	oh i yes
0:18:16	so that's essentially here
0:18:20	you that cool
0:18:21	yes yeah
0:18:22	so
0:18:23	that the required we get it is these G Q locked chi right so we want chi G to grow
0:18:28	very slow is time T
0:18:30	and uh so if you want to achieve a lot a lot should be
0:18:34	uh for chi chi
0:18:36	then
0:18:36	eventually do sequence
0:18:38	uh
0:18:39	this is just
0:18:41	the number of
0:18:43	but what is the value of L for the current
0:18:46	so
0:18:47	L one i is
0:18:48	well we start we play one i L L one
0:18:51	slot
0:18:52	so then for the first L once lots the these sequence was what have the same value a one
0:18:58	for the that
0:18:59	what a a a a a model be out to we play out to time
0:19:02	so for the next L tools as well as the but will be out
0:19:06	so
0:19:06	L is increasing
0:19:08	so these sequence will be also be in crazy but even slower or
0:19:12	because it's repeating the same value
0:19:15	but it
0:19:16	yeah
0:19:19	okay
0:19:20	or running times to time well just a record some
0:19:31	um
0:19:33	i field to see if a connection was not little huh but this is really trying to learn
0:19:38	um
0:19:38	trying to maximise this throughput put and there known parameters
0:19:42	but a a is more like the inclusion was your peer users here we even only have one
0:19:48	uh and secondary user
0:19:49	i i i probably miss
0:19:51	um
0:19:52	but an action that i i don't really see whether two problems i really to
0:19:56	um
0:19:57	maybe we can talk more data
0:19:59	you to to rooms
0:20:01	so so let's
0:20:02	that's
0:20:02	the speaker

THE NON-BAYESIAN RESTLESS MULTI-ARMED BANDIT: A CASE OF NEAR-LOGARITHMIC REGRET

Spectrum Sensing for Cognitive Radio

Presented by: Qing Zhao, Author(s): Wenhan Dai, Tsinghua University, China; Yi Gai, Bhaskar Krishnamachari, University of Southern California, United States; Qing Zhao, University of California Davis, United States