Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
Nicola Botta
IdrisLibs2
Commits
616138e3
Commit
616138e3
authored
Nov 23, 2020
by
Nicola Botta
Browse files
Added 'cval', 'cvalmax' and 'cvalargmax' and 'optExtLemma'.
parent
dd7e4fb0
Changes
1
Hide whitespace changes
Inline
Side-by-side
SequentialDecisionProblems/NaiveTheory.lidr
View file @
616138e3
...
...
@@ -13,7 +13,7 @@
> next : (t : Nat) -> (x : X t) -> Y t x -> M (X (S t))
*
Monadic sequential d
ecision problem:
*
D
ecision problem:
> Val : Type
> zero : Val
...
...
@@ -32,33 +32,31 @@
For a fixed number of decision steps |n = S m|, the problem consists of
finding a sequence
<
[p0, p1, ..., pm]
[p0, p1, ..., pm]
of decision rules
<
p0 : (x : X 0) -> Y 0 x
<
p1 : (x : X 1) -> Y 1 x
<
...
<
pm : (x : X m) -> Y m x
p0 : (x
0
: X 0) -> Y 0 x
0
p1 : (x
1
: X 1) -> Y 1 x
1
...
pm : (x
m
: X m) -> Y m x
m
that, for any |x : X 0|, maximizes the |meas|-measure of the |<+>|-sum
that, for any |x
0
: X 0|, maximizes the |meas|-measure of the |<+>|-sum
of the |reward|-rewards obtained along the trajectories starting in |x|.
* The theory:
** Policies and policy sequences:
> Policy : (t : Nat) -> Type
> Policy t = (x : X t) -> Y t x
> data PolicySeq : (t : Nat) -> (n : Nat) -> Type where
> Nil : {t : Nat} -> PolicySeq t Z
> (::) : {t, n : Nat} -> Policy t -> PolicySeq (S t) n -> PolicySeq t (S n)
** Value function:
> (<++>) : {A : Type} -> (f, g : A -> Val) -> A -> Val
> f <++> g = \ a => f a <+> g a
...
...
@@ -71,17 +69,17 @@ of the |reward|-rewards obtained along the trajectories starting in |x|.
> mx' : M (X (S t))
> mx' = next t x y
** Optimality of policy sequences
> OptPolicySeq : {t, n : Nat} -> Functor M => PolicySeq t n -> Type
> OptPolicySeq {t} {n} ps = (ps' : PolicySeq t n) -> (x : X t) -> val ps' x <= val ps x
** Bellman's principle of optimality (1957):
> OptExt : {t, n : Nat} -> Functor M => PolicySeq (S t) n -> Policy t -> Type
> OptExt {t} ps p = (p' : Policy t) -> (x : X t) -> val (p' :: ps) x <= val (p :: ps) x
Bellman's principle of optimality (1957):
> Bellman : {t, n : Nat} -> Functor M =>
> (ps : PolicySeq (S t) n) -> OptPolicySeq ps ->
> (p : Policy t) -> OptExt ps p ->
...
...
@@ -103,22 +101,23 @@ Bellman's principle of optimality (1957):
> s2 : (val (p' :: ps) x <= val (p :: ps) x)
> s2 = oep p' x
** Backwards induction
T
he empty policy sequence is optimal:
First, t
he empty policy sequence is optimal:
> nilOptPolicySeq : Functor M => OptPolicySeq Nil
> nilOptPolicySeq Nil x = NaiveTheory.lteRefl
Now, provided that we can implement
Provided that we can implement optimal extensions of arbitrary policy
sequences
> optExt : {t, n : Nat} -> PolicySeq (S t) n -> Policy t
> optExtSpec : {t, n : Nat} -> Functor M =>
> (ps : PolicySeq (S t) n) -> OptExt ps (optExt ps)
then
, we can compute optimal policy sequences backwards, starting from the
empty policy sequence:
> bi : (t : Nat) -> (n : Nat) -> PolicySeq t n
> bi t Z = Nil
...
...
@@ -126,7 +125,7 @@ then
> where ps : PolicySeq (S t) n
> ps = bi (S t) n
is correct "by construction"
This generic backwards induction
is correct "by construction"
:
> biLemma : Functor M => (t : Nat) -> (n : Nat) -> OptPolicySeq (bi t n)
> biLemma t Z = nilOptPolicySeq
...
...
@@ -134,32 +133,89 @@ is correct "by construction"
> where ps : PolicySeq (S t) n
> ps = bi (S t) n
> ops : OptPolicySeq ps
> ops =
?kika --
biLemma (S t) n
> ops = biLemma (S t) n
> p : Policy t
> p = optExt ps
> oep : OptExt ps p
> oep = ?lala -- optExtSpec ps
> oep = optExtSpec ps
** Optimal extensions
The verified, generic implementation of backwards induction |bi|
naturally raises the question of under which conditions one can
implement
< optExt : {t, n : Nat} -> PolicySeq (S t) n -> Policy t
such that
< optExtSpec : {t, n : Nat} -> Functor M =>
< (ps : PolicySeq (S t) n) -> OptExt ps (optExt ps)
To this end, consider the function
> cval : {t, n : Nat} -> Functor M => PolicySeq (S t) n ->
> (x : X t) -> Y t x -> Val
> cval {t} ps x y = meas (map {f = M} (reward t x y <++> val ps) mx')
> where mx' : M (X (S t))
> mx' = next t x y
By definition of |val| and |cval|, one has
val (p :: ps) x
=
meas (map {f = M} (reward t x (p x) <++> val ps) (next t x (p x)))
=
cval ps x (p x)
This suggests that, if we can maximize |cval| that is, implement
> cvalmax : {t, n : Nat} -> PolicySeq (S t) n -> (x : X t) -> Val
> cvalargmax : {t, n : Nat} -> PolicySeq (S t) n -> (x : X t) -> Y t x
that fulfill
> cvalmaxSpec : {t, n : Nat} -> Functor M =>
> (ps : PolicySeq (S t) n) -> (x : X t) ->
> (y : Y t x) -> cval ps x y <= cvalmax ps x
> cvalargmaxSpec : {t, n : Nat} -> Functor M =>
> (ps : PolicySeq (S t) n) -> (x : X t) ->
> cvalmax ps x = cval ps x (cvalargmax ps x)
then we can implement optimal extensions of arbitrary policy
sequences. The following lemma shows that this intuition is
correct. With
> optExt = cvalargmax
, one has
> optExtLemma : {t, n : Nat} -> Functor M =>
> (ps : PolicySeq (S t) n) -> OptExt ps (optExt ps)
> optExtLemma {t} {n} ps p' x = s4 where
> p : Policy t
> p = optExt ps
> y : Y t x
> y = p x
> y' : Y t x
> y' = p' x
> s1 : cval ps x y' <= cvalmax ps x
> s1 = cvalmaxSpec ps x y'
> s2 : cval ps x y' <= cval ps x (cvalargmax ps x)
> s2 = replace {P = \ z => (cval ps x y' <= z)} (cvalargmaxSpec ps x) s1
> s3 : cval ps x y' <= cval ps x y
> s3 = s2
> s4 : val (p' :: ps) x <= val (p :: ps) x
> s4 = s3
** Memoisation
> {-
> Bellman {t} ps ops p oep (p' :: ps') x =
> let y' = p' x in
> let mx' = next t x y' in
> let f' : ((x' : X (S t)) -> Val)
> = \ x' => reward t x y' x' <+> val ps' x' in
> let f : ((x' : X (S t)) -> Val)
> = \ x' => reward t x y' x' <+> val ps x' in
> let s0 : ((x' : X (S t)) -> f' x' <= f x')
> = \ x' => plusMon (NaiveTheory.lteRefl {v = reward t x y' x'}) (ops ps' x') in
> let s1 : (val (p' :: ps') x <= val (p' :: ps) x)
> = ?kuka in -- measMon f' f s0 mx' in
> let s2 : (val (p' :: ps) x <= val (p :: ps) x)
> = oep p' x in
> lteTrans s1 s2
> ---}
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment