Commit 616138e3 authored by Nicola Botta's avatar Nicola Botta
Browse files

Added 'cval', 'cvalmax' and 'cvalargmax' and 'optExtLemma'.

parent dd7e4fb0
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
> next : (t : Nat) -> (x : X t) -> Y t x -> M (X (S t)) > next : (t : Nat) -> (x : X t) -> Y t x -> M (X (S t))
* Monadic sequential decision problem: * Decision problem:
> Val : Type > Val : Type
> zero : Val > zero : Val
...@@ -32,33 +32,31 @@ ...@@ -32,33 +32,31 @@
For a fixed number of decision steps |n = S m|, the problem consists of For a fixed number of decision steps |n = S m|, the problem consists of
finding a sequence finding a sequence
< [p0, p1, ..., pm] [p0, p1, ..., pm]
of decision rules of decision rules
< p0 : (x : X 0) -> Y 0 x p0 : (x0 : X 0) -> Y 0 x0
< p1 : (x : X 1) -> Y 1 x p1 : (x1 : X 1) -> Y 1 x1
< ... ...
< pm : (x : X m) -> Y m x pm : (xm : X m) -> Y m xm
that, for any |x : X 0|, maximizes the |meas|-measure of the |<+>|-sum that, for any |x0 : X 0|, maximizes the |meas|-measure of the |<+>|-sum
of the |reward|-rewards obtained along the trajectories starting in |x|. of the |reward|-rewards obtained along the trajectories starting in |x|.
* The theory: * The theory:
** Policies and policy sequences: ** Policies and policy sequences:
> Policy : (t : Nat) -> Type > Policy : (t : Nat) -> Type
> Policy t = (x : X t) -> Y t x > Policy t = (x : X t) -> Y t x
> data PolicySeq : (t : Nat) -> (n : Nat) -> Type where > data PolicySeq : (t : Nat) -> (n : Nat) -> Type where
> Nil : {t : Nat} -> PolicySeq t Z > Nil : {t : Nat} -> PolicySeq t Z
> (::) : {t, n : Nat} -> Policy t -> PolicySeq (S t) n -> PolicySeq t (S n) > (::) : {t, n : Nat} -> Policy t -> PolicySeq (S t) n -> PolicySeq t (S n)
** Value function:
> (<++>) : {A : Type} -> (f, g : A -> Val) -> A -> Val > (<++>) : {A : Type} -> (f, g : A -> Val) -> A -> Val
> f <++> g = \ a => f a <+> g a > f <++> g = \ a => f a <+> g a
...@@ -71,17 +69,17 @@ of the |reward|-rewards obtained along the trajectories starting in |x|. ...@@ -71,17 +69,17 @@ of the |reward|-rewards obtained along the trajectories starting in |x|.
> mx' : M (X (S t)) > mx' : M (X (S t))
> mx' = next t x y > mx' = next t x y
** Optimality of policy sequences
> OptPolicySeq : {t, n : Nat} -> Functor M => PolicySeq t n -> Type > OptPolicySeq : {t, n : Nat} -> Functor M => PolicySeq t n -> Type
> OptPolicySeq {t} {n} ps = (ps' : PolicySeq t n) -> (x : X t) -> val ps' x <= val ps x > OptPolicySeq {t} {n} ps = (ps' : PolicySeq t n) -> (x : X t) -> val ps' x <= val ps x
** Bellman's principle of optimality (1957):
> OptExt : {t, n : Nat} -> Functor M => PolicySeq (S t) n -> Policy t -> Type > OptExt : {t, n : Nat} -> Functor M => PolicySeq (S t) n -> Policy t -> Type
> OptExt {t} ps p = (p' : Policy t) -> (x : X t) -> val (p' :: ps) x <= val (p :: ps) x > OptExt {t} ps p = (p' : Policy t) -> (x : X t) -> val (p' :: ps) x <= val (p :: ps) x
Bellman's principle of optimality (1957):
> Bellman : {t, n : Nat} -> Functor M => > Bellman : {t, n : Nat} -> Functor M =>
> (ps : PolicySeq (S t) n) -> OptPolicySeq ps -> > (ps : PolicySeq (S t) n) -> OptPolicySeq ps ->
> (p : Policy t) -> OptExt ps p -> > (p : Policy t) -> OptExt ps p ->
...@@ -103,22 +101,23 @@ Bellman's principle of optimality (1957): ...@@ -103,22 +101,23 @@ Bellman's principle of optimality (1957):
> s2 : (val (p' :: ps) x <= val (p :: ps) x) > s2 : (val (p' :: ps) x <= val (p :: ps) x)
> s2 = oep p' x > s2 = oep p' x
** Backwards induction
The empty policy sequence is optimal: First, the empty policy sequence is optimal:
> nilOptPolicySeq : Functor M => OptPolicySeq Nil > nilOptPolicySeq : Functor M => OptPolicySeq Nil
> nilOptPolicySeq Nil x = NaiveTheory.lteRefl > nilOptPolicySeq Nil x = NaiveTheory.lteRefl
Provided that we can implement optimal extensions of arbitrary policy
Now, provided that we can implement sequences
> optExt : {t, n : Nat} -> PolicySeq (S t) n -> Policy t > optExt : {t, n : Nat} -> PolicySeq (S t) n -> Policy t
> optExtSpec : {t, n : Nat} -> Functor M => > optExtSpec : {t, n : Nat} -> Functor M =>
> (ps : PolicySeq (S t) n) -> OptExt ps (optExt ps) > (ps : PolicySeq (S t) n) -> OptExt ps (optExt ps)
, we can compute optimal policy sequences backwards, starting from the
then empty policy sequence:
> bi : (t : Nat) -> (n : Nat) -> PolicySeq t n > bi : (t : Nat) -> (n : Nat) -> PolicySeq t n
> bi t Z = Nil > bi t Z = Nil
...@@ -126,7 +125,7 @@ then ...@@ -126,7 +125,7 @@ then
> where ps : PolicySeq (S t) n > where ps : PolicySeq (S t) n
> ps = bi (S t) n > ps = bi (S t) n
is correct "by construction" This generic backwards induction is correct "by construction":
> biLemma : Functor M => (t : Nat) -> (n : Nat) -> OptPolicySeq (bi t n) > biLemma : Functor M => (t : Nat) -> (n : Nat) -> OptPolicySeq (bi t n)
> biLemma t Z = nilOptPolicySeq > biLemma t Z = nilOptPolicySeq
...@@ -134,32 +133,89 @@ is correct "by construction" ...@@ -134,32 +133,89 @@ is correct "by construction"
> where ps : PolicySeq (S t) n > where ps : PolicySeq (S t) n
> ps = bi (S t) n > ps = bi (S t) n
> ops : OptPolicySeq ps > ops : OptPolicySeq ps
> ops = ?kika -- biLemma (S t) n > ops = biLemma (S t) n
> p : Policy t > p : Policy t
> p = optExt ps > p = optExt ps
> oep : OptExt ps p > oep : OptExt ps p
> oep = ?lala -- optExtSpec ps > oep = optExtSpec ps
** Optimal extensions
The verified, generic implementation of backwards induction |bi|
naturally raises the question of under which conditions one can
implement
< optExt : {t, n : Nat} -> PolicySeq (S t) n -> Policy t
such that
< optExtSpec : {t, n : Nat} -> Functor M =>
< (ps : PolicySeq (S t) n) -> OptExt ps (optExt ps)
To this end, consider the function
> cval : {t, n : Nat} -> Functor M => PolicySeq (S t) n ->
> (x : X t) -> Y t x -> Val
> cval {t} ps x y = meas (map {f = M} (reward t x y <++> val ps) mx')
> where mx' : M (X (S t))
> mx' = next t x y
By definition of |val| and |cval|, one has
val (p :: ps) x
=
meas (map {f = M} (reward t x (p x) <++> val ps) (next t x (p x)))
=
cval ps x (p x)
This suggests that, if we can maximize |cval| that is, implement
> cvalmax : {t, n : Nat} -> PolicySeq (S t) n -> (x : X t) -> Val
> cvalargmax : {t, n : Nat} -> PolicySeq (S t) n -> (x : X t) -> Y t x
that fulfill
> cvalmaxSpec : {t, n : Nat} -> Functor M =>
> (ps : PolicySeq (S t) n) -> (x : X t) ->
> (y : Y t x) -> cval ps x y <= cvalmax ps x
> cvalargmaxSpec : {t, n : Nat} -> Functor M =>
> (ps : PolicySeq (S t) n) -> (x : X t) ->
> cvalmax ps x = cval ps x (cvalargmax ps x)
then we can implement optimal extensions of arbitrary policy
sequences. The following lemma shows that this intuition is
correct. With
> optExt = cvalargmax
, one has
> optExtLemma : {t, n : Nat} -> Functor M =>
> (ps : PolicySeq (S t) n) -> OptExt ps (optExt ps)
> optExtLemma {t} {n} ps p' x = s4 where
> p : Policy t
> p = optExt ps
> y : Y t x
> y = p x
> y' : Y t x
> y' = p' x
> s1 : cval ps x y' <= cvalmax ps x
> s1 = cvalmaxSpec ps x y'
> s2 : cval ps x y' <= cval ps x (cvalargmax ps x)
> s2 = replace {P = \ z => (cval ps x y' <= z)} (cvalargmaxSpec ps x) s1
> s3 : cval ps x y' <= cval ps x y
> s3 = s2
> s4 : val (p' :: ps) x <= val (p :: ps) x
> s4 = s3
** Memoisation
> {- > {-
> Bellman {t} ps ops p oep (p' :: ps') x =
> let y' = p' x in
> let mx' = next t x y' in
> let f' : ((x' : X (S t)) -> Val)
> = \ x' => reward t x y' x' <+> val ps' x' in
> let f : ((x' : X (S t)) -> Val)
> = \ x' => reward t x y' x' <+> val ps x' in
> let s0 : ((x' : X (S t)) -> f' x' <= f x')
> = \ x' => plusMon (NaiveTheory.lteRefl {v = reward t x y' x'}) (ops ps' x') in
> let s1 : (val (p' :: ps') x <= val (p' :: ps) x)
> = ?kuka in -- measMon f' f s0 mx' in
> let s2 : (val (p' :: ps) x <= val (p :: ps) x)
> = oep p' x in
> lteTrans s1 s2
> ---} > ---}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment