### Final.

parent c1f2186a
 ... ... @@ -621,7 +621,7 @@ Optimal extensions: %endif Together with the result of \bottaetal (|biOptVal|, see appendix~\ref{appendix:bilemma} below) we can prove the correctness Appendix~\ref{appendix:bilemma} below) we can prove the correctness of monadic backward induction as corollary, using a generalised optimality of policy sequences predicate: ... ...
 ... ... @@ -44,18 +44,19 @@ extensional equality of |val| and |val'|: \item[{\bf Condition 1.}] The measure needs to be left-inverse to |pure|: \footnote{The symbol |ExtEq| denotes \emph{extensional} equality, see appendix~\ref{appendix:monadLaws}} see Appendix~\ref{appendix:monadLaws}} < measPureSpec : meas . pure ExtEq id \begin{center} \begin{tikzcd}[column sep=2cm] % Val \arrow[d, "pure", swap] \arrow[dr, "id"] \\ M Val \arrow[r, "meas", swap] & Val\\ % \end{tikzcd} \includegraphics{img/diag1.eps} % \begin{tikzcd}[column sep=2cm] % % % Val \arrow[d, "pure", swap] \arrow[dr, "id"] \\ % M Val \arrow[r, "meas", swap] & Val\\ % % % \end{tikzcd} \end{center} % measJoinSpec ... ... @@ -67,12 +68,13 @@ equal to applying it after |map meas|: < measJoinSpec : meas . join ExtEq meas . map meas \begin{center} \begin{tikzcd}[column sep=3cm] % M (M Val) \arrow[r, "map\ meas"] \arrow[d, "join", swap] & M Val \arrow[d, "meas" ] \\ M Val \arrow[r, "meas", swap] & Val\\ % \end{tikzcd} \includegraphics{img/diag2.eps} % \begin{tikzcd}[column sep=3cm] % % % M (M Val) \arrow[r, "map\ meas"] \arrow[d, "join", swap] & M Val \arrow[d, "meas" ] \\ % M Val \arrow[r, "meas", swap] & Val\\ % % % \end{tikzcd} \end{center} % measPlusSpec ... ... @@ -87,14 +89,15 @@ applying |(v <+>)| after the measure: < (meas . map (v <+>)) mv = ((v <+>) . meas) mv \begin{center} \begin{tikzcd}[column sep=3cm] % M Val \arrow[r, "map (v\, \oplus)"] \arrow[d, "meas", swap] & M Val \arrow[d, "meas" ] \\ Val \arrow[r, "(v\, \oplus)", swap] & Val\\ \includegraphics{img/diag3.eps} % \begin{tikzcd}[column sep=3cm] % % % M Val \arrow[r, "map (v\, \oplus)"] \arrow[d, "meas", swap] & M Val \arrow[d, "meas" ] \\ % Val \arrow[r, "(v\, \oplus)", swap] & Val\\ % % % \end{tikzcd} % \end{tikzcd} \vspace{0.1cm} %\vspace{0.1cm} \end{center} \end{itemize} ... ... @@ -146,15 +149,19 @@ is neutral for |*|. The second and the third condition require some arithmetic reasoning, so let us just consider them for two examples. Say we have distributions Let |a, b, c, d| be variables of type |Double| and say we have distributions %if False > parameters (a, b, c, d, e : Double) > parameters (a, b, c, d : Double) %endif > dps1 : Dist Double > dps1 = [(a, 0.5), (b, 0.3), (c, 0.2)] > dps2 : Dist Double > dps2 = [(d, 0.4), (e, 0.6)] > dps2 = [(a, 0.4), (d, 0.6)] > dpdps : Dist (Dist Double) > dpdps = [(dps1, 0.1), (dps2, 0.9)] ... ... @@ -170,13 +177,13 @@ addition and multiplication: < (expected . distJoin) dpdps = < < expected [(a, 0.5 * 0.1), (b, 0.3 * 0.1), (c, 0.2 * 0.1), (d, 0.4 * 0.9), (e, 0.6 * 0.9)] = < expected [(a, 0.5 * 0.1), (b, 0.3 * 0.1), (c, 0.2 * 0.1), (a, 0.4 * 0.9), (d, 0.6 * 0.9)] = < < (a * 0.5 * 0.1) + (b * 0.3 * 0.1) + (c * 0.2 * 0.1) + (d * 0.4 * 0.9) + (e * 0.6 * 0.9) = < (a * 0.5 * 0.1) + (b * 0.3 * 0.1) + (c * 0.2 * 0.1) + (a * 0.4 * 0.9) + (d * 0.6 * 0.9) = < < ((a * 0.5 + b * 0.3 + c * 0.2) * 0.1 + (d * 0.4 + e * 0.6) * 0.9 = < ((a * 0.5 + b * 0.3 + c * 0.2) * 0.1 + (a * 0.4 + d * 0.6) * 0.9 = < < expected [(a * 0.5 + b * 0.3 + c * 0.2, 0.1) , (d * 0.4 + e * 0.6, 0.9)] = < expected [(a * 0.5 + b * 0.3 + c * 0.2, 0.1) , (a * 0.4 + d * 0.6, 0.9)] = < < (expected . distMap expected) dpdps ... ...
 ... ... @@ -120,9 +120,9 @@ The price to pay for including the non-emptiness premise is the additional condition |nextNotEmpty| on the transition function |next| that was already stated in Sec.~\ref{subsection:wrap-up}. Moreover, we have to postulate non-emptiness preservation laws for the monad operations (appendix~\ref{appendix:monadLaws}) and to prove an monad operations (Appendix~\ref{appendix:monadLaws}) and to prove an additional lemma about the preservation of non-emptiness (appendix~\ref{appendix:lemmas}). (Appendix~\ref{appendix:lemmas}). % Conceptually, it might seem cleaner to omit the non-emptiness condition: In this case, the remaining conditions would only concern ... ...
 ... ... @@ -43,7 +43,8 @@ control theory for finite-horizon, discrete-time SDPs. It extends mathematical formulations for stochastic SDPs \citep{bertsekas1995, bertsekasShreve96, puterman2014markov} to the general problem of optimal decision making under \emph{monadic} uncertainty.\\ uncertainty. For monadic SDPs, the framework provides a generic implementation of backward induction. It has been applied to study the impact of uncertainties on optimal emission policies ... ... @@ -156,7 +157,8 @@ Note that for deterministic problems it is unnecessary to parameterise the reward function over the next state as it is unique and can thus be obtained from the current state and control. But for non-deterministic problems it is useful to be able to assign rewards depending on the (uncertain) outcome of a transition.\\ (uncertain) outcome of a transition. A few remarks are at place here. \begin{itemize} ... ... @@ -477,7 +479,7 @@ Like the reference value |zero| discussed above, |plusMonSpec| and |measMonSpec| are specification components of the BJI-framework that we have not discussed in Sec.~\ref{subsection:specification_components}. % We provide a proof of |Bellman| in appendix~\ref{appendix:Bellman}. As one We provide a proof of |Bellman| in Appendix~\ref{appendix:Bellman}. As one would expect, the proof makes essential use of the recursive definition of the function |val| discussed above. % ... ... @@ -597,7 +599,7 @@ orthogonal to the purpose of the current paper. For the same reason we have not addressed the question of how to make |bi| more efficient by tabulation. We briefly discuss the specification and implementation of optimal extensions in the BJI-framework in appendix~\ref{appendix:optimal_extension}. in the BJI-framework in Appendix~\ref{appendix:optimal_extension}. We refer the reader interested in tabulation of |bi| to \href{https://gitlab.pik-potsdam.de/botta/IdrisLibs/-/blob/master/SequentialDecisionProblems/TabBackwardsInduction.lidr} {SequentialDecisionProblems.TabBackwardsInduction} ... ...
 ... ... @@ -156,5 +156,5 @@ The sources of this document have been written in literal Idris and are available at \citep{IdrisLibsValVal}, together with some example code. All source files can be type checked with Idris~1.3.2. \vfill \pagebreak \ No newline at end of file %\vfill %\pagebreak \ No newline at end of file
 main: main.tex latexmk -pdf main.tex final: main latex main.tex dvips main.dvi ps2pdf main.ps LHS2TEX = lhs2TeX ... ...
 ... ... @@ -58,8 +58,8 @@ on total rewards which is usually aggregated using the familiar \emph{expected value} measure. The value thus obtained is called the \emph{expected total reward} \citep[ch.~4.1.2]{puterman2014markov} and its role is central: It is the quantity that is to be optimised in an SDP.\\ % an SDP. In monadic SDPs, the measure is generic, i.e. it is not fixed in advance but has to be given as part of the specification of a concrete problem. Therefore we will generalise the notion of \emph{expected total reward} to ... ... @@ -76,8 +76,8 @@ Similarly, we define that \emph{solving a monadic SDP} consists in This means that when starting from any initial state at decision step |t|, following the computed list of rules for selecting controls will result in a value that is maximal as measure of the sum of rewards along all possible trajectories rooted in that initial state.\\ % along all possible trajectories rooted in that initial state. Equivalently, rewards can instead be considered as \emph{costs} that need to be \emph{minimised}. This dual perspective is taken e.g. in \citep{bertsekas1995}. In the subsequent sections we will follow ... ... @@ -141,22 +141,27 @@ Fig.~\ref{fig:examplesGraph}. \begin{subfigure}[b]{.35\textwidth-0.55cm} \centering \scalebox{0.7}{ %\scalebox{0.7}{ \small \input{NFAexample.tex} \includegraphics[height=0.26\textheight]{img/fig1a.eps} % \small % \input{NFAexample.tex} } %} \caption{Example~1} \label{fig:example1} \end{subfigure} \begin{subfigure}[b]{.65\textwidth} \centering \scalebox{0.7}{ %\scalebox{0.7}{ \input{SchedulingExample.tex} \vspace{0.3cm} } \includegraphics[height=0.25\textheight]{img/fig1b.eps} % \input{SchedulingExample.tex} %\vspace{0.3cm} % } \caption{Example~2} \label{fig:example2} ... ... @@ -218,8 +223,8 @@ these lines we refer the interested reader to~\citep{esd-9-525-2018}. Scheduling problems serve as canonical examples in control theory textbooks. The one we present here is a slightly modified version of \citep[Example~1.1.2]{bertsekas1995}.\\ % \citep[Example~1.1.2]{bertsekas1995}. Think of some machine in a factory that can perform different operations, say $A, B, C$ and $D$. Each of these operations is supposed to be performed once. The machine can only perform one operation at a ... ...
 ... ... @@ -71,8 +71,8 @@ starting at that state: where we use |StateCtrlSeq| as type of trajectories. Essentially it is a non-empty list of (dependent) state/control pairs, with the exception of the base case which is a singleton just containing the last state reached.\\ % which is a singleton just containing the last state reached. Furthermore, we can compute the \emph{total reward} for a single trajectory, i.e. its sum of rewards: ... ... @@ -114,7 +114,7 @@ of the sum of rewards along all possible trajectories of length |n| that are rooted in an initial state at step |t|. Again by analogy to the stochastic case, we define monadic backward induction to be correct, if for a given SDP the policy backward induction to be correct if, for a given SDP, the policy sequence computed by |bi| is the solution to the SDP. I.e., we consider |bi| to be correct if it meets the specification % ... ... @@ -230,7 +230,8 @@ and it is not obviously clear'' that |val| and |val'| are extensionally equal without further knowledge about |meas|. In the deterministic case, i.e. for |M = Id| and |meas = id|, they are indeed equal without imposing any further |meas = id|, |val ps x| and |val' ps x| are indeed equal for all |ps| and |x|, without imposing any further conditions (as we will see in Sec.~\ref{section:valval}). For the stochastic case, \cite[Theorem 4.2.1]{puterman2014markov} suggests that the equality ... ...
 ... ... @@ -85,9 +85,9 @@ interaction of the measure with the monad structure and the |<+>|-operator on |Val|. % Machine-checked proofs are given in the appendices~\ref{appendix:theorem},~\ref{appendix:biCorrectness} and Appendices~\ref{appendix:theorem},~\ref{appendix:biCorrectness} and \ref{appendix:lemmas}. The monad laws we use are stated in appendix~\ref{appendix:monadLaws}. In the remainder of this section, Appendix~\ref{appendix:monadLaws}. In the remainder of this section, we discuss semi-formal versions of the proofs.\\ \paragraph*{Monad algebras.} \hspace{0.1cm} The first lemma allows us ... ... @@ -229,8 +229,8 @@ that the equality holds.\\ The induction hypothesis (|IH|) is: for all |x : X t|, |val' ps x = val ps x|. We have to show that |IH| implies that for all |p : Policy t| and |x : X t|, the equality |val' (p :: ps) x = val (p :: ps) x| holds.\\ % |val' (p :: ps) x = val (p :: ps) x| holds. For brevity (and to economise on brackets), let in the following |y = p x|, |mx' = next t x y|, |r = reward t x y|, |trjps = trj ps|, and |consxy = ((x ** y) :::)|. ... ... @@ -295,7 +295,7 @@ may be uninteresting for a pen and paper proof, but turn out to be crucial in the setting of an intensional type theory -- like Idris -- where function extensionality does not hold in general. In particular, we have to postulate that the functorial |map| preserves extensional equality (see appendix~\ref{appendix:monadLaws} preserves extensional equality (see Appendix~\ref{appendix:monadLaws} and \citep{botta2020extensional}) for Idris to accept the proof. In fact, most of the reasoning proceeds by replacing functions that are mapped onto monadic values by other functions that are only extensionally ... ...
This diff is collapsed.