Nuria Brede committed Feb 23, 2021 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 % -*-Latex-*- %if False > module MonadicSDP %endif \section{Finite-horizon Sequential Decision Problems} \label{section:SDPs} In deterministic, non-deterministic and stochastic finite-horizon SDPs, a decision maker seeks to control the evolution of a \emph{(dynamical) system} at a finite number of \emph{decision steps} by selecting certain \emph{controls} in sequence, one after the other. The controls available to the decision maker at a given decision step typically depend on the \emph{state} of the system at that step. In \emph{deterministic} problems, selecting a control in a state at decision step |t : Nat|, determines a unique next state at decision step |t + 1| through a given \emph{transition function}. % In \emph{non-deterministic} problems, the transition function yields a whole set of \emph{possible} states at the next decision step. % In \emph{stochastic} problems, the transition function yields a \emph{probability distribution} on states at the next decision step.  Nuria Brede committed Jul 23, 2021 31 The notion of \emph{monadic} problem generalises that of deterministic,  Nuria Brede committed Feb 23, 2021 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 non-deterministic and stochastic problem through a transition function that yields an |M|-structure of next states where |M| is a monad. % For example, the identity monad can be applied to model deterministic systems. Non-deterministic systems can be represented in terms of transition functions that return lists (or some other representations of sets) of next states. Stochastic systems can be represented in terms of probability distribution monads \citep{giry1981, DBLP:journals/jfp/ErwigK06, DBLP:journals/scp/AudebaudP09, DBLP:journals/tcs/Jacobs11}. The uncertainty monad, the states, the controls and the next function define what is often called a \emph{decision process}. The idea of sequential decision problems is that each single decision yields a \emph{reward} and these rewards add up to a \emph{total reward} over all decision steps. Rewards are often represented by values of a numeric type, and added up using the canonical addition. If the transition function and thus the evolution of the system is not deterministic, then the resulting possible total rewards need to be aggregated to yield a single outcome value. % In stochastic SDPs, evolving the underlying stochastic system leads to a probability distribution on total rewards which is usually aggregated using the familiar  Nuria Brede committed Jul 23, 2021 58 59 60 \emph{expected value} measure. The value thus obtained is called the \emph{expected total reward} \citep[ch.~4.1.2]{puterman2014markov} and its role is central: It is the quantity that is to be optimised in  Nicola Botta committed Sep 23, 2021 61 62 an SDP.  Nuria Brede committed Jul 23, 2021 63 64 65 66 67 68 69 70 71 In monadic SDPs, the measure is generic, i.e. it is not fixed in advance but has to be given as part of the specification of a concrete problem. Therefore we will generalise the notion of \emph{expected total reward} to a corresponding notion for monadic SDPs that we call \emph{measured total reward} in analogy (see Sec.~\ref{section:preparation}). \emph{Solving a stochastic SDP} consists in \emph{finding a list of rules for selecting controls that maximises the expected total reward for |n| decision steps when starting at  Nuria Brede committed Feb 23, 2021 72  decision step |t|}.  Nuria Brede committed Jul 23, 2021 73 74 75 Similarly, we define that \emph{solving a monadic SDP} consists in \emph{finding a list of rules for selecting controls that maximises the measured total reward}.  Nuria Brede committed Feb 23, 2021 76 This means that when starting from any initial state at decision step  Nuria Brede committed Jul 23, 2021 77 |t|, following the computed list of rules for selecting controls will  Nuria Brede committed Feb 23, 2021 78 result in a value that is maximal as measure of the sum of rewards  Nicola Botta committed Sep 23, 2021 79 80 along all possible trajectories rooted in that initial state.  Nuria Brede committed Feb 23, 2021 81 Equivalently, rewards can instead be considered as \emph{costs}  Nuria Brede committed Jul 23, 2021 82 that need to be \emph{minimised}. This dual perspective is taken e.g.  Nuria Brede committed Feb 23, 2021 83 84 85 86 87 88 89 in \citep{bertsekas1995}. In the subsequent sections we will follow the terminology of the BJI-framework and \citep{puterman2014markov} and speak of rewards'', but our second example below will illustrate the cost'' perspective. In mathematical theories of optimal control, the rules for selecting controls are called \emph{policies}. A \emph{policy} for a decision  Nuria Brede committed Jul 23, 2021 90 step is simply a function that maps each possible state to a  Nuria Brede committed Feb 23, 2021 91 92 control. As mentioned above, the controls available in a given state typically depend on that state, thus policies are dependently typed  Nuria Brede committed Jul 23, 2021 93 functions. A list of such policies is called a \emph{policy sequence}.  Nuria Brede committed Feb 23, 2021 94 95 96  The central idea underlying backward induction is to compute a globally optimal solution of a multi-step SDP incrementally by solving  Nuria Brede committed Jul 23, 2021 97 local optimisation problems at each decision step. This is captured  Nuria Brede committed Feb 23, 2021 98 99 by \emph{Bellman's principle}: \emph{Extending an optimal policy sequence with an optimal policy yields again an optimal policy  Nuria Brede committed Jul 23, 2021 100  sequence}. However, as we will see in Sec.~\ref{subsection:counterEx},  Nuria Brede committed Feb 23, 2021 101 102 103 104 105 106 107 108 109 110  one has to carefully check whether for a given SDP backward induction is indeed applicable. Two features are crucial for finite-horizon, monadic SDPs to be solvable with the BJI-framework that we study in this paper: (1) the number of decision steps has to be given explicitly as input to the backward induction and (2) at each decision step, the number of possible next states has to be \emph{finite}. % While (2) is a necessary condition for backward induction to be  Nuria Brede committed Jul 23, 2021 111 computable, (1) is a genuine limitation of the BJI-framework: in many SDPs,  Nuria Brede committed Feb 23, 2021 112 for example in a game of tic-tac-toe, the number of decision steps is  Nuria Brede committed Jul 23, 2021 113 bounded but not known a priori.  Nuria Brede committed Feb 23, 2021 114 115 116  Before we discuss the BJI-framework in the next section, we illustrate the notion of sequential decision problem with two simple examples,  Nuria Brede committed Jul 23, 2021 117 118 one in which the purpose is to maximise rewards and one in which the purpose is to minimise costs. Rewards and costs in these examples are  Nuria Brede committed Feb 23, 2021 119 120 121 122 123 124 125 just natural numbers and are summed up with ordinary addition. The first example is a non-deterministic SDP. Although it is somewhat oversimplified, it has the advantage of being tractable for computations by hand while still being sufficient as basis for illustrations in sections~\ref{section:framework}--\ref{section:conditions}. The second example is a deterministic SDP that stands for the important class of  Nuria Brede committed Jul 23, 2021 126 127 scheduling SDPs. This problem highlights why dependent types are necessary to model state and control spaces accurately. As in these simple  Nuria Brede committed Feb 23, 2021 128 129 examples state and control spaces are finite, the transition functions can be described by directed graphs. These are given in  Nuria Brede committed Jul 23, 2021 130 Fig.~\ref{fig:examplesGraph}.  Nuria Brede committed Feb 23, 2021 131   Nuria Brede committed Jul 23, 2021 132 \begin{exm}[\emph{A toy climate problem}]  Nuria Brede committed Feb 23, 2021 133 134 135 136 137 138 139 140 141 142 143 \label{subsection:example1SDPs} \begin{figure} \centering \framebox{ \parbox{\textwidth-0.5cm}{ \vspace{0.2cm} \begin{subfigure}[b]{.35\textwidth-0.55cm} \centering  Nicola Botta committed Sep 23, 2021 144  %\scalebox{0.7}{  Nuria Brede committed Feb 23, 2021 145   Nicola Botta committed Sep 23, 2021 146 147 148 149  \includegraphics[height=0.26\textheight]{img/fig1a.eps} % \small % \input{NFAexample.tex}  Nuria Brede committed Feb 23, 2021 150   Nicola Botta committed Sep 23, 2021 151  %}  Nuria Brede committed Feb 23, 2021 152 153 154 155 156  \caption{Example~1} \label{fig:example1} \end{subfigure} \begin{subfigure}[b]{.65\textwidth} \centering  Nicola Botta committed Sep 23, 2021 157  %\scalebox{0.7}{  Nuria Brede committed Feb 23, 2021 158   Nicola Botta committed Sep 23, 2021 159 160 161 162 163 164  \includegraphics[height=0.25\textheight]{img/fig1b.eps} % \input{SchedulingExample.tex} %\vspace{0.3cm} % }  Nuria Brede committed Feb 23, 2021 165 166 167 168 169 170 171 172  \caption{Example~2} \label{fig:example2} \end{subfigure} \begin{small} \caption{Transition graphs for the example SDPs described in  Nuria Brede committed Jul 23, 2021 173  Sec.~\ref{section:SDPs}. The edge labels denote pairs  Nuria Brede committed Feb 23, 2021 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204  $\textit{control} \mid \textit{reward}$ for the associated transitions. In the first example, state and control spaces are constant over time therefore we have omitted the temporal dimension. \label{fig:examplesGraph} } \end{small} } } \end{figure} Our first example is a variant of a stochastic climate science SDP studied in \citep{esd-9-525-2018}, stripped down to a simple non-deterministic SDP. At every decision step, the world may be in one of two \emph{states}, namely \emph{Good} or \emph{Bad}, and the \emph{controls} determine whether a \emph{Low} or a \emph{High} amount of green house gases is emitted into the atmosphere. If the world is in the \emph{Good} state, choosing \emph{Low} emissions will definitely keep the world in the \emph{Good} state, but the result of choosing high emissions is non-deterministic: either the world may stay in the \emph{Good} or tip to the \emph{Bad} state. Similarly, in the \emph{Bad} state, \emph{High} emissions will definitely keep the world in \emph{Bad}, while with \emph{Low} emissions it might either stay in \emph{Bad} or recover and return to the \emph{Good} state. The transitions just described define a non-deterministic \emph{transition function}. The \emph{rewards} associated with each transition are determined by the respective control and reached state. Now we can formulate an SDP: \emph{Which  Nuria Brede committed Jul 23, 2021 205 policy sequence will maximise the worst-case sum of rewards along all  Nuria Brede committed Feb 23, 2021 206 207 208 209 210 211 212 213 214 215 216 217 possible trajectories when taking |n| decisions starting at decision step |t|?''}. In this simple example the question is not hard to answer: always choose \emph{Low} emissions, independent of decision step and state. The \emph{optimal policy sequence} for any |n| and |t| would thus consist of |n| constant \emph{Low} functions. But in a more realistic example the situation will be more involved: every option will have its benefits and drawbacks encoded in a more complicated reward function, uncertainties might come with different probabilities, there might be intermediate states, different combinations of control options etc. For more along these lines we refer the interested reader to~\citep{esd-9-525-2018}.  Nuria Brede committed Jul 23, 2021 218 \end{exm}  Nuria Brede committed Feb 23, 2021 219   Nuria Brede committed Jul 23, 2021 220 \begin{exm}[\emph{Scheduling}]  Nuria Brede committed Feb 23, 2021 221 222 223 224 225 \label{subsection:example2SDPs} Scheduling problems serve as canonical examples in control theory textbooks. The one we present here is a slightly modified version of  Nicola Botta committed Sep 23, 2021 226 227 \citep[Example~1.1.2]{bertsekas1995}.  Nuria Brede committed Feb 23, 2021 228 229 230 231 232 233 234 235 236 237 Think of some machine in a factory that can perform different operations, say $A, B, C$ and $D$. Each of these operations is supposed to be performed once. The machine can only perform one operation at a time, thus an order has to be fixed in which to perform the operations. Setting the machine up for each operation incurs a specific cost that might vary according to which operation has been performed before. Moreover, operation $B$ can only be performed after operation $A$ has already been completed, and operation $D$ only after operation $C$. It suffices to fix the order in which the first three operations are to be performed as this uniquely determines which will be the  Nuria Brede committed Jul 23, 2021 238 fourth task. The aim is now to choose an order that minimises  Nuria Brede committed Feb 23, 2021 239 240  the total cost of performing the four operations.  Nuria Brede committed Jul 23, 2021 241 This situation can be modelled as follows as a problem with three  Nuria Brede committed Feb 23, 2021 242 decision steps: The \emph{states at each step} are the sequences of  Nuria Brede committed Jul 23, 2021 243 operations already performed, with the empty sequence at step 0. The  Nuria Brede committed Feb 23, 2021 244 245 246 247 248 249 250 \emph{controls at a decision step and in a state} are the operations which have not already been performed at previous steps and which are permitted in that state. For example, at decision step 0, only controls $A$ and $C$ are available because of the above constraint on performing $B$ and $D$. % The \emph{transition} and \emph{cost} functions of the problem are depicted by  Nuria Brede committed Jul 23, 2021 251 the graph in Fig.~\ref{fig:example2}. As the problem is deterministic,  Nuria Brede committed Feb 23, 2021 252 253 254 picking a control will result in a unique next state and each sequence of policies will result in a unique trajectory. In this setting, solving the SDP  Nuria Brede committed Jul 23, 2021 255 256 reduces to finding a control sequence that \emph{minimises the sum of costs along the single resulting trajectory}. In Fig.~\ref{fig:example2} this  Nuria Brede committed Feb 23, 2021 257 is the sequence $CAB(D)$.  Nuria Brede committed Jul 23, 2021 258 259  \end{exm}