% -*-Latex-*-
%if False
> module MonadicSDP
%endif
\section{Finite-horizon Sequential Decision Problems}
\label{section:SDPs}
In deterministic, non-deterministic and stochastic finite-horizon
SDPs, a decision maker seeks
to control the evolution of a \emph{(dynamical) system} at a finite number of
\emph{decision steps} by selecting certain \emph{controls} in sequence,
one after the other. The controls
available to the decision maker at a given decision step typically
depend on the \emph{state} of the system at that step.
In \emph{deterministic} problems, selecting a control in a state at
decision step |t : Nat|, determines a unique next state at decision step
|t + 1| through a given \emph{transition function}.
%
In \emph{non-deterministic} problems, the transition function yields a
whole set of \emph{possible} states at the next decision step.
%
In \emph{stochastic} problems, the transition function yields a
\emph{probability distribution} on states at the next decision step.
The notion of \emph{monadic} problem generalises that of deterministic,
non-deterministic and stochastic problem through a transition
function that yields an |M|-structure of next states where |M| is a
monad.
%
For example, the identity monad can be applied to model deterministic
systems. Non-deterministic systems can be represented in terms of
transition functions that return lists (or some other representations
of sets) of next states. Stochastic systems can be represented in
terms of probability distribution monads \citep{giry1981,
DBLP:journals/jfp/ErwigK06, DBLP:journals/scp/AudebaudP09,
DBLP:journals/tcs/Jacobs11}.
The uncertainty monad, the states, the controls and the next function
define what is often called a \emph{decision process}.
The idea of sequential decision problems is that each single decision
yields a \emph{reward} and these rewards add up to a \emph{total
reward} over all decision steps. Rewards are often represented by
values of a numeric type, and added up using the canonical addition.
If the transition function and thus
the evolution of the system is not deterministic, then the resulting
possible total rewards need to be aggregated to yield a single outcome
value.
%
In stochastic SDPs, evolving the
underlying stochastic system leads to a probability distribution
on total rewards which is usually aggregated using the familiar
\emph{expected value} measure. The value thus obtained is called the
\emph{expected total reward} \citep[ch.~4.1.2]{puterman2014markov} and
its role is central: It is the quantity that is to be optimised in
an SDP.
In monadic SDPs, the measure is generic, i.e. it is not fixed in advance
but has to be given as part of the specification of a concrete problem.
Therefore we will generalise the notion of \emph{expected total reward} to
a corresponding notion for monadic SDPs that we call
\emph{measured total reward} in analogy (see Sec.~\ref{section:preparation}).
\emph{Solving a stochastic SDP} consists in \emph{finding a list of rules
for selecting controls that maximises the expected total reward for
|n| decision steps when starting at
decision step |t|}.
Similarly, we define that \emph{solving a monadic SDP} consists in
\emph{finding a list of rules for selecting controls that maximises
the measured total reward}.
This means that when starting from any initial state at decision step
|t|, following the computed list of rules for selecting controls will
result in a value that is maximal as measure of the sum of rewards
along all possible trajectories rooted in that initial state.
Equivalently, rewards can instead be considered as \emph{costs}
that need to be \emph{minimised}. This dual perspective is taken e.g.
in \citep{bertsekas1995}. In the subsequent sections we will follow
the terminology of the BJI-framework and \citep{puterman2014markov}
and speak of ``rewards'', but our second example below will illustrate
the ``cost'' perspective.
In mathematical theories of optimal control, the rules for selecting
controls are called \emph{policies}. A \emph{policy} for a decision
step is simply a function that maps each possible state to a
control. As mentioned above, the controls available in a given
state typically depend on that state, thus policies are dependently typed
functions. A list of such policies is called a \emph{policy sequence}.
The central idea underlying backward induction is to compute a globally
optimal solution of a multi-step SDP incrementally by solving
local optimisation problems at each decision step. This is captured
by \emph{Bellman's principle}: \emph{Extending an optimal
policy sequence with an optimal policy yields again an optimal policy
sequence}. However, as we will see in Sec.~\ref{subsection:counterEx},
one has to carefully check whether for a given SDP backward
induction is indeed applicable.
Two features are crucial for finite-horizon, monadic SDPs to be
solvable with the BJI-framework that we study in this
paper: (1) the number of decision steps has to be given explicitly as
input to the backward induction and (2) at each decision step, the
number of possible next states has to be \emph{finite}.
%
While (2) is a necessary condition for backward induction to be
computable, (1) is a genuine limitation of the BJI-framework: in many SDPs,
for example in a game of tic-tac-toe, the number of decision steps is
bounded but not known a priori.
Before we discuss the BJI-framework in the next section, we illustrate
the notion of sequential decision problem with two simple examples,
one in which the purpose is to maximise rewards and one in which the
purpose is to minimise costs. Rewards and costs in these examples are
just natural numbers and are summed up with ordinary addition. The
first example is a non-deterministic SDP. Although it is somewhat
oversimplified, it has the advantage of being tractable for
computations by hand while still being sufficient as basis for
illustrations in
sections~\ref{section:framework}--\ref{section:conditions}. The second
example is a deterministic SDP that stands for the important class of
scheduling SDPs. This problem highlights why dependent types are necessary
to model state and control spaces accurately. As in these simple
examples state and control spaces are finite, the transition functions
can be described by directed graphs. These are given in
Fig.~\ref{fig:examplesGraph}.
\begin{exm}[\emph{A toy climate problem}]
\label{subsection:example1SDPs}
\begin{figure}
\centering
\framebox{
\parbox{\textwidth-0.5cm}{
\vspace{0.2cm}
\begin{subfigure}[b]{.35\textwidth-0.55cm}
\centering
%\scalebox{0.7}{
\includegraphics[height=0.26\textheight]{img/fig1a.eps}
% \small
% \input{NFAexample.tex}
%}
\caption{Example~1}
\label{fig:example1}
\end{subfigure}
\begin{subfigure}[b]{.65\textwidth}
\centering
%\scalebox{0.7}{
\includegraphics[height=0.25\textheight]{img/fig1b.eps}
% \input{SchedulingExample.tex}
%\vspace{0.3cm}
% }
\caption{Example~2}
\label{fig:example2}
\end{subfigure}
\begin{small}
\caption{Transition graphs for the example SDPs described in
Sec.~\ref{section:SDPs}. The edge labels denote pairs
$\textit{control} \mid \textit{reward}$ for the associated
transitions. In the first example, state and control spaces
are constant over time therefore we have omitted the temporal
dimension.
\label{fig:examplesGraph}
}
\end{small}
}
}
\end{figure}
Our first example is a variant of a
stochastic climate science SDP studied in \citep{esd-9-525-2018},
stripped down to a simple non-deterministic SDP. At every decision
step, the world may be in
one of two \emph{states}, namely \emph{Good} or \emph{Bad}, and the
\emph{controls} determine whether a \emph{Low} or a \emph{High} amount
of green house gases is emitted into the
atmosphere. If the world is
in the \emph{Good} state, choosing \emph{Low} emissions will definitely
keep the world in the \emph{Good} state, but the result of choosing high
emissions is non-deterministic: either the world may stay in the
\emph{Good} or tip to the \emph{Bad} state. Similarly, in the \emph{Bad}
state, \emph{High} emissions will definitely keep the world in
\emph{Bad}, while with \emph{Low} emissions it might either stay in
\emph{Bad} or recover and return to the \emph{Good} state. The
transitions just described define a non-deterministic \emph{transition
function}. The \emph{rewards}
associated with each transition are determined by the respective control
and reached state. Now we can formulate an SDP: \emph{``Which
policy sequence will maximise the worst-case sum of rewards along all
possible trajectories when taking |n| decisions starting at decision
step |t|?''}. In this simple example the question is not hard to
answer: always choose \emph{Low} emissions, independent of decision step and
state. The \emph{optimal policy sequence} for any |n| and |t| would
thus consist of |n| constant \emph{Low} functions. But in a more
realistic example the
situation will be more involved: every option will have its benefits and
drawbacks encoded in a more complicated reward function, uncertainties
might come with different probabilities, there might be intermediate
states, different combinations of control options etc. For more along
these lines we refer the interested reader to~\citep{esd-9-525-2018}.
\end{exm}
\begin{exm}[\emph{Scheduling}]
\label{subsection:example2SDPs}
Scheduling
problems serve as canonical examples in control theory textbooks. The
one we present here is a slightly modified version of
\citep[Example~1.1.2]{bertsekas1995}.
Think of some machine in a factory that can perform different
operations, say $A, B, C$ and $D$. Each of these operations is supposed
to be performed once. The machine can only perform one operation at a
time, thus an order has to be fixed in which to perform the
operations. Setting the machine up for each operation incurs a specific
cost that might vary according to which operation has been performed
before. Moreover, operation $B$ can only be performed after operation
$A$ has already been completed, and operation $D$ only after operation
$C$. It suffices to fix the order in which the first three operations
are to be performed as this uniquely determines which will be the
fourth task. The aim is now to choose an order that minimises
the total cost of performing the four operations.
This situation can be modelled as follows as a problem with three
decision steps: The \emph{states at each step} are the sequences of
operations already performed, with the empty sequence at step 0. The
\emph{controls at a decision step and in a state} are the operations which have
not already been performed at previous steps and which are permitted in
that state. For example, at decision step 0, only controls $A$ and $C$
are available because of the above constraint on performing $B$ and
$D$.
%
The \emph{transition} and \emph{cost} functions of the problem are depicted by
the graph in Fig.~\ref{fig:example2}. As the problem is deterministic,
picking a control will result in a unique next state and each sequence
of policies will result in a unique trajectory. In this setting,
solving the SDP
reduces to finding a control sequence that \emph{minimises the sum of costs
along the single resulting trajectory}. In Fig.~\ref{fig:example2} this
is the sequence $CAB(D)$.
\end{exm}