MonadicSDP.lidr 11.3 KB
Newer Older
Nuria Brede's avatar
Nuria Brede committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
% -*-Latex-*-

%if False

> module MonadicSDP

%endif


\section{Finite-horizon Sequential Decision Problems}
\label{section:SDPs}

In deterministic, non-deterministic and stochastic finite-horizon
SDPs, a decision maker seeks
to control the evolution of a \emph{(dynamical) system} at a finite number of
\emph{decision steps} by selecting certain \emph{controls} in sequence,
one after the other. The controls
available to the decision maker at a given decision step typically
depend on the \emph{state} of the system at that step.

In \emph{deterministic} problems, selecting a control in a state at
decision step |t : Nat|, determines a unique next state at decision step
|t + 1| through a given \emph{transition function}.
%
In \emph{non-deterministic} problems, the transition function yields a
whole set of \emph{possible} states at the next decision step.
%
In \emph{stochastic} problems, the transition function yields a
\emph{probability distribution} on states at the next decision step.

Nuria Brede's avatar
Nuria Brede committed
31
The notion of \emph{monadic} problem generalises that of deterministic,
Nuria Brede's avatar
Nuria Brede committed
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
non-deterministic and stochastic problem through a transition
function that yields an |M|-structure of next states where |M| is a
monad.
%
For example, the identity monad can be applied to model deterministic
systems. Non-deterministic systems can be represented in terms of
transition functions that return lists (or some other representations
of sets) of next states. Stochastic systems can be represented in
terms of probability distribution monads \citep{giry1981,
  DBLP:journals/jfp/ErwigK06, DBLP:journals/scp/AudebaudP09,
  DBLP:journals/tcs/Jacobs11}.
The uncertainty monad, the states, the controls and the next function
define what is often called a \emph{decision process}.

The idea of sequential decision problems is that each single decision
yields a \emph{reward} and these rewards add up to a \emph{total
  reward} over all decision steps. Rewards are often represented by
  values of a numeric type, and added up using the canonical addition.
  If the transition function and thus
the evolution of the system is not deterministic, then the resulting
possible total rewards need to be aggregated to yield a single outcome
value.
%
In stochastic SDPs, evolving the
underlying stochastic system leads to a probability distribution
on total rewards which is usually aggregated using the familiar
Nuria Brede's avatar
Nuria Brede committed
58
59
60
\emph{expected value} measure. The value thus obtained is called the
\emph{expected total reward} \citep[ch.~4.1.2]{puterman2014markov} and
its role is central: It is the quantity that is to be optimised in
Nicola Botta's avatar
Final.    
Nicola Botta committed
61
62
an SDP.

Nuria Brede's avatar
Nuria Brede committed
63
64
65
66
67
68
69
70
71
In monadic SDPs, the measure is generic, i.e. it is not fixed in advance
but has to be given as part of the specification of a concrete problem. 
Therefore we will generalise the notion of \emph{expected total reward} to
a corresponding notion for monadic SDPs that we call
\emph{measured total reward} in analogy (see Sec.~\ref{section:preparation}).

\emph{Solving a stochastic SDP} consists in \emph{finding a list of rules
  for selecting controls that maximises the expected total reward for
  |n| decision steps when starting at
Nuria Brede's avatar
Nuria Brede committed
72
  decision step |t|}.
Nuria Brede's avatar
Nuria Brede committed
73
74
75
Similarly, we define that \emph{solving a monadic SDP} consists in
\emph{finding a list of rules for selecting controls that maximises
  the measured total reward}.
Nuria Brede's avatar
Nuria Brede committed
76
This means that when starting from any initial state at decision step
Nuria Brede's avatar
Nuria Brede committed
77
|t|, following the computed list of rules for selecting controls will
Nuria Brede's avatar
Nuria Brede committed
78
result in a value that is maximal as measure of the sum of rewards
Nicola Botta's avatar
Final.    
Nicola Botta committed
79
80
along all possible trajectories rooted in that initial state.

Nuria Brede's avatar
Nuria Brede committed
81
Equivalently, rewards can instead be considered as \emph{costs}
Nuria Brede's avatar
Nuria Brede committed
82
that need to be \emph{minimised}. This dual perspective is taken e.g.
Nuria Brede's avatar
Nuria Brede committed
83
84
85
86
87
88
89
in \citep{bertsekas1995}. In the subsequent sections we will follow
the terminology of the BJI-framework and \citep{puterman2014markov}
and speak of ``rewards'', but our second example below will illustrate
the ``cost'' perspective. 

In mathematical theories of optimal control, the rules for selecting
controls are called \emph{policies}. A \emph{policy} for a decision
Nuria Brede's avatar
Nuria Brede committed
90
step is simply a function that maps each possible state to a
Nuria Brede's avatar
Nuria Brede committed
91
92
control. As mentioned above, the controls available in a given
state typically depend on that state, thus policies are dependently typed
Nuria Brede's avatar
Nuria Brede committed
93
functions. A list of such policies is called a \emph{policy sequence}.
Nuria Brede's avatar
Nuria Brede committed
94
95
96

The central idea underlying backward induction is to compute a globally
optimal solution of a multi-step SDP incrementally by solving
Nuria Brede's avatar
Nuria Brede committed
97
local optimisation problems at each decision step. This is captured
Nuria Brede's avatar
Nuria Brede committed
98
99
by \emph{Bellman's principle}: \emph{Extending an optimal
  policy sequence with an optimal policy yields again an optimal policy
Nuria Brede's avatar
Nuria Brede committed
100
  sequence}. However, as we will see in Sec.~\ref{subsection:counterEx},
Nuria Brede's avatar
Nuria Brede committed
101
102
103
104
105
106
107
108
109
110
  one has to carefully check whether for a given SDP backward
  induction is indeed applicable. 

Two features are crucial for finite-horizon, monadic SDPs to be
solvable with the BJI-framework that we study in this  
paper: (1) the number of decision steps has to be given explicitly as
input to the backward induction and (2) at each decision step, the
number of possible next states has to be \emph{finite}.
%
While (2) is a necessary condition for backward induction to be
Nuria Brede's avatar
Nuria Brede committed
111
computable, (1) is a genuine limitation of the BJI-framework: in many SDPs,
Nuria Brede's avatar
Nuria Brede committed
112
for example in a game of tic-tac-toe, the number of decision steps is
Nuria Brede's avatar
Nuria Brede committed
113
bounded but not known a priori.
Nuria Brede's avatar
Nuria Brede committed
114
115
116
  
Before we discuss the BJI-framework in the next section, we illustrate
the notion of sequential decision problem with two simple examples,
Nuria Brede's avatar
Nuria Brede committed
117
118
one in which the purpose is to maximise rewards and one in which the
purpose is to minimise costs. Rewards and costs in these examples are
Nuria Brede's avatar
Nuria Brede committed
119
120
121
122
123
124
125
just natural numbers and are summed up with ordinary addition. The
first example is a non-deterministic SDP. Although it is somewhat
oversimplified, it has the advantage of being tractable for
computations by hand while still being sufficient as basis for
illustrations in
sections~\ref{section:framework}--\ref{section:conditions}. The second
example is a deterministic SDP that stands for the important class of
Nuria Brede's avatar
Nuria Brede committed
126
127
scheduling SDPs. This problem highlights why dependent types are necessary
to model state and control spaces accurately. As in these simple
Nuria Brede's avatar
Nuria Brede committed
128
129
examples state and control spaces are finite, the transition functions
can be described by directed graphs. These are given in
Nuria Brede's avatar
Nuria Brede committed
130
Fig.~\ref{fig:examplesGraph}.
Nuria Brede's avatar
Nuria Brede committed
131

Nuria Brede's avatar
Nuria Brede committed
132
\begin{exm}[\emph{A toy climate problem}]
Nuria Brede's avatar
Nuria Brede committed
133
134
135
136
137
138
139
140
141
142
143
\label{subsection:example1SDPs}

\begin{figure}
    \centering
    \framebox{
    \parbox{\textwidth-0.5cm}{
    \vspace{0.2cm}


    \begin{subfigure}[b]{.35\textwidth-0.55cm}
      \centering
Nicola Botta's avatar
Final.    
Nicola Botta committed
144
      %\scalebox{0.7}{
Nuria Brede's avatar
Nuria Brede committed
145

Nicola Botta's avatar
Final.    
Nicola Botta committed
146
147
148
149
        \includegraphics[height=0.26\textheight]{img/fig1a.eps}
        
      % \small
      % \input{NFAexample.tex}
Nuria Brede's avatar
Nuria Brede committed
150

Nicola Botta's avatar
Final.    
Nicola Botta committed
151
      %}
Nuria Brede's avatar
Nuria Brede committed
152
153
154
155
156
      \caption{Example~1}
      \label{fig:example1}
    \end{subfigure}
    \begin{subfigure}[b]{.65\textwidth}
      \centering
Nicola Botta's avatar
Final.    
Nicola Botta committed
157
      %\scalebox{0.7}{
Nuria Brede's avatar
Nuria Brede committed
158

Nicola Botta's avatar
Final.    
Nicola Botta committed
159
160
161
162
163
164
        
        \includegraphics[height=0.25\textheight]{img/fig1b.eps}
        
        % \input{SchedulingExample.tex}
        %\vspace{0.3cm}
       % }
Nuria Brede's avatar
Nuria Brede committed
165
166
167
168
169
170
171
172

      \caption{Example~2}
      \label{fig:example2}
    \end{subfigure}


     \begin{small}
       \caption{Transition graphs for the example SDPs described in
Nuria Brede's avatar
Nuria Brede committed
173
        Sec.~\ref{section:SDPs}. The edge labels denote pairs
Nuria Brede's avatar
Nuria Brede committed
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
        $\textit{control} \mid  \textit{reward}$ for the associated
        transitions. In the first example, state and control spaces
        are constant over time therefore we have omitted the temporal
        dimension.
        \label{fig:examplesGraph}
      }
      \end{small}
    }
  }

\end{figure}

Our first example is a variant of a
stochastic climate science SDP studied in \citep{esd-9-525-2018},
stripped down to a simple non-deterministic SDP. At every decision
step, the world may be in
one of two \emph{states}, namely \emph{Good} or \emph{Bad}, and the
\emph{controls} determine whether a \emph{Low} or a \emph{High} amount
of green house gases is emitted into the
atmosphere. If the world is
in the \emph{Good} state, choosing \emph{Low} emissions will definitely
keep the world in the \emph{Good} state, but the result of choosing high
emissions is non-deterministic: either the world may stay in the
\emph{Good} or tip to the \emph{Bad} state. Similarly, in the \emph{Bad}
state, \emph{High} emissions will definitely keep the world in
\emph{Bad}, while with \emph{Low} emissions it might either stay in
\emph{Bad} or recover and return to the \emph{Good} state. The
transitions just described define a non-deterministic \emph{transition
  function}. The \emph{rewards}
associated with each transition are determined by the respective control
and reached state. Now we can formulate an SDP: \emph{``Which
Nuria Brede's avatar
Nuria Brede committed
205
policy sequence will maximise the worst-case sum of rewards along all
Nuria Brede's avatar
Nuria Brede committed
206
207
208
209
210
211
212
213
214
215
216
217
possible trajectories when taking |n| decisions starting at decision
step |t|?''}. In this simple example the question is not hard to
answer: always choose \emph{Low} emissions, independent of decision step and
state. The \emph{optimal policy sequence} for any |n| and |t| would
thus consist of |n| constant \emph{Low} functions. But in a more
realistic example the
situation will be more involved: every option will have its benefits and
drawbacks encoded in a more complicated reward function, uncertainties
might come with different probabilities, there might be intermediate
states, different combinations of control options etc. For more along
these lines we refer the interested reader to~\citep{esd-9-525-2018}.

Nuria Brede's avatar
Nuria Brede committed
218
\end{exm}
Nuria Brede's avatar
Nuria Brede committed
219

Nuria Brede's avatar
Nuria Brede committed
220
\begin{exm}[\emph{Scheduling}]
Nuria Brede's avatar
Nuria Brede committed
221
222
223
224
225
\label{subsection:example2SDPs}

Scheduling
problems serve as canonical examples in control theory textbooks. The
one we present here is a slightly modified version of
Nicola Botta's avatar
Final.    
Nicola Botta committed
226
227
\citep[Example~1.1.2]{bertsekas1995}.

Nuria Brede's avatar
Nuria Brede committed
228
229
230
231
232
233
234
235
236
237
Think of some machine in a factory that can perform different
operations, say $A, B, C$ and $D$. Each of these operations is supposed
to be performed once. The machine can only perform one operation at a
time, thus an order has to be fixed in which to perform the
operations. Setting the machine up for each operation incurs a specific
cost that might vary according to which operation has been performed
before. Moreover, operation $B$ can only be performed after operation
$A$ has already been completed, and operation $D$ only after operation
$C$. It suffices to fix the order in which the first three operations
are to be performed as this uniquely determines which will be the
Nuria Brede's avatar
Nuria Brede committed
238
fourth task. The aim is now to choose an order that minimises
Nuria Brede's avatar
Nuria Brede committed
239
240
  the total cost of performing the four operations.

Nuria Brede's avatar
Nuria Brede committed
241
This situation can be modelled as follows as a problem with three
Nuria Brede's avatar
Nuria Brede committed
242
decision steps: The \emph{states at each step} are the sequences of
Nuria Brede's avatar
Nuria Brede committed
243
operations already performed, with the empty sequence at step 0. The
Nuria Brede's avatar
Nuria Brede committed
244
245
246
247
248
249
250
\emph{controls at a decision step and in a state} are the operations which have
not already been performed at previous steps and which are permitted in
that state. For example, at decision step 0, only controls $A$ and $C$
are available because of the above constraint on performing $B$ and
$D$.
%
The \emph{transition} and \emph{cost} functions of the problem are depicted by
Nuria Brede's avatar
Nuria Brede committed
251
the graph in Fig.~\ref{fig:example2}. As the problem is deterministic,
Nuria Brede's avatar
Nuria Brede committed
252
253
254
picking a control will result in a unique next state and each sequence
of policies will result in a unique trajectory. In this setting,
solving the SDP
Nuria Brede's avatar
Nuria Brede committed
255
256
reduces to finding a control sequence that \emph{minimises the sum of costs
along the single resulting trajectory}. In Fig.~\ref{fig:example2} this
Nuria Brede's avatar
Nuria Brede committed
257
is the sequence $CAB(D)$.
Nuria Brede's avatar
Nuria Brede committed
258
259

\end{exm}