# «Algorithmic Information Theory and Foundations of Probability Alexander Shen LIF Marseille, CNRS & Univ. Aix–Marseille. On leave from IITP, RAS, ...»

Algorithmic Information Theory and

Foundations of Probability

Alexander Shen

LIF Marseille, CNRS & Univ. Aix–Marseille. On leave from IITP, RAS, Moscow.

Supported in part by NAFIT ANR-08-EMER-008-01 grant.

E-mail: alexander.shen@lif.univ-mrs.fr

Abstract. The question how and why mathematical probability theory

can be applied to the “real world” has been debated for centuries. We

try to survey the role of algorithmic information theory (Kolmogorov

complexity) in this debate.

1 Probability theory paradox One often describes the natural sciences framework as follows: a hypothesis is used to predict something, and the prediction is then checked against the observed actual behavior of the system; if there is a contradiction, the hypothesis needs to be changed.

Can we include probability theory in this framework? A statistical hypothesis (say, the assumption of a fair coin) should be then checked against the experimental data (results of coin tossing) and rejected if some discrepancy is found.

However, there is an obvious problem: The fair coin assumption says that in a series of, say, 1000 coin tossings all the 21000 possible outcomes (all 21000 bit strings of length 1000) have the same probability 2−1000. How can we say that some of them contradict the assumption while other do not?

The same paradox can be explained in a diﬀerent way. Consider a casino that wants to outsource the task of card shuﬄing to a special factory that produced shrink-wrapped well shuﬄed decks of cards. This factory would need some quality control department. It looks at the deck before shipping it to the customer, blocks some “badly shuﬄed” decks and approves some others as “well shuﬄed”.

But how is it possible if all n! orderings of n cards have the same probability?

2 Current best practice Whatever the philosophers say, statisticians have to perform their duties. Let us try to provide a description of their current “best practice” (see [7, 8]).

A. How a statistical hypothesis is applied. First of all, we have to admit that probability theory makes no predictions but only gives recommendations: if the probability (computed on the basis of the statistical hypothesis) of an event A is much smaller than the probability of an event B, then the possibility of the event B must be taken into consideration to a greater extent than the possibility of the event A (assuming the consequences are equally grave). For example, if the probability of A is smaller than the probability of being killed on the street by a meteorite, we usually ignore A completely (since we have to ignore event B anyway in our everyday life).

Borel [2] describes this principle as follows: “... Fewer than a million people live in Paris. Newspapers daily informus about the strange events or accidents that happen to some of them. Our life would be impossible if we were afraid of all adventures we read about. So one can say that from a practical viewpoint we can ignore events with probability less that one millionth... Often trying to avoid something bad we are confronted with even worse... To avoid this we must know well the probabilities of diﬀerent events” (Russian ed., pp. 159–160).

B. How a statistical hypothesis is tested. Here we cannot say na¨ ıvely that if we observe some event that has negligible probability according to our hypothesis, we reject this hypothesis. Indeed, this would mean that any 1000-bit sequence of the outcomes would make the fair coin assumption rejected (since this speciﬁc seqeunce has negligible probability 2−1000 ).

Here algorithmic information theory comes into play: We reject the hypothesis if we observe a simple event that has negligible probability according to this hypothesis. For example, if coin tossing produces thousand tails, this event is simple and has negligible probability, so we don’t believe the coin is fair. Both conditions (“simple” and “negligible probability”) are important: the event “the ﬁrst bit is a tail” is simple but has probability 1/2, so it does not discredit the coin. On the other hand, every sequence of outcomes has negligible probability 2−1000, but if it is not simple, its appearance does not discredits the fair coin assumption.

Often both parts of this scheme are united into a statement “events with small probabilities do not happen”. For example, Borel writes: “One must not be afraid to use the word “certainty” to designate a probability that is suﬃciently close to 1” ([3], Russian translation, p. 7). Sometimes this statement is called “Cournot principle”. But we prefer to distinguish between these two stages, because for the hypothesis testing the existence of a simple description of an event with negligible probability is important, and for application of the hypothesis it seems unimportant. (We can expect, however, that events interesting to us have simple descriptions because of their interest.)

**3 Simple events and events speciﬁed in advance**

Unfortunately, this scheme remains not very precise: the Kolmogorov complexity of an object x (deﬁned as the minimal length of the program that produces x) depends on the choice of programming language; we need also to ﬁx some way to describe the events in question. Both choices lead only to an O(1) change asymptotically; however, strictly speaking, due to this uncertainty we cannot say that one event has smaller complexity than the other one. (The word “negligible” is also not very precise.) On the other hand, the scheme described, while very vague, seems to be the best approximation to the current practice.

One of the possible ways to eliminate complexity in this picture is to say that a hypothesis is discredited if we observe a very unprobable event that was speciﬁed in advance (before the experiment). Here we come to the following question. Imagine that you make some experiment and get a sequence of thousand bits that looks random at ﬁrst. Then somebody comes and says “Look, if we consider every third bit in this sequence, the zeros and ones alternate”. Will you still believe in the fair coin hypothesis? Probably not, even if you haven’t thought about this event before looking at the sequence: the event is so simple that one could think about it. In fact, one may consider the union of all simple events that have small probability, and it still has small probability (if the bound for the complexity of a simple event is small compared to the number of coin tossing involved, which is a reasonable condition anyway). And this union can be considered as speciﬁed before the experiment (e.g., in this paper).

On the other hand, if the sequence repeats some other sequence observed earlier, we probably won’t believe it is obtained by coin tossing even if this earlier sequence had high complexity. One may explain this opinion saying the the entire sequence of observations is simple since it contains repetitions; however, the ﬁrst observation may be not covered by any probabilistic assumption. This could be taked into account by considering the conditional complexity of the event (with respect to all information available before the experiment).

The conclusion: we may remove one problematic requirement (being “simple” in a not well speciﬁed sense) and replace it by another problematic one (being speciﬁed before the observation).

**4 Frequency approach**

The most natural and common explanation of the notion of probability says that probability is the limit value of frequencies observed when the number of repetitions tends to inﬁnity. (This approach was advocated as the only possible basis for probability theory by Richard von Mises.) However, we cannot observe inﬁnite sequences, so the actual application of this deﬁnition should somehow deal with ﬁnite number of repetitions. And for ﬁnite number of repetitions our claim is not so strong: we do not guarantee that frequency of tails for a fair coin is exactly 1/2; we say only that it is highly improbable that it deviates signiﬁcantly from 1/2. Since the words “highly improbably” need to be interpreted, this leads to some kind of logical circle that makes the frequency approach much less convincing; to get out of this logical circle we need some version of Cournot principle.

Technically, the frequency approach can be related to the principles explained above. Indeed, the event “the number of tails in a 1 000 000 coin tossings deviates from 500 000 more than by 100 000” has a simple description and very small probability, so we reject the fair coin assumption if such an event happens (and ignore the dangers related to this event if we accept the fair coin assumption). In this way the belief that frequency should be close to probability (if the statistical hypothesis is chosen correctly) can be treated as the consequence of the principles explained above.

**5 Dynamical and statistical laws**

We have described how the probability theory is usually applied. But the fundamental question remains: well, probability theory describes (to some extent) the behavior of a symmetric coin or dice and turns out to be practically useful in many cases. But is it a new law of nature or some consequence of the known dynamical laws of classical mechanics? Can we somehow “prove” that a symmetric dice indeed has the same probabilities for all faces (if the starting point is high enough and initial linear and rotation speeds are high enough)?

Since it is not clear what kind of “proof” we would like to have, let us put the question in a more practical way. Assume that we have a dice that is not symmetric and we know exactly the position of its center of gravity. Can we use the laws of mechanics to ﬁnd the probabilities of diﬀerent outcomes?

It seems that this is possible, at least in principle. The laws of mecahnics determine the behavior of a dice (and therefore the outcome) if we know the initial point in the phase space (initial position and velocity) precisely. The phase space, therefore, is splitted into six parts that correspond to six outcomes.

In this sense there is no uncertainty or probabilities up to now. But these six parts are well mixed since very small modiﬁcations aﬀect the result, so if we consider a small (but not very small) part of the phase space around the initial conditions and any probability distribution on this part whose density does not change drastically, the measures of the six parts will follow the same proportion.

The last sentence can be transformed into a rigorous mathematical statement if we introduce speciﬁc assumptions about the size of the starting region in the phase space and variations of the density of the probability distribution on it. It then can be proved. Probably it is a rather diﬃcult mathematical problem not solved yet, but at least theoretically the laws of mechanics allow us to compute the probabilities of diﬀerent outcomes for a non-symmetic dice.

**6 Are “real” sequences complex?**

The argument in the preceding section would not convince a philosophically minded person. Well, we can (in principle) compute some numbers that can be interpreted as probabilities of the outcomes for a dice, and we do not need to ﬁx the distribution on the initial conditions, it is enough to assume that this distribution is smooth enough. But still we speak about probability distributions that are somehow externally imposed in addition to dynamical laws.

Essentially the same question can be reformulated as follows. Make 106 coin tosses and try to compress the resulting sequence of zeros and ones by a standard compression program, say, gzip. (Technically, you need ﬁrst to convert bit sequence into a byte sequence.) Repeat this experiment (coin tossing plus gzipping) as many times as you want, and this will never give you more that 1% compression. (Such a compression is possible for less than 2−10000 -fraction of all sequences.) This statement deserves to be called a law of nature: it can be

**checked experimentally in the same way as other laws are. So the question is:**

does this law of nature follows from dynamical laws we know?

To see where the problem is, it is convenient to simplify the situation. Imagine for a while that we have discrete time, phase space is [0, 1) and the dynamical law is x → T (x) = if 2x 1 then 2x else 2x − 1.

So we get a sequence of states x0, x1 = T (x0 ), x2 = T (x1 ),...; at each step we observe where the current state is — writing 0 if xn is in [0, 1/2) and 1 if xn is in [1/2, 1).

This tranformation T has the mixing property we spoke about; however, looking at it more closely, we see that a sequence of bits obtained is just the binary representation of the initial condition. So our process just reveals the initial condition bit by bit, and any statement about the resulting bit sequence (e.g., its incompressibility) is just a statement about the initial condition.

So what? Do we need to add to the dynamical laws just one more methaphysical law saying that world was created at the random (=incompressible) state? Indeed, algorithmic transformations (including dynamical laws) cannot increase signiﬁcantly the Kolmogorov complexity of the state, so if objects of high complexity exist in the (otherwise deterministic, as we assume for now) real world now, they should be there at the very beginning. (Note that it is difcult to explain the randomness observed saying that we just observe the world at random time or in a random place: the number of bits needed to encode the time and place in the world is not enough to explain an incompressible string of length, say 106, if we use currently popular estimates for the size and age of the world: the logarithms of the ratios of the maximal and minimal lengths (or time intervals) that exist in nature are negligible compared to 106 and therefore the position in space-time cannot determine a string of this complexity.

Should we conclude then that instead of playing the dice (as Einstein could put it), God provided concentrated randomness while creating the world?

**7 Randomness as ignorance: Blum – Micali – Yao**