# «Abstract We express some criticism about the deﬁnition of an algorithmic suﬃcient statistic and, in particular, of an algorithmic minimal ...»

Algorithmic Minimal Suﬃcient Statistic

Revisited

Nikolay Vereshchagin∗

July 10, 2009

Abstract

We express some criticism about the deﬁnition of an algorithmic

suﬃcient statistic and, in particular, of an algorithmic minimal suﬃcient statistic. We propose another deﬁnition, which might have better

properties.

1 Introduction

Let x be a binary string. A ﬁnite set A containing x is called an (algorithmic)

suﬃcient statistic of x if the sum of Kolmogorov complexity of A and the

**log-cardinality of A is close to Kolmogorov complexity C(x) of x:**

C(A) + log2 |A| ≈ C(x). (1) Let A∗ denote a minimal length description of A and i the index of x in the list of all elements of A arranged lexicographically. The equality (1) means that the two part description (A∗, i) of x is as concise as the minimal length code of x.

It turns out that A is a suﬃcient statistic of x iﬀ C(A|x) ≈ 0 and C(x|A) ≈ log |A|. The former equality means that the information in A∗ is a part of information in x. The latter equality means that x is a typical member of A: x has no regularities that allow to describe x given A in a Moscow State University ∗ shorter way than just by specifying its log |A|-bit index in A. Thus A∗ contains all useful information present in x and i contains only an accidental information (a noise).

Suﬃcient statistics may also contain a noise. For example, it happens for x being a random string and A = {x}. Is it true that for all x there is a suﬃcient statistic that contains no noise? To answer this question we can try to use the notion of a minimal suﬃcient statistics deﬁned in [3]. In this paper we argue that (1) this notion is not well deﬁned for some x (although for some x the notion is well deﬁned) and (2) even for those x for which the notion of a minimal suﬃcient statistic is well deﬁned not every minimal suﬃcient statistic qualiﬁes for “denoised version of x”. We propose another deﬁnition of a (minimal) suﬃcient statistic that might have better properties.

2 Suﬃcient statistics Let x be a given string of length n. The goal of algorithmic statistics is to “explain” x. As possible explanations we consider ﬁnite sets containing x. We call any ﬁnite A ∋ x a model for x. Every model A corresponds the statistical hypothesis “x was obtained by selecting a random element of x”. In which case is such hypothesis plausible? As argued in [4, 3, 5], it is plausible if C(x|A) ≈ log |A| and C(A|x) ≈ 0 (we prefer to avoid rigorous deﬁnitions up to a certain point; approximate equalities should be thought as equalities up to an additiveO(log n) term). In the expressions C(x|A), C(A|x) the set A is understood as a ﬁnite object. More precisely, we ﬁx any computable bijection A → [A] between ﬁnite sets of binary strings and binary strings and let C(x|A) = C(x|[A]), C(A|x) = C([A]|x).

As shown in [3, 5] this is equivalent to saying that C(A) + log |A| ≈ C(x).

Indeed, assume that A contains x and C(A) ≤ n. Then, given A the string x can be speciﬁed by its log |A|-bit index in A. Recalling the symmetry of information and omitting additive terms of order O(log n), we obtain

** C(x) ≤ C(x) + C(A|x) = C(A) + C(x|A) ≤ C(A) + log |A|.**

Assume now that C(x|A) ≈ log |A| and C(A|x) ≈ 0. Then all inequalities here become equalities and hence A is a suﬃcient statistic. Conversely, if C(x) ≈ C(A) + log |A| then the left hand side and the right hand side in these inequalities coincide. Thus C(x|A) ≈ log |A| and C(A|x) ≈ 0.

The inequality C(x) ≤ C(A) + log |A| (2) (which is true up to an additive O(log n) term) has the following meaning.

Consider the two part code (A∗, i) of x, consisting of the minimal program A∗ for x and log |A|-bit index of x in the list of all elements of A arranged lexicographically. The equality means that its total length C(A) + log |A| cannot exceed C(x). If C(A) + log |A| is close to C(x), we call A a suﬃcient statistic of x. To make this notion rigorous we have specify what means “close”. In [3] this is speciﬁed as follows: ﬁx a constant c and call A a suﬃcient statistic if |(C(A) + log |A|) − C(x)| ≤ c. (3) More precisely, [3] uses preﬁx complexity K in place of plain complexity C.

For preﬁx complexity the inequality (2) holds up to a constant error term.

If we choose c large enough then suﬃcient statistics exists, witnessed by A = {x}. (The paper [1] suggests to set c = 0 and to use C(x|n) and C(A|n) in place of K(x) and K(A) in the deﬁnition of a suﬃcient statistic. For such deﬁnition suﬃcient statistics might not exist.) To avoid the discussion on how small should be c let us call A ∋ x a c-suﬃcient statistic if (3) holds. The smaller c is the more suﬃcient A is.

This notion is non-vacuous only for c = O(log n) as the inequality (2) holds only with logarithmic precision.

3 Minimal suﬃcient statistics Naturally, we are interested in squeezing as much noise from the given string x as possible. What does it mean? Every suﬃcient statistic A identiﬁes log |A| bits of noise in x. Thus a suﬃcient statistic with maximal log |A| (and hence minimal C(A)) identiﬁes the maximal possible amount of noise in x. So we arrive at the notion of a minimal suﬃcient statistic: a suﬃcient statistic with minimal C(A) is called a minimal suﬃcient statistic (MSS).

Is this notion well deﬁned? Recall that actually we have only the notion of a c-suﬃcient statistic (where c is either a parameter, or a constant). That is, we have actually deﬁned the notion of a minimal c-suﬃcient statistic. Is this a good notion? We argue that for some strings x it is not (for every c).

There are strings x for which it is impossible identify MSS in an intuitively appealing way. For those x the complexity of the minimal c-suﬃcient statistic decreases much, as c increases a little.

To present such strings we need to recall a theorem from [7]. Let Sx stand

**for the structure set of x:**

The function hx is called the Kolmogorov structure function of x; for small i it might take inﬁnite values due to lack of models of small complexity. In contrast, the function gx is total for all x.

** As pointed by Kolmogorov [4], the structure set Sx of every string x of length n and Kolmogorov complexity k has the following three properties (we state the properties in terms of the function gx ):**

(1) gx (0) = k + O(1) (witnessed by A = {x}).

(2) gx (n) = O(log n) (witnessed by A = {0, 1}n.

(3) gx in non-increasing and gx (j + l) ≥ gx (j) − l − O(log l) for every j, l ∈ N.

For the proof of the last property see [5, 7]. Properties (1) and (3) imply that i + j ≥ k − O(log n) for every (i, j) ∈ Sx. Suﬃcient statistics correspond to those (i, j) ∈ Sx with i + j ≈ k. The line i + j = k is therefore called the suﬃciency line.

A result of [7, Remark IV.4] states that for every set g that satisﬁes (1)– (3) there is x of length n and complexity close to k such that gx is close to

**g. 1 More speciﬁcally, the following holds:**

Theorem 1 ([7]). Let g be any non-increasing function g : {0,..., n} → N such that g(0) = k, g(n) = 0 and such that g(j + l) ≥ gx (j) − l for every j, l ∈ N with j + l ≤ n. Then there is a string x of length n and complexity k ± ε such that |gx (j) − g(j)| ≤ ε for all j ≤ n. Here ε = O(log n + C(g)),

**where C(g) stands for the Kolmogorov complexity of the graph of g:**

C(g) = C({ j, g(j) | 0 ≤ j ≤ n}).

Actually, [7] provides the description of possible shapes of Sx in terms of the Kolmogorov structure function hx. We use here gx instead of hx, as in terms of gx the description is easier-to-understand.

We are ready to present strings for which the notion of a MSS is not well deﬁned. Fix a large n and let, say, k = n/2 and g(j) = max{k−jk/(k+α), 0}, where α = α(k) ≤ k is a computable function of k with natural values.

Then n, k, g satisfy all conditions of Theorem 1. Hence there is a string x of length n and complexity k + O(log n) with gx (j) = g(j) + O(log n) (note that C(g) = O(log n)). Its structure function is shown on Fig. 1. Choose α so that α/k is negligible (compared to k) but α is not.

Figure 1: The structure function of a string for which MSS is not well deﬁned For very small j the graph of gx is close to the suﬃciency line and for j = k + α it is already at a large distance α from it. As j increments by one, the value gx (j) + j − C(x) increases by at most k/(k + α) + O(log n), which is negligible. Therefore, it is not clear where the graph of gx leaves the suﬃciency line. The complexity of the minimal c-suﬃcient statistic is k − kc/α + O(k log n/α) and decreases fast as a function of c.

Thus there are strings for which it is hard to identify the complexity of MSS. There is also another minor point regarding minimal suﬃcient statistics. Namely, there is a string x for which the complexity of minimal sufcient statistic is well deﬁned but not all MSS qualify as denoised versions of x. Namely, some of them have a weird structure function. What kind of structure set we expect of a denoised string? To answer this question consider the following example. Let y be a string, m a natural number and z a string of length l(z) = m that is random relative to y. The latter means that C(z|y) ≥ m − β for a small β. Consider the string x = y, z. Intuitively, z is a noise in x. In other words, we can say that y is obtained from x by removing m bits of noise. What is the relation between the structure set of x and that of y?

** Theorem 2. Assume that z is a string of length m with C(z|y) ≥ m − β.**

Then for all j ≥ m we have gx (j) = gy (j − m) and for all j ≤ m we have gx (j) = C(y) + m − j = gy (0) + m − j. The equalities here hold up to O(log m + log C(y) + β) term.

Proof. In the proof we will ignore terms of order O(log m + log C(y) + β).

The easy part is the equality gx (j) = C(y) + m − j for j ≤ m. Indeed, we have gx (m) ≤ C(y) witnessed by A = { y, z ′ | |z ′ | = m}. On the other hand, gx (0) = C(x) = C(y) + C(z|y) = C(y) + m. Thus gx (j) should have maximal possible rate of decrease on the segment [0, m] to drop from C(y) + m to C(y).

Another easy part is the inequality gx (j) ≤ gy (j − m). Indeed, for every model A of y with |A| ≤ 2j−m consider the model

** A′ = A × {0, 1}m = { y ′, z ′ | y ′ ∈ A, |z ′ | = m}**

of cardinality at most 2j. Its complexity is at most that of |A|, which proves gx (j) ≤ gy (j − m).

The tricky part is the inverse inequality gx (j) ≥ gy (j − m). Let A be a model for x with |A| ≤ 2j and C(A) = gy (j). We need to show that there is a model of y of cardinality at most 2j−m and of the same (or lower) complexity.

We will prove it in a non-constructive way using a result from [7].

The ﬁrst idea is to consider the projection of A: {y ′ | y ′, z ′ ∈ A}.

However this set may be as large as A itself. Reduce it as follows. Consider the yth section of A: Ay = {z ′ | y, z ′ ∈ A}. Deﬁne i as the natural number such that 2i ≤ |Ay | 2i+1. Let A′ be the set of those y ′ whose y ′ th section has at least 2i elements. Then by counting arguments we have |A′ | ≤ 2j−i.

If i ≥ m, we are done. However, it might be not the case. To lower bound i, we will relate it to the conditional complexity of z given y and A. Indeed, we have C(z|A, y) ≤ i, as z can be identiﬁed by its ordinal number in yth section of A. Hence we know that log |A′ | ≤ j − C(z|A, y). Now we will

**improve A′ using a result of [7]:**

Lemma 3 (Lemma A.4 in [7]). For every A′ ∋ y there is A′′ ∋ y with C(A′′ ) ≤ C(A′ ) − C(A′ |y) and ⌊log |A′′ |⌋ = ⌊log |A′ |⌋.

By this lemma we get the inequality gy (j − C(z|A, y)) ≤ C(A′ ) − C(A′ |y).

Note that C(A′ ) − C(A′ |y) = I(y : A′ ) ≤ I(y : A) = C(A) − C(A|y), as C(A′ |A) is negligible. Thus we have gy (j − C(z|A, y)) ≤ C(A) − C(A|y).

We claim that by the property (3) of the structure set this inequality implies

**that gy (j − m) ≤ C(A). Indeed, as C(z|A, y) ≤ m we have by property (3):**

gy (j − m) ≤ m − C(z|A, y) + C(A) − C(A|y) ≤ m + C(A) − C(z|y) = C(A).

In all the above inequalities, we need to be careful about the error term, as they include A and thus the error term involves O(log C(A)). The set A is either a model of y or a model of x. W.l.o.g. we may assume that C(A) ≤ C(x) + O(1). Indeed, there is no need to consider models of y or x of larger complexity, as the models {y} and {x} have the least possible cardinality and their complexity is at most C(x) + O(1). Since C(x) ≤ C(y) + O(C(z|y)) ≤ C(y) + O(k), the term O(log C(A)) is absorbed by the general error term.

This theorem answers our question: if y is obtained from x by removing m bits of noise then we expect that gy satisfy Theorem 2. Now we will show that there are strings x as in Theorem 2 for which the notion of the MSS is well deﬁned but the structure function of some minimal suﬃcient statistics does not satisfy Theorem 2. The structure set of a ﬁnite set A of strings is deﬁned as that of [A]. It is not hard to see that if we switch to another computable bijection A → [A] the value of g[A] (j) changes by an additive constant. Thus SA and gA are well deﬁned for ﬁnite sets A.

Theorem 4. For every k there is a string y of length 2k and Kolmogorov complexity C(y) = k such that

(See Fig. 3.) Moreover, for every such z the string x = y, z has a model B of complexity C(B) = k and log-cardinality log |B| = k such that gB (j) = k for all j ≤ 2k. All equalities here hold up to O(log k) additive error term.

The structure set of x = y, z clearly leaves the suﬃciency line at the point j = k. Thus k is intuitively the complexity of minimal suﬃcient statistic and both models A = y × {0, 1}k and B are minimal suﬃcient statistics.