FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 |

«Abstract We express some criticism about the definition of an algorithmic sufficient statistic and, in particular, of an algorithmic minimal ...»

-- [ Page 1 ] --

Algorithmic Minimal Sufficient Statistic


Nikolay Vereshchagin∗

July 10, 2009


We express some criticism about the definition of an algorithmic

sufficient statistic and, in particular, of an algorithmic minimal sufficient statistic. We propose another definition, which might have better


1 Introduction

Let x be a binary string. A finite set A containing x is called an (algorithmic)

sufficient statistic of x if the sum of Kolmogorov complexity of A and the

log-cardinality of A is close to Kolmogorov complexity C(x) of x:

C(A) + log2 |A| ≈ C(x). (1) Let A∗ denote a minimal length description of A and i the index of x in the list of all elements of A arranged lexicographically. The equality (1) means that the two part description (A∗, i) of x is as concise as the minimal length code of x.

It turns out that A is a sufficient statistic of x iff C(A|x) ≈ 0 and C(x|A) ≈ log |A|. The former equality means that the information in A∗ is a part of information in x. The latter equality means that x is a typical member of A: x has no regularities that allow to describe x given A in a Moscow State University ∗ shorter way than just by specifying its log |A|-bit index in A. Thus A∗ contains all useful information present in x and i contains only an accidental information (a noise).

Sufficient statistics may also contain a noise. For example, it happens for x being a random string and A = {x}. Is it true that for all x there is a sufficient statistic that contains no noise? To answer this question we can try to use the notion of a minimal sufficient statistics defined in [3]. In this paper we argue that (1) this notion is not well defined for some x (although for some x the notion is well defined) and (2) even for those x for which the notion of a minimal sufficient statistic is well defined not every minimal sufficient statistic qualifies for “denoised version of x”. We propose another definition of a (minimal) sufficient statistic that might have better properties.

2 Sufficient statistics Let x be a given string of length n. The goal of algorithmic statistics is to “explain” x. As possible explanations we consider finite sets containing x. We call any finite A ∋ x a model for x. Every model A corresponds the statistical hypothesis “x was obtained by selecting a random element of x”. In which case is such hypothesis plausible? As argued in [4, 3, 5], it is plausible if C(x|A) ≈ log |A| and C(A|x) ≈ 0 (we prefer to avoid rigorous definitions up to a certain point; approximate equalities should be thought as equalities up to an additiveO(log n) term). In the expressions C(x|A), C(A|x) the set A is understood as a finite object. More precisely, we fix any computable bijection A → [A] between finite sets of binary strings and binary strings and let C(x|A) = C(x|[A]), C(A|x) = C([A]|x).

As shown in [3, 5] this is equivalent to saying that C(A) + log |A| ≈ C(x).

Indeed, assume that A contains x and C(A) ≤ n. Then, given A the string x can be specified by its log |A|-bit index in A. Recalling the symmetry of information and omitting additive terms of order O(log n), we obtain

C(x) ≤ C(x) + C(A|x) = C(A) + C(x|A) ≤ C(A) + log |A|.

Assume now that C(x|A) ≈ log |A| and C(A|x) ≈ 0. Then all inequalities here become equalities and hence A is a sufficient statistic. Conversely, if C(x) ≈ C(A) + log |A| then the left hand side and the right hand side in these inequalities coincide. Thus C(x|A) ≈ log |A| and C(A|x) ≈ 0.

The inequality C(x) ≤ C(A) + log |A| (2) (which is true up to an additive O(log n) term) has the following meaning.

Consider the two part code (A∗, i) of x, consisting of the minimal program A∗ for x and log |A|-bit index of x in the list of all elements of A arranged lexicographically. The equality means that its total length C(A) + log |A| cannot exceed C(x). If C(A) + log |A| is close to C(x), we call A a sufficient statistic of x. To make this notion rigorous we have specify what means “close”. In [3] this is specified as follows: fix a constant c and call A a sufficient statistic if |(C(A) + log |A|) − C(x)| ≤ c. (3) More precisely, [3] uses prefix complexity K in place of plain complexity C.

For prefix complexity the inequality (2) holds up to a constant error term.

If we choose c large enough then sufficient statistics exists, witnessed by A = {x}. (The paper [1] suggests to set c = 0 and to use C(x|n) and C(A|n) in place of K(x) and K(A) in the definition of a sufficient statistic. For such definition sufficient statistics might not exist.) To avoid the discussion on how small should be c let us call A ∋ x a c-sufficient statistic if (3) holds. The smaller c is the more sufficient A is.

This notion is non-vacuous only for c = O(log n) as the inequality (2) holds only with logarithmic precision.

3 Minimal sufficient statistics Naturally, we are interested in squeezing as much noise from the given string x as possible. What does it mean? Every sufficient statistic A identifies log |A| bits of noise in x. Thus a sufficient statistic with maximal log |A| (and hence minimal C(A)) identifies the maximal possible amount of noise in x. So we arrive at the notion of a minimal sufficient statistic: a sufficient statistic with minimal C(A) is called a minimal sufficient statistic (MSS).

Is this notion well defined? Recall that actually we have only the notion of a c-sufficient statistic (where c is either a parameter, or a constant). That is, we have actually defined the notion of a minimal c-sufficient statistic. Is this a good notion? We argue that for some strings x it is not (for every c).

There are strings x for which it is impossible identify MSS in an intuitively appealing way. For those x the complexity of the minimal c-sufficient statistic decreases much, as c increases a little.

To present such strings we need to recall a theorem from [7]. Let Sx stand

for the structure set of x:

–  –  –

The function hx is called the Kolmogorov structure function of x; for small i it might take infinite values due to lack of models of small complexity. In contrast, the function gx is total for all x.

As pointed by Kolmogorov [4], the structure set Sx of every string x of length n and Kolmogorov complexity k has the following three properties (we state the properties in terms of the function gx ):

(1) gx (0) = k + O(1) (witnessed by A = {x}).

(2) gx (n) = O(log n) (witnessed by A = {0, 1}n.

(3) gx in non-increasing and gx (j + l) ≥ gx (j) − l − O(log l) for every j, l ∈ N.

For the proof of the last property see [5, 7]. Properties (1) and (3) imply that i + j ≥ k − O(log n) for every (i, j) ∈ Sx. Sufficient statistics correspond to those (i, j) ∈ Sx with i + j ≈ k. The line i + j = k is therefore called the sufficiency line.

A result of [7, Remark IV.4] states that for every set g that satisfies (1)– (3) there is x of length n and complexity close to k such that gx is close to

g. 1 More specifically, the following holds:

Theorem 1 ([7]). Let g be any non-increasing function g : {0,..., n} → N such that g(0) = k, g(n) = 0 and such that g(j + l) ≥ gx (j) − l for every j, l ∈ N with j + l ≤ n. Then there is a string x of length n and complexity k ± ε such that |gx (j) − g(j)| ≤ ε for all j ≤ n. Here ε = O(log n + C(g)),

where C(g) stands for the Kolmogorov complexity of the graph of g:

C(g) = C({ j, g(j) | 0 ≤ j ≤ n}).

Actually, [7] provides the description of possible shapes of Sx in terms of the Kolmogorov structure function hx. We use here gx instead of hx, as in terms of gx the description is easier-to-understand.

We are ready to present strings for which the notion of a MSS is not well defined. Fix a large n and let, say, k = n/2 and g(j) = max{k−jk/(k+α), 0}, where α = α(k) ≤ k is a computable function of k with natural values.

Then n, k, g satisfy all conditions of Theorem 1. Hence there is a string x of length n and complexity k + O(log n) with gx (j) = g(j) + O(log n) (note that C(g) = O(log n)). Its structure function is shown on Fig. 1. Choose α so that α/k is negligible (compared to k) but α is not.

–  –  –

Figure 1: The structure function of a string for which MSS is not well defined For very small j the graph of gx is close to the sufficiency line and for j = k + α it is already at a large distance α from it. As j increments by one, the value gx (j) + j − C(x) increases by at most k/(k + α) + O(log n), which is negligible. Therefore, it is not clear where the graph of gx leaves the sufficiency line. The complexity of the minimal c-sufficient statistic is k − kc/α + O(k log n/α) and decreases fast as a function of c.

Thus there are strings for which it is hard to identify the complexity of MSS. There is also another minor point regarding minimal sufficient statistics. Namely, there is a string x for which the complexity of minimal sufcient statistic is well defined but not all MSS qualify as denoised versions of x. Namely, some of them have a weird structure function. What kind of structure set we expect of a denoised string? To answer this question consider the following example. Let y be a string, m a natural number and z a string of length l(z) = m that is random relative to y. The latter means that C(z|y) ≥ m − β for a small β. Consider the string x = y, z. Intuitively, z is a noise in x. In other words, we can say that y is obtained from x by removing m bits of noise. What is the relation between the structure set of x and that of y?

Theorem 2. Assume that z is a string of length m with C(z|y) ≥ m − β.

Then for all j ≥ m we have gx (j) = gy (j − m) and for all j ≤ m we have gx (j) = C(y) + m − j = gy (0) + m − j. The equalities here hold up to O(log m + log C(y) + β) term.

–  –  –

Proof. In the proof we will ignore terms of order O(log m + log C(y) + β).

The easy part is the equality gx (j) = C(y) + m − j for j ≤ m. Indeed, we have gx (m) ≤ C(y) witnessed by A = { y, z ′ | |z ′ | = m}. On the other hand, gx (0) = C(x) = C(y) + C(z|y) = C(y) + m. Thus gx (j) should have maximal possible rate of decrease on the segment [0, m] to drop from C(y) + m to C(y).

Another easy part is the inequality gx (j) ≤ gy (j − m). Indeed, for every model A of y with |A| ≤ 2j−m consider the model

A′ = A × {0, 1}m = { y ′, z ′ | y ′ ∈ A, |z ′ | = m}

of cardinality at most 2j. Its complexity is at most that of |A|, which proves gx (j) ≤ gy (j − m).

The tricky part is the inverse inequality gx (j) ≥ gy (j − m). Let A be a model for x with |A| ≤ 2j and C(A) = gy (j). We need to show that there is a model of y of cardinality at most 2j−m and of the same (or lower) complexity.

We will prove it in a non-constructive way using a result from [7].

The first idea is to consider the projection of A: {y ′ | y ′, z ′ ∈ A}.

However this set may be as large as A itself. Reduce it as follows. Consider the yth section of A: Ay = {z ′ | y, z ′ ∈ A}. Define i as the natural number such that 2i ≤ |Ay | 2i+1. Let A′ be the set of those y ′ whose y ′ th section has at least 2i elements. Then by counting arguments we have |A′ | ≤ 2j−i.

If i ≥ m, we are done. However, it might be not the case. To lower bound i, we will relate it to the conditional complexity of z given y and A. Indeed, we have C(z|A, y) ≤ i, as z can be identified by its ordinal number in yth section of A. Hence we know that log |A′ | ≤ j − C(z|A, y). Now we will

improve A′ using a result of [7]:

Lemma 3 (Lemma A.4 in [7]). For every A′ ∋ y there is A′′ ∋ y with C(A′′ ) ≤ C(A′ ) − C(A′ |y) and ⌊log |A′′ |⌋ = ⌊log |A′ |⌋.

By this lemma we get the inequality gy (j − C(z|A, y)) ≤ C(A′ ) − C(A′ |y).

Note that C(A′ ) − C(A′ |y) = I(y : A′ ) ≤ I(y : A) = C(A) − C(A|y), as C(A′ |A) is negligible. Thus we have gy (j − C(z|A, y)) ≤ C(A) − C(A|y).

We claim that by the property (3) of the structure set this inequality implies

that gy (j − m) ≤ C(A). Indeed, as C(z|A, y) ≤ m we have by property (3):

gy (j − m) ≤ m − C(z|A, y) + C(A) − C(A|y) ≤ m + C(A) − C(z|y) = C(A).

In all the above inequalities, we need to be careful about the error term, as they include A and thus the error term involves O(log C(A)). The set A is either a model of y or a model of x. W.l.o.g. we may assume that C(A) ≤ C(x) + O(1). Indeed, there is no need to consider models of y or x of larger complexity, as the models {y} and {x} have the least possible cardinality and their complexity is at most C(x) + O(1). Since C(x) ≤ C(y) + O(C(z|y)) ≤ C(y) + O(k), the term O(log C(A)) is absorbed by the general error term.

This theorem answers our question: if y is obtained from x by removing m bits of noise then we expect that gy satisfy Theorem 2. Now we will show that there are strings x as in Theorem 2 for which the notion of the MSS is well defined but the structure function of some minimal sufficient statistics does not satisfy Theorem 2. The structure set of a finite set A of strings is defined as that of [A]. It is not hard to see that if we switch to another computable bijection A → [A] the value of g[A] (j) changes by an additive constant. Thus SA and gA are well defined for finite sets A.

Theorem 4. For every k there is a string y of length 2k and Kolmogorov complexity C(y) = k such that

–  –  –

(See Fig. 3.) Moreover, for every such z the string x = y, z has a model B of complexity C(B) = k and log-cardinality log |B| = k such that gB (j) = k for all j ≤ 2k. All equalities here hold up to O(log k) additive error term.

–  –  –

The structure set of x = y, z clearly leaves the sufficiency line at the point j = k. Thus k is intuitively the complexity of minimal sufficient statistic and both models A = y × {0, 1}k and B are minimal sufficient statistics.

Pages:   || 2 |

Similar works:

«Datalog Rewritings of Regular Path Queries using Views Nadime Francis Luc Segoufin ENS-Cachan, INRIA INRIA, ENS-Cachan Cristina Sirangelo LSV at ENS-Cachan, INRIA, CNRS Abstract We consider query answering using views on graph databases, i.e. databases structured as edge-labeled graphs. We mainly consider views and queries specified by Regular Path Queries (RPQ). These are queries selecting pairs of nodes in a graph database that are connected via a path whose sequence of edge labels belongs...»

«SKIDMORE COLLEGE CAREER DEVELOPMENT CENTER Resume & Cover Letter Guide Career Development Center Starbuck Center 815 North Broadway Saratoga Springs, NY 12866 http://ww.skidmore.edu/career Creative Thought Matters. Creative Thought Works. What is a Resume? A resume is a brief document that articulates a candidate’s most relevant and recent experiences to a potential employer or graduate / professional school program. A strong resume is targeted for a specific opportunity and focuses on a...»

«www.basistech.com info@basistech.com 617-386-2090 Case Study Forecasting the Future: The EMBERS Predictive Analytics Success Story Language Identifier RLI RLI ROSETTE Language Identifier Identify languages and encodings Base Linguistics RBL RBL ROSETTE Base Linguistics Search many languages with high accuracy Entity Extractor REX REX ROSETTE Entity Extractor Tag names of people, places, and organizations Entity Resolver RES RES ROSETTE Entity Resolver Make real-world connections in your data...»

«Children’s centre report Norfolk 4C West (Lots 31 and 32) Swaffham Infant & Nursery School, White Cross Road, Swaffham, PE37 7RF Inspection date 12–14 June 2013 This inspection: Requires improvement 3 Overall effectiveness Previous inspection: Not previously inspected Access to services by young children and families Requires improvement 3 The quality of practice and services Requires improvement 3 The effectiveness of leadership, governance and Requires improvement 3 management Summary of...»

«JNCC Report No. 374 The numbers of inshore waterbirds using the Greater Thames during the non-breeding season; an assessment of the area’s potential for qualification as a marine SPA Andy Webb, Ben J. Dean, Susan H. O’Brien, Ilka Söhle, Claire McSorley, James B. Reid, Peter A. Cranswick, Lucy E. Smith and Colette Hall May 2009 © JNCC, Peterborough 2009 ISSN 0963 8901 The numbers of inshore waterbirds using the Greater Thames during the non-breeding season; an assessment of the area’s...»

«Property tax experiment in Punjab, Pakistan Adnan Q. Khan, LSE Asim I. Khwaja, Harvard Benjamin A. Olken, MIT Grantee Final Report Accepted by 3ie: October 2014 Note to readers This impact evaluation has been submitted in partial fulfilment of the requirements of grant OW2.178 issued under Open Window 2. This version is being published online as it was received. A copy-edited and formatted version will be produced in the near future. All content is the sole responsibility of the authors and...»

«COMMUNICATION ON ENGAGEMENT (COE) Period covered by this Communication on Engagement For the Humanitarian Air, Water and Food Award – WAF Award, http://wafaward.org From: 01 January 2014 To: 31 December 2015 Part I. Statement of Continued Support by the Chief Executive or Equivalent January 6, 2015 To our stakeholders: I am pleased to confirm that The Humanitarian Water, Air, and Food Award – WAF Award reaffirms its support to the United Nations Global Compact and its Ten Principles in the...»

«Ottawa, Ontario Canada September 69th 2012 Dear ICCAI attendees, As the chair of the 11th annual International Conference on Complexity in Acute Illness (ICCAI), I would like to welcome you all to Ottawa, Canada’s capital city. We welcome many new attendees to the ICCAI this year, and hope that this will lead to insightful discussion throughout the meeting. The theme of this year’s meeting, “quantitative, innovative,...»

«An experimental study on the cement mortar reinforcement through recycled nylon fibers Saverio Spadeaa,*, Ilenia Farinaa, Anna Carrafiellob, Fernando Fraternalia a University of Salerno, Department of Civil Engineering Via Giovanni Paolo II, 84084 Fisciano (SA), Italy b Omega Plastic srl, Via Udine 6, 84091 Battipaglia (SA), Italy * Corresponding author: e-mail: sspadea@unisa.it; telephone/fax number: +39 089 964342. Abstract We investigate engineering applications of recycled nylon fibers...»

«Ciência & Educação (Bauru) ISSN: 1516-7313 revista@fc.unesp.br Universidade Estadual Paulista Júlio de Mesquita Filho Brasil Montino, Marisol; Petrucci, Diego; Ure, José Ernesto; Aleman, Alejandra; Pérez, Silvia Margarita UNA PROPUESTA DE TRABAJOS PRÁCTICOS DE LABORATORIO QUE FAVORECE EL APRENDIZAJE DE CONCEPTOS Ciência & Educação (Bauru), vol. 17, núm. 4, 2011, pp. 823-833 Universidade Estadual Paulista Júlio de Mesquita Filho São Paulo, Brasil Disponible en:...»

«The Throne of God by Woodrow Kroll Today’s Radio Study: Woodrow Kroll: Chapter 4 of Revelation, verse 1, “After these things I looked, and behold, a door standing open in heaven. And the first voice which I heard was like a trumpet speaking with me, saying, ‘Come up here, and I will show you things which must take place after this.’” I’m going to stop there because you’ll hear me repeat this again and again and again in our studies: it is very important, in understanding...»

«REGIONAL FISHERIES LIVELIHOODS PROGRAMME FOR SOUTH AND SOUTHEAST ASIA (RFLP) Fisheries and environmental profile of Negombo lagoon: a literature review (Activity 1.2.1 Prepare fisheries and environmental profile of Negombo lagoon using secondary data and survey reports) For the Regional Fisheries Livelihoods Programme for South and Southeast Asia Prepared by Leslie Joseph Co-management consultant-RFLP June 2011 DISCLAIMER AND COPYRIGHT TEXT This publication has been made with the financial...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.