«Abstract. A description of a medical cases can – as any statement about reality – contain more or less information. The aim of a classification ...»
From Terminologies to Classifications – the
Challenge of Information Reduction
Hans Rudolf STRAUBa, Maurus DUELLIa, Norbert FREIb, Hugo MOSIMANNa and
Semfinder AG, Kreuzlingen, Switzerland
Interstate University of Applied Sciences of Technolog NTB, Buchs, Switzerland
A description of a medical cases can – as any statement about reality – contain
more or less information. The aim of a classification is to express as much as
possible with a minimum of words (classes). For this purpose the information contained in a terminology must be reduced. Is such a reduction an obvious process? In this paper we examine this question by considering practical aspects arising from the task of "teaching" computers automated ICD-10 coding of diagnoses in text form.
We first assess the extent of information reduction and then discuss the path along which this reduction takes place. The role and conditions of a true hierarchical structure are discussed, as well as the questions that stem from reduction of the many semantic dimensions to the single dimension of a formal hierarchy. Special attention is given to the sum/summands problem, a major challenge for automated classification in practice.
Are medical classifications necessary at all? Just because extracting class information from terminological data is not self-evident, the classification holds information which is not otherwise available.
1. Introduction The information available about a medical case, a patient, is always less than the information that could theoretically be found at the moment of observation of the real case. The language that we use to describe the patient can be differentiated according to several characteristics: the divide between ontological and epistemological viewpoints has recently been discussed [1,2] and the discussion looks set to continue.
In this paper we do not emphasize this distinction, but we do look more closely at the question of granularity, e.g. of the information content of a language.
"Language" is used in a broad sense in this context and includes "free" natural languages, standardized and structured terminologies like SNOMED CT (with a fine granularity and a large information content) and classifications like ICD-10 (with a coarse granularity and a poor information content). Of course, nobody believes that it would be possible to extract information in a fine granular language (terminology) from the information found in the terms or codes of a classification. But is it – on the other hand – possible to go in the other direction and assign a case to a classification with the aid of the terms in the terminology alone? At first glance this seems self- evident. A clo
Figure 1: From Reality to Terminologies to Classifications
2. Information reduction
2.1. From reality to observation In reality, every hair of a patient can be counted. But this information is not what the physician wants to know. Nor does he want to know the condition of every single red blood cell. It is sufficient for him to know that most appear to be normal and that they occur in numbers within a certain range of normality. If anaemia is present, it is not necessary to know all the details of the single cells; it is sufficient to know their condition and numbers in general terms. Obviously only a very small part of the information relating to the real case is observed by the physicians, nurses and laboratories, yet this is not a shortcoming, but a desirable outcome, since we do not need every single piece of information to cure the patient. Too much information would confuse the observer and he wouldn't be able to see the wood for the trees.
The fact that the look is of limited closeness implies a reduction of information, but closeness is not the only aspect. Also the direction of the look means a selection of what is possibly observed. This selection is intended, too. The complaints of the patient direct the views of the medical professionals. When he complains about acute abdominal pain, the doctor will most probably not perform a CT scan of the head.
All in all, the reduction of information content from reality to observation is obviously huge.
2.2. From observation to medical records
Not every observation is worth recording and of course only a small part of the information in doctors’ and nurses’ heads finds its way into medical records.
Information in the records can be in pictures, in numbers (quantitative) or in words (qualitative). For purposes of this paper we confine ourselves to the words, they carry the qualitative information, which is the main scope of terminologies and medical classifications.
2.3. From medical records to diagnoses The diagnoses are usually a small part of the information in the medical record.
2.4. From diagnoses to codes and DRGs Again there is a reduction of the amount of information and again this reduction is intended. The fewer codes or DRGs (diagnosis related groups) there are, the easier it is to compare cases statistically in groups.
2.5. Estimation of the information content on each layer of granularity In Figure 1 the information content of the layers of granularity is estimated roughly.
The number of permitted instances in the layers provides an estimate of the information content of a selected instance (selective information content according to Shannon  and MacKay).
DRGs usually amount to several hundred groups and usually include less than a thousand groups. The ICD-10 has roughly 15,000 codes, depending on the version in question. SNOMED CT contains more than 1 million terms. Compared to these still small numbers, the information content of a medical case is impossible to quantify in reality. In Figure 1 it is shown as a cloud, which represents the huge amount of information as well as its lack of form at this stage of interpretation.
The number in brackets (and the points in the three quadrilaterals representing the interpretation layers) in Figure 1 reflect the fact that, although there are several ICD-10 codes for one case, there is by definition just one DRG for the same case. The information content of the single ICD-10 codes is multiplied and the information content of the whole is the product of the contents of the single codes. In Figure 1 we assume that each DRG has two codes. This is of course a rough estimate. Not every combination of codes is possible, but usually there are more than two diagnostic and therapeutic codes per case.
What is true for the codes is true for the terms. Many terms combine to give one code. Not every term is used for ICD-10 coding. Therefore not only is the information content of one SNOMED clinical term reduced to one ICD-10 code, but several terms in the medical record lead to just one code.
2.6. Amount of the information reduction
As can be seen from Figure 1, the amount of information explodes when we go from the bottom (DRGs) to the top (free text in the medical record). The information in the real case (cloud) is again much richer than the information in the medical record (we shan’t offer a quantitative estimate at this point). In the other direction, from the real case to the codes and the DRGs, the information content of the medical case is radically reduced.
2.7. The coding process
Our group creates programs for automated ICD-coding with computers. The installations are designed around an inference machine, which reads the free text (noun phrases) and produces ICD-10 codes. If the input is not precise enough, the program requests the missing information in the form of a context specific multiple-choice question. As an internal representation language we use concept molecules [12,14], which permit precise and structured modelling of the descriptive  information content of the words in the physicians’ natural language as well as the information contained in the ICD-10 codes.
3. Is the result of the coding process naturally deducible?
3.1. Deduction in a hierarchical tree A hierarchy (Figure 2) has two conditions: disjunctivity and unidirectionality.
Disjunctivity means that the siblings on each level are mutually exclusive. If a mammal is a dog, it cannot be a cat at the same time.
Unidirectionality in a hierarchy means that the branchings go in only one direction: mammals can be differentiated as dogs, cats, cows, elephants, etc. However, this differentiation cannot apply in the other direction: elephants are mammals and can never be fish. If a hierarchy were not strictly unidirectional, it would contain ring structures and would not be a hierarchy, but a net.
If the two conditions apply, we have a true hierarchy and this means that we can easily make conclusions based on the leaves of the hierarchical tree back to the branches: if we know that the subject is an elephant, we can conclude that it is a mammal and that it is a vertebrate. Furthermore we can pass the properties of the elements in the upper layer to those in the lower layers. The elephant inherits all the properties of mammals as well of those of vertebrates.
This is a stroke of luck for knowledge representation: we don’t need to show all the information about elephants, dogs, cats etc. again for each species, as it is sufficient to show the common information just once at the upper level. This saves space in the representation and makes maintenance easier and more transparent.
A hierarchical tree is therefore ideal for knowledge representation purposes.
Properties are passed from the root to the leaves, from coarse granular to fine granular levels. Class information, however, is deduced in the opposite direction, from fine granular to coarse granular levels (elephant mammal). This deduction is selfevident in a hierarchical tree, but is dependent on the two conditions explained above.
A natural deduction of this kind from fine to coarse granular levels would be exactly what we are striving for in the coding process described in Section 2.7. If the information reduction "funnel" in Figure 1 could be designed as a hierarchy, we could easily deduce the identity of a medical case on the coarse level from its description on the fine granular level. In other words, we could safely deduce the ICD-10 code from the description of the case in medical terms without external assistance.
Is this possible?
3.2. Difference between the zoological system and the system of diseases
Unfortunately the system of diseases cannot be arranged naturally in a hierarchical
tree. The reason for this is linked to the two conditions required for a hierarchy:
disjunctivity and unidirectionality both apply naturally in the case of animals and plants but are absent in the case of diseases.
In zoology the disjunctivity condition is naturally guaranteed by the fact that two species cannot mix (species barrier). Because cats and dogs cannot have offspring together, the two species are definitively disjunct.
The unidirectionality condition is based on the history of the evolution of species.
Since species have evolved along the unidirectional time line, this evolution cannot be reversed. Elephants cannot evolve into fish in the future.
The evolution of zoological species is, however, a special case in nature. The fact that this system is in the form of a perfect hierarchical tree is due to the history behind its evolution.
Such a history is absent in the development of diagnoses. Diseases do not evolve from other diseases as zoological species evolve over time from more ancient species.
Certainly diseases are related to each other. One disease can lead to another. But these relationships are much more complicated than the ones in zoology.
Because the two conditions, disjunctivity and unidirectionality, are not present in the system of diseases, this system does not occur naturally in the form of a hierarchy.
If we want to make it into a hierarchy for practical reasons – and there are good reasons for this! – we have to create it artificially. As soon as a structure is artificial, however, its shape can be altered and becomes arbitrary.
Statistical methods (variance reduction with regard to a target variable) can be used to perfect a system as Fetter has done in creating the first DRG systems . Such statistically created systems are designed to serve specific needs (economic ones in the case of DRGs), they are artificial and do not have natural and unpassable boundaries like the above described species barriers in zoology.
The ICD-10 classification is designed as a hierarchy. This does offer many advantages, but we have to remember that its structure is arbitrary – however well designed it may be. If we want to assign ICD-10 codes to diagnoses, we must reduce the complex information of the real case diagnoses until it fits into the artificial hierarchical tree. What information gets lost? Additional rules – inclusiva et exclusiva – are necessary for this task.
If we try to obtain ICD-10 codes automatically from natural language diagnoses, we can see how more complex structured information is arranged in a hierarchical tree.
In the next section I intend to show how this is done.
4. ICD-10 coding of arterial hypertension
4.1. Semantic dimensions (degrees of freedom) If we want to code a diagnosis, we first have to analyse the characteristics used in the target coding system. The terms used to describe the codes are best arranged in groups of the same semantic "flavour".
Terms of the same "flavour" represent tokens of the same type. Usually, for each "flavour", just one token can be assigned independently to a diagnosis, so that the diagnosis has as many tokens assigned to it as there are "flavours". The "flavours" can be seen as semantic dimensions, as axes in a semantic space or as degrees of freedom, the latter in order to express the independence of each dimension. They are related, but not completely identical, to the partitions and features (qualities) of the semantic web . The exact differences between the methods of partitioning in the semantic web and the here depicted semantic dimensions as well as the consequences of these differences must be the subject of an additional paper yet to appear.