«Abstract. With the rise of data mining technologies, group profiling -i.e. ascribing characteristics to groups of peoplehas increasingly become a ...»
Effects of Unreliable Group Profiling
by means of Data Mining
Tilburg University, Faculty of Law, P.O. Box 90153,
5000 LE Tilburg, The Netherlands
Abstract. With the rise of data mining technologies, group
profiling -i.e. ascribing characteristics to groups of peoplehas increasingly become a useful tool for policy-making,
direct marketing, etc. However, group profiles usually
contain statistics and therefore the characteristics of group profiles may be valid for the group and for individuals as members of that group, though not for individuals as such.
When individuals are judged by group characteristics they do not posses as individuals, this may strongly influence the advantages and disadvantages of using group profiles.
However, striving for more reliable group profiles only provides a partial solution to this problem, since perfectly reliable group profiles may still result in unjustifiable treatment of people. A broader solution to deal with the disadvantages of group profiles may be found in developing new ethical, legal, and technological standards that adequately recognize the possible harmful consequences of particular types of information.
Keywords. Data mining, KDD, group profiling, personal data, data protection, reliability, distributivity, security, selection, stigmatization, confrontation, ethics.
1 Introduction Information and communication technologies have resulted in large databases with enormous amounts of data. From the need to discover knowledge from these large amounts of data, data-mining techniques have been developed in order to find patterns and relations in data. When characteristics are ascribed to people, we speak of profiles. Profiles concerning individuals are called personal profiles, sometimes also referred to as individual profiles or customer profiles. A personal profile is a property or a collection of properties of a particular individual. Profiles concerning a group of persons are referred to as group profiles. Thus, a group profile is a property or a collection of properties of a particular group of people.
Ascribing characteristics to individuals may be done either correctly or incorrectly.1 If an individual is being judged upon information that was wrongly ascribed to him, most legal systems provide opportunities to have the information changed or deleted, possibly combined with compensation of damages.
Group characteristics are more complex: they may be correct for the group as a whole and members of that group, though not for individuals as such. To explain the difference, we may use the following example.
Suppose in street A 80 percent of the people wear glasses. Without any further knowledge, it may be suggested that there is a high probability (80 percent) that a person living in street A wears glasses. This is when this person is regarded as a member of the group of people living in street A.
When these persons are considered as individuals as such, it will be clear immediately who wears glasses and who does not.
It may be argued that, when group characteristics are incorrectly ascribed to individuals, there should be a right for people to have information changed or deleted. However, since group data is often anonymous data, it is usually not protected by data protection laws.
Besides, most people are unaware of the group profiles they are being judged upon.2 2 Risks and benefits of group profiles The use of group profiles may have various advantages and disadvantages.
Starting with some general advantages, the search for patterns and relations in data may provide overviews of large amounts of data, facilitate the handling and retrieving of information, and help the search for immanent structure in nature. More closely related to the goals of particular users, group profiles may enhance efficacy (achieving more of the goal) and efficiency (achieving the goal more easily). Here, efficiency often means cost efficiency. For group profiles usually less information is required than for individual profiles (although reliability may not be so good). Group data is usually anonymous data and, therefore, it is in most (notably European) countries not protected by data protection law, which For inference errors that may occur when ascribing characteristics, see .
Many authors urge for more openness concerning the collection and use of data towards data subjects and the public in general. See for instance .
means that no costly and time-consuming effort for obtaining informed consent has to be made.
Group profiling also provides more opportunities for selecting targets.
For instance, members of a high-risk group for lung cancer may be earlier identified and treated, or people not interested in cars will no longer receive direct mail about the subject. So-called hit ratios will increase with the help of profiling, but also new groups of customers or risk-bearers may be discovered.
Most of the disadvantages of using group profiles are closely connected to their advantages. One of the main applications of group profiles is selection, as indicated above. However, much selection may be unwanted or unjustified. When selection for jobs is performed on the basis of medical profiles, this may soon lead to discrimination.3 Unjustified selection may also occur in cases of purchasing products, acquiring services, applying for loans, applying for benefits, etc.
Some of the group profiles constructed by companies, government, or researchers may also become ‘public knowledge,’ which may lead to the stigmatization of particular groups. Another disadvantage may occur when people are confronted with information about a group they belong to.
When supposedly healthy people are confronted with the fact that they will have only a limited lifetime left, this may upset their lives and the lives of others. In some cases, people may prefer not to know their prospects while they are healthy.
Although it may seem that group profiles lead to a more individual approach (e.g. by customization), the use of group profiles may in fact lead to de-individualization. This is a paradox, but group profiles result in a tendency of judging and treating people on the basis of their group characteristics instead of on their own individual characteristics and merits . Thus, the use of profiles may lead to a more one-sided treatment of individuals. As I will show in the next section, the effects of all these risks and benefits of group profiles are strongly influenced by the reliability of the profiles and their use.
3 Reliability When discussing the reliability of group profiles, it is important to distinguish distributive group profiles from non-distributive group profiles. Distributivity means that a property in a group profile is valid for each individual member of a group; non-distributivity means that a property in a group profile is valid for the group and for individuals as members of that group, though not for those individuals as such .
The reliability of a group profile may influence the effects, both positive and negative, of the use of the profile. The reliability of a group profile may be divided into two factors. The first is the reliability of the profile itself and the second is the reliability of its use.
A case study in the U.S. showed that discrimination as a result of access to genetic information resulted in loss of employment, loss of insurance coverage, or ineligilibility for insurance. All cases of discrimination were based on the future potential of disease rather than existing (symptoms of) diseases .
The creation of group profiles consists of several steps, in which errors may occur . First, the data on which a group profile is based may contain errors, or the data may not be representative for the group it tries to describe. Furthermore, to take samples, the group should be large enough to give reliable results.
In the data preparation phase, data may be aggregated, missing data may be searched for, superfluous data may be deleted, etc. All these actions may lead to errors. For instance, missing data is often made up, which is proved by the fact that a significantly large number of people in databases tend to have been born on the 1st of January (1-1 is the easiest to type) .
The actual data mining consists of a mathematical algorithm. There are different algorithms, each having its strengths and weaknesses. Using different data-mining programs to analyse the same database may lead to different group profiles. The choice of algorithm is very important and the consequences of this choice for the reliability of the results should be realized. For instance, in the case of a classification algorithm, the chosen classification criteria determine most of the resulting distribution of the subjects over the classes.
As far as the reliability of the use of group profiles is concerned, this depends on the interpretation of the group profile and the actions that are taken upon (the interpretation of) the group profile. As was explained above, both the interpretation and the actions determined depend on whether people are regarded as members of the group or as individuals as such.
It should be noted that a perfectly reliable use of a group profile, i.e. 100 percent of the group members sharing the characteristic, does not necessarily imply that the results of the use are fair or desirable. Especially in the case of negative characteristics this may occur, for instance, when a group consisting of handicapped people only are all refused a particular insurance. Although the use of the group profile is perfectly reliable, it is not justified.
Note that the difference between regarding people as group members or as individuals is not applicable to future properties. For instance, an epidemiological group profile with the characteristic that 5 per cent of a particular group will die from a heart attack does not provide any information on the question whether Mr. Smith, who is a member of this group, will die from a heart attack. And since Mr. Smith himself has no additional information on this, his perspective as a group member is no different from the perspective of someone outside the group.
The fact that in non-distributive profiles not every group member has the group characteristic, has different consequences depending on whether the characteristic is generally regarded as negative or positive. This is illustrated in Figure 1.
People in category A have the disadvantages of sharing the negative group characteristic and of being treated on the basis of this negative profile. This may result in an accumulation of negative things: first, there is the negative health prospect; on the basis of this prospect stigmatization and selection for jobs, insurances, etc., may follow.
In category B people have the disadvantage of being treated as if they have the negative characteristic, although this is not the case. There may be an opportunity for these people to prove or show they do not share the characteristic, but they are ‘guilty until proven innocent.’ Sometimes, proving exceptions is useless anyway, for instance when a computer system does not allow exceptions or when handling exceptions is too costly or time consuming.
Fig. 1. Not every group member necessarily has a group characteristic. This has different consequences depending on whether the characteristic is negative or positive.
Sometimes people in category B may have an advantage. This is the case when measures are taken to improve the situation of the people with the negative characteristic. For instance, when the government decides to grant extra money to a group with a very low income, some group members not sharing this characteristic may profit from this.
People in category C have the advantage of having the (positive) group characteristic as well as being treated on it. Similar to the people in category A, this may be accumulative. The group may get the best offers for jobs, insurance, loans, etc.
Finally, category D contains the people who do not share the positive group characteristic. Their advantage may be that they are being treated on a positive characteristic, but the disadvantage is that they are not recognized as not having the positive characteristic or even having a negative characteristic. Lack of such recognition may become a problem when measures are taken to help the people with negative characteristics.
For instance, people in category D may not be recognized as people running a great risk to get colon cancer and are thus easily forgotten in government screening programs.
From Figure 1 it becomes clear that there is a difference between correct treatment and fair treatment. People in categories A and C can be said to be treated correctly since they are treated on a characteristic they in fact have. Whether this treatment is also fair, remains to be seen. Accumulation of negative things for people in category A and of positive things for people in category C may lead to polarization.
People in categories B and D do not have the group profiles of the groups they belong to and are therefore being treated incorrectly. Incorrect treatment very probably also implies unfair treatment since it does not take into account the actual situation people are in.
4 Concluding remarks As was shown in the previous sections, the reliability of a group profile may strongly influence its advantages and disadvantages. It is, however, clear that striving for more distributive group profiles will only provide a partial solution to this problem. Perfectly reliable profiles may still be used unjustifiably. The other end of the spectrum, i.e. prohibiting group profiles altogether, will not be a realistic solution either.
A broader solution to the disadvantages of group profiles will have to be sought in new ethical and legal standards posing smart restrictions on the availability and use of particular types of information. Such restrictions may be enforced by law and regulations in combination with several security techniques . Security techniques with regard to data mining do not only concern access controls, but also flow controls and inference controls , .