«T.C. BAHÇEŞEHĐR ÜNĐVERSĐTESĐ PREDICTING THE EXISTENCE OF MYCOBACTERIUM TUBERCULOSIS ON PATIENTS BY DATA MINING APPROACH Master Thesis Tamer ...»
PREDICTING THE EXISTENCE OF
MYCOBACTERIUM TUBERCULOSIS ON PATIENTS
BY DATA MINING APPROACH
Institute of Science
Computer Engineering Graduate Program
PREDICTING THE EXISTENCE OF
MYCOBACTERIUM TUBERCULOSIS ON PATIENTS
BY DATA MINING APPROACHMaster Thesis Tamer UÇAR
SUPERVISOR: ASSOC. PROF. DR. ADEM KARAHOCAĐSTANBUL, 2009 T.C
BAHÇEŞEHĐR ÜNĐVERSĐTESĐThe Graduate School of Natural and Applied Sciences Computer Engineering Title of the Master’s Thesis : Predicting The Existence Of Mycobacterium Tuberculosis On Patients By Data Mining Approach Name/Last Name of the Student : Tamer UÇAR Date of Thesis Defense : 10.08.2009 The thesis has been approved by the Graduate School of Natural and Applied Sciences.
Signature Prof. Dr. A. Bülent ÖZGÜLER Director This is to certify that we have read this thesis and that we find it fully adequate in scope, quality and content, as a thesis for the degree of Master of Science.
Examining Committee Members:
Assoc. Prof. Dr. Adem KARAHOCA (Supervisor) :
Asst. Prof. Dr. Yalçın ÇEKĐÇ :
Prof. Dr. Nizamettin AYDIN :
ACKNOWLEDGEMENTSI would like to thank all people who have helped and inspired me during my study.
Especially, I offer my sincerest gratitude to my supervisor, Assoc. Prof. Dr. Adem Karahoca, who has supported me, thought-out my thesis with his experience and knowledge. It would be impossible to complete this study without his encouragement, motivation and guidance.
I would like to show my gratitude to my father, Dr. Necmettin Uçar and my brother Dr.
Tolga Uçar for their professional insight. Without their support, medical basis of this thesis would not be constructed.
I owe my deepest gratitude to my mother, Nedret Uçar, for her endless love and support throughout my life. Not only in this study, but also in every moment in my life her encouragement made everything easier than it is.
Finally, I would like to thank to my fiancée, Elif Çöğürlü, for her everlasting love, endless support and encouragement in every part of my life.
Günümüzde veri madenciliği yöntemleri birçok problemin çözümünde oldukça popüler bir tekniktir. Kısaca tanımlamak gerekirse, veri madenciliği mevcut bir veri kümesinden çeşitli örüntüler elde etmeye yarayan bir mekanizmalar bütünüdür. Elde edilen bu örüntüler, mevcut olan ya da yeni toplanan verilerin yorumlanarak bu verilerden anlamlı bilgilerin elde edilmesinde kullanılır. Birçok çalışma alanında geniş ölçekli veriler ile çalışılır. Bu verilerin anlamlı bilgiye dönüştürülmesinde çok sayıda farklı algoritmalar ve yaklaşımlar uygulanmıştır.
Biyomedikal alanı veri madenciliği tekniklerinin kullanılarak verilerin anlamlı bilgilere dönüştürülebildiği alanlardan biridir. Kalp atımlarının sınıflandırılması, Alzheimer hastalığında arkaplandaki MEG (Magnetoencephalography) aktivitesinin analizi, insandaki kalıtsal metabolik bozuklukların metabolik biyomarkerlar ile öngörülmesi ve kanda Sikolosporin A seviyelerinin tahmin edilmesi gibi konu başlıkları altında birçok veri madenciliği çalışması yapılmıştır.
iii Bu çalışma tüberküloz hastalarının sınıflandırılması problemi üzerinde yoğunlaşmıştır.
Tüberkülozun kesin tanısının konmasında hastanın balgamında bakterinin bulunup bulunmadığına dair bir testin yapılması gereklidir. Bu testin neticesi de yaklaşık olarak 45 günlük bir zaman dilimi sonunda belli olmaktadır. Bizim çalışmamızın amacı, veri madenciliği tekniğini kullanarak tüberküloz hastalığının tanısını kesin tıbbi test sonuçlarını beklemeden, mümkün olduğunca tutarlı bir şekilde koyabilen bir sistem geliştirmektir. Sistemin tutarlı bir şekilde çalışması çok önemlidir. Çünkü gerçekte tüberküloz olmayıp sistem tarafından tüberküloz olarak sınıflandırılan hastalar 45 gün boyunca güçlü ve yoğun bir antibiyotik tedavisine boşu boşuna alınacaklar ve bunun sonunda gereksiz olarak kullandıkları ilaçların yan etkilerine maruz kalacaklardır. Aynı şekilde gerçekte tüberküloz olup sistem tarafından tüberküloz dışı sınıflandırılan hastalar da 45 gün boyunca tedaviye alınmayıp uygulanması gereken tedavi programına geç başlayacaklar ve mevcut hastalıkları daha da ilerlemiş olacaktır.
Yapmış olduğumuz çalışmamızın bulguları neticesinde ANFIS metodunun tüberküloz hastalarının sınıflandırılması konusunda Bayesian Network, Multilayer Perceptron, Part, Jrip ve RSES metodlarına göre daha tutarlı ve güvenilir olduğunu gördük.
Anahtar Kelimeler: ANFIS, Biyomedikal, Hastaların Sınıflandırılması
Data mining techniques are very popular for solving various problems. As a brief description, data mining is a mechanism for obtaining patterns from an existing data set.
Those extracted patterns are used to interpret the new or existing data into useful information. In most of the areas, large scaled data is collected. To convert these data into information, many different algorithms and approaches are used.
Biomedical is one of the areas where data mining can be applied to convert data into information. Many studies are made under topics such as classification of cardiac beat, analysis of MEG (Magnetoencephalography) background activity in Alzheimer's disease, predicting metabolic biomarkers of human inborn errors of metabolism, prediction of Cyclosporine A blood levels and etc.
This study focuses on classification of tuberculosis patients. To make a correct diagnosis of tuberculosis, a medical test must be applied to patient’s phlegm. The result of this test is obtained about after a time period of 45 days. The purpose of this study is to develop a data mining solution which makes diagnosis of tuberculosis as accurate as possible and helps deciding if it is reasonable to start tuberculosis treatment on v suspected patients without waiting the exact medical test results or not. It is imperative that, there must be a very accurate classification for this model. Because false positive classified patients will use strong antibiotics for 45 days for nothing and they have to deal with its side affects. And the false negative classified patients’ treatment plan will be suspended for 45 days and within this untreated period their disease will get even worse than it is. Therefore, correct prediction of tuberculosis is a very important issue.
According to the findings of our study, we concluded that ANFIS is an accurate and reliable method comparing to Bayesian Network, Multilayer Perceptron, Part, Jrip and RSES methods for classification of tuberculosis patients.
Keywords: ANFIS, Biomedical, Patient Classification
LIST OF TABLES
LIST OF FIGURES
1.1 PROBLEM DEFINITION
1.2.1 Tuberculosis and Data Mining
1.2.2 Biomedical and Data Mining
2. MATERIAL & METHODS
2.1 PREPARING TUBERCULOSIS DATA SET
2.2 ADAPTIVE NEURO FUZZY INFERENCE SYSTEM (ANFIS)
2.3 BAYESIAN NETWORK
2.4 MULTILAYER PERCEPTRON
2.5 RIPPER ALGORITHM (JRIP)
2.6 PARTIAL DECISION TREES
2.7 ROUGH NEURAL NETWORKS
2.8 STATISTICAL ACCURACY METRICS
2.8.1 Root Mean Squared Error
2.9 RECEIVER OPERATING CHARACTERISTIC
4. CONCLUSION AND FUTURE PLANS
Table 2.1: Full list of variables
Table 2.2: List of types and acceptable values of variables
Table 2.3: Ranking of variables
Table 2.4: Layers of ANFIS Algorithm
Table 2.5: Structure of a confusion matrix
Table 3.1: Benchmarking of methods
Table 3.2: Confusion matrix of Rough Set test data
Table 3.3: MATLAB code of generating and training FIS
Table 3.4: Confusion matrix of ANFIS test data
Table 4.1: Predicted classes and output codes
Figure 2.1: Distribution of patients by their age groups
Figure 2.2: First-order Sugeno fuzzy model
Figure 2.3: ANFIS Architecture
Figure 2.4: ANFIS model of fuzzy interference
Figure 2.5: Sample rule set of an ANFIS model
Figure 2.6: A sample membership function plot
Figure 2.7: A sample ROC space plot
Figure 3.1: ANFIS testing error plot
Figure 3.2: Surface plot of active specific lung lesion and calcific tissue existence parameters versus output
Figure 3.3: Surface plot of patient weight and age group parameters versus output.
....32 Figure 3.4: Plot of age group versus output
Figure 3.5: ROC plot of ANFIS test data
1.1 PROBLEM DEFINITION Tuberculosis, which a few years ago was considered to be almost under control, has once again become a serious world-wide problem because of AIDS. Tuberculosis disease is caused by a bacterium which is called as mycobacterium tuberculosis. This disease can spread among humans and the patients who suffer from tuberculosis might die unless they get the right treatment. This microorganism widely exists on humans, cattle, sheep and birds. All of the organs in the body can be affected by tuberculosis.
But most of the tuberculosis cases are occur in lungs (Davidson 1999, pp. 347-354).
Tuberculosis disease occurs under different manifestations on adults and children.
When the first encounter happens with bacillus, which is mostly happens on the childhood phase of a person, lymphatic glands that are located at the entry point of the lungs are picked by this microorganism for the first rooting point on the body. As a result of this event, those glands enlarge (hilar lymphadenopathy). This is called as primary tuberculosis. The adult type (secondary) tuberculosis is different than this scenario: In those cases, the person’s lung is contaminated with the microorganism before. If the immune system is strong enough, microorganism can not cause any sickness but can keep itself alive. When the immune system of the person weakens for a reason, microorganism gets activated and begins to create sickness. Prostration, long term sicknesses, insomnia, tobacco and alcohol abuse, drug addiction, having an irregular life, malnutrition, stress, et cetera are some factors which are responsible for weakening the immune system and providing a suitable basis for illness to occur.
Unlike primary tuberculosis, lesions are spread to lung parenchyma tissue in secondary tuberculosis cases. Cavities (holes) which may cause lung tissue to bleed can also be seen on advanced phases of the illness (Harrison 1999, pp. 1007-1014).
Lung tuberculosis can be seen on very wide age range. From new born babies to old people, everybody can be affected by this disease. Symptoms are: cough, fatigue, exhaustion, anorexia, night sweating, fever (which not exceeds 37.5 centigrade degree), cavities and hemoptysis on advanced cases (Özlü, Metintaş & Ardıç 2008, pp. 323To make an exact diagnosis, existence of microorganism in phlegm must be proven.
But, some other microorganisms can also be flagged as mycobacterium tuberculosis under microscope observation. In order to avoid this problem, a special culture medium is prepared where only bacteria of mycobacterium tuberculosis can reproduce. The phlegm sample which is obtained from patient is planted to this medium and kept for 45 days at body temperature. At the end of this time period, the culture medium is checked for any reproduction sign of the bacteria.
In order to cure tuberculosis, 4-5 different major antituberculotic antibiotics are used for 6-12 months. Some cases may heal without any treatment plan if immune system is strong enough. After full recovery, lung wounds which are caused by tuberculosis disease still exist as calcific tissue. Unfortunately, cases which are not treated may result by death of patient (Harrison 1999).
A time period of 45 days is required in order to make a correct diagnosis. The aim of this study is to develop a data mining solution which makes diagnosis of tuberculosis as accurate as possible and helps deciding if it is reasonable to start tuberculosis treatment on suspected patients without waiting the exact test results or not. It is imperative that, there must be high sensitivity and specificity results for this model. Because false positive classified patients will use strong antibiotics for 45 days for nothing and they have to deal with its side effects. And the false negative classified patients’ treatment plan will be suspended for 45 days and within this untreated period their disease will get even worse than it is. Therefore, correct prediction of tuberculosis is a very important issue.
1.2 BACKGROUND Today, data mining techniques are used in very different areas. As mentioned earlier, this study focuses on predicting the existence of mycobacterium tuberculosis on patients by using ANFIS. Besides this study, there are two other research papers regarding this issue. In the following section, those studies will be mentioned. And after, recent researches on biomedical area using ANFIS will be referred.