Prediction and Factors Affecting of Chronic Kidney Disease Diagnosis using Artificial Neural Networks Model and Logistic Regression Model

The last few years witnessed a great and increasing interest in the field of intelligent classification techniques which rely on Machine Learning. In recent times Machine Learning one of the areas in Artificial Intelligence has been widely used in order to assist medical experts and doctors in the prediction and diagnosis of different diseases. In this paper, we applied two different machine learning algorithms to a problem in the domain of medical diagnosis and analyzed their efficiency in prediction the results. The problem selected for the study is the diagnosis and factors affecting Chronic Kidney Disease. The dataset used for the study consists of 153 cases and 11 attributes of CKD patients. The objective of this research is to compare the performance of Artificial Neural Networks (ANNs) and Logistic Regression (LR) classifier on the basis of the following criteria: Accuracy, Sensitivity, Specificity, Prevalence, and Area under curve (ROC) for CKD prediction. From the experimental results, it is observed that the performance of ANNs classifier is better than the Logistic Regression model. With the accuracy of 84.44%, sensitivity of 84.21%, specificity of 84.61% and AUC ROC of 84.41%. Also, through the final fitted models used, the most important factors that have a clear impact on chronic kidney disease patients are creatinine and urea. classifier, Quadratic Discriminant classifier, Linear SVM, Quadratic SVM, Fine KNN, Medium KNN, Cosine KNN, Cubic KNN, Weighted KNN, Feed Forward Back Propagation Neural Network using Gradient Descent and Feed Forward Back Propagation Neural.


Introduction
Last years, after increasing the number of patients with chronic kidney disease, It had to be highlighted to study this disease and the factors affecting it and the use of statistical methods and artificial intelligence techniques, artificial techniques have been receiving a lot of interest nowadays.
Chronic kidney disease (CKD) also known as chronic renal failure, or chronic kidney failure, is much more widespread than people realize; it often goes undetected and undiagnosed until the disease is well advanced. Often, CKD is diagnosed as a result of screening of people known to be at risk of problems, such as those with high blood pressure or diabetes and those who have relatives with CKD. It may also be identified when it leads to one of its recognized complications, such as cardiovascular disease or anemia. ( undetermined causes in 14%, pyelonephritis in 4.7%, glomerulonephritis in 4.3%, and polycystic kidney disease in 3.9% ( Moukeh et al., 2009).
Machine learning is a branch of computational sciences that deals with learning the systems automatically based on inputs. The Classification is the main problem which is located in supervised machine learning.
Igor Kononenko (2001) presented a view on the use of Machine learning techniques 1) for the interpretation of medical data 2) for intelligent analysis of medical data in the current scenario and 3) for assistance of physicians in diagnosis of medical disorders, in the future. The authors suggested integration of machine learning techniques with the existing instrumentations for the acceptance of machine learning in medicine. Chen et al. (2009) presented A Comparative analysis of logistic regression, support vector machine and artificial neural network for the differential diagnosis of benign and malignant solid breast tumors by the use of three-dimensional power Doppler imaging. The diagnostic performances of these three models (LRA, SVM and NN) are not different as demonstrated by ROC curve analysis.
In (2013) Abid Sarwar et al. compared the accuracy of Naïve Bayes, artificial neural network, and KNN algorithm for the type II diabetes. Type II diabetes is a condition in which the pancreas is not able to produce the needed amount of insulin or the cell is not able to use the produced insulin (insulin resistance) which leads to abnormal glucose level in the blood. The results showed that neural network with 96% prediction accuracy performs better than Naïve Bayes with 95% and KNN 91%.K.R. Lakshmi et al. (2014) proposed performance evaluation of three data mining techniques for predicting kidney dialysis survivability. In this research, various data mining techniques (Artificial Neural Networks, Decision tree and Logical Regression) are used to extract knowledge about the interaction between these variables and patient survival. A performance comparison of three data mining techniques is applied for extracting knowledge. Vijayarani et al. (2015) projected work on prediction of kidney disease using data mining classification algorithms. Prediction of four types of Kidney diseases namely Nephritic Syndrome, Chronic Kidney disease, Acute Renal Failure and Chronic Glomerulonephritis. Supervised classification algorithm Support Vector Machine (SVM) and Artificial Neural Network (ANN) is used to predict the kidney disease. Experimental results show that ANN is best classifier Classification accuracy for ANN is higher compared to SVM. Sharma et al.(2016) presented Different machine learning classification algorithm for diagnosis of chronic kidney disease is discussed. Various classification techniques that are used are: Decision Tree, Linear Discriminant The main aim of this study is to use two methods of supervised machine learning algorithms to predict and diagnose chronic kidney disease between two groups of patients (presence and absence) to identify the best classifier depend on a number of performance evaluation criteria.
In addition, the study tested the most important factors affecting chronic kidney disease.

Methodology 2.1 Artificial Neural Networks
Scientists have found that the brain cortex contains 22 billion neurons and 220 trillion connections between them. This neuron that governs the mechanism of neuronal action drives neuroscientists, computer engineers and psychologists to try to simulate the human mind so that they can ultimately building a structure for giant computers that simulates the work of the human mind.
ANNs is a data processing system that simulates and resembles the way natural neural networks do to humans or to an organism. These elements relate to each other through a network of balanced links. The artificial neural network is an adaptive system, changing its structure based on the information through which it passes through the so-called learning stage (Wu and Larty, 2000).
On the other hand, ANNs has the right to solve many of the problems as it entered in many areas, the most important, field Medicine: An application of instant medicine which is related to the principle of memory as in the case of the human mind, principle of Pathological signs and diagnosis, the field of telecommunications: such as the disposal of resonance The sound that may be produced in the telephone lines, the  The practical use of these networks lies in the possibility of applying algorithms designed to change the weight of the nodes connecting the artificial neurons together to produce a reaction (Wu and Larty, 2000).
whereas:-: Total entries multiplied by weights for unit j in layer (L + 1).
: The output of the layer (L + 1).
: Weights of layer L.
Error value between layers.

Artificial Neural Network Learning Algorithms
The weights in the artificial neural network represent the initial information that the network will learn. Therefore, the weights must be updated during the training phase. So, several algorithms are used and depending on the type of network for this update in weights. In this syudy, we used the Backpropagation Algorithm, which is used in multilayered and nonlinear neural networks. It is implemented in two main stages: (Livingstone, 2008) 1-Feed Forward Propagation.

First: Feed Forward Propagation
At this stage, no weight adjustment is made, but the network begins by assigning each processing element from the input layer to the excitation of the units of this layer, followed by a forward spread of that excitation across the rest of the grid layers. That is, the output of any layer affects only the next layer, and there is no correlation between the cells of the single layer. (Livingstone, 2008).

Second: BackPropagation
Is the stage at which the weights are set which the standard backpropagation algorithm in the network is the Gradient Descent algorithm, which allows the signal to be re-exported from the output to the input in reverse. The network weights are set by calculating the error between the output and the target. The algorithm can be represented for one repetition (Reed and Marks, 1999): Whereas: : weights vector.

Logistic Regression Model
Regression methods have become an integral component of any data analysis concerned with describing the relationship between a response variable and one or more explanatory variables. Quite often the outcome variable is discrete, taking on two or more possible values. The logistic regression model is the most frequently used regression model for the analysis of these data. (David et al., 2013) LR defined was known to be a type of nonlinear regression model describe The relationship between the dependent variable (the response) and a set of explanatory variables is determined A nonlinear relationship, where the dependent variable (response) is variable qualitative. The dependent variable in the regression model may take only two forms and syllabus (0) or (1), which is the basis for the Binary logistic regression, which was studied in our research. Alternatively, the variable may take more than two classes, which is called a multinomial logistic regression as for the explanatory variables; these variables can be continuous or discrete. (Harrell, 2010).
The logistic regression model is based on a basic assumption that dependent variable to be studied is a two-character variable and follows a Bernoulli distribution according to the probability function known as the following formula: (Özkale and Arıcan,2016) The probability ( ) can be defined mathematically in terms of explanatory variables and the logistic function as in the following formula:- where (β:) Vector of parameters and { } Row vector of independent variables. In order to simplify notation, we use the quantity π(x)=E(Y/X) to represent the conditional mean of Y given X when the logistic distribution is used. The specific from of the logistic regression model we use is:

Performance Evaluation
In this part we will discuss several methods of evaluating the performance of (LR and ANNs).

A-Confusion Matrix
The classification matrix is a statistical indicator of the suitability of the model and thus its compatibility with the data. where it works on classification of binary events by using the Confusion Matrix, which shows the actual versus predicted affiliation of each group .( Soderstorm and Leitner, 1997)  The percentage of a population that is affected with a particular disease at a given time. Prevalence = (10)

E-The Area under curve (ROC curve)
AUC is defined as a measure for the overall performance of the classifier scores across all possible values of the threshold (or cutoff_ point).
If the probability distributions are known for both detection and false alarms, it is possible to create a ROC curve by plotting the cumulative distribution (The area under probability from (-∞ to+∞), usually area under curve ROC using as a measure of the quality of probability classification. The area under curve used the following formula. (Hosmer & Lemshow,2000) ∫ ∫ (11)

Real Dataset Collection
This data contains 153 patients, including 12 variables, 11 of which are independent variables and a dependent variable (presence 1 and absence 0) of chronic kidney disease (CKD) depend on Blood test(serum), there is no missing value in this data. The patients included in this study were collected from Erbil teaching hospital during the period from first of April, 2018 to 30th June 2018.
The studied samples consist of 85 absence CKD and 68 presence CKD patients, the age of patients ranged from 12 to 90 years, also studied consist of 83 (54%) males and 70 (46%) females. Table 2 shows the description of the data studied in this paper Dataset is randomly partitioned into the training dataset and the test dataset, where (70%) (108 patients) of the samples are selected for training dataset and the rest (30%) (45 patients) are selected for the testing dataset. For a fair comparison and for alleviating the effect of the data partition, all the used classification methods are evaluated, for theirclassification performance metrics using 10 folds cross-validation, averaged over 10 partitioned times. All the implementations of the study on real data applications are carried out using R version (3.4.4).

Performance Evaluation of Models Applied
After partitioned the data into two groups (training and testing) we begun building the model based on the training dataset which includes 108 cases. (0=65 cases 1=43 cases) Now, 0 mean absence CKD, 1 mean presence CKD. The content of the r esponse is unchanged: 65 cases of absence class and 43 cases of presence class were detected. From table 3 we can conclude that the logistic regression model has been able to properly classify 97 cases out of 108 available. the model has Accuracy (89.81), but also the sensitivity and the specificity are greater than 80 percent (81.40%, and 95.38%) respectively, That is, the model can predict correctly based on the independent variables entered by 81.40% for those with CKD patients. The specificity of the model was 95.38%, that is to say, it can predict correctly based on the independent variables entered by 95.38% for those without CKD. In addition, the prevalence of the disease in the community for this model of the training dataset is 39.81%.  table 3 where the testing dataset has been achieved the model has Accuracy (82.22%), but also the sensitivity and the specificity are greater than 80 percent (84.21%, and 80.77.as well as, we also found that the prevalence of the disease in the population for this model of the testing dataset is 42.22%.

Iraqi Journal of Statistical Sciences (28) 2019 [151]
Another tool to measure the model performance is the Receiver Operator Characteristic (ROC). It determines the model's accuracy using area under curve (AUC) ROC . the value (AUC) ROC . for testing dataset of LR model is (0.8249) , As shown in Figure 3.

Figure (3) (ROC) curve of Testing dataset for LR
ROC is plotted between the sensitivity (y axis) and the specificity (x axis).from figure 3 shows area under curve value is 82.5%, The ROC is a metric used to check the quality of classifiers.   Table 5 summarizes, the analysis of the confusion matrix, we can see that the Neural Networks model has been able to properly classify 108 cases out of 108 available. without any classification errors. As it is possible to verify, the model has Accuracy (100%), this indicates that all other metrics (model sensitivity, model specificity, etc.) will be equal to the correct one hundred (100%).In addition; we also found that the prevalence of the disease in the community for this model of the training dataset is 39.81%.
From the previous results we found that the models of artificial neural networks have achieved excellent results for training dataset compared with the logistic regression model.  5 where the testing dataset has been achieved the model has Accuracy (84.44%), but also the sensitivity and the specificity are greater than 80 percent (84.21%, and 84.61%)respectively. That is, the model can predict correctly based on the independent variables entered by 84.21% for those with CKD patients. The specificity of the model was 84.61%, that is to say, it can predict correctly based on the independent variables entered by 84.61% for those without CKD. As well as, we also found that the prevalence of the disease in the community for this model of the testing dataset is 42%.
On the other hand Area under the curve (AUC) ROC  In the other word, we discussed the Comparison between two classification models (LR and ANNs) based on criteria (Model Accuracy, Model Sensitivity, Model Specificity, Prevalence, and Area under curve (ROC)). Table 7 shows comparison for testing dataset only. Because the best classifier based on testing dataset. The results showed that ANNs model was better than LR where the classification using ANNs model was more accurate and more efficient.  criterion for LR was 80.77%. The performance of the model was improved by using ANNs. The value was 84.61%. Means that last model has a complete preference.as well as that prevalence of the disease in the community all models have their value approximately 42%.
On the other hand, the area under curve (ROC) for the logistic regression was 82.49%. while area under curve (ROC) criterion for ANNs models was equal to 84.41%. The greater the value of area under curve (ROC), It was the best.

Fitted Final Model and Variables Importance
When dealing with neural network models, the study data is introduced to the neural network model and uses a single-hidden layer consisting of eight neurons in our study, as well as these have made many changes to reach the result above which follows a lot of changed in layers and nodes within each layer.by using Garson's algorithm (Garson, 1991) to evaluate relative variable importance (Beck, 2018).
We found importance variables for ANNs model which appeared in table 8 and figure 5.  Table 8 and figure 5 shows the degree of importance for each factor affecting CKD patients using ANNs, where the factor of creatinine was the most influential factor by 40% on the dependent variable (class), and urea factor at 18%, total bilirubin 9%, age of patient 8% also sex of patient 7%.

Figure (5) Importance Variables ANNs Model plot
The following factors were the least affected factors in the dependent variable, ie, phosphorus 4%, glucose 4% and calcium 2%.
On the other hand, to determine the most important factors affecting the model table 9 shows importance variables for LR model used in our study. By using VarImp function in the caret package of R. Table 9 shows significance of independent variables of LR model.  Table 9 shows the degree of importance for each independent variable affecting CKD Using LR, Where the variable creatinine is the largest effect followed by variable urea (100%, 95%) respectively and then smoking etc. Figure 6 shows significance of independent variables of logistic regression model.

Figure (6) Importance Variables LR Model plot
From the results above may be determined the equation of logistic regression model with significant independent variables affecting of CKD patients. As shown in Table 10. Table (

Conclusions
In view of great importance of CKD and what may be caused of death and healthy crisis for community. The study has reached through comparison between of two classifier methods in the classification of CKD patients relying on the blood test, and based on evaluation criteria that the method of ANNs is the best methods used in this study. As well as we concluded the factors had greatest effect on the data of patients with chronic kidney disease that the variables creatinine and urea are the most effective and significant variables by using the two methods (LR and ANNs).
The study also concluded from the sample taken of the community, that the prevalence of disease at community has a big percent. Calls for action to reduce prevalence among the population.