Abstract
Abstract:
The logistic regression model is one of the modern statistical methods developed to predict the set of quantitative variables (nominal or monotonous), and it is considered as an alternative test for the simple and multiple linear regression equation as well as it is subject to the model concepts in terms of the possibility of testing the effect of the overall pattern of the group of independent variables on the dependent variable and in terms of its use For concepts of standard matching criteria, and in some cases there is a correlation between the explanatory variables which leads to contrast variation and this problem is called the problem of Multicollinearity. This research included an article review to estimate the parameters of the logistic regression model in several biased ways to reduce the problem of multicollinearity between the variables. These methods were compared through the use of the mean square error (MSE) standard. The methods presented in the research have been applied to Monte Carlo simulation data to evaluate the performance of the methods and compare them, as well as the application to real data and the simulation results and the real application that the logistic ridge estimator is the best of other method.
Keywords
Main Subjects
Highlights
1- The simulation results showed that the best way to address the problem of multicollinearity is the ridge logistic regression method.
2- The higher the correlation coefficient value, the greater the MSE value.
3- The more the number of explanatory variables (p) increases, the value of (MSE) increases, and that this increase affects the amount of estimators, however the estimated performance (LRE) is better than the rest of the estimators.
4- As the sample size increases, the value of (MSE) decreases when taking different values for each correlation coefficient and the number of explanatory variables.
5- The results of the application on the data showed the fact that the ridge logistic regression method is the best method presented in the search because it has the lowest value for the average squares of the error, and that the value of the standard error for the estimated parameters was almost close to all methods.
Full Text
1- Introduction
Regression is a statistical method that specializes in studying the relationship between a dependent variable and one or several other independent variables, resulting in a mathematical equation where this relationship represents the best representation. The logistic regression model is a special case of the generalized linear model which is the most common in analyzing metadata and is a logarithmic transformation of linear regression, and it has several types, but the most common is the analysis of the binary logistic regression that we will use in our research without other types of logistic regression. it is a more powerful tool because it provides a test of the significance of parameters, and it also gives the researcher an idea of how much the independent variable affects the qualitative dependent variable dual value In addition, it sees the effect of independent variables, which allows the researcher to conclude that a variable is considered stronger than the other variable in understanding the appearance of the desired result, and that the logistic regression analysis can include qualitative independent variables The effect of the interaction between the independent variables in the two-valued dependent variable [Abbas,2012]. The researcher faces many problems, most of which are the lack of analysis hypotheses when using the method of ordinary least squares, including the problem of multicollinearity that affects the results of estimates and tests, and this problem appears as a result of an association between explanatory variables that lead to giving weak estimates that cannot be relied upon as the variations of these The capabilities are amplified and unacceptable and the (OLS) method is not able to give good estimates when there is a linear relationship between the explanatory variables.
2- Logistic Regression
The logistic regression model is an important statistical model in analyzing binary data (0 or 1) as the primary goal of most studies is to analyze and evaluate relationships between a set of variables to obtain a formula by which we describe the model and uses the logistic regression model to describe the relationship between the response variable of the discontinuous type and the explanatory variables, prediction, estimation and control of the values of the dependent variable according to the changes in the values of the variable with interpretation [Farhood, 2014]. One of the characteristics of the binary response logistic regression is that the dependent variable (Y) of the response variable follows the Bernoulli distribution taking the value (1) with a probability of (π) probability of success, and a value (0) with a probability (1- π) of failure probability [Qasim,2011]. As we work in linear regression whose independent and dependent variables take continuous values, the model that links the variables is as follows:
Since (Y): represents a continuous observational variable and assuming that the average values of (Y) observation or actual at a given value of the variable x which is E(Y) and that the variable e represents a random error, then the model can be written as follows:
In regression (the other end), it is known that models have values (-∞,+ ∞), but when the variable (Y) is:
(3)
Thus, the value of the right side is confined between the two numbers (0.1), and thus the model is not applicable from the regression point of view, and one of the methods of solving this problem is to enter an appropriate mathematical transformation on the dependent variable (Y). Since (0 ≤ π ≤ 1), then the ratio (π / (1-π)) is a positive amount confined between (0, ∞) i.e. (0 ≤ π / (1-π) ≤ ∞) and taking the natural logarithm For the base (e) of the amount (π / (1-π)) the value field becomes between (-∞, + ∞) and is ((-∞≤ loge (π / (1-π)) ≤ ∞). Therefore, the regression model can be written in the case of one explanatory variable as follows:
(4)
But if we have more than one explanatory variable, then the model is formulated as follows:
(5)
As: i = 1,2,3, ..........., n : Directed for features to be estimated. are explanatory variables.
As for (π / (1-π)) odds of success rate or preference ratio for the desired event and its mathematical formula are as follows:
(6)
The probability formula for the logistic regression model is written as follows:
(7)
And the amount Loge(π / (1-π)) is called the logs odds of success logarithm. Logistic regression does not require many assumptions. It only requires that there is no correlation between the explanatory variables and that the volume of observations is large in each group that is assumed to be greater than five times the number of parameters used in the final model [Demosthenes, 2006]. The estimation of the parameters of the logistic regression model is carried out using the Maximum Likelihood Method (ML), which is one of the most famous estimation methods in statistics. Assuming that the observations are independent, the logarithmic likelihood function is defined by the following formula: [Hosmer and Lemeshow, 2000]
By maximizing the likelihood function (L) and taking the derivative with respect to the parameters (β) and equating the result of the equation with zero, the possibility function is given as:
Since equation (9) is a nonlinear parameter, some special methods should be used to obtain the appropriate solutions. Therefore, Iteratively Re-Weighted Lest Squares (IRLS) can be applied to obtain appropriate solutions. The maximum likelihood estimator (MLE) of the parameters (β) can be found using the IRLS algorithm as follows:
As , ,
One disadvantage of using MLE is that MSE becomes bulky when explanatory variables are Linear dependent, which is called the problem of multicollinearity. A condition number (CN) has been developed to test the existence of the problem of multicollinearity between the variables known as the following formula:
As: , They represent the largest and smallest eigenvalue roots of the matrix (S), if the value of CN 30 This means that there is a strong multicollinearity problem between the explanatory variables [Inan and Erdogan, 2013] Also when the eigenvalue root values of the matrix (S) are close to zero, this indicates that there is a problem of multicollinearity between the variables and this will lead to an increase in the value of (MSE) .The value of the mean square error of equation (10) is found according to the following formula: [Siray et al. 2015]
As: represent the eigenvalue roots of the matrix (S).
3- Ridge Estimator
When there is multicollinearity, the maximum likelihood estimator method (ML) suffer from inflation in the variations of the estimated parameters and the occurrence of instability, and this inflation is represented by the diagonal elements of the matrix (S). To solve this problem, [Schaefer et al., 1984] suggested a logistic ridge estimator (LRE) that was first introduced by 1970 (Horal & Kennard), and used it to estimate the parameters for the Multiple Linear Regression Model. This method is summarized by adding a small positive constant quantity (k) whose value falls between zero and one (0≤ k ≤1) to the diagonal elements of the information matrix (S) to obtain more accurate estimator, and this method works to decouple the links between the explanatory variables and the logistic character estimator is defined according to the formula next: [Månsson and Shukur, 2011] and [Kibria et al. , 2012]
The estimator (ML) can be considered a special case of equation (13) when the value of (k = 0). The value of k in logistic regression models is found according to one of the following common formulas: [Schaefer et al., 1984] & [Smith et al., 1991]
The value of the average square error of equation (13) is found according to the following formula:
As: and represent the eigenvalue vectors of the matrix (S).
4- Liu Estimator
Liu's logistic estimator was defined by the scientist (Månsson et al., 2012) as another solution to the problem of multicollinearity, and Liu's logistic estimator denoted by symbol (LLE) was defined according to the following formula:
As: (0
The value of the average square error of equation (16) is found according to the following formula:
5- Liu-Type Logistic Estimator
The Liu-Type estimator was suggested as a substitute for the ridge regression estimator in the linear regression, which was defined by the following formula:
As: (-∞ 0) and represent the estimated value of the parameter β in the least squares method. To take into account the problem of strong linear interrelationship, a Liu-Type logistic estimator has been proposed, which can be defined according to the following formula:
And that the value of the average square error of the above equation is found according to the following formula:
6- Tow-parameter Logistic Estimator
The Tow-parameter estimator was suggested by [Asar and Genc, 2017] as an alternative to the ridge regression estimator in a linear regression that was defined by the formula:
As: (0
We note that combines between two different estimators, which are the liu logistic estimator (LLE) and the ridge logistic estimator (LRE), if the value of (k=1) in equation (23) we get the liu logistic estimator and if the value of (k=0) in equation (23) We get the maximum likelihood estimator and when the value of (d = 0) in equation (23) we get the ridge logistic estimator . And that the value of the average square error of equation (23) is found according to the following formula:
7- The practical side:
1- Simulation: For the purpose of obtaining the best capabilities, Monte Carlo simulation was used to compare the above mentioned criteria by using the standard comparison of the average squares of error. The data was generated using the MATLAB program where sample sizes were chosen (n = 50,120,200), The following formula was used to generate the explanatory variables:
As: represents the value of the correlation between the explanatory variables in the studied model, and values were taken ( ).
n: represents the number of observation.
p: represents the number of related variables and values are taken (p = 5,10).
wij: represents random numbers that follow the standard normal distribution.
wip: represents the values of the last column of the columns of the generated variables.
The response variable for (n) of observations was found according to the formula of the logistic regression model:
And β1=β2=β3=…=βp and the feature values were determined [Kibria, 2003]. The experiment was repeated (1000) times. And the mean square error (MSE) is calculated according to the following formula:
As: represents Respectively
We conclude from the results of Table (1) the following three points:
1- As the correlation coefficient value increases, the MSE value increases when taking all the probabilities of the number of explanatory variables (p) and the sample size (n). In addition, the estimated performance (LRE) is better than the rest of the estimators.
2- The more the number of explanatory variables (p) increases, the value of (MSE) increases, and this increase affects the quantity of estimators. However, the estimated performance (LRE) is better than the rest of the estimators.
3- As the sample size increases, the value of MSE decreases when taking different values for each correlation coefficient and the number of explanatory variables.
Table (1): shows MSE values for different values of ρ, p, n for data generated for each of the capabilities ML, LRE, LLE, LLTE, LTPE.
2- Real data: Data were taken that dealt with anemia on two levels, namely acute anemia that was symbolized (0), and chronic anemia, which was symbolized (1). The explanatory variables are the gender represented by the variable (X1), the age represented by the variable (X2), the hemoglobin ratio (hp) represented by the variable (X3), the ferritin ratio in the blood represented by the variable (X4), the retic count(They are immature red blood cells) ratio represented by the variable (X5), the MCV ratio represented by the variable (X6), iron deficiency in the blood represented by the variable (X7), the rate of transferrin in the blood represented by the variable (X8), the cause of poverty is hemorrhage represented by the variable (X9) anemia, chronic diseases represented by variable (X10), and anemia is a decrease in blood cells Red represented by the variable (X11). After conducting the initial data analysis in the Minitab program, he found the following:
1- The number of people with severe anemia is (67) patients with a percentage of 47.9%, while those with chronic anemia are (73) patients with a rate of 52.1% as shown in Table (2).
Table (2): Shows the number of patients with anemia.
Types of anemia |
Number of people with types of the disease |
The proportion of injured |
Severe anemia |
67 |
47.9 |
Chronic anemia |
73 |
52.1 |
Total |
140 |
100.0 |
2- As for the number of males and females in the sample, they were as in Table (3) as follows:
Table (3): Shows the number of males and females in the sample.
|
Male and female number |
Male and female ratio |
Male |
83 |
59.3 |
Female |
57 |
40.7 |
Total |
140 |
100.0 |
To test the existence of the problem of linear relationship between the data, the eigenvalue roots of the matrix (S) were found and the values of the roots were as shown in Table (4), as we note that the value of CN = 726.9358 is greater than (30) and this is evidence of the existence of a problem of multicollinearity between the explanatory variables.
Table (4): shows the values of the eigenvalue roots of the matrix (S(.
0.38 |
0.27 |
0.93 |
1.8 |
2.35 |
7.12 |
13.65 |
31.27 |
1762.75 |
33144.1 |
143077 |
The following table shows the estimated binary logistic regression parameters, standard error, and MSE values for each of the ML, LRE, LLE, LLTE, and LTPE estimators. We note that the best estimation is LRE having the lowest value for MSE.
Table (5): shows the estimated parameters, standard error, and MSE values for ML, LRE, LLE, LLTE, LTPE.
8- Conclusions:
1- The simulation results showed that the best way to address the problem of multicollinearity is the ridge logistic regression method.
2- The higher the correlation coefficient value, the greater the MSE value.
3- The more the number of explanatory variables (p) increases, the value of (MSE) increases, and that this increase affects the amount of estimators, however the estimated performance (LRE) is better than the rest of the estimators.
4- As the sample size increases, the value of (MSE) decreases when taking different values for each correlation coefficient and the number of explanatory variables.
5- The results of the application on the data showed the fact that the ridge logistic regression method is the best method presented in the search because it has the lowest value for the average squares of the error, and that the value of the standard error for the estimated parameters was almost close to all methods.