Document Type : Research Paper
Abstract
In this paper, we suggested to use the Bayes approach in calculating the Bayes weights to treat the heterogeneity problem when estimating the gamma regression model parameters depending on the weighted least squares method and iterative weighted least squares method. A comparison with the classical method through an experimental side to simulate the generated data from a gamma distribution is presented. The data is analyzed through a MATLAB code designed for this purpose, in addition to the statistical program SPSS-25 and EasyFit-5.5. The aims of this study are: solving of heteroscedasticity problem random error variance for gamma regression model by a proposed method which depends on Bayes weighted and estimation of the best fit gamma regression model by using Bayes weighted, as well as a comparison between the results from the classical and proposed methods through some statistical criteria, the results provided the preference of the proposed method on the classical method.
Highlights
Based on simulations and real data, the following conclusions and recommendations were made:
Depending on the (MSE) criterion, models estimated using the proposed methods (BWLS and BIWLS) are more efficient than those estimated using the classical methods (WLS and IWLS). Also, Models estimated by classical methods (WLS and IWLS) are more efficient than models estimated by classical methods (OLS) depending on the MSE criterion. This leads to conclude that the proposed method remedies the problem of heterogeneity and reduced the (MSE).
On the other hand, the parameters of the models estimated using the proposed methods (BWLS and BIWLS) are more significant than those estimated using the classical methods (OLS, WLS, and IWLS). For example, in the real data, the proposed method (BIWLS) made a parameter significant, which was not significant in the other methods applied (classical and proposed). To estimate gamma regression parameters, the (BIWLS) method proved to be the best.
According to F-test, the proposed method generated a higher coefficient of determination (R2) than the classical method.
Estimation of the Bayes weights used in (BWLS and BIWLS) for the distributions mentioned in the introduction (Normal, Binomial, Negative Binomial, Exponential, and Poisson). Since the proposed methods (BWLS and BIWLS) are more efficient than classical methods, it is recommended to apply it in a wider application. Using Bayes regression to estimate parameters.
Full Text
Positive random variables can be analyzed with the gamma distribution in a variety of ways. Gamma regression models (GRM) are therefore used in a variety of experimental applications. (Krishnamoorthy, 2006). One of the necessary conditions assumed in estimating the parameters of the linear regression model is the homogeneity of random error variance or that it has a constant value for all the random errors and thus the homogeneity of dependent variable variance or the residuals, given that the independent variables are constant and not random with the real parameters of the population, but in many cases this condition is not available and do not get a constant variance value or homogeneity, which is called the random error variance heterogeneity problem, which leads to inefficient model estimators (Taha, 2021). Therefore, in this paper Bayes approach to remedy this problem is adopted.
The gamma regression model is a specific type of generalized linear models (GZLM), when the dependent variable of the regression model has a positive skewed distribution and the mean is proportional to the dispersion parameter, it is utilized. (Amin M. et al. 2017). In a generalized linear model, the response variable distribution must only be a member of the exponential family, which includes the normal, Poisson, Binomial, Exponential, and Gamma distributions as members. Furthermore, the normal-error linear model is just a special case of the GZLM, therefore, the GZLM can be used as a unifying approach to many aspects of experimental modeling and data analysis (Montgomery, et al., 2012). McCullagh & Nelder (1989) presented a GRM where the coefficient of variation is assumed constant for all observations. Recently, many extensions have been proposed with special emphasis on heterogeneous model data.
Bayes approach is the likelihood function which reflects information about the parameters, and the prior distribution which quantifies the parameters before observing the data. The prior distribution and likelihood function can be easily combined to form the posterior distribution, which represents total knowledge about the parameters after the data have been observed. Simple summaries of this distribution can be used to isolate quantities of interest and ultimately to draw substantive conclusions. Therefore, in this paper, weights are estimated for the model parameters of GZLM, through inverted variance of the distribution using two Methods; Classical Method and Bayes Method, where the depended variable is non-normality distribution as gamma and related to the variance non-homogeneity problem of random error. The aims of this study are: solving of heteroscedasticity problem random error variance for gamma regression model by a proposed method which depends on Bayes weighted and estimation of the best fit gamma regression model by using Bayes weighted, as well as, a comparison between the results from the classical and proposed methods through statistical criteria such as mean square error (MSE).
In this section we shall present the theoretical tools use in this study such as gamma regression and Bayes approach.
2.1 Gamma Regression
Let , be independent random variables. Then the gamma regression model is defined as:
Where Y is a vector of observations of the dependent variable with a dimension (n×1), X is a matrix of observations of independent variables with a dimension (n×(m+1)), is vector of unknown regression parameters ((m+1) ×1)), is a vector of random error with a dimension (n×1), (n) is the sample size and (m) is the number of the independent variables (Cepeda, et al., 2016). The estimation of the gamma regression coefficients is presented (Atkinson, et al., 2007) as follows:
Then the mean and variance can be determined as follows:
A weighted least square (WLS) method can be used to solve the heteroscedasticity problem for gamma regression models, multiply the equation (1) by as follows:
where is the inverse of the variance, i.e.
Using matrices, we get estimates of (WLS) as follows:
Where is:
To find up the estimated coefficients of Iterative weighted least square (IWLS) the gamma distribution function can be used (Atkinson, et al., 2007) as follows:
can be replaced by
By including in gamma distribution function (Atkinson, et al., 2007) as follows:
Canonical link is , is a known constant, it is acceptable to remove and to use the reciprocal function as the link function.
The model requires that , but implies that which might be negative. An alternatives link function is the logarithmic function ; and which implies that ; that is always non-negatives. Let us consider the model weights at for the reciprocal and logarithmic links, respectively (Atkinson, et al., 2007).
Note that which implies that or
The weight that will be used in this study is ( ) as we found in equation (10).
For (IWLS), is the adjusted dependent variate which is a linearization of around is found as the following formula:
Thus, the maximum likelihood estimate (MLE) of the regression parameters for iterative is:
2.2 Bayes Approach
Bayes approach is one of the major theoretical and practical frameworks for reasoning and decision making under uncertainty. This theory was developed by Thomas Bayes in the late 18th and early 19th centuries. During various periods of the 20th century, it was appreciated in different application domains, after being "forgotten" for a long time (Bruyckx, nin2002). In its general sense, Bayes estimation employs prior information about the unknown parameters to be estimated, which is expressed as the prior probability density function (prior PDF) and symbolized as f(θ). Currently, the probability distribution of the observations in the sample under study is a distributive function of variables (Y) dependent on the observation ( ), called .
Combining the prior probability density function of the parameters with the likelihood function of the observations yields the Bayes estimator, which is only the information-rich part of the posterior probability density function (Omar, et al. 2020). The bayes theory is defined (Gep & Tiao, 1973) as follows:
To estimate variance by using bayes theorem: To use the Bayes Theorem, the prior distribution and the likelihood function are needed. The prior distribution of gamma distribution is:
The values of and represent the prior information about the parameters of the distribution that is obtained from experience and past experiment or it is estimated from an initial sample, to obtain a likelihood function (Taha,1997) as follows:
From the Bayes theory equation (12) we have:
Since the function (13) is similar to the gamma distribution, then
2.3 Proposed Method
It has been suggested to use the final distribution variance estimator for the inverse gamma in formula (10) to estimate the Bayes weights as in the following formula:
Based on the Bayes weight estimate in formula (17), the weighted least squares method (BWLS) and the iterative weighted least squares method (BIWLS) will be used to estimate the parameters of the gamma regression model, which has heterogeneity of variance in random error values.
To compare the proposed and classical methods, two types of data were used: a simulation study and a real-life study.
3.1 Simulation
Simulation of the first experiment using MATLAB program (Appendix) to generate a heterogeneous multiple regression model with three independent variables, they have a normal distribution and a sample size equal to (100), parameter vector [5, -1.5, 1.5, 1] with a random error that has an identical independent 𝐺amma (0.8,10).
The linear models estimated using classical method (ordinary least square (OLS), weighted least square (WLS), iterative weighted least square (IWLS)) and proposed method (Bayes weighted least square (BWLS), Bayes iterative weighted least square (BIWLS)) are summarized in table (1).
According to table (1), the OLS summary indicates that the model is not significant since the F-calculate (0.264) is less than the F-table at the significant level (0.05) and degrees of freedom (3, 96) which is equal to (2.7114). In the (WLS and IWLS) which are (1922 and 21156) supports the linear model's fit to the data since they are greater than its tabulated value. It can be seen from the Bayes method, the F-Statistic of (BWLS and BIWLS) are (296849 and 3262101) respectively, strongly support the fit of the linear model to the data since they are greater than the tabulated value, and they are more significant than the classical (WLS and IWLS) models.
Test of homogeneity of random error variance ( : homogeneity of random error variance vs : heterogeneity of random error variance), in the OLS summary, based on Levene’s test (based on mean) supports the alternative hypothesis (P-value = 0.049), and it is less than (𝛼 = 0.05), indicating that the random error is heterogeneous. As shown in the (WLS and IWLS) methods solve the problem of heterogeneity of random error variance, which was heterogeneous when using OLS method and became homogeneous when using (WLS and IWLS). This is supported by the Levene’s test which finds (p-value = 0.052 and 0.056 respectively), which are greater than (𝛼 = 0.05). In the proposed method (BWLS and BIWLS) solved the heterogeneity issue of random error variance. based on Levene’s test (p-value = 0.052 and 0.056 respectively), and they are greater than (𝛼 = 0.05).
Table (1): Analysis of (OLS, WLS, IWLS, BWLS and BIWLS) Models
Model |
|
T-Test |
|
MSE |
Sig. of model |
Test of Homogeneity |
OLS |
F |
Levene’s Test (Based on Mean) |
||||
(Constant) |
13.4002 |
3.0475 |
0.0082 |
59.4507 |
0.2636 |
4.000 (0.049) |
|
-0.7032 |
-0.2535 |
||||
|
-0.6031 |
-0.3937 |
||||
|
1.0455 |
0.7631 |
||||
WLS |
|
|
|
|
|
|
(Constant) |
13.7350 |
24.3106 |
0.9836 |
0.9815 |
1922.2 |
3.881 (0.052) |
|
-1.0653 |
-2.9886 |
||||
|
-0.5531 |
-2.8100 |
||||
|
1.1547 |
6.5595 |
||||
IWLS |
|
|
|
|
|
|
(Constant) |
7.1070 |
41.7315 |
0.9837 |
0.0892 |
21156 |
3.748 (0.056) |
|
0.0422 |
0.3931 |
||||
|
0.9642 |
16.2509 |
||||
|
1.7083 |
32.1922 |
||||
BWLS |
|
|
|
|
|
|
(Constant) |
13.7351 |
299.6393 |
0.9988 |
0.0065 |
296849 |
3.890 (0.052) |
|
-1.0654 |
-36.8356 |
||||
|
-0.5532 |
-346344 |
||||
|
1.1546 |
80.8485 |
||||
BIWLS |
|
|
|
|
|
|
(Constant) |
7.1069 |
513.9736 |
0.9989 |
0.0006 |
3262101 |
3.765 (0.056) |
|
0.0423 |
4.8420 |
||||
|
0.9643 |
200.1494 |
||||
|
1.7083 |
396.4856 |
The OLS summary shows that all parameters (slope) are not significant because the OLS model was not significant, and this is supported by the T-test which finds (T-table = 1.985) at the significant level (0.025) and degrees of freedom (99). The WLS shows that all parameters became significant since the absolute values (24.311, 2.989, 2.810, and 6.560) exceed the tabulated value. The IWLS shows that the parameters , and are significant because the absolute values (41.7315, 16.2509 and 32.1922 respectively) are greater than tabulated value, and is not significant because the absolute value (0.3931) does not exceed the tabulated value. It is evident that all parameters are significant in the proposed method (BWLS and BIWLS) and they are more significant than the parameters estimated by the classical method (WLS and IWLS) because their absolute values of t-calculate are greater than their values in the classical method.
The coefficient of determination in the OLS is (0.82%), WLS is (98.36%), IWLS is (98.37%) and in the proposed method (BWLS and BIWLS) are (99.88% and 99.89% respectively), which are better than their values in the classical method. This shows the proportion of the influence of the independent variables on the dependent variable.
Table (2): The Average of Mean Square Error (MSE)
Sample size |
Regression Coefficients |
Parameters vector |
The Average of MSE |
||||
OLS |
WLS |
IWLS |
BWLS |
BIWLS |
|||
30 |
|
|
78.886 |
1.1937 |
0.0948 |
0.0044 |
0.0003 |
|
78.886 |
1.303 |
0.0863 |
0.0159 |
0.0010 |
||
|
|
12.6149 |
1.8044 |
0.1769 |
0.0069 |
0.00046 |
|
|
12.6149 |
1.6016 |
0.1587 |
0.0292 |
0.0020 |
||
50 |
|
|
81.039 |
1.1151 |
0.0771 |
0.0055 |
0.0004 |
|
81.0388 |
1.0381 |
0.0700 |
0.0175 |
0.0013 |
||
|
|
12.9193 |
1.7270 |
0.1740 |
0.0094 |
0.0011 |
|
|
12.9193 |
1.5161 |
0.1560 |
0.0346 |
0.0031 |
||
100 |
|
|
80.002 |
1.0711 |
0.0941 |
0.0071 |
0.0007 |
|
80.002 |
0.987 |
0.0870 |
0.0204 |
0.0017 |
||
|
|
12.5845 |
1.6608 |
0.1415 |
0.0126 |
0.0010 |
|
|
12.5845 |
1.4497 |
0.1227 |
0.0414 |
0.0037 |
The MSE of OLS is (59.451), WLS is (0.982) and in the IWLS is (0.0892), it is better than MSE of (OLS and WLS methods) in the classical method. The MSE of the proposed method (BWLS and BIWLS) are (0.0065 and 0.0006 respectively) are less than their values in the classical method. Thus, Bayes weights to estimate parameters of a gamma regression model is more efficient than generalized linear model. After knowing that the Bayes weights to estimate parameters of a gamma regression was better than the classical method in first random sample, the experiment will be repeated (1000) times and for several different sample sizes (30, 50, 100), regression coefficients ( and ) and parameters vectors, the first parameters vector is called ( which is ([5 -1.5 1.5 1] and the second parameters vector ( which is [2 0.5 -0.5 0.8], and the results of the proposed method (BWLS and BIWLS) will be compared with the classical method (OLS, WLS and IWLS), depending on the average of MSE criterion, as shown in the table (2), and all the proposed methods were better than the classical method because the averages of MSE were less than their value in the classical method for all several different sample sizes (30, 50 and 100), regression coefficients ( , and parameters vectors (( , ( ). The best model in the classical method is (IWLS). And (BIWLS) in the proposed method is better than all other methods depending on average of MSE. Thus, the estimated models by proposed method were better.
In this part, real data will be analyzed with gamma regression and then using proposed method by using Bayes weights, and compare the classical method with the proposed method, choose the method that gives us better results. The data in table (Appendix) have been retrieved from a book under the name (Statistical Theory and Methodology in Science and Engineering) (Brownlee, 1965, p. 454). the sample size was (21) observations, they came from the operation of a factory for converting the oxidation ammonia to nitric acid for (21) days. The dependent variable represents (stack loss y), the independent variables represent (air flow x1, cooling water inlet temperature x2 and acid concentration x3). The hypotheses of the linear regression model were verified, and they are summarized in the table (3):
Table (3): Linear Model Hypotheses Tests
Test of Gamma distribution |
Test of Homogeneity |
Test of Autocorrelation |
Test of Multi-collinearity |
||||
K.S. |
Chi- squared |
Levene’s test (Based on Mean) |
Durban-Watson |
Model |
Tolerance |
VIF |
|
0.145 (0.718) |
3.365 (0.339) |
17.587 (0.001) |
1.485 |
Constant |
|
|
|
|
0.344 |
2.906 |
|||||
Critical values |
|
1.125 |
|
0.389 |
2.573 |
||
(0.287) |
(7.815) |
|
1.538 |
|
0.750 |
1.334 |
Table (3) shows that test the hypothesis of a gamma distribution for random error ( : the random error has gamma distribution vs : the random error has no gamma distribution), Kolmogorov-Smirnov (K. S.) and Chi-Square tests support the null hypothesis, and the random error has gamma distribution (p-values = 0.718 and 0.339 respectively, and they are greater than 𝛼 = 0.05). Test of homogeneity of random error variance ( : homogeneity of random error variance vs : heterogeneity of random error variance), based on Levene’s test (based on mean) supports the alternative hypothesis and heterogeneity of random error variance (P-value = 0.001, and it is less than 𝛼 = 0.05, This indicates that the null hypothesis is rejected and the alternative hypothesis is accepted, it means random error is heterogeneous. Test of autocorrelation for random error, ( : There is no autocorrelation problem vs : There is autocorrelation problem), Durban-Watson test, its value falls into the uncertainty region, this means that the model can be relied upon, but the values cannot be predicted and it can be said that there is no autocorrelation problem between random error values (D. W. = 1.485 and it is lies between ( = 1.125, = 1.538). Test of multi-collinearity problem, ( : There is no multi-collinearity problem vs : There is multi-collinearity problem), Variance Inflation Factor (VIF) test supports the null hypothesis and there is no multi-collinearity problem between independent variables (VIF = 2.906, 2.573, and 1.334 respectively, and they are less than (3). Thus, it is concluded that the estimation hypotheses of the linear regression model are available. The estimated linear model information (OLS, WLS and IWLS) is summarized in table (4):
Table (4) shows that the OLS summary indicates that the model is significant since the F-calculate (59.902) is greater than the F-table at the significant level (0.05) and degrees of freedom (3, 17) which is equal to (3.197). The WLS which is (2169.9) supports the linear model's fit to the data since it is greater than its tabulated value. In IWLS which is equal to (23759.2) and it is significant because it is greater than its tabulated value. It can be seen from the Bayes method, the F-Statistic of the (BWLS and BIWLS) are (366528 and 4744277) respectively, strongly support the fit of the linear model to the data since they are greater than the tabulated value, and they are more significant than the classical (WLS and IWLS) models.
Test of homogeneity of random error variance, the (WLS and IWLS) methods solve the problem of heterogeneity of random error variance, which was heterogeneous when using OLS method and became homogeneous when using (WLS and IWLS), this is supported by the Levene’s test which finds (p-value = 0.439 and 0.126 respectively), which are greater than (𝛼 = 0.05). Also, the IWLS method solves the problem of heterogeneity of random error variance (p-value = 0.126), which is greater than (𝛼 = 0.05). Also, the (BWLS and BIWLS) in the proposed method solved the heterogeneity issue of random error variance. based on Levene’s test (p-value = 0.439 and 0.128 respectively), and they are greater than (𝛼 = 0.05).
Table (4): Analysis of (OLS, WLS and IWLS) Model for Real Data:
Model |
B |
t-test |
|
MSE |
Sig. of model |
Test of Homogeneity |
OLS |
F |
Leven Sig. |
||||
(Constant) |
-39.9197 |
-3.3557 |
0.9136 |
10.519 |
59.902 |
17.587 (0.001) |
|
0.7156 |
5.3066 |
||||
|
1.2953 |
3.5196 |
||||
|
-0.1521 |
-0.9733 |
||||
WLS |
|
|
|
|
|
|
(Constant) |
-43.1594 |
-20.898 |
0.9973 |
0.317 |
2169.9 |
0.627 (0.439) |
|
0.6731 |
28.751 |
||||
|
0.9422 |
14.747 |
||||
|
-0.0010 |
-0.036 |
||||
IWLS |
|
|
|
|
|
|
(Constant) |
-41.4991 |
-66.493 |
0.9974 |
0.029 |
23759.2 |
2.575 (0.126) |
|
0.6345 |
89.684 |
||||
|
0.9890 |
51.224 |
||||
|
-0.0092 |
-1.116 |
||||
BWLS |
|
|
|
|
|
|
(Constant) |
-43.1595 |
-271.26 |
0.9997 |
0.0019 |
366528 |
0.626 (0.439) |
|
0.6730 |
373.19 |
||||
|
0.9423 |
191.41 |
||||
|
-0.0009 |
-0.471 |
||||
BIWLS |
|
|
|
|
|
|
(Constant) |
-41.4990 |
-938.36 |
0.9998 |
0.0002 |
4744277 |
2.548 (0.128) |
|
0.6344 |
1265.7 |
||||
|
0.9889 |
722.90 |
||||
|
-0.0091 |
-15.748 |
Test of parameters' significance, in the (OLS, WLS and IWLS), shows that the parameters , and are significant because the absolute values of t-calculate are greater than t-table at the significant level (0.05/2) and degree of freedom (20) which is equal to (2.086), is not significance because the absolute does not exceed the tabulated value. the parameters , and in IWLS are more significant than the classical method (OLS and WLS). The BWLS summary shows that the parameters , and are more significant than the classical method. It is evident that all parameters became significant in the proposed method BIWLS and they are more significant than the parameters estimated by the classical method because their absolute values of t-calculate are greater than their values in the classical method.
The coefficient of determination in the OLS is (91.36%), WLS is (99.74%), IWLS is (99.73%) and in the proposed method (BWLS and BIWLS) are (99.98% and 99.99% respectively), which are better than their values in the classical method.
The MSE of OLS is (10.519), WLS is (0.317) and IWLS is (0.029), in the IWLS is better than MSE of (OLS and WLS) methods. In the proposed method (BWLS and BIWLS) are (0.0019 and 0.0002 respectively) they are less than their values in the classical method.