Comparison of statistical and machine learning methods in modelling of data with multicollinearity

Akhil Garg; Kang Tai

doi:10.1504/IJMIC.2013.053535

Comparison of statistical and machine learning methods in modelling of data with multicollinearity

Akhil Garg, Kang Tai^*

^*Corresponding author for this work

Nanyang Technological University

Research output: Contribution to journal › Article › peer-review

126 Citations (Scopus)

Abstract

Multicollinearity occurs in a dataset due to correlation between the predictors. Models derived from such data without a check on multicollinearity may lead to erroneous system analysis. This problem can be eliminated by the selection of appropriate predictors from the dataset. Variable reduction methods like B2, B4, VIF, KIF and factor analysis (FA) can be used to overcome this problem. Such methods are useful particularly when used in conjunction with modelling methods that do not automate variable selection, such as artificial neural network (ANN) and fuzzy logic. The literature reveals that the current problem is aptly described in the field of statistics but is paid little attention in the field of machine learning. In this paper, multicollinearity is presented involving the estimation of fat content inside the body. Commonly used statistical methods such as stepwise regression, radial basis function partial least squares, partial robust M-regression, ridge regression and principal component regression are applied to this problem. The machine learning methods FA-ANN and genetic programming are also applied. The results are discussed with the interpretation and comparison of the modelling methods summarised in order to guide users on the proper techniques for tackling the multicollinearity problem.

Original language	English
Pages (from-to)	295-312
Number of pages	18
Journal	International Journal of Modelling, Identification and Control
Volume	18
Issue number	4
DOIs	https://doi.org/10.1504/IJMIC.2013.053535
Publication status	Published - 2013
Externally published	Yes

Keywords

ANN
Artificial neural network
Factor analysis
Genetic programming
Machine learning
Multicollinearity
PCA
Principal component analysis
Regression
Statistics

Access to Document

10.1504/IJMIC.2013.053535

Cite this

@article{f6923c88a4df4af6ac94ea9459f09fee,

title = "Comparison of statistical and machine learning methods in modelling of data with multicollinearity",

abstract = "Multicollinearity occurs in a dataset due to correlation between the predictors. Models derived from such data without a check on multicollinearity may lead to erroneous system analysis. This problem can be eliminated by the selection of appropriate predictors from the dataset. Variable reduction methods like B2, B4, VIF, KIF and factor analysis (FA) can be used to overcome this problem. Such methods are useful particularly when used in conjunction with modelling methods that do not automate variable selection, such as artificial neural network (ANN) and fuzzy logic. The literature reveals that the current problem is aptly described in the field of statistics but is paid little attention in the field of machine learning. In this paper, multicollinearity is presented involving the estimation of fat content inside the body. Commonly used statistical methods such as stepwise regression, radial basis function partial least squares, partial robust M-regression, ridge regression and principal component regression are applied to this problem. The machine learning methods FA-ANN and genetic programming are also applied. The results are discussed with the interpretation and comparison of the modelling methods summarised in order to guide users on the proper techniques for tackling the multicollinearity problem.",

keywords = "ANN, Artificial neural network, Factor analysis, Genetic programming, Machine learning, Multicollinearity, PCA, Principal component analysis, Regression, Statistics",

author = "Akhil Garg and Kang Tai",

year = "2013",

doi = "10.1504/IJMIC.2013.053535",

language = "English",

volume = "18",

pages = "295--312",

journal = "International Journal of Modelling, Identification and Control",

issn = "1746-6172",

number = "4",

}

TY - JOUR

T1 - Comparison of statistical and machine learning methods in modelling of data with multicollinearity

AU - Garg, Akhil

AU - Tai, Kang

PY - 2013

Y1 - 2013

N2 - Multicollinearity occurs in a dataset due to correlation between the predictors. Models derived from such data without a check on multicollinearity may lead to erroneous system analysis. This problem can be eliminated by the selection of appropriate predictors from the dataset. Variable reduction methods like B2, B4, VIF, KIF and factor analysis (FA) can be used to overcome this problem. Such methods are useful particularly when used in conjunction with modelling methods that do not automate variable selection, such as artificial neural network (ANN) and fuzzy logic. The literature reveals that the current problem is aptly described in the field of statistics but is paid little attention in the field of machine learning. In this paper, multicollinearity is presented involving the estimation of fat content inside the body. Commonly used statistical methods such as stepwise regression, radial basis function partial least squares, partial robust M-regression, ridge regression and principal component regression are applied to this problem. The machine learning methods FA-ANN and genetic programming are also applied. The results are discussed with the interpretation and comparison of the modelling methods summarised in order to guide users on the proper techniques for tackling the multicollinearity problem.

AB - Multicollinearity occurs in a dataset due to correlation between the predictors. Models derived from such data without a check on multicollinearity may lead to erroneous system analysis. This problem can be eliminated by the selection of appropriate predictors from the dataset. Variable reduction methods like B2, B4, VIF, KIF and factor analysis (FA) can be used to overcome this problem. Such methods are useful particularly when used in conjunction with modelling methods that do not automate variable selection, such as artificial neural network (ANN) and fuzzy logic. The literature reveals that the current problem is aptly described in the field of statistics but is paid little attention in the field of machine learning. In this paper, multicollinearity is presented involving the estimation of fat content inside the body. Commonly used statistical methods such as stepwise regression, radial basis function partial least squares, partial robust M-regression, ridge regression and principal component regression are applied to this problem. The machine learning methods FA-ANN and genetic programming are also applied. The results are discussed with the interpretation and comparison of the modelling methods summarised in order to guide users on the proper techniques for tackling the multicollinearity problem.

KW - ANN

KW - Artificial neural network

KW - Factor analysis

KW - Genetic programming

KW - Machine learning

KW - Multicollinearity

KW - PCA

KW - Principal component analysis

KW - Regression

KW - Statistics

UR - http://www.scopus.com/inward/record.url?scp=84876995897&partnerID=8YFLogxK

U2 - 10.1504/IJMIC.2013.053535

DO - 10.1504/IJMIC.2013.053535

M3 - Article

AN - SCOPUS:84876995897

SN - 1746-6172

VL - 18

SP - 295

EP - 312

JO - International Journal of Modelling, Identification and Control

JF - International Journal of Modelling, Identification and Control

IS - 4

ER -

Comparison of statistical and machine learning methods in modelling of data with multicollinearity

Abstract

Keywords

Access to Document

Other files and links

Cite this