Business Intelligence Techniques for Missing Data Imputations
Background. Properly constructed decision support systems (DSS) for modelling and forecasting behaviour of dynamic systems provide a possibility for taking into consideration uncertainties of probabilistic, statistical and structural types what results in higher quality of developed models and estimated forecasts.
Objective. To consider general reasons for loosing (missing) data in statistical data analysis; to provide categorization of missing data into several groups, and identify the reasons for missing measurements; to provide stepwise system methodology for uncertainty analysis and selection of data imputation techniques; to give an insight into some popular missing values imputation techniques regarding their possible applications.
Methods. To solve the problems mentioned the following methods have been used: data categorization approach from business or practical point of view that is necessary for discovering the reasons for availability of systemic and/or random missing values; the modern systemic methodology was hired for analysis of uncertainty causes and missing values imputation; the decision tree based imputation procedures; EM algorithm and regression model approach to forecasting missing data using forecasting functions.
Results. The main results of the study are in categorization of the missing data into groups; development of systemic methodology for analysis of uncertainty causes and missing values imputation; providing an analysis for possibilities of missing values imputation with decision trees, EM algorithm and regression models; development of multistep forecasting functions on the basis of autoregression models; illustration of application of some selected perspective methods for missing data imputation.Conclusions. We proposed the six steps system methodology for data imputation which stresses that selection of correct method for imputation is tightly connected with the step-by-step analysis of the gaps causes and finding an appropriate technique for their imputation. The results of imputation sometimes are rather far from the existing data and should be smoothed or even broken from the sample due to their incorrectness. For such cases it should be proposed a new probabilistic-regression method which allows define parameters of the probability interval for the regression aiming missing data imputation. A series of computing experiments performed with EM algorithm, forecast regression based imputation technique and some other approaches shows that it is possible to reach high quality results regarding correct processing of data with missing values.
G. Svolba, Data Quality for Analytics Using SAS.Cary,NC: SAS Institute Inc., 2012, 340 p.
P. Gogishvili, “Determination of the vehicle location in case of incomplete GPS data”, Informatsiyni Tekhnolohiyi ta Kompyuterna Inzheneriya, no. 3, pp. 19–23, 2012 (in Russian).
N. Siddiqi, Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Hoboken, John Wiley & Sons, Inc., 2005, 196 р.
M. Owen. (2005). Tukey's Biweight Correlation and the Breakdown [Online]. Avaliable: http://pages.pomona.edu/~jsh04747/Student%20Theses/MaryOwen10.pdf
P. Breheny, Robust Regression [Online]. Avaliable: http://web.as.uky.edu/statistics/users/pbreheny/764-F11/notes/12-1.pdf
F. Shi et al. (2013). Missing Value Estimation for Microarray Data by Bayesian Principal Component Analysis and Iterative Local Least Squares, [Online]. Avaliable: http://www.hindawi.com/journals/mpe/2013/162938/
T. Marwala. (2014). Flexibly-bounded Rationality and Marginalization of Irrationality Theories for Decision Makin [Online]. Avaliable: http://arxiv.org/ftp/arxiv/papers/1306/1306.2025.pdf
G.J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Hoboken: John Wiley & Sons, Inc., 2008, 359 p.
GOST Style Citations
- Svolba G. Data Quality for Analytics Using SAS. – SAS Institute Inc.,Cary,NC, 2012. – 340 p.
- Gogishvili P. Determination of the vehicle location in case of incomplete GPS data // Інформаційні технології та комп’ютерна інженерія. – 2012. – № 3. – С. 19–23.
- Siddiqi N. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. – Hoboken, John Wiley & Sons, Inc., 2005. – 196 р.
- Owen M. Tukey's Biweight Correlation and the Breakdown [Online]. – 2005. –Avaliable: http://pages.pomona.edu/~jsh04747/Student%20Theses/MaryOwen10.pdf
- Breheny P. Robust Regression [Online]. – Pomona:PomonaUniversity, 2005. – 165 p. – Aavaliable: http://web.as.uky.edu/statistics/users/pbreheny/764-F11/notes/12-1.pdf
- Missing value estimation for microarray data by bayesian principal component analysis and iterative local least squares [Online] / F. Shi, D. Zhang, J. Chen, H.R. Karimi // Math. ProblemsEng. – 2013. – Article ID 162938. – Aavaliable: http://www.hindawi.com/journals/mpe/2013/162938/
- Marwala T. Flexibly-bounded Rationality and Marginalization of Irrationality Theories for Decision Making [Online]. – 2014. – Avaliable: http://arxiv.org/ftp/arxiv/papers/1306/1306.2025.pdf
- McLachlan G.J., Krishnan T. The EM Algorithm and Extensions. – Hoboken: John Wiley & Sons Inc., 2008. – 359 p.
- There are currently no refbacks.