Business Intelligence Techniques for Missing Data Imputations
DOI:
https://doi.org/10.20535/1810-0546.2015.5.60503Keywords:
Uncertainties in data processing, Imputation of missing data, Systemic approach, Decision support systemAbstract
Background. Properly constructed decision support systems (DSS) for modelling and forecasting behaviour of dynamic systems provide a possibility for taking into consideration uncertainties of probabilistic, statistical and structural types what results in higher quality of developed models and estimated forecasts.
Objective. To consider general reasons for loosing (missing) data in statistical data analysis; to provide categorization of missing data into several groups, and identify the reasons for missing measurements; to provide stepwise system methodology for uncertainty analysis and selection of data imputation techniques; to give an insight into some popular missing values imputation techniques regarding their possible applications.
Methods. To solve the problems mentioned the following methods have been used: data categorization approach from business or practical point of view that is necessary for discovering the reasons for availability of systemic and/or random missing values; the modern systemic methodology was hired for analysis of uncertainty causes and missing values imputation; the decision tree based imputation procedures; EM algorithm and regression model approach to forecasting missing data using forecasting functions.
Results. The main results of the study are in categorization of the missing data into groups; development of systemic methodology for analysis of uncertainty causes and missing values imputation; providing an analysis for possibilities of missing values imputation with decision trees, EM algorithm and regression models; development of multistep forecasting functions on the basis of autoregression models; illustration of application of some selected perspective methods for missing data imputation.
Conclusions. We proposed the six steps system methodology for data imputation which stresses that selection of correct method for imputation is tightly connected with the step-by-step analysis of the gaps causes and finding an appropriate technique for their imputation. The results of imputation sometimes are rather far from the existing data and should be smoothed or even broken from the sample due to their incorrectness. For such cases it should be proposed a new probabilistic-regression method which allows define parameters of the probability interval for the regression aiming missing data imputation. A series of computing experiments performed with EM algorithm, forecast regression based imputation technique and some other approaches shows that it is possible to reach high quality results regarding correct processing of data with missing values.References
G. Svolba, Data Quality for Analytics Using SAS.Cary,NC: SAS Institute Inc., 2012, 340 p.
P. Gogishvili, “Determination of the vehicle location in case of incomplete GPS data”, Informatsiyni Tekhnolohiyi ta Kompyuterna Inzheneriya, no. 3, pp. 19–23, 2012 (in Russian).
N. Siddiqi, Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Hoboken, John Wiley & Sons, Inc., 2005, 196 р.
M. Owen. (2005). Tukey's Biweight Correlation and the Breakdown [Online]. Avaliable: http://pages.pomona.edu/~jsh04747/Student%20Theses/MaryOwen10.pdf
P. Breheny, Robust Regression [Online]. Avaliable: http://web.as.uky.edu/statistics/users/pbreheny/764-F11/notes/12-1.pdf
F. Shi et al. (2013). Missing Value Estimation for Microarray Data by Bayesian Principal Component Analysis and Iterative Local Least Squares, [Online]. Avaliable: http://www.hindawi.com/journals/mpe/2013/162938/
T. Marwala. (2014). Flexibly-bounded Rationality and Marginalization of Irrationality Theories for Decision Makin [Online]. Avaliable: http://arxiv.org/ftp/arxiv/papers/1306/1306.2025.pdf
G.J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Hoboken: John Wiley & Sons, Inc., 2008, 359 p.
Downloads
Published
Issue
Section
License
Copyright (c) 2017 NTUU KPI Authors who publish with this journal agree to the following terms:- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under CC BY 4.0 that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work