Business Intelligence Techniques for Missing Data Imputations

Наталія Володимирівна Кузнєцова; Петро Іванович Бідюк

doi:10.20535/1810-0546.2015.5.60503

Authors

Наталія Володимирівна Кузнєцова Institute for applied system analysis of the NTUU KPI, Ukraine https://orcid.org/0000-0002-1662-1974
Петро Іванович Бідюк Institute for applied system analysis of the NTUU KPI, Ukraine https://orcid.org/0000-0002-7421-3565

DOI:

https://doi.org/10.20535/1810-0546.2015.5.60503

Keywords:

Uncertainties in data processing, Imputation of missing data, Systemic approach, Decision support system

Abstract

Background. Properly constructed decision support systems (DSS) for modelling and forecasting behaviour of dynamic systems provide a possibility for taking into consideration uncertainties of probabilistic, statistical and structural types what results in higher quality of developed models and estimated forecasts.

Objective. To consider general reasons for loosing (missing) data in statistical data analysis; to provide categorization of missing data into several groups, and identify the reasons for missing measurements; to provide stepwise system methodology for uncertainty analysis and selection of data imputation techniques; to give an insight into some popular missing values imputation techniques regarding their possible applications.

Methods. To solve the problems mentioned the following methods have been used: data categorization approach from business or practical point of view that is necessary for discovering the reasons for availability of systemic and/or random missing values; the modern systemic methodology was hired for analysis of uncertainty causes and missing values imputation; the decision tree based imputation procedures; EM algorithm and regression model approach to forecasting missing data using forecasting functions.

Results. The main results of the study are in categorization of the missing data into groups; development of systemic methodology for analysis of uncertainty causes and missing values imputation; providing an analysis for possibilities of missing values imputation with decision trees, EM algorithm and regression models; development of multistep forecasting functions on the basis of autoregression models; illustration of application of some selected perspective methods for missing data imputation.

Conclusions. We proposed the six steps system methodology for data imputation which stresses that selection of correct method for imputation is tightly connected with the step-by-step analysis of the gaps causes and finding an appropriate technique for their imputation. The results of imputation sometimes are rather far from the existing data and should be smoothed or even broken from the sample due to their incorrectness. For such cases it should be proposed a new probabilistic-regression method which allows define parameters of the probability interval for the regression aiming missing data imputation. A series of computing experiments performed with EM algorithm, forecast regression based imputation technique and some other approaches shows that it is possible to reach high quality results regarding correct processing of data with missing values.

Author Biographies

Наталія Володимирівна Кузнєцова, Institute for applied system analysis of the NTUU KPI

Natalia V. Kuznietsova,

candidate of sciences (engineering), senior lecturer

Петро Іванович Бідюк, Institute for applied system analysis of the NTUU KPI

Petro I. Bidyuk,

Doctor of engineering, professor

References

G. Svolba, Data Quality for Analytics Using SAS.Cary,NC: SAS Institute Inc., 2012, 340 p.

P. Gogishvili, “Determination of the vehicle location in case of incomplete GPS data”, Informatsiyni Tekhnolohiyi ta Kompyuterna Inzheneriya, no. 3, pp. 19–23, 2012 (in Russian).

N. Siddiqi, Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Hoboken, John Wiley & Sons, Inc., 2005, 196 р.

M. Owen. (2005). Tukey's Biweight Correlation and the Breakdown [Online]. Avaliable: http://pages.pomona.edu/~jsh04747/Student%20Theses/MaryOwen10.pdf

P. Breheny, Robust Regression [Online]. Avaliable: http://web.as.uky.edu/statistics/users/pbreheny/764-F11/notes/12-1.pdf

F. Shi et al. (2013). Missing Value Estimation for Microarray Data by Bayesian Principal Component Analysis and Iterative Local Least Squares, [Online]. Avaliable: http://www.hindawi.com/journals/mpe/2013/162938/

T. Marwala. (2014). Flexibly-bounded Rationality and Marginalization of Irrationality Theories for Decision Makin [Online]. Avaliable: http://arxiv.org/ftp/arxiv/papers/1306/1306.2025.pdf

G.J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Hoboken: John Wiley & Sons, Inc., 2008, 359 p.

Business Intelligence Techniques for Missing Data Imputations

Authors

DOI:

Keywords:

Abstract

Author Biographies

Наталія Володимирівна Кузнєцова, Institute for applied system analysis of the NTUU KPI

Петро Іванович Бідюк, Institute for applied system analysis of the NTUU KPI

References

Downloads

Published

Issue

Section

License

Developed By