Home   >   CSC-OpenAccess Library   >    Manuscript Information
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Categorical Data
Lukun Zheng
Pages - 1 - 12     |    Revised - 31-01-2018     |    Published - 30-04-2018
Volume - 7   Issue - 1    |    Publication Date - April 2018  Table of Contents
Categorical Variable, Imputation Methods, Missing Value, Re-Imputation Accuracy Rate.
Missing data are often encountered in data sets and a common problem for researchers in different fields of research. There are many reasons why observations may have missing values. For instance, some respondents may not report some of the items for some reason. The existence of missing data brings difficulties to the conduct of statistical analyses, especially when there is a large fraction of data which are missing. Many methods have been developed for dealing with missing data, numeric or categorical. The performances of imputation methods on missing data are key in choosing which imputation method to use. They are usually evaluated on how the missing data method performs for inference about target parameters based on a statistical model. One important parameter is the expected imputation accuracy rate, which, however, relies heavily on the assumptions of missing data type and the imputation methods. For instance, it may require that the missing data is missing completely at random. The goal of the current study was to develop a two-step algorithm to evaluate the performances of imputation methods for missing categorical data. The evaluation is based on the re-imputation accuracy rate (RIAR) introduced in the current work. A simulation study based on real data is conducted to demonstrate how the evaluation algorithm works.
1 Google Scholar 
2 BibSonomy 
3 Doc Player 
4 Scribd 
5 SlideShare 
A.B. Anderson, A. Basilevsky, and D.P.J. Hum. "Missing data: a review of the literature," in Handbook of Survey Research. New York: Academic Press, 1983, pp. 415-492.
D. J. Hand, H. J. Adér, and G. J. Mellenbergh. "Advising on Research Methods: A Consultant's Companion." Huizen, Netherlands: Johannes van Kessel. pp. 305-332, 2008.
D.B. Rubin. "Multiple imputation after 18+ years." J. Am. Stat. Assoc, vol. 91, pp. 473-489, 1996.
E. D. de Leeuw, J. Hox, and M. Husman. "Prevention and treatment of item nonresponse." Journal of Official Statistics, vol. 19, pp. 277-314, 2003.
I. Myrtverit and E. Stensrud. "Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods." IEEE Transactions On Software Engineering, vol. 27, pp.999-1013, 2001.
J. Chen and J. Shao. "Jackknife variance estimation for nearest-neighbor imputation." J. Amer. Statist, Assoc, vol. 96, pp. 260-269, 2001.
J. Chen and J. Shao. "Nearest neighbor imputation for survey data." Journal of Official Statistics, vol. 16, pp. 113-131, 2000.
J. Fox, S. Weisberg, D. Adler, D. Bates, G. Baud-Bovy, S. Ellison and R. Heiberger. Package "car", Companion to Applied Regression. R Package version, 2-1, 2016.
J. L. Schafer and J. W. Graham. "Missing data: Our view of the state of the art." Psychological Methods, vol. 7, pp.147-177, 2002.
J. L. Schafer. Analysis of Incomplete Multivariate Data. Chapman and Hall, 1997.
J. R. Quinlan. C4.5: Programs for machine learning, Morgan Kaufman, Los Altos, CA, 1993.
L. Hurley. "Missing covariates in causal inference matching: Statistical imputation using machine learning and evolutionary search algorithms." Doctoral dissertation, Fordham University, 2017.
M.J. Rovine and M. Delaney. " Missing data estimation in developmental research," in Statistical Methods in Longitudinal Research: Principles and Structuring Change, A. Von Eye ed. 1, New York: Academic Press, pp. 35-79.
O. Troyanskaya, M. Cantor, and G. Sherlock. "Missing value estimation methods for DNA microarrays." Bioinformatics, vol. 17, pp. 520-525, 2001.
Q. Wang and J. Rao, "Empirical likelihood-based inferences in linear models with missing data." Scand. J. Statist, vol. 29, pp. 563-576, 2002.
R.J.A. Little. and D.B. Rubin. Statistical Analysis with Missing Data. New York: Wiley, 1987.
R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis, New York: Wiley, 1973.
R.S. Somasundaram and R. Nedunchezhian. "Evaluation of three simple imputation methods for enhancing preprocessing of data with missing values." International Journal of Computer Applications, vol. 21, pp. 14-19, 2011.
S. F. Messner. "Exploring the Consequences of Erratic Data Reporting for Cross- National Research on Homicide." Journal of Quantitative Criminology, vol. 8, pp.155-173, 1992.
S.C. Zhang, et al. "Optimized parameters for missing data imputation." PRICAI, vol. 6, pp. 1010-1016, 2006.
S.M. Chen and C.M. Huang. "Generating weighted fuzzy rules from relational database systems for estimating null values using genetic algorithms." IEEE Transactions on Fuzzy Systems, vol. 11, pp. 495-506, 2003.
W.H. Finch. "Imputation methods for missing categorical questionnaire data: a comparison of approaches." Journal of Data Science, vol. 8, pp. 361-378, 2010.
Dr. Lukun Zheng
Tennessee Technological University - United States of America

View all special issues >>