Efficiency Comparison in Replace Missing Value Using Regression Imputation, Multiple Imputation and Expectation Maximization for Classification in Data Mining
รหัสดีโอไอ
Creator Theeridsara Ngernwilai
Title Efficiency Comparison in Replace Missing Value Using Regression Imputation, Multiple Imputation and Expectation Maximization for Classification in Data Mining
Contributor Doungkaew Hunthong, Saichon Sinsomboonthong
Publisher Thammasat University
Publication Year 2563
Journal Title Thai Journal of Science and Technology
Journal Vol. 9
Journal No. 5
Page no. 575-588
Keyword missing value, regression imputation, multiple imputation, eXpectation maximization, K-nearest neighbor, decision tree, artificial neural network, support vector machine
URL Website https://www.tci-thaijo.org/
Website title THAIJO
ISSN 2286-7333
Abstract The objective of this research was to compare the efficiencies of three missing value replacement methods, i.e. regression imputation, multiple imputation, and expectation maximization using four classification methods including K-nearest neighbor, decision tree, artificial neural network and support vector machine, on six datasets with some missing values. The tested datasets were the followings: a dataset of liver disease in Andhra Pradesh, India, and a dataset of biopsy data on breast cancer patients, which had the least amount of missing value; a dataset of monoclonal gammopathy data, and a dataset of issued and non-issued credit cards by a bank, which had a moderate amount of missing value; and a dataset of single family loan-level and a dataset of cardiovascular disease in Framingham, Massachusetts, which had the highest amount of missing value. By offered in SPSS software program, the metrics that indicated the efficiency of a classification method were its accuracy, mean squared error and mean absolute error. Each of these data sets was divided into three proportions in the ratio of 70 : 20 : 10. By using the data part 1, training data are used to create a model 70 percentages. For the data part 2, validation data are used to evaluate an error a model 20 percentages and the data part 3, testing data are used to test a model 10 percentages using the random seeds of 10, 20, 30, 40, and 50 by WEKA program. For the classification of the dataset of liver disease in Andhra Pradesh, India, the best method was the support vector machine method by the regression imputation method, multiple imputation method and expectation maximization method. For the classification of the dataset of biopsy data on breast cancer patients, the best method was the support vector machine method by the regression imputation method and expectation maximization method. For the classification of the dataset of monoclonal gammopathy data, the best method was the artificial neural network method by the multiple imputation method. For the classification of the dataset of issued and non-issued credit cards by a bank, the best method was the support vector machine method by the expectation maximization method. For the classification of the dataset of single-family loan-level, the best method was the decision tree method by the multiple imputation method. For the classification of the dataset of cardiovascular disease in Framingham, Massachusetts, the best method was the support vector machine method by the regression imputation method, multiple imputation method and expectation maximization method.
Thai Journal of Science and Technology

บรรณานุกรม

EndNote

APA

Chicago

MLA

ดิจิตอลไฟล์

Digital File
DOI Smart-Search
สวัสดีค่ะ ยินดีให้บริการสอบถาม และสืบค้นข้อมูลตัวระบุวัตถุดิจิทัล (ดีโอไอ) สำนักการวิจัยแห่งชาติ (วช.) ค่ะ