10,000& n > 1M), andthe cured data sets can improve the prediction accuracy of subsequent ML. The evolved FHDIwill help promote reliable ML with "cured" big data."/> 10,000& n > 1M), andthe cured data sets can improve the prediction accuracy of subsequent ML. The evolved FHDIwill help promote reliable ML with "cured" big data."/> 10,000& n > 1M), andthe cured data sets can improve the prediction accuracy of subsequent ML. The evolved FHDIwill help promote reliable ML with "cured" big data."/>
Authors
In Ho Cho, Jae-Kwang Kim, Yicheng Yang, Yonghyun Kwon, and Ashish Chapagain, Iowa State University (ISU), USA
Abstract
Machine learning (ML) advancementshinge upon data - the vital ingredient for training.Statistically-curing the missing data is called imputation, and there are many imputation theories and tools.Butthey often require difficult statistical and/or discipline-specific assumptions, lacking general toolscapable ofcuring large data. Fractional hot deck imputation (FHDI) can cure data by filling nonresponses with observed values (thus, "hot-deck") without resorting to assumptions. The review paper summarizes how FHDI evolves to ultra data-oriented parallel version (UP-FHDI).Here, "ultra" data have concurrentlylarge instances (big-n) and high dimensionality (big-p). The evolution is made possible with specialized parallelism and fast variance estimation technique. Validations with scientific and engineering data confirm thatUP-FHDI can cure ultra data(p >10,000& n > 1M), andthe cured data sets can improve the prediction accuracy of subsequent ML. The evolved FHDIwill help promote reliable ML with "cured" big data.
Keywords
Big Incomplete Data, Fractional Hot-Deck Imputation,Machine Learning, High-Dimensional Missing Data