Workshop:Machine Learning in HPC Environments
Authors: Justin Wozniak (Argonne National Laboratory (ANL), University of Chicago); Hyunseung Yoo (Argonne National Laboratory (ANL)); Jamaludin Mohd-Yusof (Los Alamos National Laboratory); and Bogdan Nicolae, Richard Turgeon, Nick Collier, Jonathan Ozik, Thomas Brettin, and Rick Stevens (Argonne National Laboratory (ANL))
Abstract: Machine learning in biomedicine is reliant on the availability of large, high-quality data sets. These corpora are used for training statistical or deep learning -based models that can be validated against other data sets and ultimately used to guide decisions. The quality of these data sets is an essential component of the quality of the models and their decisions. Thus, identifying and inspecting outlier data is critical for evaluating, curating, and using biomedical data sets. Many techniques are available to look for outlier data, but it is not clear how to evaluate the impact on highly complex deep learning methods. In this paper, we use deep learning ensembles and workflows to construct a system for automatically identifying data subsets that have a large impact on the trained models. These effects can be quantified and presented to the user for further inspection, which could improve data quality overall. We then present results from running this method on the near-exascale Summit supercomputer.
Back to Machine Learning in HPC Environments Archive Listing