Workshop Program

To be announced


Location: Virtual

Date: Monday November 13, 2023

Time: 9:00am - 5:30pm in CST


Coffee Break (10:00 am - 10:30 am)

MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems

Steven Farrell (Lawrence Berkeley National Laboratory) and others

Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning applications that are representative of real-world scientific use cases. MLPerf(TM) is a community-driven standard to benchmark machine learning workloads, focusing on end-to-end performance metrics. In this paper, we introduce MLPerf HPC, a benchmark suite of large-scale scientific machine learning training applications driven by the MLCommons(TM) Association. We present the results from the first submission round, including a diverse set of some of the world's largest HPC systems, along with a systematic framework for their joint analysis and insights on implementations. Furthermore, we characterize each benchmark with compute, memory and I/O behaviours to parameterize extended roofline performance models.

HYPPO: A Surrogate-Based Multi-Level Parallelism Tool for Hyperparameter Optimization

Vincent Dumont (Lawrence Berkeley National Laboratory)

We present a new software, HYPPO, that enables the automatic tuning of hyperparameters of various deep learning models. Unlike other hyperparameter optimization methods, HYPPO uses adaptive surrogate models and directly accounts for uncertainty in model predictions to find accurate and reliable models that make robust predictions. Using asynchronous nested parallelism, we are able to significantly alleviate the computational burden of training complex architectures and quantifying the uncertainty. HYPPO is implemented in Python and can be used with both TensorFlow and PyTorch libraries. We demonstrate various software features on time-series prediction and image classification problems as well as a scientific application in computed tomography image reconstruction. Finally, we show that we can reduce by an order of magnitude the number of evaluations necessary to find the most optimal region in the hyperparameter space and reduce by two orders of magnitude the throughput for such HPO process to complete.

HPCFAIR: Enabling FAIR AI for HPC Applications

Gaurav Verma (Stony Brook University)

Artificial Intelligence (AI) is being adopted in different domains at an unprecedented scale. A significant interest in the scientific community also involves leveraging machine learning (ML) to run high-performance computing applications at scale effectively. Given multiple efforts in this arena, there are often duplicated efforts when existing rich data sets and ML models could be leveraged instead. The primary challenge is a lack of an ecosystem to reuse and reproduce the models and datasets. In this work, we propose HPCFAIR, a modular, extensible framework to enable AI models to be Findable, Accessible, Interoperable, and Reproducible (FAIR). It enables users with a structured approach to search, load, save and reuse the models in their codes. We present the design, implementation of our framework and highlight how it can be seamlessly integrated into ML-driven applications for high-performance computing applications and scientific machine learning workloads.

High-Performance Deep Learning Toolbox for Genome-Scale Prediction of Protein Structure and Function

Ada Sedova (Oak Ridge National Laboratory)

Computational biology is one of many scientific disciplines ripe for innovation and acceleration with the advent of high-performance computing (HPC). In recent years, the field of machine learning has also seen significant benefits from adopting HPC practices. In this work, we present a novel HPC pipeline that incorporates various machine-learning approaches for structure-based functional annotation of proteins on the scale of whole genomes. Our pipeline makes extensive use of deep learning and provides computational insights into best practices for training advanced deep-learning models for high-throughput data such as proteomics data. We showcase methodologies our pipeline currently supports and detail future tasks for our pipeline to envelop, including large-scale sequence comparison using SAdLSA and prediction of protein tertiary structures using AlphaFold2.

Lunch Break (12:30 PM ~ 2:00 PM)

Afternoon Invited Talk:Large scale languge model training

Bryan Catanzaro, NVIDIA

Coffee Break (3:00 PM - 3:30 PM)

Semantic-Aware Lossless Data Compression for Deep Learning Recommendation Model (DLRM)

Sarunya Pumma (Advanced Micro Devices)

Deep Learning Recommendation Model (DLRM), a new neural network for recommendation systems, introduces challenging requirements for deep neural network training and inference. The size of the DLRM model is typically large and not able to fit on a single GPU memory. DLRM requires both model-parallel and data-parallel for the bottom part and top part of the model when running on multiple GPUs. Due to the hybrid-parallel model, the all-to-all communication is used for welding the top and bottom parts together. We have observed that the all-to-all communication is costly and is a bottleneck in the DLRM training/inference. In this paper, we reduce the communication volume by using DLRM's properties to compress the transferred data without information loss. We demonstrate benefits of our method by training DLRM TeraByte on AMD Instinct MI100 accelerators. The experimental results show 38%-59% improvement in the time-to-solution of the DLRM TeraByte training for FP32 and mixed-precision.

Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing

Logan Ward (Argonne National Laboratory)

Scientific applications that involve simulation ensembles can be accelerated greatly by using experiment design methods to select the best simulations to perform. Methods that use machine learning (ML) to create proxy models of simulations show particular promise for guiding ensembles but are challenging to deploy because of the need to coordinate dynamic mixes of simulation and learning tasks. We present Colmena, an open-source Python framework that allows users to steer campaigns by providing just the implementations of individual tasks plus the logic used to choose which tasks to execute when. Colmena handles task dispatch, results collation, ML model invocation, and ML model (re)training, using Parsl to execute tasks on HPC systems. We describe the design of Colmena and illustrate its capabilities by applying it to electrolyte design, where it both scales to \num{65536} CPUs and accelerates the discovery rate for high-performance molecules by a factor of 100 over unguided searches.

Production Deployment of Machine-Learned Rotorcraft Surrogate Models on HPC

Wesley Brewer (General Dynamics Information Technology)

We explore how to optimally deploy different types of machine-learned surrogate models used in rotorcraft aerodynamics on HPC. We first developed three different rotorcraft models at three different orders of magnitude (2M, 44M, and 212M trainable parameters) to use as test models. We tested three different types of inference server deployments: (1) a Flask-based HTTP inference server, (2) TensorFlow Serving with gRPC protocol, and (3) RedisAI server with RESP protocol. We investigated deployments on both DoD HPCMP's SCOUT and DoE OLCF's Summit POWER9 supercomputers, demonstrated the ability to inference a million samples per second using 192 GPUs, and studied multiple scenarios on both Nvidia T4 and V100 GPUs. We studied a range of concurrency levels both on the client-side and the server-side, and provide optimal configuration advice based on the type of deployment. Finally, we provide a simple Python-based framework for benchmarking machine-learned surrogate models using the various inference servers.

HPC Ontology: Towards a Unified Ontology for Managing Training Datasets and AI Models for High-Performance Computing

Chunhua Liao (Lawrence Livermore National Laboratory)

Machine learning (ML) techniques have been widely studied to address various challenges of productively and efficiently running large-scale scientific applications on heterogeneous supercomputers. However, it is extremely difficult to generate, access, and maintain training datasets and AI models to accelerate ML-based research. The Future of Research Communications and e-Scholarship has proposed the FAIR data principles describing Findability, Accessibility, Interoperability, and Reusability. In this paper, we present our ongoing work of designing an ontology for high-performance computing (named HPC ontology) in order to make training datasets and AI models FAIR. Our ontology provides controlled vocabularies, explicit semantics, and formal knowledge representations. Our design uses an extensible two-level pattern, capturing both high-level meta information and low-level data content for software, hardware, experiments, workflows, training datasets, AI models, and so on. Preliminary evaluation shows that HPC ontology is effective to annotate selected data and support a set of SPARQL queries.

Is Disaggregation Possible for HPC Cognitive Simulation?

Michael Wyatt (Lawrence Livermore National Laboratory)

Cognitive simulation (CogSim) is an important and emerging workflow for HPC scientific exploration and scientific machine learning (SciML). One challenging workload for CogSim is the replacement of one component in a complex physical simulation with a fast, learned, surrogate model that is inside of the computational loop. The execution of this in-the-loop inference is particularly challenging because it requires frequent inference across multiple possible target models, can be on the simulation's critical path (latency bound), is subject to requests from multiple MPI ranks, and typically contains a small number of samples per request. In this paper we explore the use of large, dedicated Deep Learning / AI accelerators that are disaggregated from compute nodes for this CogSim workload. We compare the trade-offs of using these accelerators versus the node-local GPU accelerators on leadership-class HPC systems.