May Workshop Report
The S4PST May Workshop Report hosted at the Innovative Computing Laboratory, ICL at the University of Tennessee, Knoxville is now available here.Motivation
The US Department of Energy (DOE) Exascale Computing Project (ECP) has fostered and strengthened the use of modern software engineering practices for the development of applications and libraries, resulting in the coordinated and interoperable E4S and xSDK ecosystem stacks. While this is a cost-effective strategy, the technologies underpinning our HPC software are developed using programming systems and tools (PST). Today, our major PST stack consists of traditional high-performance computing (HPC) programming languages (Fortran, C, C++, Python) that offload ecosystem aspects to third-party implementations motivated by use-cases outside science due to the broader nature of these communities. The result is a massive number of specifications and variants creating an expensive many-ecosystems orchestration adding overhead costs to the consumers near end of the pipeline in our development cycles. Despite efforts such as ECP, this dependence on legacy design decisions continues to lead the broader HPC community as a whole to a scattered, and corrective maintenance model of the ecosystem aspects to enable performance portability, productivity, correctness, and reproducibility when building our software. These aspects combined with the niche one-off nature of scientific software resulted in a separate and uncoordinated development model that was identified early on as unsustainable when crafting the vision for exascale computing software. This technical debt is expected only to grow, and be paid by the future workforce in the pipeline, with the increasingly heterogeneous computing landscape in the post-Moore era.
Requirements for Sustainability
The latest generation of programming languages, e.g. Julia and Rust, embrace critical ecosystem aspects (e.g. packaging, tooling, instrumentation) as part of the overall development of these modern languages, leading to a more productive experience addressing today needs. A noteworthy effort coming out of ECP in this direction is Spack, as a unifying package manager targeting HPC facilities. Productivity gains are obtained through a rich, common layer structure that enhances the orchestration of the packaging and deployment of our HPC-targeted software replacing the previously uncoordinated facility and system specific efforts. Another success example is the adoption of LLVM by major vendors and highly productive languages (e.g. Julia, Rust and Python/Numba) as the common compiler backend. LLVM empowers different communities by providing a common and coordinated development effort which leads to more productive experience as novel hardware architectures emerge and programming models and languages evolve. As a result, there are ecosystem aspects that are worth providing a common structure that fosters a culture of collaboration among the broader and diverse HPC community. Nevertheless, for a successful buy-in and adoption of sustainable practices and ecosystem from different communities, it is required to understand the uniqueness of HPC software and its nontrivial and highly specialized tasks. In this environment, it is key to understand societal requirements to expand the current pool of talent as the people behind HPC efforts are highly skilled individuals, far from any type of commodity resource. Hence, sustainability requirements for this context can be classified at three different levels: technical, economical and social.
Technical requirements imply that modern software must be well tested, validated, deployed and that results can achieve a level of reproducibility to trust the scientific end products at a wide range of scales and targeted heterogeneous platforms in the DOE HPC landscape. The PST for the DOE users are unique in a variety of open-source and vendor specific APIs for parallelism and concurrency, requirements for correctness, validation and verification (V&V) process for application output, and support for rich stack of legacy software and applications. Today, each project must address many of these aspects individually- generating siloed processes and interactions, thus bringing an excellent opportunity to rethink codesign for the future. Additionally, the current landscape of extreme heterogeneity in the post-Moore era and the data avalanche of AI data-driven workflows has led to an inflection point in the future of HPC disrupting the successful monolithic model.
Economical requirements imply that the process behind the software lifecycle must have reasonable costs as more heterogeneous hardware and programming models become available in HPC. This process includes source code refactoring and software testing. More importantly, a larger cost is incurred with documentation and design decision or standardization process such as feature addition and deprecation. The cost for outreach activities such as training and consultation can be expensive without enough participants. Current costs in a corrective maintenance model are only expected to increase this includes the costs of interactions between software maintainers, the community and the accumulation of technical and social debt. The community should consider amortization of these debts by proactively introducing and rewarding new practices and PST designs that embrace ecosystem aspects as their goals for sustainable software to be handed to the next generation of HPC experts.
Social requirements imply that the next generation of HPC practitioners need to continue fostering a diverse and inclusive culture of predictive software quality, productivity, reproducibility, and maintenance to solve the important scientific challenges of tomorrow. DOE is in a unique position to contribute and reduce social barriers towards a sustainable reward, retention, and community model impacting the pipeline of people behind these efforts, and investing and promoting the future workforce that will bring new ideas to this new landscape.
S4PST plan to expand this proactive, rather than corrective, view towards a rich ecosystem that includes: software system specification, validation, verification and correctness, quality assurance, and interoperability for the sustainability of programming systems and tools.
S4PST Mission
Our vision is to define a work plan to make the access to HPC for science more cost-effective by lowering technical, economical and social barriers to enable sustainability. The proposed community effort will deliver a comprehensive view and study that prioritizes the needs of DOE mission. We propose a coordinated software ecosystem, S4PST, as an investment opportunity to make our programming systems and tools more accessible as we enter an inflection point in HPC. The current uncoordinated, corrective and vendor-driven programming languages, models and ecosystem approach is only expected to become more expensive down the pipeline as more heterogeneous components are deployed in the future landscape of HPC. The current monolithic model is impacted by the end of Moore Law, energy and economic bounds, and the wealth of data-driven AI workflows. Democratizing access through a sustainable ecosystem will allow the growth of the HPC community to include traditional underrepresented groups while maximizing the nation strategic computing investments for science.
S4PST Objectives
Sustainable Community . The efforts envisioned towards a sustainable community for PST are: i) drafting a strategy for career development and retention of the workforce behind this effort, ii) enabling different levels of quality assurance to reduce currently scattered sustainability costs across users, iii) defining a project-specific reward mechanism (e.g. badging) to promote not only the sustainability and robustness of a single programming system but also a sustainable improvement of the interoperability of multiple programming systems and their tools, iv) creating a venue for inviting emerging programming systems, languages, and tools to promote technical inclusiveness in the community. The existing initiatives in the ECP have established a set of policies and standards, improving the software quality for building large production software suites. This approach, however, eliminates the opportunities for the community to explore emerging, yet immature, programming systems and tools. Instead, we will develop policies and reward mechanisms to promote early adapters with new flexible metrics of software quality assurance that clarify risks and mitigation to the PST community and users.
Community-Wide Technical Support . These efforts accomplish the goals for a sustained PST ecosystem, such as i) identifying, maintaining, and supporting current and future critical capabilities of the PST ecosystem, ii) tracking implementation, standard specification, and the latest capabilities, iii) establishing vendor and open-source points of contact, iv) facilitating dependency tracking and deployment via package managers (e.g., Spack, Fortran/Julia Package Manager, etc.), v) tracking the interoperability of individual languages, programming systems, and tools. These types of information are often scattered without any coordination across different supercomputing facilities or small groups of experts. Having organized archives such as tables describing language feature requirements for certain programming systems (such as C++17, specific compiler versions) and interoperability for major libraries (such as MPI) facilitate quick referencing.
Training and Diversity . This effort consists of an aggressive community engagement and training effort, specifically targeting traditionally underrepresented minorities to diversify the pipeline of people, ensuring that technical and social debts are minimized for future generations while sustaining and expanding the existing solid connections built between DOE and the broader scientific communities. This is achieved by partnering up with Universities and higher education institutions, offering training programs and internships, and working with University faculty to include PST expertise in study programs. Our goal is to define a training program focused on teaching modern programming techniques to train students and application developers on modern techniques to make code safer (e.g., smart pointer vs. raw pointer), cross-language and cross-platform interoperability, and more importantly principals and best practices of software testing and packaging. These ideas have been implemented in the IDEAS-ECP project, and we will further explore opportunities in non-HPC venues as seen in the new scientific computing track of CppCon 2022 Conference, and other programming systems conferences such as RustCon and JuliaCon.
Verification, Validation and Correctness . The main goal of this effort is to elaborate a strategy to reduce vendor dependence and identify gaps (e.g. bugs) in vendor and S4PST stacks. It is also our goal to continue building synergies with vendors through early proactive verification and validation engagements that are independent, but complementary, to vendor efforts. We will propose the application of proxy apps to identify important workloads and a hierarchical approach to perform different levels of validation: functionality, accuracy, and scalability. The goal is to ensure the DOE interests for HPC programming systems and tools are represented in open-source, vendor-independent, verification and validation test suites. Such suites will capture important use cases from DOE HPC applications and, for those use cases, check that programming systems and tools conform to relevant programming model standards, support interoperability across multiple programming systems and runtime systems, and support portability across HPC increasingly heterogeneous hardware ecosystem. Results from these suites will be useful to HPC users, programming systems and tools vendors, hardware vendors, and DOE program manners as they evaluate the suitability of programming systems and tools for DOE needs. While some relevant suites already exist, we will propose efforts to identify gaps, create new suites, and extend existing suites as needed to ensure the DOE evolving requirements continue to be represented.
Emerging Technologies . One of the major S4PST objectives is the definition of mechanisms able to drive (R&D, collaborative, shepherd, ...) efforts to guarantee that DOE priorities are part of the PST ecosystem. This will help to develop a more robust, functional, and sustainable PST ecosystem. Important DOE Research Priorities, such as performance portability, extreme heterogeneity, and automatic verification, among many others, will help to build a valuable long-term PST ecosystem exceeding the current sustainability capacity to contribute significantly in key scientific milestones and transformational discoveries. While DOE, and ASCR specifically, maintains a very important software stack, it is part of ASCR identity and our mission to continue to propose new ideas and technologies that can be integrated into DOE portfolio. The value and return of investments for software sustainability on emerging technologies such as modern LLVM-based high-productivity languages and ecosystems (e.g. Julia, Rust, Python/Numba) and modern build systems (e.g. Meson) need to be understood in the DOE context in conjunction with their success in the broader field of computing. This understanding will take into consideration previous PST-related efforts, such as the Defense Advanced Research Projects Agency (DARPA) High Productivity Computing Systems (HPCS) program. This is crucial in the convergence of AI + HPC, as AI is driven by different community and business needs not necessarily focusing on the scalability and science mission aspects of DOE HPC.
S4PST Contact
Keita Teranishi (PI), Oak Ridge National Laboratory, USA. teranishik@ornl.gov
William F. Godoy (co-PI), Oak Ridge National Laboratory, USA. godoywf@ornl.gov
Pedro Valero-Lara (co-PI), Oak Ridge National Laboratory, USA. valerolarap@ornl.gov