The Linpack Benchmark is a measure of a computer’s floating-point rate of execution. It is determined by running a computer program that solves a dense system of linear equations. It is used by the TOP 500 as a tool to rank peak performance. The benchmark allows the user to scale the size of the problem and to optimize the software in order to achieve the best performance for a given machine. This performance does not reflect the overall performance of a given system, as no single number ever can. It does, however, reflect the performance of a dedicated system for solving a dense system of linear equations. Since the problem is very regular, the performance achieved is quite high, and the performance numbers give a good correction of peak performance.
- HPCG Benchmark
The High Performance Conjugate Gradients (HPCG) Benchmark project is an effort to create a new metric for ranking HPC systems. HPCG is intended as a complement to the High Performance LINPACK (HPL) benchmark, currently used to rank the TOP500 computing systems. The computational and data access patterns of HPL are still representative of some important scalable applications, but not all. HPCG is designed to exercise computational and data access patterns that more closely match a different and broad set of important applications, and to give incentive to computer system designers to invest in capabilities that will have impact on the collective performance of these applications.
- Parallel Deep Learning with Horovod
Deep learning is a class of machine learning algorithms in which layers of nonlinear processing units are used for feature extraction and transformation, with each successive layer taking the output from the previous layer as input. TensorFlow is one of the popular deep learning frameworks with parallel training functionality. Horovod acts as the communication layer of TensorFlow to accelerate the training process. Documentation and code examples of Horovod can be found at https://github.com/uber/horovod .
OpenMC is an open source Monte Carlo particle transport code. OpenMC simulates neutrons moving stochastically through an arbitrarily defined model that represents a real-world experimental setup. The experiment could be as simple as a sphere of metal or as complicated as a full-scale nuclear reactor. Modern, portable input/output file formats are used in OpenMC: XML for input, and HDF5 for output. High performance parallel algorithms in OpenMC have demonstrated near-linear scaling to over 100,000 processors on modern supercomputers. Documentation of OpenMC can be found at https://openmc.readthedocs.io/en/stable/ .
- Reproducibility Challenge
Once again, students in the cluster competition will be asked to replicate the results of a publication from the previous year's Supercomputing conference. For this challenge, you will take on the role of reviewing an SC18 paper that contains an Artifact Description appendix to see if its results are replicable.
For the past three years, SC has promoted the adoption of the Artifact Description (AD) policy endorsed by ACM. The conference has selected one of the papers from the past edition to become the benchmark for the SC SCC reproducibility challenge. The competitive selection includes the review of all the past edition SC papers with AD and the in-person interview of the finalist papers' authors.
This year, the SCC committee is proud to announce the winning paper for the reproducibility challenge at this year’s SCC at SC18: "Extreme scale multi-physics simulations of the tsunamigenic 2004 sumatra megathrust earthquake". The paper is co-authored by Carsten Uphoff, Sebastian Rettenberger, and Michael Bader from Technical University of Munich and Elizabeth H. Madden, Thomas Ulrich, Stephanie Wollherr amd Alice-Agnes Gabriel from Ludwig-Maximilians-Universität München.
In the paper the authors describe the end-to-end optimization of the simulation code SeisSol (http://www.seissol.org/) that become necessary to run the extreme size of a high-resolution simulation of the 2004 Sumatra-Andaman earthquake. You will find the code that was used in this paper: https://github.com/SeisSol/SeisSol/tree/scc18 . Please make sure to take the scc18 branch.
The challenge will consist of two parts. One part will consist of a short interview with each team as in the other challenges. The other part will consist of a report describing your attempt to replicate the results from the paper. The report must be written in English, using a page size of US Letter, in PDF format, with 1 inch margins, and using a 12 point font size. You will also need to be able to create graphs and tables similar to those from the paper.
Most importantly, in the report, you will need to explain how you got your results and compare your results to the results of the authors. Your scores will not depend on whether your results match the results in the paper. The goal of the report is to state if you can or cannot replicate the results in the paper. If you cannot replicate the paper's results, you can still obtain full points by presenting your data and explaining why you cannot replicate the paper's results.
As a last note, you will only be required to report results for one compute architecture. If your cluster contains both a CPU and an accelerator, results for either one may be used for this part of the competition.
At the start of the competition, teams will be given an application and datasets for a mystery application. Students will be expected to build, optimize and run this mystery application all at the competition.
- Power Shutoff Activity
Some time during the 45 and 1/2 hours of the general competition the power will be shut-off at least once. The exact timing of the shutdown(s) are secret and may happen day or night. You and your team will need to know how to bring the hardware and software back from a full unscheduled power outage and how to resume any workload you were processing at that time. This exercise is designed to simulate real world events that system staff must respond to. This activity will allow your team to demonstrate their systems skills by recovering the system.
This has happened before. During the first Student Cluster Competition, in 2007, the power to the Reno Convention Center suddenly failed. The entire show floor went dark. It turned out that the power coming to the convention center was inadequate for Supercomputing's high-performance machines.
Power was out for an hour or so, followed by what the press described as "the world's largest reboot". After the conference, crews were seen laying additional power cables across Virginia Street.
Our competition clusters, of course, went down. When the power was restored, some teams, who had been checkpointing their systems, resumed their computations quickly. Other teams, who had not been saving data, lost many hours of work and had to start over. The experience prompted discussions about checkpointing in the real world—the tradeoff between protecting against possible disasters at a cost of reducing computations.
Since power and other failures are the realities of modern computing systems, we would like to encourage cluster teams to understand the tradeoffs, and to consider what is needed in real life. We turn this thought-provoking accident into an activity to capture the think-on-your-feet spirit of the first competition.
The power will be shut off at the breaker to both monitored circuits for all teams. Once the power is shut off, all teams will be asked to leave their booths, and each booth will be inspected to make sure that everything is powered off. All teams will be let back in their booths at the same time to begin the procedures for recovering from the power failure. The full power-off and restoration logistics will be provided on site before the competition begins.
Some rules to be aware of:
Only the undergraduate team members are allowed to participate in bringing the system back up. No vendor or advisor help is allowed. Advisor rules listed here: http://www.studentclustercompetition.us/2018/overview.html
- The advisor is not allowed to provide technical assistance during the competition, however, he/she is encouraged to run for food and snacks for their team and cheer during the long nights.
Any hardware failures can be swapped based on the rules listed here: http://www.studentclustercompetition.us/2018/rules.html
- No changes to the physical configuration are permitted after the start of the competition. In the case of hardware failure, replacements can be made while supervised by an SCC committee member.
Battery backups are prohibited in the competition this year.
- No battery backup or (Uninterrupted Power Supply) UPS systems are allowed to be used during the competition.
Teams that are not present in their booth will need to restore their clusters once they return.
- In this instance, clusters will be unplugged from the SCC PDU by the SCC committee after power is shut down and the cluster will remain unplugged until the team returns.
SCC committee initiated power shutdown will not happen during setup or benchmarking. Please let us know if you have any questions.
- Poster Session
The Overall SCC Winner will be the team with the highest score when combining their correctly completed workload of the four competition applications, mystery application, best benchmark run, application interviews, and HPC interview. The HPC interview will take into consideration the team's participation in the SC18 conference as well as their ability to wow the judges on their competition know-how.
Teams will be required to attend other aspects of the convention beyond the Student Cluster Competition, which will be included in their final score. Further details will be provided before the competition.