MLCommons, the open engineering consortium that helps the industry track machine learning performance, is introducing a new metric designed to more accurately assess and compare the performance of the world's fastest supercomputers. The new metric is part of MLPerf HPC v1.0, the latest release of the group's ML training performance benchmark suite for high-performance computing (HPC).
The next wave of IT innovation will be powered by artificial intelligence and machine learning. We look at the ways companies can take advantage of it and how to get started.
Read nowMLCommons released the inaugural MLPerf HPC results last year, measuring how quickly different systems could train a neural network. The initial benchmark suite has been used to measure systems that generally use somewhere between 500 to 4,000 processors or accelerators -- quite a bit smaller than the leading supercomputers.
But while the initial version worked well for many scientifically-oriented workloads, it didn't really scale up to full supercomputing capabilities. For instance, at scale, interconnect begins to matter a lot more.
"It's important to keep in mind that small systems and large systems behave very differently," David Kanter, the head of MLPerf, said in a briefing with reporters.
Most systems run multiple jobs at the supercomputer scale, such as training ML models -- in parallel. So, in addition to the time-to-train metric, MLCommons added a throughput metric. It measures how many models per minute a system can train -- "a very good proxy for the aggregate machine learning capabilities of a supercomputer," Kanter said. It captures the impact on shared resources, such as the storage system and interconnects.
Submitters can choose the size and number of instances they test, allowing them to exhibit different supercomputing capabilities. For this release, submitters also had to report their "strong-scaling" results -- the "time to train" metric.
For this benchmark round, MLCommons received submissions from eight supercomputing organizations, including Argonne National Laboratory, the Swiss National Supercomputing Centre, Fujitsu and Japan's Institute of Physical and Chemical Research (RIKEN), Helmholtz AI (a collaboration of the J