Google TPUv4 selectively joins NVIDIA MLPerf Training 2.0

Google TPUv4 vs Azure A100

MLPerf 2.0 training is out. As usual, MLPerf training is primarily an exercise for NVIDIA and its server OEMs. MLPerf training at this stage has tests that are exclusively or almost exclusively NVIDIA benchmarks at this stage. Still, there were a few other submissions, so we’ll focus on those.

Google TPUv4 selectively joins NVIDIA MLPerf Training 2.0

Perhaps the coolest entry in the MLPerf Training 2.0 results was the Google TPUv4 results. Google ran large clusters of TPUv4 machines against NVIDIA’s large on-premises clusters and was able to show impressive performance:

Google TPUv4 vs Azure A100

Google looks at Microsoft Azure as its competitor here, so it shows its training cost at a whopping 4096 accelerator chips:

Google TPUv4 chip cost vs Azure A100 4096
Google TPUv4 chip cost vs Azure A100 4096

That, along with the Intel Habana Gaudi2, was the most interesting data point.

Just to get an idea of ​​how MLPerf is NVIDIA focused at this point:

  • The ResNet ImageNet has the most diversity. In this one, the Intel Habana Gaudi2 results seem to show that, especially with the 8x Gaudi2 and TensorFlow 2.8 chips, it’s a quick fix compared to the NVIDIA MxNet result scores. Google TPUv4 was added here with two results from thousands of TPUv4s.
  • KiTS19 medical image segmentation results were all reserved for NVIDIA in closed division
  • RetinaNet Lightweight Object Detection was reserved for NVIDIA in the Closed Division
  • COCO heavy object detection was reserved for NVIDIA in the closed division
  • LibriSpeech RNN-T was NVIDIA only in closed division
  • The BERT Wikipedia NLP had by far the most diversity.
    • The Intel Habana Gaudi2 again looks like a strong contender for the NVIDIA A100 (although these are actually different board generations given Gaudi2’s launch into the A100 lifecycle.)
    • Graphcore machines make another disappointing appearance on the list.
    • Microsoft Azure with its NVIDIA A100 machines and Google GCP with its TPUv4 tested interesting clusters. Perhaps the real contest was between Google Cloud TPUv4 and NVIDIA’s large-scale A100 cluster benchmarking here, since both of these scaled to over 4000 accelerators in this test.
  • The Recommendation Engine DLRM test only had NVIDIA-based submissions under 128 accelerators. Google had a single 128 TPUv4 submission that couldn’t really compare to anything meaningful. That one result that was on a different scale was the one thing that didn’t make it an NVIDIA-only test.
  • The Reinforcement MiniGo benchmark was reserved for NVIDIA.

There is a real lack of diversity with MLPerf Training. Fortunately, inference efforts seem to have better representation.

Last words

With so little competition in MLPerf Training, we now simply call it the NVIDIA MLPerf Training benchmark. NVIDIA not only got the majority of results from MLPerf Training 2.0, but had it not been for the single TPUv4 DLRM test (noted above at a different cluster scale), 75% of workloads of the closed division would have had only NVIDIA-based accelerator submissions. We use “based on NVIDIA” to cover either the NVIDIA submission or an NVIDIA partner submission where NVIDIA provides support. Even though we call this diversity of TPUv4 results unique in the systems tested, 5/8 of the benchmarks only had NVIDIA accelerator results.

Intel Habana Gaudi2 case
Intel Habana Gaudi2 case

Since there is no real competition, the training exercise should simply be called the “NVIDIA MLPerf Training” test. Luckily, Google stepped in with something interesting with the folks at Intel Habana to make this interesting. Realistically, NVIDIA has had a background in AI for years, so that’s part of it. Also, the poor results of Graphcore in MLPerf Training probably preclude other exercise solutions. The Gaudi2 also looks interesting.

It should be noted that there were five submissions in the open division where the systems were each tested on one of eight benchmarks, but it is difficult to use these as benchmarks as there are so few.

Source link

Richard V. Johnson