Page 24 - Chip Scale Review_May June_2022-digital
P. 24

versus 2x Intel Xeon Platinum 8362.   matrix (FP32), 23.1 TFLOPS  peak   theoretical  memory  bandwidth
             Neon crash impact is the max result   theoretical single precision (FP32),   performance. MI250/MI250X
             test case. Results may vary.       184.6 TFLOPS peak theoretical      memory bus interface is 4,096
          5.  MLN-080B: ANSYS  CFX  2021.1      half precision (FP16) floating-point   bits times 2 die and memory
                                  ®
                             ®
             comparison based on AMD internal   performance. Published results on the   data rate is 3.20 Gbps for total
             testing as of 09/27/2021 measuring   NVidia Ampere A100 (80GB) GPU    memory bandwidth of 3.2768 TB/
             the average time to run the Release   accelerator, boost engine clock of   s ((3.20 Gbps*(4,096 bits*2))/8).
             14.0 test case simulations (converted   1410 MHz, resulted in 19.5 TFLOPS   The highest published results on
             to jobs/day - higher is better) using   peak double precision tensor cores   the NVidia Ampere A100 (80GB)
             a server with 2x AMD EPYC 75F3     (FP64 Tensor Core), 9.7 TFLOPS peak   SXM GPU accelerator resulted in
             utilizing 1TB (16x 64 GB DDR4-     double precision (FP64). 19.5 TFLOPS   2.039 TB/s GPU memory bandwidth
             3200) versus 2x Intel Xeon Platinum   peak single precision (FP32), 78   performance. https://www.nvidia.
             8380 utilizing 1TB (16x 64 GB DDR4-  TFLOPS peak half precision (FP16),   com/content/dam/en-zz/Solutions/
             3200). Results may vary.           312 TFLOPS peak half precision (FP16   Data-Center/a100/pdf/nvidia-a100-
                             ®
                                        ®
          6.  MLN-130A: ANSYS  Mechanical       Tensor Flow), 39 TFLOPS peak Bfloat   datasheet-us-nvidia-1758950-r4-
             2021 R2 comparison based on AMD    16 (BF16), 312 TFLOPS peak Bfloat16   web.pdf
             internal testing as of 09/27/2021   format precision (BF16 Tensor Flow),   10.  MI200-15A - Testing Conducted
             measuring the average of all Release   theoretical floating-point performance.    by AMD perfor mance lab as
             2019 R2 test case simulations using   The TF32 data format is not IEEE   of 10/7/2021, on a single socket
             a server with 2x AMD EPYC 75F3     compliant and not included in this   Optimized AMD  EPYC™ CPU
             versus 2x Intel Xeon Platinum 8380.   comparison. https://www.nvidia.com/  server, with  4x AMD Instinct™
             Steady state thermal analysis of a   content/dam/en-zz/Solutions/Data-  MI250X OAM (128 GB HBM2e)
             power supply module 5.3M (cg1) is   Center/nvidia-ampere-architecture-  560W GPUs with AMD Infinity
             max result. Results may vary.      whitepaper.pdf, page 15, Table 1.   Fabric™ technology, using LAMMPS
          7.  MI200-01 - World’s fastest data center   8.  MI200-02 - Calculations conducted   ReaxFF/C, patch_2Jul2021 plus
             GPU is the AMD Instinct™ MI250X.   by AMD Performance Labs as of      AMD optimizations to LAMMPS
             Calculations conducted by AMD      Sep 15, 2021, for the AMD Instinct™   and Kokkos that are not yet available
             Performance Labs as of Sep 15, 2021,   MI250X  accelerator (128GB HBM2e   upstream resulted in a median score
             for the AMD Instinct™ MI250X       OAM module) at 1,700 MHz peak      of  4x MI250X = 19,482,180.48
             (128GB HBM2e OAM module)           boost engine clock resulted in 95.7   ATOM-Time Steps/s Vs. Dual AMD
             accelerator at 1,700 MHz peak boost   TFLOPS peak double precision matrix   EPYC 7742@2.25GHz CPUs with
             engine clock resulted in 95.7 TFLOPS   (FP64 Matrix) theoretical, floating-  4x NVIDIA A100 SXM 80GB
             peak theoretical double precision   point performance. Published results   (400W) using LAMMPS classical
             (FP64 Matrix), 47.9 TFLOPS peak    on the NVidia Ampere A100 (80GB)   molecular dynamics package
             theoretical double precision (FP64),   GPU accelerator resulted in 19.5   Reax FF/C, patch _10Feb2021
             95.7 TFLOPS peak theoretical single   TFLOPS peak double precision (FP64   resulted in a published score of
             precision matrix (FP32 Matrix), 47.9   Tensor Core) theoretical, floating-  8,850,000 (8.85E+06) ATOM-Time
             TFLOPS peak theoretical single     point performance. Results found at:   Steps/s. https://developer.nvidia.
             precision (FP32), 383.0 TFLOPS     https://www.nvidia.com/content/dam/  com/hpc-application-performance
             peak theoretical half precision    en-zz/Solutions/Data-Center/nvidia-  19,482,180.48/8,850,000=2.20x
             (FP16), and  383.0 TFLOPS peak     ampere-architecture-whitepaper.pdf,   (220%) the/1.2x (120%) faster.
             theoretical Bfloat16 format precision   page 15, Table 1.             Container details found at: https://
             (BF16) floating-point performance.   9.  MI200-07 - Calculations conducted   ngc.nvidia.com/catalog/containers/
             Calculations conducted by AMD      by AMD Performance Labs as         hpc:la m mps I n for mat ion on
             Performance Labs as of Sep 18, 2020   of Sep 21, 2021, for the AMD    LAMMPS: https://www.lammps.
             for the AMD Instinct™ MI100 (32GB   Instinct™ MI250X and MI250        org/index.html Server manufacturers
                      ®
             HBM2 PCIe  card) accelerator at 1,502   (128GB HBM2e) OAM accelerators   may vary configurations, yielding
             MHz peak boost engine clock resulted   designed with AMD CDNA™ 2      different results. Performance may
             in 11.54 TFLOPS peak theoretical   6nm FinFet process technology at   vary based on use of latest drivers
             double precision (FP64), 46.1 TFLOPS   1,600 MHz peak memory clock    and optimizations.
             peak theoretical single precision   resulted in 3.2768 TFLOPS peak


                       Biography
                         Raja Swaminathan is a Senior Fellow & Advanced Packaging Leader at AMD, Austin, Texas, USA. Prior to
                       AMD, he was at Apple, architecting and developing the packaging technologies for the M1x series of processors
                       and Principal Engineer, Silicon Package Architecture at Intel. He holds 35 patents on semiconductor packaging
                       technologies. He received his BS in metallurgy from Indian Institute of Technology, Madras, India, and a PhD
                       in Materials Science from Carnegie Mellon U. Email: raja.swaminathan@amd.com



        22   Chip Scale Review   May  •  June  •  2022   [ChipScaleReview.com]
        22
   19   20   21   22   23   24   25   26   27   28   29