Page 24 - Chip Scale Review_May June

Page 24 - Chip Scale Review_May June_2022-digital

P. 24

versus 2x Intel Xeon Platinum 8362. matrix (FP32), 23.1 TFLOPS peak theoretical memory bandwidth
Neon crash impact is the max result theoretical single precision (FP32), performance. MI250/MI250X
test case. Results may vary. 184.6 TFLOPS peak theoretical memory bus interface is 4,096
5. MLN-080B: ANSYS CFX 2021.1 half precision (FP16) floating-point bits times 2 die and memory
®
®
comparison based on AMD internal performance. Published results on the data rate is 3.20 Gbps for total
testing as of 09/27/2021 measuring NVidia Ampere A100 (80GB) GPU memory bandwidth of 3.2768 TB/
the average time to run the Release accelerator, boost engine clock of s ((3.20 Gbps*(4,096 bits*2))/8).
14.0 test case simulations (converted 1410 MHz, resulted in 19.5 TFLOPS The highest published results on
to jobs/day - higher is better) using peak double precision tensor cores the NVidia Ampere A100 (80GB)
a server with 2x AMD EPYC 75F3 (FP64 Tensor Core), 9.7 TFLOPS peak SXM GPU accelerator resulted in
utilizing 1TB (16x 64 GB DDR4- double precision (FP64). 19.5 TFLOPS 2.039 TB/s GPU memory bandwidth
3200) versus 2x Intel Xeon Platinum peak single precision (FP32), 78 performance. https://www.nvidia.
8380 utilizing 1TB (16x 64 GB DDR4- TFLOPS peak half precision (FP16), com/content/dam/en-zz/Solutions/
3200). Results may vary. 312 TFLOPS peak half precision (FP16 Data-Center/a100/pdf/nvidia-a100-
®
®
6. MLN-130A: ANSYS Mechanical Tensor Flow), 39 TFLOPS peak Bfloat datasheet-us-nvidia-1758950-r4-
2021 R2 comparison based on AMD 16 (BF16), 312 TFLOPS peak Bfloat16 web.pdf
internal testing as of 09/27/2021 format precision (BF16 Tensor Flow), 10. MI200-15A - Testing Conducted
measuring the average of all Release theoretical floating-point performance. by AMD perfor mance lab as
2019 R2 test case simulations using The TF32 data format is not IEEE of 10/7/2021, on a single socket
a server with 2x AMD EPYC 75F3 compliant and not included in this Optimized AMD EPYC™ CPU
versus 2x Intel Xeon Platinum 8380. comparison. https://www.nvidia.com/ server, with 4x AMD Instinct™
Steady state thermal analysis of a content/dam/en-zz/Solutions/Data- MI250X OAM (128 GB HBM2e)
power supply module 5.3M (cg1) is Center/nvidia-ampere-architecture- 560W GPUs with AMD Infinity
max result. Results may vary. whitepaper.pdf, page 15, Table 1. Fabric™ technology, using LAMMPS
7. MI200-01 - World’s fastest data center 8. MI200-02 - Calculations conducted ReaxFF/C, patch_2Jul2021 plus
GPU is the AMD Instinct™ MI250X. by AMD Performance Labs as of AMD optimizations to LAMMPS
Calculations conducted by AMD Sep 15, 2021, for the AMD Instinct™ and Kokkos that are not yet available
Performance Labs as of Sep 15, 2021, MI250X accelerator (128GB HBM2e upstream resulted in a median score
for the AMD Instinct™ MI250X OAM module) at 1,700 MHz peak of 4x MI250X = 19,482,180.48
(128GB HBM2e OAM module) boost engine clock resulted in 95.7 ATOM-Time Steps/s Vs. Dual AMD
accelerator at 1,700 MHz peak boost TFLOPS peak double precision matrix EPYC 7742@2.25GHz CPUs with
engine clock resulted in 95.7 TFLOPS (FP64 Matrix) theoretical, floating- 4x NVIDIA A100 SXM 80GB
peak theoretical double precision point performance. Published results (400W) using LAMMPS classical
(FP64 Matrix), 47.9 TFLOPS peak on the NVidia Ampere A100 (80GB) molecular dynamics package
theoretical double precision (FP64), GPU accelerator resulted in 19.5 Reax FF/C, patch _10Feb2021
95.7 TFLOPS peak theoretical single TFLOPS peak double precision (FP64 resulted in a published score of
precision matrix (FP32 Matrix), 47.9 Tensor Core) theoretical, floating- 8,850,000 (8.85E+06) ATOM-Time
TFLOPS peak theoretical single point performance. Results found at: Steps/s. https://developer.nvidia.
precision (FP32), 383.0 TFLOPS https://www.nvidia.com/content/dam/ com/hpc-application-performance
peak theoretical half precision en-zz/Solutions/Data-Center/nvidia- 19,482,180.48/8,850,000=2.20x
(FP16), and 383.0 TFLOPS peak ampere-architecture-whitepaper.pdf, (220%) the/1.2x (120%) faster.
theoretical Bfloat16 format precision page 15, Table 1. Container details found at: https://
(BF16) floating-point performance. 9. MI200-07 - Calculations conducted ngc.nvidia.com/catalog/containers/
Calculations conducted by AMD by AMD Performance Labs as hpc:la m mps I n for mat ion on
Performance Labs as of Sep 18, 2020 of Sep 21, 2021, for the AMD LAMMPS: https://www.lammps.
for the AMD Instinct™ MI100 (32GB Instinct™ MI250X and MI250 org/index.html Server manufacturers
®
HBM2 PCIe card) accelerator at 1,502 (128GB HBM2e) OAM accelerators may vary configurations, yielding
MHz peak boost engine clock resulted designed with AMD CDNA™ 2 different results. Performance may
in 11.54 TFLOPS peak theoretical 6nm FinFet process technology at vary based on use of latest drivers
double precision (FP64), 46.1 TFLOPS 1,600 MHz peak memory clock and optimizations.
peak theoretical single precision resulted in 3.2768 TFLOPS peak

Biography
Raja Swaminathan is a Senior Fellow & Advanced Packaging Leader at AMD, Austin, Texas, USA. Prior to
AMD, he was at Apple, architecting and developing the packaging technologies for the M1x series of processors
and Principal Engineer, Silicon Package Architecture at Intel. He holds 35 patents on semiconductor packaging
technologies. He received his BS in metallurgy from Indian Institute of Technology, Madras, India, and a PhD
in Materials Science from Carnegie Mellon U. Email: raja.swaminathan@amd.com

22 Chip Scale Review May • June • 2022 [ChipScaleReview.com]
22

19 20 21 22 23 24 25 26 27 28 29