Page 13 - ChipScale_Nov-Dec_2020-digital
P. 13

than the caches in generalized processors,   number of training examples, throughput   the number of parallel workers, high
        which guess at the best data to store,   is key. On the other hand, inference has   bandwidth communication between AI
        on-chip memory scratchpads storing   fewer data structures, involving only the   chips also becomes a critical bottleneck
        precisely predictable data can be used   forward pass. If possible, storage of the   where heterogeneous integration can help.
        to optimize data flow and maximize   full DNN model near memory is highly   For example, in data parallelism, each
        data reuse. Finally, it has been amply   desirable to reduce latency and minimize   worker takes a subset of the training data
        demonstrated that AI applications simply   the energy cost of data movement. Low   and determines the changes to the model
        do not need the high precision required   latency is of prime importance for many   for its part of the data. However, as the
        for many other workloads; the state of   inference applications because of the   training proceeds, these weight changes
        the art is to use 8 bits for training [4]   need for real-time processing.  must be exchanged and synchronized
        and 4 bits (with even 2 bits possible)   Although great strides have been   so that all workers have an updated
        for inference [5] while preserving the   made in accelerating AI workloads   copy of the model. Several topologies
        accuracy achieved at higher precisions.   through compute units such as GPUs   such as hypercube mesh and torus with
        The use of lower precision leverages the   and specialized accelerators, it is equally   high-speed links are being employed to
        quadratic scaling of compute energy and   important to  increase the  memory   optimize chip-to-chip communication [9].
        area with the number of bits.      bandwidth in tandem; otherwise the   To summarize this section, AI compute
          W i t h  t h e s e  d r i v e r s , s e v e r a l   compute units can remain idle waiting   requires dense compute modules capable
        specialized  components  including   for data, leading to unbalanced system   of accelerating operations common to
        graphics processing units (GPUs), field-  performance. To support the high   DNNs, such as matrix multiplication.
        programmable gate arrays (FPGAs),   bandwidth requirements, memories such as   AI memory demands high bandwidth
        and AI accelerators have been widely   the high bandwidth memory (HBM) have   to match the compute throughput and
        deployed to accelerate AI workloads.   been introduced [6]. An illustrative study   high capacity to support large DNN
        Because of the huge computational   found that, over a 10-year period, GPU   models. Finally, large training jobs
        requirements, training workloads are   compute performance increased by 41x   require multiple chips connected by high-
        often parallelized among multiple   while bandwidth increased only 10x [7]. To   bandwidth, energy-efficient interfaces.
        accelerators; nonetheless, large models,   illustrate this point further, we simulated
        with billions of weights, may still take   [8] the training of Resnet-50 (a common   Classical packaging vs.
        days to train. Although inference involves   network for image classification) using a   heterogeneous integration
        similar mathematical operations as   system of 16 interconnected AI accelerator   A classical packaging construct may
        training, there are many fewer operations   chips, asking whether the maximum   comprise several individual first-level
        and one processor is usually sufficient for   benefit from system performance would   assemblies that are joined to a card or
        inference workloads.               come from improving compute or from   board, often referred to as discrete chip-
          AI workloads have large memory   increasing memory bandwidth. As shown   on-package assembly. Alternatively, it
        requirements on account of the large   in Figure 2, for an accelerator which has   may be a system on chip (SoC), where a
        size of models and the need to store   already leveraged many tricks to improve   single chip comprises an entire “system,”
        intermediate data structures as data   compute performance, proportionate   meaning it has logic, memory, and other
        propagates th rough the net work.   improvements in memory bandwidth at   entities that function in unison as a
        The requirements for training are   fixed compute capability far outperform   complete system; in this case, the chip
        particularly taxing because of the need   further improvements in compute at   is likely to be very expensive, and the
        to store multiple training examples in   fixed memory bandwidth. The maximum   packaging assembly yield has to be very
        memory (called a “minibatch”), the   improvements are achieved by increasing   close to perfect (100%). A different type
        model weights, the weight updates,   compute capabilit y a nd memor y   of system integration is seen in the system
        and the intermediate results at each   bandwidth in a balanced way, as illustrated   in package (SiP) case, where a number
        layer (activations). Because of the large   by the top  curve. This  gap between   of integrated circuit chips are attached to
                                                         compute performance   a single packaging substrate, creating a
                                                         a nd  m e mo r y     single module that performs all or most
                                                         bandwidth presents an   of the functions of a full system. The SiP
                                                         exciting opportunity   case has the advantage that chips may be
                                                         f o r h e t e r o g e n e ou s   sourced from various suppliers, to create
                                                         integration to contribute   a cost-effective solution, such as found in
                                                         to AI.               mobile phones.
                                                           As discussed earlier,   However, these classical packaging
                                                         AI training is highly   solutions fall short of delivering the key
                                                         c om p u t a t i o n a l l y   requirements of AI compute, i.e., 1) high
                                                         i n ten s i v e ,  o f ten   compute density, 2) high bandwidth access
                                                         requiring parallelization   to memory, and 3) chip to chip connectivity
                                                         among many chips, or   with high bandwidth, low latency, and low
                                                         “workers.” Because   energy cost. Heterogeneous integration
                                                         c o m m u ni c a t i o n   provides innovative solutions where
        Figure 2: System performance improvement comparing changes in compute   costs  increase  with   classical options are unable to meet these
        vs. bandwidth.

                                                       Chip Scale Review   November  •  December  •  2020   [ChipScaleReview.com]  11 11
   8   9   10   11   12   13   14   15   16   17   18