Page 12 - ChipScale_Nov-Dec_2020-digital
P. 12

Enabling AI with heterogeneous integration




        By Arvind Kumar, Mukta Farooq  [IBM Research]
        A         rtificial intelligence (AI)   speech recognition.




                  applications have become
                  a pervasive segment of the   Finally, there has been
                                           a spectacular growth in
        computing landscape and are poised   computing capability
        to continue explosive growth for many   f u ele d by s c a l i ng ,
        years. Delivering the very large demands   which  has  led  to  the
        of compute, memory, and bandwidth   widespread availability
        required by AI has become a leading   of computing resources
        challenge in computing system design   facilitated by the cloud.
        and provided a major incentive for   This third factor is the   Figure 1: Example of a deep neural network (DNN).
        deployment of specialized components   one on which we will
        t o a c c e l e r a t e  t h e s e  wo r k l o a d s .   focus because its continued success is   based on the difference, or error, between
        Heterogeneous integration [1] has risen   critical to sustaining the unrelenting   the calculated output and the correct
        to the forefront of technology focus   growth of AI, making clear the need   output during a subsequent backward
        because of the need to enable high   for new paradigms to address the   pass through the network. Propagation
        interconnectivity between these diverse   diminishing benefits of scaling.  from one layer to the next involves a set of
        components coupled with the need for a   Before discussing the potential   computationally-intensive steps, of which
        new technology paradigm to counter the   benefits of heterogeneous integration,   matrix multiplication of the layer’s input
        diminishing returns of scaling. In this   we first summarize some fundamentals   by the weights of that layer dominates
        article, we address three fundamental   of AI [2]. The engine of AI computing   the computation time [3]. This procedure,
        questions framing the challenges in this   is the deep neural network (DNN). As   called backpropagation, has to be repeated
        area: 1) What are the compute, memory,   shown in Figure 1, a DNN consists   many times over the full set of training
        and connectivity requirements for AI   of a series of layers that transform an   examples, until the accuracy on a separate
        workloads? 2) What novel heterogeneous   input (e.g., the pixels of an image) to an   verification set saturates, representing the
        integration  technologies  are  being   output (e.g., the classification of what is   achievable accuracy of the network. Once
        developed to deliver continued gains   in the image) by discovering the most   the model is trained, it can be deployed in a
        in AI system performance? 3) What   important features (e.g., the distinct   phase called inference. Unlike the training
        is the path forward to deploy these   characteristics distinguishing a cat from   phase, which involves forward, backward,
        heterogeneous integration technologies   a dog). A large DNN may consist of tens   and update passes through the network
        to address these challenges? We conclude   or even hundreds of layers (hence the   for each training example, inference
        by describing the requirements of the   name “deep”). These layers consist of   involves only the forward pass. Moreover,
        heterogeneous integration platform   matrices with parameters called weights   it is often possible to downsize the model
        needed to enable an upward trajectory for   that transform the input of the layer to an   after training while preserving accuracy,
        AI system performance.             output, which feeds into the next layer.   reducing the computational burden even
                                           Typical DNNs have tens of millions of   further.
        Background                         weight parameters, but the number of
          To understand the compute, memory,   parameters in very large models (e.g.,   Computational requirements
        and connectivity requirements for AI   for language translation) is approaching   We now discuss the compute, memory,
        workloads, we start with some history.   the  trillion  level.  DNNs  have  found   and connectivity requirements for
        The revolution in AI computing has   widespread application in many domains,   training and inference. Several key
        been the product of three factors. First,   with  speech,  language,  and  vision   drivers have emerged to accelerate
        voluminous amounts of data are being   comprising three of the most common.   A I w o r k l o a d c o m pu t a t i o n [ 3] .
        collected across many domains (social                                 First, specialized accelerators have
        media and data from sensors are two   Training and inference phases of  architect ures to speed up matrix
        example sources), and AI algorithms   a DNN                           multiplication as a dominant operation
        are adept at analyzing this unstructured   Before a DNN can be used, it must first   of DNN computation. Second, unlike
        data. Second, following a period of many   be trained, typically with a very large   many other workloads, the memory
        decades of little progress, there have been   number of  examples, in order to find the   access patterns and order of instructions
        many recent advances in AI algorithms   weight values. Each training example is   are completely deterministic (set by the
        to gain insights into this data, attaining   sent forward through the network, and   DNN), so that specialized accelerators
        human-level accuracy at tasks such as   the weights of the network are adjusted   with dataflow architectures can achieve
                                                                              very high compute utilization. Rather

        10
        10   Chip Scale Review   November  •  December  •  2020   [ChipScaleReview.com]
   7   8   9   10   11   12   13   14   15   16   17