Learning to Optimize Tensor Programs · Learning to Optimize Tensor Programs Tianqi Chen 1Lianmin...

12
Learning to Optimize Tensor Programs Tianqi Chen 1 Lianmin Zheng 2 Eddie Yan 1 Ziheng Jiang 1 Thierry Moreau 1 Luis Ceze 1 Carlos Guestrin 1 Arvind Krishnamurthy 1 1 Paul G. Allen School of Computer Science & Engineering, University of Washington 2 Shanghai Jiao Tong University Abstract We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective deep learning systems. However, current systems rely on manually optimized libraries, e.g., cuDNN, that support only a narrow range of server class GPUs. Such reliance limits the applicability of high-level graph optimizations and incurs significant engineering costs when deploying to new hardware targets. We use learning to remove this engineering burden. We learn domain-specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants. We further accelerate the search using effective model transfer across workloads. Experimental results show that our framework delivers performance that is competitive with state-of-the-art hand-tuned libraries for low-power CPUs, mobile GPUs, and server-class GPUs. 1 Introduction Deep learning (DL) has become ubiquitous in our daily lives. DL models can now recognize images [23], understand natural language [38], play games [27], and automate system decisions (e.g., device placement [26] and indexing [21]). Tensor operators, such as matrix multiplication and high dimensional convolution, are basic building blocks of DL models. Scalable learning systems [1, 4, 8, 2] rely on manually optimized, high-performance tensor operation libraries, such as cuDNN, that are optimized for a narrow range of hardware devices. To optimize a tensor operator, programmers must choose from many implementations that are logically equivalent but differ dramatically in performance due to differences in threading, memory reuse, pipelining and other hardware factors. Supporting diverse hardware back-ends therefore requires tremendous engineering effort. Even on currently supported hardware, developing DL frameworks and models is limited by the set of optimized operators in libraries, preventing optimizations (such as operator fusion) that can produce unsupported operators. This research explores the following question: can we use learning to alleviate this engineering burden and automatically optimize tensor operator programs for a given hardware platform? Our affirmative answer is based on statistical cost models we built that predict program run time using a given low-level program. These cost models, which guide our exploration of the space of possible programs, use transferable representations that generalize across different workloads to accelerate search. We make the following contributions: We formalize the new problem of learning to optimize tensor programs and summarize its key characteristics. We propose a machine learning framework to solve this problem. We further accelerate the optimization by 2× to 10× using transfer learning. 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.

Transcript of Learning to Optimize Tensor Programs · Learning to Optimize Tensor Programs Tianqi Chen 1Lianmin...

Page 1: Learning to Optimize Tensor Programs · Learning to Optimize Tensor Programs Tianqi Chen 1Lianmin Zheng2 Eddie Yan Ziheng Jiang 1Thierry Moreau Luis Ceze 1Carlos Guestrin Arvind Krishnamurthy

Learning to Optimize Tensor Programs

Tianqi Chen1 Lianmin Zheng2 Eddie Yan1 Ziheng Jiang1 Thierry Moreau1

Luis Ceze1 Carlos Guestrin1 Arvind Krishnamurthy1

1Paul G. Allen School of Computer Science & Engineering, University of Washington2Shanghai Jiao Tong University

Abstract

We introduce a learning-based framework to optimize tensor programs for deeplearning workloads. Efficient implementations of tensor operators, such as matrixmultiplication and high dimensional convolution, are key enablers of effectivedeep learning systems. However, current systems rely on manually optimizedlibraries, e.g., cuDNN, that support only a narrow range of server class GPUs.Such reliance limits the applicability of high-level graph optimizations and incurssignificant engineering costs when deploying to new hardware targets. We uselearning to remove this engineering burden. We learn domain-specific statisticalcost models to guide the search of tensor operator implementations over billionsof possible program variants. We further accelerate the search using effectivemodel transfer across workloads. Experimental results show that our frameworkdelivers performance that is competitive with state-of-the-art hand-tuned librariesfor low-power CPUs, mobile GPUs, and server-class GPUs.

1 Introduction

Deep learning (DL) has become ubiquitous in our daily lives. DL models can now recognizeimages [23], understand natural language [38], play games [27], and automate system decisions (e.g.,device placement [26] and indexing [21]). Tensor operators, such as matrix multiplication and highdimensional convolution, are basic building blocks of DL models. Scalable learning systems [1, 4, 8,2] rely on manually optimized, high-performance tensor operation libraries, such as cuDNN, thatare optimized for a narrow range of hardware devices. To optimize a tensor operator, programmersmust choose from many implementations that are logically equivalent but differ dramatically inperformance due to differences in threading, memory reuse, pipelining and other hardware factors.Supporting diverse hardware back-ends therefore requires tremendous engineering effort. Evenon currently supported hardware, developing DL frameworks and models is limited by the set ofoptimized operators in libraries, preventing optimizations (such as operator fusion) that can produceunsupported operators.

This research explores the following question: can we use learning to alleviate this engineeringburden and automatically optimize tensor operator programs for a given hardware platform? Ouraffirmative answer is based on statistical cost models we built that predict program run time using agiven low-level program. These cost models, which guide our exploration of the space of possibleprograms, use transferable representations that generalize across different workloads to acceleratesearch. We make the following contributions:

• We formalize the new problem of learning to optimize tensor programs and summarize its keycharacteristics.

• We propose a machine learning framework to solve this problem.• We further accelerate the optimization by 2× to 10× using transfer learning.

32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.

Page 2: Learning to Optimize Tensor Programs · Learning to Optimize Tensor Programs Tianqi Chen 1Lianmin Zheng2 Eddie Yan Ziheng Jiang 1Thierry Moreau Luis Ceze 1Carlos Guestrin Arvind Krishnamurthy

e<latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit><latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit><latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit><latexit sha1_base64="hP+6LrUf2d3tZaldqaQQvEKMXyw=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4</latexit><latexit sha1_base64="DprgtIzi24Eq9y5/8TyqdCxQO58=">AAAB3XicbZBLSwMxFIXv+Ky1anXrJlgEV2XGjS4FNy5bsA9oh5JJ77SxmcyQ3BHK0F/gxoUi/i13/hvTx0JbDwQ+zknIvSfKlLTk+9/e1vbO7t5+6aB8WDk6PqmeVto2zY3AlkhVaroRt6ikxhZJUtjNDPIkUtiJJvfzvPOMxspUP9I0wzDhIy1jKTg5q4mDas2v+wuxTQhWUIOVGoPqV3+YijxBTUJxa3uBn1FYcENSKJyV+7nFjIsJH2HPoeYJ2rBYDDpjl84Zsjg17mhiC/f3i4In1k6TyN1MOI3tejY3/8t6OcW3YSF1lhNqsfwozhWjlM23ZkNpUJCaOuDCSDcrE2NuuCDXTdmVEKyvvAnt63rg14OmDyU4hwu4ggBu4A4eoAEtEIDwAm/w7j15r97Hsq4tb9XbGfyR9/kDtZiLlg==</latexit><latexit sha1_base64="DprgtIzi24Eq9y5/8TyqdCxQO58=">AAAB3XicbZBLSwMxFIXv+Ky1anXrJlgEV2XGjS4FNy5bsA9oh5JJ77SxmcyQ3BHK0F/gxoUi/i13/hvTx0JbDwQ+zknIvSfKlLTk+9/e1vbO7t5+6aB8WDk6PqmeVto2zY3AlkhVaroRt6ikxhZJUtjNDPIkUtiJJvfzvPOMxspUP9I0wzDhIy1jKTg5q4mDas2v+wuxTQhWUIOVGoPqV3+YijxBTUJxa3uBn1FYcENSKJyV+7nFjIsJH2HPoeYJ2rBYDDpjl84Zsjg17mhiC/f3i4In1k6TyN1MOI3tejY3/8t6OcW3YSF1lhNqsfwozhWjlM23ZkNpUJCaOuDCSDcrE2NuuCDXTdmVEKyvvAnt63rg14OmDyU4hwu4ggBu4A4eoAEtEIDwAm/w7j15r97Hsq4tb9XbGfyR9/kDtZiLlg==</latexit><latexit sha1_base64="f06LPawGe2Q0Ej/v9kIC5ARzRvQ=">AAAB6HicbVBNT8JAEJ3iF+IX6tHLRmLiibRe9Ej04hESCyTQkO0yhZXtttndmpCGX+DFg8Z49Sd589+4QA8KvmSSl/dmMjMvTAXXxnW/ndLG5tb2Tnm3srd/cHhUPT5p6yRTDH2WiER1Q6pRcIm+4UZgN1VI41BgJ5zczf3OEyrNE/lgpikGMR1JHnFGjZVaOKjW3Lq7AFknXkFqUKA5qH71hwnLYpSGCap1z3NTE+RUGc4Ezir9TGNK2YSOsGeppDHqIF8cOiMXVhmSKFG2pCEL9fdETmOtp3FoO2NqxnrVm4v/eb3MRDdBzmWaGZRsuSjKBDEJmX9NhlwhM2JqCWWK21sJG1NFmbHZVGwI3urL66R9Vffcutdya43bIo4ynME5XIIH19CAe2iCDwwQnuEV3pxH58V5dz6WrSWnmDmFP3A+fwDILYzl</latexit><latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit><latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit><latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit><latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit><latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit><latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit>

s1<latexit sha1_base64="74/5ryLy7rv4hfaCJ57+tAHgIZ0=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpwfT9vlv1at4cZJX4BalCgUbf/eoNEpbFXCGT1Jiu76UY5FSjYJJPK73M8JSyMR3yrqWKxtwE+fzUKTmzyoBEibalkMzV3xM5jY2ZxKHtjCmOzLI3E//zuhlG10EuVJohV2yxKMokwYTM/iYDoTlDObGEMi3srYSNqKYMbToVG4K//PIqaV3UfK/m319W6zdFHGU4gVM4Bx+uoA530IAmMBjCM7zCmyOdF+fd+Vi0lpxi5hj+wPn8AQQSjZs=</latexit><latexit sha1_base64="74/5ryLy7rv4hfaCJ57+tAHgIZ0=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpwfT9vlv1at4cZJX4BalCgUbf/eoNEpbFXCGT1Jiu76UY5FSjYJJPK73M8JSyMR3yrqWKxtwE+fzUKTmzyoBEibalkMzV3xM5jY2ZxKHtjCmOzLI3E//zuhlG10EuVJohV2yxKMokwYTM/iYDoTlDObGEMi3srYSNqKYMbToVG4K//PIqaV3UfK/m319W6zdFHGU4gVM4Bx+uoA530IAmMBjCM7zCmyOdF+fd+Vi0lpxi5hj+wPn8AQQSjZs=</latexit><latexit sha1_base64="74/5ryLy7rv4hfaCJ57+tAHgIZ0=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpwfT9vlv1at4cZJX4BalCgUbf/eoNEpbFXCGT1Jiu76UY5FSjYJJPK73M8JSyMR3yrqWKxtwE+fzUKTmzyoBEibalkMzV3xM5jY2ZxKHtjCmOzLI3E//zuhlG10EuVJohV2yxKMokwYTM/iYDoTlDObGEMi3srYSNqKYMbToVG4K//PIqaV3UfK/m319W6zdFHGU4gVM4Bx+uoA530IAmMBjCM7zCmyOdF+fd+Vi0lpxi5hj+wPn8AQQSjZs=</latexit><latexit sha1_base64="74/5ryLy7rv4hfaCJ57+tAHgIZ0=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpwfT9vlv1at4cZJX4BalCgUbf/eoNEpbFXCGT1Jiu76UY5FSjYJJPK73M8JSyMR3yrqWKxtwE+fzUKTmzyoBEibalkMzV3xM5jY2ZxKHtjCmOzLI3E//zuhlG10EuVJohV2yxKMokwYTM/iYDoTlDObGEMi3srYSNqKYMbToVG4K//PIqaV3UfK/m319W6zdFHGU4gVM4Bx+uoA530IAmMBjCM7zCmyOdF+fd+Vi0lpxi5hj+wPn8AQQSjZs=</latexit>

x1 = g(e, s1)<latexit sha1_base64="1kEHQ6sAAtWXqv0U9+UgjHcNIHQ=">AAAB+HicbVBNS8NAEJ3Ur1o/GvXoZbEIFaQkIuhFKHrxWMF+QBvCZrtpl242YXcj1tBf4sWDIl79Kd78N27bHLT1wcDjvRlm5gUJZ0o7zrdVWFldW98obpa2tnd2y/befkvFqSS0SWIey06AFeVM0KZmmtNOIimOAk7bwehm6rcfqFQsFvd6nFAvwgPBQkawNpJvlx99F12hQZWeIuW7J75dcWrODGiZuDmpQI6Gb3/1+jFJIyo04Viprusk2suw1IxwOin1UkUTTEZ4QLuGChxR5WWzwyfo2Ch9FMbSlNBopv6eyHCk1DgKTGeE9VAtelPxP6+b6vDSy5hIUk 0FmS8KU450jKYpoD6TlGg+NgQTycytiAyxxESbrEomBHfx5WXSOqu5Ts29O6/Ur/M4inAIR1AFFy6gDrfQgCYQSOEZXuHNerJerHfrY95asPKZA/gD6/MHw9uRMg==</latexit><latexit sha1_base64="1kEHQ6sAAtWXqv0U9+UgjHcNIHQ=">AAAB+HicbVBNS8NAEJ3Ur1o/GvXoZbEIFaQkIuhFKHrxWMF+QBvCZrtpl242YXcj1tBf4sWDIl79Kd78N27bHLT1wcDjvRlm5gUJZ0o7zrdVWFldW98obpa2tnd2y/befkvFqSS0SWIey06AFeVM0KZmmtNOIimOAk7bwehm6rcfqFQsFvd6nFAvwgPBQkawNpJvlx99F12hQZWeIuW7J75dcWrODGiZuDmpQI6Gb3/1+jFJIyo04Viprusk2suw1IxwOin1UkUTTEZ4QLuGChxR5WWzwyfo2Ch9FMbSlNBopv6eyHCk1DgKTGeE9VAtelPxP6+b6vDSy5hIUk 0FmS8KU450jKYpoD6TlGg+NgQTycytiAyxxESbrEomBHfx5WXSOqu5Ts29O6/Ur/M4inAIR1AFFy6gDrfQgCYQSOEZXuHNerJerHfrY95asPKZA/gD6/MHw9uRMg==</latexit><latexit sha1_base64="1kEHQ6sAAtWXqv0U9+UgjHcNIHQ=">AAAB+HicbVBNS8NAEJ3Ur1o/GvXoZbEIFaQkIuhFKHrxWMF+QBvCZrtpl242YXcj1tBf4sWDIl79Kd78N27bHLT1wcDjvRlm5gUJZ0o7zrdVWFldW98obpa2tnd2y/befkvFqSS0SWIey06AFeVM0KZmmtNOIimOAk7bwehm6rcfqFQsFvd6nFAvwgPBQkawNpJvlx99F12hQZWeIuW7J75dcWrODGiZuDmpQI6Gb3/1+jFJIyo04Viprusk2suw1IxwOin1UkUTTEZ4QLuGChxR5WWzwyfo2Ch9FMbSlNBopv6eyHCk1DgKTGeE9VAtelPxP6+b6vDSy5hIUk 0FmS8KU450jKYpoD6TlGg+NgQTycytiAyxxESbrEomBHfx5WXSOqu5Ts29O6/Ur/M4inAIR1AFFy6gDrfQgCYQSOEZXuHNerJerHfrY95asPKZA/gD6/MHw9uRMg==</latexit><latexit sha1_base64="1kEHQ6sAAtWXqv0U9+UgjHcNIHQ=">AAAB+HicbVBNS8NAEJ3Ur1o/GvXoZbEIFaQkIuhFKHrxWMF+QBvCZrtpl242YXcj1tBf4sWDIl79Kd78N27bHLT1wcDjvRlm5gUJZ0o7zrdVWFldW98obpa2tnd2y/befkvFqSS0SWIey06AFeVM0KZmmtNOIimOAk7bwehm6rcfqFQsFvd6nFAvwgPBQkawNpJvlx99F12hQZWeIuW7J75dcWrODGiZuDmpQI6Gb3/1+jFJIyo04Viprusk2suw1IxwOin1UkUTTEZ4QLuGChxR5WWzwyfo2Ch9FMbSlNBopv6eyHCk1DgKTGeE9VAtelPxP6+b6vDSy5hIUk 0FmS8KU450jKYpoD6TlGg+NgQTycytiAyxxESbrEomBHfx5WXSOqu5Ts29O6/Ur/M4inAIR1AFFy6gDrfQgCYQSOEZXuHNerJerHfrY95asPKZA/gD6/MHw9uRMg==</latexit>

s2<latexit sha1_base64="qN923TPi/PUCw4bkJcCJu6fCh+s=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48V7Ae0oWy2m3bp7ibsToQS+he8eFDEq3/Im//GpM1BWx8MPN6bYWZeEEth0XW/ndLG5tb2Tnm3srd/cHhUPT7p2CgxjLdZJCPTC6jlUmjeRoGS92LDqQok7wbTu9zvPnFjRaQfcRZzX9GxFqFgFHPJDhuVYbXm1t0FyDrxClKDAq1h9WswiliiuEYmqbV9z43RT6lBwSSfVwaJ5TFlUzrm/Yxqqrj108Wtc3KRKSMSRiYrjWSh/p5IqbJ2poKsU1Gc2FUvF//z+gmGN34qdJwg12y5KEwkwYjkj5ORMJyhnGWEMiOyWwmbUEMZZvHkIXirL6+TTqPuuXXv4arWvC3iKMMZnMMleHANTbiHFrSBwQSe4RXeHOW8OO/Ox7K15BQzp/AHzucPOqqNsA==</latexit><latexit sha1_base64="qN923TPi/PUCw4bkJcCJu6fCh+s=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48V7Ae0oWy2m3bp7ibsToQS+he8eFDEq3/Im//GpM1BWx8MPN6bYWZeEEth0XW/ndLG5tb2Tnm3srd/cHhUPT7p2CgxjLdZJCPTC6jlUmjeRoGS92LDqQok7wbTu9zvPnFjRaQfcRZzX9GxFqFgFHPJDhuVYbXm1t0FyDrxClKDAq1h9WswiliiuEYmqbV9z43RT6lBwSSfVwaJ5TFlUzrm/Yxqqrj108Wtc3KRKSMSRiYrjWSh/p5IqbJ2poKsU1Gc2FUvF//z+gmGN34qdJwg12y5KEwkwYjkj5ORMJyhnGWEMiOyWwmbUEMZZvHkIXirL6+TTqPuuXXv4arWvC3iKMMZnMMleHANTbiHFrSBwQSe4RXeHOW8OO/Ox7K15BQzp/AHzucPOqqNsA==</latexit><latexit sha1_base64="qN923TPi/PUCw4bkJcCJu6fCh+s=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48V7Ae0oWy2m3bp7ibsToQS+he8eFDEq3/Im//GpM1BWx8MPN6bYWZeEEth0XW/ndLG5tb2Tnm3srd/cHhUPT7p2CgxjLdZJCPTC6jlUmjeRoGS92LDqQok7wbTu9zvPnFjRaQfcRZzX9GxFqFgFHPJDhuVYbXm1t0FyDrxClKDAq1h9WswiliiuEYmqbV9z43RT6lBwSSfVwaJ5TFlUzrm/Yxqqrj108Wtc3KRKSMSRiYrjWSh/p5IqbJ2poKsU1Gc2FUvF//z+gmGN34qdJwg12y5KEwkwYjkj5ORMJyhnGWEMiOyWwmbUEMZZvHkIXirL6+TTqPuuXXv4arWvC3iKMMZnMMleHANTbiHFrSBwQSe4RXeHOW8OO/Ox7K15BQzp/AHzucPOqqNsA==</latexit><latexit sha1_base64="qN923TPi/PUCw4bkJcCJu6fCh+s=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48V7Ae0oWy2m3bp7ibsToQS+he8eFDEq3/Im//GpM1BWx8MPN6bYWZeEEth0XW/ndLG5tb2Tnm3srd/cHhUPT7p2CgxjLdZJCPTC6jlUmjeRoGS92LDqQok7wbTu9zvPnFjRaQfcRZzX9GxFqFgFHPJDhuVYbXm1t0FyDrxClKDAq1h9WswiliiuEYmqbV9z43RT6lBwSSfVwaJ5TFlUzrm/Yxqqrj108Wtc3KRKSMSRiYrjWSh/p5IqbJ2poKsU1Gc2FUvF//z+gmGN34qdJwg12y5KEwkwYjkj5ORMJyhnGWEMiOyWwmbUEMZZvHkIXirL6+TTqPuuXXv4arWvC3iKMMZnMMleHANTbiHFrSBwQSe4RXeHOW8OO/Ox7K15BQzp/AHzucPOqqNsA==</latexit>

x2 = g(e, s2)<latexit sha1_base64="Sgd23IBZtjrPnbTlDNqZsZEDwc8=">AAAB+HicbVBNS8NAEJ34WetHox69LBahgpSkCHoRil48VrAf0Iaw2W7apZtN2N2INfSXePGgiFd/ijf/jds2B219MPB4b4aZeUHCmdKO822trK6tb2wWtorbO7t7JXv/oKXiVBLaJDGPZSfAinImaFMzzWknkRRHAaftYHQz9dsPVCoWi3s9TqgX4YFgISNYG8m3S49+DV2hQYWeIeXXTn277FSdGdAycXNShhwN3/7q9WOSRlRowrFSXddJtJdhqRnhdFLspYommIzwgHYNFTiiystmh0/QiVH6KIylKaHRTP09keFIqXEUmM4I66Fa9Kbif1431eGllzGRpJ oKMl8UphzpGE1TQH0mKdF8bAgmkplbERliiYk2WRVNCO7iy8ukVau6TtW9Oy/Xr/M4CnAEx1ABFy6gDrfQgCYQSOEZXuHNerJerHfrY966YuUzh/AH1ucPxvCRNA==</latexit><latexit sha1_base64="Sgd23IBZtjrPnbTlDNqZsZEDwc8=">AAAB+HicbVBNS8NAEJ34WetHox69LBahgpSkCHoRil48VrAf0Iaw2W7apZtN2N2INfSXePGgiFd/ijf/jds2B219MPB4b4aZeUHCmdKO822trK6tb2wWtorbO7t7JXv/oKXiVBLaJDGPZSfAinImaFMzzWknkRRHAaftYHQz9dsPVCoWi3s9TqgX4YFgISNYG8m3S49+DV2hQYWeIeXXTn277FSdGdAycXNShhwN3/7q9WOSRlRowrFSXddJtJdhqRnhdFLspYommIzwgHYNFTiiystmh0/QiVH6KIylKaHRTP09keFIqXEUmM4I66Fa9Kbif1431eGllzGRpJ oKMl8UphzpGE1TQH0mKdF8bAgmkplbERliiYk2WRVNCO7iy8ukVau6TtW9Oy/Xr/M4CnAEx1ABFy6gDrfQgCYQSOEZXuHNerJerHfrY966YuUzh/AH1ucPxvCRNA==</latexit><latexit sha1_base64="Sgd23IBZtjrPnbTlDNqZsZEDwc8=">AAAB+HicbVBNS8NAEJ34WetHox69LBahgpSkCHoRil48VrAf0Iaw2W7apZtN2N2INfSXePGgiFd/ijf/jds2B219MPB4b4aZeUHCmdKO822trK6tb2wWtorbO7t7JXv/oKXiVBLaJDGPZSfAinImaFMzzWknkRRHAaftYHQz9dsPVCoWi3s9TqgX4YFgISNYG8m3S49+DV2hQYWeIeXXTn277FSdGdAycXNShhwN3/7q9WOSRlRowrFSXddJtJdhqRnhdFLspYommIzwgHYNFTiiystmh0/QiVH6KIylKaHRTP09keFIqXEUmM4I66Fa9Kbif1431eGllzGRpJ oKMl8UphzpGE1TQH0mKdF8bAgmkplbERliiYk2WRVNCO7iy8ukVau6TtW9Oy/Xr/M4CnAEx1ABFy6gDrfQgCYQSOEZXuHNerJerHfrY966YuUzh/AH1ucPxvCRNA==</latexit><latexit sha1_base64="hP+6LrUf2d3tZaldqaQQvEKMXyw=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYa SblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4</latexit><latexit sha1_base64="/vh5r0qgY2d6MTk2fGhbnHAqDDg=">AAAB7XicbZDNSgMxFIXv1L9aqx3dugkWoYKUmW50IwhuXFawrdAOQya904ZmMkOSEWvpk7hxoYiv4863Mf1ZaOuBwMc5CffmRJng2njet1PY2Nza3inulvbK+wcV97Dc1mmuGLZYKlL1EFGNgktsGW4EPmQKaRIJ7ESjm1neeUSleSrvzTjDIKEDyWPOqLFW6Faewga5IoManhMdNs5Ct+rVvbnIOvhLqMJSzdD96vVTlicoDRNU667vZSaYUGU4Ezgt9XKNGWUjOsCuRUkT1MFkvviUnFqnT+JU2SMNmbu/X0xoovU4iezNhJqhXs1m5n9ZNzfxZTDhMssNSr YYFOeCmJTMWiB9rpAZMbZAmeJ2V8KGVFFmbFclW4K/+uV1aDfqvlf37zwowjGcQA18uIBruIUmtIBBDi/wBu/Os/PqfCzqKjjL3o7gj5zPH4bCj9E=</latexit><latexit sha1_base64="/vh5r0qgY2d6MTk2fGhbnHAqDDg=">AAAB7XicbZDNSgMxFIXv1L9aqx3dugkWoYKUmW50IwhuXFawrdAOQya904ZmMkOSEWvpk7hxoYiv4863Mf1ZaOuBwMc5CffmRJng2njet1PY2Nza3inulvbK+wcV97Dc1mmuGLZYKlL1EFGNgktsGW4EPmQKaRIJ7ESjm1neeUSleSrvzTjDIKEDyWPOqLFW6Faewga5IoManhMdNs5Ct+rVvbnIOvhLqMJSzdD96vVTlicoDRNU667vZSaYUGU4Ezgt9XKNGWUjOsCuRUkT1MFkvviUnFqnT+JU2SMNmbu/X0xoovU4iezNhJqhXs1m5n9ZNzfxZTDhMssNSr YYFOeCmJTMWiB9rpAZMbZAmeJ2V8KGVFFmbFclW4K/+uV1aDfqvlf37zwowjGcQA18uIBruIUmtIBBDi/wBu/Os/PqfCzqKjjL3o7gj5zPH4bCj9E=</latexit><latexit sha1_base64="mlywhQMQMNoDAb34GBg6JwKRXC4=">AAAB+HicbVBNS8NAEN3Ur1o/GvXoZbEIFaQkvehFKHrxWMF+QBvCZjtpl242YXcj1tBf4sWDIl79Kd78N27bHLT1wcDjvRlm5gUJZ0o7zrdVWFvf2Nwqbpd2dvf2y/bBYVvFqaTQojGPZTcgCjgT0NJMc+gmEkgUcOgE45uZ33kAqVgs7vUkAS8iQ8FCRok2km+XH/06vsLDKpxj5dfPfLvi1Jw58Cpxc1JBOZq+/dUfxDSNQGjKiVI910m0lxGpGeUwLfVTBQmhYzKEnqGCRKC8bH74FJ8aZYDDWJoSGs/V3xMZiZSaRIHpjIgeqWVvJv7n9VIdXnoZE0mqQd DFojDlWMd4lgIeMAlU84khhEpmbsV0RCSh2mRVMiG4yy+vkna95jo1986pNK7zOIroGJ2gKnLRBWqgW9RELURRip7RK3qznqwX6936WLQWrHzmCP2B9fkDxbCRMA==</latexit><latexit sha1_base64="Sgd23IBZtjrPnbTlDNqZsZEDwc8=">AAAB+HicbVBNS8NAEJ34WetHox69LBahgpSkCHoRil48VrAf0Iaw2W7apZtN2N2INfSXePGgiFd/ijf/jds2B219MPB4b4aZeUHCmdKO822trK6tb2wWtorbO7t7JXv/oKXiVBLaJDGPZSfAinImaFMzzWknkRRHAaftYHQz9dsPVCoWi3s9TqgX4YFgISNYG8m3S49+DV2hQYWeIeXXTn277FSdGdAycXNShhwN3/7q9WOSRlRowrFSXddJtJdhqRnhdFLspYommIzwgHYNFTiiystmh0/QiVH6KIylKaHRTP09keFIqXEUmM4I66Fa9Kbif1431eGllzGRpJ oKMl8UphzpGE1TQH0mKdF8bAgmkplbERliiYk2WRVNCO7iy8ukVau6TtW9Oy/Xr/M4CnAEx1ABFy6gDrfQgCYQSOEZXuHNerJerHfrY966YuUzh/AH1ucPxvCRNA==</latexit><latexit sha1_base64="Sgd23IBZtjrPnbTlDNqZsZEDwc8=">AAAB+HicbVBNS8NAEJ34WetHox69LBahgpSkCHoRil48VrAf0Iaw2W7apZtN2N2INfSXePGgiFd/ijf/jds2B219MPB4b4aZeUHCmdKO822trK6tb2wWtorbO7t7JXv/oKXiVBLaJDGPZSfAinImaFMzzWknkRRHAaftYHQz9dsPVCoWi3s9TqgX4YFgISNYG8m3S49+DV2hQYWeIeXXTn277FSdGdAycXNShhwN3/7q9WOSRlRowrFSXddJtJdhqRnhdFLspYommIzwgHYNFTiiystmh0/QiVH6KIylKaHRTP09keFIqXEUmM4I66Fa9Kbif1431eGllzGRpJ oKMl8UphzpGE1TQH0mKdF8bAgmkplbERliiYk2WRVNCO7iy8ukVau6TtW9Oy/Xr/M4CnAEx1ABFy6gDrfQgCYQSOEZXuHNerJerHfrY966YuUzh/AH1ucPxvCRNA==</latexit><latexit sha1_base64="Sgd23IBZtjrPnbTlDNqZsZEDwc8=">AAAB+HicbVBNS8NAEJ34WetHox69LBahgpSkCHoRil48VrAf0Iaw2W7apZtN2N2INfSXePGgiFd/ijf/jds2B219MPB4b4aZeUHCmdKO822trK6tb2wWtorbO7t7JXv/oKXiVBLaJDGPZSfAinImaFMzzWknkRRHAaftYHQz9dsPVCoWi3s9TqgX4YFgISNYG8m3S49+DV2hQYWeIeXXTn277FSdGdAycXNShhwN3/7q9WOSRlRowrFSXddJtJdhqRnhdFLspYommIzwgHYNFTiiystmh0/QiVH6KIylKaHRTP09keFIqXEUmM4I66Fa9Kbif1431eGllzGRpJ oKMl8UphzpGE1TQH0mKdF8bAgmkplbERliiYk2WRVNCO7iy8ukVau6TtW9Oy/Xr/M4CnAEx1ABFy6gDrfQgCYQSOEZXuHNerJerHfrY966YuUzh/AH1ucPxvCRNA==</latexit><latexit sha1_base64="Sgd23IBZtjrPnbTlDNqZsZEDwc8=">AAAB+HicbVBNS8NAEJ34WetHox69LBahgpSkCHoRil48VrAf0Iaw2W7apZtN2N2INfSXePGgiFd/ijf/jds2B219MPB4b4aZeUHCmdKO822trK6tb2wWtorbO7t7JXv/oKXiVBLaJDGPZSfAinImaFMzzWknkRRHAaftYHQz9dsPVCoWi3s9TqgX4YFgISNYG8m3S49+DV2hQYWeIeXXTn277FSdGdAycXNShhwN3/7q9WOSRlRowrFSXddJtJdhqRnhdFLspYommIzwgHYNFTiiystmh0/QiVH6KIylKaHRTP09keFIqXEUmM4I66Fa9Kbif1431eGllzGRpJ oKMl8UphzpGE1TQH0mKdF8bAgmkplbERliiYk2WRVNCO7iy8ukVau6TtW9Oy/Xr/M4CnAEx1ABFy6gDrfQgCYQSOEZXuHNerJerHfrY966YuUzh/AH1ucPxvCRNA==</latexit><latexit sha1_base64="Sgd23IBZtjrPnbTlDNqZsZEDwc8=">AAAB+HicbVBNS8NAEJ34WetHox69LBahgpSkCHoRil48VrAf0Iaw2W7apZtN2N2INfSXePGgiFd/ijf/jds2B219MPB4b4aZeUHCmdKO822trK6tb2wWtorbO7t7JXv/oKXiVBLaJDGPZSfAinImaFMzzWknkRRHAaftYHQz9dsPVCoWi3s9TqgX4YFgISNYG8m3S49+DV2hQYWeIeXXTn277FSdGdAycXNShhwN3/7q9WOSRlRowrFSXddJtJdhqRnhdFLspYommIzwgHYNFTiiystmh0/QiVH6KIylKaHRTP09keFIqXEUmM4I66Fa9Kbif1431eGllzGRpJ oKMl8UphzpGE1TQH0mKdF8bAgmkplbERliiYk2WRVNCO7iy8ukVau6TtW9Oy/Xr/M4CnAEx1ABFy6gDrfQgCYQSOEZXuHNerJerHfrY966YuUzh/AH1ucPxvCRNA==</latexit><latexit sha1_base64="Sgd23IBZtjrPnbTlDNqZsZEDwc8=">AAAB+HicbVBNS8NAEJ34WetHox69LBahgpSkCHoRil48VrAf0Iaw2W7apZtN2N2INfSXePGgiFd/ijf/jds2B219MPB4b4aZeUHCmdKO822trK6tb2wWtorbO7t7JXv/oKXiVBLaJDGPZSfAinImaFMzzWknkRRHAaftYHQz9dsPVCoWi3s9TqgX4YFgISNYG8m3S49+DV2hQYWeIeXXTn277FSdGdAycXNShhwN3/7q9WOSRlRowrFSXddJtJdhqRnhdFLspYommIzwgHYNFTiiystmh0/QiVH6KIylKaHRTP09keFIqXEUmM4I66Fa9Kbif1431eGllzGRpJ oKMl8UphzpGE1TQH0mKdF8bAgmkplbERliiYk2WRVNCO7iy8ukVau6TtW9Oy/Xr/M4CnAEx1ABFy6gDrfQgCYQSOEZXuHNerJerHfrY966YuUzh/AH1ucPxvCRNA==</latexit>

default code

loop tiling tiling, map to micro kernel intrinsics

for yo in range(1024 / ty): for xo in range(1024 / tx): C[yo*ty:yo*ty+ty][xo*tx:xo*tx+tx] = 0 for k in range(1024): for yi in range(ty): for xi in range(tx): C[yo*ty+yi][xo*tx+xi] += A[k][yo*ty+yi] * B[k][xo*tx+xi]

for yo in range(128): for xo in range(128): intrin.fill_zero(C[yo*8:yo*8+8][xo*8:xo*8+8]) for ko in range(128): intrin.fused_gemm8x8_add( C[yo*8:yo*8+8][xo*8:xo*8+8], A[ko*8:ko*8+8][yo*8:yo*8+8], B[ko*8:ko*8+8][xo*8:xo*8+8])

for y in range(1024): for x in range(1024): C[y][x] = 0 for k in range(1024): C[y][x] += A[k][y] * B[k][x]

yo, xo, yi, xi = s[C].title(y, x, ty, tx)s[C].reorder(yo, xo, k, yi, xi)

yo,xo,ko,yi,xi,ki = s[C].title(y,x,k,8,8,8)s[C].tensorize(yi, intrin.gemm8x8)

compute expressionA = t.placeholder((1024, 1024))B = t.placeholder((1024, 1024))k = t.reduce_axis((0, 1024))C = t.compute((1024, 1024), lambda y, x: t.sum(A[k, y] * B[k, x], axis=k))

x0<latexit sha1_base64="R135vjpnIa7C4FCMWzv6Aw3hypQ=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48V7Qe0oWy2k3bpZhN2N2IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4Zua3H1FpHssHM0nQj+hQ8pAzaqx0/9R3++WKW3XnIKvEy0kFcjT65a/eIGZphNIwQbXuem5i/Iwqw5nAaamXakwoG9Mhdi2VNELtZ/NTp+TMKgMSxsqWNGSu/p7IaKT1JApsZ0TNSC97M/E/r5ua8MrPuExSg5ItFoWpICYms7/JgCtkRkwsoUxxeythI6ooMzadkg3BW355lbQuqp5b9e4uK/XrPI4inMApnIMHNajDLTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwAKLI2f</latexit><latexit sha1_base64="R135vjpnIa7C4FCMWzv6Aw3hypQ=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48V7Qe0oWy2k3bpZhN2N2IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4Zua3H1FpHssHM0nQj+hQ8pAzaqx0/9R3++WKW3XnIKvEy0kFcjT65a/eIGZphNIwQbXuem5i/Iwqw5nAaamXakwoG9Mhdi2VNELtZ/NTp+TMKgMSxsqWNGSu/p7IaKT1JApsZ0TNSC97M/E/r5ua8MrPuExSg5ItFoWpICYms7/JgCtkRkwsoUxxeythI6ooMzadkg3BW355lbQuqp5b9e4uK/XrPI4inMApnIMHNajDLTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwAKLI2f</latexit><latexit sha1_base64="R135vjpnIa7C4FCMWzv6Aw3hypQ=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48V7Qe0oWy2k3bpZhN2N2IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4Zua3H1FpHssHM0nQj+hQ8pAzaqx0/9R3++WKW3XnIKvEy0kFcjT65a/eIGZphNIwQbXuem5i/Iwqw5nAaamXakwoG9Mhdi2VNELtZ/NTp+TMKgMSxsqWNGSu/p7IaKT1JApsZ0TNSC97M/E/r5ua8MrPuExSg5ItFoWpICYms7/JgCtkRkwsoUxxeythI6ooMzadkg3BW355lbQuqp5b9e4uK/XrPI4inMApnIMHNajDLTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwAKLI2f</latexit><latexit sha1_base64="R135vjpnIa7C4FCMWzv6Aw3hypQ=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48V7Qe0oWy2k3bpZhN2N2IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4Zua3H1FpHssHM0nQj+hQ8pAzaqx0/9R3++WKW3XnIKvEy0kFcjT65a/eIGZphNIwQbXuem5i/Iwqw5nAaamXakwoG9Mhdi2VNELtZ/NTp+TMKgMSxsqWNGSu/p7IaKT1JApsZ0TNSC97M/E/r5ua8MrPuExSg5ItFoWpICYms7/JgCtkRkwsoUxxeythI6ooMzadkg3BW355lbQuqp5b9e4uK/XrPI4inMApnIMHNajDLTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwAKLI2f</latexit>

Figure 1: Sample problem. For a given tensor operator specification (Cij =∑

k AkiBkj), there are multiplepossible low-level program implementations, each with different choices of loop order, tiling size, and otheroptions. Each choice creates a logically equivalent program with different performance. Our problem is toexplore the space of programs to find the fastest implementation.

We provide a detailed empirical analysis of component design choices in this framework. Experi-mental results on real-world DL workloads show that our framework yields end-to-end performanceimprovements ranging from 1.2× to 3.8× over existing frameworks.

2 Problem Formalization

We begin by walking through the motivating example in Figure 1. To enable automatic codegeneration, we specify tensor operators using index expressions (e.g., Cij =

∑k AkiBkj). Let E

denote the space of index expressions. The index expression leaves many low-level implementationdetails, such as loop order, memory scope, and threading unspecified. As a result, we can generatemultiple variants of low-level code that are logically equivalent to the expression for a given e ∈ E .We use Se to denote the space of possible transformations (schedules) from e to low-level code. Foran s ∈ Se, let x = g(e, s) be the generated low-level code. Here, g represents a compiler frameworkthat generates low-level code from e, s. We are interested in minimizing f(x), which is the real runtime cost on the hardware. Importantly, we do not know an analytical expression for f(x) but canquery it by running experiments on the hardware. For a given tuple of (g, e,Se, f), our problem canbe formalized as the following objective:

argmins∈Se

f(g(e, s)) (1)

This problem formalization is similar to that of traditional hyper-parameter optimization problems [34,33, 35, 13, 17, 25] but with several key differentiating characteristics:

Relatively Low Experiment Cost. Traditionally, hyper-parameter optimization problems incura high cost to query f , viz., running experiments could take hours or days. However, the cost ofcompiling and running a tensor program is a few seconds. This property requires that model trainingand inference be fast ; otherwise, there is no benefit over profiling execution on real hardware. It alsomeans that we can collect more training data during optimization.

Domain-Specific Problem Structure. Most existing hyper-parameter optimization algorithmstreat the problem as a black box. As we optimize programs, we can leverage their rich structures tobuild effective models.

Large Quantity of Similar Operators. An end-to-end DL system must optimize tensor operatorprograms for different input sizes, shapes, and data layout configurations. These tasks are similar andcan offer opportunities for transfer learning.

We describe two key prerequisites for automatic code generation that is competitive with hand-optimized code. (1) We need to define an exhaustive search space Se that covers all hardware-awareoptimizations in hand-tuned libraries. (2) We need to efficiently find an optimal schedule in Se.There are many domain-specific languages (DSLs) for code generation [32, 36, 15, 37, 20, 30], eachwith with a different E , Se and g. Polyhedral models [5, 42, 41] are a popular choice for Se; theymodel the loop domains as integer linear constraints. An alternative approach originating fromHalide [32] defines a schedule space using a set of transformation primitives. Improving Se is animportant research direction that is beyond the scope of this paper; we pick a rich Se and focus onschedule optimization in the rest of the paper.

2

Page 3: Learning to Optimize Tensor Programs · Learning to Optimize Tensor Programs Tianqi Chen 1Lianmin Zheng2 Eddie Yan Ziheng Jiang 1Thierry Moreau Luis Ceze 1Carlos Guestrin Arvind Krishnamurthy

Expression

Schedule Space Exploration Module

Cost Model

Hardware EnvironmentCode Generator

e<latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit><latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit><latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit><latexit sha1_base64="hP+6LrUf2d3tZaldqaQQvEKMXyw=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4</latexit><latexit sha1_base64="DprgtIzi24Eq9y5/8TyqdCxQO58=">AAAB3XicbZBLSwMxFIXv+Ky1anXrJlgEV2XGjS4FNy5bsA9oh5JJ77SxmcyQ3BHK0F/gxoUi/i13/hvTx0JbDwQ+zknIvSfKlLTk+9/e1vbO7t5+6aB8WDk6PqmeVto2zY3AlkhVaroRt6ikxhZJUtjNDPIkUtiJJvfzvPOMxspUP9I0wzDhIy1jKTg5q4mDas2v+wuxTQhWUIOVGoPqV3+YijxBTUJxa3uBn1FYcENSKJyV+7nFjIsJH2HPoeYJ2rBYDDpjl84Zsjg17mhiC/f3i4In1k6TyN1MOI3tejY3/8t6OcW3YSF1lhNqsfwozhWjlM23ZkNpUJCaOuDCSDcrE2NuuCDXTdmVEKyvvAnt63rg14OmDyU4hwu4ggBu4A4eoAEtEIDwAm/w7j15r97Hsq4tb9XbGfyR9/kDtZiLlg==</latexit><latexit sha1_base64="DprgtIzi24Eq9y5/8TyqdCxQO58=">AAAB3XicbZBLSwMxFIXv+Ky1anXrJlgEV2XGjS4FNy5bsA9oh5JJ77SxmcyQ3BHK0F/gxoUi/i13/hvTx0JbDwQ+zknIvSfKlLTk+9/e1vbO7t5+6aB8WDk6PqmeVto2zY3AlkhVaroRt6ikxhZJUtjNDPIkUtiJJvfzvPOMxspUP9I0wzDhIy1jKTg5q4mDas2v+wuxTQhWUIOVGoPqV3+YijxBTUJxa3uBn1FYcENSKJyV+7nFjIsJH2HPoeYJ2rBYDDpjl84Zsjg17mhiC/f3i4In1k6TyN1MOI3tejY3/8t6OcW3YSF1lhNqsfwozhWjlM23ZkNpUJCaOuDCSDcrE2NuuCDXTdmVEKyvvAnt63rg14OmDyU4hwu4ggBu4A4eoAEtEIDwAm/w7j15r97Hsq4tb9XbGfyR9/kDtZiLlg==</latexit><latexit sha1_base64="f06LPawGe2Q0Ej/v9kIC5ARzRvQ=">AAAB6HicbVBNT8JAEJ3iF+IX6tHLRmLiibRe9Ej04hESCyTQkO0yhZXtttndmpCGX+DFg8Z49Sd589+4QA8KvmSSl/dmMjMvTAXXxnW/ndLG5tb2Tnm3srd/cHhUPT5p6yRTDH2WiER1Q6pRcIm+4UZgN1VI41BgJ5zczf3OEyrNE/lgpikGMR1JHnFGjZVaOKjW3Lq7AFknXkFqUKA5qH71hwnLYpSGCap1z3NTE+RUGc4Ezir9TGNK2YSOsGeppDHqIF8cOiMXVhmSKFG2pCEL9fdETmOtp3FoO2NqxnrVm4v/eb3MRDdBzmWaGZRsuSjKBDEJmX9NhlwhM2JqCWWK21sJG1NFmbHZVGwI3urL66R9Vffcutdya43bIo4ynME5XIIH19CAe2iCDwwQnuEV3pxH58V5dz6WrSWnmDmFP3A+fwDILYzl</latexit><latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit><latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit><latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit><latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit><latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit><latexit sha1_base64="NPruYLn66/puOzAMtMM3tSFgc5w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipiYNyxa26C5B14uWkAjkag/JXfxizNEJpmKBa9zw3MX5GleFM4KzUTzUmlE3oCHuWShqh9rPFoTNyYZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTXjjZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTcmG4K2+vE7aV1XPrXrN60r9No+jCGdwDpfgQQ3qcA8NaAEDhGd4hTfn0Xlx3p2PZWvByWdO4Q+czx/JbYzp</latexit>

e, s<latexit sha1_base64="EC80YnDk9bIj8j/Ofs/0DuTf/ps=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYD+gDWWznbRLdzdhdyOU0L/gxYMiXv1D3vw3Jm0O2vpg4PHeDDPzglhwY1332ymtrW9sbpW3Kzu7e/sH1cOjtokSzbDFIhHpbkANCq6wZbkV2I01UhkI7ASTu9zvPKE2PFKPdhqjL+lI8ZAzanMJL4gZVGtu3Z2DrBKvIDUo0BxUv/rDiCUSlWWCGtPz3Nj6KdWWM4GzSj8xGFM2oSPsZVRRicZP57fOyFmmDEkY6ayUJXP190RKpTFTGWSdktqxWfZy8T+vl9jwxk+5ihOLii0WhYkgNiL542TINTIrphmhTPPsVsLGVFNms3gqWQje8surpH1Z99y693BVa9wWcZThBE7hHDy4hgbcQxNawGAMz/AKb450Xpx352PRWnKKmWP4A+fzB1tgjcY=</latexit><latexit sha1_base64="EC80YnDk9bIj8j/Ofs/0DuTf/ps=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYD+gDWWznbRLdzdhdyOU0L/gxYMiXv1D3vw3Jm0O2vpg4PHeDDPzglhwY1332ymtrW9sbpW3Kzu7e/sH1cOjtokSzbDFIhHpbkANCq6wZbkV2I01UhkI7ASTu9zvPKE2PFKPdhqjL+lI8ZAzanMJL4gZVGtu3Z2DrBKvIDUo0BxUv/rDiCUSlWWCGtPz3Nj6KdWWM4GzSj8xGFM2oSPsZVRRicZP57fOyFmmDEkY6ayUJXP190RKpTFTGWSdktqxWfZy8T+vl9jwxk+5ihOLii0WhYkgNiL542TINTIrphmhTPPsVsLGVFNms3gqWQje8surpH1Z99y693BVa9wWcZThBE7hHDy4hgbcQxNawGAMz/AKb450Xpx352PRWnKKmWP4A+fzB1tgjcY=</latexit><latexit sha1_base64="EC80YnDk9bIj8j/Ofs/0DuTf/ps=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYD+gDWWznbRLdzdhdyOU0L/gxYMiXv1D3vw3Jm0O2vpg4PHeDDPzglhwY1332ymtrW9sbpW3Kzu7e/sH1cOjtokSzbDFIhHpbkANCq6wZbkV2I01UhkI7ASTu9zvPKE2PFKPdhqjL+lI8ZAzanMJL4gZVGtu3Z2DrBKvIDUo0BxUv/rDiCUSlWWCGtPz3Nj6KdWWM4GzSj8xGFM2oSPsZVRRicZP57fOyFmmDEkY6ayUJXP190RKpTFTGWSdktqxWfZy8T+vl9jwxk+5ihOLii0WhYkgNiL542TINTIrphmhTPPsVsLGVFNms3gqWQje8surpH1Z99y693BVa9wWcZThBE7hHDy4hgbcQxNawGAMz/AKb450Xpx352PRWnKKmWP4A+fzB1tgjcY=</latexit><latexit sha1_base64="EC80YnDk9bIj8j/Ofs/0DuTf/ps=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYD+gDWWznbRLdzdhdyOU0L/gxYMiXv1D3vw3Jm0O2vpg4PHeDDPzglhwY1332ymtrW9sbpW3Kzu7e/sH1cOjtokSzbDFIhHpbkANCq6wZbkV2I01UhkI7ASTu9zvPKE2PFKPdhqjL+lI8ZAzanMJL4gZVGtu3Z2DrBKvIDUo0BxUv/rDiCUSlWWCGtPz3Nj6KdWWM4GzSj8xGFM2oSPsZVRRicZP57fOyFmmDEkY6ayUJXP190RKpTFTGWSdktqxWfZy8T+vl9jwxk+5ihOLii0WhYkgNiL542TINTIrphmhTPPsVsLGVFNms3gqWQje8surpH1Z99y693BVa9wWcZThBE7hHDy4hgbcQxNawGAMz/AKb450Xpx352PRWnKKmWP4A+fzB1tgjcY=</latexit>

f̂(x)<latexit sha1_base64="3YWuE+22s48EHVzAe+Zi3sO8Uu4=">AAAB8XicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPRi8cK9gPbUDbbTbt0swm7E7GE/gsvHhTx6r/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmjjVjDdYLGPdDqjhUijeQIGStxPNaRRI3gpGN1O/9ci1EbG6x3HC/YgOlAgFo2ilh+6QYhZOKk9nvVLZrbozkGXi5aQMOeq90le3H7M04gqZpMZ0PDdBP6MaBZN8UuymhieUjeiAdyxVNOLGz2YXT8ipVfokjLUthWSm/p7IaGTMOApsZ0RxaBa9qfif10kxvPIzoZIUuWLzRWEqCcZk+j7pC80ZyrEllGlhbyVsSDVlaEMq2hC8xZeXSfO86rlV7+6iXLvO4yjAMZxABTy4hBrcQh0awEDBM7zCm2OcF+fd+Zi3rjj5zBH8gfP5Az4nkJ4=</latexit><latexit sha1_base64="3YWuE+22s48EHVzAe+Zi3sO8Uu4=">AAAB8XicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPRi8cK9gPbUDbbTbt0swm7E7GE/gsvHhTx6r/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmjjVjDdYLGPdDqjhUijeQIGStxPNaRRI3gpGN1O/9ci1EbG6x3HC/YgOlAgFo2ilh+6QYhZOKk9nvVLZrbozkGXi5aQMOeq90le3H7M04gqZpMZ0PDdBP6MaBZN8UuymhieUjeiAdyxVNOLGz2YXT8ipVfokjLUthWSm/p7IaGTMOApsZ0RxaBa9qfif10kxvPIzoZIUuWLzRWEqCcZk+j7pC80ZyrEllGlhbyVsSDVlaEMq2hC8xZeXSfO86rlV7+6iXLvO4yjAMZxABTy4hBrcQh0awEDBM7zCm2OcF+fd+Zi3rjj5zBH8gfP5Az4nkJ4=</latexit><latexit sha1_base64="3YWuE+22s48EHVzAe+Zi3sO8Uu4=">AAAB8XicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPRi8cK9gPbUDbbTbt0swm7E7GE/gsvHhTx6r/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmjjVjDdYLGPdDqjhUijeQIGStxPNaRRI3gpGN1O/9ci1EbG6x3HC/YgOlAgFo2ilh+6QYhZOKk9nvVLZrbozkGXi5aQMOeq90le3H7M04gqZpMZ0PDdBP6MaBZN8UuymhieUjeiAdyxVNOLGz2YXT8ipVfokjLUthWSm/p7IaGTMOApsZ0RxaBa9qfif10kxvPIzoZIUuWLzRWEqCcZk+j7pC80ZyrEllGlhbyVsSDVlaEMq2hC8xZeXSfO86rlV7+6iXLvO4yjAMZxABTy4hBrcQh0awEDBM7zCm2OcF+fd+Zi3rjj5zBH8gfP5Az4nkJ4=</latexit><latexit sha1_base64="3YWuE+22s48EHVzAe+Zi3sO8Uu4=">AAAB8XicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPRi8cK9gPbUDbbTbt0swm7E7GE/gsvHhTx6r/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmjjVjDdYLGPdDqjhUijeQIGStxPNaRRI3gpGN1O/9ci1EbG6x3HC/YgOlAgFo2ilh+6QYhZOKk9nvVLZrbozkGXi5aQMOeq90le3H7M04gqZpMZ0PDdBP6MaBZN8UuymhieUjeiAdyxVNOLGz2YXT8ipVfokjLUthWSm/p7IaGTMOApsZ0RxaBa9qfif10kxvPIzoZIUuWLzRWEqCcZk+j7pC80ZyrEllGlhbyVsSDVlaEMq2hC8xZeXSfO86rlV7+6iXLvO4yjAMZxABTy4hBrcQh0awEDBM7zCm2OcF+fd+Zi3rjj5zBH8gfP5Az4nkJ4=</latexit>

f(x)<latexit sha1_base64="Appt6dOASLoU0puF9XJna1LvMt4=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBahXkoigh6LXjxWMG2hDWWz3bRLdzdhdyOW0L/gxYMiXv1D3vw3btoctPXBwOO9GWbmhQln2rjut1NaW9/Y3CpvV3Z29/YPqodHbR2nilCfxDxW3RBrypmkvmGG026iKBYhp51wcpv7nUeqNIvlg5kmNBB4JFnECDa5FNWfzgfVmttw50CrxCtIDQq0BtWv/jAmqaDSEI617nluYoIMK8MIp7NKP9U0wWSCR7RnqcSC6iCb3zpDZ1YZoihWtqRBc/X3RIaF1lMR2k6BzVgve7n4n9dLTXQdZEwmqaGSLBZFKUcmRvnjaMgUJYZPLcFEMXsrImOsMDE2nooNwVt+eZW0Lxqe2/DuL2vNmyKOMpzAKdTBgytowh20wAcCY3iGV3hzhPPivDsfi9aSU8wcwx84nz9sX43R</latexit><latexit sha1_base64="Appt6dOASLoU0puF9XJna1LvMt4=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBahXkoigh6LXjxWMG2hDWWz3bRLdzdhdyOW0L/gxYMiXv1D3vw3btoctPXBwOO9GWbmhQln2rjut1NaW9/Y3CpvV3Z29/YPqodHbR2nilCfxDxW3RBrypmkvmGG026iKBYhp51wcpv7nUeqNIvlg5kmNBB4JFnECDa5FNWfzgfVmttw50CrxCtIDQq0BtWv/jAmqaDSEI617nluYoIMK8MIp7NKP9U0wWSCR7RnqcSC6iCb3zpDZ1YZoihWtqRBc/X3RIaF1lMR2k6BzVgve7n4n9dLTXQdZEwmqaGSLBZFKUcmRvnjaMgUJYZPLcFEMXsrImOsMDE2nooNwVt+eZW0Lxqe2/DuL2vNmyKOMpzAKdTBgytowh20wAcCY3iGV3hzhPPivDsfi9aSU8wcwx84nz9sX43R</latexit><latexit sha1_base64="Appt6dOASLoU0puF9XJna1LvMt4=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBahXkoigh6LXjxWMG2hDWWz3bRLdzdhdyOW0L/gxYMiXv1D3vw3btoctPXBwOO9GWbmhQln2rjut1NaW9/Y3CpvV3Z29/YPqodHbR2nilCfxDxW3RBrypmkvmGG026iKBYhp51wcpv7nUeqNIvlg5kmNBB4JFnECDa5FNWfzgfVmttw50CrxCtIDQq0BtWv/jAmqaDSEI617nluYoIMK8MIp7NKP9U0wWSCR7RnqcSC6iCb3zpDZ1YZoihWtqRBc/X3RIaF1lMR2k6BzVgve7n4n9dLTXQdZEwmqaGSLBZFKUcmRvnjaMgUJYZPLcFEMXsrImOsMDE2nooNwVt+eZW0Lxqe2/DuL2vNmyKOMpzAKdTBgytowh20wAcCY3iGV3hzhPPivDsfi9aSU8wcwx84nz9sX43R</latexit><latexit sha1_base64="Appt6dOASLoU0puF9XJna1LvMt4=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBahXkoigh6LXjxWMG2hDWWz3bRLdzdhdyOW0L/gxYMiXv1D3vw3btoctPXBwOO9GWbmhQln2rjut1NaW9/Y3CpvV3Z29/YPqodHbR2nilCfxDxW3RBrypmkvmGG026iKBYhp51wcpv7nUeqNIvlg5kmNBB4JFnECDa5FNWfzgfVmttw50CrxCtIDQq0BtWv/jAmqaDSEI617nluYoIMK8MIp7NKP9U0wWSCR7RnqcSC6iCb3zpDZ1YZoihWtqRBc/X3RIaF1lMR2k6BzVgve7n4n9dLTXQdZEwmqaGSLBZFKUcmRvnjaMgUJYZPLcFEMXsrImOsMDE2nooNwVt+eZW0Lxqe2/DuL2vNmyKOMpzAKdTBgytowh20wAcCY3iGV3hzhPPivDsfi9aSU8wcwx84nz9sX43R</latexit>

x = g(e, s)<latexit sha1_base64="7SfFS4wAO3Vo/fTFR5rvB5I+J3s=">AAAB8nicbVBNSwMxEM3Wr1q/qh69BItQQcquCHoRil48VrAfsF1KNs22odlkSWbFsvRnePGgiFd/jTf/jWm7B219MPB4b4aZeWEiuAHX/XYKK6tr6xvFzdLW9s7uXnn/oGVUqilrUiWU7oTEMMElawIHwTqJZiQOBWuHo9up335k2nAlH2CcsCAmA8kjTglYyX/C13hQZWfYnPbKFbfmzoCXiZeTCsrR6JW/un1F05hJoIIY43tuAkFGNHAq2KTUTQ1LCB2RAfMtlSRmJshmJ0/wiVX6OFLalgQ8U39PZCQ2ZhyHtjMmMDSL3lT8z/NTiK6CjMskBSbpfFGUCgwKT//Hfa4ZBTG2hFDN7a2YDokmFGxKJRuCt/jyMmmd1zy35t1fVOo3eRxFdISOURV56BLV0R1qoCaiSKFn9IreHHBenHfnY95acPKZQ/QHzucP9T+PuQ==</latexit><latexit sha1_base64="7SfFS4wAO3Vo/fTFR5rvB5I+J3s=">AAAB8nicbVBNSwMxEM3Wr1q/qh69BItQQcquCHoRil48VrAfsF1KNs22odlkSWbFsvRnePGgiFd/jTf/jWm7B219MPB4b4aZeWEiuAHX/XYKK6tr6xvFzdLW9s7uXnn/oGVUqilrUiWU7oTEMMElawIHwTqJZiQOBWuHo9up335k2nAlH2CcsCAmA8kjTglYyX/C13hQZWfYnPbKFbfmzoCXiZeTCsrR6JW/un1F05hJoIIY43tuAkFGNHAq2KTUTQ1LCB2RAfMtlSRmJshmJ0/wiVX6OFLalgQ8U39PZCQ2ZhyHtjMmMDSL3lT8z/NTiK6CjMskBSbpfFGUCgwKT//Hfa4ZBTG2hFDN7a2YDokmFGxKJRuCt/jyMmmd1zy35t1fVOo3eRxFdISOURV56BLV0R1qoCaiSKFn9IreHHBenHfnY95acPKZQ/QHzucP9T+PuQ==</latexit><latexit sha1_base64="7SfFS4wAO3Vo/fTFR5rvB5I+J3s=">AAAB8nicbVBNSwMxEM3Wr1q/qh69BItQQcquCHoRil48VrAfsF1KNs22odlkSWbFsvRnePGgiFd/jTf/jWm7B219MPB4b4aZeWEiuAHX/XYKK6tr6xvFzdLW9s7uXnn/oGVUqilrUiWU7oTEMMElawIHwTqJZiQOBWuHo9up335k2nAlH2CcsCAmA8kjTglYyX/C13hQZWfYnPbKFbfmzoCXiZeTCsrR6JW/un1F05hJoIIY43tuAkFGNHAq2KTUTQ1LCB2RAfMtlSRmJshmJ0/wiVX6OFLalgQ8U39PZCQ2ZhyHtjMmMDSL3lT8z/NTiK6CjMskBSbpfFGUCgwKT//Hfa4ZBTG2hFDN7a2YDokmFGxKJRuCt/jyMmmd1zy35t1fVOo3eRxFdISOURV56BLV0R1qoCaiSKFn9IreHHBenHfnY95acPKZQ/QHzucP9T+PuQ==</latexit><latexit sha1_base64="hP+6LrUf2d3tZaldqaQQvEKMXyw=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4</latexit><latexit sha1_base64="z2KSz7Kif4ZV+s/8P9zK5Mr1Z7I=">AAAB53icbZBLSwMxFIXv1FetVatbN8EiVJAy40Y3guDGZQX7gOlQMmmmDc0kQ3JHLKU/w40LRfxH7vw3po+Fth4IfJyTkHtPnElh0fe/vcLG5tb2TnG3tFfePzisHJVbVueG8SbTUptOTC2XQvEmCpS8kxlO01jydjy6m+XtJ26s0OoRxxmPUjpQIhGMorPCZ3JDBjV+Qex5r1L16/5cZB2CJVRhqUav8tXta5anXCGT1Now8DOMJtSgYJJPS93c8oyyER3w0KGiKbfRZD7ylJw5p08SbdxRSObu7xcTmlo7TmN3M6U4tKvZzPwvC3NMrqOJUFmOXLHFR0kuCWoy25/0heEM5dgBZUa4WQkbUkMZupZKroRgdeV1aF3WA78ePPhQhBM4hRoEcAW3cA8NaAIDDS/wBu8eeq/ex6Kugrfs7Rj+yPv8AcRGjlw=</latexit><latexit sha1_base64="z2KSz7Kif4ZV+s/8P9zK5Mr1Z7I=">AAAB53icbZBLSwMxFIXv1FetVatbN8EiVJAy40Y3guDGZQX7gOlQMmmmDc0kQ3JHLKU/w40LRfxH7vw3po+Fth4IfJyTkHtPnElh0fe/vcLG5tb2TnG3tFfePzisHJVbVueG8SbTUptOTC2XQvEmCpS8kxlO01jydjy6m+XtJ26s0OoRxxmPUjpQIhGMorPCZ3JDBjV+Qex5r1L16/5cZB2CJVRhqUav8tXta5anXCGT1Now8DOMJtSgYJJPS93c8oyyER3w0KGiKbfRZD7ylJw5p08SbdxRSObu7xcTmlo7TmN3M6U4tKvZzPwvC3NMrqOJUFmOXLHFR0kuCWoy25/0heEM5dgBZUa4WQkbUkMZupZKroRgdeV1aF3WA78ePPhQhBM4hRoEcAW3cA8NaAIDDS/wBu8eeq/ex6Kugrfs7Rj+yPv8AcRGjlw=</latexit><latexit sha1_base64="zWSX7SEAPOAMI/A+45JBiFTQx0U=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSJUkJJ40YtQ9OKxgv2ANJTNdtIu3WzC7kYspT/DiwdFvPprvPlv3LY5aOuDgcd7M8zMC1PBtXHdb2dldW19Y7OwVdze2d3bLx0cNnWSKYYNlohEtUOqUXCJDcONwHaqkMahwFY4vJ36rUdUmifywYxSDGLalzzijBor+U/kmvQreE70WbdUdqvuDGSZeDkpQ456t/TV6SUsi1EaJqjWvuemJhhTZTgTOCl2Mo0pZUPaR99SSWPUwXh28oScWqVHokTZkobM1N8TYxprPYpD2xlTM9CL3lT8z/MzE10FYy7TzKBk80VRJohJyPR/0uMKmREjSyhT3N5K2IAqyoxNqWhD8BZfXibNi6rnVr17t1y7yeMowDGcQAU8uIQa3EEdGsAggWd4hTfHOC/Ou/Mxb11x8pkj+APn8wfz/4+1</latexit><latexit sha1_base64="7SfFS4wAO3Vo/fTFR5rvB5I+J3s=">AAAB8nicbVBNSwMxEM3Wr1q/qh69BItQQcquCHoRil48VrAfsF1KNs22odlkSWbFsvRnePGgiFd/jTf/jWm7B219MPB4b4aZeWEiuAHX/XYKK6tr6xvFzdLW9s7uXnn/oGVUqilrUiWU7oTEMMElawIHwTqJZiQOBWuHo9up335k2nAlH2CcsCAmA8kjTglYyX/C13hQZWfYnPbKFbfmzoCXiZeTCsrR6JW/un1F05hJoIIY43tuAkFGNHAq2KTUTQ1LCB2RAfMtlSRmJshmJ0/wiVX6OFLalgQ8U39PZCQ2ZhyHtjMmMDSL3lT8z/NTiK6CjMskBSbpfFGUCgwKT//Hfa4ZBTG2hFDN7a2YDokmFGxKJRuCt/jyMmmd1zy35t1fVOo3eRxFdISOURV56BLV0R1qoCaiSKFn9IreHHBenHfnY95acPKZQ/QHzucP9T+PuQ==</latexit><latexit sha1_base64="7SfFS4wAO3Vo/fTFR5rvB5I+J3s=">AAAB8nicbVBNSwMxEM3Wr1q/qh69BItQQcquCHoRil48VrAfsF1KNs22odlkSWbFsvRnePGgiFd/jTf/jWm7B219MPB4b4aZeWEiuAHX/XYKK6tr6xvFzdLW9s7uXnn/oGVUqilrUiWU7oTEMMElawIHwTqJZiQOBWuHo9up335k2nAlH2CcsCAmA8kjTglYyX/C13hQZWfYnPbKFbfmzoCXiZeTCsrR6JW/un1F05hJoIIY43tuAkFGNHAq2KTUTQ1LCB2RAfMtlSRmJshmJ0/wiVX6OFLalgQ8U39PZCQ2ZhyHtjMmMDSL3lT8z/NTiK6CjMskBSbpfFGUCgwKT//Hfa4ZBTG2hFDN7a2YDokmFGxKJRuCt/jyMmmd1zy35t1fVOo3eRxFdISOURV56BLV0R1qoCaiSKFn9IreHHBenHfnY95acPKZQ/QHzucP9T+PuQ==</latexit><latexit sha1_base64="7SfFS4wAO3Vo/fTFR5rvB5I+J3s=">AAAB8nicbVBNSwMxEM3Wr1q/qh69BItQQcquCHoRil48VrAfsF1KNs22odlkSWbFsvRnePGgiFd/jTf/jWm7B219MPB4b4aZeWEiuAHX/XYKK6tr6xvFzdLW9s7uXnn/oGVUqilrUiWU7oTEMMElawIHwTqJZiQOBWuHo9up335k2nAlH2CcsCAmA8kjTglYyX/C13hQZWfYnPbKFbfmzoCXiZeTCsrR6JW/un1F05hJoIIY43tuAkFGNHAq2KTUTQ1LCB2RAfMtlSRmJshmJ0/wiVX6OFLalgQ8U39PZCQ2ZhyHtjMmMDSL3lT8z/NTiK6CjMskBSbpfFGUCgwKT//Hfa4ZBTG2hFDN7a2YDokmFGxKJRuCt/jyMmmd1zy35t1fVOo3eRxFdISOURV56BLV0R1qoCaiSKFn9IreHHBenHfnY95acPKZQ/QHzucP9T+PuQ==</latexit><latexit sha1_base64="7SfFS4wAO3Vo/fTFR5rvB5I+J3s=">AAAB8nicbVBNSwMxEM3Wr1q/qh69BItQQcquCHoRil48VrAfsF1KNs22odlkSWbFsvRnePGgiFd/jTf/jWm7B219MPB4b4aZeWEiuAHX/XYKK6tr6xvFzdLW9s7uXnn/oGVUqilrUiWU7oTEMMElawIHwTqJZiQOBWuHo9up335k2nAlH2CcsCAmA8kjTglYyX/C13hQZWfYnPbKFbfmzoCXiZeTCsrR6JW/un1F05hJoIIY43tuAkFGNHAq2KTUTQ1LCB2RAfMtlSRmJshmJ0/wiVX6OFLalgQ8U39PZCQ2ZhyHtjMmMDSL3lT8z/NTiK6CjMskBSbpfFGUCgwKT//Hfa4ZBTG2hFDN7a2YDokmFGxKJRuCt/jyMmmd1zy35t1fVOo3eRxFdISOURV56BLV0R1qoCaiSKFn9IreHHBenHfnY95acPKZQ/QHzucP9T+PuQ==</latexit><latexit sha1_base64="7SfFS4wAO3Vo/fTFR5rvB5I+J3s=">AAAB8nicbVBNSwMxEM3Wr1q/qh69BItQQcquCHoRil48VrAfsF1KNs22odlkSWbFsvRnePGgiFd/jTf/jWm7B219MPB4b4aZeWEiuAHX/XYKK6tr6xvFzdLW9s7uXnn/oGVUqilrUiWU7oTEMMElawIHwTqJZiQOBWuHo9up335k2nAlH2CcsCAmA8kjTglYyX/C13hQZWfYnPbKFbfmzoCXiZeTCsrR6JW/un1F05hJoIIY43tuAkFGNHAq2KTUTQ1LCB2RAfMtlSRmJshmJ0/wiVX6OFLalgQ8U39PZCQ2ZhyHtjMmMDSL3lT8z/NTiK6CjMskBSbpfFGUCgwKT//Hfa4ZBTG2hFDN7a2YDokmFGxKJRuCt/jyMmmd1zy35t1fVOo3eRxFdISOURV56BLV0R1qoCaiSKFn9IreHHBenHfnY95acPKZQ/QHzucP9T+PuQ==</latexit><latexit sha1_base64="7SfFS4wAO3Vo/fTFR5rvB5I+J3s=">AAAB8nicbVBNSwMxEM3Wr1q/qh69BItQQcquCHoRil48VrAfsF1KNs22odlkSWbFsvRnePGgiFd/jTf/jWm7B219MPB4b4aZeWEiuAHX/XYKK6tr6xvFzdLW9s7uXnn/oGVUqilrUiWU7oTEMMElawIHwTqJZiQOBWuHo9up335k2nAlH2CcsCAmA8kjTglYyX/C13hQZWfYnPbKFbfmzoCXiZeTCsrR6JW/un1F05hJoIIY43tuAkFGNHAq2KTUTQ1LCB2RAfMtlSRmJshmJ0/wiVX6OFLalgQ8U39PZCQ2ZhyHtjMmMDSL3lT8z/NTiK6CjMskBSbpfFGUCgwKT//Hfa4ZBTG2hFDN7a2YDokmFGxKJRuCt/jyMmmd1zy35t1fVOo3eRxFdISOURV56BLV0R1qoCaiSKFn9IreHHBenHfnY95acPKZQ/QHzucP9T+PuQ==</latexit>

D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>

Objective Function

Se<latexit sha1_base64="tAVAC9U/qBVzQqUzJzuAwTDLLCw=">AAAB9HicbVDLSgMxFL1TX7W+qi7dBIvgqsyIoMuiG5cV7QPaoWTS2zY0kxmTTKEM/Q43LhRx68e482/MtLPQ1gOBwzn3ck9OEAuujet+O4W19Y3NreJ2aWd3b/+gfHjU1FGiGDZYJCLVDqhGwSU2DDcC27FCGgYCW8H4NvNbE1SaR/LRTGP0QzqUfMAZNVbyuyE1I0ZF+jDrYa9ccavuHGSVeDmpQI56r/zV7UcsCVEaJqjWHc+NjZ9SZTgTOCt1E40xZWM6xI6lkoao/XQeekbOrNIng0jZJw2Zq783UhpqPQ0DO5mF1MteJv7ndRIzuPZTLuPEoGSLQ4NEEBORrAHS5wqZEVNLKFPcZiVsRBVlxvZUsiV4y19eJc2LqudWvfvLSu0mr6MIJ3AK5+DBFdTgDurQAAZP8Ayv8OZMnBfn3flYjBacfOcY/sD5/AEIKZJB</latexit><latexit sha1_base64="tAVAC9U/qBVzQqUzJzuAwTDLLCw=">AAAB9HicbVDLSgMxFL1TX7W+qi7dBIvgqsyIoMuiG5cV7QPaoWTS2zY0kxmTTKEM/Q43LhRx68e482/MtLPQ1gOBwzn3ck9OEAuujet+O4W19Y3NreJ2aWd3b/+gfHjU1FGiGDZYJCLVDqhGwSU2DDcC27FCGgYCW8H4NvNbE1SaR/LRTGP0QzqUfMAZNVbyuyE1I0ZF+jDrYa9ccavuHGSVeDmpQI56r/zV7UcsCVEaJqjWHc+NjZ9SZTgTOCt1E40xZWM6xI6lkoao/XQeekbOrNIng0jZJw2Zq783UhpqPQ0DO5mF1MteJv7ndRIzuPZTLuPEoGSLQ4NEEBORrAHS5wqZEVNLKFPcZiVsRBVlxvZUsiV4y19eJc2LqudWvfvLSu0mr6MIJ3AK5+DBFdTgDurQAAZP8Ayv8OZMnBfn3flYjBacfOcY/sD5/AEIKZJB</latexit><latexit sha1_base64="tAVAC9U/qBVzQqUzJzuAwTDLLCw=">AAAB9HicbVDLSgMxFL1TX7W+qi7dBIvgqsyIoMuiG5cV7QPaoWTS2zY0kxmTTKEM/Q43LhRx68e482/MtLPQ1gOBwzn3ck9OEAuujet+O4W19Y3NreJ2aWd3b/+gfHjU1FGiGDZYJCLVDqhGwSU2DDcC27FCGgYCW8H4NvNbE1SaR/LRTGP0QzqUfMAZNVbyuyE1I0ZF+jDrYa9ccavuHGSVeDmpQI56r/zV7UcsCVEaJqjWHc+NjZ9SZTgTOCt1E40xZWM6xI6lkoao/XQeekbOrNIng0jZJw2Zq783UhpqPQ0DO5mF1MteJv7ndRIzuPZTLuPEoGSLQ4NEEBORrAHS5wqZEVNLKFPcZiVsRBVlxvZUsiV4y19eJc2LqudWvfvLSu0mr6MIJ3AK5+DBFdTgDurQAAZP8Ayv8OZMnBfn3flYjBacfOcY/sD5/AEIKZJB</latexit><latexit sha1_base64="tAVAC9U/qBVzQqUzJzuAwTDLLCw=">AAAB9HicbVDLSgMxFL1TX7W+qi7dBIvgqsyIoMuiG5cV7QPaoWTS2zY0kxmTTKEM/Q43LhRx68e482/MtLPQ1gOBwzn3ck9OEAuujet+O4W19Y3NreJ2aWd3b/+gfHjU1FGiGDZYJCLVDqhGwSU2DDcC27FCGgYCW8H4NvNbE1SaR/LRTGP0QzqUfMAZNVbyuyE1I0ZF+jDrYa9ccavuHGSVeDmpQI56r/zV7UcsCVEaJqjWHc+NjZ9SZTgTOCt1E40xZWM6xI6lkoao/XQeekbOrNIng0jZJw2Zq783UhpqPQ0DO5mF1MteJv7ndRIzuPZTLuPEoGSLQ4NEEBORrAHS5wqZEVNLKFPcZiVsRBVlxvZUsiV4y19eJc2LqudWvfvLSu0mr6MIJ3AK5+DBFdTgDurQAAZP8Ayv8OZMnBfn3flYjBacfOcY/sD5/AEIKZJB</latexit>

experiment feedback

update

history data

Figure 2: Framework for learning to optimize tensor programs.

We use primitives from an existing code generation framework [9] to form Se. Our search spaceincludes multi-level tiling on each loop axis, loop ordering, shared memory caching for GPUs, andannotations such as unrolling and vectorization. The search space size |Se| can be on the order ofbillions of possible implementations for a single GPU operator. As we find in section 6 , our choiceof Se can contain programs competitive with hand-optimized libraries.

3 Learning to Optimize Tensor Programs

We propose a machine learning (ML)-based framework to solve this problem. Figure 2 presentsthe framework and its modules. We build a statistical cost model f̂(x) to estimate the cost of eachlow-level program x. An exploration module proposes new schedule configurations to run on thehardware. The run time statistics are collected in a database D = {(ei, si, ci)}, which can in turn beused to update f̂ . We discuss module-specific design choices in the following subsections.

3.1 Statistical Cost Model

The first statistical model we support is based on gradient boosted trees [11](GBTs). We extractdomain-specific features from a given low-level abstract syntax tree (AST) x. The features includeloop structure information (e.g., memory access count and data reuse ratio) and generic annotations(e.g., vectorization, unrolling, thread binding). We use XGBoost [7], which has proven to be a strongfeature-based model in past problems. Our second model is a TreeGRU[39], which recursivelyencodes a low-level AST into an embedding vector. We map the embedding vector to a final predictedcost using a linear layer.

GBT and TreeGRU represent two distinct ML approaches to problem resolution. Both are valuable,but they offer different benefits. GBT relies on precise feature extraction and makes fast predictionsusing CPUs. TreeGRU, the deep learning-based approach, is extensible and requires no featureengineering, but it lags in training and predictive speed. We apply batching to the TreeGRU modeland use GPU acceleration to make training and prediction fast enough to be usable in our framework.

3.2 Training Objective Function

We can choose from multiple objective functions to train a statistical cost model for a given collectionof data D = {(ei, si, ci)}. A common choice is the regression loss function

∑i(f̂(xi)− ci)2 , which

encourages the model to predict cost accurately. On the other hand, as we care only about the relativeorder of program run times rather than their absolute values in the selection process, we can insteaduse the following rank loss function [6]:∑

i,j

log(1 + e− sign(ci−cj)(f̂(xi)−f̂(xj))). (2)

We can use the prediction f̂(x) to select the top-performing implementations.

3.3 Exploration Module

The exploration module controls the search loop, which is summarized in Algorithm 1. At eachiteration, it must pick a batch of candidate programs based on f̂(x) and query f(x) on real hardware.We cannot simply enumerate the entire space of Se and pick the top-b candidates due to the sizeof the search space. Instead, we use simulated annealing [19] with f̂(x) as the energy function.

3

Page 4: Learning to Optimize Tensor Programs · Learning to Optimize Tensor Programs Tianqi Chen 1Lianmin Zheng2 Eddie Yan Ziheng Jiang 1Thierry Moreau Luis Ceze 1Carlos Guestrin Arvind Krishnamurthy

Algorithm 1: Learning to Optimize Tensor ProgramsInput : Transformation space SeOutput : Selected schedule configuration s∗

D ← ∅while n_trials < max_n_trials do

// Pick the next promising batchQ← run parallel simulated annealing to collect candidates in Se using energy function f̂S ← run greedy submodular optimization to pick (1− ε)b-subset from Q by maximizing Equation 3S ← S ∪ { Randomly sample εb candidates. }// Run measurement on hardware environmentfor s in S do

c← f(g(e, s)); D ← D ∪ {(e, s, c)}end// Update cost modelupdate f̂ using Dn_trials← n_trials + b

ends∗ ← history best schedule configuration

Specifically, we use a batch of parallel Markov chains to improve the prediction throughput of thestatistical cost model. We select the top-performing batch of candidates to run on real hardware. Thecollected performance data is used to update f̂ . We make the states of the Markov chains persistentacross f̂ updates. We also apply the ε-greedy to select εb (e.g. 0.05) candidates randomly to ensureexploration.

Diversity-Aware Exploration. We consider both quality and diversity when selecting b candidatesfor hardware evaluation. Assume that the schedule configuration s can be decomposed into mcomponents s = [s1, s2, · · · sm]. We maximize the following objective to select candidate set S fromthe top λb candidates:

L(S) = −∑s∈S

f̂(g(e, s)) + α

m∑j=1

| ∪s∈S {sj}| (3)

The first term encourages us to pick candidates with low run time costs. The second term counts thenumber of different configuration components that are covered by S. L(S) is a submodular function,and we can apply the greedy algorithm [29, 22] to get an approximate solution.

Uncertainty Estimator. Bayesian optimization methods [34, 33, 35, 17] use acquisition functionsother than the mean when an uncertainty estimate of f̂ is available. Typical choices include expectedimprovement (EI) and upper confidence bound (UCB). We can use bootstrapping to get the model’suncertainty estimate and validate the effectiveness of these methods. As we will see in section 6, considering uncertainty does not improve the search in our problem. However, the choice ofacquisition function remains a worthy candidate for further exploration.

4 Accelerating Optimization via Transfer Learning

Thus far, we have focused only on learning to optimize a single tensor operator workload. In practice,we need to optimize for many tensor operators with different input shapes and data types. In a realworld setting, the system collects historical data D′ from previously seen workloads. We can applytransfer learning to effectively use D′ to speed up the optimization.

The key to transfer learning is to create a transferable representation that is invariant to the sourceand target domains. We can then share the cost model using the common representation acrossdomains. Different choices of representations may have different levels of invariance.

A common practice in Bayesian optimization methods is to directly use configuration s as the model’sinput. However, the search space specification can change for different workloads or when theuser specifies a new search space for the same workload. The configuration representation s is notinvariant to changes in the search space.

4

Page 5: Learning to Optimize Tensor Programs · Learning to Optimize Tensor Programs Tianqi Chen 1Lianmin Zheng2 Eddie Yan Ziheng Jiang 1Thierry Moreau Luis Ceze 1Carlos Guestrin Arvind Krishnamurthy

for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]

for

xfor

for

y

k C A B

y x k y k x

embeddingforyxkCAB

C A By 64 64 64x 8 8 64k 1 8 8

y 1x 8k 64

touched memory

outer looplength

(a) Low level AST (b) Loop context vectors

for

context vec of x

for

for

context vec of y

context vec of k

(c) Vanilla TreeGRU (d) Context Encoded TreeGRU

+

soft scatter

finalembedding

Figure 3: Possible ways to encode the low-level loop AST.

Workload Name C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12H, W 224,224 56,56 56,56 56,56 56,56 28,28 28,28 28,28 14,14 14,14 14,14 7,7

IC, OC 3,64 64,64 64,64 64,128 64,128 128,128 128,256 128,256 256,256 256,512 256,512 512,512K, S 7,2 3,1 1,1 3,2 1,2 3,1 3,2 1,2 3,1 3,2 1,2 3,1

Table 1: Configurations of all conv2d operators in a single batch ResNet-18 inference. H,W denotes height andwidth, IC input channels, OC output channels, K kernel size, and S stride size.

On the other hand, the low-level loop AST x (Figure 3a) is a shared representation of programs thatis invariant to the search space. To leverage this invariance, our cost model f̂(x) takes the low-levelloop AST x as input. We also need to encode x into a vector space to perform prediction. The specificencoding of x can also result in different levels of invariance.

Context Relation Features for GBT. We define context features at each loop level to representloop characteristics. A simple representation of context features is a vector (e.g., in Figure 3b, whereeach loop has a row of features). Context features are informative but, crucially, cannot generalizeacross different loop nest patterns; we define context relation features to overcome this issue.

To build context relation features, we treat context vectors as a bag of points and extract fea-tures that model relations between feature axes. Formally, let Z be the context feature matrixsuch that Zki corresponds to the i-th feature of loop k. We define a set of log2-spaced con-stant thresholds β = [β1, β2, · · ·βm]. The relation feature between feature i and j is definedas: R(ij)

t = maxk∈{k|Zkj<βt} Zki. This encoding summarizes useful relations, such as loop countvs. touched memory size (related to the memory hierarchy of the access), that affect run time cost.

Context Encoded TreeGRU. The invariant representation also exists for the neural-based model.Figure 3c shows a way to encode the program by learning an embedding vector for each identifierand summarizing the AST using TreeGRU. This model works well for modeling a single workload.However, the set of loop variables can change across different domains, and we do not have embeddingfor the new loop variables. We instead encode each loop variable using the context vector extractedfor GBT to summarize the AST (Figure 3d). We scatter each loop level, embedding h into m vectorsusing the rule outi = softmax(WTh)ih. Conceptually, the softmax classifies the loop level into amemory slot in out. Then, we sum the scattered vectors of all loop levels to get the final embedding.

Once we have a transferable representation, we can use a simple transfer learning method bycombining a global model and an in-domain local model, as follows:

f̂(x) = f̂ (global)(x) + f̂ (local)(x). (4)

The global model f̂ (global)(x) is trained on D′ using the invariant representation; it helps to makeeffective initial predictions before we have sufficient data to fit f̂ (local)(x).

5 Prior Work

Black box optimization (auto-tuning) is used in high-performance computing libraries such asATLAS [43] and FFTW [12]. Alternatively, a hardware-dependent cost model can be built to guidethe search [28, 5]. Polyhedral methods [5, 42] use integer linear programming to optimize cost.Tensor Comprehensions [41] combine both approaches, using black-box optimization to chooseparameters of thread blocks and polyhedral optimization to generate internal loops. Black-boxapproaches can require many experiment trials to explore a huge Se. On the other hand, predefinedcost models may not be sufficiently accurate to capture the complexity of modern hardware and mustbe manually redefined for each new hardware target.

5

Page 6: Learning to Optimize Tensor Programs · Learning to Optimize Tensor Programs Tianqi Chen 1Lianmin Zheng2 Eddie Yan Ziheng Jiang 1Thierry Moreau Luis Ceze 1Carlos Guestrin Arvind Krishnamurthy

0 200 400 600 800

Number of Trials

1

2

3

TF

LO

PS

C1

0 200 400 600 800

Number of Trials

2

3

C2

GBT TreeGRU GA GA X 2 Random Random X 2

0 200 400 600 800

Number of Trials

0.5

1.0

C5

0 200 400 600 800

Number of Trials

1

2

C6

Figure 4: Statistical cost model vs. genetic algorithm (GA) and random search (Random) evaluated on NVIDIATITAN X. ’Number of trials’ corresponds to number of evaluations on the real hardware. We also conductedtwo hardware evaluations per trial in Random ×2 and GA ×2. Both the GBT- and TreeGRU-based modelsconverged faster and achieved better results than the black-box baselines.

0 200 400 600 800

Number of Trials

1.5

2.0

2.5

TF

LO

PS

C1

0 200 400 600 800

Number of Trials

2

3

C2

GBT Rank TreeGRU Rank GBT Regression TreeGRU Regression

0 200 400 600 800

Number of Trials

0.50

0.75

1.00

C5

0 200 400 600 800

Number of Trials

1.5

2.0

2.5

C6

Figure 5: Rank vs. Regression objective function evaluated on NVIDIA TITAN X. The rank-based objectiveeither outperformed or performed the same as the regression-based objective in presented results.

Previously, statistical cost models have been applied to optimize SAT solvers [17, 18]. We apply thisidea to our problem and build a domain-specific cost model that enables effective transfer amongworkloads. A recent trend is to use deep neural networks to perform program analysis [3, 10]. Ournew problem setting and experiment environment can serve as a testbed for unexplored researchopportunities in related directions.

6 Experiments

6.1 Component Evaluations

We first evaluated each design choice in the framework. Component evaluations were based onconvolution workloads in ResNet-18 [14] for ImageNet classification (Table 1). Due to spacelimitations, we show component evaluation results only on representative workloads; the completeset of results is reported in the supplementary material. All methods compared in this subsectionwere initialized with no historical data. Section 6.2 evaluates the transfer learning setting.

Importance of Statistical Cost Model. Figure 4 compares the performance of the statistical costmodel to black-box methods. Both the GBT and TreeGRU models outperformed the black-boxmethods and found operators that were 2× faster than those found with random searches. Thisresult is particularly interesting compared to prior results in hyper-parameter tuning [25], wheremodel-based approaches were shown to work only as well as random searching. Our statisticalmodels benefit from domain-specific modeling and help the framework find better configurations.

Choice of Objective Function. We compared the two objective functions in Figure 5 on both typesof models. In most cases, we found that using a rank-based objective was slightly better than using aregression-based one: the rank-based objective may have sidestepped the potentially challenging taskof modeling absolute cost values. We chose rank as our default objective.

Impact of Diversity-Aware Exploration. We evaluated the impact of the diversity-aware explo-ration objective in Figure 6. Most of the workloads we evaluated showed no positive or negativeimpact for diversity-based selection. However, diversity-aware exploration improved C6, whichshows some potential usefulness to the approach. We adopted the diversity-aware strategy since itcan be helpful, has no meaningful negative impact, and negligibly affects run time.

6

Page 7: Learning to Optimize Tensor Programs · Learning to Optimize Tensor Programs Tianqi Chen 1Lianmin Zheng2 Eddie Yan Ziheng Jiang 1Thierry Moreau Luis Ceze 1Carlos Guestrin Arvind Krishnamurthy

0 200 400 600 800

Number of Trials

1.5

2.0

2.5

TF

LO

PS

C1

0 200 400 600 800

Number of Trials

2

3

C2

λ=1 λ=2 λ=4

0 200 400 600 800

Number of Trials

0.50

0.75

1.00C5

0 200 400 600 800

Number of Trials

1

2

C6

Figure 6: Impact of diversity-aware selection with different choices of λ evaluated on NVIDIA TITAN X.Diversity-aware selection had no positive or negative impact on most of the evaluated workloads.

0 200 400 600 800

Number of Trials

1.5

2.0

2.5

TF

LO

PS

C1

0 200 400 600 800

Number of Trials

2

3

C2

Expected Improvement Upper Confidence Bound Mean

0 200 400 600 800

Number of Trials

0.50

0.75

1.00C5

0 200 400 600 800

Number of Trials

1.0

1.5

2.0

2.5

C6

Figure 7: Impact of uncertainty-aware acquisition functions evaluated on NVIDIA TITAN X. Uncertainty-awareacquisition functions yielded no improvements in our evaluations.

50 100 150Number of Trials

0.0

0.5

1.0

TFLO

PS

C1 C6 -> C7

50 100 150Number of Trials

0.0

0.2

0.4

C1 C6 -> C8GBT-Transfer TreeGRU-Transfer GBT TreeGRU

50 100 150Number of Trials

0

1

C1 C6 -> C9

50 100 150Number of Trials

0

2

C1 C6 -> Matmul-1024

Figure 8: Impact of transfer learning. Transfer-based models quickly found better solutions.

Impact of Uncertainty Estimator. We evaluated the usefulness of uncertainty-aware acquisitionfunctions in Figure 7. The uncertainty measurement was achieved by training five models usingbootstrapping. We used the regression objective in this setting—similar to its use in most Bayesianoptimization methods. Results show that uncertainty estimation was not as important in our problem,possibly because our models were trained with more training samples than traditional hyper-parameteroptimization problems.

6.2 Transfer Learning Evaluations

The evaluations presented so far used no historical data. This subsection evaluates the improvementsobtainable with transfer learning.

Improvements by Transfer. We first evaluated general improvements made possible by transferlearning. We randomly picked samples from D′ collected from C1,C2,C3,C4,C5,C6 and used themto form the source domain (30000 samples in the TITAN X experiment and 20000 samples in theARM GPU and ARM A53 experiments). We then compared the performance of transfer-enabledmethods to learning from scratch for target workloads C7,C8,C9. Results are shown in Figure 8.Overall, using transfer learning yielded a 2× to 10× speedup. This approach is especially importantfor real DL compilation systems, which continuously optimize incoming workloads.

Invariant Representation and Domain Distance. As discussed in Section 4, different representa-tions have different levels of invariance. We used three scenarios to study the relationship betweendomain distance and the invariance of feature representations: (1) running optimizations on only onetarget domain; (2) C1–C6→7: C1–C6 as source domain and C7 as target domain (transfer within sameoperator type); (3) C1–C6→Matmul-1024: C1–C6 as source domain and matrix multiplication as

7

Page 8: Learning to Optimize Tensor Programs · Learning to Optimize Tensor Programs Tianqi Chen 1Lianmin Zheng2 Eddie Yan Ziheng Jiang 1Thierry Moreau Luis Ceze 1Carlos Guestrin Arvind Krishnamurthy

0 250 500 750Number of Trials

(a)

0.0

0.5

1.0

TFLO

PS

TITANXC7 in domain

50 100 150Number of Trials

(b)

0.0

0.5

1.0

TITANXC1 C6 -> C7

GBT on Configuration S GBT on Flatten Loop Context x GBT on Context Relation R GBT No Transfer

50 100 150Number of Trials

(c)

0

2

TITANXC1 C6 -> Matmul-1024

50 100 150Number of Trials

(d)

0.000

0.005

Mali GPU C1 C6 -> A53 CPU C7

Figure 9: Comparison of different representations in different transfer domain settings. The configuration-basedmodel can be viewed as a typical Bayesian optimization approach (batched version of SMAC [17]). We foundthat models using configuration space features worked well within a domain but were less useful across domains.The flattened AST features worked well when transferring across convolution workloads but were not usefulacross operator types. Context relation representation allowed effective transfer across operator types.

0 250 500 750 1000Time (second)

0

1

Rela

tive

Spee

dup C7 on NVIDIA TITAN X

0 250 500 750 1000Time (second)

0

1

C8 on NVIDIA TITAN XBaseline TensorComprehensions Random AutoTVM AutoTVM Transfer

0 100 200 300Time (second)

0.5

1.0

1.5C7 on ARM Cortex-A53

0 100 200 300Time (second)

0.0

0.5

1.0C7 on ARM Mali-T860

(a) Optimization curves in wall clock time. (We set cuDNN v7, Tensorflow Lite and ARM Com-puteLibrary v18.03 as the baselines for TITAN X, ARM A53 and ARM Mali-T860, respectively.)

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11 C120.0

1.0

2.0

3.0

Rela

tive

Spee

dup

cuDNNTensorComprehensions

AutoTVMAutoTVM PT

(b) NVIDIA TITAN X Single Op

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C120.01.02.03.0

Rela

tive

Spee

dup Tensorflow Lite AutoTVM

(c) ARM Cortex-A53 Single Op

Figure 10: Single operator performance on the TITAN X and ARM CPU. (Additional ARM GPU (Mali) resultsare provided in the supplementary material.) We also included a weight pre-transformed Winograd kernel [24]for 3× 3 conv2d (AutoTVM PT). AutoTVM generated programs that were competitive with hardware-specificlibraries.

target domain (transfer across operator types). Results ( Figure 9) show the need for more invariancewhen domains are farther apart. Using our transferable feature representation, our model generalizedacross different input shapes and operator types. We also ran a preliminary study on transfer froman ARM Mali GPU to an ARM Cortex-A53 ( Figure 9d), showing that the proposed representationenabled transfer across devices. Developing an invariant feature representation poses a difficultproblem worthy of additional research.

6.3 End-to-End Evaluation

Thus far, our evaluation has focused on specific design choices in our framework. We now segue tothe natural follow-up question: can learning to optimize tensor programs improve real-world deeplearning systems on diverse hardware targets? We call our framework AutoTVM. We compared ourapproach to existing DL frameworks backed by highly engineered hardware-specific libraries ondiverse hardware back-ends: a server class GPU, an embedded CPU, and a mobile GPU. Note thatAutoTVM performs optimization and code generation with no external operator library.

We first evaluated single-operator optimization against baselines that used hardware-specific li-braries. The baselines were: cuDNN v7 for the NVIDIA GPU, TFLite(commit: 7558b085) forthe Cortex-A53, and the ARM Compute Library (v18.03) for the ARM Mali GPU. We also in-

8

Page 9: Learning to Optimize Tensor Programs · Learning to Optimize Tensor Programs Tianqi Chen 1Lianmin Zheng2 Eddie Yan Ziheng Jiang 1Thierry Moreau Luis Ceze 1Carlos Guestrin Arvind Krishnamurthy

ResNet-18 MobileNet0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)

LSTM LM DQN DCGAN0.00.10.20.30.40.50.60.70.80.9

Tensorflow XLATensorflow

MXNetAutoTVM

(a) NVIDIA TITAN X End2EndResNet-18 MobileNet0.0

100.0200.0300.0400.0500.0600.0700.0800.0

Tim

e(m

s)

DQN0.0

2.0

4.0

6.0

8.0

10.0

12.0

Tensorflow LiteAutoTVM

(b) ARM Cortex-A53 End2EndResNet-18 MobileNet0.0

50.0

100.0

150.0

200.0

250.0

Tim

e(m

s)

DQN0.0

1.0

2.0

3.0

4.0

5.0

ARMComputeLibAutoTVM

(c) ARM Mali-T860 End2End

Figure 11: End-to-end performance across back-ends. 2AutoTVM outperforms the baseline methods.

cluded TensorComprehensions (commit: ef644ba) [41] as an additional baseline for the TITAN X 1

TensorComprehensions used 2 random seeds ×25 generations ×200 population for each operator,and padding was removed (TC does not yet support padding). The results are shown in Figure 10.AutoTVM generated high-performance tensor programs across different hardware back-ends.

Further, we embedded our framework into an existing DL graph compiler stack and performed end-to-end workload evaluation. We evaluated real world end-to-end DL inference workloads, includingResNet [14], MobileNet [16], LSTM Language Model [44], Deep Q Network (DQN) [27], and DeepConvolutional Generative Adversarial Networks (DCGAN) [31]. Our baselines were: MXNet (v1.1),Tensorflow (v1.7) for the GPU, TFLite(commit: 7558b085) for the Cortex A53, and ARM ComputeLibrary (v18.03) for the ARM Mali GPU. Results are summarized in Figure 11. AutoTVM improvedend-to-end performance by 1.2× to 3.8×. These improvements were due to both tensor programoptimization and operator fusion optimizations; the latter would otherwise be impossible if we usedlibraries with a limited set of operators.

7 Discussion and Conclusion

We presented AutoTVM: a machine learning-based framework that automatically optimizes theimplementation of tensor operators in deep learning systems. Our statistical cost model allowseffective model sharing between workloads and speeds up the optimization process via model transfer.The positive experimental results of this new approach show promise for DL deployment. Beyondour solution framework, the specific characteristics of this new problem make it an ideal testbedfor innovations in related areas, such as neural program modeling, Bayesian optimization, transferlearning, and reinforcement learning. On the systems side, learning to optimize tensor programs canenable more fused operators, data layouts, and data types across diverse hardware back-ends—crucialto improving DL systems. Our framework can be found at https://tvm.ai.

Acknowledgement

We would like to thank members of Sampa, SAMPL and Systems groups at the Allen School for theirfeedback on the work and manuscript. This work was supported in part by a Google PhD Fellowshipfor Tianqi Chen, ONR award #N00014-16-1-2795, NSF under grants CCF-1518703, CNS-1614717,and CCF-1723352, and gifts from Intel (under the CAPA program), Oracle, Huawei and anonymoussources.

References

[1] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga,Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin

1According to personal communication [40], TC is not yet intended for use in compute-bound problems.However, it still provides a good reference baseline for inclusion in the comparison.

2DCGAN and LSTM were not reported on A53 and Mali because they are not yet supported by baselinesystems.

9

Page 10: Learning to Optimize Tensor Programs · Learning to Optimize Tensor Programs Tianqi Chen 1Lianmin Zheng2 Eddie Yan Ziheng Jiang 1Thierry Moreau Luis Ceze 1Carlos Guestrin Arvind Krishnamurthy

Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12thUSENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.

[2] Amit Agarwal, Eldar Akchurin, Chris Basoglu, Guoguo Chen, Scott Cyphers, Jasha Droppo, AdamEversole, Brian Guenter, Mark Hillebrand, Ryan Hoens, Xuedong Huang, Zhiheng Huang, VladimirIvanov, Alexey Kamenev, Philipp Kranen, Oleksii Kuchaiev, Wolfgang Manousek, Avner May, BhaskarMitra, Olivier Nano, Gaizka Navarro, Alexey Orlov, Marko Padmilac, Hari Parthasarathi, Baolin Peng,Alexey Reznichenko, Frank Seide, Michael L. Seltzer, Malcolm Slaney, Andreas Stolcke, YongqiangWang, Huaming Wang, Kaisheng Yao, Dong Yu, Yu Zhang, and Geoffrey Zweig. An introduction tocomputational networks and the computational network toolkit. Technical Report MSR-TR-2014-112,August 2014.

[3] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. Learning to represent programs withgraphs. In International Conference on Learning Representations, 2018.

[4] Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron,Nicolas Bouchard, and Yoshua Bengio. Theano: new features and speed improvements. Deep Learningand Unsupervised Feature Learning NIPS 2012 Workshop, 2012.

[5] Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedralparallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation, PLDI ’08, pages 101–113. ACM, 2008.

[6] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender.Learning to rank using gradient descent. In Proceedings of the 22Nd International Conference on MachineLearning, ICML ’05, pages 89–96, New York, NY, USA, 2005. ACM.

[7] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22NdACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages785–794, New York, NY, USA, 2016. ACM.

[8] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, ChiyuanZhang, , and Zheng Zhang. MXNet: A flexible and efficient machine learning library for heterogeneousdistributed systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems(LearningSys’15), 2015.

[9] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen,Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Tvm: An automatedend-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating SystemsDesign and Implementation (OSDI 18), 2018.

[10] Xinyun Chen, Chang Liu, and Dawn Song. Tree-to-tree neural networks for program translation. CoRR,abs/1802.03691, 2018.

[11] J.H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics,29(5):1189–1232, 2001.

[12] M. Frigo and S. G. Johnson. Fftw: an adaptive software architecture for the fft. In Acoustics, Speech andSignal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, volume 3, pages1381–1384 vol.3, May 1998.

[13] Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D. Sculley. Googlevizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, KDD ’17, pages 1487–1495. ACM, 2017.

[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.arXiv preprint arXiv:1603.05027, 2016.

[15] Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. Futhark:Purely functional gpu-programming with nested parallelism and in-place array updates. In Proceedings ofthe 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017,pages 556–571, New York, NY, USA, 2017. ACM.

[16] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobilevision applications. CoRR, abs/1704.04861, 2017.

10

Page 11: Learning to Optimize Tensor Programs · Learning to Optimize Tensor Programs Tianqi Chen 1Lianmin Zheng2 Eddie Yan Ziheng Jiang 1Thierry Moreau Luis Ceze 1Carlos Guestrin Arvind Krishnamurthy

[17] Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for generalalgorithm configuration. In Proceedings of the 5th International Conference on Learning and IntelligentOptimization, LION’05, pages 507–523, Berlin, Heidelberg, 2011. Springer-Verlag.

[18] Frank Hutter, Lin Xu, Holger Hoos, and Kevin Leyton-Brown. Algorithm runtime prediction: Methodsand evaluation (extended abstract). In Proceedings of the Twenty-Fourth International Joint Conference onArtificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pages 4197–4201, 2015.

[19] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science,220(4598):671–680, 1983.

[20] Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. The tensoralgebra compiler. Proc. ACM Program. Lang., 1(OOPSLA):77:1–77:29, October 2017.

[21] Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned indexstructures. CoRR, abs/1712.01208, 2017.

[22] Andreas Krause and Daniel Golovin. Submodular function maximization. In Tractability: PracticalApproaches to Hard Problems. Cambridge University Press, February 2014.

[23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutionalneural networks. In Advances in Neural Information Processing Systems 25, pages 1097–1105. 2012.

[24] Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. In 2016 IEEE Conferenceon Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages4013–4021, 2016.

[25] Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Efficienthyperparameter optimization and infinitely many armed bandits. CoRR, abs/1603.06560, 2016.

[26] Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar,Mohammad Norouzi, Samy Bengio, and Jeff Dean. Device placement optimization with reinforcementlearning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney,NSW, Australia, 6-11 August 2017, pages 2430–2439, 2017.

[27] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control throughdeep reinforcement learning. Nature, 518(7540):529, 2015.

[28] Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian.Automatically scheduling halide image processing pipelines. ACM Trans. Graph., 35(4):83:1–83:11, July2016.

[29] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations formaximizing submodular set functions—i. Mathematical Programming, 14(1):265–294, 1978.

[30] Shoumik Palkar, James J. Thomas, Deepak Narayanan, Anil Shanbhag, Rahul Palamuttam, Holger Pirk,Malte Schwarzkopf, Saman P. Amarasinghe, Samuel Madden, and Matei Zaharia. Weld: Rethinking theinterface between data-intensive applications. CoRR, abs/1709.06416, 2017.

[31] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deepconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.

[32] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and SamanAmarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputationin image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation, PLDI ’13, pages 519–530, New York, NY, USA, 2013. ACM.

[33] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of the loop: Areview of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, Jan 2016.

[34] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesian optimization of machine learningalgorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems- Volume 2, NIPS’12, pages 2951–2959, USA, 2012.

[35] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md.Mostofa Ali Patwary, Prabhat Prabhat, and Ryan P. Adams. Scalable bayesian optimization using deepneural networks. In Proceedings of the 32Nd International Conference on International Conference onMachine Learning - Volume 37, ICML’15, pages 2171–2180, 2015.

11

Page 12: Learning to Optimize Tensor Programs · Learning to Optimize Tensor Programs Tianqi Chen 1Lianmin Zheng2 Eddie Yan Ziheng Jiang 1Thierry Moreau Luis Ceze 1Carlos Guestrin Arvind Krishnamurthy

[36] Michel Steuwer, Toomas Remmelg, and Christophe Dubach. Lift: A functional data-parallel ir forhigh-performance gpu code generation. In Proceedings of the 2017 International Symposium on CodeGeneration and Optimization, CGO ’17, pages 74–85, Piscataway, NJ, USA, 2017. IEEE Press.

[37] Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Anand R. Atreya, KunleOlukotun, Tiark Rompf, and Martin Odersky. Optiml: An implicitly parallel domain-specific language formachine learning. In Proceedings of the 28th International Conference on International Conference onMachine Learning, ICML’11, pages 609–616, USA, 2011.

[38] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. InProceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2,NIPS’14, pages 3104–3112, Cambridge, MA, USA, 2014. MIT Press.

[39] Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations fromtree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015.

[40] Nicolas Vasilache. personal communication.

[41] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S.Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. CoRR, abs/1802.04730, 2018.

[42] Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and FranckyCatthoor. Polyhedral parallel code generation for cuda. ACM Trans. Archit. Code Optim., 9(4):54:1–54:23,January 2013.

[43] R. Clint Whaley and Jack J. Dongarra. Automatically tuned linear algebra software. In Proceedings of the1998 ACM/IEEE Conference on Supercomputing, SC ’98, pages 1–27, Washington, DC, USA, 1998. IEEEComputer Society.

[44] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXivpreprint arXiv:1409.2329, 2014.

12