U-M Personal World Wide Web Server - SIRIUS IMPLICATIONS …rovinski/pub/hauswald2016sirius.pdf ·...
Transcript of U-M Personal World Wide Web Server - SIRIUS IMPLICATIONS …rovinski/pub/hauswald2016sirius.pdf ·...
.................................................................................................................................................................................................................
SIRIUS IMPLICATIONS FOR FUTUREWAREHOUSE-SCALE COMPUTERS
.................................................................................................................................................................................................................
DEMAND IS EXPECTED TO GROW SIGNIFICANTLY FOR CLOUD SERVICES THAT DELIVER
SOPHISTICATED ARTIFICIAL INTELLIGENCE ON THE CRITICAL PATH OF USER QUERIES, AS IS
THE CASE WITH INTELLIGENT PERSONAL ASSISTANTS SUCH AS APPLE’S SIRI. IF THE
PREDICTION OF THE TREND IS CORRECT, THESE TYPES OF APPLICATIONS WILL LIKELY
CONSUME MOST OF THE WORLD’S COMPUTING CYCLES. THE SIRIUS PROJECT WAS
MOTIVATED TO INVESTIGATE WHAT THIS FUTURE MIGHT LOOK LIKE AND HOW CLOUD
ARCHITECTURES SHOULD EVOLVE TO ACHIEVE IT.
......Ultimately, that’s why [ClarityLab] is running [the] Sirius project. TheApples and Googles and … Microsoft knowhow this new breed of service operates, butthe rest of the world doesn’t. And they needto.”1
In this article, we discuss Sirius, an openend-to-end intelligent personal assistant (IPA)application, modeled after popular IPA serv-ices such as Apple’s Siri. Sirius leverages well-established open infrastructures for speechrecognition, image recognition, and question-answering systems. We use Sirius to investigatethe performance, power, and cost implicationsof hardware accelerator-based server architec-tures for future datacenter designs. Among thepopular acceleration options, including GPUs,Intel Phi, and field-programmable gate arrays(FPGAs), the FPGA-accelerated server is thebest server option for a homogeneous datacen-ter design when the design objective is to mini-mize latency or maximize energy efficiencywith a latency constraint. The FPGA achievesan average 16 times reduction on the querylatency across various query types over the
baseline multicore system. GPUs provide thehighest total cost of ownership (TCO) reduc-tion on average. GPU-accelerated servers canachieve an average 10 times query-latencyreduction, translating to a 2.6 times TCOreduction (while FPGA-accelerated serversachieve 1.4 times TCO reduction). Whenexcluding FPGAs as an acceleration option,GPUs provide the best latency and cost reduc-tion among the rest of the accelerator choices.Replacing FPGAs with GPUs leads to a 66percent longer latency, but in return achieves a47 percent TCO reduction.
MotivationSiri, Google’s Google Now, and Microsoft’sCortana represent a class of emerging Webservice applications known as IPAs. An IPA isan application that uses inputs such as theuser’s voice, vision (images), and contextualinformation to provide assistance by answer-ing questions in natural language, makingrecommendations, and performing actions.These IPAs are emerging as one of the fastest-growing Internet services; they recently have
Johann Hauswald
Michael A. Laurenzano
Yunqi Zhang
Cheng Li
Austin Rovinski
Arjun Khurana
Ronald G. Dreslinski
Trevor Mudge
University of Michigan
Vinicius Petrucci
Federal University of Bahia
Lingjia Tang
Jason Mars
University of Michigan
.......................................................
42 Published by the IEEE Computer Society 0272-1732/16/$33.00�c 2016 IEEE
been deployed on well-known platformssuch as iOS, Android, and Windows Phone,making them ubiquitous on mobile devicesworldwide. In addition, the usage scenariosfor IPAs are rapidly increasing, with recentofferings in wearable technologies such assmart watches and glasses. Recent projectionspredict that the wearables market will reach485 million annual device shipments by2018.
In this article, we present Sirius, the firstopen source voice and vision IPA. Prior to therelease of Sirius, large companies such asApple, Google, Microsoft, and Amazon had amonopoly on the intelligent assistant applica-tion space. But without access to a representa-tive open intelligent assistant application,researchers cannot investigate the systemarchitecture implications of this type of work-load. Sirius is the first effort to serve this pur-pose. The Sirius study provides insights on thelandscape of current accelerator hardware indatacenters for a future in which demand forintelligent assistants grows radically.
Specifically, our study found that IPAs differfrom many Web service workloads present inmodern warehouse-scale computers (WSCs).In contrast to the queries of traditionalbrowser-centric services, IPA queries streamthrough software components that leveragerecent advances in speech recognition, naturallanguage processing (NLP), and computervision to provide users with a speech- and/orimage-driven contextually based question-and-answer system. Owing to the computationalintensity of these components and the largedata-driven models they use, service providershouse the required computation in massivedatacenter platforms in lieu of performing thecomputation on the mobile devices themselves.This offloading approach is used by both Siriand Google Now as they send compressedrecordings of voice commands and queries todatacenters for speech recognition and seman-tic extraction.2 However, datacenters have beendesigned and tuned for traditional Web serv-ices, and questions arise as to whether the cur-rent design employed by modern datacenters,composed of general-purpose servers, is suit-able for emerging IPA workloads.
In particular, IPA queries require a signifi-cant amount of computational resources com-pared to traditional text-based Web services
such as search. The computational resourcesrequired for a single leaf query, for example,are in excess of 100 times more than that oftraditional Web search. Figure 1 illustrates thescaling of computational resources in a mod-ern datacenter required to sustain an equiva-lent throughput of IPA queries. Because of thelooming scalability gap shown in the figure,both academia and industry have expressedsignificant interest in leveraging hardwareacceleration in datacenters using various plat-forms such as GPUs, many-core coprocessors,and FPGAs to achieve high performance andenergy efficiency. To gain further insight onwhether there are sufficient accelerationopportunities for IPA workloads and what thebest acceleration platform is, we must addressthe following challenges:
� Identify critical computational andperformance bottlenecks throughoutthe end-to-end lifetime of an IPAquery.
� Understand the performance, energy,and cost tradeoffs among popularaccelerator options given the charac-teristics of IPA workloads.
� Design future server and datacentersolutions that can meet the amountof future user demand while beingcost and energy efficient.
However, the lack of a representative,publicly available, end-to-end IPA system isprohibitive for investigating the design spaceof future accelerator-based server designs forthis emerging workload. To address this chal-lenge, we constructed Sirius as an end-to-endstand-alone IPA service that implements an
DC after scalingfor matching no. of
IPA queries
1x 10x 100x
DC for browser-based Web search
queries
Figure 1. Impact of higher computational requirements for intelligent
personal assistant (IPA) queries on datacenters. Illustrated is the scaling of
computational resources in a modern datacenter required to sustain an
equivalent throughput of IPA queries compared to Web search.
.............................................................
MAY/JUNE 2016 43
IPA’s core functionalities, such as speech rec-ognition, image matching, NLP, and a ques-tion-and-answer system. Sirius takes as inputuser-dictated speech and images captured bya camera. A voice command primarily exer-cises speech recognition on the server side toexecute a command on the mobile device. Avoice query additionally leverages a sophisti-cated NLP question-and-answer system toproduce a natural language response to theuser. A voice and image question—such as,“When does this restaurant close?” coupledwith an image of the restaurant—also lever-ages image matching with an image databaseand combines the matching output with thevoice query to select the best answer for theuser.
Sirius also provides the Sirius Suite, abenchmark suite composed of the sevencomputational bottlenecks on a query’spathway that represent 92 percent of thatquery’s execution time. The Sirius Suite alsoincludes implementations of these bottle-necks for a spectrum of computational sub-strates, including CPU, GPU, FPGAs, andmany-core (Phi), all of which are includedin the open source release.3 Sirius was thetop-trending open source project onGitHub for the first few weeks after itsrelease. In fact, Sirius and the Sirius Suitehave been downloaded tens of thousands oftimes since their release and are alreadybeing used in research papers as well as pro-duction projects.
In designing Sirius, we performed inves-tigative research by going to several largecompanies and talking with key engineerson relevant teams. We asked for insights asto the algorithms and approaches actuallyused in production, and, using this infor-mation, we integrated three services builtusing well-established open source projectsrepresentative of those found in commercialsystems. These included Carnegie MellonUniversity’s Sphinx,4 which representedspeech recognition based on the widelyused Gaussian mixture model (GMM);Kaldi5 and RWTH Aachen University’sSpeech Recognition System (RASR),6
which represented the industry’s recenttrend toward speech recognition based onthe deep neural network (DNN); Open-Ephyra, which represented the-state-of-the-
art question-and-answer system based onIBM’s Watson7; and SURF8 implementedusing OpenCV,9 which represented thestate-of-the-art image-matching algorithmswidely used in various production applica-tions. We used these open source projects tosteer the construction of both Sirius and theSirius Suite.
Sirius: An End-to-End IPAIn this section, we describe the design objec-tives for Sirius, then present an overview ofSirius and a taxonomy of the query types itsupports. Finally, we detail the underlyingalgorithms and techniques used by Sirius.
Design ObjectivesWe had three key objectives when designingSirius. The first objective was completeness:Sirius should provide a complete IPA servicethat takes the input of human voice andimages and provides a response to the user’squestion with natural language. The secondobjective was representativeness: the computa-tional techniques used by Sirius to providethis response should be representative ofstate-of-the-art approaches used in commer-cial domains. The third objective was deploy-ability: Sirius should be deployable and fullyfunctional on real systems.
Overview: Life of an IPA QueryFigure 2 presents a high-level diagram of theend-to-end Sirius query pipeline. The life of aquery begins with a user’s voice and/or imageinput through a mobile device. The audio isprocessed by an automatic speech recognition(ASR) front end that translates the user’s speechquestion into text. The text then goes througha query classifier that decides whether thespeech is an action or a question. For an action,the command is sent back to the mobile devicefor execution. Otherwise, the Sirius back endreceives the question in plain text. The ques-tion-answering service extracts informationfrom the input, searches its database, and choo-ses the best answer to return to the user. If animage accompanies the speech input, Siriususes computer-vision techniques to match theinput image to its image database and returnrelevant information about the matched imageusing the image-matching service. For example,
..............................................................................................................................................................................................
TOP PICKS
............................................................
44 IEEE MICRO
a user can ask, “What time does this restaurantclose?” while capturing an image of the restau-rant via smart glasses.3 Sirius can then return ananswer to the query based not only on the user’sspeech but also on information from the image.Table A in “Sirius Web Extra” summarizesthe queries supported by Sirius (see http://extras.computer.org/extra/mmi2016030042s1.pdf).
Design of Sirius: IPA Services and AlgorithmicComponentsWe leverage open infrastructures that use thesame algorithms as commercial applications.Speech recognition in Google Voice has usedspeaker-independent GMMs and hiddenMarkov models (HMMs) and is adoptingDNNs. The OpenEphyra framework usedfor question answering is an open sourcerelease from Carnegie Mellon University’sprior research collaboration with IBM on theWatson system.7 OpenEphyra’s NLP techni-ques, including conditional random field(CRF), have been recognized as state-of-the-art and are used at Google and in otherindustry question-answering systems.10 Ourimage-matching pipeline design is based onthe widely used SURF algorithm. We imple-ment SURF using the open source computervision (OpenCV9) library, which is employedin commercial products from companiesincluding Google, IBM, and Microsoft.
Automatic Speech Recognition PipelineThe ASR inputs are feature vectors represent-ing the speech segment. The ASR compo-
nent relies on a combination of an HMMand either a GMM or a DNN. Sirius’sGMM-based ASR uses Sphinx,4 whereas theDNN-based ASR includes Kaldi5 andRASR.6
As Figure 3 shows, the HMM builds atree of states for the current speech frameusing input feature vectors. The GMM orDNN scores the probability of the state tran-sitions in the tree, and the Viterbi algo-rithm11 then searches for the most likely pathbased on these scores. The path with thehighest probability represents the final trans-lated text output. The GMM scores HMMstate transitions by mapping an input featurevector into a multidimensional coordinatesystem and iteratively scores the featuresagainst the trained acoustic model.
Image-Matching PipelineThe image-matching pipeline receives aninput image, attempts to match it againstimages in a preprocessed image database, andreturns information about the matchedimages. Image keypoints are extracted fromthe input image using the SURF algorithm.8
In feature extraction, the image is down-sampled and convolved multiple times tofind interesting keypoints at different scales(see Figure 4). The keypoints are passed tothe feature descriptor component, wherethey are assigned an orientation vector, andsimilarly oriented keypoints are grouped intofeature descriptors. The descriptors from theinput image are matched to preclustereddescriptors representing the database images
Answer
Question answering
Searchdatabase
ActionM
obile
Ser
ver
Displayanswer
Imagedatabase
Image matching
Image
Imag
e data q
uestionVoice Questionor
action
Queryclassifier
Automaticspeech recognition
Users
Executeaction
Figure 2. End-to-end diagram of the Sirius pipeline. Sirius is built of three core components that communicate with one
another to service IPA queries.
.............................................................
MAY/JUNE 2016 45
using an approximate nearest-neighborsearch.
Question-Answering PipelineThe text output from the ASR is passed toOpenEphyra,7 which uses word stemming,regular expression matching, and part-of-speech tagging. Figure 5 shows a diagram ofthe OpenEphyra engine incorporating thesecomponents, generating Web search queries,and filtering the returned results. The PorterStemming12 algorithm (stemmer) exposes a
word’s root by matching and truncating com-mon word endings. OpenEphyra also uses asuite of regular-expression patterns to matchcommon query words. The CRF classifier2
takes a sentence, the position of each word inthe sentence, and the label of the current andprevious word as input to make predictionson the part of speech for each word of aninput query. Filters using the same techniquesextract information from the returned docu-ments; the document with the highest overallscore after score aggregation is returned.
HM
M
Speech decoder
Featureextraction
Feature vectors
Speech
0.3
0.9
0.2
...
Viterb
i search
Inputlayer
Hiddenlayers
Outputlayer
DNN scoringor
GMM scoring
Trained dataWord
dictionaryLanguage
modelAcoustic
model
ScoredHMM states
“Who was elected44th president?
Figure 3. Automatic speech recognition pipeline. The decoding stage receives speech features and uses either a Gaussian
mixture model or a deep neural network in combination with a hidden Markov model to transcribe the speech to text.
Calculate Hessian matrix
SURF featureextractor
Build scale-space
Find keypoints
Haar wavelet
SURF featuredescriptor
Orientationassignment
Keypointdescriptor
Descriptordatabase
Image descriptors
Image Keypoints
Figure 4. Image-matching pipeline. Keypoints are first extracted from the image and are used to build descriptors (grouping of
keypoints). These descriptors are used to find similar images in the database.
..............................................................................................................................................................................................
TOP PICKS
............................................................
46 IEEE MICRO
Real System Analysis for SiriusIn this section, we present a real-system anal-ysis of Sirius. The experiments throughoutthis section are performed using an IntelHaswell server.
Scalability GapTo gain insights on the required resource scal-ing for IPA queries in modern datacenters,we juxtapose the computational demand ofan average Sirius query with that of an aver-age Web search query. We compare the aver-age query latency for both applications on asingle core at a very low load.
Figure 6a presents the average latency ofboth Web search using open source ApacheNutch (http://nutch.apache.org) and Siriusqueries. The average Nutch-based Websearch query latency is 91 ms on the Haswell-based server. Sirius’s query latency is signifi-cantly longer, averaging around 15 s across42 queries spanning our three query classes(voice command, VC; voice query, VQ; andvoice image query, VIQ). Based on this sig-nificant difference in the computational
demand, we perform a back-of-the-envelopecalculation of how the computational resour-ces (machines) in current datacenters mustscale to match the throughput in queries forIPAs and Web search.
Figure 6b presents the number of ma-chines needed to support IPA queries as thenumber of queries increases. The x-axis showsthe ratio between IPA queries and traditionalWeb search queries. The y-axis shows the ratioof computational resources needed to supportIPA queries relative to Web search queries.Current datacenter infrastructures will need toscale their computational resources to 165times their current size when the number ofIPA queries scales to match the number ofWeb search queries. We refer to this through-put difference as the scalability gap.
Cycle Breakdown of Sirius ServicesTo identify computational bottlenecks, weperform top-down profiling of hot algorith-mic components for each service. Figure 7presents the average cycle breakdown results.We identify the architectural bottlenecks for
^[0-9,th]$ 44th
VERB NUM N
elected 44th president
Stemmer
Question answering
Web search
Input filter
Docum
ent filters
Regex
CRF
elect
Scores
Docum
ent selector
Documents“Who was elected
44th president?”
“Barack Obama”
“... elected 44thpresident” “-ed”
Figure 5. OpenEphyra question-answering pipeline. The QA system uses a combination of natural language processing
techniques to generate search queries for the document database and score the returned queries to pick out the best answer.
.............................................................
MAY/JUNE 2016 47
these hot components to investigate the per-formance improvement potential for a gen-eral-purpose processor. From our analysis,even with all stall cycles removed (that is, per-fect branch prediction, infinite cache, and soon), the maximum speedup is bound byabout three times. Considering the orders ofmagnitude difference indicated by the scal-ability gap, further acceleration is needed tobridge the gap.
Accelerating SiriusIn this section, we describe the platforms andmethodology used to accelerate the key com-ponents of Sirius. We also present and discussthe results of accelerating each of these com-ponents across four different acceleratorplatforms.
Accelerator PlatformsWe used a total of four platforms, summar-ized in Table B online, to accelerate Sirius.Our baseline platform was an Intel XeonHaswell CPU running single-threadedkernels.
Each accelerator platform has advantagesand disadvantages. Multicore CPU offershigh clock frequency and is not limited bybranch divergence, but it has the leastamount of threads available. The GPU ismassively parallel, but it is also power hungry,requires a custom ISA, has large data transferoverheads, and offers limited branch diver-gence handling. Intel Phi has a many-core,standard programming model (same ISA);offers optional porting and compiler help;handles branch divergence; and has highbandwidth. On the other hand, it has datatransfer overheads and relies on the compiler.Finally, FPGAs can be tailored to implementefficient computation and data layout for theworkload, but they also run at a much lowerclock frequency, are expensive, and are hardto develop for and maintain with softwareupdates.
16
14
12
10
Late
ncy
(s)
No.
of m
achi
nes
8
6
0.091
0.0 0.2 0.4
QIPA/QWS
Scalability gap
0.6 0.8 0.10
4
2
0
103
102
101
100
IPAquery
Websearch(a) (b)
WS
VC
VQ
VIQ
Time (s)0 5 10 15 20 25 30
0.091ASRIMMQA
Figure 6. Scalability gap and latency. (a) Average latency of Web search using Apache Nutch and Sirius queries. (b) The
number of machines needed to support IPA queries as the number of queries increases. (c) Latency across query types.
HMM
DNNHMM
GMM78% 22%15%
85%
(a)
StemmerOther
3%
Search
CRF
12%
17%22%
46%
FE ANN3%
FD
56%
41%
(b)
(d)(c)Regex
Figure 7. Cycle breakdown per service. (a) ASR (Sphinx); (b) ASR (RASR); (c)
QA (OpenEphyra); (d) IMM (SURF). Across the services, several hot
components emerge as good candidates for acceleration.
..............................................................................................................................................................................................
TOP PICKS
............................................................
48 IEEE MICRO
Sirius Suite: A Collection of IPA ComputationalBottlenecksWe extracted Sirius’s key computational bot-tlenecks to construct a suite of benchmarkscalled the Sirius Suite. The Sirius Suite andits implementations across the describedaccelerator platforms are available alongsidethe end-to-end Sirius application.3 Weported existing open source C/Cþþ imple-mentations available for each algorithmiccomponent to our target platforms. Weadditionally implemented stand-aloneC/Cþþ benchmarks based on Sirius’ssource code where none were currentlyavailable. For each Sirius Suite benchmark,we built an input set representative of IPAqueries. The baseline implementations aresummarized in column 3 of Table 1. Thetable also shows the granularity at whicheach thread performs the computation onthe accelerators.
Porting MethodologyThe common porting methodology usedacross all platforms is to exploit the largeamount of data-level parallelism availablethroughout the processing of a single IPAquery.
Multicore CPUWe used the Pthread library to accelerate thekernels on the multicore platform by divid-
ing the size of the data. Each thread is respon-sible for a range of data over a fixed numberof iterations. This approach lets each threadrun concurrently and independently, syn-chronizing only at the end of the execution.
GPUWe used Nvidia’s CUDA library to port theSirius components to the Nvidia GPU. Toimplement each CUDA kernel, we variedand configured the GPU block and grid sizesto achieve high resource utilization, matchingthe input data to the best thread layout. Weported additional string manipulation func-tions not currently supported in CUDA forthe stemmer kernel.
Intel PhiWe ported our Pthread versions to the IntelPhi platform, leveraging the target compiler’sability to parallelize the loops on the targetplatform. To investigate this platform’s poten-tial to facilitate ease of programming, we usedthe standard programming model and customcompiler to extract performance from theplatform. As such, the results represent whatcan be accomplished with minimal pro-grammer effort.
FPGAWe used previously published details ofFPGA implementations for several of ourSirius benchmarks in this work. We designed
Table 1. The Sirius Suite and granularity of parallelism.
Service Benchmark Baseline Input set Granularity
Automatic speech
recognition (ASR)
Gaussian mixture model Sphinx4 Hidden Markov model
(HMM) states
For each HMM state
Deep neural network RASR6 HMM states For each matrix multiplication
Question
answering (QA)
Porter Stemming (stemmer) Porter12 4M word list For each individual word
Regular expression (regex) Super Light Regular
Expression Library
(SLRE;
http://cesanta.com)
100 expressions/
400 sentences
For each regex-sentence pair
Conditional random
fields (CRFs)
CRFSuite13 CoNLL-2000 shared task For each sentence
Image matching
(IMM)
Feature extraction (FE) SURF8 JPEG image For each image tile
Feature description (FD) SURF8 Vector of keypoints For each keypoint
.............................................................
MAY/JUNE 2016 49
our own FPGA implementations for GMMand stemmer and evaluated them on a XilinxFPGA. Our full paper offers more details onthe FPGA designs.14
Accelerator ResultsFigure 8 presents the performance speedupachieved by the Sirius kernels running on eachaccelerator platform. For the numbers fromprior literature, we scaled the FPGA speedupnumber to match our FPGA platform basedon the fabric usage and area reported in priorwork. We used numbers from the literaturefor kernels (regex and CRF) that were alreadyported to the GPU and yielded better speed-ups than our implementations.
ASRThe GMM implementation had the best per-formance on the GPU (70 times) after opti-mizations. The FPGA implementation usinga single GMM core achieved a speedup of 56times; when fully utilizing the FPGA fabric,we achieved a 169 times speedup using threeGMM cores. RWTH’s DNN includes bothmultithreaded and GPU versions out of thebox. RWTH’s DNN parallelizes the entireframework (both HMM search and DNNscoring) and achieves good speedup in bothcases. In the cases where we used a customkernel or cited literature, we assumed a 3.7times speedup for the HMM15 as a reason-able lower bound.
Question AnsweringThe NLP algorithms as a whole have similarperformance across platforms because of thenature of the workload: high input variabilitywith many test statements causes high branchdivergence. The FPGA stemmer implemen-tation achieved 6 times speedup over thebaseline with a single core using only 17 per-cent of the FPGA. Scaling the number ofcores to fully utilize the FPGA’s resourcesyielded a 30 times speedup over the baseline.
Image MatchingThe image-processing kernels achieved thebest speedup on the GPU that uses heavilyoptimized OpenCV9 SURF implementations,yielding speedups of 10.5 and 120.5 times forfeature extraction and feature description,respectively. The tiled multicore version yieldsgood speedup, but the performance does notscale as well on the Phi because the number oftiles is fixed, which means there is littleadvantage to having more threads available.The GPU version has better performancebecause it uses a data layout explicitly opti-mized for a larger number of threads.
Implications for Future Server DesignWe first investigated the end-to-end latencyreduction and power efficiency achieved acrossserver configurations for Sirius’s services,including ASR, question answering, andimage matching.
Latency ImprovementFigure 9 presents the end-to-end querylatency across Sirius’s services on a single leafnode configured with each accelerator. Forquestion answering, we focused on the NLPcomponents comprising 88 percent of thequestion-answering cycles, because search hasalready been well studied.
Our baseline in this figure, CMP, is thelatency of the original algorithm implementa-tions of Sirius running on a single core of anIntel Haswell server, described in Table Bonline. CMP (subquery) is our Pthreadedimplementation of each service exploiting par-allelism within a single query. This is executedon four cores (eight threads) of the Intel Has-well server. CMP (subquery) in general achievesa 25 percent latency reduction over the
FPGA
Phi
GPU
CMP
Pla
tform
GMM DNN StemmerWorkload speedup heat map
CRF FE FD
160
140
120
100
80
60
40
20
Regex
Figure 8. Heat map of acceleration results. The heat map presents the
results of accelerating each of the seven kernels (x-axis) across the four
platforms (y-axis), with the darker color representing higher speedups.
..............................................................................................................................................................................................
TOP PICKS
............................................................
50 IEEE MICRO
baseline. Across all services, the GPU andFPGA significantly reduce the query latency.
Datacenter DesignNext, we evaluate multiple design choices fordatacenters composed of accelerated servers toimprove performance (throughput) andreduce the TCO. We also investigate eachplatform’s energy efficiency and throughputimprovement, which can be found online inthe “Energy Efficiency” and “ThroughputImprovement” sections of the web extra.
TCO AnalysisImproving throughput lets us reduce theamount of computing resources (servers)needed to serve a given load. However, thismay not necessarily lead to a reduction in adatacenter’s TCO. Although reducing thenumber of machines leads to a reduction inthe datacenter construction cost and power/cooling infrastructure cost, we may increasethe per-server capital or operational expendi-ture cost either from the additional accelera-tor purchase cost or energy cost.
We performed our TCO analysis using theTCO model recently proposed by Google.16
Table D online gives the parameters used in
our TCO model. We based the server priceand power usage on the following server con-figuration based on the OpenCompute Proj-ect: 1 CPU Intel Xeon E3 1240 V3 3.4 GHz,32 Gbytes of RAM, and two 4-Tbyte disks.
Figure 10 presents the datacenter TCOwith various accelerator options, normalizedby the TCO achieved by a datacenter thatuses only CMPs. FPGAs and GPUs providehigh TCO reduction. We further discuss theTCO results, derive our datacenter designs,and present query-level results online.
Overall Latency Reduction ResultsUsing the derived datacenter design andquery-level results, Figure 11 presents thelatency reduction of these two accelerateddatacenters and how homogeneous acceler-ated datacenters can significantly reduce thescalability gap, from the current 165 timesresource scaling, shown in Figure 6, downto 16 and 10 times for GPU- and FPGA-accelerated datacenters, respectively.
S ince the release of Sirius, hundreds ofcompanies around the world have down-
loaded the Sirius code base. Large companies
Latency (s) (* includes DNN and HMM combined)
0 1 2 3 4 5
*
0.09*
0.3
0 1 2 3 4 5
CMP
CMP (subquery)
GPU
Phi
FPGA
0.22
0.19
ASR_GMMASR_HMM
ASR_DNNASR_HMM
0
0.94
2 4 6 8 10 12 14
CMP
CMP (subquery)
GPU
Phi
FPGA
0 1 2 3
0.12
0.05
QA_StemmerQA_CRFQA_RegexQA_Other
IMM_FEIMM_FD
Figure 9. Latency across platforms for each service. The speedups translate into latency
gains, with the x-axis presenting the end-to-end latency of a single query across the four
accelerator platforms.
.............................................................
MAY/JUNE 2016 51
such as Ford, Intel, GE, and Wells Fargo, aswell as numerous medium-sized and start-upcompanies, have demonstrated interest inintegrating Sirius into their products. Withthe release of Sirius, we have made open thetype of technology only expected from thebig cloud companies. Now any smaller outfitcan have a nicely packaged end-to-end solu-tion and a set of tools to design its own IPA.Thus, Sirius represents the democratizationof intelligent assistants. Now the technologyis in the hands of everyone. MICR O
....................................................................References1. C. Metz, “Voice Control Will Force an Over-
haul of the Whole Internet,” Wired, 24
Mar. 2015; www.wired.com/2015/03/voice-
control-will-force-overhaul-whole-internet.
2. J. Lafferty, A. McCallum, and F.C.N. Pereira,
Conditional Random Fields: Probabilistic
Models for Segmenting and Labeling
Sequence Data, ACM, 2001.
3. “Sirius: An Open End-to-End Voice and
Vision Personal Assistant,” 2015; http://
sirius.clarity-lab.org.
4. D. Huggins-Daines et al., “Pocketsphinx: A
Free, Real-Time Continuous Speech Recog-
nition System for Hand-Held Devices,”
Proc. IEEE Int’l Conf. Acoustics, Speech
and Signal Processing, 2006; doi:10.1109/
ICASSP.2006.1659988.
5. D. Povey et al., “The Kaldi Speech Recogni-
tion Toolkit,” Proc. IEEE Workshop Auto-
matic Speech Recognition and
Understanding, 2011; http://infoscience.
epfl.ch/record/192584.
6. D. Rybach et al., “RASR—The RWTH
Aachen University Open Source Speech
Recognition Toolkit,” Proc. IEEE Automatic
Speech Recognition and Understanding
Workshop, 2011.
7. D. Ferrucci et al., “Building Watson: An
Overview of the DeepQA Project,” AI Mag-
azine, vol. 31, no. 3, 2010, pp. 59–79.
8. H. Bay, T. Tuytelaars, and L. Van Gool,
“Surf: Speeded Up Robust Features,” Com-
puter Vision–ECCV 2006, Springer, 2006,
pp. 404–417.
9. G. Bradski and A. Kaehler, Learning
OpenCV: Computer Vision with the OpenCV
Library, O’Reilly Media, 2008.
10. J. Dean et al., “Large Scale Distributed
Deep Networks,” Proc. 25th Conf. Advan-
ces in Neural Information Processing Sys-
tems, 2012, pp. 1232–1240.
11. G. David Forney Jr., “The Viterbi Algo-
rithm,” Proc. IEEE, vol. 61, no. 3, 1973, pp.
268–278.
12. M.F. Porter, “An Algorithm for Suffix
Stripping,” Program: Electronic Library and
Information Systems, vol. 14, no. 3, 1980,
pp. 130–137.
13. N. Okazaki, “CRFSuite: A Fast Implementa-
tion of Conditional Random Fields (CRFs),”
blog, 2007; www.chokkan.org/software/
crfsuite.
ASR(DNN/HMM) QA IMMASR(GMM/HMM)
8x6x4x2x1x
–2x–4x–6x–8x
–10x
TCO
CMP (subquery)GPU
PhiFPGA
Figure 10. Total cost of ownership across platforms for each service. Bars
show the benefits of accelerator DCs normalized to the TCO achieved by a
DC that uses only CMPs.
Web
sea
rch
IPA
que
ry
IPA
que
ry (
GP
U)
IPA
que
ry (
FPG
A)
16141210864
0.0911.48 0.942
0
Late
ncy
(s)
Num
ber
of m
achi
nes
103
102
101
1000.0 0.2 0.4
Scalability gap
IPA query (w/o acceleration)IPA query (GPU)IPA query (FPGA)
0.6QIPA/QWS
0.8 1.0
Figure 11. Bridging the scalability gap. The significant latency reductions
using GPUs and FPGAs in homogeneous DCs help bridge the scalability gap.
..............................................................................................................................................................................................
TOP PICKS
............................................................
52 IEEE MICRO
14. J. Hauswald et al., “Sirius: An Open End-to-
End Voice and Vision Personal Assistant
and its Implications for Future Warehouse
Scale Computers,” Proc. Int’l Conf. Archi-
tectural Support for Programming Lan-
guages and Operating Systems, 2015, pp.
223–238.
15. J. Chong, E. Gonina, and K. Keutzer,
“Efficient Automatic Speech Recognition
on the GPU,” GPU Computing Gems Emer-
ald Edition, Morgan Kaufmann, 2011.
16. L.A. Barroso, J. Clidaras, and U. Holzle, The
Datacenter as a Computer: An Introduction
to the Design of Warehouse-Scale
Machines, 2nd ed., 2013.
Johann Hauswald is a PhD candidate inthe Computer Science and EngineeringDepartment at the University of Michigan.He received an MS in computer science andengineering from the University of Michi-gan. Contact him at [email protected].
Michael A. Laurenzano is a PhD candidatein the Computer Science and EngineeringDepartment at the University of Michigan.He received an MS in computer science andengineering from the University of Califor-nia, San Diego. Contact him at [email protected].
Yunqi Zhang is a PhD candidate in theComputer Science and Engineering Depart-ment at the University of Michigan. Hereceived an MS in computer science andengineering from the University of Califor-nia, San Diego. Contact him at [email protected].
Cheng Li is a PhD student in the ComputerScience and Engineering Department at theUniversity of Illinois, Urbana–Champaign.She received an MS in computer science andengineering from the University of Michi-gan, where she completed the work for thisarticle. Contact her at [email protected].
Austin Rovinski is an undergraduate stu-dent in electrical engineering at the Univer-sity of Michigan. Contact him at [email protected].
Arjun Khurana is a master’s student in theComputer Science and Engineering Depart-
ment at the University of Michigan. He
received a BSE in electrical engineering
from the University of Michigan. Contact
him at [email protected].
Ronald G. Dreslinski is an assistant profes-sor in the Computer Science and Engineer-
ing Department at the University of Michi-
gan. He received a PhD in computer science
and engineering from the University of
Michigan. Contact him at rdreslin@umich.
edu.
Trevor Mudge is the Bredt Family Professorof Computer Science and Engineering at
the University of Michigan, Ann Arbor. He
received a PhD in computer science from
the University of Illinois, Urbana–
Champaign. Contact him at tnm@umich.
edu.
Vinicius Petrucci is an assistant professor inthe Department of Computer Science at the
Federal University of Bahia, Brazil. He
received a PhD in computer science from
the Fluminense Federal University, Brazil.
Contact him at [email protected].
Lingjia Tang is an assistant professor in theComputer Science and Engineering Depart-
ment at the University of Michigan. She
received a PhD in computer science from
the University of Virginia. Contact her at
Jason Mars is an assistant professor in theComputer Science and Engineering Depart-
ment at the University of Michigan. He
received a PhD in computer science from
the University of Virginia. Contact him at
.............................................................
MAY/JUNE 2016 53