WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task...

41
WNGT 2020 Efficiency Shared Task Kenneth Heafield, 1 Yusuke Oda, Graham Neubig https://www.aclweb.org/anthology/2020.ngt-1.1 https://sites.google.com/view/wngt20/efficiency-task 1 Corruptly, both organizer and participant.

Transcript of WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task...

Page 1: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

WNGT 2020 Efficiency Shared TaskKenneth Heafield,1 Yusuke Oda, Graham Neubig

https://www.aclweb.org/anthology/2020.ngt-1.1https://sites.google.com/view/wngt20/efficiency-task

1Corruptly, both organizer and participant.

Page 2: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 2

Page 3: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Goal: Efficient Machine Translation

Present task: inference → productionFuture task: efficient training?

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 3

Page 4: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Data Condition

WMT 2019 English–German constrained news task.

State-of-the-art systems submit to the latest WMT=⇒ There is no such thing as state-of-the-art on WMT14!

Also, recycle WMT 2019 systems as teachers.

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 4

Page 5: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Awkward timing with WMT

2020 training data not final at start, test set unavailable at end.Root cause: WNGT at ACL, WMT at EMNLP.Coordinate with WMT more?

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 5

Page 6: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Test Set

Last year≈1s to translate =⇒ too smallBanned a team for memorizing known test set

Before deadline1 million sentences≤100 space-separated words/sentenceUnspecified test set hidden in input

After deadlineWMT plus filler: EMEA, Tatoeba, German FederalShuffled, also score parallel filler datahttp://data.statmt.org/heafield/wngt20/test/

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 6

Page 7: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Test Set

Last year≈1s to translate =⇒ too smallBanned a team for memorizing known test set

Before deadline1 million sentences≤100 space-separated words/sentenceUnspecified test set hidden in input

After deadlineWMT plus filler: EMEA, Tatoeba, German FederalShuffled, also score parallel filler datahttp://data.statmt.org/heafield/wngt20/test/

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 7

Page 8: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Test Set

Last year≈1s to translate =⇒ too smallBanned a team for memorizing known test set

Before deadline1 million sentences≤100 space-separated words/sentenceUnspecified test set hidden in input

After deadlineWMT plus filler: EMEA, Tatoeba, German FederalShuffled, also score parallel filler datahttp://data.statmt.org/heafield/wngt20/test/

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 8

Page 9: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Approximately Measuring Quality

Need a surprise evaluation set. WMT20 not ready yet.→ Uh, average old WMT test sets?→ WMT12 has sentences longer than 100 words.→ WMT1*: average sacrebleu of WMT11, WMT13–19

See paper supplement for individual WMT scores.Problem: participants likely tuned on WMT sets.

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 9

Page 10: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

BLEU?

“use human evaluation to verify claims in experiments that use metrics suchas BLEU” –Reviewer of my EU project

“BLEU has been surpassed by various other metrics”–Mathur et al, ACL 2020→ Submitted fast Czech systems to WMT20 with Charles University.

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grantagreement No 825303.

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 10

Page 11: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Hardware

Recent hardware with 8-bit optimization:

GPU NVidia T4g4dn.xlarge on Amazon Web Services $0.526/hr

CPU Intel Xeon Platinum 8275CL (Cascade Lake) dual socketc5.metal on Amazon Web Services $4.08/hrSingle-core and all-core tracks (48 physical cores)

Provided credits for participants to develop with.

Amazon, Intel, and NVidia have contributed to my research.

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 11

Page 12: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Three teams

Multiple submission encouraged!GPU CPU 1 core CPU all core

NiuTrans 4 0 1OpenNMT 4 4 4UEdin 4 2 5

UEdin’s CPU submissions had a memory leak → shown with/without fix.

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 12

Page 13: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Pareto Comparison

Submissions have varying quality and efficiency.Unclear how much quality loss to tolerate.

Pareto comparison: quality ≥ baseline and efficiency ≥ baseline.

More efficient with same quality. . . or better quality with same efficiency.

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 13

Page 14: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Speed

Primary: wall clock time.Words per second based on 15,048,961 untokenized words.

Supplementary data: CPU time.

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 14

Page 15: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

GPU speed

28

30

32

34

36

0 5 10 15 20 25

WM

T1*

BLEU

Thousand words per real second

NiuTransOpenNMT

UEdin

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 15

Page 16: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

CPU single core speed

28

30

32

34

36

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

WM

T1*

BLEU

Thousand words per real second

OpenNMTUEdin

UEdin Fix

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 16

Page 17: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

CPU all core speed

28

30

32

34

36

0 20 40 60 80 100 120

WM

T1*

BLEU

Thousand words per real second

NiuTransOpenNMT

UEdin

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 17

Page 18: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Cost

28

30

32

34

36

0 20 40 60 80 100 120 140 160

WM

T1*

BLEU

Million Words per USD

NiuTrans: GPUOpenNMT: GPU

UEdin: GPUCPUCPUCPU

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 18

Page 19: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Disk

Model size: parameters, BPE, shortlists, etc.

Total Docker size: model, part of Ubuntu, codeOpenNMT won Docker with 122–308 MB; others 432–933 MB.

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19

Page 20: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Model size, all platforms

28

30

32

34

36

0 50 100 150 200 250 300 350 400 450

WM

T1*

BLEU

Model size (MB)

NiuTransOpenNMT

UEdin

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 20

Page 21: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Peak RAM usage

GPU: polling nvidia-smiCPU: memory.max usage in bytes

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 21

Page 22: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

GPU RAM

28

30

32

34

36

1 2 4

WM

T1*

BLEU

GPU RAM (GB)

NiuTransOpenNMT

UEdin

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 22

Page 23: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

CPU single core RAM

28

30

32

34

36

0.25 0.5 1 2 4 8 16 32 64 128

WM

T1*

BLEU

CPU RAM (GB)

OpenNMTUEdin

UEdin Fix

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 23

Page 24: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

CPU all core RAM

28

30

32

34

36

1 2 4 8 16 32 64

WM

T1*

BLEU

CPU RAM (GB)

NiuTransOpenNMT

UEdinUEdin Fix

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 24

Page 25: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Efficiency Task

All participants had something Pareto optimal.

System descriptions:https://sites.google.com/view/wngt20/programme

I am opening the task for rolling submission.

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 25

Page 26: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

What’s missing

Allowed batching in all conditions→ What about latency?

Where are the non-autoregressive people?→ Non-autoregressive: a case study in poor evaluation.

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 26

Page 27: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Latency

Latency is average time to translate one sentence.Experiments with Edinburgh’s systems; sorry I asked too late.

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 27

Page 28: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Batching is Important for Speed

28

29

30

31

32

33

34

35

36

0 1 2 3 4 5 6 7 8 9 10

WM

T1*

BLEU

Thousand words per second

Batch on GPUBatch on CPU core

Single on GPUSingle on CPU core

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 28

Page 29: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Latency: 10.3–71.7 ms!

28

29

30

31

32

33

34

35

36

0 10 20 30 40 50 60 70 80

WM

T1*

BLEU

Latency (ms)

GPUCPU core

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 29

Page 30: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Autoregressive MT latency is 10.3–71.7 ms, often <30 ms.

So what’s up with this table from Jiatao Gu et al (2018)�

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 30

Page 31: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Replicating Gu et al (2018)’s setup

Do not try this at home or work.WMT14State-of-the-art is latest WMT.Just don’t claim state-of-the-art like Wang et al (2018) did.

Tokenized BLEUTokenization differences =⇒ BLEU scores are not comparable.But many non-autoregressive papers compare anyway.Use sacrebleu instead.

P100, latency on IWSLT 2016 en-de dev.

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 31

Page 32: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Replicating Gu et al (2018)’s setup

Do not try this at home or work.WMT14State-of-the-art is latest WMT.Just don’t claim state-of-the-art like Wang et al (2018) did.

Tokenized BLEUTokenization differences =⇒ BLEU scores are not comparable.But many non-autoregressive papers compare anyway.Use sacrebleu instead.

P100, latency on IWSLT 2016 en-de dev.

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 32

Page 33: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Real baselines for Gu et al (2018)

16

18

20

22

24

26

28

30

0 100 200 300 400 500 600 700

Weir

dly

Toke

nize

dW

MT1

4BL

EU

Latency (ms) on P100, IWSLT 2016 dev

Gu et al (2018) autoregressiveGu et al (2018) non-autoregressive

Marian 2018

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 33

Page 34: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Real baselines for Gu et al (2018)

16

18

20

22

24

26

28

30

0 100 200 300 400 500 600 700

Weir

dly

Toke

nize

dW

MT1

4BL

EU

Latency (ms) on P100, IWSLT 2016 dev

Gu et al (2018) autoregressiveGu et al (2018) non-autoregressive

Marian 2018Marian 2019

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 34

Page 35: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Research doesn’t have to be state-of-the-art.Just mention stronger baselines, not 60x weaker straws.

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 35

Page 36: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Arguably this is what the shared task explores.Here are some easy things you could have done:

1 Model distillation for autoregressive, since it’s used fornon-autoregressive

2 Use 1–2 decoder layers in autoregressive models3 Averaged attention network

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 36

Page 37: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Recommendations

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 37

Page 38: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Use sacrebleu.You can’t compare against a paper that didn’t.

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 38

Page 39: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Don’t have to be state-of-the-art.Just cite it or put it in your table.Strawman baselines are misleading.

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 39

Page 40: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Lots of baselines

1 Fewer parameters or layers2 Quantize3 Prune4 Beam size5 Shortlisting6 Simplify architecture7 Model distillation8 Early exit9 Non-autoregressive

Show your method is a better trade-off via Pareto optimality.Don’t trust papers that get X speedup for “small” Y BLEU loss!

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 40

Page 41: WNGT 2020 Efficiency Shared Task · OpenNMT won Docker with 122–308 MB; others 432–933 MB. Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19. Model

Conclusion

Currently no evidence that non-autoregressive is competitive.

We’re implementing it in Marian.

WNGT 2020 efficiency task is rolling, send me dockers!

Task Definition Efficiency Results Latency Non-autoregressive Recommendation 41