Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU...
Transcript of Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU...
![Page 1: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/1.jpg)
CMU Sinbad’s Submission for the DSTC7 AVSD
ChallengeRamon Sanabria*, Shruti Palaskar* and Florian Metze
Language Technologies InstituteCarnegie Mellon University
*equal contribution
![Page 2: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/2.jpg)
● Generate system responses in a dialog about an input video.
● Dialog systems need to understand scenes.
● Task:
○ Visual question answering (VQA).
○ Video description.
Task Description
2
![Page 3: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/3.jpg)
Investigating Different Visual Features
3
![Page 4: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/4.jpg)
Visual Features
4
Object RecognitionPlace/Scene Recognition Action Features
● http://places2.csail.mit.edu/download.html
● Source: Image.● Target: Scene Context.
● http://www.image-net.org/● Source: Image.● Target: Object.
● Hara et al. 2018● Source: Video (16 frames).● Target: Action.
riding arena Tree Arranging flowers
![Page 5: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/5.jpg)
DSTC7 Baseline
● Sequence-to-Sequence model .
● Simple concatenation.
● 2 layers 128 units in encoder.
● 2 layers 128 units.
● Alamri et al. 2018.
Model Details
Attentive Decoder
2-layer BiGRU
Answer
Encoder
Attentive Decoder
2-layer BiGRU
Encoder
Naive Fusion
![Page 6: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/6.jpg)
Feature ComparisonBaseline Results
Bleu-4 Rouge-L
Baseline 0.084 0.291
Place 0.082 0.286
Object 0.083 0.287
Action 0.079 0.284
All 0.085 0.287
● Results of proto set.6
![Page 7: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/7.jpg)
Our models
7
![Page 8: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/8.jpg)
Model
● Sequence-to-Sequence model with attention (Bahdanau et al. 2015)
● Bidirectional GRU encoder decoder
● 2 layers 256 units in encoder
● 2 layers 256 units of conditional GRU decoder
8
Attentive Decoder
2-layer BiGRU
Answer
Encoder
Model Details
![Page 9: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/9.jpg)
Text-only Model
Attentive Decoder
2-layer BiGRU
Answer
TextEncoder
Summary +Question
9
ROUGE-L
0.330
BLEU-4
0.105
![Page 10: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/10.jpg)
Action-only Model
ResNeXtaction prediction
Decoder
ROUGE-L
0.29410
BLEU-4
0.085
Answer
2-layer BiGRU
![Page 11: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/11.jpg)
Hierarchical Attention for Text + Video
TextEncoder
ResNeXtaction prediction
MultimodalDecoder
Vocabulary
ROUGE-L
0.33811Libovicky et al. 2017
BLEU-4
0.112
![Page 12: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/12.jpg)
All models, Comparison
DSTC
Bleu-4 Rouge-L
Text-only 0.105 0.330
Video-only 0.085 0.294
Text + Video 0.112 0.338
● Video-only competitive with other models
● Text + Video improves over Text-only (but not drastically)
12
![Page 13: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/13.jpg)
Investigating use of external data
13
![Page 14: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/14.jpg)
External Data: How2 Dataset
● Video● Speech● English Transcript
● Portuguese Transcript● Summary
Sanabria et al. 201814
![Page 15: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/15.jpg)
Model Fine-tuned with How2 Data
Answer
Summary + Question
Video
Audio
Caption
Transcript
Video
Audio
Transcript +Video
Translation
Summary
15
![Page 16: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/16.jpg)
Summarization with How2 Data
16
● Summarization○ Present subset of information in
a more compact form (maybe across modalities)
● “Description” field○ 2-3 sentences of meta-data:
template based, uploader provides
○ “Informative” and abstractive summary of a how-to video
○ Should generate interest of a potential viewer
Libovicky et al. 2018
![Page 17: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/17.jpg)
Attention over the Video Features
17
cut cut
Talking and preparing the brush
Close-up of brushstrokes w/ hand
Black frames at the end
Close-up of brushstrokes no hand
![Page 18: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/18.jpg)
Text-only Model Fine-tuned with How2
Attentive Decoder
2-layer BiGRU
Answer
TextEncoder
Summary +Question
->Answer
ROUGE-L
0.33718
BLEU-4
0.114Transcript
->Summary
![Page 19: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/19.jpg)
Video-only Model Fine-tuned with How2
ResNeXtaction prediction
Vocabulary
Bidirectional RNN
Decoder
ROUGE-L
0.30019
BLEU-4
0.086Video
->Answer
Video ->
Summary
![Page 20: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/20.jpg)
Text + Video Model Fine-tuned with How2
TextEncoder
ResNeXtaction prediction
MultimodalDecoder
Vocabulary
ROUGE-L
0.33920Libovicky et al. 2017
BLEU-4
0.113
![Page 21: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/21.jpg)
All models, Comparison + Human Ratings
DSTC only DSTC + How2
Bleu-4 Rouge-L Human Bleu-4 Rouge-L Human
Text-only 0.105 0.330 - 0.114 0.337 3.394
Video-only 0.085 0.294 - 0.086 0.300 -
Text + Video 0.112 0.338 3.491 0.113 0.339 3.459
● How2 data improves Text-only model
● Video-only competitive with other models
● Text + Video improves over Text-only but not drastically 21
● Our model performs well on human evaluation as well
![Page 22: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/22.jpg)
Models outputs
Question: is he talking or reading out loud ?
Answer: no , he is not talking at all .
Question: what 's in the mug ?
Answer: i don 't know , i can 't see the inside .
Question: hello . did someone come to the door ?
Answer: no and it is a window that he is standing in front of .
Question: are they talking in the video ?
Answer:not really no i don 't hear anything
22
![Page 23: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/23.jpg)
References
● “Attention Strategies for Multi-Source Sequence-to-Sequence Learning”,
Jindřich Libovický, Jindřich Helcl
● “Hierarchical Question-Image Co-Attention for Visual Question Answering”, Jiasen
Lu, Jianwei Yang, Dhruv Batra, Devi Parikh
● “NMTPY: A Flexible Toolkit for Advanced Neural Machine Translation Systems”, Ozan Caglayan, Mercedes García-Martínez, Adrien Bardet, Walid Aransa, Fethi Bougares, Loïc Barrault
● “How2: A Large-scale Dataset for Multimodal Language Understanding”, Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliot, Loic Barrault, Lucia Specia, Florian Metze
● “Multimodal Abstractive Summarization for Open-Domain Videos”, Jindrich Libovicky, Shruti Palaskar, Spandana Gella, Florian Metze
23
![Page 24: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e07ec9edb9f50242c9d64/html5/thumbnails/24.jpg)
Thank you to the organizers!
Data https://github.com/srvk/how2-dataset How2 Dataset
Code https://github.com/lium-lst/nmtpytorch