General Question Answering for the MS MARCO Dataset · General Question Answering for the MS MARCO...

9
General Question Answering for the MS MARCO Dataset TJ Gilbrough and Lane Aasen and Omar Alhadlaq {tgilbrou, aaasen, hadlaq}@cs.washington.edu Abstract Machine comprehension is an integral piece in the quest for general artificial intelligence. Recently published mod- els have utilized attention mechanisms to achieve promising results on question an- swering datasets such as SQuAD. In this paper, we attempt to extend these mod- els to the MS MARCO dataset in order to build a general question answering sys- tem, capable of tapping into the knowl- edge held within a wide array of internet documents. 1 Introduction Traditionally, search engines have focused on re- ferring people to web pages with answers to their questions. More recently, some search engines have begun to directly answer simple, factual questions using curated databases of basic knowl- edge. Figure 1: Google directly answers the question ‘what is the radius of the earth?’ However, current approaches to question an- swering are rigid and limited to encyclopedic questions. What if we could create a system that could utilize the entirety of the public internet to answer complex questions spanning the sum of human knowledge? This would be the holy grail of question answering. It would revolution- ize search engines, personal assistants, and other artificial intelligence products. Recent advancements in machine comprehen- sion have made it possible to create a basic ver- sion of such a system. In October of 2016, the Stanford Question Answering Dataset (SQuAD) was published, which contains over 100,000 ques- tions based on over 500 Wikipedia articles. When SQuAD was released, it was significantly larger than previous reading comprehension datasets. However, since SQuAD is based on a relatively small and well curated set of Wikipedia articles, it is a poor choice for training reading comprehen- sion models for the entire public internet. At NIPS 2016, Microsoft released the MS MARCO (Microsoft MAchine Reading COmpre- hension) dataset. Questions in the MS MARCO dataset are sampled from real-world Bing queries, and answers are human-generated based on ex- cerpts from Bing search results. This makes MS MARCO the most promising dataset so far for training general purpose question answering mod- els. Recent research into recurrent neural networks and attention mechanisms has resulted in con- siderable improvements to machine comprehen- sion models. Models such as BiDAF (Seo et al., 2016) and Dynamic Coattention Networks (Xiong et al., 2016) have achieved impressive results on the SQuAD dataset and pushed the state of the art forward. In this paper, we hope to adapt these and other models to the more complex MS MARCO dataset, with the goal of creating a truly general purpose question answering system with access to the entirety of human knowledge.

Transcript of General Question Answering for the MS MARCO Dataset · General Question Answering for the MS MARCO...

General Question Answering for the MS MARCO Dataset

TJ Gilbrough and Lane Aasen and Omar Alhadlaq{tgilbrou, aaasen, hadlaq}@cs.washington.edu

Abstract

Machine comprehension is an integralpiece in the quest for general artificialintelligence. Recently published mod-els have utilized attention mechanisms toachieve promising results on question an-swering datasets such as SQuAD. In thispaper, we attempt to extend these mod-els to the MS MARCO dataset in orderto build a general question answering sys-tem, capable of tapping into the knowl-edge held within a wide array of internetdocuments.

1 Introduction

Traditionally, search engines have focused on re-ferring people to web pages with answers to theirquestions. More recently, some search engineshave begun to directly answer simple, factualquestions using curated databases of basic knowl-edge.

Figure 1: Google directly answers the question‘what is the radius of the earth?’

However, current approaches to question an-swering are rigid and limited to encyclopedicquestions. What if we could create a system thatcould utilize the entirety of the public internetto answer complex questions spanning the sum

of human knowledge? This would be the holygrail of question answering. It would revolution-ize search engines, personal assistants, and otherartificial intelligence products.

Recent advancements in machine comprehen-sion have made it possible to create a basic ver-sion of such a system. In October of 2016, theStanford Question Answering Dataset (SQuAD)was published, which contains over 100,000 ques-tions based on over 500 Wikipedia articles. WhenSQuAD was released, it was significantly largerthan previous reading comprehension datasets.However, since SQuAD is based on a relativelysmall and well curated set of Wikipedia articles, itis a poor choice for training reading comprehen-sion models for the entire public internet.

At NIPS 2016, Microsoft released the MSMARCO (Microsoft MAchine Reading COmpre-hension) dataset. Questions in the MS MARCOdataset are sampled from real-world Bing queries,and answers are human-generated based on ex-cerpts from Bing search results. This makes MSMARCO the most promising dataset so far fortraining general purpose question answering mod-els.

Recent research into recurrent neural networksand attention mechanisms has resulted in con-siderable improvements to machine comprehen-sion models. Models such as BiDAF (Seo et al.,2016) and Dynamic Coattention Networks (Xionget al., 2016) have achieved impressive results onthe SQuAD dataset and pushed the state of the artforward. In this paper, we hope to adapt these andother models to the more complex MS MARCOdataset, with the goal of creating a truly generalpurpose question answering system with access tothe entirety of human knowledge.

2 MS MARCO Dataset

To create the MS MARCO dataset, people wereshown real Bing search queries along with pas-sages extracted from the top Bing search results.They were then asked to select the relevant pas-sages and use them to answer the question.

The MS MARCO dataset contains 102,023question-answer pairs with a train-validation-testsplit of approximately 80-10-10. There are be-tween 1 and 12 passages for each question withan average of 8.2. Passages are between 19 and1,167 characters in length, with an average of 422.There are an average of 1.07 ‘selected’ passages,passages marked as the source of the answer, foreach question. Virtually all of the dataset is in En-glish.

Unlike previous reading comprehensiondatasets like SQuAD, the answers in MS MARCOmay not come directly from the context passages.Only about 60% of answers are contained in thecontext passages verbatim. This presents a uniqueand novel challenge, since models must havesome generative capabilities in order to achievehigh performance on the MS MARCO dataset.

Questions are separated into five types based onthe form of their answer. These labels are pro-vided for all questions, including those in the testdataset.

• Description (54.6%)

– e.g. What is vitamin A used for?

• Numeric (27.6%)

– e.g. How long does it take for a pulse oflaser light to reach the moon?

• Entity (10.4%)

– e.g. What is a hummingbird moth?

• Location (4.9%)

– e.g. Where do Roseate Spoonbills live?

• Person (2.5%)

– e.g. Who first invented the anemometer?

Of these question types, description questionsare both the most common and complex. De-scription answers commonly synthesize informa-tion from multiple passages and are only takenverbatim from the context passages 54% of thetime, versus 68% for non-description answers.

Answers to description questions can also havemuch more variability than those to other typesof questions. For example, there are thousands ofcorrect answers to the question ‘What is vitaminA used for?’ which makes training a model forMS MARCO even more complex. This variabil-ity in humans answers explains why humans canonly achieve Bleu-1 scores of 46/100 and Rouge-L scores of 47/100 on the validation set, which isquite low.

3 Methods

We viewed this project as having two main mod-ules, the answer extractor and the passage rele-vance model. The answer extractor is given a pas-sage and a query, and is expected to find the an-swer to that query within the given passage. Thepassage relevance model ranks passages accordingto how relevant they are to the given query.

The output of the answer extractor is simply thepredicted start and end index of the answer in thetext. Along with these indexes, we can also ex-tract the probabilities of those indexes by taking asoftmax of the logits produced by the model.

To pick the final answer for the query, we passeach of the passages through the answer extractor.From this we get the probability of the start andend indexes of the answer within each individualpassage. Define P (sp = i) and P (ep = j) as theprobability of the start and end indexes of the an-swer at i and j respectively, within passage p. Wealso pass through each passage to the passage rel-evance model to get its relevance weight, wp. Topick the best answer, pick the passage and indexeswith the following formula:

arg maxp,i,j

(wp · P (sp = i) · P (ep = j))

3.1 TrainingThe only training in the model takes place withinthe answer extractor. Therefore, we trained theanswer extractor separately on queries with sin-gle passages before using it in our multi-passagemodel. Since many of the answers to the queriescannot be found verbatim within the passages, wehad to filter the training data. To do this, weselected only samples that had its answer con-tained verbatim within a ‘selected’ passage. A se-lected passage is one which contained a 1 withinits is_selected field within the dataset. The

reason the passage must have been selected is be-cause we needed to ensure that the answer is usedin the correct context. For example, if the query is‘where is the Space Needle?’ and the true answeris ‘Seattle’, we would not want to train the modelon a passage that happened to mention Seattle buthad no reference of the Space Needle. It wouldonly confuse the model.

In addition to filtering the training data, we de-cided to train five separate models for the five dif-ferent query types. Each of the different querytypes have distinctive patterns of where the answerfalls in the passage, and by training five separateanswer extractors, the model can exploit these pat-terns. One disadvantage to this approach is that weare further reducing the training data that each an-swer extractor uses. Since we were limited withour resources and time, this was a trade-off wewere willing to take.

3.2 Word Embeddings and Unknown Tokens

The first step in each of the answer extractor mod-els is to convert the tokenized text into a series ofword-level embeddings. To reduce the complex-ity of the model, we utilized pretrained GloVe em-beddings (Pennington et al., 2014). In order to ex-periment with different word embedding sizes, weprimarily stuck with the Wikipedia + Gigawordword embeddings which had sizes 50, 100, 200,and 300. The vocabulary size for each of thesesets of embeddings is 400,000.

Query Type Val Unk % Test Unk %Person 1.969 1.926Location 2.134 1.990Numeric 2.110 2.099Entity 1.802 1.803Description 1.917 1.915

Table 1: Percentage of unknown word in eachquery type

Even with a large vocabulary, there is still agood portion of the words that do not have cor-responding pretrained embeddings. Displayed intable 1 is the percentage of unknown tokens inthe multi-passage datasets. Instead of embeddingthese words as the zero vector, we implementedspecial, trainable, unknown vectors that representparticular unknown classes. The four that we set-tled on were:

• numeric: unknown token that contains anumber in it

• punctuation: unknown token entirely consist-ing of punctuation

• alphabetic: unknown token made up of lettersa-z (often rare nouns or misspellings)

• wildcard: unknown token that does not matchthe above classes

Figure 2: Percentage of unknown classes for eachquery type

In figure 2 we see the distribution of unknowntokens of the different query types across our de-fined unknown classes. In all but the numericdataset, the majority of the unknown tokens fellinto the wildcard class. This means that mostof the unknown tokens are some combination ofalpha-numeric characters with some sort of punc-tuation, or even some sort of foreign languageor unicode characters. If given more time, otherclasses we would have liked to experiment with in-clude capitialized words, foreign language words,numerical tokens with units, years, zip codes etc.

3.3 Answer-Extractor Models

We found that the answer extractor model is some-what well defined with many papers written on thesubject. In fact, The Stanford Question Answer-ing Dataset (Rajpurkar et al., 2016) leaderboardprovided many of these papers. The differencebetween the SQuAD and MS MARCO datasetsis that for SQuAD, each sample contains multi-ple queries given a single passage, while with MSMARCO, each sample contains multiple passagesgiven a single query. Therefore we based our ad-vanced answer extractor models on the previouswork for the SQuAD dataset.

3.3.1 BaselineFor our baseline model, we started with a verybasic design. The model has two bi-directionalGRU RNNs. The first RNN takes a matrix of theword embeddings of the question: Q ∈ Rd×J ,where d is the size of the word embeddings andJ is the maximum question length. The secondRNN takes the word embeddings of the context:X ∈ Rd×T , where T is the maximum contextlength. The output of each RNN is the concate-nation of the backward and forward hidden statesof that RNN, hence we obtain two matrices: U ∈R2d×J from the question RNN, and H ∈ R2d×T

from the context RNN. U is a matrix representinga vector for each word in the question, such thatU = [q1...qJ ]. The question representation welater consider is an average of the different wordvectors outputted from the question RNN:

q̄ =1

J

J∑i=1

qi

The question representation q̄ and the output ofthe context RNN H are concatenated and insertedin a third bi-directional dynamic GRU RNN.Then, we have a dense output layer that takes theoutput of the third RNN. From the output layer weget the logits for the starting and ending indicesof the answer, from which we take the arg max.All three RNNs have a dropout as a regularizationtechnique to reduce overfitting during training.

3.3.2 AttentionThe attention model is very similar to the baselinemodel. The only difference is in defining the ques-tion representation q̄ that we use in the third RNN.Before, we used to average the different word vec-tors we’re getting from the question RNN. In theattention model, for each context word xk we geta question representation q̄k which is a weightedaverage with trained weights:

q̄k =1

J

J∑i=1

pk,iqi

such that∑

i pk,i = 1. The attention weightsare learned with a softmax and a dense layer:

pk,i = softmax(W[xk;qi;xk ◦ qi] + b)

where ◦ is the element-wise product and ; is row-wise concatenation.

Figure 3: The Baseline and the Attention model

3.3.3 CoattentionThe main component of the coattention model isthe coattention encoder, an attention mechanismthat accounts for the important words in the pas-sage in light of the question, and the importantwords in the question in light of the passage. In-tuitively, this is like reading the question first andthen searching the passage for the answer by look-ing for words that seem particularly relevant. In-stead of the dynamic decoder proposed in the Dy-namic Coattention Networks (Xiong et al., 2016),we used a simple dense layer for each the start andend outputs. We found that the iterative approachwould be ideal in theory but we could not get it toconverge nicely. Further details on the implemen-tation can be found here.

3.3.4 BiDAFBiDAF is a model designed by the University ofWashing and Allen Institute of Artificial Intelli-gence (Seo et al., 2016) which performed verywell on the SQuAD dataset. The BiDAF modelconsists of six layers, of which we implementedfive: word embedding, phrase embedding, atten-tion, modeling, and output. Given more time,we could also implement the character embeddinglayer. Further details on the implementation canbe found here.

3.4 Passage Relevance

With the passage relevance model, there were mul-tiple different directions to take it. Since the pas-sages for each sample all contained very similartokens, we struggled to find training data and traina model that could pick out just the single pas-sage that is relevant to the queries. In addition,since the answer is commonly contained in mul-tiple passages, there are opportunities to exploitthis when considering all the passages. Therefore,

the best option was to have a model assign a rel-evance weights to each passage and then rank thepassages based on their weights.

3.4.1 TF-IDFTerm frequency-inverse document frequency, orTF-IDF for short, is a numerical statistic that rep-resents how important a word is to a document in acorpus. The statistic counts the number of occur-rences of a particular word in the document andthen offsets it by how often the word appears inthe corpus. We relied on this statistic to determinehow relevant a passage is to the given query, com-pared to the other passages provided.

Define tf(t, d) to be the term frequency of termt in document d. Most of the time this is simplythe number of times t appears in d, but we tweakedit to be a binary variable indicating if the term ap-pears or not in the document. Since every passageis relevant to the query in some way, we just wantthe ones that contain the key words the query islooking for. Next, define idf(t,D) to be the in-verse document frequency of term t in corpus D.idf(t,D) is calculated by:

idf(t,D) = log

(N

dft

)where N is the number of documents in D, anddft is the number of documents in D that containt. Finally, the TF-IDF value can be calculated fora term t in document d within corpus D:

tfidf(t, d,D) = tf(t, d) · idf(t,D)

For the model, we compute the TF-IDF matrix,where the first row is the query, and each subse-quent row is a passage, and each column is a wordin the vocabulary. Once the matrix is computed,rows can be compared with any similarity metric.We decided to use a linear kernel, which throughexperimentation, gave us the best results.

To compute the passage relevance weights,compute the similarity of the first row (the query)with each of the other rows (the passages) in thematrix. Finally, normalize the weights so that theysum to one.

3.5 EvaluationTo evaluate the performance of the model, Mi-crosoft has released evaluation scripts for the MSMARCO dataset 1. Given both the true and pre-

1The evaluation scripts can be found on MS MARCO’swebsite: http://www.msmarco.org/submission.aspx

Model Bleu-1 Rouge-LReasoNet Baseline 14.83 19.20R-Net (Leader) 42.22 42.89Human Performance 46.00 47.00

Table 2: MS MARCO benchmark performances

dicted answers, the scripts output the Bleu-1 andRouge-L scores. The Bleu-1 score measures theprecision of unigrams, while the Rouge-L scoremeasures the recall of the longest co-occurring insequence n-grams. In other words, the Bleu scoreis how much of the words in the predicted an-swers appeared in the reference answers, while theRouge is how much of the words in the referenceanswers appeared in the predicted answers. Cap-tured in table 2, are benchmark performances atthe time of writing this paper.

4 Results

4.1 Answer-Extractor Results

Model Bleu-1 Rouge-LBaseline 46.6 53.8Attention 52.6 51.8Coattention 50.6 57.0BiDAF 53.9 57.6

Table 3: Answer extractor results on selected pas-sages

Feeding the answer extractor just the selectedpassages for each sample, we used the evaluationscripts to score how well the answer extractors did.These results are display in table 3. Clearly themore complex the model became, the better theresult became as well. Consequently, what it camedown to was how much would the multiple pas-sages affect these results? With the multiple pas-sages, there is much more noise surrounding theanswer, as well as variation on how the answer ex-pressed.

4.2 Passage Relevance ResultsOur passage relevance model selects the ‘selected’passage as the most relevant just 20.2% of thetime. Although this is quite low, often times theanswer is contained in more than just the selectedpassage, so this is alright. Analyzing the cumu-lative accuracy of the model in figure 4, it canbe noted that the passage relevance model gives

Figure 4: Passage relevance ranking accuracy

at least somewhat of an advantage over randomranking. The model does a nice job in particularburying irrelevant passages at the bottom of theranking. Since our TF-IDF calculations are usingbinary variables for the term frequencies, multiplepassages score similarly compared to each other.This is by design in our model. We want the an-swer extractor confidence to play a large role inselecting the answer, not being overshadowed justbecause a particular passage uses key terms morethan others.

4.3 Multi-Passage Results

Model Bleu-1 Rouge-LBaseline (Microsoft’s) 14.83 19.20Baseline (Ours) 28.92 26.88Attention 25.88 25.01Coattention 26.56 25.48BiDAF 29.22 26.30

Table 4: Multi-passage results

Combining the answer extractor and passagerelevance models together, we now evaluated ourmulti-passage model. The resulting scores of thefour models are displayed in table 4. Quite sur-prisingly, we did not see the same trend as we didin the answer extractor models. We believe this isdue to the different probability distributions pro-duced by the different models. The more complexmodels may have produced more confident pre-dictions for irrelevant or shorter passages. Sinceour method for selecting the final answer out ofthe multiple candidate answers does not accountfor this, the results from the more complex multi-passage models suffer.

Drilling further into what produced these re-sults, we analyzed the performance of the modelson the different query types (table 5). The firstthing to note is that no one model performed thebest for every answer type. While the baselinemodel produced the most consistent results, eachmodel had its strong suit. In addition to this, oneother trend to note is how the larger the datasetwas, generally the better results. This is a conse-quence of our decision to train five separate an-swer extractors for the five query types. The de-scription answer extractors produced much betterresults than the person type, despite the larger vari-ation. We believe this is because of the fact thatthe description answer extractors had much moretraining data to learn from. To account for this,something to look into in the future may be meth-ods involving transfer learning, having the mainanswer extractor look at every query type, andappending layers to the model for the particularquery types.

4.4 Error Analysis

Through analyzing the answers given by our mod-els, we have identified several sources of error,which are enumerated below. Some of these er-rors would be relatively simple to address givenmore time, while others would require a signifi-cant amount of further research to fix.

4.4.1 Lack of ComprehensionThis is the most glaring problem with our model,and also the most difficult to solve. In manycases, our model produces probable but incorrectanswers. It recognizes patterns but fails to com-prehend enough about the question and context toproduce correct answers. For example, figure 5shows how the baseline model provides a proba-ble but incorrect answer.

Figure 5: Answer from the baseline model for thequestion ‘Average hours of sleep college studentsget.’

Bleu-1Query Type Baseline Attention Coattention BiDAFPerson 16.32 19.05 20.43 13.86Location 23.26 27.61 21.87 20.88Entity 25.84 18.99 23.51 16.91Numeric 17.27 15.72 18.29 14.43Description 28.17 25.37 26.56 28.96

Rouge-LQuery Type Baseline Attention Coattention BiDAFPerson 29.91 26.06 26.21 24.21Location 24.86 27.40 23.61 23.89Entity 22.51 19.33 20.28 18.00Numeric 29.53 28.49 29.61 30.39Description 26.27 23.59 24.05 25.97

Table 5: Multi-passage results on individual query types. Numbers in bold highlight best results for eachquery type.

4.4.2 Answer LengthThe correct level of detail for an answer is subjec-tive and varies significantly across questions. Forexample, figure 6 shows three possible answers tothe question ‘What is the pay scale for medicalsecretary?’. The first, and shortest, includes justthe median salary. This answer is acceptable, butdoesn’t fully answer the question. The next an-swer includes the average, median, minimum, andmaximum salaries for medical secretaries, whilethe last includes the same information as well ashourly pay rates. In this case, the ”correct” an-swer is entirely subjective and depends on whatexactly the user is searching for. There is no sim-ple fix to balance precision and recall like this; it isa complex problem that will be a persistent sourceof error for any question answering system.

4.4.3 Passage RelevanceWhen using only the passages which contain an-swers, our models achieve scores similar to thoseat the top of the MS MARCO leader board. Whenwe introduce irrelevant passages, those scoresdrop significantly. Our model for predicting pas-sage relevance generally selects passages that con-tain some answer to the provided question. Theproblem is that the passages that it selects may notbe the same ones that were selected when the MSMARCO dataset was created. A more complexneural model might be able to achieve higher per-formance on passage relevance, but the variabilityof human answers will still be a significant source

Figure 6: Answer from the coattention model forthe question ‘What is the pay scale for medicalsecretary?’

of error for any model trained on this dataset.

4.4.4 No Answers

In some cases, our model produces no answerwhatsoever. When selecting an answer, we havethe probability that the answer starts and ends oneach token. Our generated answer starts with themost probable start token and ends with the mostprobable end token. When the most probable endprecedes the most probable start, we return no an-swer. There are many simple fixes to this issue,some lie within the answer extractors themselves,other in the way we select the final answer.

Figure 7: The attention model provides no an-swer to the question ‘What type of substance isephedrine?’ because the most probable start tokenprecedes the most probable end token.

4.4.5 Numeric AnswersOur model has little understanding of numeric val-ues, which poses a significant problem since morethan one quarter of MS MARCO questions are nu-meric. Our models have no understanding of scaleand how units relate. We are completely depen-dent on how well the GloVe embeddings can en-code how well units relate, which we suspect is notgreat. In addition to this, since the models are justlooking for patterns within the passages, they donot understand if a particular number fits the ap-propriate numeric range for the answer. To solvethis problem, we would have to investigate howto better encoding the relation between units andscale.

5 Related Work

Microsoft Research Redmond (Shen et al., 2016)achieved state-of-the-art performance in multiplereading comprehension datasets and performedwell in the MS MARCO dataset. ReasoNet, theirmodel, uses a multi-turn process, in which it re-peatedly processes the context and the questionafter digesting intermediate information, with dif-ferent attention weights in each iteration. Re-sults show that multi-turn reasoning have gen-erally outperformed single-turn reasoning, whichbasically utilizes an attention mechanisms to em-phasize specific parts of the context that are rele-vant to the question. Moreover, ReasoNet uses re-inforcement learning to dynamically decide whento terminate the inference process in reading com-prehension tasks.

Microsoft Research Asia (Wang et al., 2017)currently have the best performance in the MSMARCO leader board. Their model, Predic-tion, uses an end-to-end neural network based onMatch-LSTM and Pointer Net. Match-LSTM isused to predict textual entailment, where giventwo sentences, a premise and a hypothesis, it pre-dicts whether the premise entails the hypothesis.Pointer Net generates an output sequence whosetokens come from the input. Microsoft ResearchAsia propose two ways to use Pointer Net: a se-quence model, which outputs pointers to the to-kens of the answer in the context, and a bound-aries model, which outputs pointers boundaries(first and last tokens) of the answer in the context.

6 Conclusion

We proposed a machine comprehension modelthat is able to take in multiple passages and aquery, and produce an answer to the question. Ourmodel consists of two main modules, the answer-extractor and the passage relevance model. Onthe MS MARCO dataset, the model surpassedbaseline performance posted by Microsoft Re-search and started to approach other competi-tors. With more time to refine the model, and ex-periment with different approaches, now that wehave become acquainted with the field, we feelthat we could really compete with some of themore advanced models from more experienced re-searchers.

With less than 10 weeks to ramp up and learndeep learning with no prior experience, learnTensorFlow, and put together an advanced ques-tion answering system, we feel we produced re-spectable results. Looking back, there are manythings we would have done differently and routesthat we wish we would have explored, but that isall part of the learning process.

6.1 Future Work

Obviously, we can greatly improve upon the cur-rent results. The first place to look is the passagerelevance model. We believe this was the largestlimiting factor for our results. Although TF-IDFworks very well in practice, that is often for re-trieving documents from giant, broad corpora, notfor differentiating between very similar passages.Initial ideas would be to look into other datasetsfor some sort of training data for a model. Anothergreat place to look for models would be the Fake

News Challenge. In that task, the model is given aheadline and a body of text, and the model has tooutput if the two are related or not. Adapting aneffective model from the Fake News Challenge toour task of finding relevant passages could lead toa great improvement in the results.

Another opportunity for improvement lies in thefact that many of the passages contain the cor-rect answer. If each of these passages are passedthrough the answer extractor, ideally, the correctanswer will be chosen multiple times. Exploitingthis fact, we could cluster/merge similar answersto give greater confidence to persistent answers.

7.13% of the samples contained yes or no an-swers, this is something we chose to overlookfor the time being. In order to score well onthis dataset, this property of the data must be ad-dressed. Initially, building a classifier to identify ifthe query is a ‘yes or no’ query would be the firststep. From there, if the query is a ‘yes or no’ query,just run the answer extractor on the query and pas-sage to get the answer. Once we have this answer,run it through yet one final classifier, giving it thequery and answer, and it will output either ‘yes’ or‘no’. Although our current model answers manyof these questions correctly, it gives the detailedanswer instead of just ‘yes’ or ‘no’ and thereforereceives no credit for the answer.

The final remark addresses what we believe thegold standard is for the model. Making quite thejump, the ideal model would be a language gen-erative one. Instead of just pointing to where theanswer is, the model would need to encode an un-derstanding of the answer, and then generate thelanguage to express this answer. This would takecare of the many of the issue our model has, wouldtake much more research on the topic to createeven just a baseline model.

ReferencesJeffrey Pennington, Richard Socher, and Christo-

pher D. Manning. 2014. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP). pages 1532–1543. http://www.aclweb.org/anthology/D14-1162.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev,and Percy Liang. 2016. Squad: 100, 000+ ques-tions for machine comprehension of text. CoRRabs/1606.05250. http://arxiv.org/abs/1606.05250.

Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi,and Hannaneh Hajishirzi. 2016. Bidirectional at-

tention flow for machine comprehension. CoRRabs/1611.01603. http://arxiv.org/abs/1611.01603.

Yelong Shen, Po-Sen Huang, Jianfeng Gao, andWeizhu Chen. 2016. Reasonet: Learning tostop reading in machine comprehension. CoRRabs/1609.05284. http://arxiv.org/abs/1609.05284.

Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang,and Ming Zhou. 2017. Gated self-matching net-works for reading comprehension and question an-swering. In Proceedings of the 55th Annual Meetingof the Association for Computational Linguistics.

Caiming Xiong, Victor Zhong, and Richard Socher.2016. Dynamic coattention networks forquestion answering. CoRR abs/1611.01604.http://arxiv.org/abs/1611.01604.