Exploratory Search Missions for TREC Topicsceur-ws.org/Vol-1033/extras/paper3-poster.pdfI 150 essays...

1
Exploratory Search Missions for TREC Topics Martin Potthast Matthias Hagen Michael V¨ olske Benno Stein Bauhaus-Universit¨ at Weimar 99421 Weimar, Germany ClueWeb API Revision log ClueWeb Query log Editor Topic ChatNoir SE Author We report on the construction of a new text reuse corpus comprising writing interactions and exploratory search missions. I 150 essays (based on TREC Web Track topics 2009-2011) I 12 professional writers hired on a crowdsourcing platform I Long essay writing task, researching sources using a custom ClueWeb09 search engine I Writing and search engine interactions recorded in high detail Corpus Overview Authors Writer Demographics Age Gender Native language(s) Minimum 24 Female 67% English 67% Median 37 Male 33% Filipino 25% Maximum 65 Hindi 17% Academic degree Country of origin Second language(s) Postgraduate 41% UK 25% English 33% Undergraduate 25% Philippines 25% French 17% None 17% USA 17% Afrikaans, Dutch, n/a 17% India 17% German, Spanish, Australia 8% Swedish each 8% South Africa 8% None 8% Years of writing Search engines used Search frequency Minimum 2 Google 92% Daily 83% Median 8 Bing 33% Weekly 8% Standard dev. 6 Yahoo 25% n/a 8% Maximum 20 Others 8% Topics Example topic: Obama’s family. Write about President Barack Obama’s family history, including genealogy, national origins, places and dates of birth, etc. Where did Barack Obama’s parents and grandparents come from? Also include a brief biography of Obama’s mother. Original topic 001 of the TREC Web Track 2009: Query. obama family tree Description. Find information on President Barack Obama’s family history, including genealogy, national origins, places and dates of birth, etc. Sub-topic 1. Find the TIME magazine photo essay “Barack Obama’s Family Tree.” Sub-topic 2. Where did Barack Obama’s parents and grandparents come from? Sub-topic 3. Find biographical information on Barack Obama’s mother. Query log Corpus Distribution Σ Characteristic min avg max stdev Writers 12 Topics 150 Topics / Writer 1 12.5 33 9.3 Queries 13 651 Queries / Topic 4 91.0 616 83.1 Clicks 16 739 Clicks / Topic 12 111.6 443 80.3 Clicks / Query 0 0.8 76 2.2 Sessions 931 Sessions / Topic 1 12.3 149 18.9 Days 201 Days / Topic 1 4.9 17 2.7 Hours 2068 Hours / Writer 3 129.3 679 167.3 Hours / Topic 3 7.5 10 2.5 Search mission data will be made available as the Webis-Query-Log-12 (http://www.webis.de/research/corpora) Data Collection 047 165 112 33 023 20 024 16 044 158 037 210 052 58 066 70 064 113 142 23 003 18 028 119 140 28 090 27 053 196 136 23 080 347 017 109 027 248 085 40 018 148 013 153 048 113 082 154 110 319 095 64 069 26 009 30 150 18 010 208 123 35 072 24 026 34 088 114 022 284 084 46 102 52 004 60 012 52 098 48 029 66 075 97 134 50 107 138 040 36 079 42 148 34 015 70 014 34 056 57 099 120 049 616 126 74 145 101 062 62 111 32 118 69 149 106 130 4 131 136 039 28 005 108 114 98 143 47 089 46 021 10 121 55 007 50 139 88 045 48 087 198 086 94 031 218 120 48 058 198 081 112 030 76 061 20 019 147 001 170 096 139 091 56 108 106 008 323 016 70 146 60 109 74 093 104 038 51 094 42 133 301 054 111 083 69 034 44 065 150 144 274 041 48 105 92 060 155 127 99 138 241 106 58 097 84 051 181 011 40 002 135 035 46 059 118 067 185 115 14 116 29 025 133 070 61 073 17 124 23 050 78 129 24 063 66 055 80 078 33 117 68 104 12 141 162 125 60 006 76 071 62 128 108 103 22 068 42 076 42 135 75 113 69 046 18 119 147 042 208 020 30 147 24 122 173 137 16 132 16 032 52 077 26 057 36 074 9 036 60 101 8 043 30 033 42 092 74 100 64 A B C D E F G H I J 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Spectrum of search behavior I Percentage of queries submitted over time for all 150 search missions I Ranges from majority of queries issued at the start of the task (A1) to most queries towards the end (J15) I In between, sets of queries submitted in bursts (e.g F9) or linear increase (A10) Correlation of searching and writing I Evidence of distinct text reuse strategies (build-up and boil-down) I Only the former clearly reflected in the query log 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Author 5 (18 topics) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Author 24 (13 topics) Average query distribution Average text length over time First Conclusions I Query frequency by itself poor predictor of task completion I Heavy reliance on search engine indicates need to better support exploratory tasks Main Findings Web Technology and Information Systems Bauhaus-Universit¨ at Weimar www.webis.de

Transcript of Exploratory Search Missions for TREC Topicsceur-ws.org/Vol-1033/extras/paper3-poster.pdfI 150 essays...

Page 1: Exploratory Search Missions for TREC Topicsceur-ws.org/Vol-1033/extras/paper3-poster.pdfI 150 essays (based on TREC Web Track topics 2009-2011) I 12 professional writers hired on a

Exploratory Search Missions for TREC TopicsMartin Potthast Matthias Hagen Michael Volske Benno Stein

Bauhaus-Universitat Weimar99421 Weimar, Germany

ClueWeb API

Revision log ClueWebQuery log

Editor

Topic

ChatNoir SE

Author

We report on the construction of a new text reuse corpuscomprising writing interactions and exploratory search missions.

I 150 essays (based on TREC Web Track topics 2009-2011)

I 12 professional writers hired on a crowdsourcing platform

I Long essay writing task, researching sources using a customClueWeb09 search engine

I Writing and search engine interactions recorded in high detail

Corpus Overview

AuthorsWriter Demographics

Age Gender Native language(s)Minimum 24 Female 67% English 67%Median 37 Male 33% Filipino 25%Maximum 65 Hindi 17%Academic degree Country of origin Second language(s)Postgraduate 41% UK 25% English 33%Undergraduate 25% Philippines 25% French 17%None 17% USA 17% Afrikaans, Dutch,n/a 17% India 17% German, Spanish,

Australia 8% Swedish each 8%South Africa 8% None 8%

Years of writing Search engines used Search frequencyMinimum 2 Google 92% Daily 83%Median 8 Bing 33% Weekly 8%Standard dev. 6 Yahoo 25% n/a 8%Maximum 20 Others 8%

TopicsExample topic:Obama’s family.Write about President Barack Obama’s family history, includinggenealogy, national origins, places and dates of birth, etc. Where didBarack Obama’s parents and grandparents come from? Also include abrief biography of Obama’s mother.

Original topic 001 of the TREC Web Track 2009:Query. obama family treeDescription. Find information on President Barack Obama’s familyhistory, including genealogy, national origins, places and dates of birth,etc.Sub-topic 1. Find the TIME magazine photo essay “Barack Obama’sFamily Tree.”Sub-topic 2. Where did Barack Obama’s parents and grandparentscome from?Sub-topic 3. Find biographical information on Barack Obama’s mother.

Query logCorpus Distribution ΣCharacteristic min avg max stdevWriters 12Topics 150Topics / Writer 1 12.5 33 9.3Queries 13 651Queries / Topic 4 91.0 616 83.1Clicks 16 739Clicks / Topic 12 111.6 443 80.3Clicks / Query 0 0.8 76 2.2Sessions 931Sessions / Topic 1 12.3 149 18.9Days 201Days / Topic 1 4.9 17 2.7Hours 2068Hours / Writer 3 129.3 679 167.3Hours / Topic 3 7.5 10 2.5

Search mission data will be made available as the Webis-Query-Log-12 (http://www.webis.de/research/corpora)

Data Collection

047165

11233

02320

02416

044158

037210

05258

06670

064113

14223

00318

028119

14028

09027

053196

13623

080347

017109

027248

08540

018148

013153

048113

082154

110319

09564

06926

00930

15018

010208

12335

07224

02634

088114

022284

08446

10252

00460

01252

09848

02966

07597

13450

107138

04036

07942

14834

01570

01434

05657

099120

049616

12674

145101

06262

11132

11869

149106

1304

131136

03928

005108

11498

14347

08946

02110

12155

00750

13988

04548

087198

08694

031218

12048

058198

081112

03076

06120

019147

001170

096139

09156

108106

008323

01670

14660

10974

093104

03851

09442

133301

054111

08369

03444

065150

144274

04148

10592

060155

12799

138241

10658

09784

051181

01140

002135

03546

059118

067185

11514

11629

025133

07061

07317

12423

05078

12924

06366

05580

07833

11768

10412

141162

12560

00676

07162

128108

10322

06842

07642

13575

11369

04618

119147

042208

02030

14724

122173

13716

13216

03252

07726

05736

0749

03660

1018

04330

03342

09274

10064

A B C D E F G H I J

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Spectrum of search behaviorI Percentage of queries submitted over time for all 150 search missionsI Ranges from majority of queries issued at the start of the task (A1) to most queries

towards the end (J15)

I In between, sets of queries submitted in bursts (e.g F9) or linear increase (A10)

Correlation of searching and writingI Evidence of distinct text reuse strategies

(build-up and boil-down)I Only the former clearly reflected in the

query log0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Author 5 (18 topics)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Author 24 (13 topics)

Average query distribution

Average text length over time

First ConclusionsI Query frequency by itself poor predictor of task completionI Heavy reliance on search engine indicates need to better support exploratory tasks

Main Findings

Web Technology and Information Systems Bauhaus-Universitat Weimarwww.webis.de