A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

19
A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Transcript of A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Page 1: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

A quick tour of the datasets for VLDB 2008

(does not include datasets already in the UCR archive)

Page 2: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 80

Number of testing objects 2320

Number of classes 8

Length of time series 1024

Euclidean Distance accuracy 95.05%

Some Name

The dataset came from blah blah blah blah

Why is difficult?

• Blah blah• Blah blah• Blah blah

Formatting Note

This is the one nearest neighbor, Euclidean distance accuracy for just the training set, measured using leaving-one-out.

I measured the accuracy of 1NN-ED on the training set (only).

This was to make sure we do not have any formatting misunderstandings

You should test the 1NN-ED on the training set (only), and see if you get the same answers. Do this first, otherwise we may waste time.

Page 3: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 55

Number of testing objects 2345

Number of classes 8

Length of time series 1024

Euclidean Distance accuracy 98.18%

This figure is from [a]. The only change we made was to flip the data left to right, (and z-normalization)

This dataset is described in Mallat, S. G. (1998), A Wavelet Tour of Signal Processing, San Diego: Academic Press. However the data we used was donated by Jeong [a].

The data was obtained by randomly choosing 55 objects for the training set and choosing the rest for the testing set. Each time series was also reversed.

[a] M. K. Jeong, J. C. Lu, X. Huo, B. Vidakovic, and D. Chen (2006), "Wavelet-based Data Reduction Techniques for Process Fault Detection," Technometrics, 48(1), 26-40. http://web.utk.edu/~mjeong/

MALLAT TECHNOMETRICS

Why is difficult?

• Many classes • Some classes are globally similar, and have only local differences. • Small training set (In [a], using 1024 instances for training, a decision tree got 96.87% accuracy. Since this was too easy, we reduced the size of the training set significantly).

Page 4: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 67

Number of testing objects 1029

Number of classes 2

Length of time series 24

Euclidean Distance accuracy 95.522

See Keogh ICDM06

Eamonn Keogh, Li Wei, Xiaopeng Xi, Stefano Lonardi, Jin Shieh, Scott Sirowy (2006). Intelligent Icons: Integrating Lite-Weight Data Mining and Visualization into GUI Operating Systems. ICDM 2006.

ItalyPowerDemand (3 years)

TaskDistinguish days from Oct to March (inclusive) from April to September

Why is difficult? • Borderline days (late Sep vs early Oct)• Unusual days (soccer games etc)• Under sampled data?• August is radically different to the rest of the summer months.

1 3 5 7 9 11 13 15 17 19 21 23

From Keogh ICDM06

Page 5: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 40

Number of testing objects 1380

Number of classes 4

Length of time series 1639

Euclidean Distance accuracy 85.00%

CinC_ECG_torso

TaskData is taken from ECG data for multiple torso-surface sites. There are 4 classes (4 different people)

Why is difficult? • See gray strip on figure. Depending on location on the body, the peak can be positive, neutral or negative. Similar remarks apply to all features. • The figure shows aligned data, but the challenge data is slightly out of alignment.

Page 6: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 155

Number of testing objects 308

Number of classes 5

Length of time series 1092

Euclidean Distance accuracy 51.61%

Haptics

TaskData is taken from 5 people entering their “passgraph” on a touchscreen. We only consider the X axis.

Why is difficult? • Small training set• I think (but have not checked this) that the high variability at the beginning and end of the time series is just noise.• We are just looking at the X-axis for simplicity, we should also be looking at Y-axis, pen pressure, pen acceleration…

Novel Shoulder-Surfing Resistant Haptic-based Graphical PasswordBehzad Malek, Mauricio Orozco, Abdulmotaleb El Saddik

4 sample time series (before normalizing)

0 200 400 600 800 1000 120040

60

80

100

120

140

160

180

200

Page 7: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 25

Number of testing objects 995

Number of classes 6

Length of time series 398

Euclidean Distance accuracy 84.0%

Symbols

TaskThirteen people participated in this experiment. They were asked to copy the randomly appearing symbol as best they could. There were 3 possible symbols, each person contributed about 30 attempts.

Why is difficult?

• Individuality of the 13 individuals• Each of the 6 classes looks only at the X or Y axis, we really should have 3 classes looking at the X and Y axis• Two of the symbols are very very similar on the Y-axis• Small training set

This dataset was created for the contest by Jill Brady, a grad student at UCR. We gratefully acknowledge her.

0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400

X-axis Y-axis

Page 8: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 381

Number of testing objects 760

Number of classes 10

Length of time series 99

Euclidean Distance accuracy 72.178%

MedicalImages

TaskThe data are histograms of pixel intensity of medical images. The classes are “different human body regions.”

Why is difficult?

• It is not clear that treating the raw data as time series is the best overall approach for this problems, but the original authors due report success with a “time warping” measure.• Original time series are of different lengths, some are very short, making them all the same length may have introduced artifacts

This dataset was donated by Joaquim C. Felipe, Agma J. M. Traina and Caetano Traina Jr.

0 10 20 30 40 50 60 70 80 90 100

Page 9: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 20

Number of testing objects 601

Number of classes 2

Length of time series 70

Euclidean Distance accuracy 90.0%

SonyAIBORobotSurface

TaskThe robot has roll/pitch/yaw accelerometers, here we looked at just X-axis. The task is to detect the surface being walked on.

Why is difficult?

• Noisy data• Small training set. See figure at left, with enough data it looks easy.

This dataset was donated by Manuela Veloso and Douglas Vail of Carnegie Mellon University

Red: Cement. Blue Carpet

Page 10: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 27

Number of testing objects 953

Number of classes 2

Length of time series 65

Euclidean Distance accuracy 85.185%

SonyAIBORobotSurfaceII

TaskThe robot has roll/pitch/yaw accelerometers, here we looked at just Z-axis. The task is to detect the surface being walked on.

Why is difficult?

• Noisy data• Small training set. See figure at left, with enough data it looks easier.

This dataset was donated by Manuela Veloso and Douglas Vail of Carnegie Mellon University

Red: Cement. Blue Carpet or Field

Page 11: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 23

Number of testing objects 1139

Number of classes 2

Length of time series 82

Euclidean Distance accuracy 78.261%

TwoLeadECG

TaskTime series is taken from MIT-BIH Long-Term ECG Database (ltdb) Record ltdb/15814, begin at time 420, ending at 1019. The task is to distinguish between signal 0 and signal 1.

Why is difficult? • Subtle distinctions • Small training set• Beat extractor does not produce perfect alignment, but after using EM to align the signal (figure at left) it is clear that certain parts of the signal are more informative.

Page 12: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 1000

Number of testing objects 8236

Number of classes 3

Length of time series 1024

Euclidean Distance accuracy 86.00%

StarLightCurves

TaskTime series are star light curves falling into three classes.

Why is difficult? • Two of the three classes are quite similar. • Large dataset (but the real datasets have billions of these!)• Phase was aligned using standard astronomy tricks. However we tried circular shift invariant Euclidean distance (see [a]) our accuracy improved, suggesting the alignment is not perfect.

1 - CEPH2 - EB 3 - RRL

[a] Eamonn Keogh, Li Wei, Xiaopeng Xi, Sang-Hee Lee and Michail Vlachos  (2006) LB_Keogh Supports Exact Indexing of Shapes under Rotation Invariance with Arbitrary Representations and Distance Measures. VLDB 2006.

Page 13: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 16

Number of testing objects 306

Number of classes 4

Length of time series 345

Euclidean Distance accuracy 93.75%

DiatomSizeReduction

Task“Each successive generation of a clonaly reproducing diatom is slightly smaller than its forebears .”[a]

Why is difficult? • Small training set • Possible errors caused by image processing step.• Change in scale of diatoms shows up as “warping”.

[a] http://rbg-web2.rbge.org.uk/DIADIST/index.htm?srseries.htm&main[b] Xiaopeng Xi, et al (2007). Finding Motifs in Database of Shapes. SDM'07

(many omitted)

Eun

otia

ten

ella

Gomphonema augur

Fragilariforma bicapitata

Stauroneis smithii

0 200 400 600 800 1000 1200

[b]

Page 14: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 20

Number of testing objects 1252

Number of classes 2

Length of time series 84

Euclidean Distance accuracy 75.00%

Motes

TaskSensor data used in paper [b].Here the task is to distinguish between sensor q8calibHumid and sensor q8calibHumTemp. The raw data has dropouts, which I left in.

Why is difficult? • Small training set.• Lots of dropouts (however, when noise is removed, should be very easy).• Here the dropouts had value zero. But after z-normalization these values changed. It would have been easier to do smart smoothing if the data was not normalized.

[a] Raw data from Carlos Guestrin (CMU), Classification version by Keogh [b] Jimeng Sun, Spiros Papadimitriou, Christos Faloutsos: Online Latent Variable Detection in Sensor Networks. ICDE 2005: 1126-1127

0 50 100 150 200 250 300 3500

5

10

15

20

25

Page 15: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 487

Number of testing objects 3840

Number of classes 3

Length of time series 166

Euclidean Distance accuracy 63.383%

ChlorineConcentration

TaskSensor data used in paper [b].Multiple sensors have spatial correlation, which I arbitrarily divided into 3 sets

Why is difficult? • The borderline cases are hard to classify. However with more data it would be easy. For example, when I randomly sample k items from the labeled test set, and do INN ED classification, I get…

1000 -> 76.5% accuracy2000 -> 89.85% accuracy3000 -> 96.8% accuracy

[a] Stacia Thompson and Jeanne M. VanBriesen (CMU) Classification version by Keogh [b] Jimeng Sun, Spiros Papadimitriou, Christos Faloutsos: Online Latent Variable Detection in Sensor Networks. ICDE 2005: 1126-1127

0 20 40 60 80 100 120 140 160 180-0.2

0

0.2

0.4

0.6

0.8

1

Page 16: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 23

Number of testing objects 861

Number of classes 2

Length of time series 136

Euclidean Distance accuracy 82.609%

ECGFiveDays

TaskData is from a 67 year old male. The two

classes are simply 1) ECG date: 12/11/1990 2) ECG date: 17/11/1990

Why is difficult? • Wandering baseline was not removed,

this shows up as linear drift.• Beat extractor does not produce

perfect alignment, but after using EM to align the signal (figure at left) it is clear that certain parts of the signal are more informative.

Wandering baselineExcerpt of Class 1

Page 17: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 100

Number of testing objects 550

Number of classes 7

Length of time series 1882

Euclidean Distance accuracy 30.00%

InlineSkate

TaskThis data was been collected from

experiments with inline speed skaters on a treadmill.

Each time series represents an angular measurement of the ankle during one movement cycle.

Cycles were of different lengths, we made them all the same length.

Why is difficult? • Lots of “warping”• Long time series (for algorithms that

scale poorly in dimensionality).• The “cycle” extraction algorithm might

not be perfect (this was done before we saw the data)

The data was provided by Fabian Moerchen and Olaf Hoos.

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Page 18: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 200

Number of testing objects 2050

Number of classes 14

Length of time series 131

Euclidean Distance accuracy 75.50

FacesUCR

TaskThis data consists of faces of grad students

transformed into “time series”

Why is difficult? • Variation of head angle and expression.• Some have glasses/no glasses versions• All grad students look alike (well, some

do).• The transformation algorithm is a little

brittle (we have since found more robust techniques).

Photographs by Chotirat "Ann" Ratanamahatana, image conversion by Xiaopeng Xi and Eamonn Keogh

Page 19: A quick tour of the datasets for VLDB 2008 (does not include datasets already in the UCR archive)

Number of training objects 267

Number of testing objects 638

Number of classes 25

Length of time series 270

Euclidean Distance accuracy 58.80

WordsSynonyms

TaskThis dataset consists of word profiles for

George Washington's manuscripts. This dataset is the “50-words” dataset,

remapped to 25 classes.The data was flipped left-right so that it

would not be recognized.

Why is difficult? • There are two ways to be a member of

each class.• In this case, length normalization

clearly does throw away useful info.• Errors from the difficult task of OCR on

old documents

0 50 100 150 200 250 300 350 400 4500

0.5

1

The data was provided by Toni M. Rath and R. Manmatha.

The time series representation of words is known to be very competitive with other representations [a].

Here the results might not be competitive because we are only using one (of four) time series per word, we are normalizing, and we have small training sets.

[a] Word spotting for historical documents. Toni M. Rath and R. Manmatha International Journal on Document Analysis and Recognition. Volume 9, Numbers 2-4 / April, 2007