Stat Design3 18 09

7
STAT Design (I) 3/18/09

description

 

Transcript of Stat Design3 18 09

Page 1: Stat Design3 18 09

STAT Design (I)

3/18/09

Page 2: Stat Design3 18 09

Domain

Core

Application

Technical Service

Weka API MinorThird API

Example Impl

Foundation

Util Math

John Impl Weka Wrapper

Fram

ewo

rkC

lients

Extension of framework provided with the package

Extension of framework done by the user

Page 3: Stat Design3 18 09

edu.cmu.statproject.core

edu.cmu.statproject.impl

edu.cmu.statproject.wrapper

edu.cmu.statproject.math

edu.cmu.statproject.util

edu.cmu.statproject.impl.hideki

edu.cmu.statproject.impl.example

edu.cmu.statproject.impl.shilpa

edu.cmu.statproject.wrapper.weka

edu.cmu.statproject.wrapper.minorthird

Proposed packages V1

edu.cmu.statproject.core

edu.cmu.statproject.math

edu.cmu.statproject.util

edu.cmu.statproject.hideki

edu.cmu.statproject.example

edu.cmu.statproject.shilpa

edu.cmu.statproject.weka

edu.cmu.statproject.minorthird

Proposed packages V2

Different implementations of .core

Page 4: Stat Design3 18 09

CorpusReader

+ read(filename: String): Corpus+ read(ds: DataSource): Corpus

- documents: List<Document>- pm : PartitionManager- paritionFilter : String

Corpus

+ add(d: Document)+ remove(id: int)+ get (id: int): Document+ getPartitionManager() : PartitionManager+ setPartitionManager(pm: PartitionManager)+ iterator(): Iterator<Document>

Document

- id : int- annotations: List<Annotation>- labels: List<String>- text: String…+ getAnnotations(id: int): List<Annotation>+ getLabels(): List<String>+ getText(): String+ setText(text : String)+ addLabel(label : String)

*

Annotator

+ annotate (c: Corpus)+ annotate (d: Document)

FeatureExtractor

+ transform(c: Corpus): Dataset+ transform(d:Document): Instance

Dataset

See next diagram for “Machine Learning” side

Annotation

- id : int- begin : int- end : int- label: List<String>

+ getText(): String

edu.cmu.statproject.coreText Processingedu.cmu.statproject.coreText Processing

*

PartitionManager

DataSource

Page 5: Stat Design3 18 09

FeatureExtractor

+ transform(c: Corpus): Dataset+ transform(d: Document): Instance

Dataset

- instances: List<Instance>- featureNames: List<String>- featureIDs: HashMap<String, Integer>- labelNames: List<String>- labelIDs: HashMap<String, Integer>- pm: PartitionManager- paritionFilter : String

+ add(ins: Instance)+ remove(idx: int)+ get(idx: int): Instance+ getPartitionManager() : PartitionManager+ setPartitionManager(pm: PartitionManager)+ iterator() : Iterator<Instance>+ getSubset(partitionName : String) : Dataset

Learner

- name: String- settings : Settings

+ learn(d: Dataset): Model

Model

- name: String

Instance

- id : int- featureIDs: int[]- featureValues: double[]- sequence: int[]- labels: int[]

+ getFeatureIDs() : int []+ getFeatureValues() : double[]+ getSequence() : int[]+ getLabel(): int+ getLabels(): int[]

*

Classifier

- name: String - m: Model- settings : Settings

+ setModel(m: Model)+ classify(d: Dataset): Classification

1

Classification

- predictions : List<String> - classifierName: String - modelName: String

ClassificationEvaluator

- settings : Settings

+ eval(cl: Classification, Dataset d) : ClassficationEvaluation

ClassificationEvaluation

edu.cmu.statproject.coreMachine Learningedu.cmu.statproject.coreMachine Learning

PartitionManager

Page 6: Stat Design3 18 09

62138745

PartitionManager

- nparts: int- partSizes: int[]- partNames : String[]- itemOrder: int[]

// Constructors# PartitionManager(d : Dataset)# PartitionManager(c : Corpus)# PartitionManager(itemCount : int)

+ size()

// Partitioning+ split(npart : int)+ setPartName(partNo : int, partName : String)+ split(partSizes : int[], partNames : String[])+ split(partRatios : double[] , partNames : String[])

// Cross Validation Methods+ splitForCrossValidation(npart : int, , partNames : String[])+ setCurrentFold(foldNo : int)

// Changing the order that items being processed+ shuffle()+ shuffle(partNo : int)+ shuffle(partName : int)

// Getters and setters are not shown for brevity

edu.cmu.statproject.corePartitioning and Orderingedu.cmu.statproject.corePartitioning and Ordering

Dataset Corpus

0

1

2

3

name1

name2

name1

name3

partName partNo itemOrder

Dataset d;...// Default split method (without explicit splitter)PartitionManager pm = new PartitionManager(d.size());d.setPartitionManager(pm);pm.shuffle();pm.split(new double{0.8,0.2}, new String[]{”train”,”test”});

// Learn from the training subsetNaiveBayesLearner nblearner = new NaiveBayesLearner();NaiveBayesModel nbModel = nBlearner.learn(d.getSubset(“train”))

// Classify the test subset datasetNaiveBayesClassifier nbCf = new NaiveBayesClassifier(nbModel);NaiveBayesClassification nbCt = nbCf.classify(d.getSubset(“test”));

// Or use cross validationpm.splitForCrossValidation(10, new String[]{”train”,”test”});for (int i = 0; i < 10; ++i) { pm.setCurrentFold(i) nbModel = nBlearner.learn(d.getSubset(“train”)); nbCf.classify(d.getSubset(“test”));}

How Partition Manager Works:

Sample Code:

pm.split( new int{2,3,1,2}, new String[]{”name1”, ”name2”, ”name1”, ”name3”});

Page 7: Stat Design3 18 09

Corpus

Model

Dataset

<<interface>>

Persistable

+ save(filename : String)+ load(filename : String)+ save(ds : DataSource)+ load(ds : DataSource)

CorpusWriter

+ write(c: Corpus, filename: String)…

CorpusReader

+ read(filename: String): Dataset…

DatasetReader

+ read(filename: String): Dataset…

DatasetWriter

+ write(d: Dataset, filename: String)…

Classification

PersistencePersistence

Settings

Java.util.Properties