Stat Design3 18 09

Post on 14-Dec-2014

255 views 1 download

description

 

Transcript of Stat Design3 18 09

STAT Design (I)

3/18/09

Domain

Core

Application

Technical Service

Weka API MinorThird API

Example Impl

Foundation

Util Math

John Impl Weka Wrapper

Fram

ewo

rkC

lients

Extension of framework provided with the package

Extension of framework done by the user

edu.cmu.statproject.core

edu.cmu.statproject.impl

edu.cmu.statproject.wrapper

edu.cmu.statproject.math

edu.cmu.statproject.util

edu.cmu.statproject.impl.hideki

edu.cmu.statproject.impl.example

edu.cmu.statproject.impl.shilpa

edu.cmu.statproject.wrapper.weka

edu.cmu.statproject.wrapper.minorthird

Proposed packages V1

edu.cmu.statproject.core

edu.cmu.statproject.math

edu.cmu.statproject.util

edu.cmu.statproject.hideki

edu.cmu.statproject.example

edu.cmu.statproject.shilpa

edu.cmu.statproject.weka

edu.cmu.statproject.minorthird

Proposed packages V2

Different implementations of .core

CorpusReader

+ read(filename: String): Corpus+ read(ds: DataSource): Corpus

- documents: List<Document>- pm : PartitionManager- paritionFilter : String

Corpus

+ add(d: Document)+ remove(id: int)+ get (id: int): Document+ getPartitionManager() : PartitionManager+ setPartitionManager(pm: PartitionManager)+ iterator(): Iterator<Document>

Document

- id : int- annotations: List<Annotation>- labels: List<String>- text: String…+ getAnnotations(id: int): List<Annotation>+ getLabels(): List<String>+ getText(): String+ setText(text : String)+ addLabel(label : String)

*

Annotator

+ annotate (c: Corpus)+ annotate (d: Document)

FeatureExtractor

+ transform(c: Corpus): Dataset+ transform(d:Document): Instance

Dataset

See next diagram for “Machine Learning” side

Annotation

- id : int- begin : int- end : int- label: List<String>

+ getText(): String

edu.cmu.statproject.coreText Processingedu.cmu.statproject.coreText Processing

*

PartitionManager

DataSource

FeatureExtractor

+ transform(c: Corpus): Dataset+ transform(d: Document): Instance

Dataset

- instances: List<Instance>- featureNames: List<String>- featureIDs: HashMap<String, Integer>- labelNames: List<String>- labelIDs: HashMap<String, Integer>- pm: PartitionManager- paritionFilter : String

+ add(ins: Instance)+ remove(idx: int)+ get(idx: int): Instance+ getPartitionManager() : PartitionManager+ setPartitionManager(pm: PartitionManager)+ iterator() : Iterator<Instance>+ getSubset(partitionName : String) : Dataset

Learner

- name: String- settings : Settings

+ learn(d: Dataset): Model

Model

- name: String

Instance

- id : int- featureIDs: int[]- featureValues: double[]- sequence: int[]- labels: int[]

+ getFeatureIDs() : int []+ getFeatureValues() : double[]+ getSequence() : int[]+ getLabel(): int+ getLabels(): int[]

*

Classifier

- name: String - m: Model- settings : Settings

+ setModel(m: Model)+ classify(d: Dataset): Classification

1

Classification

- predictions : List<String> - classifierName: String - modelName: String

ClassificationEvaluator

- settings : Settings

+ eval(cl: Classification, Dataset d) : ClassficationEvaluation

ClassificationEvaluation

edu.cmu.statproject.coreMachine Learningedu.cmu.statproject.coreMachine Learning

PartitionManager

62138745

PartitionManager

- nparts: int- partSizes: int[]- partNames : String[]- itemOrder: int[]

// Constructors# PartitionManager(d : Dataset)# PartitionManager(c : Corpus)# PartitionManager(itemCount : int)

+ size()

// Partitioning+ split(npart : int)+ setPartName(partNo : int, partName : String)+ split(partSizes : int[], partNames : String[])+ split(partRatios : double[] , partNames : String[])

// Cross Validation Methods+ splitForCrossValidation(npart : int, , partNames : String[])+ setCurrentFold(foldNo : int)

// Changing the order that items being processed+ shuffle()+ shuffle(partNo : int)+ shuffle(partName : int)

// Getters and setters are not shown for brevity

edu.cmu.statproject.corePartitioning and Orderingedu.cmu.statproject.corePartitioning and Ordering

Dataset Corpus

0

1

2

3

name1

name2

name1

name3

partName partNo itemOrder

Dataset d;...// Default split method (without explicit splitter)PartitionManager pm = new PartitionManager(d.size());d.setPartitionManager(pm);pm.shuffle();pm.split(new double{0.8,0.2}, new String[]{”train”,”test”});

// Learn from the training subsetNaiveBayesLearner nblearner = new NaiveBayesLearner();NaiveBayesModel nbModel = nBlearner.learn(d.getSubset(“train”))

// Classify the test subset datasetNaiveBayesClassifier nbCf = new NaiveBayesClassifier(nbModel);NaiveBayesClassification nbCt = nbCf.classify(d.getSubset(“test”));

// Or use cross validationpm.splitForCrossValidation(10, new String[]{”train”,”test”});for (int i = 0; i < 10; ++i) { pm.setCurrentFold(i) nbModel = nBlearner.learn(d.getSubset(“train”)); nbCf.classify(d.getSubset(“test”));}

How Partition Manager Works:

Sample Code:

pm.split( new int{2,3,1,2}, new String[]{”name1”, ”name2”, ”name1”, ”name3”});

Corpus

Model

Dataset

<<interface>>

Persistable

+ save(filename : String)+ load(filename : String)+ save(ds : DataSource)+ load(ds : DataSource)

CorpusWriter

+ write(c: Corpus, filename: String)…

CorpusReader

+ read(filename: String): Dataset…

DatasetReader

+ read(filename: String): Dataset…

DatasetWriter

+ write(d: Dataset, filename: String)…

Classification

PersistencePersistence

Settings

Java.util.Properties