STAT Design (I)
3/18/09
Domain
Core
Application
Technical Service
Weka API MinorThird API
Example Impl
Foundation
Util Math
John Impl Weka Wrapper
Fram
ewo
rkC
lients
Extension of framework provided with the package
Extension of framework done by the user
edu.cmu.statproject.core
edu.cmu.statproject.impl
edu.cmu.statproject.wrapper
edu.cmu.statproject.math
edu.cmu.statproject.util
edu.cmu.statproject.impl.hideki
edu.cmu.statproject.impl.example
edu.cmu.statproject.impl.shilpa
edu.cmu.statproject.wrapper.weka
edu.cmu.statproject.wrapper.minorthird
…
…
Proposed packages V1
edu.cmu.statproject.core
edu.cmu.statproject.math
edu.cmu.statproject.util
edu.cmu.statproject.hideki
edu.cmu.statproject.example
edu.cmu.statproject.shilpa
edu.cmu.statproject.weka
edu.cmu.statproject.minorthird
Proposed packages V2
Different implementations of .core
CorpusReader
+ read(filename: String): Corpus+ read(ds: DataSource): Corpus
- documents: List<Document>- pm : PartitionManager- paritionFilter : String
Corpus
+ add(d: Document)+ remove(id: int)+ get (id: int): Document+ getPartitionManager() : PartitionManager+ setPartitionManager(pm: PartitionManager)+ iterator(): Iterator<Document>
Document
- id : int- annotations: List<Annotation>- labels: List<String>- text: String…+ getAnnotations(id: int): List<Annotation>+ getLabels(): List<String>+ getText(): String+ setText(text : String)+ addLabel(label : String)
*
Annotator
+ annotate (c: Corpus)+ annotate (d: Document)
FeatureExtractor
+ transform(c: Corpus): Dataset+ transform(d:Document): Instance
Dataset
See next diagram for “Machine Learning” side
Annotation
- id : int- begin : int- end : int- label: List<String>
+ getText(): String
edu.cmu.statproject.coreText Processingedu.cmu.statproject.coreText Processing
*
PartitionManager
DataSource
FeatureExtractor
+ transform(c: Corpus): Dataset+ transform(d: Document): Instance
Dataset
- instances: List<Instance>- featureNames: List<String>- featureIDs: HashMap<String, Integer>- labelNames: List<String>- labelIDs: HashMap<String, Integer>- pm: PartitionManager- paritionFilter : String
+ add(ins: Instance)+ remove(idx: int)+ get(idx: int): Instance+ getPartitionManager() : PartitionManager+ setPartitionManager(pm: PartitionManager)+ iterator() : Iterator<Instance>+ getSubset(partitionName : String) : Dataset
Learner
- name: String- settings : Settings
+ learn(d: Dataset): Model
Model
- name: String
Instance
- id : int- featureIDs: int[]- featureValues: double[]- sequence: int[]- labels: int[]
+ getFeatureIDs() : int []+ getFeatureValues() : double[]+ getSequence() : int[]+ getLabel(): int+ getLabels(): int[]
*
Classifier
- name: String - m: Model- settings : Settings
+ setModel(m: Model)+ classify(d: Dataset): Classification
1
Classification
- predictions : List<String> - classifierName: String - modelName: String
ClassificationEvaluator
- settings : Settings
+ eval(cl: Classification, Dataset d) : ClassficationEvaluation
ClassificationEvaluation
edu.cmu.statproject.coreMachine Learningedu.cmu.statproject.coreMachine Learning
PartitionManager
62138745
PartitionManager
- nparts: int- partSizes: int[]- partNames : String[]- itemOrder: int[]
// Constructors# PartitionManager(d : Dataset)# PartitionManager(c : Corpus)# PartitionManager(itemCount : int)
+ size()
// Partitioning+ split(npart : int)+ setPartName(partNo : int, partName : String)+ split(partSizes : int[], partNames : String[])+ split(partRatios : double[] , partNames : String[])
// Cross Validation Methods+ splitForCrossValidation(npart : int, , partNames : String[])+ setCurrentFold(foldNo : int)
// Changing the order that items being processed+ shuffle()+ shuffle(partNo : int)+ shuffle(partName : int)
// Getters and setters are not shown for brevity
edu.cmu.statproject.corePartitioning and Orderingedu.cmu.statproject.corePartitioning and Ordering
Dataset Corpus
0
1
2
3
name1
name2
name1
name3
partName partNo itemOrder
Dataset d;...// Default split method (without explicit splitter)PartitionManager pm = new PartitionManager(d.size());d.setPartitionManager(pm);pm.shuffle();pm.split(new double{0.8,0.2}, new String[]{”train”,”test”});
// Learn from the training subsetNaiveBayesLearner nblearner = new NaiveBayesLearner();NaiveBayesModel nbModel = nBlearner.learn(d.getSubset(“train”))
// Classify the test subset datasetNaiveBayesClassifier nbCf = new NaiveBayesClassifier(nbModel);NaiveBayesClassification nbCt = nbCf.classify(d.getSubset(“test”));
// Or use cross validationpm.splitForCrossValidation(10, new String[]{”train”,”test”});for (int i = 0; i < 10; ++i) { pm.setCurrentFold(i) nbModel = nBlearner.learn(d.getSubset(“train”)); nbCf.classify(d.getSubset(“test”));}
How Partition Manager Works:
Sample Code:
pm.split( new int{2,3,1,2}, new String[]{”name1”, ”name2”, ”name1”, ”name3”});
Corpus
…
…
Model
…
…
Dataset
…
…
<<interface>>
Persistable
+ save(filename : String)+ load(filename : String)+ save(ds : DataSource)+ load(ds : DataSource)
CorpusWriter
…
+ write(c: Corpus, filename: String)…
CorpusReader
…
+ read(filename: String): Dataset…
DatasetReader
…
+ read(filename: String): Dataset…
DatasetWriter
…
+ write(d: Dataset, filename: String)…
Classification
…
…
PersistencePersistence
Settings
…
…
Java.util.Properties
Top Related