Introduction to Machine Learning, Stanford...

Classification of Transcription Start Sites in the Human Genome Ann He, Chloe Siebach, and Zandra Ho under the direction of Nathan Boley Introduction to Machine Learning, Stanford University With the dramatic growth of genomic sequence data, there have been new initiatives to annotate the data using machine learning techniques. One aspect of epigenomic annotation includes the labeling of gene regions surrounding transcription start sites (TSS) as promoters or enhancers, which are the two main classes of functional regions that influence the rate of protein synthesis from DNA. As such, the ability to distinguish between these regions is imperative in our understanding of the regulation of genes. In computational genetics, the k-mers of a DNA sequence refer to all its possible subsequences of length k. For each sequence in our data set, we created a feature vector containing the counts of distinct 6-mer appearances of all the 6-mers of that sequence. We would like to thank our project mentor Dr. Nathan Boley for his guidance throughout this project. All data is from the Roadmap Epigenomics Project. Random Forest and Boosting Neural Nets One-hot encoded sequences convert DNA sequences into matrices (above) which can be used by neural nets. We considered two neural net architectures (below). As the model was being trained, it recorded loss on the training and validation sets. To prevent overfitting, training stopped when validation error started to increase. Using 10-fold cross validation, we found architecture 2 to outperform 1 based on AUPRC. Moving forward with NN2, we again used 10-fold cross validation to determine the optimal number of filters for our first convolutional layer. The number of estimators is a parameter in both random forest and gradient boosting. We used 10-fold cross validation to determine the optimal number for both. For boosting we considered AUROC and for random forest we considered both AUROC and AUPRC. GCGGCG CGCCGC CCGCGC CGCGGC CCGCCG GGGCGG CCGCCC CGGCCG CGCGCG CCGCGG CGCCGC CCGCCC GGGCGG CGGCGG GCGGCG GGCGGC GGGCCC GCGCCG GGGGCC CAGGCA NN1: Convolution, Dropout, Flatten, Dense, Dropout, Activation, Dense, Dropout, Activation, Dense, Activation NN2: Convolution, Dropout, Maxpool, Convolution, Dropout, Maxpool, Flatten, Dense Above we display the sensitivities of two filters before (left) and after (right) convergence. We extracted the top 10 most important k-mers in classifying a TSS region as a promoter or enhancer. Highlighted in red are the k-mers that random forest and boosting had in common. The results of our optimized random forest and boosting models on the test data set are as follows: Random Forest: AUPRC = 0.989, AUROC = 0.885 Gradient Boosting: AUPRC = 0.988, AUROC = 0.881 The results of our optimized neural net 2 architecture on the test data set is as follows: auroc = 0.947, auprc = 0.994.

Upload
others
Category

Documents
view
16
download
0

Embed Size (px):

Transcript of Introduction to Machine Learning, Stanford...

Classification of Transcription Start Sites in the Human GenomeAnn He, Chloe Siebach, and Zandra Ho under the direction of Nathan Boley

Introduction to Machine Learning, Stanford University

With the dramatic growth of genomic sequence data,there have been new initiatives to annotate the datausing machine learning techniques. One aspect ofepigenomic annotation includes the labeling of generegions surrounding transcription start sites (TSS) aspromoters or enhancers, which are the two main classesof functional regions that influence the rate of proteinsynthesis from DNA. As such, the ability to distinguishbetween these regions is imperative in ourunderstanding of the regulation of genes.

In computational genetics, the k-mers of a DNAsequence refer to all its possible subsequences oflength k.

For each sequence in our data set, we created a featurevector containing the counts of distinct 6-merappearances of all the 6-mers of that sequence.

We would like to thank our project mentor Dr. Nathan Boley for hisguidance throughout this project. All data is from the RoadmapEpigenomics Project.

Random Forest and Boosting

Neural Nets

One-hot encoded sequences convert DNA sequencesinto matrices (above) which can be used by neural nets.We considered two neural net architectures (below).

Asthemodelwasbeingtrained,itrecordedlossonthetrainingandvalidationsets.Topreventoverfitting,trainingstoppedwhenvalidationerrorstartedtoincrease.

Using 10-fold cross validation, we found architecture 2 to outperform 1 based onAUPRC. Moving forward with NN2, we again used 10-fold cross validation todetermine the optimal number of filters for our first convolutional layer.

The number of estimators is a parameter in both random forest and gradientboosting. We used 10-fold cross validation to determine the optimal numberfor both. For boosting we considered AUROC and for random forest weconsidered both AUROC and AUPRC.

GCGGCGCGCCGCCCGCGCCGCGGCCCGCCG

GGGCGGCCGCCCCGGCCGCGCGCGCCGCGG

CGCCGCCCGCCCGGGCGGCGGCGGGCGGCG

GGCGGCGGGCCCGCGCCGGGGGCCCAGGCA

NN1: Convolution, Dropout, Flatten, Dense, Dropout, Activation, Dense, Dropout, Activation, Dense, ActivationNN2: Convolution, Dropout, Maxpool, Convolution, Dropout, Maxpool, Flatten, Dense

Above we display the sensitivities of two filters before (left) and after(right) convergence.

We extracted the top 10 most important k-mers in classifying a TSS region as apromoter or enhancer. Highlighted in red are the k-mers that random forestand boosting had in common.

The results of our optimized random forest and boosting models on the testdata set are as follows:

RandomForest:AUPRC=0.989,AUROC=0.885GradientBoosting:AUPRC=0.988,AUROC=0.881

The results of our optimized neural net 2 architecture on the test data set is as follows: auroc = 0.947, auprc = 0.994.