Alignment-free sequence...
Transcript of Alignment-free sequence...
![Page 1: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/1.jpg)
Alignment-free sequence comparison
Analysis of Biological Sequences 140.638
![Page 2: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/2.jpg)
Why not just align the sequences?
• Alignment scoring can be arbitrary• current alignment algorithms are not scalable: tedious and slow to do
sequence alignment on a large scale (especially short read sequencing)• sequences may not align to each other well enough to give recognizable
distances (gaps etc)• alignment algorithms assume generally collinear sequences
![Page 3: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/3.jpg)
Why not just align the sequences?
• below “twilight zone” of 60-65% identity (nucleotide) or 20-35% identity (protein), alignments are not accurate
• memory and time consuming (prohibitive for multiple genomes)• algorithms make implicit assumptions about evolutionary trajectories of
sequences being compared
![Page 4: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/4.jpg)
resolution-free sequence comparison methods
• word counting/composition comparison• Universal sequence maps (CGR)• Kolmogorov complexity• Complete composition vectors
![Page 5: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/5.jpg)
word-based distance
![Page 6: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/6.jpg)
word-based distances
• word size 2-6 works well for protein comparisons• 8-10nt words useful for DNA or RNA• long words (~25nt) can distinguish very closely related bacterial species in
metagenomics applications
![Page 7: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/7.jpg)
word-based approach
![Page 8: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/8.jpg)
word-based distances
• determine relative frequency of each word:
Oij = # times word Oj appears in sequence i
fij = frequency of word Oj in sequence i
![Page 9: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/9.jpg)
word-based distances
comparing two sequences x and y:
used this method to compare mitochondrial DNA from primate species
![Page 10: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/10.jpg)
![Page 11: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/11.jpg)
improved word-based methods
a single change in a word creates a new word -- biologically realistic?
instead use word neighborhoods e.g.CATTATT, CATTATA, CATTAAT...
![Page 12: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/12.jpg)
N2 similarity score
![Page 13: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/13.jpg)
N2 similarity score
• defines a (potentially weighted) set of words that are the “neighborhood” of any word
• compute word neighborhood counts• correct for inter-variable frequency (e.g. observations of CAAAA and AAAAA
are strongly correlated)• correct for word covariance• normalize so that all word frequencies sum to one
![Page 14: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/14.jpg)
N2 similarity score: distinguishing enhancers
![Page 15: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/15.jpg)
D2 score
if XW is the count of word W in the sequence X,
D2 is Poisson distributed
![Page 16: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/16.jpg)
improvements to D2 score
D2S is normally distributed, D2* is the sum of independent normally distributed variables
![Page 17: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/17.jpg)
metagenomics with 5-tuples
![Page 18: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/18.jpg)
k-tuple scores and metagenomics
![Page 19: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/19.jpg)
metagenomics with D2*
clustering of gut bacteria from foregut fermenters, hindgut fermenters, and carnivores
![Page 20: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/20.jpg)
more metagenomics
![Page 21: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/21.jpg)
speeding up kmer distances
![Page 22: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/22.jpg)
CAFE workflow
source sequences can be whole genomes, contigs, or short reads
![Page 23: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/23.jpg)
CAFE results
![Page 24: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/24.jpg)
![Page 25: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/25.jpg)
![Page 26: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/26.jpg)
![Page 27: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/27.jpg)
![Page 28: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/28.jpg)
![Page 29: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/29.jpg)
resolution-free sequence comparison methods
word counting/composition comparisonComplete composition vectorsUniversal sequence maps (CGR)Kolmogorov complexity
![Page 30: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/30.jpg)
Universal sequence maps
Chaos theory / Chaos game representationIterative functions to represent biological sequencesCan be generalized to any order alphabet (thus “universal”)
![Page 31: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/31.jpg)
Chaos game representation
Plot sequence in a square with vertices labeled A,C,T,G1st nt is plotted halfway between the center of the square and the vertex labeled with that ntSubsequent nts are plotted halfway between previous dot and the vertex labeled with the new nt-> 2D plot representing 1° DNA sequence for ANY lengthpatterns are usually fractal
![Page 32: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/32.jpg)
Chaos game representation
ACG C
A T
G
![Page 33: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/33.jpg)
Chaos game representation
![Page 34: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/34.jpg)
![Page 35: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/35.jpg)
Chaos game representation
Sierpiński Sieve
![Page 36: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/36.jpg)
Chaos game representation
Genomic signature: dinucleotide & trinucleotide relative abundance profiles distinguish between organisms and sequence segments, and can be used in phylogenetic analysis
We see less variation of CGR along genomes than between genomes -> related to genomic signature?
![Page 37: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/37.jpg)
Chaos game representation
What determines the pattern in a CGR?Short nucleotide frequencies don’t solely determine the patternFor a DNA sequence, one can construct a simulated sequence with the same length and nucleotide compositionIF CGRs are the same, then nucleotide and dinucleotide frequencies are all that’s important
![Page 38: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/38.jpg)
making CGR computable
• Hatje & Kollmar divide the CGR grid to outline short oligos & then get frequencies of those oligos
• Almeida et al proved that the length of the common prefix between two CGR is the dissimilarity distance
![Page 39: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/39.jpg)
![Page 40: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/40.jpg)
![Page 41: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/41.jpg)
3D Chaos game representation (HPV)
![Page 42: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/42.jpg)
computing feature vectors from 3D CGR (HPV)
![Page 43: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/43.jpg)
Kolmogorov complexity
K(x) is the shortest binary program that can compute the string x on a binary computerK(x|y) is the shortest binary program that can compute the string x, given the information from y
NID(x,y) = max[K(x|y), K(y|x)]/max[K(y), K(x)] is the normalized information distance (0 ≤ NID ≤ 1)Can be shown that NID(x,y) can express all other distances between x and y!. . . Not computable though :(
![Page 44: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/44.jpg)
Kolmogorov complexity
NID(x,y) = max[K(x|y), K(y|x)]/max[K(y), K(x)]
NCD = normalized compression distance, approximation to NID
NCD(x,y) = (min[C(xy),C(yx)] - min[C(x),C(y)])/max[C(x),C(y)]
Where C(x) is the compressibility of the string x, C(xy) is the compressibility of the string x concatenated to y etc. Can just use gzip!
x: AAAAAAAy: ACGAATAxy: AAAAAAAACGAATAyx: ACGAATAAAAAAAA
![Page 45: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/45.jpg)
>seq1TAGAAATAAATGGAAAGTCAGTAAATGTGTGGCCTGTTAAAATTCTTGGAGAATATACATCACCACTTTCCTCCAAAAATGGGAATAGAATTAGTTCGAATAATTTAGAGAAAAGCACCAACAAACAAATCCACTCAGAATTCTCCATTTCTAGATTGCCCAGAACTAGGCCACGGCAGCTGGGTTCTGAGCAAGACAGTGAGGTTTTCCCTTCCGACCAGGGTGTCAAGAAGAATTGTAAGCAGATTGAATCTGCTAAATTATTACCTGATACACCCGTTCAATTCATACCTCCAAATACATTGAACCTTCGTAGCTTTACCAAGATCATAAAGAGACTGGCTGAACTGCATCCAGAAGTCAGCAGAGACCATATTATAAATGCACTTCAGGAAGTGAGAATAAGACATAAAGGTTTTCTGAATGGCTTATCTATTACTACTATTGTGGAGATGACTTCATCTCTTCTGAAAAACTCTGCTTCCAGTTAGGAATTCAAAAAACAATAAAGAGAACTTCCTTGGAAAGTGTGTTTCCTCCTTCAGAGAATGTTCTACAGCACTTAGGAAAAAGTAGTAATAACAAGATGATGTAATTAAATAGGCTCTATAAATGGGCTAAGCTGTTAAAATATTCTACTTTATATCCCTCCTTTAAAATCTAGCAACAGTTGTCTATACAATATTAAGATCTTCTCTATATATTTAAAGTTAAAATATAATTTTTAATAAGTTTTTAAATTTTTTTATTTCAATTTTGTTACTTAGAACATTAAGATGCATATTTGTGATCTAAAGAAATTGTCTTGTCCATTTTAAAAACCTTTATTAAGTCACTTTTAAAATGTATTGACCAAGAAGGAGGTTTGTTGTTACATCAATGTTTGTGAAATGATTTCCATACATAAAAAATGTAATTTACCTGAACTTTGTCTTAAGACTCTTACATTGGATTATAGGATAACAGATAAATAAACTGTATAGATACATTCAGTATCATACAACATTTTGGAATGTGTATGCTTTCAGGCTTCCAAGATAATTAAATTACTAGAAATAAATGGAAAGTCAGTAAATGTGTGGCCTGTTAAAATTCTTGGAGAATATACATCACCACTTTCCTCCAAAAATGGGAATAGAATTAGTTCGAATAATTTAGAGAAAAGCACCAACAAACAAATCCACTCAGAATTCTCCATTTCTAGATTGCCCAGAACTAGGCCACGGCAGCTGGGTTCTGAGCAAGACAGTGAGGTTTTCCCTTCCGACCAGGGTGTCAAGAAGAATTGTAAGCAGATTGAATCTGCTAAATTATTACCTGATACACCCGTTCAATTCATACCTCCAAATACATTGAACCTTCGTAGCTTTACCAAGATCATAAAGAGACTGGCTGAACTGCATCCAGAAGTCAGCAGAGACCATATTATAAATGCACTTCAGGAAGTGAGAATAAGACATAAAGGTTTTCTGAATGGCTTATCTATTACTACTATTGTGGAGATGACTTCATCTCTTCTGAAAAACTCTGCTTCCAGTTAGGAATTCAAAAAACAATAAAGAGAACTTCCTTGGAAAGTGTGTTTCCTCCTTCAGAGAATGTTCTACAGCACTTAGGAAAAAGTAGTAATAACAAGATGATGTAATTAAATAGGCTCTATAAATGGGCTAAGCTGTTAAAATATTCTACTTTATATCCCTCCTTTAAAATCTAGCAACAGTTGTCTATACAATATTAAGATCTTCTCTATATATTTAAAGTTAAAATATAATTTTTAATAAGTTTTTAAATTTTTTTATTTCAATTTTGTTACTTAGAACATTAAGATGCATATTTGTGATCTAAAGAAATTGTCTTGTCCATTTTAAAAACCTTTATTAAGTCACTTTTAAAATGTATTGACCAAGAAGGAGGTTTGTTGTTACATCAATGTTTGTGAAATGATTTCCATACATAAAAAATGTAATTTACCTGAACTTTGTCTTAAGACTCTTACATTGGATTATAGGATAACAGATAAATAAACTGTATAGATACATTCAGTATCATACAACATTTTGGAATGTGTATGCTTTCAGGCTTCCAAGATAATTAAATTAC
![Page 46: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/46.jpg)
>seq2ATTTATAGAGAAGCCAGTGTTAAGCCGTACTTAAGGTTCACATTTGTAATGAAATAGGTAACTGGGCCTCCACAAGTTCCATGGGAATCGCAGACTAACCATTTGGTTTTCCTCTGCCTCATTTTCTCCTCCTCCTCCTGCTCCTCCTCTTCCTCCTCCCCTCTCTTTAGCATCCTCCTCCTCCTTCTTCTTCTACATCCTCCTTTTCCTCTTCCTCCTCCATCTTCTCCTCTCCTTCTCCTCTTCCTCCCCTTCTTCATCTATTCATTCTTCCTTGAGCCTCCTGGCCCACTAGGGCCCTTCTATCTTGCATCACCTCTGCCCTCTCAAGGCATGCAATATCCTGTATCTCATTCTTCCTTTAGTTCAGCTGCCTTCTCTTCACATGGTGGTCTATCTTGGGCTGTCTGCTCAGACCACATCTCACCCAATTTCCTTGCTACATTCCCAGTGGACAAGCCCGGTGATTCACTCTTGATCTTTGGACAATATTCAGAATGAAGCAGGAAGAAAGCAAGCGGTAGTCTTTTGTGAGTACCTAAGTCTTCATTTTTCTTCAGGTCCTTTCTTATTGCCTTTAAGAGGAACATAATTCTTCATCAGCTATCATAGCCTCAGAGCAAGCCTTGTCACTTGGAGCTGTATCTTCAGGTTTCACCTTTTCCTTTGTAGGCATGAAGGTCCTCTCCAAGAACTCAGCAAAGCTGACTGGACCCAGGCATTTCTTTCTGTTCTCCTGGAAGTCTGCAGGAAGACAGCTCCTGGGCCTTTTCTTCCTCCAGCCAACCCAGTCTCCTTCACCCAAGGTGACCCATGGCGTGCGGGGAGAAGGGGGGCTCTATCTGAGTGGGCTTTTTCCTGAGTCCAAACCAGATGCTTCCTTCTCCATACGATTGTCAGCTGGCTTCACTTTTCATATTATTTTAAGCTTTAATTATTTTTCTCTCCTTGCAGAGCAACAATTGTGGTAATAAAACCAGATACCAACTCTTATCTCAGGTTAGTAATAAAGTTGTTGCCTACTATCTAGAAATGTACCTGCCTTTTCTTTTTTCTTTCCTTTTCTCTTTCCTTTCCTTTCCTTTCCTTTCCTTTTCGTTTCTTTTCTTTTCTGTAAAATGTGGCAATTTACAGGTTGGGATGTATCACCGTTGGTGGAGTGTTTACCTAGCTAGTATATACAAAGCCCTTGTTTAAATTCCTAGCACTGGGTAGGTATGGTGACTCGTGTCTGTAATCTCAGAACTCTACAGGTAGACATGTGGGAAGCAGAAATTCATCCTCAGCATACAGTGAGTTTCAAGTTGGCCTGACCCAGAAGAACTCAGGGGAAAAAAGCTGATGTCTTTTCTCTCTCTCTCTCTCTCTTTCTCTCTCTCTCTCTCTCTCTCTCTCTTTCTCTCTTTCATAATTCTTTTGGTAGAGAGAAGGAAAGAGATGAACATGTATTAAGTTCCCTGGTATCTACCAAATTTGTGTATTACTTGTCGGTTAATATTATAACAAACATTAAATTGTATTCAGAACCATATTTTGATTATTATCTTTGTGTGCTTTGGATCTCACGACAGTAATAGTTACCTGAGGTGCTTAACTACCGTTTCTGTGACAGTAAATTATTTAAGTTTACTCTCTCCCTCTACAGCCCAACAGTGTGTAGTTTGTATGGTTCATTTGTTGTTGGCTTGTTGTTATTGATGTTGTTTGTGTTGCTGATGCTAGAGTCTGGGGCCTTGGACATATTCGGCAGGCAAATGCTCCACCACTGAGCCTCCAGCCACTTTGCTGGAGGTTTTTGTAGCTGTAGATTGTAATGAAGAAGTTTTTCATCTTTTATATTTGAAAAAGATACCACGGCACGATACACAGCTACAACCAATGCACTAAGATAAATAACCAACCCAACAGAGTGACATTATGATGCAGTAGTTGTAAGAATCAATTTAAAAGATATATCACTTCATCCTTGGGTTTGCCTATGTTCTCATCTGTGAGATTTAAAATCTTTTGAAACATTGAATGAAGCCTCTCATCTATCATCAACTGCCATTAAATATCACATATTCACAGCTGGAGAAATGGACCAGCCGACATCCGGAA
![Page 47: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/47.jpg)
sequence comparison by gzip
bytes filename 2130 seq1 2130 seq2 4260 seq1.seq1 4260 seq2.seq2 4260 seq1.seq2 4260 seq2.seq1
![Page 48: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple](https://reader035.fdocuments.us/reader035/viewer/2022071421/611a0da38543f519cb05e70a/html5/thumbnails/48.jpg)
sequence comparison by gzip
bytes filename 2130 seq1 2130 seq2 4260 seq1.seq1 4260 seq2.seq2 4260 seq1.seq2 4260 seq2.seq1NCD(x,y) = (min[C(xy),C(yx)] - min[C(x),C(y)])/max[C(x),C(y)]= (min(1104/4260, 1088/4260) - min(766/4260, 443/4260))/max(766/4260, 443/4260)= (1088/4260 - 443/4260) / (766/4260) = 0.84
bytes filename422 seq1.gz735 seq2.gz443 seq1.seq1.gz766 seq2.seq2.gz1104 seq1.seq2.gz1088 seq2.seq1.gz