bioinformatics - RIP Tutorial · from: bioinformatics It is an unofficial and free bioinformatics...

bioinformatics

#bioinformat

ics

1

1: 2

2

Examples 2

2

.GFF 2

DNA 3

2: BLAST 5

Examples 5

DNA blastdb 5

nucl blastdbfasta 5

ubuntu 5

blastdbGI 6

3: FASTA 7

Examples 7

AWKFASTA 7

1 7

UniprotFASTA 7

4: fastq 9

Examples 9

9

Awk 9

5: 11

Examples 11

GC 11

6: 12

Examples 12

12

MAF 12

GCT 13

fasta 13

7: 15

Examples 15

15

sambam 15

16

You can share this PDF with anyone you feel could benefit from it, downloaded the latest version from: bioinformatics

It is an unofficial and free bioinformatics ebook created for educational purposes. All the content is extracted from Stack Overflow Documentation, which is written by many hardworking individuals at Stack Overflow. It is neither affiliated with Stack Overflow nor official bioinformatics.

The content is released under Creative Commons BY-SA, and the list of contributors to each chapter are provided in the credits section at the end of this book. Images may be copyright of their respective owners unless otherwise specified. All trademarks and registered trademarks are the property of their respective company owners.

Use the content presented in this book at your own risk; it is not guaranteed to be correct nor accurate, please send your feedback and corrections to [email protected]

https://riptutorial.com/ja/home 1

http://riptutorial.com/ebook/bioinformaticshttps://archive.org/details/documentation-dump.7zmailto:[email protected]

1: バイオインフォマティクスのまりバイオインフォマティクスは、データをするためのとソフトウェアツールをするなです。

バイオインフォマティクスのトピックはのとおりです。

シーケンス••

モデリング•およびタンパクの•

Examples

Wikipediaバイオインフォマティクスは、データをするためのとソフトウェアツールをするなです。なとして、バイオインフォマティクスは、コンピュータサイエンス、、、およびエンジニアリングをみわせて、データをおよびします。バイオインフォマティクスは、およびをいたクエリーのin silicoにされています。

.GFFファイルパーサーバッファーとしてのみをするフィルター

""" A [GFF parser script][1] in Python for [www.VigiLab.org][2] Description: - That performs buffered reading, and filtering (see: @filter) of .GFF input file (e.g. "[./toy.gff][3]") to keep only rows whose field (column) values are equal to "transcript"... Args: - None (yet) Returns: - None (yet) Related: - [1]: https://github.com/a1ultima/vigilab_intergeneShareGFF/blob/master/README.md - [2]: http://www.vigilab.org/ - [3]: https://github.com/a1ultima/vigilab_intergeneShareGFF/blob/master/toy.gff """ gene_to_field = {} # dict whose keys: genes represented (i.e. later slice-able/index-able) as 1..n, values, where n = 8 total #fields (cols) of a gff row, whose version is unknown but example is: https://github.com/a1ultima/vigilab_intergeneShareGFF/blob/master/toy.gff gene_i = 0 with open("./toy.gff", "r") as fi: print("Reading GFF file into: gene_to_field (dict), index as such: gene_to_field[gene_i], where gene_i is between 1-to-n...") while True: # breaks once there are no more lines in the input .gff file, see "@break"


line = fi.readline().rstrip() # no need for trailing newline chars ("\n") if line == "": # @break break line_split = line.split("\t") # turn a line of input data into a list, each element = different field value, e.g. [...,"transcript",...] if line_split[2] != "transcript": # @@filter incoming rows so only those with "transcript" are not skipped by "continue" continue gene_i += 1 # indexing starts from 1 (i.e. [1] = first gene) ends at n ##@TEST: sometimes 4.00 instead of 4.0 (trivial) # some @deprecated code, but may be useful one day #if not (str(line_split[5])==str(float(line_split[5]))): # print("oops") # print("\t"+str(line_split[5])+"___"+str(float(line_split[5]))) # create a dict key, for gene_to_field dict, and set its values according to list elements in line_split gene_to_field[gene_i] = { \ "c1_reference_seq":line_split[0],# e.g. 'scaffold_150' \ "c2_source":line_split[1],# e.g. 'GWSUNI' \ "c3_type":line_split[2],# e.g. 'transcript' \ "c4_start":int(line_split[3]),# e.g. '1372' \ "c5_end":int(line_split[4]),# e.g. '2031' \ "c6_score":float(line_split[5]),# e.g. '45.89' \ "c7_strand":line_split[6],# e.g. '+' \ "c8_phase":line_split[7],# e.g. '.' @Note: codon frame (0,1,2) \ "c9_attributes":line_split[8]# e.g. \ }

にえるためのDNAのマッピングの

くのは、DNAのにすることができます。えば、のレベルをりたければ、そのmRNAをDNAにコピーし、られたDNAをそれぞれし、それらのをゲノムにマップしてから、するのをするそののプロキシとしての RNA-seq 。のには、ゲノムの3Dをすること、ヒストンマークをきめること、およびRNA-DNAをマッピングすることがまれる。なDNAでわれたなののリストはここでつけることができません。

には、ウェットラボのいコートとゴーグルをている々が、されたDNAサンプルをるためのをし、する。その、バイオインフォマティアンコンピュータをしてコーヒーをむは、 FASTQファイルとしてコードされたこれらのをし、それらをゲノムにマッピングし、をBAMファイルとしてします。

のにると、これはバイオインフォマティシャンがFASTQファイルからBAMファイルをするですLinuxシステムを。

STAR --genomeDir path/to/reference/genome --outSAMtype BAM --readFilesIn my_reads.fastq


http://www.nature.com/nmeth/journal/v5/n7/full/nmeth.1226.htmlhttp://www.nature.com/nmeth/journal/v5/n7/full/nmeth.1226.htmlhttp://www.nature.com/nmeth/journal/v5/n7/full/nmeth.1226.htmlhttp://science.sciencemag.org/content/326/5950/289http://science.sciencemag.org/content/326/5950/289http://science.sciencemag.org/content/326/5950/289http://science.sciencemag.org/content/326/5950/289http://science.sciencemag.org/content/326/5950/289http://science.sciencemag.org/content/326/5950/289http://science.sciencemag.org/content/316/5830/1497http://science.sciencemag.org/content/316/5830/1497http://dx.doi.org/10.1016/j.cub.2017.01.011http://dx.doi.org/10.1016/j.cub.2017.01.011http://dx.doi.org/10.1016/j.cub.2017.01.011http://dx.doi.org/10.1016/j.cub.2017.01.011http://dx.doi.org/10.1016/j.cub.2017.01.011http://dx.doi.org/10.1016/j.cub.2017.01.011http://dx.doi.org/10.1016/j.cub.2017.01.011http://dx.doi.org/10.1016/j.cub.2017.01.011https://liorpachter.wordpress.com/seq/https://en.wikipedia.org/wiki/FASTQ_formathttps://en.wikipedia.org/wiki/FASTQ_formathttp://software.broadinstitute.org/software/igv/bamhttp://software.broadinstitute.org/software/igv/bamhttp://software.broadinstitute.org/software/igv/bamhttp://software.broadinstitute.org/software/igv/bam

STARは、スプライスアライナーであるmRNAにしるエクソン - イントロンに。

PSマッピングがされると、クリエイティブパートがされます。ここでは、バイオインフォマティクスが、データがにのあるパターンまたはノイズからまれたをしているかどうかをチェックするためのテストをした。

オンラインでバイオインフォマティクスのまりをむ https://riptutorial.com/ja/bioinformatics/topic/3960/バイオインフォマティクスのまり


https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4631051/https://riptutorial.com/ja/bioinformatics/topic/3960/%E3%83%90%E3%82%A4%E3%82%AA%E3%82%A4%E3%83%B3%E3%83%95%E3%82%A9%E3%83%9E%E3%83%86%E3%82%A3%E3%82%AF%E3%82%B9%E3%81%AE%E5%A7%8B%E3%81%BE%E3%82%8Ahttps://riptutorial.com/ja/bioinformatics/topic/3960/%E3%83%90%E3%82%A4%E3%82%AA%E3%82%A4%E3%83%B3%E3%83%95%E3%82%A9%E3%83%9E%E3%83%86%E3%82%A3%E3%82%AF%E3%82%B9%E3%81%AE%E5%A7%8B%E3%81%BE%E3%82%8Ahttps://riptutorial.com/ja/bioinformatics/topic/3960/%E3%83%90%E3%82%A4%E3%82%AA%E3%82%A4%E3%83%B3%E3%83%95%E3%82%A9%E3%83%9E%E3%83%86%E3%82%A3%E3%82%AF%E3%82%B9%E3%81%AE%E5%A7%8B%E3%81%BE%E3%82%8Ahttps://riptutorial.com/ja/bioinformatics/topic/3960/%E3%83%90%E3%82%A4%E3%82%AA%E3%82%A4%E3%83%B3%E3%83%95%E3%82%A9%E3%83%9E%E3%83%86%E3%82%A3%E3%82%AF%E3%82%B9%E3%81%AE%E5%A7%8B%E3%81%BE%E3%82%8Ahttps://riptutorial.com/ja/bioinformatics/topic/3960/%E3%83%90%E3%82%A4%E3%82%AA%E3%82%A4%E3%83%B3%E3%83%95%E3%82%A9%E3%83%9E%E3%83%86%E3%82%A3%E3%82%AF%E3%82%B9%E3%81%AE%E5%A7%8B%E3%81%BE%E3%82%8A

2: BLAST

Examples

DNA blastdbをする

とをするには、のblastdbをするがあります。これは、blastをインストールするときにまれるmakeblastdbをしてわれます。

makeblastdb -in -dbtype nucl -out

したがって、のレコードをむreference.fastaファイルがあるとします。

>reference_1 ATCGATAAA >reference_2 ATCGATCCC

のコマンドをします。

makeblastdb -in reference.fasta -dbtype nucl -out my_database

これにより、のファイルがされます。

my_database.nhr•my_database.nin•my_database.nsq•

データベースファイルには、 -outきがけられていることにしてください。

nucl blastdbからfastaをする

blastdbcmdをしてfastaファイルからされたblastdbからfastaシーケンスをすることができます。このblastdbcmdは、makeblastdbのインストールにインストールするmakeblastdbます。

blastdbcmd -entry all -db -out

あなたは、データベースとばれていたmy_databaseファイルがまれているmy_database.nhr 、 my_database.nsq 、 my_database.nin 、あなたのFASTAのファイルをびすことがしたかったreference.fastaあなたはのことをします

blastdbcmd -entry all -db my_database -out reference.fasta

ubuntuにをインストールする


apt-get install ncbi-blast+

あらかじめインストールされているバージョンをすることができます

http://packages.ubuntu.com/xenial/ncbi-blast+

blastdbからGIとタクシーをする

blastdbcmdをしてblastdbcmdからデータをすることができます。blastdbcmdはblastインストールにめるがあります。のオプションから、 -outfmtとして、どのメタデータをどのでめるかをできます。

マニュアルページから

-outfmt Output format, where the available format specifiers are: %f means sequence in FASTA format %s means sequence data (without defline) %a means accession %g means gi %o means ordinal id (OID) %i means sequence id %t means sequence title %l means sequence length %h means sequence hash value %T means taxid %X means leaf-node taxids %e means membership integer %L means common taxonomic name %C means common taxonomic names for leaf-node taxids %S means scientific name %N means scientific names for leaf-node taxids %B means BLAST name %K means taxonomic super kingdom %P means PIG

このコードはblastdbからgiとtaxidをするをしています。このでは、 NCBI 16SMicrobial ftpblastdbをしました。

# Example: # blastdbcmd -db -entry all -outfmt "%g %T" -out blastdbcmd -db 16SMicrobial -entry all -outfmt "%g %T" -out 16SMicrobial.gi_taxid.tsv

のようなファイル16SMicrobial.gi_taxid.tsvがされます。

939733319 526714 636559958 429001 645319546 629680

オンラインでBLASTをむ https://riptutorial.com/ja/bioinformatics/topic/5371/blast


http://packages.ubuntu.com/xenial/ncbi-blast+ftp://ftp.ncbi.nih.gov/blast/db/16SMicrobial.tar.gzhttps://riptutorial.com/ja/bioinformatics/topic/5371/blast

3: FASTAの。

Examples

AWKでFASTAシーケンスをする

1ずつむawk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa

このawkスクリプトをのようにむことができます

の $0 がfastaヘッダ ^> のようにまる。これがのシーケンスでないは、キャリッジリターンをします。 (N>0?"\n":"")に $0 がき、そのに \t がきます。そしてのをす next;

•

の $0 がfastaヘッダのようにしない、これはデフォルトのawkパターンです。キャリッジリターンなしでをするだけです。

•

END には、のシーケンスのキャリッジリターンのみをします。•

UniprotからFASTAをする

UniProtの10ののFASTAシーケンスをダウンロードししてください

$ curl -s "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"|\ gunzip -c |\ awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' |\ head >sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) GN=FV3-001R PE=4 SV=1 MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL >sp|Q6GZX3|002L_FRG3G Uncharacterized protein 002L OS=Frog virus 3 (isolate Goorha) GN=FV3-002L PE=4 SV=1 MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCARIKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDTQAQVDQFVSMFGESPSLAERYCMRGVKNTAGELVSRVSSDADPAGGWCRKWYSAHRGPDQDAALGSFCIKNPGAADCKCINRASDPVYQKVKTLHAYPDQCWYVPCAADVGELKMGTQRDTPTNCPTQVCQIVFNMLDDGSVTMDDVKNTINCDFSKYVPPPPPPKPTPPTPPTPPTPPTPPTPPTPPTPRPVHNRKVMFFVAGAVLVAILISTVRW >sp|Q197F8|002R_IIV3 Uncharacterized protein 002R OS=Invertebrate iridescent virus 3 GN=IIV3-002R PE=4 SV=1 MASNTVSAQGGSNRPVRDFSNIQDVAQFLLFDPIWNEQPGSIVPWKMNREQALAERYPELQTSEPSEDYSGPVESLELLPLEIKLDIMQYLSWEQISWCKHPWLWTRWYKDNVVRVSAITFEDFQREYAFPEKIQEIHFTDTRAEEIKAILETTPNVTRLVIRRIDDMNYNTHGDLGLDDLEFLTHLMVEDACGFTDFWAPSLTHLTIKNLDMHPRWFGPVMDGIKSMQSTLKYLYIFETYGVNKPFVQWCTDNIETFYCTNSYRYENVPRPIYVWVLFQEDEWHGYRVEDNKFHRRYMYSTILHKRDTDWVENNPLKTPAQVEMYKFLLRISQLNRDGTGYESDSDPENEHFDDESFSSGEEDSSDEDDPTWAPDSDDSDWETETEEEPSVAARILEKGKLTITNLMKSLGFKPKPKKIQSIDRYFCSLDSNYNSEDEDFEYDSDSEDDDSDSEDDC >sp|Q197F7|003L_IIV3 Uncharacterized protein 003L OS=Invertebrate iridescent virus 3 GN=IIV3-003L PE=4 SV=1 MYQAINPCPQSWYGSPQLEREIVCKMSGAPHYPNYYPVHPNALGGAWFDTSLNARSLTTTPSLTTCTPPSLAACTPPTSLGMVDSPPHINPPRRIGTLCFDFGSAKSPQRCECVASDRPSTTSNTAPDTYRLLITNSKTRKNNYGTCRLEPLTYGI >sp|Q6GZX2|003R_FRG3G Uncharacterized protein 3R OS=Frog virus 3 (isolate Goorha) GN=FV3-003R PE=3 SV=1


MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSNILLTERRQVDRAMGGSDDDGVMVVALSPSDFKTVLGSALLAVERDMVHVVPKYLQTPGILHDMLVLLTPIFGEALSVDMSGATDVMVQQIATAGFVDVDPLHSSVSWKDNVSCPVALLAVSNAVRTMMGQPCQVTLIIDVGTQNILRDLVNLPVEMSGDLQVMAYTKDPLGKVPAVGVSVFDSGSVQKGDAHSVGAPDGLVSFHTHPVSSAVELNYHAGWPSNVDMSSLLTMKNLMHVVVAEEGLWTMARTLSMQRLTKVLTDAEKDVMRAAAFNLFLPLNELRVMGTKDSNNKSLKTYFEVFETFTIGALMKHSGVTPTAFVDRRWLDNTIYHMGFIPWGRDMRFVVEYDLDGTNPFLNTVPTLMSVKRKAKIQEMFDNMVSRMVTS >sp|Q6GZX1|004R_FRG3G Uncharacterized protein 004R OS=Frog virus 3 (isolate Goorha) GN=FV3-004R PE=4 SV=1 MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY >sp|Q197F5|005L_IIV3 Uncharacterized protein 005L OS=Invertebrate iridescent virus 3 GN=IIV3-005L PE=3 SV=1 MRYTVLIALQGALLLLLLIDDGQGQSPYPYPGMPCNSSRQCGLGTCVHSRCAHCSSDGTLCSPEDPTMVWPCCPESSCQLVVGLPSLVNHYNCLPNQCTDSSQCPGGFGCMTRRSKCELCKADGEACNSPYLDWRKDKECCSGYCHTEARGLEGVCIDPKKIFCTPKNPWQLAPYPPSYHQPTTLRPPTSLYDSWLMSGFLVKSTTAPSTQEEEDDY >sp|Q6GZX0|005R_FRG3G Uncharacterized protein 005R OS=Frog virus 3 (isolate Goorha) GN=FV3-005R PE=4 SV=1 MQNPLPEVMSPEHDKRTTTPMSKEANKFIRELDKKPGDLAVVSDFVKRNTGKRLPIGKRSNLYVRICDLSGTIYMGETFILESWEELYLPEPTKMEVLGTLESCCGIPPFPEWIVMVGEDQCVYAYGDEEILLFAYSVKQLVEEGIQETGISYKYPDDISDVDEEVLQQDEEIQKIRKKTREFVDKDAQEFQDFLNSLDASLLS >sp|Q91G88|006L_IIV6 Putative KilA-N domain-containing protein 006L OS=Invertebrate iridescent virus 6 GN=IIV6-006L PE=3 SV=1 MDSLNEVCYEQIKGTFYKGLFGDFPLIVDKKTGCFNATKLCVLGGKRFVDWNKTLRSKKLIQYYETRCDIKTESLLYEIKGDNNDEITKQITGTYLPKEFILDIASWISVEFYDKCNNIIINYFVNEYKTMDKKTLQSKINEVEEKMQKLLNEKEEELQEKNDKIDELILFSKRMEEDRKKDREMMIKQEKMLRELGIHLEDVSSQNNELIEKVDEQVEQNAVLNFKIDNIQNKLEIAVEDRAPQPKQNLKRERFILLKRNDDYYPYYTIRAQDINARSALKRQKNLYNEVSVLLDLTCHPNSKTLYVRVKDELKQKGVVFNLCKVSISNSKINEEELIKAMETINDEKRDV >sp|Q6GZW9|006R_FRG3G Uncharacterized protein 006R OS=Frog virus 3 (isolate Goorha) GN=FV3-006R PE=4 SV=1 MYKMYFLKDQKFSLSGTIRINDKTQSEYGSVWCPGLSITGLHHDAIDHNMFEEMETEIIEYLGPWVQAEYRRIKG

オンラインでFASTAの。をむ https://riptutorial.com/ja/bioinformatics/topic/4194/fastaの-


https://riptutorial.com/ja/bioinformatics/topic/4194/fasta%E9%85%8D%E5%88%97%E3%81%AE%E7%B7%9A%E5%BD%A2%E5%8C%96-https://riptutorial.com/ja/bioinformatics/topic/4194/fasta%E9%85%8D%E5%88%97%E3%81%AE%E7%B7%9A%E5%BD%A2%E5%8C%96-https://riptutorial.com/ja/bioinformatics/topic/4194/fasta%E9%85%8D%E5%88%97%E3%81%AE%E7%B7%9A%E5%BD%A2%E5%8C%96-https://riptutorial.com/ja/bioinformatics/topic/4194/fasta%E9%85%8D%E5%88%97%E3%81%AE%E7%B7%9A%E5%BD%A2%E5%8C%96-https://riptutorial.com/ja/bioinformatics/topic/4194/fasta%E9%85%8D%E5%88%97%E3%81%AE%E7%B7%9A%E5%BD%A2%E5%8C%96-https://riptutorial.com/ja/bioinformatics/topic/4194/fasta%E9%85%8D%E5%88%97%E3%81%AE%E7%B7%9A%E5%BD%A2%E5%8C%96-https://riptutorial.com/ja/bioinformatics/topic/4194/fasta%E9%85%8D%E5%88%97%E3%81%AE%E7%B7%9A%E5%BD%A2%E5%8C%96-https://riptutorial.com/ja/bioinformatics/topic/4194/fasta%E9%85%8D%E5%88%97%E3%81%AE%E7%B7%9A%E5%BD%A2%E5%8C%96-

4: fastqファイルの

Examples

ペーストの

$ gunzip -c input.fastq.gz | paste - - - - | head @IL31_4368:1:1:996:8507/2 TCCCTTACCCCCAAGCTCCATACCCTCCTAATGCCCACACCTCTTACCTTAGGA + FFCEFFFEEFFFFFFFEFFEFFFEFCFC

オンラインでfastqファイルのをむ https://riptutorial.com/ja/bioinformatics/topic/4286/fastqファイルの


https://riptutorial.com/ja/bioinformatics/topic/4286/fastq%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E3%81%AE%E7%B7%9A%E5%BD%A2%E5%8C%96https://riptutorial.com/ja/bioinformatics/topic/4286/fastq%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E3%81%AE%E7%B7%9A%E5%BD%A2%E5%8C%96https://riptutorial.com/ja/bioinformatics/topic/4286/fastq%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E3%81%AE%E7%B7%9A%E5%BD%A2%E5%8C%96https://riptutorial.com/ja/bioinformatics/topic/4286/fastq%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E3%81%AE%E7%B7%9A%E5%BD%A2%E5%8C%96https://riptutorial.com/ja/bioinformatics/topic/4286/fastq%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E3%81%AE%E7%B7%9A%E5%BD%A2%E5%8C%96https://riptutorial.com/ja/bioinformatics/topic/4286/fastq%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E3%81%AE%E7%B7%9A%E5%BD%A2%E5%8C%96

5: シーケンス

Examples

シーケンスのGCをする

およびにおいて、GCまたはグアニン - シトシン、してGCは、グアニンまたはシトシンのいずれかであるDNAののパーセンテージであるアデニンおよびチミン。

BioPythonの

>>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> from Bio.SeqUtils import GC >>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna) >>> GC(my_seq) 46.875

BioRubyの

bioruby> require 'bio' bioruby> seq = Bio::Sequence::NA.new("atgcatgcaaaa") ==> "atgcatgcaaaa" bioruby> seq.gc_percent ==> 33

Rをって

# Load the SeqinR package. library("seqinr") mysequence

6: なファイル

Examples

ファスタ

FASTAファイルフォーマットは、するとして1つまたはのヌクレオチドまたはアミノをすためにされる。シーケンスは、シーケンスのにある>でまるコメントでがけられます。コメントは、、シーケンスのソースデータベースまたはソフトウェアをすることによってされたでフォーマットされます。えば

>gi|62241013|ref|NP_001014431.1| RAC-alpha serine/threonine-protein kinase [Homo sapiens] MSDVAIVKEGWLHKRGEYIKTWRPRYFLLKNDGTFIGYKERPQDVDQREAPLNNFSVAQCQLMKTERPRP NTFIIRCLQWTTVIERTFHVETPEEREEWTTAIQTVADGLKKQEEEEMDFRSGSPSDNSGAEEMEVSLAK PKHRVTMNEFEYLKLLGKGTFGKVILVKEKATGRYYAMKILKKEVIVAKDEVAHTLTENRVLQNSRHPFL TALKYSFQTHDRLCFVMEYANGGELFFHLSRERVFSEDRARFYGAEIVSALDYLHSEKNVVYRDLKLENL MLDKDGHIKITDFGLCKEGIKDGATMKTFCGTPEYLAPEVLEDNDYGRAVDWWGLGVVMYEMMCGRLPFY NQDHEKLFELILMEEIRFPRTLGPEAKSLLSGLLKKDPKQRLGGGSEDAKEIMQHRFFAGIVWQHVYEKK LSPPFKPQVTSETDTRYFDEEFTAQMITITPPDQDDSMECVDSERRPHFPQFSYSASGTA

のは、NCBIタンパクデータベースからりしたヒトAKT1のアイソフォームのアミノをす。ヘッダーラインは、このがGI ID 62241013およびタンパクID NP_001014431.1でされることをする。このタンパクは、 RAC-alpha serine/threonine-protein kinaseとばれ、、 Homo sapiensする。

フォーマットMAF

MAFファイルフォーマットは、でされるDNAをするためのタブりのテキストファイルフォーマットであり、したヌクレオチドをすためののフォーマットファイルタイプとはなる。のヘッダーとは、さまざまなソースのファイルによってなることがありますが、でされているとのはのとおりです。

Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1


http://www.ncbi.nlm.nih.gov/genbank/sequenceids/https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+(MAF)+Specificationhttps://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+(MAF)+Specification

Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status4 Validation_Status4 Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID

TCGAからなものなど、くのMAFファイルには、バリアントにのがされています。これらのカラムは、するのためのヌクレオチドID、なコドンまたはアミノ、QCメトリック、などをむことができる。

GCT

GCTファイルフォーマットは、されたまたはRNAiデータをするためにされる、タブりのテキストファイルフォーマットであり、にはマイクロアレイチップからられる。このデータは、1つのに1つのきまたはプローブ、およびごとに1つのチップサンプルをえてでされています。えば

#1.2 22215 2 Name Description Tumor_One Normal_One 1007_s_at DDR1 -0.214548 -0.18069 1053_at RFC2 0.868853 -1.330921 117_at HSPA6 1.124814 0.933021 121_at PAX8 -0.825381 0.102078 1255_g_at GUCA1A -0.734896 -0.184104 1294_at UBE1L -0.366741 -1.209838

このでは、のはGCTファイルのバージョンをします。このは1.2です。 2は、データの 22215 とサンプル 2 をします。ヘッダーは、をNameチッププローブセットとのためにDescriptionおよびアッセイされるサンプルのシンボルプローブセットカバー Tumor_OneとNormal_One 。ヘッダをえたデータは、のプローブセットこの、Affymetrixチッププローブセット、するする、およびサンプルのをする。サンプルデータのは、アッセイタイプとにじてなりますが、はきです。

fastaでのシーケンスみ

これは、fastaのシーケンスみのためのPythonのです。

パラメーター

filenameString - シーケンスをfastaできむためのファイル。•seqString - DNAまたはRNA。•


http://www.broadinstitute.org/cancer/software/genepattern/file-formats-guide#GCThttp://www.broadinstitute.org/cancer/software/genepattern/file-formats-guide#GCT

idString - されたシーケンスのIDです。•descString - されたシーケンスのい。•

import math def save_fsta(filename,seq,id,desc): fo = open(filename+'.fa',"a") header= str(id)+' \n' fo.write(header) count=math.floor(len(seq)/80+1) iteration = range(count) for i in iteration: fo.write(seq[80*(i):80*(i+1)]+'\n') fo.write('\n \n') fo.close()

のは、 textwrapをすることtextwrap

import textwrap def save_fasta(filename,seq, id, desc): filename+='.fa' with open(filename, 'w') as f: f.write('>'+id+' \n'); text = textwrap.wrap(seq,80); for x in text: f.write(x+'\n');

オンラインでなファイルをむ https://riptutorial.com/ja/bioinformatics/topic/4034/なファイル


https://riptutorial.com/ja/bioinformatics/topic/4034/%E4%B8%80%E8%88%AC%E7%9A%84%E3%81%AA%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E5%BD%A2%E5%BC%8Fhttps://riptutorial.com/ja/bioinformatics/topic/4034/%E4%B8%80%E8%88%AC%E7%9A%84%E3%81%AA%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E5%BD%A2%E5%BC%8Fhttps://riptutorial.com/ja/bioinformatics/topic/4034/%E4%B8%80%E8%88%AC%E7%9A%84%E3%81%AA%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E5%BD%A2%E5%BC%8Fhttps://riptutorial.com/ja/bioinformatics/topic/4034/%E4%B8%80%E8%88%AC%E7%9A%84%E3%81%AA%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E5%BD%A2%E5%BC%8Fhttps://riptutorial.com/ja/bioinformatics/topic/4034/%E4%B8%80%E8%88%AC%E7%9A%84%E3%81%AA%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E5%BD%A2%E5%BC%8Fhttps://riptutorial.com/ja/bioinformatics/topic/4034/%E4%B8%80%E8%88%AC%E7%9A%84%E3%81%AA%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E5%BD%A2%E5%BC%8Fhttps://riptutorial.com/ja/bioinformatics/topic/4034/%E4%B8%80%E8%88%AC%E7%9A%84%E3%81%AA%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E5%BD%A2%E5%BC%8Fhttps://riptutorial.com/ja/bioinformatics/topic/4034/%E4%B8%80%E8%88%AC%E7%9A%84%E3%81%AA%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E5%BD%A2%E5%BC%8F

7: なサムツール

Examples

バムファイルのごとのレコードをカウントする

samtools idxstats thing.bam

samをbamにするそしてもう

Samtoolsはサムとバムののにできます

-bは、ファイルがBAMであることをします。•-Sは、stdoutがSAMであるがあることをします。•

samtools view -sB thing.bam > thing.sam

そしてサムとバムのの

samtools view thing.sam > thing.bam samtools sort thing.bam thing samtools index thing.bam

これにより、ソートされたインデックスきのBAMがされます。これにより、ファイルthing.bamとthing.bam.baiがされます。バムをするにはインデックスファイルがです。

オンラインでなサムツールをむ https://riptutorial.com/ja/bioinformatics/topic/6886/なサムツール


https://riptutorial.com/ja/bioinformatics/topic/6886/%E5%9F%BA%E6%9C%AC%E7%9A%84%E3%81%AA%E3%82%B5%E3%83%A0%E3%83%84%E3%83%BC%E3%83%ABhttps://riptutorial.com/ja/bioinformatics/topic/6886/%E5%9F%BA%E6%9C%AC%E7%9A%84%E3%81%AA%E3%82%B5%E3%83%A0%E3%83%84%E3%83%BC%E3%83%ABhttps://riptutorial.com/ja/bioinformatics/topic/6886/%E5%9F%BA%E6%9C%AC%E7%9A%84%E3%81%AA%E3%82%B5%E3%83%A0%E3%83%84%E3%83%BC%E3%83%ABhttps://riptutorial.com/ja/bioinformatics/topic/6886/%E5%9F%BA%E6%9C%AC%E7%9A%84%E3%81%AA%E3%82%B5%E3%83%A0%E3%83%84%E3%83%BC%E3%83%ABhttps://riptutorial.com/ja/bioinformatics/topic/6886/%E5%9F%BA%E6%9C%AC%E7%9A%84%E3%81%AA%E3%82%B5%E3%83%A0%E3%83%84%E3%83%BC%E3%83%ABhttps://riptutorial.com/ja/bioinformatics/topic/6886/%E5%9F%BA%E6%9C%AC%E7%9A%84%E3%81%AA%E3%82%B5%E3%83%A0%E3%83%84%E3%83%BC%E3%83%AB

クレジット

S. No

Contributors

1バイオインフォマティクスのまり

BioGeek, Community, hello_there_andy, Marcelo, Pierre

2 BLAST amblina

3 FASTAの。 Pierre

4 fastqファイルの Pierre

5 シーケンス BioGeek, Pierre, zx8754

6 なファイル Razik, woemler

7 なサムツール amblina


https://riptutorial.com/ja/contributor/50065/biogeekhttps://riptutorial.com/ja/contributor/-1/communityhttps://riptutorial.com/ja/contributor/3011648/hello-there-andyhttps://riptutorial.com/ja/contributor/2479811/marcelohttps://riptutorial.com/ja/contributor/58082/pierrehttps://riptutorial.com/ja/contributor/3759777/amblinahttps://riptutorial.com/ja/contributor/58082/pierrehttps://riptutorial.com/ja/contributor/58082/pierrehttps://riptutorial.com/ja/contributor/50065/biogeekhttps://riptutorial.com/ja/contributor/58082/pierrehttps://riptutorial.com/ja/contributor/680068/zx8754https://riptutorial.com/ja/contributor/3157961/razikhttps://riptutorial.com/ja/contributor/1458983/woemlerhttps://riptutorial.com/ja/contributor/3759777/amblina

約章 1: バイオインフォマティクスの始まり備考Examples定義.GFFファイルパーサー（バッファーとして）行のみを保持するフィルター生物学的問題に答えるためのDNA配列のマッピングの使用

章 2: BLASTExamplesDNA blastdbを作成するnucl blastdbからfasta配列を抽出するubuntuに爆風をインストールするblastdbからGIとタクシーを抽出する

章 3: FASTA配列の線形化。ExamplesAWKでFASTAシーケンスを線形化する

1行ずつ読むUniprotからFASTA配列を線形化する

章 4: fastqファイルの線形化Examplesペーストの使用Awkを使う

章 5: シーケンス解析ExamplesシーケンスのGC％を計算する

章 6: 一般的なファイル形式Examplesファスタ突然変異注釈フォーマット（MAF）GCTfasta形式でのシーケンス書込み

章 7: 基本的なサムツールExamplesバムファイル内の参照ごとのレコード数をカウントするsamをbamに変換する（そしてもう一度）

クレジット

bioinformatics - RIP Tutorial · from: bioinformatics It is an unofficial and free bioinformatics...

Documents

Transcript of bioinformatics - RIP Tutorial · from: bioinformatics It is an unofficial and free bioinformatics...