Bioinformacs Resources - Rostlab · 2016-04-22 · BioinfRes SoSe 16 Organizaon Lecture: Friday...
Transcript of Bioinformacs Resources - Rostlab · 2016-04-22 · BioinfRes SoSe 16 Organizaon Lecture: Friday...
BioinfRes SoSe 16
Bioinforma)csResources
Lecture&ExercisesProf.B.Rost,Dr.L.Richter,J.Reeb
Ins)tutfürInforma)kI12
BioinfRes SoSe 16
Bioinforma)csResources
● Organiza)on● Schedule
● Overview
BioinfRes SoSe 16
Organiza)on
● Lecture: Friday9-12,i.e.9.30-11.45o’clock 10-15minbreakinbetween Room00.13.009A
● Exercise: Monday14-16o’clockroom 00.08.038,star)ngMon,May2nd Friday13-15o’clockroom01.09.014 star)ngFri,Apr.29th
BioinfRes SoSe 16
TeamBehindtheCourse
BioinfRes SoSe 16
Puta)veSchedule
Apr. 22nd Intro, General Overview (1. sh.) Jun 10th No-SQL (7.sh.) Apr. 29th Sequence Databases (2. sh.) Jun 17th No-SQL (8.sh.)* May 6th No lecture Jun 24th JavaScript / UI (9.sh.) May 13th Sequence Databases (3. sh.) Jul 1st Web Services (10.sh.) May 20th Structure Databases (4. sh)* Jul 8th Bioinformatics Suites / Forums May 27th SQL (5. sh.) Jul 15th Wrap Up, Q&A Jun 3rd SQL (6. sh)
* These exercises can earn you a bonus
BioinfRes SoSe 16
ScheduleDetails
● NolectureonMay6th● NoexerciseonFri,May13thandMon,May,16th
● ExercisesheetsarepublishedonFridaysanddiscussedFri/Montheweeka\er
● Lastsheet/exercise:Jul4th,Fri/Mon8th/11th
● Exam:(workingdate):August5th,tobediscussedwiththeaudience
BioinfRes SoSe 16
Overview
● lectureisnewandconsideredbeta1● seconditera)on
● nopriorsyllabusavailableandsubjecttochange
● dependingontheadvancementsinthelecturesingletopicscouldbeaddedordropped
● thesequenceoftopicsmightbeshuffled
● hybridnature:presenta)onofexis)ngresourcesareblendedwithback-andfront-endtechnology
BioinfRes SoSe 16
Exercises
● Exerciseshelptoconvertknowledgeintoaskill● prac)calapplica)onoftopicscoveredinthelecture
● ac)veexplora)onofbioinforma)csresources
● implemen)ngvariouspartsofbioinforma)csresource
● usePython/Biopythonascommonplaborm
BioinfRes SoSe 16
Meaning
● Whatdoes“resource”actuallymeans?● aGooglequeryabout“Bioinforma)csResource”yieldsabout20Miohits
● fallsroughlyintothreecategories:- databases- tools- servicecenters
BioinfRes SoSe 16
WorkingonaDefini)on
● acollec)onofinforma)onwhichisusefultodoresearchintheareaoflifesciences/computa)onalbiology
● containstheinforma)onitself
● providesappropriateinterfacestoaccesstheinforma)on
● mayprovidetoolsforinterac)vedataanalysis
BioinfRes SoSe 16
Genbank/NCBI
● NIHgene)csequencedatabase● annotatedcollec)onofallpubliclyavailableDNAsequences
● partoftheInterna)onalNucleo)deDatabaseCollabora)ontogetherwithDNADataBankofJapan(DDBJ)andtheEuropeanMolecularBiologyLaboratory(EMBL)
BioinfRes SoSe 16
GenbankII● newreleaseevery2months● retrievableviaFTPfromtheNCBIwebsite
● currentreleaseis213.0,April15,2016
● 211,423,912,047basesfrom191,739,511reportedsequences
● (187,893,826,750basesfrom181,336,445reportedsequencesFeb2015)
● Genbankflatfileformat
BioinfRes SoSe 16
GenbankIII
● threemaindivisions:CoreNucleo)de,dbEST,dbGSS
● QueryingoverEntrezNucleo)de● interac)veBLASTanalysiswithusersequences
● programma)caccessviaNCBIe-u)li)es
BioinfRes SoSe 16
Swissprot
● officialname:UniProtKB/Swiss-Prot● history
● currentrelease:2016_04
● 548208sequenceentries● (550960sequenceentries,195282524aminoacidsabstractedfrom235893referenceslastyear)
● manuallyannotated
BioinfRes SoSe 16
Swissprot/UniprotII
● manualannota)onprocess● standardopera)onprocedure
● controlledvocabularies
● guidelines● offeredservices:BLAST,Align,IDmapping
● associatedservices
BioinfRes SoSe 16
OtherUniprotServices
● TrEMBL● Proteomes
● UniRef
● UniParc● programma)csaccess
BioinfRes SoSe 16
PDB
● History● 118,087structures,incl.115,169proteins
● (108124structures,incl.100450proteinslastyear)
● PDBformats
● dataupload/valida)on
● datadic)onaries
BioinfRes SoSe 16
PDBII
● retrieval● programma)caccess
● visualiza)onwiththedifferentviews
● fileformattransi)ons:pdbandmmcif
BioinfRes SoSe 16
SCOP/e
● StructuralClassifica)onofProteins● history,currentversionisSCOPe2.05
● changesinSCOPe
● access● needed/recommendedaddi)onalso\ware
BioinfRes SoSe 16
PFAM
● PFAM- currentversionis29.0,December2015- whatisisabout- categories- interac)veuse- programma)caccess
BioinfRes SoSe 16
Prosite
● Prosite- currentversion20.125Apr5th,2016- UniRuleformatandProRule- access- typicaluseandinterfaces
BioinfRes SoSe 16
PubMedanddiscussionforums
● Whatisitfor● Searchopportuni)es
● Linkingtootherinforma)onsources
● Searchstrategies● Atourthroughvarousdiscussionforums
BioinfRes SoSe 16
FileFormats*
● HighThroughputdata:- BAM,SAM- VCF
● Newicktreefileformat
● Genbank/EMBL● PDB:mmCIF
*mostlyintegrated
BioinfRes SoSe 16
FileFormats
● Equivalenceandtransforma)onsbetweendifferentformats
● XMLformats● RDFformats
BioinfRes SoSe 16
SQL
● SQLbasics● datatypes
● tablecrea)onandmanipula)on
● join● select
BioinfRes SoSe 16
SQLII
● keys● indexes
● performanceinfluenceofindexes
● similaritysearchvssubstrings● permissions
BioinfRes SoSe 16
SQLIII
● transac)ons● setup,administra)on,backup
● programma)caccess
● mySQL,postgreSQL
BioinfRes SoSe 16
SQLIV
● generalhintsfordatabasedesign● do’sanddon’ts
● normaliza)onultralight
BioinfRes SoSe 16
NoSQL
● defini)onsofNoSQL● advantages/disadvantages
● underlyingtheory
● typicalusecases● typesofNo-SQLdatabase
● query(languages)
BioinfRes SoSe 16
NoSQLSystems
● MongoDB● CouchDB
● Neo4J
● programma)caccess
BioinfRes SoSe 16
(StoringFacts)*
● triplestores● datamodel
● rdfrefresher
● querylanguage:sparql● examples
*op)onal,mightbedropped
BioinfRes SoSe 16
ProgrammingLibraries
● roadshowofprogramminglibriariesdedicatedtobioinforma)cs:
● bioperl● biopython
● bioJS
● visualiza)on
BioinfRes SoSe 16
GraphicalUserInterfaces
● principles● interac)onmodes
● modelling
● interac)onmodes
BioinfRes SoSe 16
GraphicalUserInterfaces*
● interac)veuserinterfaceswithJavaScript● languagebasics
● programmingmodel
● client/servercommunica)onwithjson*tobeconfirmed
BioinfRes SoSe 16
JavaScript
● librariesfordatavizializa)on/bioinforma)cs● bioJS
● D3
BioinfRes SoSe 16
Client/ServerModels
● cgi● Webservices
● RemoteProcedureCalls/CORBA
● securityconsidera)ons
BioinfRes SoSe 16
Authen)ca)on/Encryp)on
● authen)ca)onmodels● communica)onencry)on
● data/resultencryp)on
● legalprivacyissues● dataaccessmodels
BioinfRes SoSe 16
WebServicesI
● typesofwebservices● webservicecomponents
● integra)onofwebservicesinso\ware
BioinfRes SoSe 16
WebServicesII
● clientsideinterfacestowebservices● serversideinterfacestowebservices
● Apacheconfigura)onforwebservices
● requiredmodules● configura)on
● performance
BioinfRes SoSe 16
Bioinforma)csSuites
● wheretofind● installa)on/configura)on
● workflowsystems:e.g.Taverna,....
● EMBOSS,STADEN● bio-.....
● .....
BioinfRes SoSe 16
SelectedBioinforma)csSuites
● Aquaria● PredictProtein
● ....
BioinfRes SoSe 16
SummaryI
● aimofthismodule:- shapetheconceptofabioinforma)csresource- becomefamiliarwithsomeofthemostprominentexamplesoutthere
- getintouchwiththeunderlyingtechnology- gatherideasandexperiencehowtorealizeanewbioinforma)csresource
BioinfRes SoSe 16
SummaryII
● handson(interac)on)experiencewithexis)ngexperience
● backendtechnology,i.e.variousdatabasemodels
● frontendtechnologytorealizetheUI/designra)onales
● communica)onmodels
BioinfRes SoSe 16
Grading:
● gradedbyawriuenexam90/100min● scheduleddayxxxdependson:- availableroom- numberofpar)cipants
● examadmission:noadmissionlimit● withsufficientperformanceinthetwomarkedexercisesyoucanearnabonus
● thebonusappliesonlyifyoupasstheexam
BioinfRes SoSe 16
Exercises
● Explora)onofavailableresources● simpletointermediateprogrammingtasks
● publica)on/presenta)onofthetaskinweekx
● solu)onsx+1
BioinfRes SoSe 16
ExercisesII
● 10exercisesheets● workingroupsof2forthebonus
● discussionwiththeaudience
BioinfRes SoSe 16
ExercisesIII
● groupsfixedforthebonsu● newsheetsarepublishedonFriday
● submissionisdueonFridaymorningforallgroups
● twoslotsforexercises
BioinfRes SoSe 16
Ques)ons&Answers
BioinfRes SoSe 16
ProgrammingExercises
● wewillusePythonforourprogrammingexercises
● scrip)nglanguage● basicunderstandingofPythonshouldbesufficienttounderstandthepresentedcodesnippets
● vividcommunityforsupportanddevelopment
BioinfRes SoSe 16
ProgrammingExercisesII
● objectoriented● goodintegra)onwithdatabasesystemsandwebaccess
● goodintegra)onwithsophis)cateddataanalysistoolslike:numPy,sciPy,mathplotlib
● BioPython
BioinfRes SoSe 16
Structureyourresearchwork
● computa)onalbiologyisdatadriven● resultsmauer->moreresultsmauermore
● otherthane.g.so\waredevelopmentthereisnofinalreleaseversionandallpriorbugs/versionareabandoned
● appropriatedocumenta)onoftheexperimentstoreconstructtheintermediatestepsisimportant,otherwiseyoumaywithresult01-result1000files
BioinfRes SoSe 16
OurpreferredSo\wareSetup
● Anaconda● iPythonnotebooks
BioinfRes SoSe 16
Anaconda
● Pythondistribu)on(hups://www.con)nuum.io)● cleverpacketmanager:conda
● allowsacompleteinstalla)onincludingvariousconfigura)onnexttoeachotherintheuserspace
● noprivilegesneeded● yourhostsystemisnotmodified
● workswithWindows,OSX,Linux
BioinfRes SoSe 16
Somesnippetsfromthecondacheatsheet
● hup://conda.pydata.org/docs/_downloads/conda-cheatsheet.pdf
● use“condacreate–nxxxbiopython”tocreateanewenvironmentxxxandinstallbiopython
● use“(source)ac)vatexxx”toac)vatethisenvironmentinyourshell
● allowsdifferentversionsofpythontobeinstalledatthesame)me
BioinfRes SoSe 16
iPython/Jupyther
● hup://jupyter.org● supportsmanydifferentlanguages,weuseitforpython
● usecondatoinstallthepackage:condainstalljupyter
● easystartofnotebook:jupyternotebook
BioinfRes SoSe 16
AdvantagesofaNotebook
● allowsyouaseamlessintegra)onof:- (rich)text- (live)code- visualiza)ons
● )etogetheryouranalysisscript,theresultsandaninterpreta)on/discussion
● youcanarchiveandsharethenotebookseasily
BioinfRes SoSe 16
Biopython● hup://biopython.org● ifinstalled:“importBio”loadsitinyourscriptstakenfromhup://biopython.org/wiki/Geyng_Started:
from Bio.Seq import Seq!#create a sequence object!my_seq = Seq('CATGTAGACTAG')!!#print out some details about it!print 'seq %s is %i bases long' % (my_seq, len(my_seq))!print 'reverse complement is %s' % my_seq.reverse_complement()!print 'protein translation is %s' % my_seq.translate()!
BioinfRes SoSe 16
Biopythonseq CATGTAGACTAG is 12 bases long!reverse complement is CTAGTCTACATG!protein translation is HVD*!!takenfromhup://biopython.org/wiki/SeqIO:from Bio import SeqIO!handle = open("example.fasta", "rU")!for record in SeqIO.parse(handle, "fasta") :! print record.id!handle.close()!!from Bio import SeqIO!record = SeqIO.read(open("single.fasta"), "fasta")!
BioinfRes SoSe 16
Biopython● advantageofeasier/moreclearsyntaxthanPerl● orientedtoBioPerl
● supportsalotofcommonbioinforma)csfileformats
● supportsaccesstoonlineserviceslikeNCBI,Expasy...
● moreinterfacesforbioinforma)csso\ware
● hup://biopython.org/DIST/docs/tutorial/Tutorial.html
BioinfRes SoSe 16
DedicatedDataStructures
● sequence(Seq):besidethesequenceofresiduesitallowsalsotoprovideanAlphabetobject->kindoftypesafetyforDNAandproteinsequences
● typicalfunc)onslikecomplement(),reverse_complement()!
BioinfRes SoSe 16
DedicatedDataStructures
● parsingfunc)onsfordifferentsequenceformats● parsingfunc)onsalignmentformatsknowaboutthedifferentcomponents
● aswellasrespec)veoutputfunc)ons
● differenttransla)ontables
● variouspredefinedalphabets
BioinfRes SoSe 16
PythonBasics
● hups://docs.python.org/2/tutorial/index.html● goodinterac)vehandling,i.e.youcanevolveandevaluateyourcodedirectlyinpythonshell
● lateryoucanincludeitinyourscript
● basicdatatypes:- numericaltypescomparabletoPerl,C,Java- strings- boolean
BioinfRes SoSe 16
SequenceTypes
● supportseasycheckforanelement● mutabletypes:List,Bytearray
● immutable:String,Tuple
● slicing:actonsubsetsnotonlyonsingleelements
BioinfRes SoSe 16
OtherCollec)onTypes
● Set:everyelementexistsonlyonce● Dic)onary:- canstorekey/valuepairs- keyhastobeimmutable(hashable)
● allcollec)ontypessupportiterators
BioinfRes SoSe 16
ImportantSyntax● whitespace(tabs,spaces)and:areusedtostructurethecodeinblocks,similarto{}inotherlanguages
● sameindenta)on==sameblock
● usualcontrolstructuresavailablefor w in words:! print w, len(w)!!# if you want to iterate by numbers you !# have to use range()!for i in range(len(a)):! print i, a[i]!!
BioinfRes SoSe 16
ImportantSyntax● Defini)onoffunc)ons:def fib(n): # write Fibonacci series up to n! """Print a Fibonacci series up to n.""”! a, b = 0, 1! while a < n:! print a,! a, b = b, a+b!
● Argumentscanbepassedby:- name- posi)on
● Argumentscanhavedefaultvalues->op)onalinthecall
● Packageareloadedwiththeimportdirec)ve!
!