Linux 4 biology -...

Post on 04-Feb-2018

219 views 2 download

Transcript of Linux 4 biology -...

LinuxforBiologyDEDANGITHAE,BIOINFORMATICIAN

BECA-ILRIHUB

Importanceofcomputerstobiology

û Availability ofvast research datashared online.

û Automated analysis leading togeneration ofmassivedata

û Interactionwith other research communities andshared databases

û Speedandefficiency inprocessing,storage anddatamining

BIGData:Volume,Variety,Velocity&Veracity

Volume:

◦Morecontentalreadygeneratedand

◦ isavailableoveropenaccess

◦Morecontentbeinggeneratedperrun

◦ asaresultoftechnologyadvancement

◦ Costscheaperovertime

Velocity:◦ Technologymakingdatagenerationfasterandhigherefficiency

Variety◦ Sequences,annotation,structures,imageprocessing

Veracity◦ Someambiguities,Inconsistencies,incomplete,modelapproximations

Othercomputationaltasks:AnalysisandinterpretationBiologyactivities:◦ Prediction– functionalandstructural◦ Patternrecognition:Domains,homology◦ Sequencealignments◦ Statisticalanalysis◦ Structuralmodelling◦ Geneticdiversityandinteractionsbetweenorganisms,betweenpopulations

Linux

Whatislinuxafamily

◦offreeandopen-sourcesoftware

◦operatingsystem

◦distributionsbuiltaroundtheLinuxkernel.

Whatislinuxafamily

Ubuntu?Fedora?Mint?Debian? openSUSE?

◦offreeanyoneisfreelylicensedtouse,copy,study,andchangethesoftwareinanyway

◦andopen-sourcesoftwarethesourcecodeisopenlysharedsothatpeopleareencouragedtovoluntarilyimprovethedesignofthesoftware

◦operatingsystemsystemsoftwarethatmanagescomputerhardwareandsoftwareresourcesandprovidescommonservicesforcomputerprograms.◦distributionsbuiltaroundtheLinuxkernel.partoftheoperatingsystemthatmediatesaccesstosystemresourceseginput/outputrequestsfromsoftware,translatingthemintodata-processinginstructionsforthecentralprocessingunit

Kernel

SomeapplicationstobiologicaltasksRepetitivetasks– processingseveralsequencesAutomatinganalysisprocesses– scripts/pipingtoprogramsTextprocessingRegex;grep;sed;◦ extractingfieldsusingcut/awk◦ We’llseemoreofthisonthetutorial

TheILRIHighPerformanceComputing(HPC)Cluster

TheILRIHighPerformanceComputing(HPC)Cluster

userslogintoHPC(themaster)

Tologin:

ssh userX@hpc.ilri.cgiar.org

then“jump”to therestofthecluster(computingservers).

Todothis,type

interactive

Softwares:Toknowwhetherasoftware,andversionyouneedtouseisinstalled,type

module avail

Touseasoftware,eg BLAST,type

module load blast

Toseewhatsoftwares arereadyforuse(loaded),type

module list

SLURM:SimpleLinuxUtilityforResourceManagement

Interactivejobshaveatimelimitof8hours.ifyouarerunningalongerjob,writeabatchscripttoscheduleit.

Howdowewritescripts?

WritingaSlurm script◦ Availableoptions,type

sbatch –u [ man sbatch fordetailedexplanationofusage]

Exampleofabatchscript#!/usr/bin/env bash

#SBATCH -p batch

#SBATCH -J blastn

#SBATCH -n 4

# load the blast module

module load blast/2.6.0+

# run the blast with 4 CPU threads (cores)

blastn -query ~/data/sequences/drosoph_14_sequences.seq -db nt

ToRunthescript,type

sbatch [ scriptname.sbatch ]

Bestpractice;overviewRunthejobonthecomputingnode

interactive

Makeadirectoryinthescratchspace;and“go”there

mkdir –p /var/scratch/userX ; cd $_

Createthescript

Runthescript

sbatch [scriptname.sbatch]

Enjoy!