Unix commands for beginners -...
Transcript of Unix commands for beginners -...
Unix commands for beginners
D. Puthier TAGC/Inserm, U1090, [email protected] Defrance, ULB, [email protected]éphanie Le gras, Igbmc, [email protected] Blanchet, IFB, [email protected]
MATE DesktopDemo
Quick overview.
Installation:http://www.france-bioinformatique.fr/?q=fr/core/cellule-infrastructure/documentation-cloud
Dashboard:https://cloud.france-bioinformatique.fr/cloud/instance/
The terminal…
Demo Type ‘ls’ in the terminal
(list files)
# list filesroot@vm: ls
● Answer : you can speak in BASH (Bourne Again Shell) *○ BASH is one of numerous shell dialect (ksh, csh, zsh,...).○ All this shell languages are extremely similar.○ These languages are based on commands.○ These modular commands allows one to perform tasks.
How can I speak to the terminal
* Reférence (calembour) au premier langage Shell écrit par Stephen Bourne :)
# Argument without any associated value# depending on the command v means verbose, version (or other)fastqc -v
# An argument with an associated value man -k jpeg
Command prototype(s) (1) ● One command performs a task (sort, select, open, align reads,...).● A command has arguments that may be facultative and modify the way it works.● These arguments may take some values.● Most of the time an instruction (command line) starts with a command name (or path to
the command).● In the example below we will say minus v’.
# Long form without any associated value.fastqc --version
# Long form with an associated value.man --apropos jpeg
Command prototype(s) (1) ● Most of the time arguments can be written in their short of long form (more
explicit/better readability).● Long form are generally precede with ‘--’ (for instance ‘minus minus apropos’)
Getting help !
Call you friends or better use man (manuel)
# Demo
root@vm: man ls # getting help about ls
root@vm: man man # getting help about man ...
Help shortcuts:/foo : search for ‘foo’.n : (next) next occurence of ‘foo’.p: (previous) previous occurrence of ‘foo’.q : quit help page.
Our first command: ls
● ls can take several arguments.● Main arguments:
○ -l : (long) get lot of information.○ -a (all) show all files including hidden files*.○ -1 : show results as 1 column.○ -t (time) sort results by date/time.○ -r (reverse) reverse sort order.
● One can combine arguments○ ls -l -a○ ls -la
The ls command and some of its arguments
* Under linux hidden files start with a ‘.’ (e.g ‘.thehiddenfile.txt’).
The ls command and some of its arguments # Demo
root@vm: ls # list files
root@vm: ls -a # list files including hidden files *
root@vm: ls -l # get lot of information about files
root@vm: ls -1 # list file (one column)
root@vm: ls -t # List file by modification date **
# Combining arguments
root@vm: ls -rtl # lot of info, sort by date, reverse order
* WARNING with spaces. Instruction should start with a command. The ls-a command does not exists !
** Default sorting is case-sensitive sorting.
Create directories and files
File system tree● The file system can be viewed as a tree in which nodes are directories or files.● This tree has a root: /
● The root folder (/) contains○ A root folder an various
additional folder*■ Under IFB machine your
root folder contains a Documents folder
* Under IFB VM, you are the root/sysadmin, this is a particular case.
● 1) By specifying the path from the root. Absolute path.
e.g; /root/Documents /root/Music
Hos should I refer to a file/directory
● 2) By referring to the current location/directory (the working/current directory). Relative path.
Syntax for relative path
# The upper directory relative to the working directory..
# Two directories up../..
# Three../../..
# The current working directory
./
File system: Demo
root@vm: pwd # The current working directory (/root)
root@vm: cd /root/Documents # We go into Documents
root@vm: pwd # /root/Documents
root@vm: cd .. # go up one level (/root)
root@vm: cd /root/Music # Go to the Music folder
root@vm: pwd # /root/Music
root@vm: cd ../.. # Go to the root of the file system
root@vm: ls # You should see the root directory
root@vm: cd /root/Music # Let’s go to the root/Music directory
root@vm: cd ../Documents # And to the Document folder
pwd (print working directory); cd (change directory). *
* Use complétion (tab key) for file, directories and commands.
● If you are the /root your data are stored in /root ○ i.e ‘user directory’ or home.
● ~ (tilda) contains the path to your home (same as $HOME).
File system: some hints
root@vm: cd / # At the root
root@vm: pwd # /
root@vm: cd ~/Documents # The Document directory of your home folder.
root@vm: cd ~ # go to your home dir.
root@vm: cd /usr/local/bin # Go to /usr/local/bin
root@vm: ls ~ # list files in your ‘home’ directory
root@vm: cd ~/Music # Go to the Music folder inside your home dir.
root@vm: cd # == cd ~
● We will use the mkdir (make directory) command.
root@vm: mkdir projet_roscoff # Create a directory
root@vm: cd ./projet_roscoff # == cd projet_roscoff *
root@vm: mkdir rna-seq # Let’s create a folder
root@vm: mkdir chip-seq dna-seq # and several sub-folders
root@vm: ls -1 # list files and folders
root@vm: cd chip-seq # == ./chip-seq
root@vm: pwd # the current working dir
root@vm: cd ../.. # go back home
Make directories
* ./ is most of the time facultative
Hands on
● 1) go to ~/projet_roscoff/chip-seq● 2) From this directory create a directory named annotation in
~/projet_roscoff/
● Go inside annotation directory
● Check you are in the write place
● Go back home.
Hands on
● 1) go to ~/projet_roscoff/chip-seq● 2) From this directory create a directory named annotation in
~/projet_roscoff/
● Go inside annotation directory
● Check you are in the write place
● Go back home.
●# Solution
root@vm: cd ~/projet_roscoff/chip-seq
root@vm: mkdir ../annotations
root@vm: cd ../annotationsroot@vm: cd
Manipulate files
● We will use the wget command to download files.● To uncompress we will use gunzip if the file was compressed with the gzip
algorithm (extension .gz)
Download and uncompress files
root@vm: cd ~/projet_roscoff/annotations # on se déplace dans annotations
# On télécharge le fichier
root@vm: wget http://pedagogix-tagc.univ-mrs.fr/courses/data/roscoff/hg19_exons.bed.gz
root@vm: ls # le fichier compressé
root@vm: ls # le fichier compressé
root@vm: gunzip hg19_exons.bed # on le décompresse
root@vm: ls # le fichier a perdu l’extension gz
Contains coordinates (start/end) of humand exons in bed format.
Bed format (Bed6) ( http://genome.ucsc.edu/FAQ/FAQformat.html#format1 ) *
Tabulated format (how to check that ???)
Chromosome Start End Name Score Strand (Others…)
The hg19_exons.bed file
* Start and End position are always given relative to the 5’/3’ orientation of the + strand. Coordinates are ‘zero-based, half-open’.
Visualising file content● With a pager: less or more (do more or less the same). ● With head ou tail to display the n first or n last lines of a file.● The cat command allows to send file content to the screen. <ctrl> + c to cancel. ● The shortcuts for less are the same as for the man command.
Raccourcis dans less:↑ : go up.↓ : go down.> : go to first line. < : go to last line./foo : search for ‘foo’.n : next occurrence of foo.p: previous occurrence of fooq : quit.
Hands on
● 1) Look at the ten first lines of hg19_exons.bed with head. ● 2) look at the ten last lines of hg19_exons.bed with tail. ● 3) Go through the hg19_exons.bed file with less.● 4)Send file content to the screen with cat.
Exercices
● 1) Look at the ten first lines of hg19_exons.bed with head. ● 2) look at the ten last lines of hg19_exons.bed with tail. ● 3) Go through the hg19_exons.bed file with less.● 4)Send file content to the screen with cat.
# Solution
root@vm: head -n 10 hg19_exons.bed root@vm: tail -n 10 hg19_exons.bed
root@vm: less hg19_exons.bed
root@vm: cat hg19_exons.bed
This can be done with the wc (word count) command with -l (line) argument.
root@vm: wc -l hg19_exons.bed # 484127 exons
Counting line number
● Use the cut command with the -f (field) argument● The columns must be tabulated or use the -d argument (‘delimiter’)
root@vm: cut -f1 hg19_exons.bed # Column 1root@vm: cut -f1,2 hg19_exons.bed # Columns 1 and 2root@vm: cut -f3-5 hg19_exons.bed # Columns from 3 to 5root@vm: cut -f3- hg19_exons.bed # Column 3 to the last column
Extract columns
● On should use the sort command (alphabetic sorting by default).○ -k (key): e.g
■ -k1,1: sort by column 1.■ -k2,2nr: sort by column 2 using a numeric sorting in reverse order.■ -k2,2g: sort by column 2 (decimal sorting).
Example: sort hg19_exons.bed by chromosomes then by genomic coordinates:
root@vm: sort -k1,1 -k2,2nr hg19_exons.bed
Sort a file
Redirections
Command pipes
Commande
Input Output
Error
● Standard Input: a file or text stream.● Standard output: screen by default.● Standard error: may be capture for log purpose.
Commande
Input Output
Error
Commande
Input Output
Error
Obtenir la liste de chromosomes présents dans le fichier
root@vm: cut -f1 hg19_exons.bed | sort | uniq # La liste non-redondante
des chromosomes
Obtenir la liste des chromosomes présents dans le fichier et leur nombre
root@vm: sort hg19_exons.bed | uniq -c # -c pour ‘count’
Compter le nombre de transcript non codant (contenant ‘NR_’).
root@vm: cut -f4 hg19_exons.bed | grep "NR_" | sort | uniq | wc -l #11675
Demo: command pipes
Note: La commande uniq permet d’éliminer les doublons dans un fichier trié.
Note: la commande grep permet de chercher une chaîne de caractères.
Exercices (notés)
● How many exons on chromosome 22 ?● What is the most frequent chrom-start-end tuple ?
○ i.e The most frequent exon.
Exercices (notés)
● How many exons on chromosome 22 ?● What is the most frequent chrom-start-end tuple ?
○ i.e The most frequent exon.
Solution
root@vm: grep -w chr22 hg19_exons.bed | wc -l # n = 259
root@vm: cut -f1-3 hg19_exons.bed | sort | uniq -c| sort -n| tail -n 1 # 77 chrY
● What is the genome fraction covered by exons ?○ We must perform the operation below
Exercice
Exon Exon
● Let see how to do that...
Exon Exon
Exon Exon Exon
Exon
Exon Exon
Exon Exon
Exon Exon Exon
Using Bedtools
● A software to perform arithmetic operations on genomic coordinates.○ http://bedtools.readthedocs.org/en/latest/content/overview.html
● Some example usages:○ Extend/slop regions. ○ Compare regions (intersect).○ Merge regions.○ Format convertion.○ …
● The bedtools command is associated with a set of sub-commands.
Bedtools
● Use bedtools with -h argument. ○ What do you see ?
● Ask for some help about the merge command (bedtools merge -h)○ Looks at the arguments.○ Read the note at the end of the command. Why is it important ?
Exercice with bedtools
● Use bedtools with -h argument. ○ What do you see ?
● Ask for some help about the merge command (bedtools merge -h)○ Looks at the arguments.○ Read the note at the end of the command. Why is it important ?
●Solution
root@vm: bedtools -h # l’ensemble des sous commandes
root@vm: bedtools merge -h # utiliser l’argument -i
# la note indique que les régions génomiques doivent être triées au
préalable.
Exercice with bedtools
● Use bedtools sort and bedtools merge to merge overlapping regions/exons.
Exercice
● Use bedtools sort and bedtools merge to merge overlapping regions/exons.
Exercice
root@vm: bedtools sort -i hg19_exons.bed | bedtools merge
● Use the > redirection operator.○ Erase file if it exists.
● >> can be used to add lines to an existing file.
root@vm: bedtools sort -i hg19_exons.bed | bedtools merge >
hg19_exons_merged.bed
root@vm: ls # A new file was created
How to save results to a file ?
Some arithmetic with awk
Awk● Awk is a command available on most linux system.● Awk has its own language.● Awk allows to perform oneliners (and more)● The prototype of a awk command is the following:
● Each set of brace is associated to a particular task:
awk ‘BEGIN{action} {action} END{action}’ fichier
BEGIN{before opening the file}
{for each line}
END{after rading all lines}
Awk● Awk has special variables.● Examples:
FS: Field Separator.
OFS: Output Field Separator.
NR: Number of Row.
NF: Number of Field.
$0: The current line
$1,$2,$3 (...): columns 1,2 ou 3 (...) of the current line
# print columns 2 and 1# \t is the tabulation characterroot@vm: awk 'BEGIN{FS="\t"}{print $2,$1}' hg19_exons.bed
# print columns 2 and 1 with tabulated outputroot@vm: awk 'BEGIN{FS=OFS="\t"}{print $2,$1}' hg19_exons.bed
# print columns 2 and 1 with tabulated output and line numberroot@vm: awk 'BEGIN{FS=OFS="\t"}{print NR,$2,$1}' hg19_exons.bed
# Compute start - end for each line root@vm: awk 'BEGIN{FS=OFS="\t"}{print $3-$2}' hg19_exons.bed
Exemple
Exercice
Calculer la somme des fragments (awk)
# Calculer à chaque ligne la somme cumulée de la taille des fragments# Notez que les “;” permettent de séparer des instructions# s est une variable que l’on déclare à 0# 75861726root@vm: awk 'BEGIN{FS="\t"; s=0}{s=s+$3-$2; print s}' hg19_exons_merged.bed
# Ou encoreawk 'BEGIN{FS="\t"; s=0}{s=s+$3-$2}END{print s}' hg19_exons_merged.bed
# A vos calculettes (vous pouvez utiliser R).# 75861726/3.2e9*100# ~ 2.37 % du génome couvert
Exercice: Calculer la somme des fragments (awk)
Aller plus loin avec awk
awk ‘BEGIN{} pattern {} END{}’ fichier
● Le prototype d’une commande awk peut être un peu étendu en ajoutant des ‘patterns’ (selecteurs ou critères).
● Le critère pourra être une expression régulière (voir plus loin) ou une expression logique
# exemples: test si a égal b. Imprime si vrai.
awk ‘a == b {} END{}’ fichier
# Exemples: imprime si la colonne 1 vérifie une expression régulière.
awk ‘$1 ~/regExp/ {print}’ fichier
Exemples avec des patterns
# La première ligneroot@vm: awk 'NR == 1 {print}' hg19_exons_merged.bed
# La ligne 2 à 10root@vm: awk '{OFS=”\t”} NR >= 2 && NR <= 10 {print NR, $0}' hg19_exons_merged.bed
# Les lignes dont la colonne 1 contient la chaîne ‘chr19’.root@vm: awk ' $1 ~/chr19/ {print}' hg19_exons_merged.bed
Expressions régulières● Permettent de décrire un motif dans une chaîne de caractère.
. un caractère quelconque
[a-z] une lettre minuscule (interval, ex : [u − w])
[A-Z] une lettre majuscule (interval, ex : [U − W])
[ABc] A ou B ou c
[ˆABab] Toute lettre différente de a et b.
^ Début de ligne.
$ Fin de ligne
x* 0 à n fois le caractères x.
x+ 1 à n fois le caractères x.
x{n,m} Le caractère x répété n à m fois.
Exemples
\.txt$ Toute chaîne finissant par “.txt”
ˆ[A − B] Une chaîne débutant par une majuscule.
ˆ.{4,6}\.txt$ Quatre à 6 caractères suivis de “.txt“
ˆ[A − Z].*\.txt$ Une chaîne débutant par une majuscule et finissant par ”.txt“
ˆ$ Une chaîne de caractères vide.
ˆ[ˆ0 − 9]*\.sh$ Une chaîne ne contenant pas de chiffres et se terminant par ”.sh“
Exercice
● En utilisant grep (general regular expression processor) construire une expression régulière permettant de récupérer, dans le fichier hg19_exons_merged.bed, les lignes dont la colonne 1contient les chaînes de caractères chr1, chr2 et chr9 (et pas d’autres chromosomes quoi que puisse contenir le fichier).
● En utilisant awk et un pattern, construire une expression régulière permettant de récupérer, dans le fichier hg19_exons_merged.bed, les lignes dont la colonne 1 contient chr1, chr2 et chr9 (et rien d’autre quoi que puisse contenir le fichier).
Solutions
# grep chr1, chr2 et chr9 (et rien d’autre !)# Notez l’utilisation de -P (perl) pour avoir un langage d’expression régulière étendu# utile ici pour la prise en compte de \t. Ne fonctionne pas sou mac.root@vm: grep -P “^chr[123]\t” hg19_exons_merged.bed
# awk chr1, chr2 et chr9 (et rien d’autre !)root@vm: awk ' $1 ~/^chr[123]$/ {print}' hg19_exons_merged.bed
Merci