Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Most data-processing tasks in...

35
Essential Skills for Bioinformatics: Unix/Linux

Transcript of Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Most data-processing tasks in...

Essential Skills for Bioinformatics: Unix/Linux

Announcement

• No class on March 16

• Make-up class on March 17, 15:00-16:15

Working with remote machines

• Most data-processing tasks in bioinformatics require more computing power than we have on our workstations.

• We must work with larger servers or computing clusters.

• We will learn how to make working with remote machines.

Putty

• Putty is an SSH and telnet client for the Windows platform.

• Secure shell (SSH) is a cryptographic network protocol to connect to another machine over a network. It is encrypted (secure to send passwords, edit private files, etc.) and on every Unix system.

• To download it, Go to http://www.putty.org/

Putty: How to use

Putty: How to use

Unix shell

• A Unix shell is a command-line interpreter that provides a command line user interface.

• The shell serves as the interface to large bioinformatics programs, as an interactive console to inspect data and intermediate results, and as the infrastructure for pipelines and workflows.

Why do we use Unix in bioinformatics?

Single large program

Input

Output

Pros: - Being customized to a particular project- Computationally efficient

Cons:- Not generalizable to future projects- Being fragile- Difficult to modify- Error prone

Why do we use Unix in bioinformatics?

program1

Input

program2

program3

program4

Output

The Unix shell was designed to allow users to easily build complex programs by interfacing smaller modular programs together.

The Unix shell provides a way for these programs to talk to each other (pipes) and write to and read files (redirection).

The advantage of the modular approach of the Unix:- Easier to spot errors and figure out where they are

occurring

- Allow us to experiment with alternative methods and approaches, as separate components can be

The advantage of the modular approach

• Easier to spot errors and figure out where they are occurring. Each component is independent, which makes it easier to inspect intermediate results for inconsistencies and isolate problematic steps.

• Allow us to experiment with alternative methods and approaches, as separate components can be easily swapped out with other components.

• Allow us to choose tools and languages that are appropriate for specific tasks. For example, it allows us to combine command-line tools for interactively exploring data, Python for more sophisticated scripting, and R for statistical analysis.

• Modular programs are reusable and applicable to many types of data.

Project directories and directory structures

• Creating a well-organized directory structure is the foundation of a reproducible bioinformatics project.

• The actual process is simple: laying out a project only entails creating a few directories with mkdir and empty READMEfiles with touch.

• All files and directories used in your project should live in a single project directory with a clear name.

Example: SNP calling in maize (Zea mays)

zmays-snps/data: contains all raw and intermediate datazmays-snps/data/seqs: NGS raw fileszmays-snps/scripts: general project-wide scriptszmays-snps/analysis: contains many smaller analyses – analyzing the quality of your raw sequences, the aligner output, the final data that will produce figures and tables for a paper.

Naming files and directories

• It is best to use only letters, numbers, underscores, and dashes in file and directory names.

• Do not use spaces.

• Unix doesn’t require file extensions, but including extensions in filenames helps indicate the type of each file.

Project documentation

• Your bioinformatics project should be well documented.

• Document your methods and workflows: Any command that produces results used in your work needs to be documented. Be sure to document any command-line options (including default options).

• Document the origin of all data in your project directory: You need to keep track of where data was downloaded from, who gave it to you, and any other relevant information.

Project documentation

• Document when you downloaded data: It’s important to include when the data was downloaded, as the external data source (such as a website or server) might change in the future.

• Record data version information: Many databases have explicit release numbers, version numbers, or names (e.g., Ensembl GRCm38 release 87 for mouse genome).

• Document the versions of the software that you ran: This is important for reproducible research.

Project documentation

• All of this information is best stored in plain-text README file.

• Let’s create an empty README file using touch. touch update the access and modification time of a file to the current time or creates a file if it doesn’t exist.

Organizing data to automate file processing tasks• Automating file processing tasks is an integral part of bioinformatics. Organizing data into subdirectories and using clear and consistent file naming schemes is imperative.

• Both of these practices allow us to programmatically refer to files. Doing something programmatically means doing it through code rather than manually, using a method that can effortlessly scale to multiple files.

Organizing data to automate file processing tasksLet’s crate some fake empty data files to see how consistent names help with programmatically working with files. Suppose we have three maize samples, “A”, “B”, and “C”, and paired-end sequencing data for each:

Shell expansion

• Shell expansion is when your shell expands text for you so you don’t have to type it out.

• cd ~: Your shell expands the tilde character to the full path to your home directory.

Shell expansion

• Wildcards like a asterisk (*) are expanded by your shell to all matching files.

• Brace expansion creates strings by expanding out the comma-separated values inside the braces.

Shell expansion by wildcards

Suppose that we wanted to programmatically retrieve all files that have the sample name zmysB rather than having to manually specify each file. To do this, we can use the Unix shell wildcard,

Shell expansion by wildcards

In general, it’s best to be as restrictive as possible with wildcards, This protects against accidental matches.

?: only matches a single character*: zero or more characters

Shell expansion by wildcards

• Range operators

• Numeric ranges

Leading zeros

• A useful trick is to use leading zeros (e.g. file-001.txtrather than file-1.txt) when naming files. This is useful because lexicographically sorting files leads to the correct ordering.

Leading zeros

Working with streams

• The text data in bioinformatics is often too large. Text streams in Unix allow us to do processing on a stream of data rather than holding it all in memory.

• The Unix shell simplifies tasks like combining large files by leveraging streams. Using streams prevents us from unnecessarily loading large files into memory. Instead, we can combine large files by printing their contents to the standard output stream and redirect this stream from our terminal to the file we wish to save the combined results to.

Working with streams

• We can look at the tb1-protein.fasta file by using catto print it to standard out:

Working with streams

• cat also concatenates multiple files

• To save these concatenated results to a file, you need to redirect this standard output stream from your terminal screen to a file.

Redirecting standard out to a file

• We use the operators > or >> to redirect standard output to a file. The operator > redirects standard output to a file and overwrites any existing contents of the file, whereas the later operator >> appends to the file. If there isn’t an existing file, both operators will create it before redirecting output to it.

Redirecting standard error

A separate stream is needed for errors, warnings, and messages meant to be read by the user. Standard error is a stream for this purpose. Standard error is by default directed to your terminal. In practice, we often want to redirect the standard error stream to a file so messages, errors, and warnings are logged to a file we can check later.

Redirecting standard error

There is a reason why standard error’s redirect operator has a 2 in it. All open files including streams on Unix are assigned a unique integer known as a file descriptor. Unix’s three standard streams – standard input, standard output, standard error – are given the file descriptors 0, 1, and 2.

Redirecting standard error

Occasionally a program will produce messages we don’t need or care about. Unix systems have a special fake disk to redirect output to /dev/null. Output written to /dev/nulldisappears.

Standard input redirection

• Normally standard input comes from your keyboard, but with the < redirection operator you can read standard input directly from a file.

• It is more common to use pipes or take a file argument.

Unix pipe

• Unix pipes are similar to the redirect operators, except rather than redirecting a program’s standard output stream to a file, pipes redirect it to another program’s standard input. Only standard output is piped to the next command. Standard error still is printed to your terminal screen.

• We use pipes in bioinformatics not only because they are useful way of building pipelines, but because they’re faster than reading and writing intermediate results to the disk.

Unix pipe

• For large NGS data, writing or reading from a disk can slow things down considerably. Additionally, unnecessarily redirecting output to a file uses up disk space.

• Pipes allow us to build larger, more complex tools from smaller modular parts. It doesn’t matter what language a program is written in. Pipes will work between anything as long as both programs understand the data passed between them.