Panati and webPanati – Information Systems for...

Panati and webPanati – Information Systems for SNPs

Mark Wright Hamilton1, Marcelo Gonçalves Narciso2, Genevieve DeClerk1

, Susan McCouch1

Abstract This paper describes two softwares: Panati and webPanati. Panati is a software that reads data from Illumina GenomeAnalyzer (Solexa) and gives an output about SNPs in unix environment. These datas are showed in Internet by webPanati, and the user can use the output from webPanati for analysis of SNP data as the user wants. These data are about rice, but panati and webPanati can be used to get SNPs from other cultures. Panati and webPanati are free software, are easy to install and use, and webPanati gives files that can be used as input to Flapjack system to do analysis of SNPs. 1. Background Panati (PANATI, 2010) is a set of programs for scanning short-read resequencing data against a reference sequence, typically for the detection of single nucleotide polymorphisms (SNPs) or small insertions or deletions. Panati is also a fast and flexible short-read next generation sequencing map-to-reference alignment and polymorphism discovery program. Features include gapped alignment, paired-end mapping, arbitrary read lengths, and fast multi-threaded execution. Panati reads data from Illumina GenomeAnalyzer (Solexa) and gives an output about SNPs. There is a command execution sequence for get the SNPs output. This work will show how to do it and how to get the data usind Panati interface, whose name is webPanati. This interface, whose name is webPanati, reads the output from Panati and shows the results about SNPs in web (any browser) and also gives outputs in files and HTML page and give files that are used as input files to flapjack (FLAPJACK, 2010), a tool that runs locally on computer, which, as data sets grow larger and larger, is much more efficient than running over the web. In this work, it will be described what is Panati, how to install e how to get data from output of Panati and insert this data in webPanati system, that shows some queries about SNPs in web and gives output to user make analysis about the SNP data. These tools are being used in a project about get rice SNP, 85 rice varieties, that is described in (RICESNP, 2010) and are free software. 2. Implementation In this section it will be showed what is Panati, webPanati, how these softwares are installed and used. 2.1 Panati Panati is a sequence alignment program developed to map short-read next generation resequencing data to a reference genome with the specific purpose of finding and characterizing SNP variation and short insertions or deletions (indels). Thus, Panati is for aligning and calling re-sequencing data, where it is possible to "discovering" SNPs. PANATI is written in the C 1 Dept. Of Plant Breeding and Genetics. 240 Emerson Hall, Cornell University, Ithaca, NY 14853 2 Embrapa Rice and Bean. Dept. Of Biotechnology. Santo Antonio de Goias, GO 75375-000

programming language and can be configured to run on any computer with a C-compiler. Panati, custom-designed pipeline for the analysis of next-generation sequencing data, can be used for SNP discovery and to align re-sequencing data coming off of the Illumina GenomeAnalyzer IIx and call SNPs against the reference variety, for exemplo Nipponbare rice genome. It is availabe for use under the GNU public license, and there is a first version of a web interface (webPanati) to see the results of Panati in the Internet. 2.2 - About Panati installation PANATI is a set of programs for scanning short-read resequencing data against a reference sequence, typically for the detection of single nucleotide polymorphisms (SNPs) or small insertions or deletions. Panati system code can be downloaded with svn command (PANATICODE, 2010). svn co https://panati.svn.sourceforge.net/svnroot/panati panati Panati code will stay in panati directory that will be created. In this directory, there will be files that are the panati system and this system runs in 64 bit unix machine. For compile panati code, execute “make” command and the panati files will be compiled. After this, in order to get the results about SNP, using some variety as reference (A) and other variety (B), it is possible to have the SNPs and their positions of B in reference to A. 2.3 - How to use Panati Lets assume that the user has a reference sequence in a FASTA format file named refseq.seq and two FASTQ (it will be showed forward) files representing the forward and matching reverse reads from an Illumina GenomeAnalyzer (Solexa) run.The steps for get the SNPs data are Step 1. Execute the command ./panati-build -f ./refseq.seq -o refseq.pindex -l refseq -w 16 -s 1 -m 1000000000 & The executable file panati-build is created after “make” command. Refseq.seq is the reference file. In this example, the reference variety is Nipponbare (12 chromosomes), which is in ftp area of NCBI site (FTPNCBI, 2010) or Gramene ftp site (FTPGRAMENE, 2010). The file refseq.pindex is the output of this command, that will be used as input in step 3. Step 2. Execute the command ./fastq-qc -f 10216343_61G6JAAXX_s_2_1_sequence.txt -r 10216343_61G6JAAXX_s_2_2_sequence.txt -i 300 -o qc-input.reads -t 30 --3p-trim=15 -z & The files 10216343_61G6JAAXX_s_1_1_sequence.txt and 10216343_61G6JAAXX_s_1_2_sequence.txt are output of Illumina (Ilumina Solexa GAIII format) and the second file has the same data that has in the first file, but in a reverse order. These files are text files and belong to the other variety, that is not reference. The fisrt lines of 10216343_61G6JAAXX_s_2_1_sequence.txt file is showed for exemplify

what is the content of this file. @HWI-EAS339_0001:2:1:1010:6302#0/1 NTGGCCGAAGCCCGATAGGCATGAACTCATCAGGGGAACTGCAANTGNTNNNGGACATCGTGCTCANANNNNNNNNNNNNNNNNNN +HWI-EAS339_0001:2:1:1010:6302#0/1 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWI-EAS339_0001:2:1:1010:3552#0/1 NGCCAAGTCGACATCAGTCCGGCCGGCACGGCCCGTCCTACGCGNTGNGNNNGGCCGGTCTAGCCCNCNNNNNGNNNCNNNNNNNN +HWI-EAS339_0001:2:1:1010:3552#0/1 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB The fisrt lines of 10216343_61G6JAAXX_s_2_2_sequence.txt file is showed for exemplify what is the content of this file. @HWI-EAS339_0001:2:1:1010:6302#0/2 NNNNNTCCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCTTAATACGGGCTTCTCACNNNNNNNNNNNNNNN +HWI-EAS339_0001:2:1:1010:6302#0/2 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWI-EAS339_0001:2:1:1010:3552#0/2 NNNNNACAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGACTTATTCCAGGCATAACANNNNNNNNNNNNNNN +HWI-EAS339_0001:2:1:1010:3552#0/2 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 10216343_61G6JAAXX_s_2_1_sequence.txt file has almost 5 GB and the same size is 10216343_61G6JAAXX_s_2_2_sequence.txt file. Step 3. The steps 1 and 2 have files output refseq.pindex and qc-input.reads that are used for running the command ./panati -r refseq.pindex -f qc-input.reads -l 1024 -m 0.10 -g 0.10 -o my-qc-input.panati --scan-shift=16 & Step 4. The output of step 3, my-qc-input.panati, is used as input of this step, which is the command ./coverage-report -p my-qc-input.panati -d 5 -m 0 > output4.txt This will print a tab delimited text report to standard output indicating the coverage from depths 0 (no reads covering the reference sequence bases) to depth 5 (specify with -d option). The column ordering is C1 - depth C2 - Mb of reference covered at exactly this depth C3 - % of reference sequence covered at exactly this depth C4 - Mb of reference covered at or above this depth level C5 - % of reference covered at or above this depth level

C6 - number of SNPs with variant allele occuring exactly depth times C7 - % of SNPs " " " occuring exactly depth times C8 – number of SNPs with variant allele occuring at least depth times C9 - % of SNPs .... C10 - SNPs/kb if this depth level were used as a threshold for calling For Sertaneja rice variety, output4.txt is C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 0 72.591 19.4% 373.707 100.0% 1 41.3 11.0% 301.1 80.6% 326229 32.3% 1010114 100.0% 3.35 2 46.3 12.4% 259.9 69.5% 185632 18.4% 683885 67.7% 2.63 3 47.8 12.8% 213.5 57.1% 147283 14.6% 498253 49.3% 2.33 4 44.3 11.8% 165.7 44.4% 114311 11.3% 350970 34.7% 2.12 5 121.5 32.5% 121.5 32.5% 236659 23.4% 236659 23.4% 1.95 Step 5. Combining PANATI results Prepare a tab delimited text file with two columns, a label for the sample, and the panati file with the results of the PANATI scan. For example: sample1 sample1.panati sample2 sample2.panati sample3 sample3.panati and lets say this is stored in a file "my-sample-list.txt". So, run the command cat my-sample-list.txt | ./combine-samples -d 2 -r 0 -p refseq.pindex > haplotypes.tsv For Sertaneja rice variety, my-sample-list.txt file has one row “sample1 my-qc-input.panati”. The command, after its end, gives the haplotypes.tsv file. This file has all the SNPs and its position in reference to Nipponbare. For exemplo, the first 6 lines of this file is refseq pos offset ref.allele var.allele n.obs n.variant sample1 n.var n.obs chr01|13101 172 0 C T 1 1 T 2 2 chr01|13101 247 0 G A 1 1 A 2 2 chr01|13101 248 0 A C 1 1 C 2 2 chr01|13101 276 0 T C 1 1 C 2 2 chr01|13101 324 0 C T 1 1 T 4 4 This output has the number of chromosome, SNP position, reference allele (Nipponbare, in this case), variety allele (Sertaneja, in this case) as the mais columns of data. This file has hundred thousands of rows. This results must be inserted in a database. For this purpose, is used a script to get the output of panati (haplotypes.tsv) and it is created a new file that is a SQL script to insert the data in the database (input.sql). The webPanati has a table “posOfSNP” and this table has the fields: refseq, pos, offset, refAllele, varAllele, numObs, numVariant, sample, nVar, nObs, chromoNumber. These fields correspond to the output of panati. The script to insert the output of Panati in posOfSNP table, insert.sh, has the shell script code described below:

#! /bin/sh cat $1 | grep -v refseq | grep -v pos | grep -v “chrSy” | grep -v “chrUn” | awk '{print "insert into posOfSNP(refseq, pos, offset, refAllele, varAllele, numObs, numVariant, sample, nVar, nObs, chromoNumber) values("$1", "$2", "$3", \""$4"\", \""$5"\", "$6", "$7", \""$8"\", "$9", "$10", "substr($1,0,index($1,"|") - 1) ");" }' | sed 's/, chr0/, /g' | sed 's/, chr11/, 11/g' | sed 's/, chr12/, 12/g' The output SQL is created by the command “insert.sh haplotypes.tsv > input.sql” The script insert.sh, that runs in unix/shell environment, is an easy option. There are many ways of get the output of haplotypes.tsv and put it in posOfSNP table using other languages like java, C++, perl, and so on. There is a C code program, insert.c, that is in webPanati package and read haplotypes.tsv as input and the output is input.sql, as insert.sh does. But use the script insert.sh is more simple. The name of panatiWeb database is “snp”. Thus, the commands to insert de data of output of panati into posOfSNP table, that belongs to snp database, is mysql -u <user> -p<password> snp < input.sql. If snp database does not exist, the user can create it using the command mysqladmin -u root -p create snp These two commands are very simple. More information about installation and how to use Panati is in the page http://panati.sourceforge.net/README.quickstart 2.4 - WebPanati Interface – Intallation and how to use After the data of panati are in a panatiWeb database, whose name is “snp”, it is easy to users see the data about SNPs if there is an interface in Internet. This interface is running now in Lab of Cornell University and it will be available for everyone soon (free software). This inteface was made with Java/JSP (server side) and JavaScript (cliente side) as program languages and MySQL (MYSQL, 2010) is the DBMS (Data Base Management System) of webPanati. AJAX (AJAX, 2010) and greybox (GREYBOX, 2010) are used in the client side to show the results faster. This interface has some queries required by biologists and researches and shows SNPs in HTML page or output file (text or CSV). On the other hand, There is an option that webPanati gives an output to flapjack system (FLAPJACK, 2010), that shows the data in graphical way. WebPanati is being used to store rice data, but it can be used to store data of other kind of cultures (maize, bean, wheat, and so on). It has a very simple data base, with few tables, but it can have GB of data and the performance is good. There is a indexation (fields about position of SNPs, chromossome number and variety) in its tables and it gives a fast access to the data. Without index, a query with a large range of positions (4000000 registers, for exemple) can take 3 minutes or more. With index, it takes 10 seconds aproximately. The installation of webPanati is easy. It needs that tomcat softtware (TOMCAT, 2010) installed in the server, java language and DBMS MySQL. The webPanati system must be in some

directory of tomcat instalation (for example, webapps/ROOT/panati tomcat directory) and so the user can access the webPanati page by web (for example, http://server_address/panati). There is a SQL script that create tables and a manual that explain how to insert data in webPanati database. With few commands the user can install webPanati, that runs in any browser, and can be installed in Windows, Linux, Solaris, AIX, FreeBSD, and so on. This tool can be used to get data by web (HTML and text/CSV files) and it is necessary username and password to access. The initial page is

Figure 1 – Initial page of webPanati After sucessful login, it will appear a page that contains search options about SNPs, which is showed in Figure 2. In this work, the SNPs are about rice, with 85 differents kind of rices, and the reference variety to get SNP is Nipponbare.

Figure 2 – Options of WebPanati This interface has 6 options to get data about SNPs and in the future this page will have more functions. When the user choose an option, it will appear more options about, int he same page, which is showed in Figure 3.

Figure 3 – Suboptions of “Allele comparasion in two varieties” option All the six options in this page has suboptions, as showed in Figure 3. With the suboptions, after the user choose the varieties, chromosome number, interval of SNPs and where the output will be showed, the system shows the results. For example, in option Allele comparasion in two varieties, if the varieties are Sertaneja and Caiapo, the chromosome number is 1, the interval is 1 to 200000 and the output (option Format, as showed in Figure 3) will be showed in HTML page, the Figure 4 show the results.

Figure 4 – output of comparasion between Caiapo and Sertaneja All the HTML results are showed in the same page/location of options/suboptions for the user don't lost the last option that was choosen. In upper right corner there is an option that close the page and back to options page. It is possible becouse the system uses greybox javascript library (GREYBOX, 2010). Other options are: Coordinate-based SNP Search, SNP search (gene information and SNP distance), Common-by-descent search, Search for tri and quad-allelic SNP, FlapJack Files. For other example, the option “Coordinate-based SNP Search” has suboptions for choose varieties, chromosome and positions (initial and final). For example, lets suppose that the user has choosen 3 varieties: Caiapo, Nipponbare and Sertaneja, and the chromosome is number one and the positions of SNPs are from 1 to 200000. The Figure 5 has this suboptions with this example. The varieties that can be choosen are all that are present in panati database. In the Figure 5 there is an example of output por this suboption.

Figura 5. Example of suboptions of Coordinate-based SNP Search After the user submit this form, it will appear the results, which is described in Figure 6.

Figure 6. results of Coordinate-based SNP Search All the other options is similar in sense that the user can choose suboptions and submit a form and the results can be showed in HTML page or file. The other options are a) SNP search (gene information and SNP distance). With this options the user get the results with all the verieties chosen, its positions and the distance between current and previous positions. b) Common-by-descent search. In this options, the user can find SNPs with reference to the one variety chosen (not nipponbare, but other variety) and the positions of this SNPs. An example of this option is showed in Figure 7.

Figure 7. Example of Common-by-descent search c) Search for tri and quad-allelic SNP. This options shows what positions of SNPs has 2, 3 or 4 diferentes alleles of the all varieties in the database, d) FlapJack Files. This options create 2 files (map and genome files) that are used in flapjack software to s show analysis in graphical way. The last option, FlapJack Files, returns two files (map file and genotype file). These maps are used as input to FlapJack software. FlapJack can routinely handle the large data volumes generated by the high throughput Illumina SNP platform and comparable genotyping technologies. Flapjack is new tool to facilitate analysis of these data types. Its visualizations are rendered in real-time allowing for rapid naigation and comparisons between lines, markes and chromosomes. Examples of map.txt file ande genotype.txt file (its contents) are below. Map file: map.txt 1_1 1 24 1_2 1 124 1_3 1 224 1_4 1 324 1_5 1 424 1_6 1 524 1_7 1 624 1_8 1 724 1_9 1 824 1_10 1 924

genotype file: genotype.txt 1_1 1_2 1_3 1_4 1_5 1_6 1_7 1_8 1_9 1_10 Sertaneja A T C T A T G A G A Nipponbare T A G A T A C T C T These two files is an example of output of option “FlapJack File”. The Flapjack system has an option for import files (map.txt and genotype.txt files). The Figure 8 shows it.

Figure 8 – Import map and genotype files After import these files, the flapjack system shows the varieties and the SNPs, as it is showed in Figure 9.

Figure 9. FlapJack showing SNPs according to input files (map.txt and genotype.txt) Based on the input of map, genotype and trait data Flapjack is able to provide a number of alternative graphical genotype views with individual alleles coloured by state, frequency or similarity to a given standard line (FLAPJACK, 2010). Flapjack system runs in desktop environmet and has version for Windows, Linux, Solaris, OS X and it is easy to install and use. The advantage of this tool is that there are several options (at least equivalent to internet genome browser) and it doesn't depends on remote server (it runs in user machine) and it is more fast. It is more easy to save data and make analysis and the user can configurate several detail like colors, positions, selections, and so on. The Figure 10 shows Flapjack system when it gets a great deal of data of SNPs.

Figure 10. FlapJack showing SNPs (large data volumes) The webPanati system will have in the future othes option to insert data for other kinds of genome browser, as for example gbrowser (GBROWSER, 2010) and chado (CHADO, 2010). Instalation and use of Flapjack is more simple, but some kinds of users prefers gbrowser becouse it can be seen by everyone in everywhere. So, webPanati will have more options for attend this kind of people. 3 - Results and Discussion The section 2 (Implementation) shows how to install and use Panati and webPanati. The advantages of Panati is that is free software, it is appropriate to Illumina Solexa and it is easy to install and use. It is possible to the user creates a shell script that runs automatically the Panati 5 steps and create the SQL script to be used by webPanati. The time to runs Panati depends on machine CPU and RAM and the input data volume. In worse case, the time to runs all the steps of Panati was aproximatelly 30 minutes by variety of rice, as showed in Table 1.

Step variety Time (s)

1 Caiapo 149 2 Caiapo 112 3 Caiapo 1416 4 Caiapo 16

5 Caiapo 23 1 Sertaneja 146 2 Sertaneja 115 3 Sertaneja 1430 4 Sertaneja 16 5 Sertaneja 25

Table 1 – Time to execute Panati steps This is the average time for each step and it was executed on Linux CentOS Machine, 64 Bits, 32 GB RAM, 16 CPUs 2.7 Ghz. If the machine has more than 1 CPU, it is possible to use more than one processor to run Panati, especifically step 3. It is possible to insert the option “--n-threads=X” in the panati command, and X is the number of computer processors. Thus, for example, if the computer has 16 processor, the command of step 3 is ./panati -r refseq.pindex -f qc-input.reads -l 1024 -m 0.10 -g 0.10 -o my-qc-input.panati --scan-shift=16 –n-threads=16 & At the moment of step 3 execution, if –n-threads=16 and the user runs the “top” unix command, it will shows: top - 22:20:45 up 8 days, 5:54, 2 users, load average: 15.99, 14.84, 9.19 Tasks: 242 total, 1 running, 241 sleeping, 0 stopped, 0 zombie Cpu(s): 99.9%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 32949440k total, 21630384k used, 11319056k free, 312372k buffers Swap: 34996216k total, 0k used, 34996216k free, 17433000k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24564 bionarc 18 0 11.5g 3.2g 876 S 1599.1 10.3 205:38.51 panati In this command, top, this is the line about panati and %CPU is 1599.1. It means that Panati can make excellent use of the hyperthreading feature typically attaining about a 14-fold higher throughput of execution over Panati run as a single thread (eg, only utilizing a single CPU). Thus, it is possible to get more speed to run panati adding the option –n-threads in step 3. Without this modification, the step 3 for Caiapo rice variety is more slow (more than 1 hour). Thus, if the user has a compurer that has more than one processor, it is possible to run Panati more fast. The webPanati interface has the main queries that the biologists want and it is possible to have input files to Flapjack with an option of webPanati. This is a free software and the user can insert more queries in webPanati code and in the future other output to genome browser (Gbrowser) will be created. WebPanati is very simple to install (unzip the code in tomcat directory and execute SQL scripts for create the tables and populate tables). These scripts are in webPanati package. With webPanati is possible to any user get Panati output in easy way and use it for several kinds of analysis. Thus, the user can choose one query and specify the interval of search of SNPs in data base. The database “snp” has the positions of SNPs indexed and it increased the performance of queries. For example, a search for SNPs in interval position (start =1 final position = 1000000), without index, it takes more than 3 minutes. With the index, it takes 10 seconds in worse case.

Thus, the main problem of this system is the search performance which is very fast considering the interval of search. Finally, the data about SNP of 85 rice varieties (RICESNP, 2010) need at least 340 GB in mysql database “snp”. Thus, the server mysql needs at least 340 GB only for SNP (aproximately 4 GB by each variety). 4 - Conclusion Panati is a good software to get data from Illumina GenomeAnalyzer (Solexa) and gives an output about SNPs. It is easy to install in unix environment and its output is a file with positions of SNPs and the SNP markers of reference variety and other variety and other data. It is easy to install and need some knowledge about unix environment. WebPanati gets the output of panati and shows it in web environment. It puts the SNP data in a nice format to users see and get the results and save in HTML format or file (text or csv). With the output files in flapjack format, it is possible make other analysis. Thus, webPanati is a good interface between panati command line and the user or panati and flapjack. WebPanati was developed to enable easy access to Panati data for non-programmers who are not equipped to manage large datasets like this. Configuring WebPANATI involves indexing the data in a relational database that structures the data such that questions can be asked easily of the data and so that results are returned quickly. Layered overtop of this data warehouse is a simple user-friendly web-based interface that offers guided searches which facilitate interpretation of the SNP and indel calls made by Panati. WebPanati allows the user to view Panati data based on genome coordinates and subsets of cultivars that are of most interest. Additionally, WebPanati offers features that can help the researcher examine extent of polymorphism or degree of evolutionary commonality within specific gemomic regions between specific subsets of germplasm. Bibliography AJAX. Site available in http://www.xul.fr/en-xml-ajax.html. Site visited on october 13, 2010. CHADO. Site available in http://gmod.org/wiki/Chado. Site visited on october 13, 2010. FLAPJACK. Site available in http://bioinf.scri.ac.uk/flapjack. Site visited on october 13, 2010. FTPGRAMENE. Site available in ftp://ftp.gramene.org/pub/gramene/release30/data/fasta/oryza_sativa/dna/. Site visited on october 13, 2010. FTPNCBI. Site available in ftp://ftp.ncbi.nih.gov/genomes/Oryza_sativa/. Site visited on october 13, 2010. GBROWSER. Site available in http://gmod.org/wiki/GBrowse. Site visited on october 13, 2010. GREYBOX. Site available in http://orangoo.com/labs/GreyBox/. Site visited on october 13, 2010. MYSQL. Site available in http://www.mysql.com. Site visited on october 13, 2010. PANATI. Site available in http://panati.sourceforge.net. Site visited on october 13, 2010. PANATICODE. Site available in http://sourceforge.net/projects/panati/develop. Site visited on october 13, 2010. RICESNP. Site available in http://www.ricesnp.org/. Site visited on october 13, 2010. TOMCAT. Site available in http://tomcat.apache.org/. Site visited on october 13, 2010.

Panati and webPanati – Information Systems for...

Documents

Transcript of Panati and webPanati – Information Systems for...