Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington...

13
Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008

Transcript of Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington...

Page 1: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008.

Converting Large NCBI Databases into SAS

Rosa SJ Lin

Division of Statistical Genomics Washington University in Saint Louis

June 30, 2008

Page 2: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008.

NCBI(http://www.ncbi.nlm.nih.gov)

Contains a large number of databases Most important are: - GenBank - PubMed - RefSeq - Online Mendelian Inheritance in Man

(OMIM) - dbSNP

Page 3: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008.

dbSNP Database

Page 4: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008.

NCBI dbSNP

Contains information about SNPs

Submitted data is given an ss number

(e.g. ss52079780)

If data meets criteria a reference SNP is

created which had an rs number (e.g.

rs530)

Page 5: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008.

dbSNP Data (1)- Each record with various lines and each line with various lengths

Page 6: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008.

dbSNP Data (2)

Page 7: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008.

dbSNP Data (3)

Page 8: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008.

Various uses of the SCAN, INDEX functions to assist in reading data (1)

data ncbisnp ; length rs $12 ; infile din firstobs=1 missover pad;

input snpline $132. ; if index(snpline,"updated")>0 then do; rs=compress(scan(snpline,1,"|")); output; end;run;

Page 9: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008.

Various uses of the SCAN, INDEX functions to assist in reading data (2)

if index(snpline,"alleles=")>0 then do; alleles=substr(compress(scan(snpline,2,"|")),9); output; end;

if index(snpline,"assembly=reference")>0 then do chrom=input(substr(compress(scan(snpline,3,"|")),5),8.); posc=compress(scan(snpline,4,"|")); output; end;

Page 10: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008.

Use RETAIN statement - cause a variable to keep its value from one iteration of the DATA step to the next.

retain markname rs alleles;

Page 11: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008.

dbSNP Data (4)

Page 12: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008.

Output SAS Dataset

Page 13: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008.

Readings:

Kim L Kolbe etc., SUGI 22: “Advanced Techniques for Reading Difficult and Unusual Flat Files”.

Clinton S Rickards, SUGI 24: “Reading External Files Using SAS® Software”.