Searching for uncommon sequences on STN · Searching for uncommon sequences on STN Jim Brown FIZ...
Transcript of Searching for uncommon sequences on STN · Searching for uncommon sequences on STN Jim Brown FIZ...
Searching for uncommon sequences on STN
Jim Brown FIZ Karlsruhe, Inc.
Agenda
• Peptide/protein sequence searching – Definition of 20/22 common amino acids – Uncommon amino acids
• FEATURE TABLE (FEAT) – In DGENE, USGENE and PCTGEN
• NOTES field (NTE) – In REGISTRY
– Variables for amino acids • B, J and Z
– Modification of peptides/proteins
2
Agenda (cont.)
• Nucleic acid sequence searching – Definition of 5/6 common nucleotides – Uncommon/modified nucleotides
• FEATURE TABLE (FEAT) – In DGENE, USGENE and PCTGEN
• NOTES field (NTE) – In REGISTRY
– Variables for nucleotides • R, Y, M, K, S, W, B, D, H, V, N
– Modification of nucleic acids
3
Agenda
• Peptide/protein sequence searching – Definition of 20/22 common amino acids – Uncommon amino acids
• FEATURE TABLE (FEAT) – In DGENE, USGENE and PCTGEN
• NOTES field (NTE) – In REGISTRY
– Variables for amino acids • B, J and Z
– Modification of peptides/proteins
4
20 Common amino acids
5
Amino acid 3 letter designation
1 letter designation
Alanine Ala A
Arginine Arg R
Asparagine Asn N
Aspartic acid Asp D
Cysteine Cys C
Glutamic acid Glu E
Glutamine Gln Q
Glycine Gly G
Histidine His H
Isoleucine Ile I
20 Common amino acids (cont.)
6
Amino acid 3 letter designation
1 letter designation
Leucine Leu L
Lysine Lys K
Methionine Met M
Phenylalanine Phe F
Proline Pro P
Serine Ser S
Threonine Thr T
Tryptophan Trp W
Tyrosine Tyr Y
Valine Val V
22 Common amino acids
7
Amino acid 3 letter designation
1 letter designation
Pyrrolysine Pyl O
Selenocysteine Scy U
Note: These two amino acids are only searchable in REGISTRY. Pyrrolysine was added to REGISTRY in 2006; selenocysteine is covered through all time periods. Only amino acids listed in WIPO ST.25 are allowed in formal listings in patents. These two amino acids are not listed in WIPO ST.25, therefore they are not used in DGENE, USGENE and PCTGEN.
• DGENE only recognizes 20 common amino acids
• O and U are represented by X in the sequence and Selenocysteine or Pyrrolysine will be in the FEATURE TABLE
L1 ANSWER x OF 45 DGENE COPYRIGHT 2013 THOMSON REUTERS on STN SEQ 1 kfqicvsxgy rr FEATURE TABLE: Key |Location|Qualifier| =============+========+=========+======================= Modified-site|8 |note |"Selenocysteine"
Example in DGENE
8
• USGENE only recognizes 20 common amino acids
• O and U are represented by X in the sequence and Selenocysteine or Pyrrolysine will be in the FEATURE TABLE
L2 ANSWER x OF 13 USGENE COPYRIGHT 2013 SEQUENCEBASE CORP on STN SEQ 1 gpssggXg FEATURE TABLE: Key |Location| ==========+========+======================= MOD_RES |7..7 |Selenocysteine
Example in USGENE
9
• PCTGEN only recognizes 20 common amino acids
• O and U are represented by X in the sequence and Selenocysteine or Pyrrolysine will be in the FEATURE TABLE
L6 ANSWER x OF 43 PCTGEN COPYRIGHT 2013 WIPO on STN SEQ 1 rrXlwdqgn FEATURE TABLE: Key |Location| ==========+========+======================= | |Synthetically generated | |peptide VARIANT |3 |Xaa = Cysteine or | |selenocysteine
Example in PCTGEN
10
• REGISTRY recognizes 22 common amino acids – Including O and U
L4 ANSWER x OF 137 REGISTRY COPYRIGHT 2013 ACS on STN SEQ 1 MVSPUWTW
Example in REGISTRY
11
Agenda
• Peptide/protein sequence searching – Definition of 20/22 common amino acids – Uncommon amino acids
• FEATURE TABLE (FEAT) – In DGENE, USGENE and PCTGEN
• NOTES field (NTE) – In REGISTRY
– Variables for amino acids • B, J and Z
– Modification of peptides/proteins
12
Sequences with uncommon amino acids
• Uncommon amino acids defined as – Anything other than 20/22 common amino acids
• In DGENE, USGENE, and PCTGEN BLAST and Sequence Code Match (aka SCM) searches do not recognize anything other than the 20 common amino acid designations and 3 variable designations (B, X or Z)
13
Sequences with uncommon amino acids
• For protein/peptide searches with uncommon amino acids, use wildcard symbols in search query – ‘X’ in query for BLAST search queries – ‘.’ in query for SCM search queries
• Search FEATURE TABLE
14
Example in DGENE
15
=> FILE DGENE
FILE 'DGENE' ENTERED AT 15:31:54 ON 14 FEB 2013 COPYRIGHT (C) 2013 THOMSON REUTERS
=> RUN GETSEQ LTCLASYXWL/SQSP
RUN GETSEQ AT 15:32:11 ON 14 FEB 2013 COPYRIGHT (C) 2013 FIZ KARLSRUHE GMBH
L1 RUN STATEMENT CREATED L1 2 LTCLASYXWL/SQSP
=> S L1 AND (HOMOCYST? OR HCY)/FEAT 13888 HOMOCYST?/FEAT 56 HCY/FEAT L2 1 L1 AND (HOMOCYST? OR HCY)/FEAT
=> D SEQ FEAT
L2 ANSWER 1 OF 1 DGENE COPYRIGHT 2013 THOMSON REUTERS on STN SEQ 1 kltclasyxw lf ========= = HITS AT: 2-11
FEATURE TABLE: Key |Location|Qualifier| =============+========+=========+======================= Modified-site|9 |note |"Homocysteine”
Sequences with uncommon amino acids
• CAS has a list of ‘standardized uncommon amino acids’ with three letter designations – Unique to the CAS databases
• In REGISTRY, SCM searches can recognize ‘standardized uncommon amino acid designations’ – Use three letter designation using single quotes
• e.g. ‘HCY ’
16
Note: To access the entire list of standardized uncommon amino acids while in REGISTRY, use HELP AAU.
=> FILE REG
FILE 'REGISTRY' ENTERED AT 17:56:27 ON 22 FEB 2013 USE IS SUBJECT TO THE TERMS OF YOUR STN CUSTOMER AGREEMENT. PLEASE SEE "HELP USAGETERMS" FOR DETAILS. COPYRIGHT (C) 2013 American Chemical Society (ACS) . . .
=> HELP AAU . . .
3-Letter Code Name ------------- ---- Aaa alpha-amino acid Aad 2-aminoadipic acid (2-aminohexanedioic acid) Aan alpha-asparagine Abu 2-aminobutanoic acid Aca 2-aminocapric acid (2-aminodecanoic acid) . . . Har homoarginine Hcy homocysteine Hhs homohistidine Hiv 2-hydroxyisovaleric acid Hse homoserine . . .
Codes for Standardized Uncommon Amino Acids
17
Example in REGISTRY
18
=> S LTCLASY'HCY'WL/SQSP L1 1 LTCLASY'HCY'WL/SQSP
=> D SEQ NTE
L1 ANSWER 1 OF 1 REGISTRY COPYRIGHT 2013 ACS on STN
SEQ 1 KLTCLASYXW LF ========= = HITS AT: 2-11 NTE modified ------------------------------------------------------------------- type ------ location ------ description ------------------------------------------------------------------- terminal mod. Phe-12 - C-terminal amide uncommon Hcy-9 - - ------------------------------------------------------------------
HCY = Homocysteine
Sequences with uncommon amino acids
• To search for uncommon amino acids not on list of ‘standardized uncommon amino acids’, use wildcard symbols in search queries – In DGENE, USGENE and PCTGEN
• Use ‘X’ in BLAST search queries – Search name variations in FEATURE TABLE
• Use ‘.’ in SCM search queries – Search name variations in FEATURE TABLE
19
Sequences with uncommon amino acids
• To search for uncommon amino acids not on list of ‘standardized uncommon amino acids’, use wildcard symbols in search queries – In REGISTRY,
• use ‘X’ in BLAST search queries – Search for ‘uncommon’ in the NTE field
• use ‘.’ in SCM search queries – Search for ‘uncommon’ in the NTE field
20
Search example
21
Search Question: Find all peptides that contain the following sequence –
KKLKQKLAELLENLLERFLDLVX
where X=azaproline
=> FILE DGENE
FILE 'DGENE' ENTERED AT 17:34:29 ON 29 JAN 2013 COPYRIGHT (C) 2013 THOMSON REUTERS
. . .
=> RUN BLAST KKLKQKLAELLENLLERFLDLVX/SQP -F F -M PAM30 -W 2 -E 10000
BLAST Version 2.2 . . .
Search example in DGENE
22
Note: Changes in BLAST parameters - 1. Turned low complexity filter off. 2. Changed matrix to PAM30. 3. Changed word size to 2. 4. Changed expectation value to 10,000.
10000 ANSWERS FOUND BELOW EXPECTATION VALUE OF 10000.0 QUERY SELF SCORE VALUE IS 72 BEST ANSWER SCORE VALUE IS 72 Similarity Score 72 | | | | || || || || ||| ||| 36 ||| |||| |||||||| ||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| Answer Count 2000 4000 6000 8000 10000
Search example in DGENE (cont.)
23
ENTER EITHER THE NUMBER OF ANSWERS YOU WISH TO KEEP OR ENTER MINIMUM PERCENT OF SELF SCORE FOLLOWED BY % (BEST ANSWER PERCENTAGE OF SELF SCORE IS 100%) ENTER (ALL) OR ? :50% L1 RUN STATEMENT CREATED L1 508 KKLKQKLAELLENLLERFLDLVX/SQP.-F F -M PAM30 -W 2 -E 10000
Answer set arranged by accession number; to sort by descending similarity score, enter at an arrow prompt (=>) "sor score d".
=> S L1 AND ((AZA (1W) PRO?) OR AZAPRO?)/FEAT 996 AZA/FEAT 367154 PRO?/FEAT 115 AZA (1W) PRO? 1 AZAPRO?/FEAT L2 85 L1 AND ((AZA (1W) PRO?) OR AZAPRO?)/FEAT => SOR L2 SCORE D PROCESSING COMPLETED FOR L2 L3 85 SOR L2 SCORE D
Search example in DGENE (cont.)
24
Note: Searching for azapro? as one word, or as two words or hyphenated (aza-pro?).
=> D BIB SCORE SEQ FEAT
L3 ANSWER 1 OF 85 DGENE COPYRIGHT 2013 THOMSON REUTERS on STN AN AYH82102 peptide DGENE TI New peptides with specific amino acid residue useful to treat or prevent e.g. dyslipidemia, cardiovascular disease e.g. atherosclerosis and restenosis, endothelial dysfunction, macrovascular disorder and microvascular disorder. IN Dasseux J; Schwendeman A S; Zhu L PA (CERE-N) CERENIS THERAPEUTICS SA. PI WO 2010093918 A1 20100819 210 AI WO 2010-US24096 20100212 PRAI US 2009-152960P 20090216 . . . PSL Disclosure; SEQ ID NO 229 DT Patent LA English OS 2010-K33647 [56] DESC Cardiovascular disease treatment related peptide, SEQ ID 229. SCORE 72 100% of query self score 72 SEQ 1 kklkqklael lenllerfld lvx
FEATURE TABLE: Key |Location|Qualifier| =============+========+=========+================= Modified-site|23 |note |"aza-proline”
Search example in DGENE (cont.)
25
Feature table identifies amino acid at position 23 as aza-proline.
Search example in REGISTRY
26
1. Click on BLAST button on main STN Express.
2. This opens the Results Set Manager, which allows you to run a new sequence search.
Search example in REGISTRY (cont.)
27
1. Type in sequence to be searched (or read from file)
2. Optional: Give the search a result name.
3. Click OK.
Search example in REGISTRY (cont.)
28
For this example, choose BLASTp button.
Search example in REGISTRY (cont.)
29
Choose which references are to be seen. For this example, only sequences which appear in at least one patent document are chosen.
Search example in REGISTRY (cont.)
30
As this is a short sequence, check Show Additional Options box.
Search example in REGISTRY (cont.)
31
Parameters changed: 1. Low complexity filter
is turned off. 2. Word size is
changed to 2. 3. Expectation value is
changed to 1000 4. The Weight Matrix is
changed to PAM-30. 5. Max No. of Answers
is changed to 1000.
Search example in REGISTRY (cont.)
32
Sequence search is added to Results Set Manager. The current status is Running.
Search example in REGISTRY (cont.)
33
When the status is Complete, highlight the name and click on View Results.
Search example in REGISTRY (cont.)
34
Decide which sequences are of interest and check box next to those sequences.
Search example in REGISTRY (cont.)
35
Click on Get STN Data button.
Search example in REGISTRY (cont.)
36
Choose the appropriate option. In this example, Sequence Records was chosen.
Search example in REGISTRY (cont.)
37
Search example in REGISTRY (cont.)
38
If you wish to save your transcript, name it here. Tip: For consistency, use the same name as the saved sequence.
Search example in REGISTRY (cont.)
39
Search example in REGISTRY (cont.)
40
Search example in REGISTRY (cont.)
41
Aza-proline is not a standardized uncommon amino acid, so ‘uncommon’ will be used in the Notes (NTE) field.
=> S L17 AND UNCOMMON/NTE 753988 UNCOMMON/NTE L18 133 L17 AND UNCOMMON/NTE
=> D SEQ NTE 1-3
L18 ANSWER 1 OF 133 REGISTRY COPYRIGHT 2013 ACS on STN
SEQ 1 KLKQKLAELL ENLLERFLDL VX **RELATED SEQUENCES AVAILABLE WITH SEQLINK** NTE ----------------------------------------------------------------- type ------ location ------ description ----------------------------------------------------------------- uncommon Aaa-22 - - -----------------------------------------------------------------
Search example in REG (cont.)
42
Note: The amino acid designation is Aaa, which stands for ‘other specific amino acid, as opposed to an Xaa (X) which would designate ‘any amino acid.’ This allows for another level of search refinement.
L18 ANSWER 2 OF 133 REGISTRY COPYRIGHT 2013 ACS on STN
SEQ 1 KLKQKXAELG ENLLERFLDL VX **RELATED SEQUENCES AVAILABLE WITH SEQLINK** NTE ----------------------------------------------------------------- type ------ location ------ description ----------------------------------------------------------------- uncommon Aaa-6 - - uncommon Aaa-22 - - ----------------------------------------------------------------- L18 ANSWER 3 OF 133 REGISTRY COPYRIGHT 2013 ACS on STN
SEQ 1 KLKQKLAELL EQLLDKFLEL AX **RELATED SEQUENCES AVAILABLE WITH SEQLINK** NTE ----------------------------------------------------------------- type ------ location ------ description ---------------------------------------------------------------- uncommon Aaa-22 - - -----------------------------------------------------------------
Search example in REG (cont.)
43
Agenda
• Peptide/protein sequence searching – Definition of 20/22 common amino acids – Uncommon amino acids
• FEATURE TABLE (FEAT) – In DGENE, USGENE and PCTGEN
• NOTES field (NTE) – In REGISTRY
– Variables for amino acids • B, J and Z
– Modification of peptides/proteins
44
Variables used in sequence searching
45
1 letter designation
3 letter designation
Represents
B Asx Aspartic acid or Asparagine
J Xle Isoleucine or Leucine (only works in REG; added in 2006)
X Xxx Uncommon or unspecified
Z Glx Glutamic acid or Glutamine
Variables in BLAST sequence searching
46
• For DGENE, USGENE, PCTGEN – Using B or Z in a BLAST search, the exact amino
acids would be a positive (+) match, not an identity match
=> FILE PCTGEN
FILE 'PCTGEN' ENTERED AT 16:00:49 ON 14 FEB 2013 COPYRIGHT (C) 2013 WIPO
=> RUN BLAST MGBNFQ/SQP -F F -M PAM30 -W 2 -E 20000
BLAST Version 2.2.20 . . .
Note: Changes in BLAST parameters - 1. Turned low complexity filter off. 2. Changed matrix to PAM30. 3. Changed word size to 2. 4. Changed expectation value to 20,000.
Variables in BLAST sequence searching
47
446 ANSWERS FOUND BELOW EXPECTATION VALUE OF 20000.0 QUERY SELF SCORE VALUE IS 24 BEST ANSWER SCORE VALUE IS 24 Similarity Score 24 | | ||| ||||||||||| |||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| 12 |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| . . . |||||||||||||||||||||||||||||||||||||||||||||||||| Answer Count 90 180 270 360 450 ENTER EITHER THE NUMBER OF ANSWERS YOU WISH TO KEEP OR ENTER MINIMUM PERCENT OF SELF SCORE FOLLOWED BY % (BEST ANSWER PERCENTAGE OF SELF SCORE IS 100%) ENTER (ALL) OR ? :60% L1 RUN STATEMENT CREATED L1 446 MGBNFQ/SQP.-F F -M PAM30 -W 2 -E 20000
Variables in BLAST sequence searching
48
=> SOR SCORE D PROCESSING COMPLETED FOR L1 L2 446 SOR L1 SCORE D
=> D SCORE ALIGN 1 5
L2 ANSWER 1 OF 446 PCTGEN COPYRIGHT 2013 WIPO on STN SCORE 24 100% of query self score 24 BLASTALIGN Query = 6 letters Length = 29 Score = 23.5 bits (48), Expect = 1e-05 Identities = 6/6 (100%), Positives = 6/6 (100%) Query: 1 MGBNFQ 6 MGBNFQ Sbjct: 24 MGBNFQ 29 L2 ANSWER 5 OF 446 PCTGEN COPYRIGHT 2013 WIPO on STN SCORE 24 100% of query self score 24 BLASTALIGN Query = 6 letters Length = 498 Score = 23.5 bits (48), Expect = 2e-04 Identities = 5/6 (83%), Positives = 6/6 (100%) Query: 1 MGBNFQ 6 MG+NFQ Sbjct: 40 MGDNFQ 45
The exact amino acid is a positive (+) match, not an identity match.
Variables in SCM sequence searching
49
• For DGENE, USGENE, and PCTGEN – J is not an acceptable letter for sequence searching – B and Z will only capture records where the sequence
has those designations, not the corresponding specific amino acids they represent
– To include specific amino acids in SCM search query, use [ ]
• Example: For TLGIVZPI subsequence search use – RUN GETSEQ TLGIV[ZEQ]PI/SQSP
=> FILE DGENE
FILE 'DGENE' ENTERED AT 22:39:25 ON 18 FEB 2013 COPYRIGHT (C) 2013 THOMSON REUTERS . . .
=> RUN GETSEQ FAEBGK/SQSP RUN GETSEQ AT 22:39:35 ON 18 FEB 2013 COPYRIGHT (C) 2013 FIZ KARLSRUHE GMBH L1 RUN STATEMENT CREATED L1 1 FAEBGK/SQSP => D SEQ L1 ANSWER 1 OF 1 DGENE COPYRIGHT 2013 THOMSON REUTERS on STN SEQ 1 tyvpkefnae tftfhadict lsekerqikk qtalvelvkh kpkatkeqlk 51 avmddfaafv ekcckaddke tcfaebgkkl vaasqaalgl ====== HITS AT: 73-78
Variable SCM search example in DGENE
50
This search strategy will capture only those sequences that have a B in them, not sequences that have N or D.
=> RUN GETSEQ FAE[BND]GK/SQSP
RUN GETSEQ AT 22:46:15 ON 18 FEB 2013 COPYRIGHT (C) 2013 FIZ KARLSRUHE GMBH
L2 RUN STATEMENT CREATED L2 100 FAE[BND]GK/SQSP
=> S L2 NOT L1 L3 99 L2 NOT L1
=> D SEQ
L3 ANSWER 1 OF 99 DGENE COPYRIGHT 2013 THOMSON REUTERS on STN SEQ 1 dtavtnkqnf stdviyqvft drfldgnpsn nptgaafdgt csnlklycgg 51 dwqglinkin dnyfsdlgvt alwisqpven ifatinysgv tntayhgywa 101 rdfkktnpyf gtmtdfqnlv tsahakgiki iidfapnhtf pametdtsfa == 151 engklydngs lvggytndtn gyfhhnggsd fstlengiyk nlydladlnh ==== 201 nnstidtyfk daiklwldmg vdgirvdavk hmpqgwqknw mssiyahkpv 251 ftfgewflgs aasdadntdf anesgmslld frfnsavrnv frdntsnmya . . . 651 kkngatitwe ggsnhtfttp tsgtatvtvn wq HITS AT: 149-154
Variable SCM search example in DGENE
51
This search strategy will capture those sequences that have a B, N or D in it.
Variables in SCM sequence searching
52
• For REGISTRY – J, B and Z are searchable – J, B and Z will capture records where the sequence
has those designations, and sequences with the corresponding specific amino acids they represent
=> FILE REG FILE 'REGISTRY' ENTERED AT 22:55:32 ON 18 FEB 2013 USE IS SUBJECT TO THE TERMS OF YOUR STN CUSTOMER AGREEMENT. PLEASE SEE "HELP USAGETERMS" FOR DETAILS. COPYRIGHT (C) 2013 American Chemical Society (ACS) . . . => S FAEBGK/SQSP L1 122 FAEBGK/SQSP
Variable SCM search example in REG
53
=> D SEQ 1-122 . . . L1 ANSWER x OF 122 REGISTRY COPYRIGHT 2013 ACS on STN SEQ 1 MIRRTLPILL MILLAGCNQE SGASKEPGEH REVIQGMHTQ FIKVTEGQNQ 51 WYEMAISVDD SNTFRMPVFF AEDGKLVRVD DKQARKLFDR WLKERAKGIA = ===== 101 AFSSVDEQVG FKGPFLALDV KR HITS AT: 70-75 . . . L1 ANSWER x OF 122 REGISTRY COPYRIGHT 2013 ACS on STN SEQ 1 TPDAERTMLT HLGISITLQK SDVDLEKLKS SSISYIEGYL WDGQGTKEAS 51 LLTMEESKKN GVKVAYTYSD PFCVNRSRED FIRLTKEYFD IVFCNTEEAK 101 ALSQREDKLE ALKFISGLSA LVFMTDSANG AYFAENGKIS HVDG ====== HITS AT: 133-138 . . .
Variable SCM search example in REG
54
Agenda
• Peptide/protein sequence searching – Definition of 20/22 common amino acids – Uncommon amino acids
• FEATURE TABLE (FEAT) – In DGENE, USGENE and PCTGEN
• NOTES field (NTE) – In REGISTRY
– Variables for amino acids • B, J and Z
– Modification of peptides/proteins
55
Modification of peptides/proteins
56
• Modifications are discussed in – FEATURES Table
• DGENE, USGENE and PCTGEN – Notes field
• REGISTRY
• Modification info may include – Stereochemistry – Modification(s) made at specific amino acid site(s)
• CAS standardized blocking groups • Non-standardized modifications
Modification of amino acid residues
57
• In DGENE, USGENE, and PCTGEN – Amino acid residue representation
• Original amino acid or X – FEATURES Table (/FEAT)
• Look for keywords and variations/spellings/abbreviations
Example of stereochemistry info
58
L1 ANSWER 2 OF 54958 DGENE COPYRIGHT 2013 THOMSON REUTERS on STN AN BAJ42897 peptide DGENE TI New peptide compound having specific amino acid sequence, useful for treating e.g. thrombosis, thrombophlebitis, unstable angina, myocardial infarction, stroke, sepsis, tumor metastasis, inflammatory arthritis. IN Huang T; Chang C; Chung C PA (UNTU) UNIV TAIWAN NAT. PI WO 2012172427 A2 20121220 46 AI WO 2012-IB1345 20120614 PRAI US 2011-496742P 20110614 PSL Claim 3; SEQ ID NO 20 DT Patent LA English OS 2012-R37989 [03] DESC Platelet aggregation inhibition related-peptide (d-form Tro-6), SEQ:20. SEQ3 1 Cys-Lys-Trp-Met-Asn-Val FEATURE TABLE: Key |Location|Qualifier| ===============+========+=========+======================= Misc-difference|2 |note |"D-form residue" Misc-difference|5 |note |"D-form residue"
This is a BIB SEQ3 FEAT custom display. Notice the three-letter abbreviations for the amino acids with a SEQ3 display.
L1 ANSWER 1 OF 597267 DGENE COPYRIGHT 2013 THOMSON REUTERS on STN AN BAJ43913 peptide DGENE TI Extracting a peptide from a reaction mixture resulting from a peptide coupling reaction comprises adding organic solvents and water to the reaction mixture. IN Monnaie D; Forni L; Giraud M PA (LONZ) LONZA LTD. (LONZ) LONZA BRAINE SA. PI WO 2012171984 A1 20121220 81 AI WO 2012-EP61257 20120614 PRAI EP 2011-170094 20110616 US 2011-497642P 20110616 US 2011-498100P 20110617 PSL Example; SEQ ID NO 17 DT Patent LA English OS 2012-R38071 [02] DESC Solid phase peptide synthesis related peptide, SEQ ID 17. SEQ 1 lwvns
FEATURE TABLE: Key |Location|Qualifier| =============+========+=========+============================ Modified-site|2 |note |"Modified with | | |tert-butoxycarbonyl (Boc)" Modified-site|4 |note |"Modified with trityl (Trt)" Modified-site|5 |note |"C-terminal amide; Modified | | |with tBu”
Example of modification of amino acid residues
59
These are value added modification annotations provided by the TR indexer, not by the applicant.
Modification of amino acid residues
60
• In REGISTRY – Amino acid residue representation
• Original amino acid or X – Notes field (/NTE)
• Look for keywords, standardized abbreviations and variations/spellings/abbreviations
Modification of amino acid residues
61
L1 ANSWER 1 OF 664 REGISTRY COPYRIGHT 2013 ACS on STN SEQ 1 VEQKYGQFPQ G NTE modified (modifications unspecified) ======== ---------------------------------------------------------------------- type ------ location ------ description ---------------------------------------------------------------------- modification Val-1 - (9h-fluoren-9-ylmethoxy) carbonyl modification Glu-2 - 1,1-dimethylethyl<t-Bu> modification Gln-3 - triphenylmethyl<Trit> modification Lys-4 - (1,1-dimethylethoxy) carbonyl<Boc> modification Tyr-5 - 1,1-dimethylethyl<t-Bu> modification Gln-7 - triphenylmethyl<Trit> modification Gln-10 - triphenylmethyl<Trit> ---------------------------------------------------------------------- REFERENCE 1 AN 157:693426 CA TI Peptide-lipid conjugates that bind lipopolysaccharide and their therapeutic use IN Tice, Thomas; Woeher, Torsten PA Evonik Degussa Corporation, USA SO PCT Int. Appl., 50pp. CODEN: PIXXD2
Modification of amino acid residues
62
The Notes field can list many different modifications.
DT Patent LA English FAN.CNT 1 PATENT NO. KIND DATE APPLICATION NO. DATE --------------- ---- -------- --------------- -------- PI WO 2012148891 A1 20121101 WO 2012-US34757 20120424 W: AE, AG, AL, AM, AO, AT, AU, AZ, BA, BB, BG, BH, BR, BW, BY, BZ, CA, CH, CL, CN, CO, CR, CU, CZ, DE, DK, DM, DO, DZ, EC, EE, EG, ES, FI, GB, GD, GE, GH, GM, GT, HN, HR, HU, ID, IL, IN, IS, JP, KE, KG, KM, KN, KP, KR, KZ, LA, LC, LK, LR, LS, LT, LU, LY, MA, MD, ME, MG, MK, MN, MW, MX, MY, MZ, NA, NG, NI, NO, NZ, OM, PE, PG, PH, PL, PT, QA, RO, RS, RU, RW, SC, SD, SE, SG, SK, SL, SM, ST, SV, SY, TH, TJ, TM, TN, TR, TT, TZ, UA, UG, US, UZ, VC, VN, ZA, ZM, ZW RW: AL, AT, BE, BG, CH, CY, CZ, DE, DK, EE, ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, LT, LU, LV, MC, MK, MT, NL, NO, PL, PT, RO, RS, SE, SI, SK, SM, TR, BF, BJ, CF, CG, CI, CM, GA, GN, GQ, GW, ML, MR, NE, SN, TD, TG, BW, GH, GM, KE, LR, LS, MW, MZ, NA, RW, SD, SL, SZ, TZ, UG, ZM, ZW, AM, AZ, BY, KG, KZ, MD, RU, TJ, TM US 20120294924 A1 20121122 US 2012-454211 20120424 PRAI US 2011-480596P 20110429 RE.CNT 4 THERE ARE 4 CITED REFERENCES AVAILABLE FOR THIS RECORD ALL CITATIONS AVAILABLE IN THE RE FORMAT
Modification of amino acid residues (cont.)
63
Agenda
• Nucleic acid sequence searching – Definition of common nucleotides – Uncommon nucleotides
• FEATURE TABLE (FEAT) – In DGENE, USGENE and PCTGEN
• NOTES field (NTE) – In REGISTRY
– Variables for nucleotides • R, Y, M, K, S, W, B, D, H, V, N
– Modification of nucleic acids
64
5/6 Common nucleotides
65
1 letter designation
Represents
A Adenine
G Guanine
C Cytosine
T Thymine
U Uracil
I Inosine (REGISTRY only)
Note: Inosine is only searchable in REGISTRY. Only nucleotides listed in WIPO ST.25 are allowed in formal listings in patents. Inosine is not listed in WIPO ST.25, therefore it is not used in DGENE, USGENE and PCTGEN.
Agenda
• Nucleic acid sequence searching – Definition of common nucleotides – Uncommon nucleotides
• FEATURE TABLE (FEAT) – In DGENE, USGENE and PCTGEN
• NOTES field (NTE) – In REGISTRY
– Variables for nucleotides • R, Y, M, K, S, W, B, D, H, V, N
– Modification of nucleic acids
66
Sequences with uncommon nucleotides
• Uncommon nucleotides defined as – Anything other than 5/6 common nucleotides
• In DGENE, USGENE, and PCTGEN BLAST and SCM searches do not recognize anything other than the 5 common nucleotide designations and 11 variable nucleotide designations
67
Sequences with uncommon nucleotides
• For nucleic acid searches with uncommon nucleotides, use wildcard symbols in search query – ‘N’ in query for BLAST search queries – ‘.’ in query for SCM search queries
• Search FEATURE TABLE
68
=> FILE DGENE FILE 'DGENE' ENTERED AT 23:45:59 ON 18 FEB 2013 COPYRIGHT (C) 2013 THOMSON REUTERS . . . => RUN GETSEQ UUCNG/SQSN RUN GETSEQ AT 23:54:36 ON 18 FEB 2013 COPYRIGHT (C) 2013 FIZ KARLSRUHE GMBH L1 RUN STATEMENT CREATED L1 18 UUCNG/SQSN => S L1 AND INOSINE/FEAT 2153 INOSINE/FEAT L2 2 L1 AND INOSINE/FEAT
Inosine example in DGENE
69
=> D SEQ FEAT L2 ANSWER 1 OF 2 DGENE COPYRIGHT 2013 THOMSON REUTERS on STN SEQ 1 uucuagccuu cnggagucag ggc == === HITS AT: 9-13 FEATURE TABLE: Key |Location|Qualifier| =============+========+=========+======================= modified_base|12 |*tag= a | | |mod_base |i | |note |"Optionally inosine"
Inosine example in DGENE (cont.)
70
=> FILE REG FILE 'REGISTRY' ENTERED AT 00:03:52 ON 19 FEB 2013 USE IS SUBJECT TO THE TERMS OF YOUR STN CUSTOMER AGREEMENT. PLEASE SEE "HELP USAGETERMS" FOR DETAILS. COPYRIGHT (C) 2013 American Chemical Society (ACS) . . . => S UUCIG/SQSN L3 32 UUCIG/SQSN => D SEQ 1-5 L3 ANSWER x OF 32 REGISTRY COPYRIGHT 2013 ACS on STN SEQ 1 uucuagccuu ciggagucag ggc == === HITS AT: 9-13
Inosine example in REG
71
CAS shortcut descriptors for “modified base”
72
=> FILE REG FILE 'REGISTRY' ENTERED AT 15:13:46 ON 19 FEB 2013 USE IS SUBJECT TO THE TERMS OF YOUR STN CUSTOMER AGREEMENT. PLEASE SEE "HELP USAGETERMS" FOR DETAILS. COPYRIGHT (C) 2013 American Chemical Society (ACS)
. . .
=> S AC4C/NTE L1 76 AC4C/NTE => D SEQ NTE L1 ANSWER 9 OF 76 REGISTRY COPYRIGHT 2013 ACS on STN SEQ 1 ggagagaugg ccgagcgguc uaaggcgcug guuuiaggca ccagucccuu 51 cgggggcgug gguucgaauc ccacucucuu cacca NTE modified ---------------------------------------------------------------------- type ------ location ------ description ---------------------------------------------------------------------- modified base c-12 ac4c modified base u-19 hu modified base u-63 m5u ----------------------------------------------------------------------
Ac4c in REG
73
=> FILE DGENE
FILE 'DGENE' ENTERED AT 14:58:58 ON 19 FEB 2013 COPYRIGHT (C) 2013 THOMSON REUTERS . . .
=> S (ACETYLCYTIDINE OR AC4C)/FEAT 4 ACETYLCYTIDINE/FEAT 67 AC4C/FEAT L1 68 (ACETYLCYTIDINE OR AC4C)/FEAT
=> D SEQ FEAT 1 x
L1 ANSWER 1 OF 68 DGENE COPYRIGHT 2013 THOMSON REUTERS on STN SEQ 1 xvxeiqlxhq xarwiqxkx FEATURE TABLE: Key |Location|Qualifier| =============+========+=========+============================== Modified-site|1 |note |"1-aminocyclopentane-1-carboxy | | |lic acid (Ac5c)" Modified-site|3 |label |Aib Modified-site|8 |label |Nle Modified-site|11 |note |"Homoarginine (hR)" Modified-site|17 |label |Aib Modified-site|18 |note |"Modified with | | |N-epsilon-1'-alkyl beta-D- | | |glucuronyl" Modified-site|19 |note |"1-aminocyclo | | |butane-1-carboxylic acid | | |(Ac4c). C- terminal amide"
Ac4c in DGENE
74
This is a peptide sequence, and ac4c has a meaning other than what we were looking for.
L1 ANSWER x OF 68 DGENE COPYRIGHT 2013 THOMSON REUTERS on STN SEQ 1 tcccaatccc aatcccaatc ccaatcccaa tcccaatccc aatcccaatc 51 ccaa FEATURE TABLE: Key |Location|Qualifier | =============+========+============+============================== misc_binding |1..10 |*tag= b | | |bound_moiety|"HT54 nanocircle splint DNA 2" | |note |"Forms double-stranded region | | |with bases 10-1 of splint | | |DNA" misc_feature |1 |*tag= a | | |label |Ligation_site | |note |"HT54 precursor is | | |circularised by ligation of T1 | | | to A54" modified_base|2..4 |*tag= c | | |mod_base |ac4c | |note |"Optionally 4-acetylcytidine" . . .
Ac4c in DGENE (cont.)
75
This is a nucleotide sequence, and ac4c has the correct meaning we were looking for.
Agenda
• Nucleic acid sequence searching – Definition of common nucleotides – Uncommon nucleotides
• FEATURE TABLE (FEAT) – In DGENE, USGENE and PCTGEN
• NOTES field (NTE) – In REGISTRY
– Variables for nucleotides • R, Y, M, K, S, W, B, D, H, V, N
– Modification of nucleic acids
76
Variables used in sequence searching
77
1 letter designation
Represents
A Adenine
G Guanine
C Cytosine
T Thymine
U Uracil
1 letter designation
Represents
R Guanine or adenine Y Thymine/uracil or cytosine M Adenine or cytosine K Guanine or thymine/uracil S Guanine or cytosine W Adenine or thymine/uracil B Guanine, cytosine or thymine/uracil D Adenine, guanine or thymine/uracil H Adenine, cytosine or thymine/uracil V Adenine, cytosine or guanine N Adenine, guanine, cytosine,
thymine/uracil, unknown or other
Variables in BLAST sequence searching
78
• For DGENE, USGENE, PCTGEN – Using the nucleotide variables in a BLAST search, the
exact nucleotide would not be an identity match
L1 ANSWER 1 OF 10000 DGENE COPYRIGHT 2013 THOMSON REUTERS on STN SCORE 31 100% of query self score 31 BLASTALIGN Query = 17 letters Length = 62371 Score = 31.4 bits (15), Expect = 2e-04 Identities = 16/17 (94%) Strand = Plus / Plus Query: 1 taarttcttctgcagtt 17 ||| ||||||||||||| Sbjct: 41353 taaattcttctgcagtt 41369
Variables in SCM sequence searching
79
• For DGENE, USGENE, and PCTGEN – R, Y, M, K, S, W, B, D, H, V, and N will only capture
records where the sequence has those designations, not the corresponding specific nucleotides they represent
– To include specific nucleotides in SCM search query, use [ ]
• Example: For TCASCC subsequence search use – RUN GETSEQ TCA[SGC]CC/SQSN
=> FILE DGENE
FILE 'DGENE' ENTERED AT 16:10:33 ON 19 FEB 2013 COPYRIGHT (C) 2013 THOMSON REUTERS
FILE LAST UPDATED: 15 FEB 2013 <20130215/UP> MOST RECENT PUBLICATION DATE: 17 JAN 2013 <20130117/PD> . . .
=> RUN GETSEQ TCASCCTA/SQSN
RUN GETSEQ AT 16:10:46 ON 19 FEB 2013 COPYRIGHT (C) 2013 FIZ KARLSRUHE GMBH
L1 RUN STATEMENT CREATED L1 7 TCASCCTA/SQSN => D SEQ
L1 ANSWER 1 OF 7 DGENE COPYRIGHT 2013 THOMSON REUTERS on STN SEQ 1 ccaggcagag tgacagttct gtgagttttc tactgtgcaa agcagagctg 51 gtttttcatt ttttatagcg tcascctatt caaagtgaat ataagctttc ======== 101 acatgtgttg tctgactcta tcctcaaatc agctccatga ggtaagaaat . . . HITS AT: 71-78
Variable SCM search example in DGENE
80
This is only capturing sequences where S is in the sequence.
=> FILE DGENE
FILE 'DGENE' ENTERED AT 11:23:08 ON 06 FEB 2013 COPYRIGHT (C) 2013 THOMSON REUTERS . . .
=> RUN GETSEQ TCA[SGC]CCTA/SQSN . . . Number of answers 214621 will create 9 Answer Sets L1 RUN STATEMENT CREATED L1 25000 TCA[SGC]CCTA/SQSN L2 RUN STATEMENT CREATED L2 25000 TCA[SGC]CCTA/SQSN . . . L9 RUN STATEMENT CREATED L9 14621 TCA[SGC]CCTA/SQSN
=> S L1-L9 AND SQL<200 21860725 SQL<200 L10 7575 (L1 OR L2 OR L3 OR L4 OR L5 OR L6 . . . OR L9) AND SQL<200
Variable SCM search example in DGENE
81
This is capturing sequences where S, G or C is in the sequence.
=> D SEQ . . .
L10 ANSWER x OF 7575 DGENE COPYRIGHT 2013 THOMSON REUTERS on STN SEQ 1 aggaggcctc agcctatat == ====== HITS AT: 9-17 L10 ANSWER x OF 7575 DGENE COPYRIGHT 2013 THOMSON REUTERS on STN SEQ 1 gaaagcagtc accctatccg ctgatcagcc tcatg == ====== HITS AT: 9-17 L10 ANSWER x OF 7575 DGENE COPYRIGHT 2013 THOMSON REUTERS on STN SEQ 1 ttttcagatc tccattacta ggccaggata gcccgagggg gaagaggagc 51 aagtttttca scctacggga gctccgggtc tgcctaattt ttccgcccct === ===== 101 cccagccgaa aaacccatca g HITS AT: 58-65
Variable SCM search example in DGENE
82
Variables in SCM sequence searching
83
• For REGISTRY – R, Y, M, K, S, W, B, D, H, V and N are searchable
• Be careful with N, as it works like ‘.’ – They will capture records where the sequence has
those designations, and sequences with the corresponding specific nucleotides they represent
=> FILE REG
FILE 'REGISTRY' ENTERED AT 17:54:42 ON 06 FEB 2013 USE IS SUBJECT TO THE TERMS OF YOUR STN CUSTOMER AGREEMENT. PLEASE SEE "HELP USAGETERMS" FOR DETAILS. COPYRIGHT (C) 2013 American Chemical Society (ACS) . . .
=> S TCASCCTA/SQSN L1 734794 TCASCCTA/SQSN => S L1 AND SQL<200 17249391 SQL<200 L2 11074 L1 AND SQL<200
Variable SCM search example in REG
84
=> D SEQ . . .
L2 ANSWER x OF 11074 REGISTRY COPYRIGHT 2013 ACS on STN
SEQ 1 actctttaga tctggcattc aaactgtctg tgttttgacc atcaccctag ======== 51 atcactgcct sttaccattt taggagtata gtttgaaatt ctgactgatt 101 ttaattggct ctgttcaact c HITS AT: 42-49
L2 ANSWER x OF 11074 REGISTRY COPYRIGHT 2013 ACS on STN
SEQ 1 tgcttaattg attatatctt ccttgtcatt ttgttccttc tttctgttta 51 attagcaaaa yggtgtctta taattctgga acagcaaaca aaatttttca 101 agtcagccta cttctaacac t ======== HITS AT: 103-110
L2 ANSWER x OF 11074 REGISTRY COPYRIGHT 2013 ACS on STN
SEQ 1 ttttcagatc tccattacta ggccaggata gcccgagggg gaagaggagc 51 aagtttttca scctacggga gctccgggtc tgcctaattt ttccgcccct === ===== 101 cccagccgaa aaacccatca g HITS AT: 58-65
Variable SCM search example in REG
85
Agenda
• Nucleic acid sequence searching – Definition of common nucleotides – Uncommon nucleotides
• FEATURE TABLE (FEAT) – In DGENE, USGENE and PCTGEN
• NOTES field (NTE) – In REGISTRY
– Variables for nucleotides • R, Y, M, K, S, W, B, D, H, V, N
– Modification of nucleic acids
86
Nucleic acid sequences with modifications
• To search for nucleic acid sequences with modifications, use normally-occurring nucleotide symbol or wildcard symbols in search queries – In DGENE, USGENE and PCTGEN
• Use ‘N’ in BLAST search queries for wildcard – Search name variations in FEATURE TABLE
• Use ‘.’ in SCM search queries for wildcard – Search name variations in FEATURE TABLE
87
Nucleic acid sequences with modifications
• To search for uncommon nucleic acid sequences with modifications, use normally-occurring nucleotide symbol or wildcard symbols in search queries – In REGISTRY,
• use ‘N’ in BLAST search queries for wildcard – Search for standard abbreviations in the NTE field – Search keywords (consider variations)
• use ‘.’ in SCM search queries for wildcard – Search for standard abbreviations in the NTE field – Search keywords (consider variations)
88
CAS list of modifications
89
=> S DEAZA?/FEAT L1 1060 DEAZA?/FEAT => D SEQ FEAT 3 L1 ANSWER 3 OF 1060 DGENE COPYRIGHT 2013 THOMSON REUTERS on STN SEQ 1 ctatctgucg ttctctgu FEATURE TABLE: Key |Location|Qualifier| =============+========+=========+========================= modified_base|1..18 |*tag= a | | |mod_base |OTHER | |note |"OTHER = Phosphorothioate | | |backbone" modified_base|7..8 |*tag= b | | |mod_base |OTHER | |note |"OTHER = Modified with | | |2'-O-Me" modified_base|10 |*tag= c | | |mod_base |OTHER | |note |"OTHER= 7-deaza-dG“
Deaza in DGENE
90
=> FILE REG
FILE 'REGISTRY' ENTERED AT 15:53:51 ON 19 FEB 2013 USE IS SUBJECT TO THE TERMS OF YOUR STN CUSTOMER AGREEMENT. PLEASE SEE "HELP USAGETERMS" FOR DETAILS. COPYRIGHT (C) 2013 American Chemical Society (ACS) . . .
=> S DEAZA/NTE L1 180 DEAZA/NTE
=> D SEQ NTE 1 L1 ANSWER 1 OF 180 REGISTRY COPYRIGHT 2013 ACS on STN SEQ 1 tcagtattag cagtccgcg SEQ 1 gcggactgct aat **RELATED SEQUENCES AVAILABLE WITH SEQLINK** NTE multistranded (2) modified ---------------------------------------------------------------------- type ------ location ------ description ---------------------------------------------------------------------- modified base a-11[2] 3-deaza ----------------------------------------------------------------------
Deaza in REG
91
Summary
• Peptide/protein sequence searching – Uncommon amino acids
• FEATURE TABLE (FEAT) or NOTES field (NTE) – Variables for amino acids – Modifications
• Nucleic acid sequence searching – Uncommon nucleotides
• FEATURE TABLE (FEAT) or NOTES field (NTE) – Variables for nucleotides – Modifications
92
Resources
• Biosequence Searching on STN web site – http://www.stn-international.com/biosequence_searching.html
• DGENE workshop manual – http://www.stn-international.com/dgene_wm.html
• USGENE workshop manual – http://www.stn-international.com/usgene_wm.html
• STN quick reference cards – http://www.cas.org/training/stn/commands-qrc
• CAS coverage of sequences – http://www.cas.org/content/chemical-substances/sequences
93
Acknowledgements
• Rob Austin • Alice Humel Denton • Lora Burgess
94
FIZ Karlsruhe [email protected] Support and Training: www.stn-international.de
CAS E-mail: [email protected] Support and Training: www.cas.org
For more information …