Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using...

30
Owing to continuous developments of high-throughput experimental technologies, ever-increasing amounts of data are being generated in functional genomics and proteomics. We are developing a new generation of databases and computational technologies, beyond the traditional genome databases and sequence analysis tools, for making full use of such large-scale data in biomedical applications, espe- cially for elucidating cellular functions as behaviors of complex interaction systems. 1. Comprehensive repository for community genome annotation Toshiaki Katayama, Mari Watanabe and Mi- noru Kanehisa KEGG DAS is an advanced genome database system providing DAS (Distributed Annotation System) service for all organisms in the GENOME and GENES databases in KEGG (Kyoto Encyclopedia of Genes and Genomes). Currently, KEGG DAS contains 6,943,951 anno- tations for genome sequences of 441 organisms. The KEGG DAS server provides gene annota- tions linked to the KEGG PATHWAY and LIGAND databases, as well as the SSDB data- base containing paralog, ortholog and motif in- formation. In addition to the coding genes, in- formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno- tation of the intergenic regions of the genome. We have been developing the server based on open source software including BioRuby, BioPerl, BioDAS and GMOD/GBrowse to make the system consistent with the existing open standards. The contents of the KEGG DAS data- base can be accessed graphically in a web browser using GBrowse GUI (graphical user in- terface) and also programatically by the DAS protocol. The DAS, which is an XML over HTTP data retrieving protocol, enables the user to write various kinds of automated programs for analyzing genome sequences and annotations. For example, by combining KEGG DAS with KEGG API, a program to retrieve upstream se- quences of a given set of genes which have similar expression patterns on the same path- way, can be written very easily. GBrowse, the graphical interface, enables user to browse, search, zoom and visualize a particular region of the genome. Moreover, users are also able to add their own annotations onto the GBrowse Human Genome Center Laboratory of Genome Database Laboratory of Sequence Analysis ゲノムデータベース分野 シークエンスデータ情報処理分野 Professor Minoru Kanehisa, Ph.D. Research Associate Toshiaki Katayama, M.Sc. Research Associate Shuichi Kawashima, M.Sc. Lecturer Tetsuo Shibuya, Ph.D. Research Associate Michihiro Araki, Ph.D. 授(委嘱)理学博士 理学修士 理学修士 理学博士 薬学博士 136

Transcript of Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using...

Page 1: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

Owing to continuous developments of high-throughput experimental technologies,ever-increasing amounts of data are being generated in functional genomics andproteomics. We are developing a new generation of databases and computationaltechnologies, beyond the traditional genome databases and sequence analysistools, for making full use of such large-scale data in biomedical applications, espe-cially for elucidating cellular functions as behaviors of complex interaction systems.

1. Comprehensive repository for communitygenome annotation

Toshiaki Katayama, Mari Watanabe and Mi-noru Kanehisa

KEGG DAS is an advanced genome databasesystem providing DAS (Distributed AnnotationSystem) service for all organisms in theGENOME and GENES databases in KEGG(Kyoto Encyclopedia of Genes and Genomes).Currently, KEGG DAS contains 6,943,951 anno-tations for genome sequences of 441 organisms.The KEGG DAS server provides gene annota-tions linked to the KEGG PATHWAY andLIGAND databases, as well as the SSDB data-base containing paralog, ortholog and motif in-formation. In addition to the coding genes, in-formation of non-coding RNAs predicted usingRfam database is also provided to fill the anno-tation of the intergenic regions of the genome.

We have been developing the server based onopen source software including BioRuby,BioPerl, BioDAS and GMOD/GBrowse to makethe system consistent with the existing openstandards. The contents of the KEGG DAS data-base can be accessed graphically in a webbrowser using GBrowse GUI (graphical user in-terface) and also programatically by the DASprotocol. The DAS, which is an XML over HTTPdata retrieving protocol, enables the user towrite various kinds of automated programs foranalyzing genome sequences and annotations.For example, by combining KEGG DAS withKEGG API, a program to retrieve upstream se-quences of a given set of genes which havesimilar expression patterns on the same path-way, can be written very easily. GBrowse, thegraphical interface, enables user to browse,search, zoom and visualize a particular regionof the genome. Moreover, users are also able toadd their own annotations onto the GBrowse

Human Genome Center

Laboratory of Genome DatabaseLaboratory of Sequence Analysisゲノムデータベース分野シークエンスデータ情報処理分野

Professor Minoru Kanehisa, Ph.D.Research Associate Toshiaki Katayama, M.Sc.Research Associate Shuichi Kawashima, M.Sc.Lecturer Tetsuo Shibuya, Ph.D.Research Associate Michihiro Araki, Ph.D.

教 授(委嘱) 理学博士 金 久 實助 手 理学修士 片 山 俊 明助 手 理学修士 川 島 秀 一講 師 理学博士 渋 谷 哲 朗助 手 薬学博士 荒 木 通 啓

136

Page 2: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

view by providing another DAS server or bysimply uploading their own data as a file. Thisfunctionality enables researchers to add variousannotations on the genome and by sharing theirannotations with the community they can con-tinuously refine the genome annotation, so-called “community annotation.” This year wehave updated the GBrowse genome browser tothe latest version, which has improved user in-terface. We are also responsible to Japanese lo-calization of the browser. The KEGG DAS isweekly updated and freely available at http://das.hgc.jp/.

2. Automatic assignments of orthologs andparalogs in complete genomes

Toshiaki Katayama and Minoru Kanehisa

Accession of the number of sequenced geno-mes made it difficult to characterize orthologousrelationship among organisms. We are develop-ing a computational method for finding appro-priate orthologous gene clusters automatically. Itis based on a graph analysis of the KEGG SSDBdatabase, containing sequence similarity rela-tions among all the genes in the completelysequenced genomes. The nodes of the SSDBgraph are genes and the edges are the Smith-Waterman sequence similarity scores computedby the SSEARCH program. The edges are notonly weighted but also directed, indicating thebest (top-scoring) hit when a gene in an organ-ism is compared against all genes in another or-ganism. Thus, a highly connected cluster ofnodes containing a number of bidirectional besthits might be considered an ortholog clusterconsisting of functionally identical genes. Such acluster can be found by our heuristic method forfinding quasi-cliques, but the SSDB graph is toolarge to perform quasi-clique finding at a time.Therefore, we introduce a hierarchy (evolution-ary relationship) of organisms and treat theSSDB graph as a nested graph. However, themethod still requires large computation timealong with the number of organism increases,we are trying to refine the process to make itfaster and accurate.

3. SOAP/WSDL interface for the KEGG sys-tem

Toshiaki Katayama, Shuichi Kawashima andMinoru Kanehisa

We have continued to develop KEGG API, aweb service to facilitate usability of the KEGGsystem. KEGG is a suite of databases and associ-ated software, integrating our current knowl-

edge of molecular interaction/reaction pathwaysand other systemic functions (PATHWAY andBRITE databases), the information about thegenomic space (GENES database), and informa-tion about the chemical space (LIGAND data-base). KEGG API provides valuable means to re-trieve various kinds of information stored in theKEGG and has become an increasingly popularmode of access. Recently, we have introducedseveral new methods to search compounds,drugs and glycans by their structure, mass, com-position and annotations. Additionally, themethods to colorize the PATHWAY diagram isenhanced to control the different elements shar-ing the same name can be distinguished by theirID. The KEGG API is available at http://www.genome.jp/kegg/soap/.

4. EGassembler: web server for large-scaleclustering and assembling ESTs andgenomic DNA fragments

Ali Masoudi-Nejad, Shuichi Kawashima, Koi-chiro Tonomura, Masanori Suzuki, MinoruKanehisa

EST sequencing has proven to be an economi-cally feasible alternative for gene discovery inspecies lacking a draft genome sequence. Ongo-ing large-scale EST sequencing projects feel theneed for bioinformatics tools to facilitate uni-form ESTs handling. This brings about a re-newed importance to a universal tool for proc-essing and functional annotation of large sets ofESTs in order to cover the complete transcrip-tome of an organism. EGassembler (http://egas-sembler.hgc.jp/) is a web server, which providesan automated as well as a user-customizedanalysis tool for cleaning, repeat masking, vec-tor trimming, organelle masking, clustering andassembling of ESTs and genomic fragments. It isalso designed to serve as a standalone web ap-plication for each one of those processes. Theweb server is freely available and provides thecommunity with a unique all-in-one online ap-plication web service for large scale ESTs andgenomic DNA clustering and assembling, espe-cially for EST processing and annotation pro-jects.

5. New version of MAGEST database

Shuichi Kawashima and Minoru Kanehisa

We have developed a new version of MAG-EST database. Previously MAGEST was de-signed as a database for maternal gene expres-sion information for an ascidian, Halocynthiaroretzi . In the new version of MAGEST, it is ex-

137

Page 3: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

tended to four embryonic developmental stagesincluding the following additional stages, e.g.early cleavage, early gastrula and early neurula.Furthermore, we constructed gene clusters andassembled sequences of ESTs by EGassembler.These clusters enabled us to cross-refer the geneexpression of the four different stages. The newweb site is accessible at http://magest.hgc.jp/.Now we are comparing the MAGEST sequenceswith other species of ascidian, Ciona sp. Becausethe phylogenetic position of H. roretzi is a goodoutgroup for Enterogonia to which Ciona sp. be-longs, we expect that the result from this com-parative analysis lead us to understand the di-versification of gene families among Urocho-data.

6. SSS: a sequence similarity search service

Toshiaki Katayama, Kazuhiro Ohi, MinoruKanehisa

There are various services in the world to findsimilar sequences from the database, such as thefamous BLAST service provided at NCBI. How-ever, the method to search and the database tobe searched could not be added from outside.To provide our super computer resources at theHuman Genome Center to the research commu-nity, we started to develop a new service for thesequence similarity search, SSS. In SSS, user canselect the search algorithm from BLAST, FASTA,SSEARCH, TRANS and EXONERATE. This va-riety of options is unique among the publicservices. Then user is prompted to select appro-priate database depending on the algorithm se-lected and the search is executed. On the back-end, we implemented the search system on theSun Grid Engine to provide efficient resourceson distributed computers. As a result, we areable to provide time consuming services such asTRANS and EXONERATE in addition to thepopular algorithms. The SSS service is freelyavailable at http://sss.hgc.jp/.

7. High performance database entry retrievalsystem

Kazutomo Ushijima, Chiharu Kawagoe, Toshi-aki Katayama, Shuichi Kawashima, KentaNakai, Minoru Kanehisa

Recently, the number of entries in biologicaldatabases is exponentially increasing year byyear. For example, there were 10,106,023 entriesin the GenBank database in the year 2000, whichhas now grown to 65,771,589 (Release 156 +daily updates). In order for such a vast amountof data to be searched at a high speed, we have

developed a high performance database entryretrieval system, named HiGet. The HiGet sys-tem is constructed on the HiRDB, a commercialORDBMS (Object-oriented Relational DatabaseManagement System) developed by Hitachi,Ltd. It is publicly accessible on the Web page athttp://higet.hgc.jp/ or SOAP based web serviceat http://higet.hgc.jp/soap/. HiGet can executefull text search on various biological databases.In addition to the original plain format, the sys-tem contains data in the XML format in order toprovide a field specific search facility. When acomplicated search condition is issued to thesystem, the search processing is executed effi-ciently by combining several types of indices toreduce the number of records to be processedwithin the system. Current searchable databasesare GenBank, UniProt, Prosite, OMIM, PDB andRefSeq. We are planning to include other valu-able databases and also planning to develop aninter-database search interface and a complexsearch facility combining keyword search andsequence similarity search.

8. Development of algorithms for biosyn-thetic process analysis

Kohichi Suematsu, Tetsuo Shibuya, MichihiroAraki, and Minoru Kanehisa

We developed algorithms for identifying bio-synthetic process of medicinal products by util-izing the database of sub-molecular buildingblocks in biosynthetic processes. The problem isto find the most reasonable decomposition of agraph into subgraphs which are annotated inthe database. We have developed new efficientalgorithms and tools based on the algorithmsfor the problem, though the problem is a verydifficult NP-hard problem.

9. Geometric Suffix Tree: A New Data struc-ture for Indexing Protein Structures

Tetsuo Shibuya

Protein structure analysis is one of the mostimportant research issues in the post-genomicera, and faster and more accurate index datastructures for such 3-D structures are highly de-sired for research on proteins. We proposed anew data structure for indexing protein 3-Dstructures. For strings, there are many efficientindexing structures such as suffix trees, but ithas been considered very difficult to designsuch sophisticated data structures against 3-Dstructures like proteins. Our index structure isbased on the suffix trees and is called the geo-metric suffix tree. By using the geometric suffix

138

Page 4: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

tree for a set of protein structures, we can searchfor all of their substructures whose RMSDs (rootmean square deviations) or URMSDs (unit-vector root mean square deviations) to a givenquery 3-D structure are not larger than a givenbound. Though there are O (N 2) substructures,our data structure requires only O (N ) spacewhere N is the sum of lengths of the set of pro-teins. We propose an O (N 2) construction algo-rithm for it, while a naive algorithm would re-quire O (N 3) time to construct it. Moreover wepropose an efficient search algorithm. Experi-ments show that we can search for similar struc-tures much faster than naive RMSD computa-tion against all over the database. The experi-ments also show that the construction time ofthe geometric suffix tree is practically almostlinear to the size of the database, when appliedto a protein structure database.

10. Protein Function Prediction based on 3-DStructure Motifs

Hiroki Sakai and Tetsuo Shibuya

Protein functions are said to be determined byits 3-D structures, but not all functions havebeen known to be related to some 3-D structuremotifs. The geometric suffix tree, a data struc-ture for indexing 3-D protein structures, whichis also developed by us, enables comprehensiveenumeration of all the possible structural motifsamong given set of proteins. We are developinga new algorithm based on the support vectormachine that decides protein’s function from the3-D structure of a protein. This algorithm util-izes all the possible 3-D motifs found by usingthe geometric suffix tree.

11. Genotype Clustering based on HiddenMarkov Models

Ritsuko Onuki and Tetsuo Shibuya

Analysis of single nucleotide polymorphisms(SNPs) is one of the most important researchtopics in the post-genomic era. Recently, thegenotype data increases exponentially and de-velopment of tools for analyzing these data ishighly desired. We are developing a new algo-rithm for clustering genotypes, by utilizing thehidden markov model (HMM) that infers haplo-types from genotpyes.

12. The Origin of Chemical Diversity in Bio-synthetic Circuits of Medicinal NaturalProducts

Michihiro Araki, Tetsuo Shibuya, Kohichi Sue-

matsu, and Minoru Kanehisa

Medicinal natural products have been the ma-jor sources of bioactive compounds with diversepharmacological activities, and are enzymicallysynthesized as secondary metabolites for specificbiological purposes. The diverse activities areobviously explained by extensive diversities inthe chemical structures. Although several hy-potheses have been proposed to understand theorigin of the chemical diversity, no systematicanalyses have been done so far. We thus definemolecular building blocks required for describ-ing the chemical information of natural productsto elucidate how chemical diversities are de-signed in the biosynthetic circuits. Each naturalproduct is expressed as a combination of mo-lecular building blocks with corresponding en-zymatic information as links between buildingblocks to be collected in a database. The knowl-edge database constructed from various re-sources enables us to identify the system struc-tures of the biosynthetic circuits. We also extractdistinctive network structures consisted of bothchemical and genomic information, which willbe useful for understanding the origin of chemi-cal diversity in the biosynthetic circuits.

13. Extracting A Strategy for Drug Design byTracing Drug Development

Daichi Shigemizu, Michihiro Araki and Mi-noru Kanehisa

Current drugs are mostly derived by modifi-cation of known drug structures or from leadstructures to be optimized for targeting newmolecules or obtaining improved efficacy. To ex-plore a design rule in drug development, it isnecessary to focus on the empirical modifica-tions to construct a knowledge base for tracingthe chemical evolutions. We have been collect-ing data on the drug evolutions from databasesand literatures, which has already been imple-mented on the drug structure maps in theKEGG DRUG database. In order to trace thechemical development in the database, binaryrelations of chemical structures were defined be-tween before and after the chemical changes.We subsequently compared chemical structuresin the binary relations to define functional at-oms and groups supposed to be important fordrug development. We have been collecting aseries of such transformation patterns in theprocess of drug development to analyze andprovide a design strategy in drug development.

14. Comprehensive Analysis of DistinctivePolyketide and Nonribosomal Peptide

139

Page 5: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

Structural Motifs Encoded in MicrobialGenomes

Yohsuke Minowa, Michihiro Araki and Mi-noru Kanehisa

We developed a highly accurate method topredict polyketide (PK) and nonribosomal pep-tide (NRP) structures encoded in the microbialgenome. PKs/NRPs are polymers of carbonyl orpeptidyl chains synthesized by polyketide syn-thases (PKS) and nonribosomal peptide syn-thetases (NRPS). We analyzed domain se-quences corresponding to specific substrates andphysical interactions between PKSs/NRPSs inorder to predict which substrates are selectedand assembled into highly ordered chemicalstructures. The predicted PKs/NRPs were repre-

sented as the sequences of carbonyl/peptidylunits to extract the structural motifs effeciently.We applied our method to 4,529 PKSs/NRPSsand found 619 PKs/NRPs. We also collected1,449 PKs/NRPs whose chemical structureshave been experimentally determined. Thestructural sequences were compared using theSmith-Waterman algorithm, and clustered into271 clusters. From the compound clusters, weextracted 33 structural motifs which are signifi-cantly related with their bioactivities. The inte-grative analysis of genomic and chemical infor-mation here will provide a strategy to predictthe chemical structures, the biosynthetic path-ways, and the biological activities of PKs/NRPs,which is useful for the rational design of novelPKs/NRPs.

Publications

Yoshizawa AC, Kawashima S, Okuda S, FujitaM, Itoh M, Moriya Y, Hattori M, Kanehisa M.Extracting sequence motifs and the phyloge-netic features of SNARE-dependent mem-brane traffic. Traffic. 7(8): 1104-1118, 2006.

Masoudi-Nejad A, Tonomura K, Kawashima S,Moriya Y, Suzuki M, Itoh M, Kanehisa M,Endo T, Goto S. EGassembler: online bioinfor-matics service for large-scale processing, clus-tering and assembling ESTs and genomicDNA fragments. Nucleic Acids Res. 34: W459-62, 2006.

Kanehisa, M., Goto, S., Hattori, M., Aoki-Kino-shita, K.F., Itoh, M., Kawashima, S., Kata-yama, T., Araki, M., and Hirakawa, M. Fromgenomics to chemical genomics: new develop-ments in KEGG. Nucleic Acids Res. 34, D354-357 (2006).

Hashimoto, K., Goto, S., Kawano, S., Aoki-Kino-shita, K.F., Ueda, N., Hamajima, M., Kawa-saki, T., and Kanehisa, M. KEGG as a glycomeinformatics resource. Glycobiology 16, 63R-70R (2006).

Honda, W., Kawashima, S., Kanehisa, M. Me-tabolite antigens and pathway incompatibility.

Genome Inform. Ser. Workshop Genome In-form. 2006, 17(1): 184-193, 2006

Shibuya, T., Geometric Suffix Tree: A New In-dex Structure for Protein 3-D Structures, Com-binatorial Pattern Matching 2006 (CPM 2006),LNCS 4009, Barcelona, July 5-7, 2006, pp. 84-93.

Suematsu, K., Araki, M., and Shibuya, T. Graphtiling algorithm for molecular synthesis analy-sis, SIGAL 106, May 18, 2006, Ikaho, pp. 25-32.

Shibuya, T., Geometric Suffix Tree: A New In-dex Structure for Protein 3-D Structures, IPSJSIG Notes SIGAL 105-3, March 18 2006, At-sugi, pp. 17-24.

Shibuya, T., Combinatorial Structural PatternMatching, The Second Japan-Taiwan BilateralSymposium on Bioinformatics, 2006.

Shibuya, T., Indexing Structures for Biomolecu-lar Structures, The First Japan-Taiwan Bilat-eral Symposium on Bioinformatics, 2006.バイオインフォマティクス事典,日本バイオインフォマティクス学会編,共立出版,2006.荒木通啓,金久實「創薬科学のためのKEGG」化学工業 Vol.57 №3,42―46,2006.

140

Page 6: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

The aim of the research at this laboratory is to establish computational methodolo-gies for discovering and interpreting information of nucleic acid sequences, pro-teins and some other experimental data arising from researches in Genome Sci-ence. Our current interests are focused on Computational Systems Biology and itsrelated computational techniques. Apart from the research activity, the laboratoryhas been providing bioinformatics software tools and has been taking a leadingpart in organizing an international forum for Genome Informatics.

1. Computational Systems Biology

a. Modeling and estimation of dynamic EGFRpathway by data assimilation approach us-ing time series proteomic data

Shinya Tasaki, Masao Nagasaki, MasaakiOyama, Hiroko Hata, Kazuko Ueno, RyoYoshida, Tomoyuki Higuchi, Sumio Sugano,Satoru Miyano

Cell Illustrator is a model building tool basedon the Hybrid Functional Petri net with exten-sion (HFPNe). By using Cell Illustrator, we havesucceeded in modeling biological pathways, e.g.,metabolic pathways, gene regulatory networks,microRNA regulatory networks, cell signalingnetworks, and cell-cell interactions. The recentdevelopment of tandem mass spectrometry cou-pled with liquid chromatography (LC/MS/MS)technology has enabled researchers to quantifythe dynamic profile of a wide range of proteinswithin the cell. The proteomic data obtained byusing LC/MS/MS has been considerably usefulfor introducing dynamics to the HFPNe model.Here, we report the first introduction of the time

-series proteomic data to our HFPNe model. Weconstructed an epidermal growth factor receptorsignal transduction pathway model (EFGRmodel) by using the biological data available inthe literature. Then, the kinetic parameters weredetermined in the data assimilation (DA) frame-work with some manual tuning so as to fit theproteomic data published by Blagoev et al. (Nat.Biotechnol., 22: 1139. 1145, 2004). This in silicomodel was further refined by adding or remov-ing some regulation loops using biological back-ground knowledge. The DA framework wasused to select the most plausible model fromamong the refined models. By using the pro-teomic data, we semi-automatically constructeda well-tuned EGFR HFPNe model by using theCell Illustrator coupled with the DA framework.

b. Genomic data assimilation for estimatinghybrid functional Petri net from time-course gene expression data

Masao Nagasaki, Rui Yamaguchi, Ryo Yoshi-da, Seiya Imoto, Atsushi Doi, YoshinoriTamada, Hiroshi Matsuno, Satoru Miyano, To-moyuki Higuchi

Human Genome Center

Laboratory of DNA Information AnalysisDNA情報解析分野

Professor Satoru Miyano, Ph.D.Assistant Professor Seiya Imoto, Ph.D.Assistant Professor Masao Nagasaki, Ph.D.Project Assistant Professor Ryo Yoshida, Ph.D.

教 授 理学博士 宮 野 悟助 手 博士(数理学) 井 元 清 哉助 手 博士(理学) 長 � 正 朗特任助手 博士(理学) 吉 田 亮

141

Page 7: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

We propose an automatic constructionmethod of the hybrid functional Petri net as asimulation model of biological pathways. Theproblems we consider are how we choose thevalues of parameters and how we set the net-work structure. Usually, we tune these un-known factors empirically so that the simulationresults are consistent with biological knowledge.Obviously, this approach has the limitation inthe size of network of interest. To extend the ca-pability of the simulation model, we propose theuse of data assimilation approach that was origi-nally established in the field of geophysicalsimulation science. We provide genomic data as-similation framework that establishes a link be-tween our simulation model and observed datalike microarray gene expression data by using anonlinear state space model. A key idea of ourgenomic data assimilation is that the unknownparameters in simulation model are convertedas the parameter of the state space model andthe estimates are obtained as the maximum aposteriori estimators. In the parameter estima-tion process, the simulation model is used togenerate the system model in the state spacemodel. Such a formulation enables us to handleboth the model construction and the parametertuning within a framework of the Bayesian sta-tistical inferences. In particular, the Bayesian ap-proach provides us a way of controlling overfit-ting during the parameter estimations that is es-sential for constructing a reliable biologicalpathway. We demonstrate the effectiveness ofour approach using synthetic data. As a result,parameter estimation using genomic data as-similation works very well and the networkstructure is suitably selected.

c. Cell fate simulation model of gustatoryneurons with microRNAs double-negativefeedback loop by hybrid functional Petrinet with extension

Ayumu Saito, Masao Nagasaki, Atushi Doi,Kazuko Ueno, Satoru Miyano

Biological regulatory networks have been ex-tensively researched. Recently, the microRNAregulation has been analyzed and its importancehas increasingly emerged. We have applied theHybrid Functional Petri net with extension(HFPNe) model and succeeded in creatingmodel biological pathways, e.g. metabolic path-ways, gene regulatory networks, cell signalingnetworks, and cell-cell interaction models withone of the HFPNe implementations Cell Illustra-tor. Thus, we have applied HFPNe to modelregulatory networks that involve a new keyregulator microRNA. As a test case, we selected

the cell fate determination model of two gusta-tory neurons of Caenorhabditis elegans…ASEleft (ASEL) and ASE right (ASER). These neu-rons are morphologically bilaterally symmetricbut physically asymmetric in function. Johnstonet al. have suggested that their cell fate is deter-mined by the double-negative feedback loop in-volving the lsy-6 and mir-273 microRNAs. Oursimulation model confirms their hypothesis. Inaddition, other well-known mutants that are re-lated with the double-negative feedback loopare also well-modeled. The new upstream regu-lator of lsy-6 (lsy-2) that is mentioned in anotherpaper is also integrated into this model for themechanism of switching between ASEL andASER without any contradictions. Therefore, theHFPNe-based modeling will be one of thepromising modeling methods and simulation ar-chitectures that illustrate microRNA regulatorynetworks.

d. A combined pathway to simulate CDK-dependent phosphorylation and ARF-dependent stabilization for p53 transcrip-tional activity

Atsushi Doi, Masao Nagasaki, Kazuko Ueno,Hiroshi Matsuno, Satoru Miyano

The protein p53 is phosphorylated by a mem-ber of protein kinases such as CDK7, and stabi-lized by the protein ARF. The phosphorylationand stabilization of p53 is believed to enhanceits transcriptional activity and act simultane-ously. Biological pathways composed of expertsknowledge obtained from the literature are in-cluding these activation mechanisms. However,the map of biological pathways does not reflectthe combination effect of phosphorylation andstabilization. We have conducted some simula-tions of biological pathways with hybrid func-tional Petri net (HFPN) after careful reading ofpapers. In this paper, we constructed the HFPNbased biological pathway of CDK-dependentphosphorylation pathway and combine withARF-dependent pathway described previously,to observe the effect of the phosphorylation onthe stabilization with simulation-based valida-tion.

e. Analysis of gene networks for drug targetdiscovery and validation

Seiya Imoto, Yoshinori Tamada, Christopher J.Savoie, Satoru Miyano

Understanding responses of cellular systemfor a dosing molecule is one of the most impor-tant problems in pharmacogenomics. In this

142

Page 8: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

chapter, we describe computational methods foridentifying and validating drug target genesbased on the gene network estimated from mi-croarray gene expression data. We use twotypes of microarray gene expression data: One isgene disruptant microarray data and the other istime-course drug response microarray data. Forthis purpose, the information of gene networksplays an essential role and is unattainable fromclustering methods, which are the standard forgene expression analysis. The gene network isestimated from disruptant microarray data bythe Bayesian network model and then the pro-posed method automatically identifies sets ofgenes or gene regulatory pathways affected bythe drug. We use an actual example from analy-sis of Saccharomyces cerevisiae gene expressionprofile data to express a concrete strategy forthe application of gene network information to-ward drug target discovery. Key words: Drugtarget, gene network, Bayesian network, Booleannetwork, microarray data, Bayes statistics.

f. Finding module-based gene networks intime-course gene expression data withstate space models

Rui Yamaguchi, Ryo Yoshida, Seiya Imoto,Tomoyuki Higuchi, Satoru Miyano

We discuss the use of the state space modelsto analyze time-course microarray gene expres-sion data. Typical features of time-course mi-croarray data are short time-course and high-dimensional observational vector. By these as-pects, conventional statistical models such as themultivariate autoregressive models lead to un-suitable results due to the overfitting. The statespace models have the potential to overcomethis problem by the dimension reduction proc-ess. This paper provides (1) a survey of the statespace models, (2) a review of some existing re-searches using the state space models for time-course microarray data together with a biologi-cal meaning of the model, (3) a solution for thelack of parameter identifiability, and (4) identifi-cation of biological system that is a central topicin bioinformatics. Finally, we show the useful-ness of the state space models for time-coursemicroarray data by the analysis of Saccharomy-ces cerevisiae cell cycle gene expression data.

g. Structural modeling and analysis of sig-naling pathways based on Petri nets

Chen Li, Shunichi Suzuki, Qi-Wei Ge, MitsuruNakata, Hiroshi Matsuno, Satoru Miyano

The purpose of this study is to discuss how to

model and analyze signaling pathways by usingPetri net. Firstly, we propose a modelingmethod based on Petri net by paying attentionto the molecular interactions and mechanisms.Then, we introduce a new notion “activationtransduction component” in order to describe anenzymic activation process of reactions in sig-naling pathways and shows its correspondenceto a so-called elementary T-invariant in the Petrinet models. Further, we design an algorithm toeffectively find basic enzymic activation proc-esses by obtaining a series of elementary T-invariants in the Petri net models. Finally, wedemonstrate how our method is practically usedin modeling and analyzing signaling pathwaymediated by thrombopoietin as an example.

2. Statistical and Computational KnowledgeDiscovery

a. A statistical framework for genome-widediscovery of biomarker splice variationswith GeneChip Human Exon 1.0 ST arrays

Ryo Yoshida, Kazuyuki Numata, Seiya Imoto,Masao Nagasaki, Atsushi Doi, Kazuko Ueno,Satoru Miyano

Alternative splicing is an important regulatorymechanism that generates multiple mRNA tran-scripts which are transcribed into functionallydiverse proteins. According to the current stud-ies, aberrant transcripts due to splicing muta-tions are known to cause for 15% of genetic dis-eases. Therefore understanding regulatory me-chanism of alternative splicing is essential foridentifying potential biomarkers for severaltypes of human diseases. Most recently, adventof GeneChip� Human Exon 1.0 ST Array en-ables us to measure genome-wide expressionprofiles of over one million exons. With thisnew microarray platform, analysis of functionalgene expressions could be extended to detectnot only differentially expressed genes, but alsoa set of specific-splicing events that are differen-tially observed between one or more experimen-tal conditions, e.g. tumor or normal control cells.In this study, we address the statistical problemsto identify differentially observed splicing vari-ations from exon expression profiles. The pro-posed method is organized according to the fol-lowing process: (1) Data preprocessing for re-moving systematic biases from the probe inten-sities. (2) Whole transcript analysis with theanalysis of variance (ANOVA) to identify a setof loci that cause the alternative splicing-relatedto a certain disease. We test the proposed statis-tical approach on exon expression profiles ofcolorectal carcinoma. The applicability is veri-

143

Page 9: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

fied and discussed in relation to the existing bio-logical knowledge. This paper intends to high-light the potential role of statistical analysis ofall exon microarray data. Our work is an impor-tant first step toward development of more ad-vanced statistical technology. Supplementary in-formation and materials are available from http://bonsai . ims .u-tokyou.ac . jp/~yoshidar / IBSB2006_ExonArray.htm

b. Machine learning prediction of amino acidpatterns in protein N-myristoylation

Ryo Okada, Chigusa Miyakawa, ManabuSugii, Hiroshi Matsuno, Satoru Miyano

Protein N-myristoylation is the lipid modifica-tion in which the 14-th carbon saturated fattyacid binds covalently to N-terminal of virus-based and eukaryotic protein. In this study, wesuggest an approach to predict the pattern of N-myristoylation signal using the machine learn-ing system BONSAI. BONSAI finds rules incombination of an alphabet indexings and deci-sion trees. Computational experiments withBONSAI classified amino acid residues depend-ing on effect for N-myristoylation and foundrules of the alphabet indexing. In addition,BONSAI suggested new requirements for theposition of an amino acid in N-myristoylationsingal.

c. Contribution of comparative fish studies togeneral endocrinology: structure and func-tion of some osmoregulatory hormones

Yoshio Takei, Akatsuki Kawakoshi, TakehiroTsukada, Shinya Yuge, Maho Ogoshi, KojiInoue, Susumu Hyodo, Hideo Bannai, SatoruMiyano

Fish endocrinologists are commonly moti-vated to pursue their research driven by their

own interests in these aquatic animals. How-ever, the data obtained in fish studies not onlysatisfy their own interests but often contributemore generally to the studies of other verte-brates, including mammals. The life of fishes ischaracterized by the aquatic habitat, which de-mands many physiological adjustments distinctfrom the terrestrial life. Among them, body fluidregulation is of particular importance as thebody fluids are exposed to media of varying sa-linities only across the thin respiratory epitheliaof the gills. Endocrine systems play pivotal rolesin the homeostatic control of body fluid balance.Judging from the habitat-dependent controlmechanisms, some osmoregulatory hormones offish should have undergone functional and mo-lecular evolution during the ecological transitionto the terrestrial life. In fact, water-regulatinghormones such as vasopressin are essential forsurvival on the land, whereas ion-regulatinghormones such as natriuretic peptides, guany-lins and adrenomedullins are diversified and ex-hibit more critical functions in aquatic species.In this short review, we introduce some exam-ples illustrating how comparative fish studiescontribute to general endocrinology by takingadvantage of such differences between fishesand tetrapods. In a functional context, fish stud-ies often afford a deeper understanding of theessential actions of a hormone across vertebratetaxa. Using the natriuretic peptide family as anexample, we suggest that more functional stud-ies on fishes will bring similar rewards of un-derstanding. At the molecular level, recent es-tablishment of genome databases in fishes andmammals brings clues to the evolutionary his-tory of hormone molecules via a comparativegenomic approach. Because of the functionaland molecular diversification of ion-regulatinghormones in fishes, this approach sometimesleads to the discovery of new hormones intetrapods as exemplified by adrenomedullin 2.

Publications

1. DeLisi, C., Kanehisa, M., Heinrich, R., Miy-ano, S. (Eds.) Genome Informatics. 17 (1),2006.

2. Doi, A., Nagasaki, M., Matsuno, H., Miy-ano, S. Simulation based validation of the p53 transcriptional activity with hybrid unc-tional Petri net. In Silico Biology. 6 (1-2): 1-13, 2006.

3. Doi, A., Nagasaki, M., Ueno, K., Matsuno,H., Miyano, S.A combined pathway tosimulate CDK-dependent phosphorylationand ARF-dependent stabilization for p53

transcriptional activity. Genome Informatics.17 (1): 112-123, 2006.

4. Imoto, S., Miyano, S. Bayesian network ap-proach to estimate gene networks. in A.Mittal, A. Kassim and T. Tan (Eds.), Baye-sian Network Technologies: Applicationsand Graphical Models, Idea Group Publish-ers, USA. In press.

5. Imoto, S., Tamada, Y., Araki, H., Yasuda,K., Print, C.G., Charnock-Jones, S.D., Sand-ers, D., Savoie, C.J., Tashiro, K., Kuhara, S.,Miyano, S. Computational strategy for dis-

144

Page 10: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

covering druggable gene networks fromgenome-wide RNA expression profiles. Pa-cific Symposium on Biocomputing. 11: 559-571, 2006.

6. Imoto, S., Tamada, Y., Savoie, C.J., Miyano,S., Analysis of gene networks for drug tar-get discovery and validation. Methods inMolecular Biology. 360: 33-56, 2006.

7. Imoto, S., Higuchi, T., Goto, T., Miyano, S.Error tolerant model for incorporating bio-logical knowledge with expression data inestimating gene networks. Statistical Meth-odology. 3 (1): 1-16, 2006.

8. Jeong, E., Miyano, S.A weighted profilebased method for protein-RNA interactingresidue prediction. Transactions on Compu-tational Systems Biology. 123-139, 2006.

9. Li, C., Suzuki, S., Ge, Q.-W., Nakata, M.,Matsuno, H., Miyano, S. Structural model-ing and analysis of signaling pathwaysbased on petri nets. J. Bioinf. Comput. Biol.4 (5): 1119-1140, 2006.

10. Matsuno, H., Inouye, S.-T., Okitsu, Y., Fujii,Y., Miyano, S.A new regulatory interactionssuggested by simulations for circadian ge-netic control mechanism in mammals. J.Bioinf. Comput. Biol. 4 (1): 139-154, 2006.

11. Matsuno, H., Li, C., Miyano, S. Petri netbased description for systematic under-standing of biological pathways. IEICETrans. Fundamentals. E89-A (11): 3166-3174,2006.

12. Miyano, S. (Ed.) Special RECOMB 2005 Is-sue. J. Comp. Biol. 13 (2), 2006.

13. Nagasaki, M., Yamaguchi, R., Yoshida, R.,Imoto, S., Doi, A., Tamada, Y., Matsuno, H.,Miyano, S., Higuchi, T. Genomic data as-similation for estimating hybrid functionalPetri net from time-course gene expressiondata. Genome Informatics. 17 (1). 46-61,2006.

14. Nakamichi, R., Imoto, S., Miyano, S. Statisti-cal model selection method to analyze com-binatorial effects of SNPs and environ-mental factors for binary disease. Interna-tional Journal on Artificial IntelligenceTools. 15 (5), 711-724, 2006.

15. Okada, R., Sugii, M., Matsuno, H., Miyano,S. Machine learning prediction of aminoacid patterns in protein N-myristoylation.Lecture Notes in Bioinformatics. 4146: 4-14,2006.

16. Saito, A., Nagasaki, M., Doi, A., Ueno, K.,Miyano, S. Cell fate simulation model ofgustatory neurons with microRNAs double-negative feedback loops by hybrid func-tional Petri net with extension. Genome In-formatics 17 (1): 100-111, 2006.

17. Sakakibara, Y., Smith, T.F., Kanehisa, M.,Miyano, S., Takagi, T. (Eds.) Genome Infor-matics. 17 (2), 2006.

18. Takei, Y., Kawakoshi, A., Tsukada, T., Yuge,S., Ogoshi, M., Inoue, K., Hyodo, S., Bannai,H., Miyano, S. Contribution of comparativefish studies to general endocrinology: struc-ture and function of some osmoregulatoryhormones. J. Experimental Zoology. Part A,Comparative Experimental Biology. 305 (9):787-798, 2006.

19. Tamada, Y., Imoto, S., Miyano, S. Estimat-ing gene networks from gene expressiondata utilizing biological information. Proc.Inst. Statist. Math. 54 (2): 333-356, 2006.

20. Tasaki, S., Nagasaki, M., Oyama, M., Hata,H., Ueno, K., Yoshida, R., Higuchi, T.,Sugano, S., Miyano, S. Modeling and esti-mation of dynamic EGFR pathway by dataassimilation approach using time series pro-teomic data. Genome Informatics. 17 (2):226-228, 2006.

21. Washio, T., Higuchi, T., Imoto, S., Tamada,Y., Sato, K., Motoda, H. Graph mining andits application to statistical modeling. Proc.Inst. Statist. Math. 54 (2): 315-331, 2006.

22. Yamaguchi, R., Yoshida, R., Imoto, S.,Higuchi, T., Miyano, S. Finding module-based gene networks in time-course geneexpression data with state space models.IEEE Signal Processing Magazine. In press.

23. Yoshida, R., Higuchi, T., Imoto, S., Miyano,S. ArrayCluster: an analytic tool for cluster-ing, data visualization and module finderon gene expression profiles. Bioinformatics.22 (12): 1538-1539, 2006.

24. Yoshida, R., Numata, K., Imoto, S., Na-gasaki, M., Doi, A., Ueno, K., Miyano, S.Astatistical framework for genome-wide dis-covery of biomarker splice variations withGeneChip Human Exon 1.0 ST arrays.Genome Informatics. 17 (1): 88-99, 2006.

25.宮野 悟,江口至洋,金久 實,高木利久,中井謙太(編).バイオインフォマティクス事典.共立出版.2006.

145

Page 11: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

The major goal of the Human Genome Project is to identify genes of medical im-portance, and to develop new diagnostic and therapeutic tools. We have been at-tempting to isolate genes involving in carcinogenesis and also those causing orpredisposing to various diseases as well as those related to drug efficacies andadverse reactions. By means of technologies developed through the genome pro-ject including a high-resolution SNP map, a large-scale DNA sequencing, and thecDNA microarray method, we have isolated a number of biologically and/or medi-cally important genes.

1. Genes playing significant roles in humancancers

Toyomasa Katagiri, Yataro Daigo, HidewakiNakagawa, Koichi Matsuda, Hitoshi Zembutsu,Ryo Takata, Mitsugu Kanehira, Koji Taka-hashi, Chiyuki Furukawa, Atsushi Takano,Nobuhisa Ishikawa, Tatsuya Kato, SatoshiHayama, Chie Suzuki , Akira Togashi ,Kazuhito Morioka, Tomohide Kidokoro, ChizuTanikawa, Asahi Hishida, Miki Akiyama, Sa-chiko Dobashi, Meng-Lay Lin, Jae-Hyun Park,Tomomi Ueki, Yosuke Harada, Koichiro Inaki,

Chikako Fukukawa, Koji Ueda, Minh HueNguyen, Yuria Mano, Masaya Taniwaki, Ryo-hei Nishino, Daizaburo Hirata, TakumiYamabuki, Nagato Sato, Kenji Tamura,Masayo Hosokawa, Su-Youn Chung, MotohideUemura, Akio Takehara, Arata Shimo, Eiji Hi-rota, and Yusuke Nakamura

(1) Chemosensitivity

To predict the efficacy of the M-VAC neoadju-vant chemotherapy for invasive bladder cancers,we previously established the method to calcu-

Human Genome Center

Laboratory of Molecular MedicineLaboratory of Genome TechnologyDivision of Advanced Clinical Proteomicsゲノムシークエンス解析分野シークエンス技術開発分野先端臨床プロテオミクス共同研究ユニット

Professor Yusuke Nakamura, M.D., Ph.D.Assistant Professor Hidewaki Nakagawa, M.D., Ph.D.Assistant Professor Koichi Matsuda, M.D., Ph.D.Assistant Professor Hitoshi Zembutsu, M.D., Ph.D.Associate Professor Toyomasa Katagiri, Ph.D.Assistant Professor Ryuji Hamamoto, Ph.D.Project Associate Professor Yataro Daigo, M.D., Ph.D.

教 授 医学博士 中 村 祐 輔助 手 医学博士 中 川 英 刀助 手 医学博士 松 田 浩 一助 手 医学博士 前 佛 均助教授 医学博士 片 桐 豊 雅助 手 理学博士 浜 本 隆 二特任助教授 医学博士 醍 醐 弥太郎

146

Page 12: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

late the prediction score on the basis of expres-sion profiles of 14 predictive genes. This scoringsystem had clearly distinguished the respondergroup from the non-responder group. To furthervalidate the clinical significance of the system,we applied it to 22 additional cases of bladdercancer patients and found that the scoring sys-tem correctly predicted clinical response for 19of the 22 test cases. The group of patients withpositive predictive scores had significantlylonger survival than that with negative scores.When we compared our results with the previ-ous report describing the prognosis of the pa-tients with cystectomy alone, the results implythat patients with positive scores are likely tohave benefit by having M-VAC neoadjuvantchemotherapy, but that the chemotherapywould shorten lives of patients with negativescores. We are confident that our prediction sys-tem to M-VAC therapy should provide opportu-nities for achieving better prognosis and im-proving quality of life of patients. Taken to-gether, our data suggest that the goal of “per-sonalized medicine,” prescribing the appropriatetreatment regimen for each patient, may beachievable by selecting specific sets of genes fortheir predictive values.

Philadelphia-chromosome-positive acute lym-phoblastic leukemia (Ph+ALL) revealed one ofthe poorest prognoses among leukemias due toits high incidence of relapse. Although morethan 96% of patients with Ph-positive ALLachieved complete remission (CR) by theimatinib-combined chemotherapy in a phase IIstudy conducted by the Japan adult leukemiastudy group (JALSG), 25% of them experiencedrelapse. To establish a prediction system for arisk of relapse after CR, we analyzed gene ex-pression profiles of 23 bone marrow samplesfrom patients with Ph+ALL using cDNA mi-croarray consisting of 27,648 cDNA sequences.Using the 19 randomly-selected test cases, weidentified 16 genes that were expressed signifi-cantly differently between “non-relapse” (8 pa-tients) and “relapse” (11 patients) groups; fromthe list of 16 genes, we selected the 6 “predic-tive” genes that showed significant differencesand constructed a numerical prediction scoringsystem by which the non-relapse group (withpositive scores) was clearly separated from therelapse group (with negative scores). Scoring of4 cases that were reserved from the original 23cases predicted correctly the responses toimatinib-combined chemotherapy. In addition,three cases, that were resistant to the imatinib-combined therapy and failed to induce remis-sion, also revealed the negative scores. Becausereal-time reverse transcription-PCR data werehighly concordant with the cDNA microarray

data for those 6 genes, we developed a quantita-tive reverse transcription-PCR based predictionsystem that could be feasible for routine clinicaluse. Our results suggest that possibility of therelapse after complete remission by the imatinib-combined chemotherapy can be predicted byexpression patterns in this set of genes, leadingto achievement of “personalized therapy” fortreatment of this disease.

(2) Lung cancer

We found co-transactivation of CDCA1 (celldivision associated 1) and KNTC2 (kinetocoreassociated 2), members of the evolutionarily-conserved centromere protein complex, in non-small cell lung carcinomas (NSCLCs). Immuno-histochemical analysis using lung-cancer tissuemicroarray confirmed high levels of CDCA1 andKNTC2 proteins in the great majority of lungcancers of various histological types. Their ele-vated expressions were associated with poorerprognosis of NSCLC patients. Knockdown ofeither CDCA1 or KNTC2 expression withsiRNA significantly suppressed growth ofNSCLC cells. Furthermore, inhibition of theirbinding by a cell-permeable peptide carrying theCDCA1-derived 19 amino-acid peptide (11R-CDCA1398-416) that correspond to the binding do-main to KNTC2 effectively suppressed growthof NSCLC cells. As our data imply that humanCDCA1 and KNTC2 appear to fall in the cate-gory of cancer-testis antigens and their simulta-neous up-regulation is a frequent and importantfeature of cell growth/survival of lung-cancer,selective suppression of CDCA1 or KNTC2 ac-tivity, and/or inhibition of the CDCA1-KNTC2complex formation could be a promising thera-peutic target for treatment of lung cancer.

We also identified abundant expression ofneuromedin U (NMU) in the great majority oflung cancers. Immunohistochemical analysisdemonstrated significant association of NMU ex-pression with poorer prognosis of NSCLC pa-tients. Treatment of NSCLC cells with siRNAagainst NMU suppressed its expression and in-hibited growth of the cells; on the other hand,induction of exogenous expression of NMU con-ferred growth-promoting activity and enhancedthe cell mobility in vitro. We found that two Gprotein-coupled receptors, growth hormonesecretagogue receptor 1b (GHSR1b) and neuro-tensin receptor 1 (NTSR1), were also over-expressed in NSCLC cells and a hetero-dimercomplex of these receptors functioned as anNMU receptor. The NMU-receptor interactionsubsequently induced generation of a secondmessenger, cAMP, to activate its downstreamgenes including transcription factors and cell cy-

147

Page 13: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

cle regulators. Treatment of NSCLC cells withsiRNAs for GHSR or NTSR1 suppressed expres-sion of those genes and the growth of NSCLCcells. These data strongly implied that targetingthe NMU signaling pathway would be a prom-ising therapeutic strategy for treatment of lungcancers.

In addition, we found over-expression of aMAPJD (Myc-associated protein with JmjC do-main) gene in the great majority of NSCLCcases. Induction of exogenous expression ofMAPJD into NIH3T3 cells conferred growth-promoting activity. Concordantly, in vitro sup-pression of MAPJD expression with siRNA ef-fectively suppressed growth of NSCLC cells inwhich MAPJD was over-expressed. We foundfour candidate MAPJD-target genes, SBNO1,TGFBRAP1, RIOK1 , and RASGEF1A, whichwere the most significantly induced by exoge-nous MAPJD expression. Through interactionwith MYC protein, MAPJD transactivates a setof genes including kinases and cell signal trans-ducers that are possibly related to proliferationof lung cancer cells. As our data imply thatMAPJD is a novel member of the MYC tran-scriptional complex and its activation is a com-mon feature of lung-cancer, selective suppres-sion of this pathway could be a promisingtherapeutic target for treatment of lung cancers.

To identify molecules that might serve asbiomarkers or targets for development of novelmolecular therapies, we have been screeninggenes encoding transmembrane/secretory pro-teins that are up-regulated in lung cancers, us-ing cDNA microarrays coupled with purificationof tumor cells by laser microdissection. A geneencoding seizure-related 6 homolog (mouse)-like2 (SEZ6L2) protein, was chosen as a candidatefor such molecule. Semi-quantitative RT-PCRand western-blot analyses documented in-creased expression of SEZ6L2 in the majority ofprimary lung cancers and lung-cancer cell linesexamined. SEZ6L2 protein was proven to bepresent on the surface of lung-cancer cells byflow cytometrical analysis using anti-SEZ6L2 an-tibody. Immunohistochemical staining for tumortissue microarray consisting of 440 archivedlung-cancer specimens detected positive SEZ6L2staining in 327 (78%) of 420 non-small cell lungcancers (NSCLCs) and 13 (65%) of 20 small-celllung cancers (SCLCs) examined. Moreover,NSCLC patients whose tumors revealed ahigher level of SEZ6L2 expression sufferedshorter tumor-specific survival compared tothose with no SEZ6L2 expression. These resultsindicate that SEZ6L2 should be a useful prog-nostic marker of lung cancers.

Furthermore, to characterize the molecularmechanisms involved in carcinogenesis and pro-

gression of small-cell lung cancer (SCLC) andidentify molecules to be applicable as novel di-agnostic markers and/or for development ofmolecular-targeted drugs, we applied cDNA mi-croarray profile analysis coupled with purifica-tion of cancer cells by laser-microbeam microdis-section (LMM). Expression profiles of 32,256genes in 15 SCLCs identified 252 genes thatwere commonly up-regulated and 851 tran-scripts that were down-regulated in SCLC cellscompared with non-cancerous lung tissue cells.An unsupervised clustering algorithm applied tothe expression data easily distinguished SCLCfrom the other major histological type of non-small cell lung cancer (NSCLC) and identified475 genes that may represent distinct molecularfeatures of each of the two histological types. Inparticular, SCLC was characterized by alteredexpression of genes related to neuroendocrinecell differentiation and/or growth such as ASCL1, NRCAM , and INSM1 . We also identified 68genes that were abundantly expressed both inadvanced SCLCs and advanced adenocarcino-mas (ADCs), both of which had been obtainedfrom patients with extensive chemotherapytreatment. Some of them are known to be tran-scription factors and/or gene expression regula-tors such as TAF5L, TFCP2L4, PHF20, LMO4,TCF20, RFX2 , and DKFZp547I048 as well asthose encoding nucleotide-binding proteins suchas C9orf76, EHD3 , and GIMAP4 . Our data pro-vide valuable information for better understand-ing of lung carcinogenesis and chemoresistance.

(3) Pancreatic cancer

Pancreatic ductal adenocarcinoma (PDAC)shows the worst mortality among the commonmalignancies, with a 5-year survival rate of only4%, and the majority of PDAC patients are di-agnosed at an advanced stage, in which no ef-fective therapy is available at present. Althoughthe proportion of curable cases is still not sohigh, surgical resection of early-staged PDACs isthe only way to cure the disease. Hence, estab-lishment of a screening strategy to detect early-staged PDACs by novel serological markers isurgently required, and development of novelmolecular therapies for PDAC treatment is alsoeagerly expected. We here report over-expression of REG4, a new member of the REGfamily, in PDAC cells on the basis of thegenome-wide cDNA microarray analysis as wellas RT-PCR and immunohistochemical analysis.We also detected significant elevation of REG4in serum of some parts of patients with early-staged PDACs by our ELISA system, indicatingthe possibility of REG4 as a new serologicalmarker of PDACs. Furthermore, we found that

148

Page 14: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

knockdown of the endogenous REG4 expressionin PDAC cell lines with siRNA caused decreaseof cell viability. Concordantly, addition of re-combinant REG4 to the culture medium en-hanced growth of PDAC cell line in a dose-dependent manner. A monoclonal antibodyagainst REG4 neutralized its growth-promotingeffects and attenuated significantly the growthof PDAC cells. These findings implicate thatREG4 is a promising tumor marker to screenearly-staged PDAC and also that neutralizationof REG4 by the antibody may offer us novel po-tential tools for treatment of PDACs.

Among dozens of up-regulated genes inPDAC cells, we also focused on one tyrosinekinase receptor, Eph receptor A4 (EphA4), as amolecular target for PDAC therapy. Immunohis-tochemical analysis validated EphA4 overex-pression in approximately a half of PDAC tis-sues. To investigate its biological function inPDAC cells, we knocked down EphA4 expres-sion by siRNA, which drastically attenuatedPDAC cell viability. Concordantly to the siRNAexperiment, PDAC-derivative cells that were de-signed to constitutively express exogenousEphA4 showed more rapid growth rate thancells transfected with mock vector, suggestingthe growth-promoting effect of EphA4 on PDACcells. Furthermore, the expression analysis forephrin ligand family members indicated co-existence of ephrinA3 ligand in PDAC cells withEphA4 receptor, and knockdown of ephrinA3 bysiRNA also attenuated PDAC cell viability aswell as EphA4 . These results implicate that theEphA4-ephrinA3 pathway is likely to be apromising molecular target for pancreatic cancertherapy.

(4) Prostate cancer

Through genome-wide cDNA microarrayanalysis coupled with microdissection of pros-tate cancer cells, we identified MICAL2-PV(Molecule Interacting with CasL-2 Prostate CancerVariants) , novel splicing variants of MICAL2 ,showing overexpression in PC cells. Northernblot analysis demonstrated that MICAL2-PVswere cancer-testis specific transcripts. Immuno-histochemical analysis using an antibody gener-ated specific to MICAL2-PV revealed that MI-CAL2-PV was expressed in the cytoplasm ofcancer cells with various staining patterns andintensities, while it was not or hardly detectablein adjacent normal prostate epithelium orprostatic intraepithelial neoplasia (PIN). Interest-ingly, immunohistochemical analysis of 105 PCspecimens on the tissue microarray indicatedthat MICAL2-PV expression status was signifi-cantly correlated with Gleason scores (P <0.0001)

or tumor classification (p<0.0001). Furthermore,the expression levels of MICAL2-PVs were alsoconcordant to those of c-Met, a marker of tumoraggressiveness, with statistical significance (p=0.0018). To investigate its biological function inPC cells in vitro, we knocked down endogenousMICAL2-PVs in PC cells by siRNA, which re-sulted in the significant reduction of PC cell vi-ability. Our findings suggest that MICAL2-PVsis likely to be involved in tumor progression oraggressiveness of PC, and could be a candidateas a novel molecular marker and/or a target fortreatment of advanced PCs.

(5) Breast Cancer

We previously reported that up-regulation ofSMYD3, a histone H3 lysine-4 specific methyl-transferase, plays a key role in the proliferationof colorectal carcinoma (CRC) and hepatocellu-lar carcinoma (HCC). In this study, we revealthat SMYD3 expression is also elevated in agreat majority of breast cancer tissues. Similarlyto CRC and HCC, silencing of SMYD3 bysiRNA to this gene resulted in the inhibitedgrowth of breast cancer cells, suggesting that theincreased SMYD3 expression is also essential forthe proliferation of the breast cancer cells. More-over, we show here that SMYD3 could promotebreast carcinogenesis by directly regulating theexpression of the proto-oncogene WNT10B .These data imply that augmented SMYD3 playsa crucial role in breast carcinogenesis, and thatinhibition of SMYD3 should be a novel thera-peutic strategy for treatment of breast cancer.

Cancer therapy directing at specific moleculartargets in signaling pathways of cancer cellssuch as Tamoxifen, aromatase inhibitors andtrastuzumab has been proven its usefulness fortreatment of advanced breast cancers. However,increases of the risk of endometrial cancer bylong-term tamoxifen administration as well asthose of bone fracture due to osteoporosis inpostmenopausal women with aromatase inhibi-tor prescription are recognized as their side ef-fects. Due to the emergence of these side effectsand also drug resistance, it is necessary tosearch novel targets for molecularly-orientateddrugs on the basis of well-characterized mecha-nisms of action. Using the accurate genome-wide expression profiles of breast cancers, wefound maternal embryonic leucine-zipper kinase(MELK) that was significantly overexpressed inthe great majority of breast cancer cells. To as-sess a possible role of MELK in mammary car-cinogenesis, we knocked down the expression ofendogenous MELK in breast cancer cell-lines bymeans of the mammalian vector-based RNA in-terference (RNAi) technique. Furthermore, we

149

Page 15: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

identified a long isoform of Bcl-G (Bcl-GL), a pro-apoptotic member of Bcl-2 family, as a possiblesubstrate (s) for the MELK kinase by pull-downassay with wild-type-and kinase-dead-MELK re-combinant proteins. Finally, we performedTUNEL assay and FACS analysis to measure theproportions of sub-G1 population to investigateMELK is involved in apoptosis cascade throughthe Bcl-GL-related pathway. The multiple humantissues-and cancer cell lines-northern blot analy-ses demonstrated that MELK was overexpressedat a significantly high level in a great majorityof breast cancers and cancer cell lines, but notexpressed in normal vital organs (heart, liver,lung, hidney). Suppression of MELK expressionwith small-interfering RNA significantly inhib-ited growth of human breast cancer cells. Wealso found that MELK protein physically inter-acted with Bcl-GL protein, a pro-apoptotic mem-ber of the Bcl-2 family through its N-terminalregion. Subsequent immunocomplex kinase as-say showed Bcl-GL was specifically phospho-rylated by MELK in vitro. TUNEL assay andFACS analysis revealed that overexpression ofWT-MELK suppressed Bcl-GL-induced apoptosis,while that of D150A-MELK did not. Our find-ings suggest that kinase activity of MELK islikely to be involved in mammary carcinogene-sis through inhibition of pro-apoptotic functionof Bcl-GL. The kinase activity of MELK shouldbe a promising target for development ofmolecular-targeting therapy for patients withbreast cancers.

We also focused on one gene that encodesPBK/TOPK including a kinase domain. North-ern blot analyses using mRNAs of normal hu-man organs, breast cancer tissues, and cancercell-lines indicated this molecule to be a novelcancer/testis antigen. Reduction of PBK/TOPKexpression by siRNA resulted in significant sup-pression of cell growth probably due to dys-function in the cytokinetic process. Immunocyto-chemical analysis with anti-PBK/TOPK anti-body implicated a critical role of PBK/TOPK inan early step of mitosis. PBK/TOPK could phos-phorylate histone H3 at Ser10 in viro and invivo, and medicated its growth-promoting effectthrough histone H3 modification. Since PBK/TOPK is the cancer/testis antigen and its kinasefunction is likely to be related to its oncogenicactivity, we suggest PBK/TOPK to be a promis-ing molecular target for breast cancer therapy.

Among the up-regulated genes, we further fo-cused on functional significance of PRC1 (pro-tein regulator of cytokinesis 1) in developmentof breast cancer. Western blot analysis usingbreast cancer cell-lines revealed a significant in-crease of endogenous PRC1 level in G2/Mphase. Treatment of breast cancer cells with

small-interfering RNAs (siRNAs) against PRC1effectively suppressed PRC1 expression and in-hibited the growth of breast cancer cells, T47Dand HBC5. Furthermore, we found interactionof PRC1 and KIF2C/MCAK (Kinesin familymember 2C/Mitotic centromere-associated kine-sin) by co-immunoprecipitation and immuno-blotting using COS-7 cells in which these mole-cules were introduced exogenously. These find-ings suggest a possible involvement of the PRC1-KIF2C/MCAK complex in breast tumorigenesisand that this complex should be a promisingtarget for development of novel treatment forbreast cancer.

(6) Colon cancer

Through a genome-wide cDNA microarrayanalysis, we identified a number of genes whoseexpression was up-regulated frequently in col-orectal cancer. Among them, we here report agene termed FAM84A that was expressed innone of 22 normal tissues examined except thetestis. Although immunocytochemical stainingrevealed localization of FAM84A protein in thesub-cellular membrane region, the staining wasobserved limitedly in the region lacking the at-tachment with neighboring cells. In addition, wefound that exogenous FAM84A expression in-creased cell motility in NIH3T3 cells, and thatphosphorylation of serine38 of FAM84A was as-sociated with morphology of cells. Our resultsindicate a possibility that up-regulation of FAM84A plays a critical role in progression of coloncancer.

(7) Renal cancer

In order to clarify the molecular mechanisminvolved in renal carcinogenesis, and identifymolecular targets for diagnosis and treatment,we analyzed genome-wide gene expression pro-files of 15 surgical specimens of clear cell renalcell carcinoma (RCC), compared to normal renalcortex, using a combination of laser microbeammicrodissection (LMM) with a cDNA microarrayrepresenting 27,648 genes. We identified 257genes that were commonly up-regulated and721 genes that were down-regulated in RCCs.Interestingly, none of top 24 up-regulated genesthat showed most significant differences in in-formative RCC-cases were included in previ-ously reports describing expression profiles ofccRCC using RNAs isolated from bulk tissues.These findings suggest that it is important topurify as much as possible the populations ofcancerous and normal epithelial cells obtainedfrom surgical specimens. Among significantly-transactivated genes, we in this study focused

150

Page 16: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

on Semaphorin 5B (SEMA5B ) and knocked-down its expression in RCC cells by small-interfering RNA (siRNA). Effective down-regulation of its expression levels in RCC cellssignificantly attenuated RCC cell viability. Inconclusion, our data should be helpful for a bet-ter understanding of the tumorigenesis of RCCand should contribute to the development of di-agnostic tumor marker and molecular-targetingtherapy for patients with RCC.

(8) Bladder cancer

To disclose molecular mechanism of bladdercancer, the second most common genitourinarytumor, we had previously performed genome-wide expression profile analysis of 26 bladdercancers by means of cDNA microarray repre-senting 27,648 genes. Among genes that weresignificantly up-regulated in the majority ofbladder cancers, we here report identification ofMPHOSPH1 (M-phase phosphoprotein 1) as a can-didate molecule for drug development for blad-der cancer. Northern blot analyses using mRNAsof normal human organs and cancer cell-linesindicated this molecule to be a novel cancer/tes-tis antigen. Introduction of MPHOSPH1 intoNIH3T3 cells significantly enhanced cell growthat in vitro and in vivo conditions. We subsequ-ently found an interaction between MPHOSPH1and PRC1 (Protein Regulator of Cytokinesis 1),which was also up-regulated in bladder cancercells. Immunocytochemical analysis revealed co-localization of endogenous MPHOSPH1 andPRC1 proteins in bladder cancer cells. Interest-ingly, knockdown of either of MPHOSPH1 orPRC1 expression with specific siRNAs causedsignificant increase of multi-nuclear cells andsubsequent cell death of bladder cancer cells.Our results imply that the MPHOSPH1/PRC1complex is likely to play a crucial role in blad-der carcinogenesis and that inhibition of theMPHOSPH1/PRC1 expression or their interac-tion should be novel therapeutic targets forbladder cancers.

(9) Cholangiocarcinoma

Intrahepatic cholangiocarcinoma (ICC) is thesecond most common primary cancer in theliver, and its incidence is highest in northeasternpart of Thailand. ICCs in this region are knownto be associated with infection with liver flukes,particularly Opisthorchis viverrini (OV), as wellas nitrosamines from food. To clarify molecularmechanisms of ICC associated with or withoutliver flukes, we analyzed gene-expression pro-files of fluke-associated ICCs from 20 Thai pa-tients, and compared their profiles with those of

20 Japanese ICCs that were not associated withfluke by means of laser-microbeam-micro-dissection and a cDNA microarray containing27,648 genes. We identified 77 commonly upre-gulated genes and 325 commonly downregu-lated genes in the two ICC groups. Unsuper-vised hierarchical cluster analysis separated the40 ICCs into two major branches almost com-pletely according to the fluke status. The puta-tive signature of liver fluke-associated ICC ex-hibited elevated expression of genes involved inxenobiotic metabolism (UGT2B11, UGT1A10,CHST4, SULT1C1), while that of non-liver fluke-associated ICC represented enhanced expressionof genes related to growth factor signaling(TGFBI, PGF, IGFBP1, IGFBP3). Additional ran-dom permutation tests identified a total of 52genes whose expression levels were significantlydifferent between the two groups. We also iden-tified 17 genes associated with macroscopic typeof fluke-associated ICCs. These data may notonly contribute to clarification of common andfluke-specific mechanisms underlying ICC, butalso serve as a starting point for the identifica-tion of novel diagnostic markers and/or thera-peutic targets for the disease.

(10) Biomarker for esophageal and lung can-cer

Gene-expression profile analysis of lung andesophageal carcinomas revealed that Dikkopf-1(DKK1) was highly transactivated in the greatmajority of lung cancers and esophagealsquamous-cell carcinomas (ESCCs). Immunohis-tochemical staining using tumor tissue microar-rays consisting of 279 archived non-small celllung cancers (NSCLCs) and 280 ESCC speci-mens demonstrated that a high level of DKK1expression was associated with poor prognosisof patients with NSCLC as well as ESCC, andmultivariate analysis confirmed its independentprognostic value for NSCLC. In addition, weidentified that exogenous expression of DKK1increased the migratory activity of mammaliancells, suggesting that DKK1 may play a signifi-cant role in progression of human cancer. Weestablished an ELISA system to measure serumlevels of DKK1 and found that serum DKK1 lev-els were significantly higher in lung andesophageal cancer patients than in healthy con-trols. The proportion of the DKK1-positive caseswas 126 (70.0%) of 180 NSCLC, 59 (69.4%) of 85SCLC, and 51 (63.0%) of 81 ESCC patients,while only 10 (4.8%) of 207 healthy volunteerswere falsely diagnosed as positive. A combinedELISA assays for both DKK1 and CEA increasedsensitivity, and classified 82.2% of the NSCLCpatients as positive while only 7.7% of healthy

151

Page 17: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

volunteers were falsely diagnosed to be positive.The use of both DKK1 and ProGRP increasedsensitivity to detect SCLCs up to 89.4%, whilefalse positive rate in healthy donors were only6.3%. Our data imply that DKK1 should be use-ful as a novel diagnostic/prognostic biomarkerin clinic and probably as a therapeutic target forlung and esophageal cancer.

2. Common diseases

(1) cerebral infarction

Authors: Michiaki Kubo1,2,4, Jun Hata1,2,4, Toshi-haru Ninomiya1,2, Koichi Matsuda4, Koji Yo-nemoto1, Toshiaki Nakano2,3, Tomonaga Mat-sushita2,4, Keiko Yamanaka-Yamazaki4, YozoOhnishi5, Susumu Saito5, Takanori Kitazono2,Setsuro Ibayashi2, Katsuo Sueishi3, MitsuoIida2, Yusuke Nakamura4, and Yutaka Kiyo-hara1.: 1Department of Environmental Medi-cine, 2Department of Medicine and Clinical Sci-ence, 3Pathophysiological and Experimental Pa-thology, Graduate School of Medical Sciences,Kyushu University, Fukuoka 812-8582, Japan.4Laboratory of Molecular Medicine, HumanGenome Center, Institute of Medical Science,University of Tokyo, Tokyo 108-8639, Japan.

5Laboratory for Genotyping, SNP ResearchCenter, the Institute of Physical and ChemicalResearch (RIKEN), Yokohama 230-0045, Ja-pan.

Cerebral infarction is the most common typeof stroke and often causes long-term disability.To investigate the genetic contribution to cere-bral infarction, we conducted a case-controlstudy using 52,608 gene-based tag-SNPs selectedfrom JSNP database. Here we reported that onenon-synonymous SNP in a member of proteinkinase C (PKC) family, PRKCH , was signifi-cantly associated with lacunar infarction in twoindependent Japanese samples (p=3.2x10-7,crude odds ratio of 1.39). This SNP was likely toaffect the PKC activity. Furthermore, a 14-yearfollow-up cohort study in Hisayama (Fukuoka,Japan) supported involvement of this SNP forthe development of cerebral infarction (p=0.03,age-and sex-adjusted hazard ratio of 2.83). Wealso found that PKCη was mainly expressed invascular endothelial cells and foamy macro-phages in human atherosclerotic lesions, and itsexpression was enhanced as the lesion type pro-gressed. Our results support a role for PRKCHin the pathogenesis of cerebral infarction.

Publications

1. R. Hamamoto, F.P. Silva, M. Tsuge, T.Nishidate, T. Katagiri, Y. Nakamura and Y.Furukawa: Enhanced SMYD3 expression isessential for the growth of breast cancercells. Cancer Science, 97: 113-118, 2006

2. K. Obama, T. Kato, S. Hasegawa, S. Satoh,Y. Nakamura, and Y. Furukawa: Overex-pression of peptidyl-prolyl isomerase-like 1is associated with the growth of colon can-cer cells. Clinical Cancer Research, 12: 70-76,2006

3. A. Iida, S. Saito, A. Sekine, A. Takahashi, N.Kamatani, and Y. Nakamura: Japanese sin-gle nucleotide polymorphism database for267 possible drug-related genes. Cancer Sci-ence, 97: 16-24, 2006

4. T. Kikuchi, Y. Daigo, N. Ishikawa, T.Katagiri, T. Tsunoda, S. Yoshida, and Y.Nakamura: Expression profiles of metastaticbrain tumor from lung adenocarcinomas oncDNA microarray. International Journal ofOncology, 28: 799-805, 2006

5. T. Mushiroda, Y. Ohnishi, S. Saito, A. Taka-hashi, Y. Kikuchi, S. Saito, H. Shimomura, Y.Wanibuchi, T. Suzuki, N. Kamatani, and Y.Nakamura: Association of VKORC1 andCYP2C9 polymorphisms with warfarin dose

requirements in Japanese patients. Journal ofHuman Genetics, 51: 249-253, 2006

6. T. Suda, T. Tsunoda, N. Uchida, T. Watan-abe, S. Hasegawa, S. Satoh, S. Ohgi, Y. Fu-rukawa, Y. Nakamura, and H. Tahara: Iden-tification of secernin 1 as a novel immuno-therapy target for gastric cancer using theexpression profiles of cDNA microarray.Cancer Science, 97: 411-419, 2006

7. K. Yoshiura, A. Kinoshita, T. Ishida, A. Ni-nokata, T. Ishikawa, T. Kaname, M. Bannai,K. Tokunaga, S. Sonoda, R. Komaki, M.Ihara, V. Saenko, A.G. Kaimovich, I. Sekine,K. Komatsu, H. Takahashi, M. Nakashima,N. Sosonkina, C.K. Mapendano, M. Ghad-ami, M. Nomura, D.-S. Linag, D.-K. Kim, A.Garidkhuu, N. Natsume, T. Ohta, H. To-mita, K. Hirayama, M. Ishibashi, A. Taka-hashi, N. Saito, S. Saitou, Y. Nakamura, andN, Niikawa: A SNP in the ABCC11 gene isthe determinant of human earwax type. Na-ture Genetics, 38: 324-330, 2006

8. T. Yamabuki, Y. Daigo, T. Kato, S. Hayama,T. Tsunoda, M. Miyamoto, T. Ito, M. Fujita,M. Hosokawa, S. Kondo, and Y. Nakamura:Genome-wide gene expression profile analy-sis of esophageal squamous cell carcinomas.

152

Page 18: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

International Journal of Oncology, 28: 1375-1384, 2006

9. A. Iida, H. Kizawa, Y. Nakamura, and S.Ikegawa: High-resolution SNP map ofASPN, a susceptibility gene for osteoarthri-tis. Journal of Human Genetics, 51: 151-154,2006

10. F.P. Silva, R. Hamamoto, Y. Furukawa, andY. Nakamura: TIPUH1 encodes a novelKRAB zinc-finger protein highly expressedin human hepatocellular carcinomas. Onco-gene, 25: 5063-5070, 2006

11. K. Nakashima, T. Hirota, Y. Suzuki, A. Mat-suda, M. Akahoshi, M. Shimizu, A. Jodo, S.Doi, K. Fujita, M. Ebisawa, S. Yoshihara,T.Enomoto, T. Shirakawa, F. Kishi, Y. Naka-mura, and M. Tamari: Association of the RIP2 gene with childhood atopic asthma. Alleol-ogy International, 55: 77-83, 2006

12. K. Nakashima, T. Hirota, K. Obara, M.Shimizu, A. Jodo, M. Kameda, S. Doi, K. Fu-jita, T. Shirakawa, T. Enomoto, F. Kishi, S.Yoshihara, K. Matsumoto, H. Saito, Y.Suzuki, Y. Nakamura, and M. Tamari: Anassociation study of asthma and related phe-notypes with polymorphisms in negativeregulator molecules of the TLR signalingpathway. Journal of Human Genetics, 51:284-291, 2006

13. S. Ashida, M. Furihata, T. Katagiri, K. Ta-mura, Y. Anazawa, H. Yoshioka, T. Miki, T.Fujioka, T. Shuin, Y. Nakamura, and H.Nakagawa: Expression of novel molecules,MICAL2-PV (MICAL2 prostate cancer vari-ants), increases with high gleason score andprostate cancer progression. Clinical CancerResearch, 12: 2767-2773, 2006

14. N. Ishikawa, Y. Daigo, A. Takano, M. Tani-waki, T. Kato, S. Tanaka, W. Yasui, Y. Take-shima, K. Inai, H. Nishimura, E. Tsuchiya,N. Kohno, and Y. Nakamura: Characteriza-tion of SEZ6L2 cell-surface protein as anovel prognostic marker for lung cancer.Cancer Science, 97737-745, 2006

15. T. Kobayashi, T. Masaki, M. Sugiyama, Y.Atomi, Y. Furukawa, and Y. Nakamura: Agene encoding a family with sequence simi-larity 84, member A (FAM84A) enhancedmigration of human colon cancer cells. Inter-national Journal of Oncology, 29: 341-347,2006

16. M. Taniwaki, Y. Daigo, N. Ishikawa, A.Takano, T. Tsunoda, W. Yasui, K. Inai, N.Kohno, and Y. Nakamura: Gene expressionprofiles of small-cell lung cancers: molecularsignatures of lung cancer. International Jour-nal of Oncology, 29: 567-575, 2006

17. E. Hirota, L. Yan, T. Tsunoda, S. Ashida, M.Fujime, T. Shuin, T. Miki, Y. Nakamura and

T. Katagiri: Genome-wide gene expressionprofiles of clear cell renal cell carcinoma; In-ternational Journal of Oncology, 29: 799-827,2006

18. J.-H. Park, M.-L. Lin, T. Nishidate, Y. Naka-mura, and T. Katagiri: PDZ-binding kinase/T-LAK cell-originated protein kinase, a puta-tive cancer/testis antigen with an oncogenicactivity in breast cancer. Cancer Research,66: 9186-9195, 2006

19. K. Ozaki, H. Sato, A. Iida, H. Mizuno, T.Nakamura, Y. Miyamoto, A. Takahashi, T.Tsunoda, S. Ikegawa, N. Kamatani, M. Hori,Y. Nakamura, and T. Tanaka: A functionalSNP in PSMA6 confers risk of myocardialinfarction in the Japanese population. Na-ture Genetics, 38: 921-925, 2006

20. N. Jinawath, Y. Chamgramol, Y. Furukawa,K. Obama, T. Tsunoda, B. Sripa, C. Pairo-jkul, and Y. Nakamura: Comparison of gene-expression profiles between opisthorchis vi-verrini and non-opisthorchis viverrini associ-ated human intrahepatic cholangiocarci-noma. Hepatology, 44: 1025-1038, 2006

21. Y. Nakamura, M. Futamura, H. Kamino, K.Yoshida, Y. Nakamura, and H. Arakawa:Identification of p53-46F as a super p53 withan enhanced ability to induce p53-depen-dent apoptosis. Cancer Science, 97: 633-641,2006

22. A. Takehara, H. Eguchi, H. Ohigashi, O.Ishikawa, T. Kasugai, M. Hosokawa, T.Katagiri, Y. Nakamura, and H. Nakagawa:Novel tumor marker REG4 detected in se-rum of patients with resectable pancreaticcancer and feasibility for antibody therapytargeting REG4. Cancer Science, 97: 1191-1197, 2006

23. M. Iiizumi, H. Nakagawa, M. Hosokawa, A.Takehara, C. Su-Youn, T. Nakamura, T.Katagiri, H. Eguchi, H. Ohigashi, O.Ishikawa, Y. Nakamura, and H. Nakagawa:EphA4 receptor, overexpressed in pancreaticductal adenocarcinoma, promotes cancer cellgrowth. Cancer Science, 97: 1211-1216, 2006

24. K. Takahashi, C. Furukawa, A. Takano, N.Ishikawa, T. Kato, S. Hayama, C. Suzuki, W.Yasui, K. Inai, S. Sone, T. Ito, H. Nishimura,E. Tsuchiya, Y. Nakamura and Y. Daigo:The neuromedin u-growth hormone secre-tagogue receptor 1b/neurotensin receptor 1oncogenic signaling pathway as a therapeu-tic target for lung cancer. Cancer Research,66: 9408-9419, 2006

25. S. Hayama, Y. Daigo, T. Kato, N. Ishikawa,T. Yamabuki, M. Miyamoto, T. Ito, E.Tsuchiya, S. Kondo, and Y. Nakamura: Acti-vation of CDCA1-KNTC2, members of cen-tromere protein complex, involved in pul-

153

Page 19: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

monary carcinogenesis. Cancer Research, 66:10339-10348, 2006

26. K. Ura, K. Obama, S. Satoh, Y. Sakai, Y.Nakamura, and Y. Furukawa: EnhancedRASGEF1A expression is involved in thegrowth and migration of intrahepatic cho-langiocarcinoma. Clinical Cancer Research,12: 6611-6616, 2006

27. N. Ishii, K. Ozaki, H. Sato, H. Mizuno, S.Saito, A. Takahashi, Y. Miyamoto, S. Ike-gawa, N. Kamatani, M. Hori, S. Saito, Y.Nakamura and T. Tanaka: Identification of anovel non-coding RNA, MIAT, that confersrisk of myocardial Infarction. Journal of Hu-man Genetics, 51: 1087-1099, 2006

28. S. Mahasirimongkol, W. Chantratita, S.Promso, E. Pasomsab, N. Jinawath, W. Jong-jaroenprasert, V. Lulitanond, P. Kritta-yapoositpot, S. Tongsima, P. Sawanpany-alert, N. Kamatani, Y. Nakamura, T. Sura:

Similarity of the allele frequency and linkagedisequilibrium pattern of single nucleotidepolymorphisms in drug-related gene loci be-tween Thai and northern East Asian popula-tions: implications for tagging SNP selectionin Thais. Journal of Human Genetics, 51: 896-904, 2006

29. Y. Nakazaki, H. Hase, H. Inoue, Y. Beppu,M. Xin, G. Sakaguchi, R. Kurita, S. Asano, Y.Nakamura, and K. Tani: Serial analysis ofgene expression in progressing and regress-ing mouse tumor implicates the involvementof RANTES andTARC in antitumor immuneresponses. Molecular Therapy, 14: 599-606,2006

30. M. Kato, A. Sekine, Y. Ohnishi, T.A. John-son, T. Tanaka, Y. Nakamura and T. Tsu-noda: Linkage disequilibrium of evolutionar-ily conserved regions in the human genome.BMC Genomics, 7: 326, 2006

154

Page 20: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

The mission of our laboratory is to conduct computational (“in silico”) studies onthe functional aspects of genome information. Roughly speaking, genome informa-tion represents what kind of proteins/RNAs are synthesized on what conditions.Thus, our study includes the structural analysis of molecular function of each geneproduct as well as the analysis of its regulatory information, which will lead us tothe understanding of its cellular role represented by the networks of inter-gene in-teraction.

1. Conservation of regulation systems in fir-micutes

Nicolas Sierro and Kenta Nakai

Due to the probable co-regulation by a com-mon transcription factor of genes showing asimilar expression profile, investigation of theirpromoter regions is an important step towardsthe understanding of global cell regulation net-works. By coupling the current knowledgeabout experimentally proven transcriptional re-gulation with the currently available raw geneticdata, a better understanding of the similaritiesand differences between various species couldbe obtained. Most of the bacterial transcriptionalregulation data have however been obtained intwo specific organisms, Escherichia coli and Bacil-lus subtilis , and comparative genomics is there-fore necessary to evaluate to which extent theacquired knowledge is applicable to other bacte-rial species. Therefore, the annotated proteins of66 complete firmicutes genomes, including thatof B. subtilis, were compared to each other in or-der to build clusters of homologous proteinsand their upstream intergenic regions. Two ap-proaches were then used to analyze these up-stream intergenic regions: the mapping of

known B. subtilis transcription factor bindingsites on every genome using the position spe-cific weight matrices provided by DBTBS, andthe analysis of the upstream intergenic regionconservation for each cluster of homologousgenes.

To provide a comprehensive yet easy to un-derstand and interpret representation of theknown motif conservation pattern, new toolswere developed that can generate a graphic foreach cluster indicating on one side in whichstains homologous proteins are found, and onthe other side whether or not these proteins pos-sess a certain transcription factor binding site intheir upstream intergenic region. With thismethod, the existence of different regulation sys-tems for the CtsR and HrcA heat shock responseregulons within the firmicutes could for instancebe shown. Our data suggest for example thatthe mollicutes, which are characterized by avery small genome size and lack CtsR, haveplaced the regulation of genes typically regu-lated by CtsR under the control of HrcA, or thatin the Staphylococcus strains the HrcA regulonsis not only regulated by HrcA itself, but also byCtsR.

The analysis of the upstream intergenic regionconservation for each cluster of homologous

Human Genome Center

Laboratory of Functional Analysis In Silico機能解析イン・シリコ分野

Professor Kenta Nakai, Ph.D.Associate Professor Kengo Kinoshita, Ph.D.

教 授 理学博士 中 井 謙 太助教授 理学博士 木 下 賢 吾

155

Page 21: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

genes is also of interest because it is expectedthat regions involved in gene regulation aremore conserved than regions with no particularfunction. To investigate this conservation, eachcluster was therefore divided in subclustersbased both on the bacterial genus and the sizeof the intergenic regions and aligned by Clus-talW. The aligned sequences were used to calcu-late the sequence conservation and position spe-cific weight matrices generated for the con-served regions. By comparing these matrices toeach other, motifs corresponding to both knowntranscription factor binding sites and unknownconserved regions could be obtained; the analy-sis of the latter group being currently underway.

By carrying out comparative analysis of alarge number of related genomes, concentratingparticularly on the conservation of the promoterregions of homologous genes, and using the ex-perimental data available in literature, signifi-cant differences in the regulatory networks ofnot only a single strain, but of whole genuscould be highlighted. This approach will there-fore not only allow a refinement of the currentunderstanding of bacterial regulation networks,but also provide new input for experimental re-search by helping in the design of experimentsand the interpretation of their results.

2. Construction of a promoter model fortissue-specific expression in Ciona intesti-nalis

Alex Vandenbon, Takehiro Kusakabe1, andKenta Nakai: 1Graduate School of Life Science,University of Hyogo

Transcriptional regulation of gene expressionin eukaryotes is controlled by transcription fac-tors binding cis-regulatory elements in regula-tory sequences. Genes containing binding sitesfor the same set of transcription factors in theirnon-coding regions can be expected to be co-regulated at the transcriptional level. However,using merely the presence of these binding sitesto predict new target genes for tissue-specificexpression might result in high numbers of falsepositives, as the computational prediction ofthese binding sites has been shown to sufferfrom very bad specificity. To increase accuracyof the prediction of new target genes we con-structed a promoter model that does not onlytake into account the presence of cis-regulatoryelements, but also their order in the regulatorysequence and the distances between pairs of ele-ments. We trained this model on 5 sets of tissue-specifically expressed genes of C. intestinalis andused it to predict new target genes in a genome-wide set of promoter sequences. Blast results for

high-scoring genes indicate that this promotermodel might be useful for predicting promisingcandidates for wet-lab experiments.

3. Analysis of trans-splicing in Ciona intesti-nalis

Li Shuang, Riu Yamashita, Takehiro Kusak-abe1 and Kenta Nakai

Trans-splicing, in which the original 5’-ends ofsome pre-mRNAs are discarded and replaced bythe 5’-region of a certain SL RNA, has been re-ported in Ciona intestinalis. We analyzed thisphenomenon using its EST and genome se-quences. A conserved head sequence, which iscalled Spliced-Leader (SL), ATTCTATTTGAATAAG, and could not be mapped to the genomesequence, has been found in 419 of 2,077 5’-ESTs. Depending on the existence of this Spliced-Leader sequence, we classified the 5’-ESTs into2 groups: SL+ and SL-. Using the UniGenedatabase, the gene names of each SL+ and SL-groups were obtained. The fact that there wereonly small overlaps between these groups sug-gests that trans-splicing occurs on a specificgene set. By regarding the number of EST clonesfor each gene as its expression strength, wefound that the expression level of SL+ genes issignificantly high while that of SL- genes var-ies greatly. In addition, by regarding the num-ber of libraries where a gene’s EST(s) are ob-served as a degree of anti-tissue/developmentalstage specificity, we found that the gene expres-sion between SL+ and SL- groups shows asignificant difference at the developmental stageof juvenile. We further used the Gene Ontologyinformation of human homologous genes to an-notate corresponding Ciona genes. As a result,we found that a significant number of SL-genes seem to be related to ribosomal functionswhile mitochondrion-related genes are foundmore frequently in SL+ genes.

4. Updating the database of transcriptionalstart sites (DBTSS)

Riu Yamashita, Yutaka Suzuki2, Sumio Suga-no2, and Kenta Nakai: 2Graduate School ofFrontier Sciences

DBTSS (http://dbtss.hgc.jp) was constructedin 2002 based on experimentally-determined 5’-end clones. During this year, we have addedseveral features. First, the number of clones cor-responding to human genes has significantly in-creased, from 190,964 to 1,359,000. Second, wedefined putative promoter groups by clusteringTSSs within a 500 base range because the con-

156

Page 22: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

tent of our database is now large enough to ana-lyze the existence of multiple promoters for eachgene. If a gene has several putative TSS clusters,we regard them as the evidence of alternativepromoters. We found 8,308 human genes and4,276 mouse genes which have alternative pro-moters. Third, DBTSS now supports a functionthat enables detailed comparison between anypair of TSSs present in it. Finally, we haveadded TSS information of zebrafish (15,189TSSs: 32,263 clones), malaria (6,908 TSSs: 10,236clones), and schyzon (14,029 TSSs: 22,923clones).

5. Comprehensive detection of human termi-nal oligo-pyrimidine (TOP) genes andanalysis of their characteristics

Riu Yamashita, Yutaka Suzuki2, Nono Tomita-Takeuchi2, Hiroyuki Wakaguri2, Takuya Ueda2,Sumio Sugano2, and Kenta Nakai

It is known that several genes have a terminaloligo-pyrimidine sequence at the 5’-end of theirmRNA, and are hence called TOP genes. Thesegenes are also known to be transcriptionally ortranslationally regulated. But it is not knownhow many of these genes are present in the hu-man genome. We performed a detection of TOPgene candidates using accurate TSS informationprovided by DBTSS. By using a position specificweight matrix constructed from 48 known TOPgenes, we could detect 1,645 candidate TOPgenes from the 13,717 human genes in our data-base. These 1,645 genes were broadly expressedcompared with the rest of genes (p<e-200). 239of these 1,645 genes satisfied the same criteriafor TOP genes also in their mouse homologs.We experimentally validated 83 of the 239 can-didates, and found 41 (49%) of them are transla-tionally regulated. We also suggest that thistranslational regulation is affected by the lengthof mRNAs.

6. Comparative studies of alternative promot-ers of human and mouse genes

Katsuki Tsuritani3, Takuma Irie2, Riu Yama-shita, Yuta Sakakibara2, Hiroyuki Wakaguri2,Akinori Kanai2, Junko Mizushima-Sugano2, Su-mio Sugano2, Kenta Nakai, and YutakaSuzuki2: 3Taisho Pharmaceutical, Co. Ltd.

It gradually becomes clear that a large popu-lation of human genes are regulated by morethan one alternative promoters (APs). Here wereport large-scale comparative studies of puta-tive alternative promoters (PAPs) between hu-man and mouse counterpart genes. We classi-

fied the 17,245 putative promoter regions (PPRs)in 5,764 PAP-containing human genes into threecategories: “conserved”, “marginal” and “non-conserved” according to the results of sequencecomparison. The conserved PPRs are major andrich in CpG-island. In contrast, the non-conservedPPRs are minor and rich in repetitive elements.The latter also are similar to intergenic region inthe distribution of CG content, and they pro-duce more transcripts that encode small or noproteins than the conserved ones. Systematic lu-ciferase assays of these PPRs revealed that bothclasses of PPRs did have promoter activity, butthat their strength ranges were significantly dif-ferent. Furthermore, we demonstrated that thesecharacteristic features of the non-conservedPPRs are shared with the PPRs of previouslydiscovered putative non-protein coding tran-scripts. Taken together, our data suggest thatthere are two distinct classes of promoters inhumans, with the latter class of promotersemerging frequently during evolution.

7. Relationships between protein conserva-tion and promoter conservation revealedby comparative sequence analysis be-tween human and mouse

Hirokazu Chiba, Riu Yamashita, Kengo Ki-noshita, and Kenta Nakai

Comparative sequence analysis is a powerfultool to extract functional or evolutionary infor-mation from genomes of organisms. With theuse of complete genome sequences, there havebeen many studies comparing protein sequencesor promoter sequences to provide insights intogenomics. However, the relationships betweenprotein conservation and promoter conservationare poorly understood. In this study, we com-pared not only protein sequences but also pro-moter sequences for 6,901 human and mouseorthologous genes to address this issue. First,we carried out the comparison of promoter se-quences, and examined the relationship betweengene function and promoter conservation. Newfunctional categories with significant promoterconservation levels were identified in additionto the ones already reported previously. Next,the relationship between protein conservationand promoter conservation was examined. Thecorrelation between them was weak, suggestingthat protein sequences and promoter sequencesare under different kinds of evolutionary pres-sures. Specifically, the ‘ribosome’ categoryshows significantly low promoter conservation,in spite of high protein conservation; while in-versely, the ‘extracellular matrix’ category showssignificantly high promoter conservation, in

157

Page 23: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

spite of low protein conservation.

8. MLV integration sites analysis usingDBTSS

Yoshiaki Tanaka, Riu Yamashita, and KentaNakai

Historically, it was thought that the integra-tion sites of retroviruses were at random. Re-cently some researchers compared integrationsites with the NCBI reference sequences (RefSeq)database, and reported Murine Leukemia Virus(MLV) is inserted preferentially close to tran-scription start sites (TSSs). However, these re-sults rely on the TSS information in RefSeq, andit is known that many RefSeq genes do not havean accurate 5’ end. To obtain precise correlationbetween MLV integration sites and TSSs, wecompared MLV integration sites with the Data-base of Transcription Start Sites (DBTSS), whichcovers precise TSSs. From this result, we couldobtain a clear peak in TSS±2kb, which was nar-rower than previous reports. The integration fre-quency in TSS±5kb had a higher percentagethan previous reports. Now we are analyzingthe features of MLV integration sites using thisresult.

9. Computational analysis of microRNA rec-ognition sites

Keishin Nishida, Riu Yamashita, Kengo Ki-noshita, and Kenta Nakai

microRNAs (miRNAs) are ~22-nucleotide-long RNAs responsible for posttranscriptionalregulation of genes by pairing with mRNA. It isknown that the miRNA 5’-terminal region rec-ognition of the mRNA 3’-untranslated regionleads to translation inhibition or mRNA cleav-age. This miRNA region is called “seed”. How-ever, although recent research increases the seedimportance, concrete seed position and lengthare not defined. Our research detects the detailof seed position and length from an experimen-tally supported miRNA-target dataset. Thosedata indicate seed tendencies for translationdownregulation and mRNA cleavage are differ-ent. We apply this analysis to the public mi-croarray dataset. Two datasets indicate clearseeds. Many datasets do not have clear seeds,indicating a dependency on the quality of themicroarray.

10. ATTED-II: a database of co-expressedgenes and cis-elements for identifying co-regulated gene groups in Arabidopsis

Takeshi Obayashi, Kengo Kinoshita, KentaNakai, Masayuki Shibaoka4, Shinpei Hayashi4,Motoshi Saeki4, Daisuke Shibata5, KazukiSaito6, and Hiroyuki Ohta4: 4Tokyo Institute ofTechnology, 5Kazusa DNA Research Institute, 6

Chiba University

Publicly available database of co-expressedgene sets would be a valuable tool for a widevariety of experimental designs. We constructedan Arabidopsis thaliana trans-factor and cis-element prediction database (ATTED-II) thatprovides co-regulated gene relationships basedon co-expressed genes deduced from microarraydata and the predicted cis-elements. ATTED-II(http://atted.hgc.jp) includes the following fea-tures: (i) lists and networks of co-expressedgenes calculated from 58 publicly available ex-perimental series (1,388 GeneChip data) in A.thaliana; (ii) prediction of cis-regulatory elementsto predict co-regulated genes amongst the co-expressed genes; and (iii) visual representationof expression patterns for individual genes.ATTED-II can thus help researchers to clarifythe function and regulation of particular genesand gene networks.

11. COXPRES: co-expressed gene databasefor mouse and human

Takeshi Obayashi and Kengo Kinoshita

The number of publicly available gene expres-sion data is abundant in mouse and human, andis ten or twenty times larger than that for Arabi-dopsis. However there is no database of co-expressed genes such as ATTED-II, although theinformation is very valuable to predict genefunction. We are thus constructing a new data-base named COXPRES (co-expression) for co-expressed genes in mouse and human fromsuch publicly available gene expression data.The information of gene co-expression is calcu-lated from thousands of oligonucleotide mi-croarray (GeneChip) data and then representedas gene lists and gene networks. This informa-tion of gene co-expression will widely promoteexperimental researches on mouse and human.

12. Assessing gene similarity with their co-expression and its use in prediction ofgene function

Takeshi Obayashi, Atsushi Takabayashi7 ,Noriko Ishikawa7, Fumihiko Sato7, and Kengo

158

Page 24: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

Kinoshita: 7Kyoto University

The information of gene co-expression is valu-able to predict gene functions. The performanceof gene prediction depends on the quality of thegene co-expression data and the method of pre-diction. The quality of gene co-expression datadepends on not only the quality and quantity ofthe original data but also on the normalizingmethod and the method used to calculate corre-lation coefficient between gene expression pat-terns. To compare these methods, we developeda new system to automatically predict genefunction from its co-expression data. Using thissystem, we assessed the methods to constructgene co-expression data. In addition to compu-tational prediction, the methods are comparedby large-scale gene disruption.

13. Analyses of homo-oligomer interfaces ofproteins from the complementarity of mo-lecular surface, electrostatic potential andhydrophobicity

Yuko Tsuchiya, Kengo Kinoshita, and HarukiNakamura8: 8Osaka University

To extract the structural interacting patternsbetween proteins, we developed a method to es-timate the complementarities for hydrophobic-ity, electrostatic potential on the molecular sur-faces and shape of the surfaces in protein-protein interfaces. We have found that the homo-oligomer interfaces can be classified into fivegroups according to the structure of the inter-face; cyclic-oligomer, twisted-dimer, dimer-parallel , dimer-perpendicular and dimer-circular. As the results of correlation analysesbetween the shape classification and the prop-erty complementary, new characteristic trendsas the possible necessary conditions of protein-protein interactions are emerging.

14. PreBI: prediction of biological interfacesof proteins in crystals

Yuko Tsuchiya, Kengo Kinoshita, NobutoshiIto9, and Haruki Nakamura8: 9Tokyo Medicaland Dental University

PreBI is a WWW server for predicting biologi-cal interfaces in protein crystal structures ac-cording to the complementarities of the electro-static potential, hydrophobicity and shape of theinterfaces, along with the area of the interfaces.The results can be checked through our interac-tive viewer. (http://pre-s.protein.osaka-u.ac.jp/¯prebi/)

15. Probabilistic alignment detects remotehomology in a pair of protein sequenceswithout homologous sequence information

Ryotaro Koike10, Kengo Kinoshita, and AkinoriKidera10: 10Yokohama City University

Dynamic programming algorithm and its heu-ristics are the fundamental methods for similar-ity searches of biological sequences. Includingadditional information, such as homologous se-quences in the profile comparison, has refinedthe detection power of sequence comparison.We described a new approach, probabilisticalignment (PA), which gives improved detectionpower using only a pair of amino acid se-quences. Receiver operating characteristic (ROC)analysis showed that the PA method is far supe-rior to BLAST, and that its sensitivity and selec-tivity approach to those of PSI-BLAST. Particu-larly for orphan proteins, PA exhibits much bet-ter performance than PSI-BLAST.

16. Prediction of disordered region and terti-ary structure of proteins from its aminoacid sequence

Takashi Ishida and Kengo Kinoshita

Identification of protein intrinsically disor-dered or unstructured regions is useful for pro-tein structure determination and protein tertiarystructure prediction. We developed a method topredict protein-disordered regions from itsamino acid sequence. We achieved high predic-tion accuracy by combining the prediction fromlocal sequence information and the predictionfrom global sequence alignment. At the sametime, prediction of protein tertiary structurefrom amino acid sequence without the informa-tion of structure of homologues is still a bigproblem. We developed a new potential func-tion based on contact number prediction toevaluate the matching between an amino acidsequence and a tertiary structure, and appliedthis potential to protein tertiary structure predic-tion.

17. Int-surf: a database for interacting sitesfor protein-protein interaction

Miho Higurashi, Takashi Ishida, and KengoKinoshita

Int-surf is a database of interacting sites ofproteins. For all PDB entries, interacting siteswith other proteins, small molecules and DNAare calculated from the atomic coordinates inPDB. The information of the interacting sites of

159

Page 25: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

homologous proteins is mapped onto each pro-tein structure, so users can see possible interac-tion sites of the considering protein, when itscomplex structure is not available, if the com-plex structures of close homologues are avail-able. Using jV version 3, an interactive 3Dviewer program, Int-surf allow users to observethe interactions with interactive visualization.

18. Molecular dynamics study of lipid mem-branes containing cholesterol

Naoya Fujita, Takashi Ishida, and Kengo Ki-noshita

Cholesterol, as a component of raft structureswhich are important in signal transduction, pro-tein transport, etc., is a fundamental molecule inmost eukaryotic cell membranes. A Moleculardynamics simulation was applied to dipalmi-toylphosphatidylcholine (DPPC) with severalconcentrations of cholesterol. Comparing a purephospholipid membrane and a cholesterol-containing one suggests that sterol modifies themembrane to a higher order phase. These mixedbilayers could be useful as surrounding systemsfor eukaryotic membrane proteins.

Publications

Horton, P., Park, K.-J., Obayashi, T., and Nakai,K. Protein subcellular localization predictionwith WoLF PSORT, In Proc. 4th Asia-PacificBIOINFORMATICS Conference (APBC2006)Edited by Jiang, T. et al. (Imperial CollegePress), pp. 39-48, 2006.

Kimura, K., Watanabe, A., Suzuki, Y., Ota, T.,Nishikawa, T., Yamashita, R., Yamamoto, J.,Sekine, M., Tsuritani, K., Ishii, S., Sugiyama,T., Saito, K., Isono, Y., Irie, R., Kushida, N.,Yoneyama, T., Otsuka, R., Kanda, K., Yokoi,T., Kondo, H., Wagatsuma, M., Murakawa, K.,Ishida, S., Ishibashi, T., Takahashi-Fujii, A.,Tanase, T., Nagai, K., Kikuchi, H., Nakai, K.,Isogai, T., and Sugano, S. Diversification oftranscriptional modulation: large-scale identi-fication and characterization of putative alter-native promoters of human genes, GenomeRes., 16: 55-65, 2006.

Sierro, N., Kusakabe, T., Park, K.-J., Yamashita,R., Kinoshita, K., and Nakai, K. DBTGR: a da-tabase of tunicate promoters and their regula-tory elements, Nucl. Acids Res., 34: D552-D555, 2006.

Yamashita, R., Suzuki, Y., Wakaguri, H., Tsuri-tani, K., Nakai, K., and Sugano, S. DBTSS: Da-tabase of Human Transcription Start Sites,Progress Report 2006, Nucl. Acids Res., 34: D86-D89, 2006.

Cheong, J., Yamada, Y., Yamashita, R., Irie, T.,Kanai, A., Wakaguri, H., Nakai, K., Ito, T.,Saito, I., Sugano, S., and Suzuki, Y. Diverse

DNA methylation status at alternative pro-moters of human genes in various normal tis-sues, DNA Res., 13(14): 155-167, 2006.

Tsuchiya, Y., Kinoshita, K., and Nakamura, H.Analysis of homo-oligomer interfaces of pro-teins from the complementarity of molecularsurface, electrostatic potential and hydropho-bicity, Protein Eng Des Sel., 19: 421-429, 2006.

Tsuchiya, Y., Kinoshita, K., Ito, N., and Naka-mura, N., PreBI: Prediction of biological inter-faces of proteins in crystals, Nucl Acid Res,34: W320-324, 2006.

Kinoshita, K., Kusunoki, M., and Miyai, K.Analysis of the three-dimensional structure ofthe “CXGXC” motif in the CMGCC andCAGYC regions of alpha-and beta-subunits ofhuman chorionic gonadotropin: Importance ofGlycine residue in the motif, Endocrine J., 53,51-38, 2006.

Ishida, T., Nakamura, S., and Shimizu, K. Poten-tial for assessing quality of protein structurebased on contact number prediction. Proteins64: 940-947, 2006.中井謙太,生物学の未来を考える,蛋白質核酸酵素 51(12):1704―1707,2006.中井謙太,第8章:バイオインフォマティクス~ホモロジー解析からシステム生物学まで~,遺伝子工学集中マスター,山本雅・仙波憲太郎編集,羊土社,pp.102―111,2006.木下賢吾,eF-siteを利用した機能部位の予測,バイオテクノロジージャーナル,6:444―447,2006.

160

Page 26: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

The main projects of our laboratory are to discover new biological meaning at mo-lecular level from vast array of biological data by various statistical approaches,and to train the researchers for the right use of statistical techniques. The sub-jects under investigation are as follows: Extraction of useful information from time-course gene expression data by using time-series models, development of pa-rameter estimation methods for bio-simulation models with statistical approaches,and building bio-pathway models with wide variety of background knowledge.

1. Extraction of Useful Information from TimeCourse Gene Expression Data with Statis-tical Time Series Models

a. State-space Approach with the MaximumLikelihood Principle to Identify the SystemGenerating Time-course Gene ExpressionData of Yeast

Rui Yamaguchi and Tomoyuki Higuchi1: 1Insti-tute of Statistical Mathematics

We use linear Gaussian state-space models toanalyze time-course gene expression data. Theyare modeled to be generated from hidden statevariable in a system. To identify the system, weestimate parameters of the model by EM algo-rithm and determine dimension of the statevariable by BIC. We apply this method to apublished cell-cycle gene expression data ofyeast (Saccharomyces cerevisiae). The determineddimension of the state variable are comparedwith those reported in other papers, which wereobtained by other parameter estimation methodsand criteria.

b. Finding Module-Based Gene Networkswith State-Space Models

Rui Yamaguchi, Ryo Yoshida2, Seiya Imoto2,Tomoyuki Higuchi2, Satoru Miyano2: 2HumanGenome Center, Laboratory of DNA Informa-tion Analysis

We discuss the use of the state space modelsto analyze time-course microarray gene expres-sion data. Typical features of time-course mi-croarray data are short time-course and high-dimensional observational vector. By these as-pects, conventional statistical models such as themultivariate autoregressive models lead to un-suitable results due to the overfitting. The statespace models have the potential to overcomethis problem by the dimension reduction proc-ess. This study provides (1) a survey of the statespace models, (2) a review of some existing re-searches using the state space models for time-course microarray data together with a biologi-cal meaning of the model, (3) a solution for thelack of parameter identifiability, and (4) identifi-cation of biological system that is a central topic

Human Genome Center

Laboratory of Biostatistics(Biostatistics Training Unit)バイオスタティスティクス人材養成ユニット

Project Lecturer Rui Yamaguchi, Ph.D.Project Research Associate Atsushi Doi, Ph.D.

特任講師 博士(理学) 山 口 類特任助手 博士(理学) 土 井 淳

161

Page 27: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

in bioinformatics. Finally, we show the useful-ness of the state space models for time-coursemicroarray data by the analysis of Saccharomycescerevisiae cell cycle gene expression data. As aresult, we estimated the number of the genemodules, and a gene-module network.

2. Development of Parameter EstimationMethods for Bio-Simulation Models withStatistical Approaches

a. Genomic Data Assimilation for EstimatingHybrid Functional Petri Net from Time-Course Gene Expression Data

Masao Nagasaki2 , Rui Yamaguchi , RyoYoshida2, Seiya Imoto2, Atsushi Doi, YoshinoriTamada1, Hiroshi Matsuno3, Satoru Miyano2,and Tomoyuki Higuchi1: 3Fuculty of Science,Yamaguchi University

For simulation models of biological pathways,in most of cases, parameters to govern them aretuned by experts empirically at the present time.In this study we take a novel approach for thisarea, in order to estimate these parameters andto select the best model from several candidateones by using observed data. The approach iscalled the data assimilation (DA) of which theconcept is to incorporate information from ob-served data into a simulation model. We can ex-pect to obtaine more plausible results from thesimulation model by this approach. From thepoint of view of the statistical modeling, DA canbe realized by solving an inverse problem to es-timate unknown state and parameters of a simu-lation model by using observed data. To formu-late the problem, we use a statistical time seriesmodel which is called a nonlinear state spacemodel (SSM). Using an SSM, we can employ ef-fective statistical methods to estimate parame-ters, i.e. a particle filter, which is based onMonte Carlo simulation. In order to examine theapplicability of the approach for biological simu-lation models, the methods was applied to amodel of circadian rhythm represented by a hy-brid functional Petri net (HFPN) with a synthe-sized data.

3. Building Bio-Pathway Models with Back-ground Knowledge

a. Simulation-Based Validation of the p53Transcriptional Activity with Hybrid Func-tional Petri Net

Atsushi Doi, Masao Nagasaki2, Hiroshi Mat-suno3, and Satoru Miyano2

MDM2 and p19ARF are essential proteins incancer pathways forming a complex with pro-tein p53 to control the transcriptional activity ofprotein p53. It is confirmed that protein p53loses its transcriptional activity by forming thefunctional dimer with protein MDM2. However,it is still unclear that protein p53 keeps its tran-scriptional activity when it forms the trimerwith proteins MDM2 and p19ARF. We have ob-served mutual behaviors among genes p53,MDM2, p19ARF and their products on a com-putational model with hybrid functional Petrinet (HFPN) which is constructed based on infor-mation described in the literature. The simula-tion results suggested that protein p53 shouldhave the transcriptional activity in the forms ofthe trimer of proteins p53, MDM2, and p19ARF.This paper also discusses the advantages ofHFPN based modeling method in terms of path-way description for simulations.

b. A Combined Pathway to Simulate CDK-Dependent Phosphorylation and ARF-Dependent Stabilization for p53 Transcrip-tional Activity

Atsushi Doi, Masao Nagasaki2, Hiroshi Mat-suno3, and Satoru Miyano2

The protein p53 is phosphorylated by a mem-ber of protein kinases such as CDK7, and stabi-lized by the protein ARF. The phosphorylationand stabilization of p53 is believed to enhanceits transcriptional activity and act simultane-ously. Biological pathways composed of expertsknowledge obtained from the literature are in-cluding these activation mechanisms. However,the map of biological pathways does not reflectthe combination effect of phosphorylation andstabilization. We have conducted some simula-tions of biological pathways with hybrid func-tional Petri net (HFPN) after careful reading ofpapers. In this paper, we constructed the HFPNbased biological pathway of CDK-dependentphosphorylation pathway and combine withARF-dependent pathway described previously,to observe the effect of the phosphorylation onthe stabilization with simulation-based valida-tion.

162

Page 28: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

Publications

Doi, A., Nagasaki, M., Matsuno, H., and Miya-no, S., Simulation based validation of the p53transcriptional activity with hybrid functionalPetri net, In Silico Biol., 6(0001), 2006.<http://www.bioinfo.de/isb/2006/06/0001>.

Doi, A., Nagasaki, M., Matsuno, H., and Miya-no, S., A combined pathway to simulate CDK-dependent phosphorylation and ARF-depen-dent stabilization for p53 transcriptional activ-ity, Genome inform., 17(1): 112-123, 2006.

Nagasaki, M., Yamaguchi, R., Yoshida, R.,Imoto, S., Doi, A., Tamada, Y., Matsuno, H.,

Miyano, S., and Higuchi, T., Genomic DataAssimilation for Estimating Hybrid FunctionalPetri Net from Time-Course Gene ExpressionData, Genome Informatics (IBSB2006), 17(1),46-61, 2006.

Yamaguchi, R., and Higuchi, T., State-space Ap-proach with the Maximum Likelihood Princi-ple to Identify the System Generating Time-Course Gene Expression Date of Yeast, Inter-national Journal of Data Mining and Bioinfor-matics, 1(1): 77-87, 2006.

163

Page 29: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

Department of Public Policy has launched since September 2007 as a new andthe first social scientific section on medical sciences featuring genomics. We valueconducting empirical studies in many fields of social science for future policy mak-ing. We work for three major missions; public policy studies on translational re-search, its application to healthcare and its impact on social security, practical ad-vices and survey for research projects to build public trust, and “minority-centered”scientific communication.

1. Public policy studies on translational re-search, its application to healthcare andits impact on social security

Exploring the human genome give us lots ofcritical keys to approach human health and dis-ease and to develop new diagnostic tools andpreventive methods. However, ethical, legal andsocial aspects have been featured since thelaunch of the Human Genome Project. Variousethical concerns have been raised concerningthese scientific quest as well as expectation forchange. Ethical discussions, however, tend to gobefore reviewing empirical evidences and facts.Our major mission is to conduct empiricalanalysis by interdisciplinary approach for futurepublic discussions. We collaborate with expertsof several fields of social sciences; e.g. sociology,anthropology, psychology, disability studies,economics, finance, policy science, law, market-ing, STS (science, technology and society) and soon. One of the urgent issues should be an em-pirical analysis of social and financial aspects ofpersonalized medicine. We’re examining socialand financial validity and draw regulatoryframework of clinical introduction of drug sensi-tivity genetic testing, by estimating saving costsand impact on healthcare insurance system.Also we’re planning to conduct surveys towards

research participants and candidates to evaluatethe consent process of ongoing genomic studiesto suggest future change or correction of ethicalguidelines for genomic research.

2. Practical advices and survey for researchprojects to build public trust

Medical scientists sometimes feel anxietyabout protection of human participants and ethi-cal issues. We help them to picture the ethically-valid roadmap towards their goals and suggestregulatory scheme for social responsibility ofscience, which includes creating consent formsand brochures with high readability, consideringprocedures of informing participants and obtain-ing consents, considering ethical balance of thewhole protocol before applying ethical reviewboards, and public disclosure throughout the re-search process.

3. “Minority-centered” scientific communica-tion

Japanese government emphasizes the impor-tance of promoting communication between re-searchers, engineers and society in Science andTechnology Basic Plan (2006). In a word “soci-ety”, we have to imagine various people with

Human Genome Center

Department of Public Policy公共政策研究分野

Associate Professor Kaori Muto, Ph.D. 助教授 武 藤 香 織

164

Page 30: Human Genome Center Laboratory of Genome Database ...formation of non-coding RNAs predicted using Rfam database is also provided to fill the anno-tation of the intergenic regions of

diverse values. We will organize a range of pub-lic engagement projects and science participationevents and particularly we focus on “minority-centered” scientific communication, empoweredby disability studies and gay/lesbian studies.We feature perspectives from people with chro-nic disorders, the disabled/challenged, LGBT/queers and ethnic minorities. These people arenot exactly “minority” in our society, but theirclaims towards science are not easy to be heard.

We conduct surveys and suggest policy agendahow science and policy works throughout thesepractices.

Collaborating with modern artists, we thinkfuture of science and society. One of our relatedprojects is “Delivery by Male Project” (2004-)with Hiroko Okada, which features ethical andsocial issues of male pregnancy using assistedreproductive technologies.

165