Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international...

35
NETTAB 2005, Naples Retrieving factual data and documents using IMGT-ML in the IMGT information system ® Denys Chaume, CNRS/IGH

Transcript of Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international...

Page 1: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

Retrieving factual data and documents using IMGT-ML in the

IMGT information system ®

Denys Chaume, CNRS/IGH

Page 2: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

Table of Contents

• The IMGT information system®• IMGT-ONTOLOGY• IMGT-ML• IMGT-ML a query language• Use examples :

– Database queries (LIGM-DB)– Tool queries (Junction Analysis)– Document queries (SEFID)

• Conclusion

Page 3: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

Part 1 summary

• IMGT, the international ImMunoGeneTics information system®

• Relations between subsystems • IMGT-ONTOLOGY• IMGT-ML Schemas• IMGT-ML : seqData• IMGT-ML Architecture

Page 4: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

IMGT, the international ImMunoGeneTics information system®

• A high quality integrated knowledge resource specialized in the immunoglobulins, T cell receptors, major histocompatibilty complex and related proteins of the immune system of human and other vertebrates.

• Created in 1989 by M.-P. Lefranc (CNRS/UM II),• On the WEB since 1995, • 90.000 sequences, 110 species• The international reference for immunogenetics

Page 5: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

IMGT information system®IMGT/LIGM-DB

IMGT/GENE-DB

IMGT/PRIMER-DB

IMGT/PROTEIN-DB

IMGT/3Dstructure-DB

6 databases 10 processing tools

IMGT/MHC-DB

IMGT Scientific ChartIMGT RepertoireIMGT IndexIMGT Bloc-notes

~8000 HTML documents

IMGT/V-QUEST

IMGT/JunctionAnalysis

IMGT/GeneView

IMGT/GeneSearch

IMGT/LocusView

IMGT/Allele-Align

IMGT/PhyloGene

IMGT/StructuralQuery

IMGT/GeneFrequency

IMGT/GeneInfo

Page 6: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

Relations between subsystems

IMGT/LIGM-DB

IMGT/GENE-DB

IMGT/PRIMER-DB

IMGT/PROTEIN-DB

IMGT/3Dstructure-DB

IMGT/V-QUEST IMGT/JunctionAnalysis

IMGT/GeneView

IMGT/GeneSearchIMGT/LocusView

IMGT/Allele-Align

IMGT/PhyloGene IMGT/StructuralQuery

IMGT/GeneFrequency

prim

er/se

quen

ces

gene

s/re

fere

nce

sequ

ence

s

sequence

s/mutatio

ns

mutation/genes

genes/phylogeny

genes/3D structures

3D st

ruct

ures

que

ry

prot

ein/

3D st

ruct

ures

nucleotidic/protein sequences

gene localizatio

n

gene localization

sequences/specificity

gene and allele identification

analyse of junction

gene

and

alle

le id

entif

icat

ion

Page 7: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

IMGT-ONTOLOGY

Identification

Description

Classification

Obtention

Numerotation

Characteristics

Annotation

Genome

Origin, methodology

IMGT unique numbering

Page 8: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

domain

IMGT-ML Schemas

IMGTOntology

Identification

imgt

Numerotation

Obtention

Classification

IMGTData

knowledge

seqDataDescription

(biological & structural)

External schemas

IMGTQuery

querySeqData

queryKnowledge

IMGTData defines elements using types

defined by IMGTOntology schemas

IMGTOntology defines simpleTypes and

complexTypes

responseTemplate

IMGTQuery defines complementary elements

to formulate queries

Namespace : http://www.imgt.org/IMGT-ML

Page 9: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

IMGT-ML : seqData

Numerotation

catalogue simpleCatalogEntry

classification

identification

annotation

sequence

seqData

?

?

?

opt@moddate

req@credate

req@id

opt@name

req@numacc

req@id

references extRef

opt@refid

req@reltype

opt@secid

req@reftype

req@dbid? +

req@seqid

opt@complement

req@seqlen

Literature

Numerotationkeywords

?

?

Page 10: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

IMGT-ML Architecture

XML Schema(simpleTypes &complexTypes)

IMGT-ONTOLOGY

Documentation

Biologicalexpertize

Data Modeling

Controlled vocabulary

(nomenclature)

Distributionformat

Data consistency

Web serviceinteractions

XSLT

Page 11: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

Part 2 summary

• IMGT-ML : a database query language– Why it works

• IMGTQuery package• Examples

– IMGT/LIGM-DB query– Junction Analysis tool

Page 12: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

A very simple database

>1970printable charactersstarts with a uppercase character

intvarchar 255 char 10

1997IMGT, the internat…. Chaume

yeartitleauthor

Data are vectors : (author, title, year)

primitive types

user types

actual data

Page 13: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

Data definition domainsPrimitive domain

User domain (rules)

Actual domain (data)

Page 14: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

Actual data XML representation

<litRefList> <litRef> <author name="Chaume" /> <title>IMGT, the internat…. </title> <date year="1997" /> </litRef> …</litRefList>

Page 15: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

XML schema <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" > <xs:element name="litRefList"> <xs:complexType> <xs:sequence> <xs:element ref="litRef" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="litRef"> <xs:complexType> <xs:sequence> <xs:element ref="author" /> <xs:element ref="title" /> <xs:element ref="date" /> </xs:sequence> </xs:complexType> </xs:element> .....

Page 16: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

XML schema 1..... <xs:element name="author"> <xs:complexType> <xs:attribute name="name" type="xs:string"/> </xs:complexType></xs:element> <xs:element name="title" type="xs:string" /><xs:element name="date"> <xs:complexType> <xs:attribute name="year" type="xs:integer"/> </xs:complexType></xs:element>.....

Page 17: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

Schema 1 definition domainSchema1 domain

Page 18: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

XML schema 2..... <xs:element name="author"> <xs:complexType> <xs:attribute name="name" > <xs:simpleType> <xs:restriction base="xs:string"> <xs:maxLength value="10" /> <xs:pattern value="^[A-Z][a-z]+"/> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType></xs:element> <xs:element name="title"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:maxLength value="255" /> </xs:restriction> </xs:simpleType></xs:element>

<xs:element name="date"> <xs:complexType> <xs:attribute name="year" > <xs:simpleType> <xs:restriction base="xs:string"> <xs:minInclusive value="1970"/> <xs:maxExclusive value="10000"/> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType></xs:element>.....

Page 19: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

Schema 2 definition domainsSchema 2 domain

Page 20: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

XML instanceand instance

domain

<litRefList> <litRef> <author name="Chaume" /> </litRef></litRefList>

..... <xs:element name="author"> <xs:complexType> <xs:attribute name="name"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="Chaume" /> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType></xs:element> <xs:element name="title" type="xs:string" /><xs:element name="date"> <xs:complexType> <xs:attribute name="year" type="xs:integer"/> </xs:complexType></xs:element>.....

Page 21: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

Instance and query result domainsThe result of the query is the intersection of the "instance domain" and the "actual domain"

Instance domain

Page 22: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

IMGTQuery package

querySeqDataqueryKnowledge

domain

@data :labelListkeywordListfunctionalityListchainTypeListspecificityListmoleculeTypeListconfigurationList

seqData (+)

enumeration

minInclusive

maxInclusive

pattern

minExclusive

maxExclusive

responseTemplate

seqData

In any seqData sub-element

@complement (boolean)@of : @xxxx text()

Namespace : http://www.imgt.org/IMGT-ML/IMGTQuery

Page 23: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

Part 3 summary

• Example of IMGT/LIGM-DB queries• Request from Identification concept• AND operator (request)• AND operator (result)• OR operator• Request with domain restriction• Result control

Page 24: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

Example of IMGT/LIGM-DB queries

<q:querySeqData> <seqData>

<simpleCatalogEntry numacc="M26678"

/> </seqData></q:querySeqData>

<seqDataList> <seqData> <catalogEntry id="M26678" > <simpleCatalogEntry numacc="M26678" name="MMIGKZZZ"/> </catalogEntry> <identification> <partIdent moleculeType="DNA" configuration="rearranged"> <taxon taxonName="Mus musculus"/> </partIdent> </identification> <classification> <group name="IGKV"> <subgroup name="IGKV8"/> </group> </classification> …. </seqData></seqDataList>

Page 25: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

Request from Identification concept<q:querySeqData> <seqData> <partIdent chainType="Ig-Light"> <taxon taxonName="Homo sapiens"/> </partIdent> </seqData></q:querySeqData>

<seqDataList> <seqData id="AF001788" > <catalogue > ….

</catalogue> <identification>

<partIdent chainType="Ig-Light"> <taxon taxonName="Homo sapiens"/>

</partIdent> ….

</identification> </seqData> <seqData id="AF001799" > <catalogue > …. </seqData></seqDataList>

Page 26: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

AND operator (request)<q:querySeqData> <seqData> <annotation> <entity name="C-GENE"/> <region name="D-REGION"/> </annotation></seqData></q:querySeqData>

AND=intersection

Page 27: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

AND operator (result)

…. <entity name="C-GENE" partial="true"> <start location="4334"/> …. <end location="5440"/> </entity> <end location="5440"/> </cluster></annotation> ….

<seqDataList> <seqData id="U97590" > <catalogue> …. </catalogue> <identification> …. </identification> <annotation> <cluster name="D-J-C-CLUSTER"> <start location="1"/> <entity name="D-GENE"> <start location="1"/> <region name="D-REGION"> <start location="464"/> <end location="475"/> </region> <end location="600"/> </entity> ….

Page 28: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

OR operator<request> <seqData> <partIdent chainType="Ig-Light-Kappa" functionality="pseudogene"> <taxon taxonName="Homo sapiens"/> </partIdent> </seqData> <seqData> <partIdent chainType="Ig-Light-Lambda" functionality="pseudogene"> <taxon taxonName="Homo sapiens"/> </partIdent> </seqData></request>

<response> <seqData> <catalogEntry id="A25907" > …. </seqData> <seqData> <catalogEntry id="AF026482" > …. </seqData> <seqData> <catalogEntry id="AF026483" > …. </seqData> <seqData> <catalogEntry id="AF026484" > …. </seqData>

…</response>

OU=union

Page 29: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

<q:querySeqData> <seqData> <sequence>

<q:domain of="@length"> <q:maxInclusive value="400"/> <q:minInclusive value="100"/>

</q:restriction> </sequence> </seqData></q:querySeqData>

<q:querySeqData> <seqData> <sequence length="300"/> </seqData></q:querySeqData>

Request with domain restriction

Page 30: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

Result control<q:querySeqData> <seqData> <partIdent chainType="Ig" moleculeType="cDNA" configuration="rearranged"

specificity="anti-thyroid peroxidase (TPO)" > <taxon taxonName="Homo sapiens" /> </partIdent>

</seqData></q:querySeqData>

<q:responseTemplate><seqData> <classification><gene/></classification></seqData></q:responseTemplate>

Expected result :

<seqDataList nb="27"> <seqData id="AF306350"> <classification>

<gene name="IGHV1-69"/> <gene name="IGHD3-10"/>

<gene name="IGHJ6"/> </classification> </seqData> <seqData id="AF306376"> <classification>

<gene name="IGHV1-3"/> <gene name="IGHD4-4"/>

<gene name="IGHJ4"/> </classification> </seqData> .. ..

Page 31: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

Tools : Junction Analysis• Input : a sequence list, each sequence has a V gene and

a J gene.• Output : each sequence is annotated with locations of V,

D, J genes and N and P regions and sequences are aligned

seqDataList

seqDataclassification

seqDataList

seqData

annotation

alignement

proSequence

IMGTjcta

Page 32: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

Literature document queries (ORIEL project)

• Authors do not use this vocabulary• GO terms are too poor to describe

genes involved in immunoglobulin and T receptor synthesis

• Existing search engines do not index text with this vocabulary

How use IMGT-ONTOLOGY vocabulary to index literature documents ?

Page 33: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

SEFID Prototype search engine pipeline

IMGT-MLquery

IMGT/LIGM-DB SOAP server at EBI

Doc2Loc

Location (NCBI, INIST, SUDOC…)

Text analysis (LT-POS, …)

Statistics analysisOriel server

General purpose search engine

(E-BioSci, Collexis, … Full text publications

Doc2Loc

Literature ref.

E-BioSci

Accession #s Medline #s Abstract locations

AbstractsUseful wordsWord signatures

IMGT SOAP server

EBI SOAP serverCan be IMGT or DDBJ SOAP

server as well

E-BioSci Doc2Loc SOAP server

E-BioSci SOAP server

Oriel SOAP server implementing LT-Chunk from LTG (Edinburgh)

(home development)

Oriel SOAP server Word frequencies

(fingerprint)(home development)

E-BioSci SOAP serverOr any other search engine

(google)

Page 34: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

Conclusion

• IMGT-ML, very close to IMGT-ONTOLOGY, is compliant with Biology

• Using the same IMGT-ML format as input and output of modules (Web services) allows their chaining

• This makes easier the development of IMGT-CHOREOGRAPHY which is our near futur development.

Page 35: Retrieving factual data and documents using IMGT-ML in the ... · IMGT, the international ImMunoGeneTics information system® • A high quality integrated knowledge resource specialized

NETTAB 2005, Naples

People• Kora Combres (CNRS/IGH, ORIEL project)

• Véronique Giudicelli (CNRS/IGH, IMGT-ONTOLOGY),

• Professeur Marie-Paule Lefranc (UM II, CNRS/IGH, IMGT project)

• Denys Chaume (CNRS/IGH, IMGT, ORIEL)