Summer Training Report

7
Summer Internship Report Aditya Dhoke Roll. No - 04005013 Department of Computer Science and Engineering Indian Institute of Technology Bombay ,India. Guide Prof. Amit Sheth August 5, 2007

description

RDM

Transcript of Summer Training Report

  • Summer Internship Report

    Aditya DhokeRoll. No - 04005013

    Department of Computer Science and EngineeringIndian Institute of Technology Bombay ,India.

    Guide Prof. Amit Sheth

    August 5, 2007

  • Abstract

    In this report, I will describe the projects / implemetations namely En-tity Spotter, Semantic Browser and Web Portal on which I worked on withKno.e.sis lab members during my summer internship, 2007. I will also pro-vide the details of the presentation which I gave as a part of my curriculum.Finally , the conclusion states the utility of this internship for me and futureendeavors that I plan.

  • 0.1 Entity Spotter

    The implementation reads a set of entities and relations and their corre-sponding IDs to create a tree data structure, similar to a Trie structure andtags the entities in text.

    In the initial stage, the program takes as input the concepts and rela-tionships along with their alphanumerical IDs and creates a tree structurewith each node having a hashmap.This hashmap (set of key value pairs) haskey as the words within the entity and value is another node in the tree. Aswords are read, the tree is traversed from top to bottom with the key as thecurrent word and the value as the node below it. If the set of words readmatches an entity, the ID of that entity will be fetched from the currentnode. It also handles the case where one entity is prefix of another and ifthe longer entity does not have a match, it backtracks to tag the shorterentity.

    The method runs in O(log(m)*n) time, where m is the number of entitiesand n is number of words in the input file, as opposed to the brute forcetechnique where time taken would have been O(m*n).The data structurehas been shown the Figure 1. The figure shows storage of two entites genemutation and gene abnormility and their corresponding IDs D130 andD432. If gene mutation occurs in the text, the hash-table of the uppermostnode is referred and later the hash-table in the left node is referred.Thisfinally leads to the node in which the ID D130 is stored.

    Figure 1: Tree structure for storing entities

  • 0.2 Presentation : Ontology Summarization Basedon RDF Sentence Graph

    The aim of the presentation was to put forward the idea in [1] .The computerscientist in Southeast University,China had introduced the novel idea of RDFSentence graph which they used for summarizing the ontologies in RDF for-mat. Given an ontology(RDF graph) along with length of the summary andpreference, RDF sentences are detected from which graph is built in whicheach node is a RDF sentence.Now the summarization problem has beenreduced to finding salient nodes in the graph.After this, re-ranking of thesalient nodes was done to get more appropriate ontology. Degree Centrality,Shortest-Path-based Centrality, Eigenvector Centrality, Weighted HITS arethe methods that were used finding salience.The work flow has been shownthe Figure 2.

    Figure 2: Workflow for Ontology Summarization

    0.3 Semantic Browser

    Semantic Browser is a tool for browsing the semantically connected Pub-Medabstracts. We can traverse the documents along with the RDF generatedusing the text.The aim of the application is to evaluate the quality andauthenticity of the RDF that has been created from the text. It has beenbuilt as a web application for platform independence.

  • 0.3.1 Data Storage

    The RDF statements are stored in the form of Trie structure persistentobject. The abstracts and their PMIDs are indexed using Lucence Index.The persistent object and indexes are created off-line and stored on theserver.

    0.3.2 Data Exchange

    The data exchange is done using AJAX, parameters are passed from theclient-side to a JSP which in turn queries information on the server-side.The data retrieved is converted in XML format by JSP. The XML data isparsed by DOM on the client-side and is then made readable to the user byCSS.

    0.3.3 Functionality

    The entities and relations in the abstract are highlighted.When the userhovers over the entity(subject), the corresponding relation and object ofRDF statement are listed. The PMID numbers of the files in which thisstatement occurs is displayed. Two search boxes are provided one for PMIDand other for keyword. As the user types suggestions appear in a drop downmenu.

    Figure 3: Phases of Semantic Browser

  • 0.4 Web Portal

    I worked on the library web page of Kno.e.sis. The resources were displayedon web using a tool named Exhibit. The tool provided an interface to browsethrough the resources. Earlier, it fetched data in JSON format which wascreated manually from the spreadsheets. Now, the data is read directlyfrom spreadsheet. The data in spreadsheet(Google Spreadsheet) was cleanedup using Java library so that every lab members name appears only onceirrespective of whether he/she uses initials or canonical forms.

    0.5 Acknowledgements

    I am grateful to Prof.Amit Sheth for giving me opportunity to work in hislab.I am thankful to Cartic Ramakrishnan for his consistent guidance andsupport.

    0.6 Conclusion

    At the end of internship, most of my queries about the research in SemanticWeb and its future prospects have been answered. I got myslf acquantedwith different areas of Semantic Web by interacting with the lab members. Ihave discovered the research topic that I am interested in and consequentlywant to pursue Ph.D. in the same topic.

  • Bibliography

    [1] Xiang Zhang,Gong Cheng,Yuzhong Qu, Ontology SummarizationBased on RDF Sentence Graph, World Wide Web Conference, 2007.

    [2] Bush,V., As We May Think. The Atlantic Monthly,1945. 176(1) p.101-108.

    [3] Cartic Ramakrishnan,Krys J. Kochut,Amit P. Sheth, A Framework forSchema-Driven Relationship Discovery form Unstructured Text ISWC,2006. p.583-596.

    [4] Marti A. Herst, Untangling Text Data Mining, Proceedings of ACL,1999.

    [5] Partha Pratim Talukdar,Thorsten Brants,Mark Liberman FernandoPeriera, A Context Pattern Induction Method for Named Entity Ex-traction, Proceedings of 10th Conference on Computional Natural Lan-guage Learning, June 2006.

    [6] Eugene Agichtein, Luis Gravano Snowball: Extracting Relations fromLarge Plain-Text Collections ACM DL, 2000.

    [7] lucene.apache.org

    5