CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1...

18
S CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2 , Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester, MA, USA 2 University of Wisconsin Milwaukee, Milwaukee, Milwaukee, WI, USA 3 VA Central Massachusetts, Leeds, MA, USA

Transcript of CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1...

Page 1: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

S

CiteGraph: A Citation Network System for MEDLINE Articles and Analysis

Qing Zhang1,2, Hong Yu1,3

1University of Massachusetts Medical School, Worcester, MA, USA2University of Wisconsin Milwaukee, Milwaukee, Milwaukee, WI,

USA3VA Central Massachusetts, Leeds, MA, USA

Page 2: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

CiteGraph, MedInfo 2013

Outline

Introduction

Background

Method

Evaluation

Analysis

Page 3: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

CiteGraph, MedInfo 2013

Introduction

Citation network is important for Information retrieval Journal Impact Factor, H-index

Co-authorship network is important

Few citation networks are available for research

We built CiteGraph

Page 4: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

CiteGraph, MedInfo 2013

Background

Citation network analysis Power law distribution in citation networks Article ranking, HITS and PageRank Community structure of physics fields Citation network tool for given legal issue using legal document

citation network

Co-authorship network analysis Research collaboration patterns Author authority : Erdös Number

Literature search CiteSeerX, Google Scholar

Page 5: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

CiteGraph, MedInfo 2013

The CiteGraph Data

Page 6: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

CiteGraph, MedInfo 2013

Citation Network Example

Page 7: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

CiteGraph, MedInfo 2013

Challenges

(1)Yu, H and Lee M. 2006. Accessing Bioscience Images from Abstract Sentences. Bioinformatics. Vol 22 No. 14, pages e547–e556.

(2) Hong Yu and Minsuk Lee. Accessing Bioscience Images from Abstract Sentences. Bioinformatics. Vol 22 No. 14, pages e547–e556. 2006.

(3) Yu H, Lee H. 2006. Accessing Bioscience Images from Abstract Sentences. Bioinformatics: 22 (14), e547–e556.

Page 8: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

CiteGraph, MedInfo 2013

Methods

Mapping between articles

Mapping articles to the PubMed ID

Author name disambiguation

Page 9: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

Methods

If two of the following matching result are true, we consider the two entities (for example the citation and the article) are matched

Title matching the set of tokens contained in one title field is a subset of the tokens in the other,

or the number of tokens common to both fields is more than 80% of the size of the

larger of the two fields.

Author list matching two lists of surnames have one-on-one mapping surnames in one entity (citation) is fully contained in the surname set of the second

(article).

Journal name matching remove stop words such as “of” if the number of common initials in the journal titles was greater than 80% of the

tokens in the longer journal name, they were considered equivalent.

Page 10: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

CiteGraph, MedInfo 2013

Evaluation Results

Task Precision

Recall

F1 Inter-Annotator Agreement (Kappa)

Citation Mapping

1 0.96 0.98 1

PMID Mapping 0.99 0.99 0.99 1

• 7 Annotators are invited to annotate the citation mapping and PMID mapping results

• Each annotator is presented with 20 matching results of each task

Page 11: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

CiteGraph, MedInfo 2013

The CiteGraph Statistics

1.65 M articles 6.35 M citations

1.37 M authors

Page 12: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

CiteGraph, MedInfo 2013

The CiteGraph Statistics

log y = 1.06 – 2.45* log x (p<0.05 t-test)

Livak KJ., Schmittgen TD., Analysis of relative gene expression data using real-time quantitative PCR and the 2(-Delta Delta C(T)) Method. Methods. 2001 Dec;25(4):402-8.

Page 13: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

CiteGraph, MedInfo 2013

The CiteGraph Statistics

Largest connected component : 1.27 million authors (92.7%)

The second largest connected component: 35 authors

Page 14: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

CiteGraph, MedInfo 2013

The CiteGraph Statistics

Co-authorship spans from 1 to 35 years, while 83.7% of author pairs just appear once.

Page 15: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

CiteGraph, MedInfo 2013

The CiteGraph Statistics

Measure Mean Median Std Max Min

# of Co-authors 11 6 14 671 0

Co-authorship Year Span 1.521 1 1.576 35 1

* The largest component is excluded when calculating the statistics in the table. Its size is 1.27 million (92.7% authors)

Page 16: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

CiteGraph, MedInfo 2013

Trends

Page 17: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

CiteGraph, MedInfo 2013

Conclusion

We created a citation/co-authorship networks with biomedical full text literature

Our networks have high accuracy and large scale, and it can benefit biomedical text mining communities Article ranking Research collaboration recommendation Social network analysis

The network database can be downloaded per request

Page 18: CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

CiteGraph, MedInfo 2013

Acknowledgement

National Institute of Health 1R01GM095476 to Hong Yu

A start-up fund from University of Massachusetts Medical School to Hong Yu

National Center for Advancing Translational Sciences of the National Institute of Health under award number UL1TR000161.