Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

20
Let it Flow: Government Documents PDF (or PDF + XML) to reflowable EPUB Open Book Hackathon, NYPL January 11-13, 2014

description

NYPL Labs hosted the Open Book Hack Weekend at the New York Public Library featuring digital book open source and content development based on HTML5, EPUB, and the Open Web Platform. Readium Foundation, O’Reilly Media, Perseus Books, Hypothes.is, Google, and Datalogics sponsored the event. I served as a user experience designer for my team, which included Dave Mayo of Harvard University, Julia Pollacks of the Bronx Community College, and Jeremy Baron. We worked on developing a way to convert PDF documents to reflowable EPUB format. We looked at converting government documents, including a sample of slip opinions from the United States Supreme Court. Sample documents and scripts are located at: https://github.com/pobocks/pdf2freedom Below is our group presentation. More information about the Open Book Hack event is available at the OpenBook2014 github site: https://github.com/openbook2014/nypl-hack-weekend/wiki/Hack-ideas

Transcript of Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

Page 1: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

Let it Flow:Government DocumentsPDF (or PDF + XML) to reflowable EPUB

Open Book Hackathon, NYPLJanuary 11-13, 2014

Page 2: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

Proposed Tasks: Legal ePUBWhat would users want?

● Remove hard line breaks for continuous, reflowable display in multiple formats/modes.● Parse metadata from header information, pagination and other semantic content.● Parse citation of previous opinions or laws and link to cited legislation.● Create stylesheet and page markup for display on devices.

It is important to:

● Ensure that data is formatted so that it flows naturally regardless of device used.● Maintain information about pagination for citation and discovery purposes.

Page 3: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

Proposed Tasks: Legal ePUBWhat would users want?

● Remove hard line breaks for continuous, reflowable display in multiple formats/modes.● Parse metadata from header information, pagination and other semantic content.● Parse citation of previous opinions or laws and link to cited legislation.● Create stylesheet and page markup for display on devices.

It is important to:

● Ensure that data is formatted so that it flows naturally regardless of device used.● Maintain information about pagination for citation and discovery purposes.

We only had time for one thing.

Page 4: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

Supreme Court Slip Opinions

Page 5: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

Supreme Court Slip Opinions● Supreme Court slip opinion documents are

available in PDF format.

● Dockets may contain multiple document types:

○ Syllabus○ Opinion of the Court○ Per Curiam○ Dissention

Page 6: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

Metadata: SCOTUS Slip OpinionsCitation

Document Type

Notice

Presiding Court

Docket Number

Case Names/Parties to the Case

Date Decided

Justice

Opinion

Front matter for each document contains several centered blocks of content. At first glance, the documents contain minimal structural information.

Page 7: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

XML Format<text top="171" left="676" width="15" height="16" font="0">1 </text><text top="174" left="234" width="71" height="13" font="1">(Slip Opinion) </text><text top="171" left="380" width="166" height="16" font="0">OCTOBER TERM, 2013 </text><text top="206" left="432" width="57" height="16" font="0">Syllabus </text><text top="237" left="284" width="362" height="13" font="1">NOTE: Where it is feasible, a syllabus (headnote) will be released, as is</text><text top="247" left="272" width="374" height="13" font="1">being done in connection with this case, at the time the opinion is issued.</text><text top="258" left="272" width="374" height="13" font="1">The syllabus constitutes no part of the opinion of the Court but has been</text><text top="269" left="272" width="380" height="13" font="1">prepared by the Reporter of Decisions for the convenience of the reader. </text><text top="280" left="272" width="343" height="13" font="1">See <i>United States</i> v. <i>Detroit Timber &amp; Lumber Co.,</i> 200 U. S. 321, 337. </text><text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text><text top="354" left="432" width="57" height="16" font="0">Syllabus </text><text top="390" left="370" width="183" height="20" font="3">KANSAS <i>v</i>. CHEEVER </text><text top="426" left="279" width="363" height="16" font="0">CERTIORARI TO THE SUPREME COURT OF KANSAS </text><text top="455" left="245" width="432" height="16" font="0">No. 12–609. Argued October 16, 2013—Decided December 11, 2013 </text>

We used Poppler to convert Supreme Court slip opinion documents (PDF) to XML. This revealed position and size attributes for lines of text, but no structured semantic information.

Page 8: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

XML Format: Font ID<text top="171" left="676" width="15" height="16" font="0">1 </text><text top="174" left="234" width="71" height="13" font="1">(Slip Opinion) </text><text top="171" left="380" width="166" height="16" font="0">OCTOBER TERM, 2013 </text><text top="206" left="432" width="57" height="16" font="0">Syllabus </text><text top="237" left="284" width="362" height="13" font="1">NOTE: Where it is feasible, a syllabus (headnote) will be released, as is</text><text top="247" left="272" width="374" height="13" font="1">being done in connection with this case, at the time the opinion is issued.</text><text top="258" left="272" width="374" height="13" font="1">The syllabus constitutes no part of the opinion of the Court but has been</text><text top="269" left="272" width="380" height="13" font="1">prepared by the Reporter of Decisions for the convenience of the reader. </text><text top="280" left="272" width="343" height="13" font="1">See <i>United States</i> v. <i>Detroit Timber &amp; Lumber Co.,</i> 200 U. S. 321, 337. </text><text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text><text top="354" left="432" width="57" height="16" font="0">Syllabus </text><text top="390" left="370" width="183" height="20" font="3">KANSAS <i>v</i>. CHEEVER </text><text top="426" left="279" width="363" height="16" font="0">CERTIORARI TO THE SUPREME COURT OF KANSAS </text><text top="455" left="245" width="432" height="16" font="0">No. 12–609. Argued October 16, 2013—Decided December 11, 2013 </text>

We may be able to use position and size attributes to identify content parts of the document. For example, different parts of the document have varying positions, font sizes and line heights.

Page 9: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

XML Format: Font ID<text top="171" left="676" width="15" height="16" font="0">1 </text><text top="174" left="234" width="71" height="13" font="1">(Slip Opinion) </text><text top="171" left="380" width="166" height="16" font="0">OCTOBER TERM, 2013 </text><text top="206" left="432" width="57" height="16" font="0">Syllabus </text><text top="237" left="284" width="362" height="13" font="1">NOTE: Where it is feasible, a syllabus (headnote) will be released, as is</text><text top="247" left="272" width="374" height="13" font="1">being done in connection with this case, at the time the opinion is issued.</text><text top="258" left="272" width="374" height="13" font="1">The syllabus constitutes no part of the opinion of the Court but has been</text><text top="269" left="272" width="380" height="13" font="1">prepared by the Reporter of Decisions for the convenience of the reader. </text><text top="280" left="272" width="343" height="13" font="1">See <i>United States</i> v. <i>Detroit Timber &amp; Lumber Co.,</i> 200 U. S. 321, 337. </text><text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text><text top="354" left="432" width="57" height="16" font="0">Syllabus </text><text top="390" left="370" width="183" height="20" font="3">KANSAS <i>v</i>. CHEEVER </text><text top="426" left="279" width="363" height="16" font="0">CERTIORARI TO THE SUPREME COURT OF KANSAS </text><text top="455" left="245" width="432" height="16" font="0">No. 12–609. Argued October 16, 2013—Decided December 11, 2013 </text>

In this document, the font attribute is 1 for the Note section, 3 for the names of the parties to the case and 4 for the name of the presiding court.

Page 10: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

XML Format: <page> attributes

Font sizes are defined after the <page> elements in the XML document. In this example, the title of the Supreme Court of the United States has height=”20” and font=”4”, which corresponds to fontspec id=”4”, which is defined as Times 20pt Black.

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml producer="poppler" version="0.24.4"><page number="1" position="absolute" top="0" left="0" height="1188" width="918">

<fontspec id="0" size="11" family="Times" color="#000000"/><fontspec id="1" size="8" family="Times" color="#000000"/><fontspec id="2" size="8" family="Times" color="#000000"/><fontspec id="3" size="14" family="Times" color="#000000"/><fontspec id="4" size="20" family="Times" color="#000000"/><fontspec id="5" size="14" family="Times" color="#000000"/><fontspec id="6" size="11" family="Times" color="#000000"/>

...

<text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text>

Font ID 4 is Times 20pt

Page 11: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

XML Format: <page> attributes

Font specifications are defined within the page element. If subsequent pages require new font IDs they are added after the <page> element.

<page number="1" position="absolute" top="0" left="0" height="1188" width="918"><fontspec id="0" size="11" family="Times" color="#000000"/><fontspec id="1" size="8" family="Times" color="#000000"/><fontspec id="2" size="8" family="Times" color="#000000"/><fontspec id="3" size="14" family="Times" color="#000000"/><fontspec id="4" size="20" family="Times" color="#000000"/><fontspec id="5" size="14" family="Times" color="#000000"/><fontspec id="6" size="11" family="Times" color="#000000"/>

<page number="2" position="absolute" top="0" left="0" height="1188" width="918"><fontspec id="7" size="11" family="Times" color="#ff0000"/>

<page number="3" position="absolute" top="0" left="0" height="1188" width="918"><fontspec id="8" size="7" family="Times" color="#000000"/>

<page number="5" position="absolute" top="0" left="0" height="1188" width="918"><fontspec id="9" size="4" family="Times" color="#000000"/>

...

Page 12: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

XML Format: Inconsistencies

Font sizes for specific id attributes are not consistent across all documents, BUT...

Docket 12-609

<page number="1" position="absolute" top="0" left="0" height="1188" width="918">

<fontspec id="0" size="11" family="Times" color="#000000"/><fontspec id="1" size="8" family="Times" color="#000000"/><fontspec id="2" size="8" family="Times" color="#000000"/><fontspec id="3" size="14" family="Times" color="#000000"/><fontspec id="4" size="20" family="Times" color="#000000"/><fontspec id="5" size="14" family="Times" color="#000000"/><fontspec id="6" size="11" family="Times" color="#000000"/>

In this document The Supreme Court of the United States has font id 4.

Docket 12-729

<page number="1" position="absolute" top="0" left="0" height="1188" width="918">

<fontspec id="0" size="11" family="Times" color="#000000"/><fontspec id="1" size="8" family="Times" color="#000000"/><fontspec id="2" size="8" family="Times" color="#000000"/><fontspec id="3" size="14" family="Times" color="#000000"/><fontspec id="4" size="11" family="Times" color="#000000"/><fontspec id="5" size="20" family="Times" color="#000000"/><fontspec id="6" size="14" family="Times" color="#000000"/><fontspec id="7" size="14" family="Times" color="#000000"/>

In this document The Supreme Court of the United States has font id 5.

Page 13: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

XML Format: Commonalities

...we noticed that the position attributes for many content parts such as left margin and line height were consistent across all documents. These content parts are noted above.

Content Parts with consistent text attribute values:

● Page elements have a height=”1188” and width=”918” on all documents.● Decision date has the same attributes for left=”395” and height=”16” on all documents.● “Cite as” is always top="171" left="366" width="189" height="16" font="0"● Type of Opinion, e.g. “(Slip Opinion)” is always top="174" left="234" font="1"● “Opinion of the Court” is always top="171" left="366" width="189" height="16" font="0" ● “Note” and “Notices” is always left="272" height="13" font="1"

○ Exception: sometimes the first line is indented at left=”284”● The last line of a Note or Notice is always a reference to another court case, which always has top="280" left="

272" width="343" height="13" font="1">● The name of the case (usually [partyA] v. [partyB]) is always height="20" font="3".

Page 14: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

XML Format: left & height attributes<text top="171" left="676" width="15" height="16" font="0">1 </text><text top="174" left="234" width="71" height="13" font="1">(Slip Opinion) </text><text top="171" left="380" width="166" height="16" font="0">OCTOBER TERM, 2013 </text><text top="206" left="432" width="57" height="16" font="0">Syllabus </text><text top="237" left="284" width="362" height="13" font="1">NOTE: Where it is feasible, a syllabus (headnote) will be released, as is</text><text top="247" left="272" width="374" height="13" font="1">being done in connection with this case, at the time the opinion is issued.</text><text top="258" left="272" width="374" height="13" font="1">The syllabus constitutes no part of the opinion of the Court but has been</text><text top="269" left="272" width="380" height="13" font="1">prepared by the Reporter of Decisions for the convenience of the reader. </text><text top="280" left="272" width="343" height="13" font="1">See <i>United States</i> v. <i>Detroit Timber &amp; Lumber Co.,</i> 200 U. S. 321, 337. </text><text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text><text top="354" left="432" width="57" height="16" font="0">Syllabus </text><text top="390" left="370" width="183" height="20" font="3">KANSAS <i>v</i>. CHEEVER </text><text top="426" left="279" width="363" height="16" font="0">CERTIORARI TO THE SUPREME COURT OF KANSAS </text><text top="455" left="245" width="432" height="16" font="0">No. 12–609. Argued October 16, 2013—Decided December 11, 2013 </text>

Position of the left margin and line height may indicate a single text block. In this example, most of the lines of the paragraph are indented to 272, while the first line of the paragraph is indented to 284. Still, each line has the same line height.

Page 15: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

Other Standard Format ElementsPage Headings Footnotes

Page header content alternates on even and odd pages, but the position and text attributes for the included content is the same. Footers are always noted in superscript within the text and displayed after a rule at the bottom of the page.

Page 16: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

Tools for Text Mining● Text Editing Initiative● Court Listener: https://www.courtlistener.com/● Semantic Parsing for Legal Texts (conference proceedings): http://www.lrec-conf.

org/proceedings/lrec2012/workshops/27.LREC%202012%20Workshop%20Proceedings%20SPLeT.pdf

● Search and Replace

We looked at a number of pre-processing tools to add semantic information to the documents, such as identifying docket number, case names, decision dates, presiding court, Justice name, etc., as well as cited legislation and prior court cases.

Page 17: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

Text Encoding InitiativeText Encoding Initiative allows content producers to tag parts of a document with semantic information.

● footnotes glossing or commenting on any passage could be added;● pointers linking parts of this text to others could be added;● proper names of various kinds could be distinguished from the surrounding text;● detailed bibliographic information about the text's provenance and context could be prefixed to it;● a linguistic analysis of the passage into sentences, clauses, words, etc., could be provided, each unit being

associated with appropriate category codes;● the text could be segmented into narrative or discourse units;● systematic analysis or interpretation of the text could be included in the encoding, with potentially complex

alignment or linkage between the text and the analysis, or between the text and one or more translations of it;● passages in the text could be linked to images or sound held on other media.

Source: TEI Lite: Encoding for Interchange: an introduction to the TEI <http://www.tei-c.org/release/doc/tei-p5-exemplars/html/tei_lite.doc.html>

Page 18: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

Parsing MetadataCourt Name in XML:

<text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text>

Could be rendered as the following, because the court always has attributes left=”235” and height=”20”:HTML: <p class=”Supreme Court of the United States"><b>SUPREME COURT OF THE UNITED STATES </b></name>

XML: <name id=”Supreme Court of the United States"><b>SUPREME COURT OF THE UNITED STATES </b></name>

TEI: <rs type="organization" key=”SCOTUS”><b>SUPREME COURT OF THE UNITED STATES</b></rs>

As we discovered, certain semantically related blocks of text have common text attributes, like position, line height and font id. We can add semantic markup to identify these blocks.

Page 19: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

Parsing Citations● Parse citation of previous opinions or laws and link to cited legislation.

Font sizes for specific id attributes are not consistent across all documents, BUT...

Page 20: Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

Creating the UI● Create stylesheet and page markup for display on devices.

Font sizes for specific id attributes are not consistent across all documents, BUT...