More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons...

46
More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution Lice See http://software-carpentry.org/license.html for more informati Regular Expressions

Transcript of More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons...

Page 1: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

More Tools

Copyright © Software Carpentry 2010

This work is licensed under the Creative Commons Attribution License

See http://software-carpentry.org/license.html for more information.

Regular Expressions

Page 2: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

Thousands of papers and theses written in

LaTeX

Page 3: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

Thousands of papers and theses written in

LaTeXGranger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

Page 4: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

Thousands of papers and theses written in

LaTeXGranger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

All share a common bibliography

Page 5: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

Thousands of papers and theses written in

LaTeXGranger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

All share a common bibliography

Want to see how often citations appear

together

Page 6: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

Thousands of papers and theses written in

LaTeXGranger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

All share a common bibliography

Want to see how often citations appear

together

First step: extract citation sets from

documents

Page 7: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

Citations enclosed in \cite{…}

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

Page 8: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

Citations enclosed in \cite{…}

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

Multiple labels separated by commas

Page 9: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

Citations enclosed in \cite{…}

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

Multiple labels separated by commas

May be white space

Page 10: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

Citations enclosed in \cite{…}

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

Multiple labels separated by commas

May be white space (including line breaks)

Page 11: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

Citations enclosed in \cite{…}

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

Multiple labels separated by commas

May be white space (including line breaks)

And multiple citations per line

Page 12: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.search('cite{(.+)}', 'a \\cite{X} b').groups()

('X',)

Idea #1: capture everything in cite{…} in a

group

Page 13: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.search('cite{(.+)}', 'a \\cite{X} b').groups()

('X',)

Idea #1: capture everything in cite{…} in a

group

print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups()

('X} b \\cite{Y',)

What about multiple citations?

Page 14: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.search('cite{(.+)}', 'a \\cite{X} b').groups()

('X',)

Idea #1: capture everything in cite{…} in a

group

print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups()

('X} b \\cite{Y',)

What about multiple citations?

Page 15: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.search('cite{(.+)}', 'a \\cite{X} b').groups()

('X',)

Idea #1: capture everything in cite{…} in a

group

print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups()

('X} b \\cite{Y',)

What about multiple citations?

Matching is greedy

Page 16: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

Idea #2: match everything inside '{}' except

'}'

Page 17: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

Idea #2: match everything inside '{}' except

'}'

Use '[^}]' to negate the set containing only

'}'

Page 18: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups()

('X',)

Idea #2: match everything inside '{}' except

'}'

Use '[^}]' to negate the set containing only

'}'

Page 19: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups()

('X',)

Idea #2: match everything inside '{}' except

'}'

Use '[^}]' to negate the set containing only

'}'

Page 20: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups()

('X',)

What about multiple citations?

Idea #2: match everything inside '{}' except

'}'

Use '[^}]' to negate the set containing only

'}'

Page 21: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups()

('X',)

print re.search('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c').groups()

('X',)

What about multiple citations?

Idea #2: match everything inside '{}' except

'}'

Use '[^}]' to negate the set containing only

'}'

Page 22: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups()

('X',)

print re.search('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c').groups()

('X',)

What about multiple citations?

Need to extract all matches, not just the first

Idea #2: match everything inside '{}' except

'}'

Use '[^}]' to negate the set containing only

'}'

Page 23: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

Idea #3: use re.findall instead of re.search

Page 24: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

Idea #3: use re.findall instead of re.search

"A programmer is only as good as her

knowledge

of her language's libraries."

Page 25: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c')

['X', 'Y']

Idea #3: use re.findall instead of re.search

"A programmer is only as good as her

knowledge

of her language's libraries."

Page 26: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c')

['X', 'Y']

Idea #3: use re.findall instead of re.search

"A programmer is only as good as her

knowledge

of her language's libraries."

Page 27: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c')

['X', 'Y']

Idea #3: use re.findall instead of re.search

"A programmer is only as good as her

knowledge

of her language's libraries."

What about spaces?

Page 28: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c')

['X', 'Y']

Idea #3: use re.findall instead of re.search

"A programmer is only as good as her

knowledge

of her language's libraries."

print re.search('cite{([^}]+)}', 'a \\cite{ X} b \\cite{Y } c').groups()

[' X', 'Y ']

What about spaces?

Page 29: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

Could tidy this up after matching using

string.strip()

Page 30: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

Could tidy this up after matching using

string.strip()

Let's modify the pattern instead

Page 31: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Could tidy this up after matching using

string.strip()

Let's modify the pattern instead

Page 32: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Could tidy this up after matching using

string.strip()

Let's modify the pattern instead

Page 33: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Could tidy this up after matching using

string.strip()

Let's modify the pattern instead

Still capturing the space after 'Y'

Page 34: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Could tidy this up after matching using

string.strip()

Let's modify the pattern instead

Still capturing the space after 'Y'

Match the word-to-nonword transition as well

Page 35: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Could tidy this up after matching using

string.strip()

Let's modify the pattern instead

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', 'a \\cite{ X} b \\cite{Y } c')

[' X', 'Y']

Still capturing the space after 'Y'

Match the word-to-nonword transition as well

Page 36: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Could tidy this up after matching using

string.strip()

Let's modify the pattern instead

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', 'a \\cite{ X} b \\cite{Y } c')

[' X', 'Y']

Still capturing the space after 'Y'

Match the word-to-nonword transition as well

Page 37: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

What about multiple labels in a single

citation?

Page 38: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ')

['X,Y']

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ')

['X, Y, Z']

What about multiple labels in a single

citation?

Page 39: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ')

['X,Y']

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ')

['X, Y, Z']

What about multiple labels in a single

citation?

Actually can be done, but it's very complex

Page 40: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ')

['X,Y']

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ')

['X, Y, Z']

What about multiple labels in a single

citation?

Actually can be done, but it's very complex

Use re.split() to break matches on '\\s*,\\s*'

Page 41: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

# Start with a working skeleton.def get_citations(text): '''Return the set of all citation tags found in a block of text.''' return set()

if __name__ == '__main__': test = '''\Granger's work on graphs \cite{dd-gr2007,gr2009},particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \cite{quirrell89}),has opened up new lines of research. However,studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers.'''

print get_citations(test)

set([])

Page 42: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

import re

CITE = 'cite{\\s*\\b([^}]+)\\b\\s*}'SPLIT = '\\s*,\\s*'

def get_citations(text): '''Return the set of all citation tags found in a block of text.'''

result = set() match = re.findall(CITE, text) if match: for citation in match: cites = re.split(SPLIT, citation) for c in cites: result.add(c)

return result

Page 43: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

import re

CITE = re.compile('cite{\\s*\\b([^}]+)\\b\\s*}')SPLIT = re.compile('\\s*,\\s*')

def get_citations(text): '''Return the set of all citation tags found in a block of text.'''

result = set() match = CITE.findall(text) if match: for citations in match: label_list = SPLIT.split(citations) for label in label_list: result.add(label)

return result

Page 44: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

import re

CITE = re.compile('cite{\\s*\\b([^}]+)\\b\\s*}')SPLIT = re.compile('\\s*,\\s*')

def get_citations(text): '''Return the set of all citation tags found in a block of text.'''

result = set() match = CITE.findall(text) if match: for citations in match: label_list = SPLIT.split(citations) for label in label_list: result.add(label)

return result

Page 45: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

Regular Expressions More Tools

# Now test it all out.

if __name__ == '__main__': test = '''\Granger's work on graphs \cite{dd-gr2007,gr2009},particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \cite{quirrell89}),has opened up new lines of research. However,studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers.'''

print get_citations(test)

set(['gr2009', 'stibbons2002', 'dd-gr2007', 'stibbons2008', 'snape87', 'quirrell89'])

Page 46: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See .

June 2010

created by

Greg Wilson

Copyright © Software Carpentry 2010

This work is licensed under the Creative Commons Attribution License

See http://software-carpentry.org/license.html for more information.