Stakeholders in memoQ Server Projects...Dangers of Greediness By default, regex expressions are...
Transcript of Stakeholders in memoQ Server Projects...Dangers of Greediness By default, regex expressions are...
Stakeholders in memoQ Server Projects
A Quick Overview
Regular Expression
[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
Matching Text
202ca4c2-749d-4f54-ae02-fdf19939ef10
The Scary Bit
What Are Regular Expressions?
• They are not a programming language
• Symbols that describe a text pattern
• Used to match, search and manipulate text
• A more powerful “Search and replace”
• Called “regex” for short
• There are several regex engines or “flavours”
• memoQ uses Microsoft .NET
How Long Does It Take to Learn a New Language?
*http://www.effectivelanguagelearning.com/language-guide/language-difficulty
How Long Does It Take to Learn Regex?
You can start creating your own basic expressions within a few minutes.
SIGH OF RELIEF
What Are They Used For?
• Search and match: – Email addresses
– Urls
– Tags and placeholders
– Phone number formats
– Alternate spellings
– Consistency checks (e.g. lower case v. upper case)
– Trailing spaces
– Punctuation sequences (for segmentation)
– Other repetitive/sequential text
Where in memoQ?
Two Types of Regex Text
Literal characters
bomb
bomb
bomber
A-bomb
The bomb went off.
Bombs off.
b o m b
Metacharacters
\
.
*
?
+
[]
-
|
()
{}
$
^
Metacharacters
. Any character
* Preceding item zero or more times
? Preceding item zero or one time
+ Preceding item one or more times
[ Begin character set
] End character set
- Separator in ranges
| Either or
{} Bean counting
^ Start of segment // Negate a character set
$ End of segment
( Begin group
) End group
Character Sets
Will match any one of the characters in the set but only once, unless otherwise specified by bean counting {}
[a-z] Lower case [A-Z] Upper case [a-Z] Any case [0-9] Digits [0-9A-z] Digits + letters \p{Ll} Lower + special letters \p{Lu} Upper + special letters \p{L} Any case + special letters
Can be negated using ^ [^0-9] Any character except a digit
Can be combined [0-9a-e ,]
Shorthand Character Sets
\d Digit \w Digit OR letter \s Whitespace \b Boundary (Beginning OR end of word) \t Tab \r Line return \n New line \D Not a digit \W Not a digit OR a letter \S Not a whitespace \tag memoQ tag
“Escaping” Metacharacters
If you need to match a special character in the text, you will have to “escape” it, or mark it for its literal meaning.
This is achieved by putting a backslash in front of it.
\(
\)
\{
\}
\$
\^
\!
\\
\.
\?
\*
\+
\[
\]
\-
\|
Find and Replace
Replace expressions allow you choose which parts of the text to replace and which parts to keep as they are. This is achieved via groups ()
Search: (\d{1,})\s{1,}[mM][gG]
Replace: $1 mg
Finds: 225 mG
Replaces with: 225 mg
Greedy v. Lazy
Dangers of Greediness By default, regex expressions are greedy, so it is a good habit to limit your expressions as much as possible to avoid matching more text than you intend to. Use the non-greedy marker ? after * and +. Example:
pur.*\b will match “All purées contains at least 10% of the main ingredient, unless otherwise specified in the purée description.”
pur.*?\b will match “All purées contains at least 10% of the main ingredient, unless otherwise specified in the purée description.”
Auto-Translation: Practical Cases
• Email addresses
\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*
• URLS
(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?
• Phone numbers
\d{5}\s\d{6} 01908 443300
\d{5}-\d{6} 01908-443300
\+\d{2}\s\(0\)\s\d{4}\s\d{6} +44 (0) 1908 443300
• Duplicate word pairs*
(\b\w+ \w+\b) \b\1\b
*Published by Max B. on the Yahoo mQ group
Segmentation: Practical Case
SOURCE: “Manufactured in China (PRC) for the UK market. Ingredients: Lemon Grass Purée (15%), Red Chilli Purée (11%), Onion, Water, Coconut Milk, Red Pepper, Galangal (5%), Sugar (Sulphites), Lime Juice From Concentrate (Sulphites), Salt, Rapeseed Oil, Garlic Purée, Rice Wine Vinegar (Sulphites), Lime Leaves (2.5%), Yeast Extract, Chilli Flakes, Cornflour, Tamarind Paste, Coriander, Cayenne Pepper, Paprika Extract.”
SOLUTION: Split segment before opening bracket if ending bracket is followed by a comma, a space and an upper case letter
[\s]+#!#\([\s]*[\p{L}0-9]*\.?\d*\s*%?\),\s+\p{Lu}
Regex Tagger: Practical Case
SOURCE: “Dear [%$FIRSTNAME%] [%$LASTNAME%], Your online order placed on [%$WEBSITE%] on [%$DATE%] and processed as the authorized vendor of [%$RANGE%] products, has been successfully completed (order number: [%$REFNO%]). Please note that [%if $ORDER != ""%][%$ORDER%][%else%] [%$COMPANY%] will appear on your bank statement, instead of [%$RANGE%].”
SOLUTION: Create a cascading filter (Plain text + Regex tagger) and add the below to tagger.
\[%.*?%\] OR, if you want to be more strict
\[%[a-z]+%\] \[%\$[A-Z]+%\] \[%if .*\!\=.*%\]
Resources
• Regex 101
https://regex101.com/
• Regex Pal
http://www.regexpal.com/
• Using regular expressions in memoQ (Basic level), by Miklós Urbán
https://www.memoq.com/recorded-webinars
• “Do the magic: Regular Expressions in FrameMaker”, by Marek Pawelec
https://blogs.adobe.com/techcomm/2016/03/framemaker-regular-expressions.html
• memoQ Yahoo Group
https://groups.yahoo.com/neo/groups/
• Regex Hero
http://regexhero.net/reference/
• Regex Cheat Sheet
https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
Queries and Feedback
Please send any comments, questions or feedback to: