Java Course 7: Text processing, Charsets & Encodings

16

Click here to load reader

description

Lecture 7 from the IAG0040 Java course in TTÜ. See the accompanying source code written during the lectures: https://github.com/angryziber/java-course Do you know the difference between charset & encoding? Every programmer nowadays MUST understand these terms, how they work, and how to use them. Otherwise we constantly face broken software refusing to work with international characters properly.

Transcript of Java Course 7: Text processing, Charsets & Encodings

Page 1: Java Course 7: Text processing, Charsets & Encodings

Text processing,Text processing,Charsets & EncodingsCharsets & Encodings

Java course - IAG0040Java course - IAG0040

Anton KeksAnton Keks 20112011

Page 2: Java Course 7: Text processing, Charsets & Encodings

Lecture 7Lecture 7Slide Slide 22

Java course – IAG0040Anton Keks

String processingString processing

● The following classes provide String processing: String, StringBuilder/Buffer, StringTokenizer

● All primitives can be converted to/from Strings using their wrapper classes (e.g. Integer, Float, etc)

● java.util.regex provides regular expressions● java.text package provides classes and interfaces for

parsing and formatting text, dates, numbers, and messages in a manner independent of natural languages

Page 3: Java Course 7: Text processing, Charsets & Encodings

Lecture 7Lecture 7Slide Slide 33

Java course – IAG0040Anton Keks

LocalesLocales

● Java also supports locales, just like most OSs● A java.util.Locale object represents a specific

geographical, political, or cultural region. – There is a default locale, which is used by some

String operations (e.g. toUpperCase) and formatters in java.text package.

– Locale is initialized with: ISO 2-letter language code (lower case), ISO 2-letter country code (upper case), and a variant. Latter two are optional

● e.g. “de”, “et_EE”, “en_GB”

Page 4: Java Course 7: Text processing, Charsets & Encodings

Lecture 7Lecture 7Slide Slide 44

Java course – IAG0040Anton Keks

LocalizationLocalization

● ResourceBundle classes can be used for localization of your programs

– ResourceBundles contain locale-specific objects, e.g. Strings

– ListResourceBundle and PropertyResourceBundle are simple implementations

– ResourceBundle.getBundle(...) returns a locale-specific bundle

Page 5: Java Course 7: Text processing, Charsets & Encodings

Lecture 7Lecture 7Slide Slide 55

Java course – IAG0040Anton Keks

Natural language comparisonNatural language comparison

● String.compareTo() does lexicographical comparison, ie compares character codes

● Collators are used for locale-sensitive comparison/sorting, according to the rules of the specific language/locale

– java.text.Collator implements Comparator<String>

– Use Collator.getInstance(...) for obtaining one

– RuleBasedCollator is the common implementation, allows specification of own rules

Page 6: Java Course 7: Text processing, Charsets & Encodings

Lecture 7Lecture 7Slide Slide 66

Java course – IAG0040Anton Keks

StringBuffer vs StringStringBuffer vs String

● A StringBuilder (and StringBuffer) is a mutable String

● Always use it, when doing complex String processing, especially when doing a lot of concatenations in a loop

● Java uses StringBuilder internally in place of the '+' operator

– String s = a + b + 25; is the same as

– String s = new StringBuilder().append(a).append(b).append(25).toString();

– There are many different append() methods for all primitive types as well as any objects. For an arbitrary object, toString() is called.

● StringBuffer, StringBuilder, and String implement CharSequence

● StringBuilder has the same methods as StringBuffer, but a bit faster, because it is not thread safe (not internally synchronized)

Page 7: Java Course 7: Text processing, Charsets & Encodings

Lecture 7Lecture 7Slide Slide 77

Java course – IAG0040Anton Keks

Formatting and ParsingFormatting and Parsing

● Locale-specific formatting and parsing is provided by java.text.

● java.text.Format is an abstract base class for

– DateFormat (SimpleDateFormat) – date and time. Calendar is used for manipulation of date and time.

– NumberFormat (ChoiceFormat, DecimalFormat) – numbers, currencies, percentages, etc

– MessageFormat – for complex concatenated messages

– all of them provide various format and parse methods

– all of them can be initialized for the default or specified locale using provided static methods

– all of them can be created directly, specifying the custom format

Page 8: Java Course 7: Text processing, Charsets & Encodings

Lecture 7Lecture 7Slide Slide 88

Java course – IAG0040Anton Keks

Regular expressionsRegular expressions

● Regular expressions are expressions, allowing easy searching and matching of textual data, they are built into many languages, like Perl and PHP, and widely used in Unix command-line

● Regular expression classes are in the java.util.regex package.

● In Java, represented as Strings, but must be 'compiled' by Pattern.compile() before use.

● However, many String methods provide convenient 'shortcuts', like split(), matches(), replaceFirst(), replaceAll(), etc

● Pattern is an immutable compiled representation, which can be used for creation of mutable Matcher objects.

● Use Patterns directly in case you intend to reuse the regexp

Page 9: Java Course 7: Text processing, Charsets & Encodings

Lecture 7Lecture 7Slide Slide 99

Java course – IAG0040Anton Keks

Regular Expressions (cont)Regular Expressions (cont)

● Read javadoc of the Pattern class!

– . (a dot) matches any character

– [] can be used for matching any specified character

– \s, \S, \d, \w, etc save you typing sometimes (note: double escaping is needed within String literals, e.g. “\\s”

– ?, +, * match the number of occurrences of the preceding character: 0 or 1, 1 or more, any number respectively

– () - matches groups (they can be accessed individually)

– | means 'or', e.g. (dog|cat) matches both “dog” and “cat”

– ^ and $ match beginning and end of a line, respectively

– \b matches word boundary

Page 10: Java Course 7: Text processing, Charsets & Encodings

Lecture 7Lecture 7Slide Slide 1010

Java course – IAG0040Anton Keks

ScanningScanning

● java.util.Scanner can be used for parsing Strings, InputStreams, Readers, or Files

● It uses either built-in or custom regular expressions for parsing input data, it is sensitive to either the default or specified Locale

● Default delimiter is whitespace (“\\s”), custom delimeter may be set using the useDelimiter() method

● It implements Iterator<String>, therefore has hasNext() and next() methods, various type-specific methods, e.g. hasNextInt(), nextInt(), etc, as well as finding and skipping facilities

● Can be used for parsing the standard input:

– Scanner s = new Scanner(System.in);int n = s.nextInt();

Page 11: Java Course 7: Text processing, Charsets & Encodings

Lecture 7Lecture 7Slide Slide 1111

Java course – IAG0040Anton Keks

Charsets and encodingsCharsets and encodings

● In the 21st century, there is no excuse for any programmer not to know charsets and encodings well

● Charsets map glyphs (symbols) to numeric codes

● Charsets are represented by character encodings (actual bits and bytes that are stored in files)

● Fonts must support charsets in order to display texts in respective encodings properly

● Example:

– Glyph (symbol): A

– Numeric code: 65 (ASCII charset)

– Encoding: 0x41 == 1000001 b (ASCII 7-bit encoding)

Page 12: Java Course 7: Text processing, Charsets & Encodings

Lecture 7Lecture 7Slide Slide 1212

Java course – IAG0040Anton Keks

ASCIIASCII

● American Standard Code for Information Interchange

● Created in 1963, ANSI in 1967, ISO-646 in 1972

● Allowed for text exchange between computers

● Only 7 bits are defined, nowadays called US-ASCII

● 0-31 – control chars

● 33-126 – printable

● Was designed forEnglish language

Page 13: Java Course 7: Text processing, Charsets & Encodings

Lecture 7Lecture 7Slide Slide 1313

Java course – IAG0040Anton Keks

ASCII extensionsASCII extensions● ASCII is enough for only Latin, English, Hawaiian and Swahili

● For most other languages a number of 8-bit ASCII extensions were developed, incompatible with each other

● ISO-8859 was an attempt to standardize them by defining the upper 128 characters in 8-bit wide bytes

– All of them have the first 7-bit the same as ASCII

– ISO-8859-1 (Latin-1) – Western European

– ISO-8859-4 – Northern, ISO-8859-13 – Baltic, WIN-1257 – MS Baltic (modified ISO)

– ISO-8859-5, KOI8-R – Cyrillic, WIN-1251 – MS Cyrillic (different from ISO)

– Many of them are still used today in legacy systems or formats

Page 14: Java Course 7: Text processing, Charsets & Encodings

Lecture 7Lecture 7Slide Slide 1414

Java course – IAG0040Anton Keks

Unicode (UCS, ISO-10646)Unicode (UCS, ISO-10646)

● Unicode solves the problem of incompatible charsets● Unicode defines standardized numeric codes (code

points) for most glyphs used in the world– Code points are abstract – they don't define representation

– First 256 code points correspond to ISO-8859-1

– 16 bit BMP (Basic Multilingual Plane) – most modern languages (including Chinese, Japanese, etc)

– More planes for other scripts (mathematical symbols, musical notation, ancient alphabets, etc)

● Apart from UCS, Unicode defines formatting and combining rules as well (e.g. for bidirectional text)

Page 15: Java Course 7: Text processing, Charsets & Encodings

Lecture 7Lecture 7Slide Slide 1515

Java course – IAG0040Anton Keks

Unicode encodingsUnicode encodings

● Define representation of code points in bits and bytes● Fixed-width UCS-2 (2 bytes) and UCS-4 (4 bytes)● UTF (Unicode Transformation Format)

– All of them can encode any Unicode code points

– UTF-8 – variable size from 1 to 6 bytes (usually no longer than 3 bytes, compatible with ASCII), the most popular and compact

– UTF-16 – 2 or 4 bytes, 2 bytes for BMP code points, 4 bytes for other planes

– UTF-32 – constant size, 4 bytes per character, 'raw' unicode

– UTF-7 – 7-bit safe encoding (less popular nowadays)

Page 16: Java Course 7: Text processing, Charsets & Encodings

Lecture 7Lecture 7Slide Slide 1616

Java course – IAG0040Anton Keks

Charsets and JavaCharsets and Java● char and String are UTF-16

– Beware that length(), indexOf(), etc operate on chars (surrogates), not Unicode glyphs, therefore can return 'logically wrong' values in case of 4-byte characters – this was a performance decision

● Encoding conversions are built-in

– Encoded text is binary data for Java, therefore stored in bytes

– There always exists the default encoding (the one OS uses)

– Charset class is provided for encoding/decoding, enumeration, etc

– s.toBytes(...) - encodes a String

– new String(...) - decodes raw bytes to a String

– System.out and System.in automatically convert to/from the default encoding