Character escaping utilities.
Classes
CharEscapers
Utility functions for encoding and decoding URIs.
Escaper
An object that converts literal text into a format safe for inclusion in a particular context (such as an XML document). Typically (but not always), the inverse process of "unescaping" the text is performed automatically by the relevant parser.
For example, an XML escaper would convert the literal string "Foo<Bar>"
into
"Foo<Bar>"
to prevent "<Bar>"
from being confused with an XML tag. When the
resulting XML document is parsed, the parser API will return this text as the original literal
string "Foo<Bar>"
.
An Escaper
instance is required to be stateless, and safe when used concurrently by
multiple threads.
Several popular escapers are defined as constants in the class CharEscapers.
PercentEscaper
A UnicodeEscaper
that escapes some set of Java characters using the URI percent encoding
scheme. The set of safe characters (those which remain unescaped) is specified on construction.
For details on escaping URIs for use in web pages, see RFC 3986 - section 2.4 and RFC 3986 - appendix A
When encoding a String, the following rules apply:
- The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.
- Any additionally specified safe characters remain the same.
- If
plusForSpace
is true, the space character " " is converted into a plus sign "+". - All other characters are converted into one or more bytes using UTF-8 encoding. Each byte is then represented by the 3-character string "%XY", where "XY" is the two-digit, uppercase, hexadecimal representation of the byte value.
RFC 3986 defines the set of unreserved characters as "-", "_", "~", and "." It goes on to state:
URIs that differ in the replacement of an unreserved character with its corresponding
percent-encoded US-ASCII octet are equivalent: they identify the same resource. However, URI
comparison implementations do not always perform normalization prior to comparison (see Section
6). For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT
(%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by
URI producers and, when found in a URI, should be decoded to their corresponding unreserved
characters by URI normalizers.
Note: This escaper produces uppercase hexadecimal sequences. From RFC 3986:
"URI producers and normalizers should use uppercase hexadecimal digits for all
percent-encodings."
UnicodeEscaper
An Escaper that converts literal text into a format safe for inclusion in a particular context (such as an XML document). Typically (but not always), the inverse process of "unescaping" the text is performed automatically by the relevant parser.
For example, an XML escaper would convert the literal string "Foo<Bar>"
into
"Foo<Bar>"
to prevent "<Bar>"
from being confused with an XML tag. When the
resulting XML document is parsed, the parser API will return this text as the original literal
string "Foo<Bar>"
.
As there are important reasons, including potential security issues, to handle Unicode correctly if you are considering implementing a new escaper you should favor using UnicodeEscaper wherever possible.
A UnicodeEscaper
instance is required to be stateless, and safe when used concurrently
by multiple threads.
Several popular escapers are defined as constants in the class CharEscapers. To create your own escapers extend this class and implement the #escape(int) method.