Trac internals are almost exclusively dealing with Unicode text, represented by unicode instances. The main advantage of using unicode over UTF-8 encoded str (as this used to be the case before version 0.10), is that text transformation functions in the present module will operate in a safe way on individual characters, and won’t risk to eventually cut a multi-byte sequence in the middle. Similar issues with Python string handling routines are avoided as well, like surprising results when splitting text in lines. For example, did you know that “Priorità” is encoded as 'Priorit\xc3\x0a' in UTF-8? Calling strip() on this value in some locales can cut away the trailing \x0a and it’s no longer valid UTF-8...
The drawback is that most of the outside world, while eventually “Unicode”, is definitely not unicode. This is why we need to convert back and forth between str and unicode at the boundaries of the system. And more often than not we even have to guess which encoding is used in the incoming str strings.
Encoding unicode to str is usually directly performed by calling encode() on the unicode instance, while decoding is preferably left to the to_unicode helper function, which converts str to unicode in a robust and guaranteed successful way.
Convert input to an unicode object.
For a str object, we’ll first try to decode the bytes using the given charset encoding (or UTF-8 if none is specified), then we fall back to the latin1 encoding which might be correct or not, but at least preserves the original byte sequence by mapping each byte to the corresponding unicode code point in the range U+0000 to U+00FF.
For anything else, a simple unicode() conversion is attempted, with special care taken with Exception objects.
Convert an Exception to an unicode object.
In addition to to_unicode, this representation of the exception also contains the class name and optionally the traceback.
A unicode aware version of urllib.quote
Parameters: |
---|
A unicode aware version of urllib.quote_plus.
Parameters: |
---|
A unicode aware version of urllib.unquote.
Parameters: | str – UTF-8 encoded str value (for example, as obtained by unicode_quote). |
---|---|
Return type: | unicode |
A unicode aware version of urllib.urlencode.
Values set to empty are converted to the key alone, without the equal sign.
Quote strings for query string
Quote strings for inclusion in single or double quote delimited Javascript strings
Embed the given string in a double quote delimited Javascript string (conform to the JSON spec)
Convert a filesystem path to unicode, using the filesystem encoding.
Return the appropriate encoding for the given stream.
Output the given arguments to the console, encoding the output as appropriate.
Parameters: | kwargs – newline controls whether a newline will be appended (defaults to True) |
---|
Do a console_print on sys.stdout.
Do a console_print on sys.stderr.
Input one line from the console and converts it to unicode as appropriate.
Pretty print content size information with appropriate unit.
Parameters: |
|
---|
Make a path breakable after path separators, and conversely, avoid breaking at spaces.
Normalize whitespace in a string, by replacing special spaces by normal spaces and removing zero-width spaces.
Remove (one level of) enclosing single or double quotes.
New in version 1.0.
Fix end-of-lines in a text.
Expand tab characters '\t' into spaces.
Parameters: |
|
---|
Replace anything looking like an e-mail address ('@something') with a trailing ellipsis ('@…')
Determine the column width of text in Unicode characters.
The characters in the East Asian Fullwidth (F) or East Asian Wide (W) have a column width of 2. The other characters in the East Asian Halfwidth (H) or East Asian Narrow (Na) have a column width of 1.
That ambiwidth parameter is used for the column width of the East Asian Ambiguous (A). If 1, the same width as characters in US-ASCII. This is expected by most users. If 2, twice the width of US-ASCII characters. This is expected by CJK users.
Print data according to a tabular layout.
Parameters: |
|
---|
Truncates text to length less than or equal to maxlen characters.
This tries to be (a bit) clever and attempts to find a proper word boundary for doing so.
Strips unicode white-spaces and ZWSPs from text.
Parameters: |
---|
Wraps the single paragraph in t, which contains unicode characters. The every line is at most cols characters long.
That ambiwidth parameter is used for the column width of the East Asian Ambiguous (A). If 1, the same width as characters in US-ASCII. This is expected by most users. If 2, twice the width of US-ASCII characters. This is expected by CJK users.
Safe conversion of text to base64 representation using utf-8 bytes.
Strips newlines from output unless strip_newlines is False.
Safe conversion of text to unicode based on utf-8 bytes.