Table Of Contents

Previous topic

trac.util.presentation – Utilities for dynamic content generation

Next topic

trac.versioncontrol.api – Trac Version Control APIs

This Page

trac.util.text – Text manipulation

The Unicode toolbox

Trac internals are almost exclusively dealing with Unicode text, represented by unicode instances. The main advantage of using unicode over UTF-8 encoded str (as this used to be the case before version 0.10), is that text transformation functions in the present module will operate in a safe way on individual characters, and won’t risk to eventually cut a multi-byte sequence in the middle. Similar issues with Python string handling routines are avoided as well, like surprising results when splitting text in lines. For example, did you know that “Priorità” is encoded as 'Priorit\xc3\x0a' in UTF-8? Calling strip() on this value in some locales can cut away the trailing \x0a and it’s no longer valid UTF-8...

The drawback is that most of the outside world, while eventually “Unicode”, is definitely not unicode. This is why we need to convert back and forth between str and unicode at the boundaries of the system. And more often than not we even have to guess which encoding is used in the incoming str strings.

Encoding unicode to str is usually directly performed by calling encode() on the unicode instance, while decoding is preferably left to the to_unicode helper function, which converts str to unicode in a robust and guaranteed successful way.

trac.util.text.to_unicode(text, charset=None)

Convert input to an unicode object.

For a str object, we’ll first try to decode the bytes using the given charset encoding (or UTF-8 if none is specified), then we fall back to the latin1 encoding which might be correct or not, but at least preserves the original byte sequence by mapping each byte to the corresponding unicode code point in the range U+0000 to U+00FF.

For anything else, a simple unicode() conversion is attempted, with special care taken with Exception objects.

trac.util.text.exception_to_unicode(e, traceback=False)

Convert an Exception to an unicode object.

In addition to to_unicode, this representation of the exception also contains the class name and optionally the traceback.

Web utilities

trac.util.text.unicode_quote(value, safe='/')

A unicode aware version of urllib.quote

Parameters:
  • value – anything that converts to a str. If unicode input is given, it will be UTF-8 encoded.
  • safe – as in quote, the characters that would otherwise be quoted but shouldn’t here (defaults to ‘/’)
trac.util.text.unicode_quote_plus(value, safe='')

A unicode aware version of urllib.quote_plus.

Parameters:
  • value – anything that converts to a str. If unicode input is given, it will be UTF-8 encoded.
  • safe – as in quote_plus, the characters that would otherwise be quoted but shouldn’t here (defaults to ‘/’)
trac.util.text.unicode_unquote(value)

A unicode aware version of urllib.unquote.

Parameters:str – UTF-8 encoded str value (for example, as obtained by unicode_quote).
Return type:unicode
trac.util.text.unicode_urlencode(params, safe='')

A unicode aware version of urllib.urlencode.

Values set to empty are converted to the key alone, without the equal sign.

trac.util.text.quote_query_string(text)

Quote strings for query string

trac.util.text.javascript_quote(text)

Quote strings for inclusion in single or double quote delimited Javascript strings

trac.util.text.to_js_string(text)

Embed the given string in a double quote delimited Javascript string (conform to the JSON spec)

Console and file system

trac.util.text.path_to_unicode(path)

Convert a filesystem path to unicode, using the filesystem encoding.

trac.util.text.stream_encoding(stream)

Return the appropriate encoding for the given stream.

trac.util.text.console_print(out, *args, **kwargs)

Output the given arguments to the console, encoding the output as appropriate.

Parameters:kwargsnewline controls whether a newline will be appended (defaults to True)
trac.util.text.printout(*args, **kwargs)

Do a console_print on sys.stdout.

trac.util.text.printerr(*args, **kwargs)

Do a console_print on sys.stderr.

trac.util.text.raw_input(prompt)

Input one line from the console and converts it to unicode as appropriate.

Miscellaneous

trac.util.text.empty

A special tag object evaluating to the empty string, used as marker for missing value (as opposed to a present but empty value).

class trac.util.text.unicode_passwd

Conceal the actual content of the string when repr is called.

trac.util.text.levenshtein_distance(lhs, rhs)

Return the Levenshtein distance between two strings.

Text formatting

trac.util.text.pretty_size(size, format='%.1f')

Pretty print content size information with appropriate unit.

Parameters:
  • size – number of bytes
  • format – can be used to adjust the precision shown
trac.util.text.breakable_path(path)

Make a path breakable after path separators, and conversely, avoid breaking at spaces.

trac.util.text.normalize_whitespace(text, to_space=u'\xa0', remove=u'\u200b')

Normalize whitespace in a string, by replacing special spaces by normal spaces and removing zero-width spaces.

trac.util.text.unquote_label(txt)

Remove (one level of) enclosing single or double quotes.

New in version 1.0.

trac.util.text.fix_eol(text, eol)

Fix end-of-lines in a text.

trac.util.text.expandtabs(s, tabstop=8, ignoring=None)

Expand tab characters '\t' into spaces.

Parameters:
  • tabstop – number of space characters per tab (defaults to the canonical 8)
  • ignoring – if not None, the expansion will be “smart” and go from one tabstop to the next. In addition, this parameter lists characters which can be ignored when computing the indent.
trac.util.text.obfuscate_email_address(address)

Replace anything looking like an e-mail address ('@something') with a trailing ellipsis ('@…')

trac.util.text.text_width(text, ambiwidth=1)

Determine the column width of text in Unicode characters.

The characters in the East Asian Fullwidth (F) or East Asian Wide (W) have a column width of 2. The other characters in the East Asian Halfwidth (H) or East Asian Narrow (Na) have a column width of 1.

That ambiwidth parameter is used for the column width of the East Asian Ambiguous (A). If 1, the same width as characters in US-ASCII. This is expected by most users. If 2, twice the width of US-ASCII characters. This is expected by CJK users.

cf. http://www.unicode.org/reports/tr11/.

trac.util.text.print_table(data, headers=None, sep=' ', out=None, ambiwidth=None)

Print data according to a tabular layout.

Parameters:
  • data – a sequence of rows; assume all rows are of equal length.
  • headers – an optional row containing column headers; must be of the same length as each row in data.
  • sep – column separator
  • out – output file descriptor (None means use sys.stdout)
  • ambiwidth – column width of the East Asian Ambiguous (A). If None, detect ambiwidth with the locale settings. If others, pass to the ambiwidth parameter of text_width.
trac.util.text.shorten_line(text, maxlen=75)

Truncates text to length less than or equal to maxlen characters.

This tries to be (a bit) clever and attempts to find a proper word boundary for doing so.

trac.util.text.stripws(text, leading=True, trailing=True)

Strips unicode white-spaces and ZWSPs from text.

Parameters:
  • leading – strips leading spaces from text unless leading is False.
  • trailing – strips trailing spaces from text unless trailing is False.
trac.util.text.wrap(t, cols=75, initial_indent='', subsequent_indent='', linesep='\n', ambiwidth=1)

Wraps the single paragraph in t, which contains unicode characters. The every line is at most cols characters long.

That ambiwidth parameter is used for the column width of the East Asian Ambiguous (A). If 1, the same width as characters in US-ASCII. This is expected by most users. If 2, twice the width of US-ASCII characters. This is expected by CJK users.

Conversion utilities

trac.util.text.unicode_to_base64(text, strip_newlines=True)

Safe conversion of text to base64 representation using utf-8 bytes.

Strips newlines from output unless strip_newlines is False.

trac.util.text.unicode_from_base64(text)

Safe conversion of text to unicode based on utf-8 bytes.

trac.util.text.to_utf8(text, charset='latin1')

Convert a string to an UTF-8 str object.

If the input is not an unicode object, we assume the encoding is already UTF-8, ISO Latin-1, or as specified by the optional charset parameter.