`trac.util.text` – Text manipulation¶

The Unicode toolbox¶

Trac internals are almost exclusively dealing with Unicode text, represented by unicode instances. The main advantage of using unicode over UTF-8 encoded str (as this used to be the case before version 0.10), is that text transformation functions in the present module will operate in a safe way on individual characters, and won’t risk to eventually cut a multi-byte sequence in the middle. Similar issues with Python string handling routines are avoided as well, like surprising results when splitting text in lines. For example, did you know that “Priorità” is encoded as 'Priorit\xc3\x0a' in UTF-8? Calling strip() on this value in some locales can cut away the trailing \x0a and it’s no longer valid UTF-8...

The drawback is that most of the outside world, while eventually “Unicode”, is definitely not unicode. This is why we need to convert back and forth between str and unicode at the boundaries of the system. And more often than not we even have to guess which encoding is used in the incoming str strings.

Encoding unicode to str is usually directly performed by calling encode() on the unicode instance, while decoding is preferably left to the to_unicode helper function, which converts str to unicode in a robust and guaranteed successful way.

trac.util.text.to_unicode(text, charset=None)¶

Convert input to an unicode object.

For a str object, we’ll first try to decode the bytes using the given charset encoding (or UTF-8 if none is specified), then we fall back to the latin1 encoding which might be correct or not, but at least preserves the original byte sequence by mapping each byte to the corresponding unicode code point in the range U+0000 to U+00FF.

For anything else, a simple unicode() conversion is attempted, with special care taken with Exception objects.

trac.util.text.exception_to_unicode(e, traceback=False)¶

Convert an Exception to an unicode object.

In addition to to_unicode, this representation of the exception also contains the class name and optionally the traceback.

Web utilities¶

trac.util.text.unicode_quote(value, safe='/')¶

A unicode aware version of urllib.quote

Parameters:	value – anything that converts to a `str`. If `unicode` input is given, it will be UTF-8 encoded. safe – as in `quote`, the characters that would otherwise be quoted but shouldn’t here (defaults to ‘/’)

trac.util.text.unicode_quote_plus(value, safe='')¶

A unicode aware version of urllib.quote_plus.

Parameters:	value – anything that converts to a `str`. If `unicode` input is given, it will be UTF-8 encoded. safe – as in `quote_plus`, the characters that would otherwise be quoted but shouldn’t here (defaults to ‘/’)

trac.util.text.unicode_unquote(value)¶

A unicode aware version of urllib.unquote.

Parameters:	str – UTF-8 encoded `str` value (for example, as obtained by `unicode_quote`).
Return type:	`unicode`

trac.util.text.unicode_urlencode(params, safe='')¶

A unicode aware version of urllib.urlencode.

Values set to empty are converted to the key alone, without the equal sign.

trac.util.text.quote_query_string(text)¶: Quote strings for query string

trac.util.text.javascript_quote(text)¶: Quote strings for inclusion in single or double quote delimited Javascript strings

trac.util.text.to_js_string(text)¶: Embed the given string in a double quote delimited Javascript string (conform to the JSON spec)

Console and file system¶

trac.util.text.path_to_unicode(path)¶: Convert a filesystem path to unicode, using the filesystem encoding.

trac.util.text.stream_encoding(stream)¶: Return the appropriate encoding for the given stream.

trac.util.text.console_print(out, *args, **kwargs)¶

Output the given arguments to the console, encoding the output as appropriate.

Parameters:	kwargs – `newline` controls whether a newline will be appended (defaults to `True`)

trac.util.text.printout(*args, **kwargs)¶: Do a console_print on sys.stdout.

trac.util.text.printerr(*args, **kwargs)¶: Do a console_print on sys.stderr.

trac.util.text.raw_input(prompt)¶: Input one line from the console and converts it to unicode as appropriate.

Miscellaneous¶

trac.util.text.empty¶: A special tag object evaluating to the empty string, used as marker for missing value (as opposed to a present but empty value).

class trac.util.text.unicode_passwd¶: Conceal the actual content of the string when repr is called.

trac.util.text.levenshtein_distance(lhs, rhs)¶: Return the Levenshtein distance between two strings.

Text formatting¶

trac.util.text.pretty_size(size, format='%.1f')¶

Pretty print content size information with appropriate unit.

Parameters:	size – number of bytes format – can be used to adjust the precision shown

trac.util.text.breakable_path(path)¶: Make a path breakable after path separators, and conversely, avoid breaking at spaces.

trac.util.text.normalize_whitespace(text, to_space=u'\xa0', remove=u'\u200b')¶: Normalize whitespace in a string, by replacing special spaces by normal spaces and removing zero-width spaces.

trac.util.text.unquote_label(txt)¶: Remove (one level of) enclosing single or double quotes.

New in version 1.0.

trac.util.text.fix_eol(text, eol)¶: Fix end-of-lines in a text.

trac.util.text.expandtabs(s, tabstop=8, ignoring=None)¶

Expand tab characters '\t' into spaces.

Parameters:	tabstop – number of space characters per tab (defaults to the canonical 8) ignoring – if not `None`, the expansion will be “smart” and go from one tabstop to the next. In addition, this parameter lists characters which can be ignored when computing the indent.

trac.util.text.obfuscate_email_address(address)¶: Replace anything looking like an e-mail address ('@something') with a trailing ellipsis ('@…')

trac.util.text.text_width(text, ambiwidth=1)¶

Determine the column width of text in Unicode characters.

The characters in the East Asian Fullwidth (F) or East Asian Wide (W) have a column width of 2. The other characters in the East Asian Halfwidth (H) or East Asian Narrow (Na) have a column width of 1.

That ambiwidth parameter is used for the column width of the East Asian Ambiguous (A). If 1, the same width as characters in US-ASCII. This is expected by most users. If 2, twice the width of US-ASCII characters. This is expected by CJK users.

cf. http://www.unicode.org/reports/tr11/.

trac.util.text.print_table(data, headers=None, sep=' ', out=None, ambiwidth=None)¶

Print data according to a tabular layout.

Parameters:

data – a sequence of rows; assume all rows are of equal length.
headers – an optional row containing column headers; must be of the same length as each row in data.
sep – column separator
out – output file descriptor (None means use sys.stdout)
ambiwidth – column width of the East Asian Ambiguous (A). If None, detect ambiwidth with the locale settings. If others, pass to the ambiwidth parameter of text_width.

trac.util.text.shorten_line(text, maxlen=75)¶

Truncates text to length less than or equal to maxlen characters.

This tries to be (a bit) clever and attempts to find a proper word boundary for doing so.

trac.util.text.stripws(text, leading=True, trailing=True)¶

Strips unicode white-spaces and ZWSPs from text.

Parameters:	leading – strips leading spaces from `text` unless `leading` is `False`. trailing – strips trailing spaces from `text` unless `trailing` is `False`.

trac.util.text.wrap(t, cols=75, initial_indent='', subsequent_indent='', linesep='\n', ambiwidth=1)¶

Wraps the single paragraph in t, which contains unicode characters. The every line is at most cols characters long.

That ambiwidth parameter is used for the column width of the East Asian Ambiguous (A). If 1, the same width as characters in US-ASCII. This is expected by most users. If 2, twice the width of US-ASCII characters. This is expected by CJK users.

Conversion utilities¶

trac.util.text.unicode_to_base64(text, strip_newlines=True)¶

Safe conversion of text to base64 representation using utf-8 bytes.

Strips newlines from output unless strip_newlines is False.

trac.util.text.unicode_from_base64(text)¶: Safe conversion of text to unicode based on utf-8 bytes.

trac.util.text.to_utf8(text, charset='latin1')¶

Convert a string to an UTF-8 str object.

If the input is not an unicode object, we assume the encoding is already UTF-8, ISO Latin-1, or as specified by the optional charset parameter.

Table Of Contents

Previous topic

Next topic

This Page

`trac.util.text` – Text manipulation¶

The Unicode toolbox¶

Web utilities¶

Console and file system¶

Miscellaneous¶

Text formatting¶

Conversion utilities¶

Navigation

Table Of Contents

Previous topic

Next topic

This Page

Quick search

trac.util.text – Text manipulation¶

The Unicode toolbox¶

Web utilities¶

Console and file system¶

Miscellaneous¶

Text formatting¶

Conversion utilities¶

Navigation

`trac.util.text` – Text manipulation¶