TIP 413: Unicode Support for 'string is space' and 'string trim'

Login
Author:         Jan Nijtmans <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Created:        08-Oct-2012
Post-History:   
Discussions-To: Tcl Core list
Keywords:       Tcl
Tcl-Version:    8.6
Tcl-Branch:     tip-318-update

Abstract

This TIP is in fact a re-consideration of [318], in that it attempts to define, once and for all, for which characters string is space should return 1 and which characters string trim should trim.

Rationale

Intuitively, string is space and string trim should treat the same characters as space, but currently that's not the case, even after the implementation of [318]. The unicode standard advanced to version 6.2 now (at the time of this writing), but also Java and .NET have their own views on what whitespace should be. Let's try to learn from them.

Defining the Tcl Space Set

The NUL character has the function as string separator, which could be considered in the same group as LINE SEPARATOR (U+2028) and PARAGRAPH SEPARATOR (U+2029). It's a very useful character to be stripped. It even had the Whitespace property in Unicode 2.0. The problem with considering this character as space is that its visible representation is not specified, it even should not occur in normal text. Therefore, it is not in the "Tcl space set", but it is very useful to let it be stripped by string trim.

The Unicode standard changed in time, which resulted in whitespace characters being removed (deprecated) and added. The String.Trim() method in .NET 3.5 stripped zero width space (U+200B) and zero width no-break space (U+FEFF) from strings, but later .NET versions don't do that any more. The "Tcl space set" should not depend on that: If characters are deprecated in future Unicode versions, and because of that Whitespace properties are changed, they will not be removed from the "Tcl space set". But if new whitespace characters are added in future Unicode standards, they will be added to the "Tcl space set" as well, influencing both string is space and string trim.

The 3 characters that are in the "Tcl space set" but not in the current Unicode whitespace set are discussed now.

Most obvious is zero width no-break space (U+FEFF), which is a very useful character to be stripped, as it is used now as Byte Order Mark (BOM). It has no visible representation, and - in fact - no meaning at all within Tcl, as Tcl is UTF-8 internally already. It should not occur anywhere else in the string, but in the past it could as being a zero-width no-break space. It had the White_Space property in Unicode 2.0, but later versions of Unicode do not; the use of the BOM as a space was deprecated.

When the use as space was deprecated for (U+FEFF), another character was put forward as replacement for it: word joiner (U+2060). As this character has no visible representation, and has no meaning at all when at the start or the end of a string, it makes sense to include it in the "Tcl space set" as well, and more so because its predecessor had the White_Space property.

Finally, zero width space (U+200B), had the White_Space property in Unicode 3.0. In the current Unicode Charts it is still listed as being a space, even though the White_Space property was removed later. Therefore it should be in the "Tcl space set" as well.

Specification

This document proposes:

The string trimleft and string trimright commands will also be modified, as they track string trim.

Compatibility

For the ASCII set, the only change is the addition of 3 characters to string trim. For Unicode there are more changes, but all added characters are either rarely used, either intuitively expected to be trimmed by string trim. I don't think that any code will be adversely affected by this change, it will probably fix more bugs than that it breaks any existing code.

Alternatives

  1. NUL could be added to string is space, but that would be in conflict with what POSIX isspace() function does.

  2. NUL could be left out of the string trim set.

  3. Additional characters I considered being part of the set:

        break permitted here (U+0082)
        no break here (U+0083)
        zero width joiner (U+200C)
        zero width non-joiner (U+200D)
    

    Those are clearly useful characters to be stripped, as they have no meaning and no visible appearance at the beginning or end of a string. But they are not spaces, so it would diverge the two commands.

Reference Implementation

A reference implementation is available in the Tcl fossil repository on the tip-318-update branch https://core.tcl-lang.org/tcl/timeline?r=tip-318-update .

Copyright

This document has been placed in the public domain.