TIP:            114
Title:          Eliminate Octal Parsing of Leading Zero Integer Strings
Author:         Don Porter <dgp@users.sf.net>
Created:        16-Oct-2007
Type:           Project
State:          Draft
Vote:           Done
Version:        $Revision: 2.3 $
Tcl-Version:    9.0
Post-History:   
Keywords:       octal

~Abstract

This TIP proposes elimination of Tcl's practice of using octal
notation to interpret a string with a leading zero when an
integer value is expected.

~History and Rationale

There are several places in the syntax of several Tcl commands
where an integer value may be accepted.  Routines such as
'''Tcl_GetInt()''' perform the task of parsing an integer value
from the string value in these places.  Ultimately, these routines
have been built on C standard library functions such as
'''strtol()'''.  Due to this implementation choice, Tcl integer
parsing has inherited features from '''strtol()''' including
the feature that a leading zero in a string has been taken
as a signal that the string is an integer value in octal format.

Several programmers and programs have hit this feature by
surprise, resulting in nasty bugs such as:

| % proc m date {
|      lassign [split $date -] y m d
|      return [string index _JFMAMJJASOND $m]
|   }
| % m 2007-02-14
| F
| % m 2007-12-25
| D
| % m 2007-09-26
| bad index "09": must be integer?[+-]integer? or end?[+-]integer? (looks like invalid octal number)

There are very few places in Tcl scripts where this feature
is actually useful.  Octal format for integers simply isn't
encountered all that often in most programming tasks tackled by
Tcl scripts.  The main counterexample is the use of octal format
integers to describe filesystem permissions on unix systems.
The Tcl commands that operate on filesystem permission values
are '''open''' and '''file attributes''', and it is a simple matter
to directly code them to recognize octal format, rather that have
then rely on octal parsing as a general integer value recognition
feature.  On the HEAD, these commands have already been so revised.
With those few cases accounted for, it's been observed that
removing this feature of Tcl integer parsing "will likely fix
more scripts than it breaks."

The opportunity to make this change in Tcl 8.5 arises because
we've already replaced our old parsing routines based on
'''strtol()'''  with our own number parser [249].

~ Proposal

Revise all integer parsing in Tcl by making modifications to
the '''TclParseNumber()''' routine.  With reference to the
state machine graph in [249], we change the exit edges of
state ''integer[[1]]''.  Characters '''0''' - '''7''' and
characters '''8''' - '''9''' should now lead to state ''integer[[4]]'',
so that they continue decimal parsing, and not octal parsing.
The states ''integer[[2]]'' and ''error[[5]]'' will now be
accessible only if the character '''o''' or '''O''' is seen
while in state ''integer[[1]]'' and there will no longer be
any exit from those states when the characters '''.''' or '''e'''
or '''E''' are observed.

This change to '''TclParseNumber()''' is achieved with a
'''#define KILL_OCTAL''' in the file ''tclStrToD.c''.

~ Compatibility

This change is an incompatibility.  It's long
been believed that such a change should not happen until Tcl 9
because of this, but over time the consensus belief has developed
that far fewer programs and programmers will be harmed by the
incompatibility than will be helped by removing the misfeature.

That said, the incompatibility is serious.  The same string in
the same place in a script can now have a completely different
meaning.  Before the change:

| % lindex {a b c d e f g h i j k} 010
| i

After the change:

| % lindex {a b c d e f g h i j k} 010
| k

This is not the usual situation where new feature causes scripts
that were an error to become non-errors -- a compatible change.

This is also not a situation where a change causes legal scripts
to become errors.  Such a change would break scripts, but would
at least leave behind scripts that raise noisy errors alerting
about the breakage.

This is the most serious kind of incompatibility, where we replace
a working script with another working script that does something
completely different.  An illustration of the problem from Tcl's
own test suite highlights the danger.  Some of Tcl's tests in
'''io.test''' depend on the umask value, so that value is captured:

| set umaskValue [exec /bin/sh -c umask]

Note that the shell command '''umask''' returns a mask value as
an integer in octal format.  The test suite has relied on Tcl's
built-in ability to recognize this format, and the expected
result of test '''io-40.3''' has been computed:

| format %04o [expr {0666 & ~$umaskValue}]

After the proposed change, ''$umaskValue'' is treated as a decimal
number, and the wrong expected result is computed.  (This test
has already been updated on the HEAD to avoid such problems.)

It is not difficult to imagine more serious problems in scripts
that make use of the result returned by the shell command '''umask'''
where a file might be created or modified with completely unintended
permissions as a result of the proposed change.  Such scripts
might easily raise security concerns.

Even in the light of the judgment that such (hopefully rare) compatibility
issues are acceptable in exchange for the benefits of purging the
misfeature, we really ought to consider seriously how we can alert
those migrating to Tcl 8.5 to this possibility and to the need to
examine their scripts for this issue.

Besides the impact on Tcl commands, this change may also cause
incompatibilities in extensions, to the extent their commands
rely on Tcl's integer parsing to support octal notation.

~ Rejected Alternatives

Motivated largely by the serious incompatibilities lurking here,
a few people have suggested that some means be provided to toggle
Tcl's integer parsing behavior between two modes, one which
recognizes octal and one which does not.  This idea appears inspired
in part by the '''::tcl_precision''' variable, which has long
exercised control over Tcl's floating point number formatting.

While the motivation may be well-intended, this proposal is basically
unworkable, and can't really help anybody.  The point of the proposed
change is to make simple code work as simple coders expect it to.
Our original example proc can already be corrected like so:

| proc m date {
|    scan $date %d-%d-%d y m d
|    return [string index _JFMAMJJASOND $m]
| }

The point of this proposal is to make the original code just work.
It doesn't help to offer this complexity as a solution:

| % proc m date {
|      set mode [tcl::unsupported::octal]
|      tcl::unsupported::octal off
|      lassign [split $date -] y m d
|      set result [string index _JFMAMJJASOND $m]
|      tcl::unsupported::octal $mode
|      return $result
|   }

No one would choose that over just fixing the code
to use '''scan''', in which case this proposal won't be needed.
Also, this kind of management of a shared mode setting cannot
(easily and cheaply) be avoided, because at the point we need
to control the Tcl number parser, the most specific context
we have is the thread, so the mode has to be set thread-wide.

Likewise, the (hopefully rare) set of scripts that would actually
want to turn octal parsing back on are not going to announce themselves.
In order to know that 

| tcl::unsupported::octal on

needs to be added to a script to make it function correctly,
some kind of audit has to reach that conclusion, and once that
conclusion is reached and the issues are understood, it's just
as easy to insert '''scan %o''' in the proper places as it would
be to insert the '''tcl::unsupported::octal''' stopgap.

In short, any coder finding themselves in a position to consider
using a '''tcl::unsupported::octal''' tool, would quickly decide
not to use it in favor of just fixing their code.  Thus users
of this feature are mythical, and it will not be implemented.

~ Note

This TIP has been ''explicitly'' rejected as a feature for Tcl 8.5.
Consensus was that the type of breakage it inherently induces is not
acceptable in a minor version change.

~Copyright

This document is placed in the public domain.

