TIP 17: Redo Tcl's filesystem

Login
Author:         Vince Darley <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Created:        17-Nov-2000
Post-History:   
Tcl-Version:    8.4.0

Abstract

Many of the most exciting recent developments in Tcl have involved putting virtual file systems in a file (e.g. Prowrap, Freewrap, Wrap, TclKit) but these have been largely ad hoc hacks of various internal APIs. This TIP seeks to replace this with a common underlying API that will, in addition, make porting of Tcl to new platforms a simpler task as well.

Overview

There are two current drawbacks to Tcl's filesystem implementation:

Prowrap (http://sourceforge.net/projects/tclpro\), Freewrap (http://home.nycap.rr.com/dlabelle/freewrap/freewrap.html\), Wrap (http://members1.chello.nl/~j.nijtmans/wrap.html\), TclKit (http://www.equi4.com/jcw/wiki.cgi/19.html\), ... are all attempts to provide an ability to place Tcl scripts and other data inside a single file (or just a small number of files). The best and simplest way to achieve that task (and many other useful tasks) is to let Tcl handle the contents of a single 'wrapped document' as if it were a filesystem: the contents may be opened, sourced, stat'd, copied, globbed, etc. Also note that at the European Tcl/Tk meeting, the (equal) second-ranked request for Tcl was support for standalone executables (http://mini.net/cgi-bin/wikit/837.html\).

This TIP suggests that Tcl's core be modified to allow non-native filesystems to be plugged in to the core, and hence allow perfect virtual filesystems to exist. The implementations provided by all of the above tools are very far from perfect. The most obvious types of virtual filesystem which should be supported are:

but the main point is that all filesystem access should occur through a hookable interface, so that Tcl neither knows nor cares what type of filesystem it is dealing with.

Furthermore this hookable interface should be Tcl_Obj based, providing a new 'Path' object type, which should be designed with two goals in mind:

If all of these goals are achieved, Tcl will have a new filesystem which is both more efficient and more powerful than the existing implementation.

Technical discussion

  1. Virtual filesystems

    An examination of the core shows that a very limited support was added to tclIOUtil.c in June 1998 (presumably by Scriptics to support prowrap) so that TclStat, TclAccess and Tcl_OpenFileChannel commands could be intercepted. (See http://cvs.sourceforge.net/cgi-bin/cvsweb.cgi/tcl/generic/tclIOUtil.c?rev=1.2&content-type=text/x-cvsweb-markup&cvsroot=tcl\)

    This TIP seeks to provide a complete implementation of virtual file system support, rather than these piecemeal functions.

    Fortunately, since Tcl is already abstracted across three different filesystem types (through the Tclp...) functions, it is not that big a task to abstract away to any generic filesystem.

    One goal of this TIP is to allow an extension to be written so that one can implement a virtual filesystem entirely in Tcl: i.e. to provide sufficient hooks into Tcl's core so that an extension can capture all filesystem requests and divert them if desired. The goal is not to provide Tcl-level hooks in Tcl's core. Such hooks will only be at the C level, and an extension would be required to expose them to the Tcl level.

  2. Objectified filesystem interface.

    Every filesystem access in Tcl's core usually involves several calls to 'access', 'stat', etc.

    For example 'file atime $path' requires two calls to 'stat' and one call to 'utime', all with the same $path argument. Each of these requires a conversion from the same Utf path to the same native string representation. No caching is performed, so each of these goes through Tcl_UtfToExternal. Often Tcl code will use the same $path objects for an entire sequence of Tcl 'file' operations. Clearly a representation which cached the native path would speed up all of these operations (except the first).

    The second reason why objectification is desirable is that in a pluggable-fs environment we must determine, for each file operation, which filesystem to use (whether native, a mounted .zip file, a remote FTP site, etc.). If this information can be cached for a particular path, again we will not need to recalculate it at every step. A similar technique to that used by Tcl's bytecode compilation will be used: each cached object will have a "filesystemEpoch" counter, so that we can tell with each access whether the filesystem has been modified (and we must discard the cached information). Mounting/unmounting filesystems will obviously modify the filesystemEpoch.

    A relatively complete implementation of this TIP, and a sample "vfs" extension now exist, and have been tested through TclKit. On both the "virtual" and "objectification" parts of this tip, the implementation is known to be stable and complete (at least on Windows): TclKit can operate through this new vfs implementation without the need to override a single Tcl core command at the script level, and all reasonable filesystem tests (cmdAH.test, fCmd.test, fileName.test, io.test) pass in a scripted document. Commands which operate on files (image, source, etc.) and extensions like Image, Winico can be made to work in a TclKit automatically! There is still some room for optimisation in some parts of the new objectification code (which wasn't possible in the old string-based API). The current implementation has great efficiency gains for vfs's implemented at the script level, since the same Path objects can be passed through the entire process, without an intermediate conversion (and string duplication which would otherwise be required). The combination of caching and objectification changes the existing list of steps from

    Tcl_Obj -> string -> filesystem -> convert-to-native -> native-call
    

    or (with vfs hooked in)

    Tcl_Obj -> string -> vfilesystem -> pick-filesystem
                              -> convert-to-native -> native-call
    

    and

    Tcl_Obj -> string -> vfilesystem -> pick-filesystem
                              -> Tcl_NewStringObj -> Tcl-vfs-call
    

    to

    Tcl_Obj -> vfilesystem -> native-call
    

    and

    Tcl_Obj -> vfilesystem -> Tcl-vfs-call
    

A final side-benefit of this proposal would be that it further modularises the core of Tcl, so that one could, in principle:

However these final two points are explicitly not the goal of this TIP! I simply want to improve Tcl to add vfs support, and the best way to do that seems (to me) to be along the lines of this TIP.

Proposal

The changes to Tcl's core for virtual filesystem support are actually very minor. Every occurrence of a Tclp-filesystem call must be replaced by a call to a hookable procedure. The current filesystem structure (defined in tclInt.h) and hookable procedure list is as follows (for documentation on this structure, see Documentation section below):

/*
 * struct Tcl_Filesystem:
 *
 * One such structure exists for each type (kind) of filesystem.
 * It collects together in one place all the functions that are
 * part of the specific filesystem.  Tcl always accesses the
 * filesystem through one of these structures.
 * 
 * Not all entries need be non-NULL; any which are NULL are simply
 * ignored.  However, a complete filesystem should provide all of
 * these functions.
 */

typedef struct Tcl_Filesystem {
    CONST char *typeName;   /* The name of the filesystem. */
    int structureLength;    /* Length of this structure, so future
                             * binary compatibility can be assured. */
    Tcl_FilesystemVersion version;  
                            /* Version of the filesystem type. */
    TclfsPathInFilesystem_ *pathInFilesystemProc;
                            /* Function to check whether a path is in this
                             * filesystem */
    TclfsDupInternalRep_ *dupInternalRepProc;
                            /* Function to duplicate internal fs rep */ 
    TclfsFreeInternalRep_ *freeInternalRepProc;
                            /* Function to free internal fs rep */ 
    TclfsInternalToNormalizedProc_ *internalToNormalizedProc_;
                            /* Function to convert internal representation
                             * to a normalized path */
    TclfsConvertToInternalProc_ *convertToInternalProc_;
			    /* Function to convert object to an
			     * internal representation */
    TclfsStatProc_ *statProc; /* Function to process a 'Tcl_Stat()' call */
    TclfsAccessProc_ *accessProc;            
                            /* Function to process a 'Tcl_Access()' call */
    TclfsOpenFileChannelProc_ *openFileChannelProc; 
                            /* Function to process a 'Tcl_OpenFileChannel()' call */
    TclfsMatchInDirectoryProc_ *matchInDirectoryProc;  
                            /* Function to process a 'Tcl_MatchInDirectory()' */
    TclfsGetCwdProc_ *getCwdProc;     
                            /* Function to process a 'Tcl_GetCwd()' call */
    TclfsChdirProc_ *chdirProc;            
                            /* Function to process a 'Tcl_Chdir()' call */
    TclfsLstatProc_ *lstatProc;            
                            /* Function to process a 'Tcl_Lstat()' call */
    TclfsCopyFileProc_ *copyFileProc; 
                            /* Function to process a 'Tcl_CopyFile()' call */
    TclfsDeleteFileProc_ *deleteFileProc;            
                            /* Function to process a 'Tcl_DeleteFile()' call */
    TclfsRenameFileProc_ *renameFileProc;            
                            /* Function to process a 'Tcl_RenameFile()' call */
    TclfsCreateDirectoryProc_ *createDirectoryProc;            
                            /* Function to process a 'Tcl_CreateDirectory()' call */
    TclfsCopyDirectoryProc_ *copyDirectoryProc;            
                            /* Function to process a 'Tcl_CopyDirectory()' call */
    TclfsRemoveDirectoryProc_ *removeDirectoryProc;            
                            /* Function to process a 'Tcl_RemoveDirectory()' call */
    TclfsLoadFileProc_ *loadFileProc; 
                            /* Function to process a 'Tcl_LoadFile()' call */
    TclfsUnloadFileProc_ *unloadFileProc;            
                            /* Function to unload a previously successfully
                             * loaded file */
    TclfsReadlinkProc_ *readlinkProc; 
                            /* Function to process a 'Tcl_Readlink()' call */
    TclfsListVolumesProc_ *listVolumesProc;            
                            /* Function to list any filesystem volumes added
                             * by this filesystem */
    TclfsFileAttrStringsProc_ *fileAttrStringsProc;
                            /* Function to list all attributes strings which
                             * are valid for this filesystem */
    TclfsFileAttrsGetProc_ *fileAttrsGetProc;
                            /* Function to process a 'Tcl_FileAttrsGet()' call */
    TclfsFileAttrsSetProc_ *fileAttrsSetProc;
                            /* Function to process a 'Tcl_FileAttrsSet()' call */
    TclfsUtimeProc_ *utimeProc;       
                            /* Function to process a 'Tcl_Utime()' call */
    TclfsNormalizePathProc_ *normalizePathProc;       
                            /* Function to normalize a path */
} Tcl_Filesystem;

Once that is done, almost no more changes need be made to Tcl's core. We must simply add code (to tclIOUtil.c and declarations to tclInt.h) to implement the hookable functions and to provide a simple API by which extensions can hook into the new filesystem support.

This gives us the simplest level of vfs. Most remaining changes are objectifying the way Tcl's core uses filesystems. Many of these changes actually simplify the core, for example, we replace:

        case FILE_COPY: {
            int result;
            char **argv;

            argv = StringifyObjects(objc, objv);
            result = TclFileCopyCmd(interp, objc, argv);
            ckfree((char *) argv);
            return result;
        }            

with

        case FILE_COPY: {
            return TclFileCopyCmd(interp, objc, objv);
        }            

and the Unix versions of stat, access, chdir are as simple as:

int TclpObjStat(Tcl_Obj *pathPtr, struct stat *buf) {
    return stat(Tclfs_GetNativePath(pathPtr), buf);
}
int TclpObjChdir(Tcl_Obj *pathPtr) {
    return chdir(Tclfs_GetNativePath(pathPtr));
}
int TclpObjAccess(Tcl_Obj *pathPtr, int mode) {
    return access(Tclfs_GetNativePath(pathPtr));
}

There are a few other small changes required, some which are absolutely necessary, and some which make the implementation of a Tcl-level vfs much simpler and more robust:

As mentioned above an implementation of all of this now exists. The modified Tcl core passes all Tcl tests, and works with a new version of TclKit (which can itself pass all reasonable tests when executing the test suite inside a scripted document). It has been tested with a variety of wrapped demos (tclhttpd, bwidgets, widgets, alphatk), and performs very well. The patch is available from the author of this TIP, and a version (possibly not the most recent) can be downloaded from: ftp://ftp.ucsd.edu/pub/alpha/tcl/tclobjvfs.diff

Documentation: vfs-aware extensions

All calls to filesystem functions in Tcl's core and in extensions should preferably be made through the new API defined in tclInt.decls. All these functions have the prefix 'Tclfs_'. Of course extensions which call older string-based APIs (e.g. Tcl_OpenFileChannel) will still work, but will not benefit from the efficiency of the cached object representation. Most of these functions are not commonly used by extensions (e.g. Tclfs_CopyDirectory), so only the most common are listed here:

    Tcl_Channel Tclfs_OpenFileChannel(Tcl_Interp *interp, Tcl_Obj *pathPtr,
                                      char *modeString, int permissions)
    int         Tclfs_EvalFile(Tcl_Interp *interp, Tcl_Obj *fileName)
    int         Tclfs_Stat(Tcl_Obj *pathPtr, struct stat *buf)
    int         Tclfs_Access(Tcl_Obj *pathPtr, int mode)

These replace the equivalent string-based APIs.

    int      Tclfs_ConvertToPathType(Tcl_Interp *interp, Tcl_Obj *pathPtr)

Attempts to convert the given object to a path type. This is a little more than a simple wrapper around Tcl_ConvertToType(interp, pathPtr, &tclFsPathType). If it returns TCL_ERROR, the object is not a valid path. In this sense it is very similar to Tcl_TranslateFileName for the existing string-based API. It should be called before attempting to pass an object to any of the other filesystem APIs (again in much the same way as Tcl_TranslateFileName was used in the core).

    int      Tclfs_EqualPaths(Tcl_Obj* firstPtr, Tcl_Obj* secondPtr)
    Tcl_Obj* Tclfs_SplitPath(Tcl_Obj* pathPtr, int *lenPtr)
    Tcl_Obj* Tclfs_JoinPath(Tcl_Obj *listObj, int elements)
    Tcl_Obj* Tclfs_JoinToPath(Tcl_Obj *basePtr, int objc, Tcl_Obj *CONST objv[])

These all manipulate paths. They return Tcl_Obj* with refCounts of zero.

    Tcl_Obj* Tclfs_GetNormalizedPath(Tcl_Interp *interp, Tcl_Obj* pathObjPtr)
    char*    Tclfs_GetTranslatedPath(Tcl_Interp *interp, Tcl_Obj* pathPtr)

and finally:

    char*    Tclfs_GetNativePath(Tcl_Obj* pathObjPtr)

which is used by native filesystems only, and is a shorthand for getting at the cached native representation for MacOS, Windows or Unix (as appropriate). This is always a string based representation, but may really be of type TCHAR* on Windows, for example.

Documentation: writing a new filesystem

The objPtr->internalRep.otherValuePtr field is a pointer to one of these structures, for objects of "path" type.

typedef struct FsPath {
    char *translatedPathPtr;    /* Name without any ~user sequences */
    Tcl_Obj *normPathPtr;       /* Normalized absolute path, without .
                                 * or .. sequences, and without ~user
                                 * sequences. */
    Tcl_Obj *cwdPtr;            /* If null, path is absolute, else
                                 * this points to the cwd object used
                                 * for this path.  We have a refCount
                                 * on the object. */ 
    ClientData nativePathPtr;   /* Native representation of this path,
                                 * which is filesystem dependent. */
    int filesystemEpoch;        /* Used to ensure the path representation
                                 * was generated during the correct
                                 * filesystem epoch.  The epoch changes
                                 * when filesystem-mounts are changed. */ 
    struct Tcl_FilesystemRecord *fsRecPtr;
                                /* Pointer to the filesystem record 
                                 * entry to use for this path. */
} FsPath;

Path to filesystem mapping:

int TclfsPathInFilesystem_ (Tcl_Obj *pathPtr, ClientData *clientDataPtr)

Is the given path in this filesystem? This function should return either -1, or it should return TCL_OK. If it returns TCL_OK, it may wish to set the clientData parameter to point to a filesystem specific representation of the path. (The native filesystem actually postpones the calculation of the native representation until it is requested, but TclKit's vfs immediately allocates a structure containing an int and Tcl_Obj* which it uses as an internal representation).

Internal representation manipulation:

void TclfsFreeInternalRep_ (ClientData clientData)
ClientData TclfsDupInternalRep_ (ClientData clientData)

These two are called to duplicate and free the clientData field of the FsPath structure. If they are NULL, they are ignored (and on duplication, the new object's clientData field is set to NULL).

Path normalization:

int TclfsNormalizePathProc_ (Tcl_Interp *interp, Tcl_Obj *pathPtr, 
                             int nextCheckpoint)

This function should check the string representation of pathPtr, starting at character index 'nextCheckpoint', and convert it from that point onwards (if possible) to a filesystem-specific unique form. It should return the character index one beyond where is could no longer apply (e.g. pointing to a directory separator or end of string). That index is then passed on to the next filesystem to continue. Most filesystems do not support path ambiguity, in which case the function need not be implemented at all (a NULL entry in the lookup table is acceptable). For example on Windows, the path "c:/PROGRA~1/tcl/tclkit.exe/lib" would be modified to "C:/Program Files/Tcl/tclkit.exe/lib" by the core's normalization procedure, which would return '31', pointing to the '/' in-between .exe and lib.

File manipulation:

For each filesystem function which is implemented, these procs should be declared:

    TclfsAccessProc_ *accessProc;            
    TclfsStatProc_ *statProc; 
    TclfsOpenFileChannelProc_ *openFileChannelProc; 
    TclfsMatchFilesTypesProc_ *matchFilesTypesProc;  
    TclfsLstatProc_ *lstatProc;            
    TclfsCopyFileProc_ *copyFileProc; 
    TclfsRenameFileProc_ *renameFileProc;            
    TclfsCopyDirectoryProc_ *copyDirectoryProc;            
    TclfsDeleteFileProc_ *deleteFileProc;            
    TclfsCreateDirectoryProc_ *createDirectoryProc;            
    TclfsRemoveDirectoryProc_ *removeDirectoryProc;            
    TclfsLoadFileProc_ *loadFileProc; 
    TclfsReadlinkProc_ *readlinkProc; 
    TclfsListVolumesProc_ *listVolumesProc;            
    TclfsFileAttrStringsProc_ *fileAttrStringsProc;
    TclfsFileAttrsGetProc_ *fileAttrsGetProc;
    TclfsFileAttrsSetProc_ *fileAttrsSetProc;
    TclfsUtimeProc_ *utimeProc;       

In fact, copy/rename file need not be implemented, because Tcl will fallback: from rename to copy and delete, and from copy to open-r/open-w/fcopy/mtime when necessary. However a filesystem may well be able to implement these more efficiently than that.

Cd/pwd support:

    TclfsChdirProc_ *chdirProc;            
    TclfsGetCwdProc_ *getCwdProc;     

the chdir proc need only return TCL_OK if the path is a valid directory, and TCL_ERROR otherwise. There is no need to remember the path in any way. Native filesystems will of course want to make the appropriate system calls to change the real cwd. Most filesystems will not implement the 'Cwd' proc, since Tcl keeps track of the cwd for you. However, the native filesystem should implement it.

Unload file support:

    TclfsUnloadFileProc_ *unloadFileProc;            

This function is called automatically by Tcl's core to unload a file, if this filesystem was the one which successfully loaded the file initially.

Philosophy

This TIP is influenced by the thoughts behind the TkGS project (http://sourceforge.net/projects/tkgs/\). Whereas TkGS provides a general and efficient graphics system, the aim of this TIP is to provide a similarly general and efficient filesystem.

Alternatives

  1. Alternatives to adding vfs support

    TclKit manages a pretty good job of vfs support. It is limited by the inadequacy of overriding at the Tcl level. Prowrap is limited by the inability to glob, load, cd, pwd, etc.

    There are currently no better alternatives: if Tcl's C core calls C functions directly (as it does), or if extensions call C functions directly (as they do), then complete vfs support requires a patch like this to Tcl's core.

  2. Alternatives to objectification

    A previous patch added string-based vfs support to Tcl's core, and required very few core changes at all. It could be adopted instead of an objectified filesystem. This would make Tcl's filesystem more complete, but would not make it any more efficient. Also it is much harder to implement complete 'glob' emulation without the newer API.

Objections

Won't all these hooks slow down Tcl's core a lot?

There are actually remarkably few changes required, so the only slowdown would occur if additional filesystems are hooked into the core. This is similar to the impact of the 'stacked channels' implementation. With the objectified filesystem, this does actually speed up Tcl's core (as remarked above, 'file exists' is 20% faster on Windows 2000).

Won't this break backwards compatibility ("The Tcl question")?

Not at all. With the current vfs patch, the entire test suite passes as before, even with an extra 'reporting' filesystem activated. Most reasonable tests now pass even when the test suite is embedded in a wrapped document.

Won't this make Tcl's core more complex??

Adding a Tcl_Obj interface is definitely a bit more complex in some areas than the existing string-based system, but in other areas it cleans things up a lot. Indeed, one result will be that Tcl's filesystem is properly abstracted away, which conceptually simplifies the core (there will be 10-15 functions which are called for all filesystem access, whether it is native or virtual).

Future thoughts

This section contains items which are outside the scope of this TIP, but it was thought useful to raise and have documented for the record.

This patch still places the native filesystem in a preferential position, and it is hard-coded as the tail of the fs-lookup list. There are two changes which could be made in the future:

Also,

Copyright

This document has been placed in the public domain.