TIP:            234
Title:          Add Support For Zlib Compression
Version:        $Revision: 1.21 $
Author:         Pascal Scheffers <pascal@scheffers.net>
State:          Final
Type:           Project
Vote:           Done
Created:        08-Dec-2004
Post-History:   
Keywords:       Tcl,zip,gzip,deflate
Tcl-Version:    8.6

~ Abstract

This TIP proposes a new core package with commands to handle compression and
decompression using the Zlib compression library.

~ Rationale

The Zlib compression algorithm is a widely used method for compression of
files and streams. It is the algorithm used for .gz and (most) .zip files, as
well as one of the standard compression algorithms in the HTTP protocol
specifications.

Including support for Zlib compression in the core would enable the use of
compressed VFS files, fast pure Tcl implementations of gzip and zip utilities
and the use of compression in various network protocols.

A compressed VFS would be of great benefit to the new '''clock'''
implementation [173], which brings a long a large number of small files which
contain the timezone data. Although this would also require support for a VFS
file format in the core. One possible candidate would be the Tcl Read Only fs
(trofs), or perhaps a zip file VFS (only a tclvfs zip handler exists at the
time of writing). Such a compressed VFS is outside the scope of this TIP, but
would be much easier in the future based on top of it.

~~ History and Implementation Notes

The specification and implementation for the package and command originally
came from tclkit. This was wrapped in a TEA compliant package as a stand alone
package. The reference implementation is a full rewrite, retaining the public
API of the tclkit ''zlib'' command.

The gzip support and C Language API are not part of the original ''zlib''
extension. The streaming decompression is functionaly equivalent to tclkit
'''zlib sinflate''', but uses a different command names. Streaming compression
is new.

The package version for this release is 2.0 because the private API from the
original command has been removed. Alternatively, the package version can be
1.2 indicating new features were added and no existing public APIs were
changed.

~ Dependency Issues

The package utilizes zlib/libz from the gzip project
[http://www.gzip.org/zlib/]. The license of this project/library is compatible
with the Tcl license, and it also compiles on most, if not all, platforms
where Tcl compiles.

For ease of use, the core distribution shall include a copy of libz under
''tcl/contrib''. This copy will be built and used automatically when autoconf
cannot find zlib.h during the configure stage.

~ Streaming

For large files (where large is a relative value, of course), streaming
compression and decompression is required. This is implemented by using
temporary commands, which can be fed small amounts of data, yielding small
chunks of (de)compressed data.

~ Tcl API

~~ Block Compression and Decompression

There are three compressed formats supported by this command:

 * ''compress'' - the output contains raw deflate data, with no zlib/gzip
   headers or trailers and no checksum value.

 * ''deflate'' - the output contains data in zlib format, with zlib header and
   trailer using an Adler-32 checksum

 * ''gzip'' - the output contains data in gzip format, with empty gzip
   filename, no extra data, no comment, no modification time (set to zero), no
   header crc and the operating system will be set to 255 (unknown).

Data is treated as binary, meaning that all input and output is going to be
converted and treated as byte arrays in Tcl.

 > '''zlib compress''' ''data'' ?'''-level''' ''level''?

Returns raw deflated byte-array version of binary data ''data'', at an
optional compression ''level''. The compression level must be between 0 and 9:
1 gives best speed, 9 gives best compression, 0 gives no compression at all
(the input data is simply copied a block at a time).

 > '''zlib decompress''' ''compressedData'' 

Decompresses a raw deflated byte array as obtained from '''zlib compress'''.

 > '''zlib deflate''' ''data'' ?'''-level''' ''level''?

Returns zlib-compressed version of ''data'', at an optional compression
'''level'''. The compression level must be between 0 and 9: 1 gives best
speed, 9 gives best compression, 0 gives no compression at all (the input data
is simply copied a block at a time).

 > '''zlib inflate''' ''deflatedData'' 

Decompresses the zlib-compressed data as obtained from '''zlib deflate'''.

 > '''zlib gzip''' ''data'' ?'''-level''' ''level''? ?'''-header'''
   ''gzipHeaderDict''?

Returns gzip-compressed ''data'', at an optional compression ''level''. The
compression level must be between 0 and 9: 1 gives best speed, 9 gives best
compression, 0 gives no compression at all (the input data is simply copied a
block at a time).

When header dict is not given with the '''-header''' option, the gzip header
will have no file name, no extra data, no comment, no modification time (set
to zero), no header crc, and the operating system will be set to 255
(unknown).

The header dict may contain:

 * '''crc''' -
   integer: CRC-32 of the uncompressed data.

 * '''filename''' -
   string: original file name.

 * '''os''' -
   integer: Operating system/file system used (see RFC 1952
   [http://www.ietf.org/rfc/rfc1952.txt] for list of codes).

 * '''size''' -
   integer: uncompressed size modulo 2**32.

 * '''time''' -
   integer: unix mtime in seconds since 1970-1-1, suitable for use with
   '''clock format'''.

 * '''type''' -
   flag: '''binary''' for binary data, '''text''' for "probably text".

Other fields may be added in the future.

 > '''zlib gunzip''' ''gzipData'' ?'''-headerVar''' ''headerDictVarName''?

Decompresses the gzip data as obtained from '''zlib gzip''' or any gzip file.

The command returns the uncompressed data. The optional '''-headerVar'''
variable name will be filled with the available header fields. If a field does
not exist in the gzip header, it will not be present in the dict. For example,
the original filename, comment and crc are optional header fields and will be
not set in the dict if they do not exist.

Note that '''compress'''/'''decompress''', '''deflate'''/'''inflate''' and
'''gzip'''/'''gunzip''' must be used in pairs.

~~ Streaming Compression and Decompression

Streaming is handled in one of two ways. Either by '''push'''ing a
transformation onto a channel's transformation stack, or by a worker command
which is created by calling the '''zlib''' command's '''stream''' subcommand.

~~ Channel Transformations

 > '''zlib push'''
   '''deflate'''|'''inflate'''|'''compress'''|'''decompress'''|'''gzip'''|'''gunzip'''
   ''channel''
   ?''-level level''?
   ?''-limit count''?
   ?''-header gzipHeaderDict''?
   ?''-headerVar headerDictVarName''?

Pushes the requested transformation onto the channel stack. The compression
level must be between 0 and 9: 1 gives best speed, 9 gives best compression, 0
gives no compression at all (the input data is simply copied a block at a
time). The '''-limit''' option specifies the maximum number of bytes to read
from the channel. This is mainly intended to specify how much compressed
should be read from a non-seekable channel.

The '''-header''' and '''-headerVar''' are only used for '''gzip''' and
'''gunzip''' modes respectively. See the previous section for their
definition.

Additional '''chan''' commands are enabled after pushing a zlib
transformation:

 > '''chan adler32''' ''channelId''

Returns the Adler32 checksum for the data. Continuously updated during
compression, available only at the of decompression.

 > '''chan fullflush''' ''channelId''

Performs a fullflush on the compression output.

At the end of the data during compression, simply '''chan pop''' to finalize
compression and flush any remaining compressed data.

At the end of compressed data, the channel will return EOF until the
transformation is popped from the channel. If no '''-limit''' was specified,
the current access position of the channel is undefined.

When the base channel or transform returns EOF, compression will automatically
finalize. When EOF occurs during decompression but the compressed stream is not
yet at EOF, an error will be raised. 

~~ The zlib stream command

 > '''zlib stream'''
   '''deflate'''|'''inflate'''|'''compress'''|'''decompress'''|'''gzip'''|'''gunzip'''
   ?''-level level''?
   ?''-header gzipHeaderDict''?
   ?''-headerVar headerDictVarName''?

Returns a command name which will perform the requested operation in a
streaming fashion. The compression level value, ''level'', is only used when
compressing data.

The '''-header''' and '''-headerVar''' are only used for '''gzip''' and
'''gunzip''' modes respectively. See earlier in this TIP for their definition.

~~~ Stream Worker Command

The stream worker command is used to actually compress and decompress in
smaller chunks than the input and/or output.

 > ''stream'' '''put''' ?''-flush|-fullflush|-finalize''? ''data''

Adds data to be (de)compression. The flags '''-flush''', '''-fullflush''' and
'''-finalize''' are mutually exclusive and indicate the desired flushing of
the stream. '''-finalize''' is used to indicate the last block of data while
compressing. After '''-finalize''', no more data can be added to be
compressed. For decompression, after '''-finalize''' you can still add more
data for decompression.

 > ''stream'' '''flush'''

The next invoke of the stream's '''get''' subcommand will try to get the most
data from the stream. While compressing, calling [[''stream'' '''flush''']]
often will degrade the compression ratio as it forces all remaining input to
be output immediately.

 > ''stream'' '''fullflush'''

Like the '''flush''' subcommand, the next '''get''' subcommand invoked on the
stream will try to get the most data from the stream. Additionally, the
compressor will output extra data to enable recovery from this point in the
datastream.

 > ''stream'' '''finalize'''

For compression, this signals the end of the input data; no more data can be
added to the stream after the '''finalize''' subcommand is called. For
decompression, this functions the same as the '''flush''' subcommand.

 > ''stream'' '''get''' ?'''-count''' ''count''?

Gets (de)compressed data from the stream. The optional ''count'' parameter
specifies the maximum number of bytes to read from the stream. Especially for
decompression, it is strongly recommended to specify a ''count''.

 > ''stream'' '''eof'''

Returns 0 while the end of the compressed stream has not been reached. Returns
1 when the end of compressed stream was reached or the last data has been put
to the stream and '''-finalize''' was specified, or [[''stream''
'''finalize''']] has been called while compressing data.

When [[''stream'' '''eof''']] is returning true, and [[''stream'' '''get'''
?''count''?] returns an empty string, you will have obtained all data from the
stream.

 > ''stream'' '''checksum'''

Returns the Adler-32 or CRC-32 checksum of the uncompressed data. For
compressing streams, this value is updated on each '''$stream put'''. For
decompressing streams, the value will only match the adler32 of the
decompressed string after the last [[''stream'' '''get''']] returned an empty
string. Which type of checksum is computed (Adler-32 or CRC-32) depends on the
compression format of the stream.

 > ''stream'' '''close'''

Deletes the ''stream'' worker command and all storage associated with it.
Discards any remaining input and output. After this command, the ''stream''
command cannot be used anymore.

~~ Checksums

 > '''zlib crc32''' ''data'' ?'''-startValue''' ''startValue''?

Calculates a standard ''CRC-32'' checksum, with an optional start value for
incremental calculations.

 > '''zlib adler32''' ''data'' ?'''-startValue''' ''startValue''?

Calculates a quick ''Adler-32'' checksum, with an optional start value for
incremental calculations.

~ C API

~~ Synopsys

Tcl_Obj *
'''Tcl_ZlibDeflate'''(''interp, format, data, level, dictObj'')

Tcl_Obj *
'''Tcl_ZlibInflate'''(''interp, format, data, dictObj'')

unsigned int
'''Tcl_ZlibCRC32'''(''initValue, bytes, length'')

unsigned int
'''Tcl_ZlibAdler32'''(''initValue, bytes, length'')

int
'''Tcl_ZlibStreamInit'''(''interp, mode, format, level, dictObj, zshandlePtr'')

Tcl_Obj *
'''Tcl_ZlibStreamGetCommandName'''(''zshandle'')

int
'''Tcl_ZlibStreamEof'''(''zshandle'')

int
'''Tcl_ZlibStreamClose'''(''zshandle'')

int
'''Tcl_ZlibStreamAdler32'''(''zshandle'')

int
'''Tcl_ZlibStreamPut'''(''zshandle, dataObj, flush'')

int
'''Tcl_ZlibStreamGet'''(''zshandle, dataObj, count'')

~~ Arguments

 Tcl_Interp *interp (in): Optional interpreter to use for error reporting.

 int format (in): Compressed data format. For compression and decompression
   either '''TCL_ZLIB_FORMAT_RAW''', '''TCL_ZLIB_FORMAT_ZLIB''' or
   '''TCL_ZLIB_FORMAT_GZIP'''. A fourth value, '''TCL_ZLIB_FORMAT_AUTO''' is
   available for decompression, which can be used when decompressing either
   GZIP or ZLIB formatted data. Decompression of RAW data requires specifying
   the format as RAW.

 int mode (in): Compress or decompress mode. Either '''TCL_ZLIB_INFLATE''' or
   '''TCL_ZLIB_DEFLATE'''.

 Tcl_Obj *data (in): The input data for compression or decompression. Will be
   interpreted as a bytearray object.

 int level (in): The compression level. Must either be between 0 and 9 (1
   gives best speed, 9 gives best compression, 0 gives no compression at all
   with the input data is simply copied a block at a time) or -1 to get a
   default level that balances speed and compressed size. This parameter is
   ignored by decompressing streams.

 const char *bytes (in/out): On input, an array of bytes for calculation of
   checksums or compression/decompression. On output, an array of bytes to
   copy compressed or decompressed data into.

 int length (in): number of bytes to calculate the checksum on, or the size of
   ''bytes'' buffer to read from or write to.

 unsigned int initValue (in): start value value for the crc-32 or adler-32
   calculation.

 Tcl_Obj *dictPtr (in): A reference to a dict containing any additional
   options for the stream handler. This is used to pass options such as
   '''-limit''', '''-header''', etc. See the Tcl command documentation for a
   list of options supported for a particular format and mode. If NULL, will
   be treated as if it is an empty dict.

 Tcl_ZlibStream **zshandlePtr (out): Pointer to an integer to receive the
   handle to the stream. All subsequent ''Tcl_ZlibStream*''() calls require
   this handle.

 Tcl_ZlibStream *zshandle (in): Handle for the stream.

 Tcl_Obj *dataObj (in/out): A bytearray object to read the streamed data from
   ('''Tcl_ZlibStreamPut''') or write the streamed data to
   ('''Tcl_ZlibStreamGet''').

 int flush (in): Flush parameter. '''TCL_ZLIB_NO_FLUSH''',
   '''TCL_ZLIB_FLUSH''', '''TCL_ZLIB_FULLFLUSH''' or '''TCL_ZLIB_FINALIZE'''.

 int count (in): Maximum number of bytes to be written to the ''dataObj''
   Tcl_Obj. The special flag value -1 means get all bytes.

~~ Functions

 > '''Tcl_ZlibDeflate()'''

Depending on the ''type'' flag, this function returns a ''Tcl_Obj *'' with a
zero reference count containing the compressed data in either raw deflate
format, zlib format or gzip format. If an error happens during compression,
this function will return NULL and store a message in the Tcl interpreter.

 > '''Tcl_ZlibInflate()'''

This function returns a ''Tcl_Obj *'' with a zero reference count containing
the decompressed data. The ''buffersize'' argument may be used as a hint if
the decompressed size is know before decompression. If an error happens during
decompression, this function will return NULL and store a message in the Tcl
interpreter.

 > '''Tcl_ZlibCRC32()'''

This function returns the standard CRC-32 calculation. The ''startvalue''
should contain the previously returned value for streaming calculations, or
zero for the first block.

 > '''Tcl_ZlibAdler32()'''

This function returns a quick Adler-32 calculation. The ''startvalue'' should
contain the previously returned value for streaming calculations, or zero for
the first block.

 > '''Tcl_ZlibStreamInit()'''

This function initializes the internal state for compression or decompression
and creates the Tcl worker command for use at the script level. Returns TCL_OK
when initialization was succesful.

 > '''Tcl_ZlibStreamGetCommandName()'''

This function returns a ''Tcl_Obj *'' which contains the fully qualified
stream worker command name associated with this stream.

 > '''Tcl_ZlibStreamEof()'''

This function returns 0 or 1 depending on the state of the (de)compressor. For
decompression, eof is reached when the entire compressed stream has been
decompressed. For compression, eof is reached when the stream has been flushed
with '''TCL_ZLIB_FINALIZE'''.

 > '''Tcl_ZlibStreamClose()'''

This function frees up all memory associated with this stream, deletes the Tcl
worker command and discards all remaining input and output data.

 > '''Tcl_ZlibStreamAdler32()'''

This function returns the Adler-32 checksum of the uncompressed data up to this
point. For decompressing streams, the checksum will only match the checksum of
uncompressed data when ''Tcl_ZlibStreamGet'' returns an empty string.

 > '''Tcl_ZlibStreamPut()'''

This function is used to copy data to the stream from the given buffer. For
compression, the final block of data, which may be an empty string, must be
indicated with '''TCL_ZLIB_FINALIZE''' as the flush parameter. The number of
bytes read from the supplied buffer is returned (or -1 on error).

 > '''Tcl_ZlibStreamGet()'''

This function is used to copy the data from the stream to the given buffer.
The number of bytes written to the supplied buffer is returned (or -1 on
error).

~ Usage

Zlib support is to form part of Tcl's standard API: no special measures will
be needed for Tcl code or C-implemented extensions to make use of it.

~ Safe Interpreters

These commands only work on data already available to a safe interpreter and
are therefore safe make available in the safe interpreter.

~ Reference Implementation

An old version the reference implementation is available at the subversion
repository [http://svn.scheffers.net/zlib]. Alternatively, a recent snapshot
is available [http://svn.scheffers.net/zlib.tar.gz]. This reference
implementation includes a copy of zlib-1.2.1 [http://www.gzip.org].

The reference implementation currently implements a version 1.8 of this TIP.

~ Copyright

This document has been placed in the public domain.

