Sun Java Solaris Communities My SDN Account Join SDN
 
Architecture, Design and Testing

Sun Software Product Internationalization Taxonomy

 
  « Previous | Contents | Next »
 

4.3.3.1 Transfer Encoding (8-Bit Clean)


Description

A transfer encoding is a reversible transformation that maps a data set containing a wide range of bytes to and from a restricted set of bytes. For example, a transfer encoding can map a data set of 8-bit text to 7-bit text and vice versa. Transfer encoding is used to create a "tunnel" between two cooperating applications, which enables them to exchange data bytes that would otherwise be discarded or corrupted by the interface between them. The transfer encoding is applied to the data stream before it is sent to the interface. The transfer encoding is then removed or decoded when retrieved from the interface. The following diagram shows an overview of transfer encoding.
Figure 4-1. Transfer Encoding
The most common use of transfer encodings is to send binary or 8-bit data over a communications channel that only supports 7-bit text. For example, the 8-bit data might include JPEG images, audio streams, and international text. Transfer encodings are also used to protect the data against unwanted modifications made by the interface, such as line wrapping or whitespace normalization.
Transfer encoding is a distinct layer with its own interfaces, properties, and configuration. The use of a transfer encoding layer is to be avoided whenever possible, because it introduces complexity, often has to be managed out of band, and reduces bandwidth because it increases the size of the data set. Many communication protocols consider the need to operate over restricted interfaces a part of their design, and so avoid the need for a transfer encoding layer. Many codesets have also been defined to be compatible with specific well-known interfaces, and likewise avoid the need for transfer encoding. In general, the only time a transfer encoding is necessary is when a legacy interface is being called upon to deliver a new application or data type that the interface was never intended to handle. This section examines the following concepts:
Signaling
Historical transfer encoding mechanisms have not provided any reliable means for the sender to signal the receiver that a transfer encoding was being used. Instead, ad hoc signaling mechanisms were used. For example, if a user received an email message that contained the word begin followed by a filename, followed by many lines of gibberish, it could reasonably be inferred that the message contained an attachment that had been encoded using uuencode. When setting up a Usenet connection, the two administrators would agree on which compression techniques and transfer encodings they would use for their specific connection. If the sender and receiver did not agree on the transfer encoding to be used, the receiver software would either deliver garbage or return an error diagnostic.
The Multipurpose Internet Mail Extensions (MIME, RFC 2045) is one of the few cases that provides a robust signaling mechanism for the use of transfer encoding. Each body part within a MIME message contains a header field, which specifies the transfer encoding that has been applied to that part. Implementors are encouraged to pick the encoding that yields optimal results for their original data. Optimal is usually defined as the encoding type that results in the least growth in the size of the body part.
Idempotency
All known Internet transfer encodings are not idempotent; that is, if you apply the transfer encoding to a stream that has already been encoded, the stream is encoded again and must be decoded twice to obtain the original data.
Transfer Encoding Data Flow
The following steps describe a complete transfer encoding data flow:
  1. The sender opens the interface to the encoder. If the encoder supports more than one type of encoding, the sender can specify which type of encoding to use or allow the encoder to choose an optimal type. It is important that the sender is able to override the encoder in selecting an encoding type, since some international standards require specific transfer encoding types. For example, Japanese email permits only base64 encoding in the message header fields.
  2. The encoder negotiates with the restricted interface to determine what restrictions are in force. For historical interfaces like a UNIX pipe, this is often not possible; the restrictions are an intrinsic property of the interface and are not negotiable (or even visible) at runtime. Some interfaces, however, do provide a means to relax some of the restrictions. Extended SMTP, for example, can allow 8-bit text if both the sender and receiver agree to it.
  3. The encoder examines the data stream to determine what transfer encoding is necessary, if any. If the encoder only supports a single encoding type, or if the sender has explicitly specified an encoding type, then this step is omitted. If the data stream is large and buffering is not available, this step might be impractical.
  4. The encoder checks the restrictions of the server against the characteristics of the source data stream, and chooses the most efficient transfer encoding type that still complies with the interface restrictions. The ideal case is an identity encoding.
  5. The encoder writes its signal field to the restricted interface, indicating the type of encoding used. The signal might be a preamble string, like the uuencode begin line, or it might be part of a larger syntax like the MIME Content-Transfer-Encoding header field.
  6. The encoder writes the encoded data stream to the restricted interface.
  7. The restricted interface makes the encoded data stream available to the decoder.
  8. The decoder opens the restricted interface and reads the signal field to identify the encoding type. The decoder must perform this check even if it only supports a single type. The decoder must be able to flag an error if, for example, the sender was configured to use uuencode, but the receiver was configured to use btoa.
  9. The decoder reads the encoded data from the restricted interface and verifies that the encoded data is correct. Depending on the protocol specification, the decoder can choose to ignore certain types of errors or try to reasonably recover from the error. For example, uudecode skips added ASCII spaces at the begining of a line and can fill in missing spaces at the end of a line. If a decoding error occurs from which the decoder cannot recover, for example, the letter G in a hexadecimal value, the decoder must notify someone of the problem. Depending on the application, the decoder can return an error to the sender, notify the receiver, record the problem in a log file, or all of these.
  10. The decoder writes the original data stream to the receiver.

Command Line Interface

All transfer encodings should be exposed through a command line interface, both for debugging and to facilitate their use in scripts. Historically, transfer encodings like uuencode and btoa were implemented only in command line utilities. The notion of transfer encodings as a feature of a larger application, such as an email client or news reader, is relatively recent.
The encoder reads the source stream from the standard input or from a user-specified file. The decoder writes the original source stream to its standard output or to a user-specified file. The I/O interface must be "binary-clean" and able to read, buffer, and write all byte values, including all control characters and the NULL character. No special handling should be given to any characters. In C programming, this rules out all of the line-oriented stdio functions like fgets(3S) and fputs(3S). Byte-oriented functions must be used instead, like read(2), fread(3S), write(2), and fwrite(2).
If the transfer encoding supports more than one encoding type, this must be exposed on the command line as an option with the most robust encoding type as a default.
Both the encoder and the decoder command line interfaces should be implemented as a small wrapper around an API. For more information, see "Application Programming Interfaces."

Character Interface

Not applicable.

Graphical Interface

Not applicable.

Application Protocols

Transfer encoding is an application protocol layer that is specifically designed to tunnel international text and binary data through storage and interchange interfaces that do not support 8-bit or binary data.

Storage and Interchange

A transfer encoding does not have any storage or interchange capabilities. However, transfer encoding does pose a design problem for the implementor of a storage and interchange interface. Consider the following scenario:
  1. An email message is composed by a client application. The message contains a binary attachment.
  2. The client delivers the message via SMTP to a mail server. Because SMTP does not support binary data, the client applies MIME base64 transfer encoding to the attachment. The mail server inserts the message in a queue; that is, a storage interface.
  3. At a later time, the mail server reads the message from the queue, and delivers it to a message store using a proprietary binary-clean interface.
  4. An end-user retrieves the message from the message store using IMAP. Because IMAP does not support binary data, base64 must be applied to the attachment.
  5. The receiving client removes the base64 transfer encoding and allows the user to save the attachment as a disk file.
While two of the interfaces in this scenario are restricted, most are binary-clean. This leaves the implementors of the mail server queue and the message store with a choice:
  • Should they remove the transfer encoding when storing the message, and then reapply the transfer encoding when the message is retrieved?
  • Should they just leave the message alone and store it encoded?
  • Does the answer change if the message is routed using a binary-clean interface, like UNIX-to-UNIX Copy Protocol (UUCP)?
  • Should there be a separate interface that can be called on demand to alter a message's transfer encodings just before delivery?
When MIME was first published and MIME-aware email clients were first available, all servers and storage interfaces left the transfer encoding untouched. This was done partly to minimize computational load on the servers, but mostly because developers could not justify the programming effort involved in managing the transfer encodings on the server.
As end-user demands for improved network performance and message fidelity have increased, however, servers have become increasingly sophisticated in their handling of transfer encoding. This trend will continue, based on the latest standards drafts.

Application Programming Interfaces (APIs)

Any implementation of a transfer encoding should be first exposed as an API, to facilitate the development of both command line tools and integration into large programs. A good API will offer the applications developer flexibility in when and how to apply encoding and decoding. For an explanation of the parameters that must be exposed to the API, see "Transfer Encoding Data Flow."
The API to a transfer encoding layer must be binary-clean and use buffer length parameters rather than NULL terminated strings.
When a higher-level protocol or service supports the use of transfer encodings, it is important that APIs which support that protocol or service, also support the transfer encodings. Many high-level API designers have omitted this step, moving the burden of supporting the transfer encodings onto the application developer. The interface must still expose flexibility to the applications developer; however, some APIs have so throughly buried the transfer encoding interface that the application is unable to obtain vital information or meet specific encoding requirements, such as those for Japanese email.

Requirements for Compliance

Command Line Interface

A compliant transfer encoding interface guarantees that the binary data stream written by the decoder, that is, the consumer, is identical to that which was read by the encoder, that is, the provider. This is a straight-forward software quality assurance problem that lends itself well to automated testing. The test suite should:
  • Check for the natural boundary conditions in the encoding type
  • Include characters known to cause failures in the restricted interface
  • Be first run with the encoder and decoder directly connected and then run with a restricted interface between them
A transfer encoding should not pass any parameters other than the encoding type from the encoder to the decoder. Passing text parameters is complex and better handled by the layer above the transfer encoding.
At least one known transfer encoding implementation does pass a text parameter that is supplied on the command line: uuencode passes a file name. This parameter is restricted to a 7-bit ASCII subset that varies from platform to platform, illustrating exactly why transfer encoding should not be overloaded with parameter passing.

Character Interface

No requirement.

Graphical Interface

No requirement.

Application Protocols

See the requirements under "Command Line Interface."

Storage and Interchange

See the requirements under "Command Line Interface."

Application Programming Interfaces

See the requirements under "Command Line Interface."
  « Previous | Contents | Next »
 
Related Links