Ascii85

This is an old revision of this page, as edited by Yesudeep (talk | contribs) at 17:38, 6 August 2011 (→‎Resources: I've added Mercurial and my own implementations for base85 in Python.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Ascii85 (also called "Base85") is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data (making the encoded size ¹⁄₄ larger than the original, assuming eight bits per ASCII character), it is more efficient than uuencode or Base64, which use four characters to represent three bytes of data (¹⁄₃ increase, assuming eight bits per ASCII character).

Its main modern use is in Adobe's PostScript and Portable Document Format file formats.

Basic idea

The basic need for a binary-to-text encoding comes from a need to communicate arbitrary binary data over preexisting communications protocols that were designed to carry only human-readable text. Those communication protocols may only be 7-bit safe (and within that avoid certain ASCII control codes), and may require line breaks at certain maximum intervals, and may not maintain whitespace. Thus, only the 94 printable ASCII characters are "safe" to use to convey data.

4 bytes can represent 232 = 4,294,967,296 possible values. 5 radix-85 digits provide 855 = 4,437,053,125 possible values, enough to provide for a unique representation for each possible 32-bit value. Further, notice that five radix-84 digits provide 845 = 4,182,119,424 representable values. The implication of this is that 85 is the minimum possible integral base that will represent four bytes in five characters, hence its choice.

When encoding, each group of 4 bytes is taken as a 32-bit binary number, most significant byte first (Ascii85 uses a big-endian convention). This is converted, by repeatedly dividing by 85 and taking the remainder, into 5 radix-85 digits. Then each digit (again, most significant first) is encoded as an ASCII printable character by adding 33 to it, giving the ASCII characters 33 ("!") to 117 ("u").

Because all-zero data is quite common, an exception is made for the sake of data compression, and an all-zero group is encoded as a single character "z" instead of "!!!!!".

Groups of characters that decode to a value greater than 232 − 1 (encoded as "s8W-!") will cause a decoding error, as will "z" characters in the middle of a group. White space between the characters is ignored and may occur anywhere to accommodate line-length limitations.

One disadvantage of Ascii85 is that encoded data may contain characters such as backslash and quote, which have special meaning in many programming languages and in some text-based protocols.

btoa version

The original btoa program always encoded full groups (padding the source as necessary), with a prefix line of "xbtoa Begin", and suffix line of "xbtoa End", followed by the original file length (in decimal and hexadecimal) and three 32-bit checksums. The decoder needs to use the file length to see how much of the group was padding. This program also introduced the special "z" short form for an all-zero group. Version 4.2 added a "y" exception for a group of all ASCII space characters (0x20202020).

Adobe version

Adobe adopted the basic btoa encoding, but with slight changes, and gave it the name Ascii85. In particular, Adobe uses the delimiter "~>" to mark the end of an Ascii85-encoded string, and represents the length by truncating the final group: If the last block of source bytes contains less than 4 bytes, the block is padded with up to three null bytes before encoding. After encoding, as many bytes as were added as padding are removed from the end of the output.

The reverse is applied when decoding: The last block is padded to 5 bytes with the Ascii85 character "u", and as many bytes as were added as padding are omitted from the end of the output (see example).

NOTE: The padding is not arbitrary. Converting from binary to base 64 only regroups bits and does not change them or their order (a high bit in binary does not affect the low bits in the base64 representation). In converting a binary number to base85 (85 is not a power of two) high bits do affect the low order base85 digits and conversely. Padding the binary low (with zero bits) while encoding and padding the base85 value high (with 'u's) in decoding assures that the high order bits are preserved (the zero padding in the binary gives enough room so that a small addition is trapped and there is no "carry" to the high bits).

In Ascii85-encoded blocks, whitespace and line-break characters may be present anywhere, including in the middle of a 5-character block, but they must be silently ignored.

Adobe's specification does not support the "y" exception.

Example

A quote from Thomas Hobbes's Leviathan:

Man is distinguished, not only by his reason, but by this singular passion from other animals, which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable generation of knowledge, exceeds the short vehemence of any carnal pleasure.

initially encoded as US-ASCII and then reencoded in Ascii85 is as follows:

<~9jqo^BlbD-BleB1DJ+*+F(f,q/0JhKF<GL>Cj@.4Gp$d7F!,L7@<6@)/0JDEF<G%<+EV:2F!,
O<DJ+*.@<*K0@<6L(Df-\0Ec5e;DffZ(EZee.Bl.9pF"AGXBPCsi+DGm>@3BB/F*&OCAfu2/AKY
i(DIb:@FD,*)+C]U=@3BN#EcYf8ATD3s@q?d$AftVqCh[NqF<G:8+EV:.+Cf>-FD5W8ARlolDIa
l(DId<j@<?3r@:F%a+D58'ATD4$Bl@l3De:,-DJs`8ARoFb/0JMK@qB4^F!,R<AKZ&-DfTqBG%G
>uD.RTpAKYo'+CT/5+Cei#DII?(E,9)oF*2M7/c~>
Text content M a n ... s u r e
ASCII 77 97 110 32 ... 115 117 114 101
Bit pattern 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 0 0 0 0 0 ... 0 1 1 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 0 0 1 1 0 0 1 0 1
32-bit Value 1,298,230,816 = 24×854 + 73×853 + 80×852 + 78×85 + 61 ... 1,937,076,837 = 37×854 + 9×853 + 17×852 + 44×85 + 22
Base 85 (+33) 24 (57) 73 (106) 80 (113) 78 (111) 61 (94) ... 37 (70) 9 (42) 17 (50) 44 (77) 22 (55)
ASCII 9 j q o ^ ... F * 2 M 7

Since the last 4-tuple is incomplete, it must be padded with three zero bytes:

Text content . \0 \0 \0
ASCII 46 0 0 0
Bit pattern 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
32-bit Value 771,751,936 = 14×854 + 66×853 + 56×852 + 74×85 + 46
Base 85 (+33) 14 (47) 66 (99) 56 (89) 74 (107) 46 (79)
ASCII / c Y k O

Since three bytes of padding had to be added, the three final characters 'YkO' are omitted from the output.

Decoding is done inversely, except that the last 5-tuple is padded with 'u' characters:

ASCII / c u u u
Base 85 (+33) 14 (47) 66 (99) 84 (117) 84 (117) 84 (117)
32-bit Value 771,955,124 = 14×854 + 66×853 + 84×852 + 84×85 + 84
Bit pattern 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 1 1 0 1 0 0
ASCII 46 3 25 180
Text content . [ ETX ] [ EM ] undefined

Since the input had to be padded with three 'u' bytes, the last three bytes of the output are ignored and we end up with the original period.

Note that the input sentence does not contain 4 consecutive zero bytes, so the example does not show the use of the 'z' abbreviation.

Compatibility

The Ascii85 encoding is compatible with 7-bit and 8-bit MIME, while having less overhead than Base64.

One potential compatibility issue of Ascii85 is that 'single' and "double" quotation marks, <angle> brackets and ampersands (&) cannot be used unescaped in markup languages like XML or SGML.

RFC 1924 version

Published on April 1, 1996, and thus presumably not meant to be taken completely seriously, informational RFC 1924: "A Compact Representation of IPv6 Addresses" by Robert Elz suggests a base-85 encoding of IPv6 addresses. This differs from the scheme used above in that he proposes a different set of 85 ASCII characters, and proposes to do all arithmetic on the 128-bit number, converting it to a single 20-digit base-85 number (internal whitespace not allowed), rather than breaking it into four 32-bit groups.

The proposed character set is, in order, 09, AZ, az, and then the 23 characters !#$%&()*+-;<=>?@^_`{|}~. The highest possible representable address, 2128−1 = 74×8519 + 53×8518 + 5×8517 + ..., would be encoded as =r54lj&NUUO~Hi%c2ym0.

While the RFC chose a different character set in order to prevent the use of certain problematic characters ("',./:[]\), it still requires escaping for SGML-based protocols, notably for XML.

Resources

See also