Compression
About compression
Compression is a key feature of Web Risk. Compression significantly reduces bandwidth requirements, which is particularly relevant for mobile devices. The Web Risk server currently supports Rice compression. More compression methods might be added in the future.
Compression is set using the
supportedCompressions
field and
CompressionType
.
Clients should use the RICE and RAW compression types. Web Risk uses the
COMPRESSION_TYPE_UNSPECIFIED
type when the compression type is not set (RAW
compression will be substituted).
The Web Risk server will also use standard HTTP compression to further compress responses, regardless of the compression type selected, as long as the client sets the correct HTTP compression header. To learn more, see the Wikipedia article on HTTP compression.
Rice compression
As noted, the Web Risk server currently supports Rice compression. For more inforomation, see the Wikipedia article Golomb coding.
Compression/decompression
The
RiceDeltaEncoding
object represents the Rice-Golomb encoded data and is used to send compressed
removal indices or compressed 4-byte hash prefixes. Hash prefixes longer than
4 bytes will not be compressed, and will be served in raw format instead.
For removal indices, the list of indices is sorted in ascending order and then delta encoded using RICE encoding. For additions, the 4-byte hash prefixes are re-interpreted as little-endian uint32s, sorted in ascending order, and then delta encoded using RICE encoding. Note the difference in hash format between RICE compression and RAW: raw hashes are lexicographically sorted bytes, whereas RICE hashes are uint32s sorted in acsending order (after decompression).
That is, the list of integers [1, 5, 7, 13] will be encoded as 1 (the first value) and the deltas [4, 2, 6].
The first value is stored in the firstValue
field and the deltas are encoded
using a Golomb-Rice encoder. The Rice parameter k
(see below) is stored in
riceParameter
. The numEntries
field contains the number of deltas encoded
in the Rice encoder (3 in our example above, not 4). The encodedData
field
contains the actual encoded deltas.
Encoder/decoder
In the Rice encoder/decoder every delta n
is encoded as q
and r
where
n = (q<<k) + r
(or, n = q * (2**k) + r
). k
is a constant and a parameter
of the Rice encoder/decoder. The values for q
and r
are encoded in the bit
stream using different encoding schemes.
The quotient q
is encoded in unary coding followed by a 0. That is, 3 would be
encoded as 1110, 4 as 11110 and 7 as 11111110. The quotient q
is decoded
first.
The remainder r
is encoded using truncated binary encoding. Only the least
significant k
bits of r
are written (and therefore read) from the bit
stream. The remainder r
is decoded after having decoded q
.
Bit encoder/decoder
The Rice encoder relies on a bit encoder/decoder where single bits can be
appended to the bit encoder; that is, to encode a quotient q
that could be
only two bits long.
The bit encoder is a list of 8-bit bytes. Bits are set from the lowest significant bit in the first byte to the highest significant bit in the first byte. If a byte has all its bits already set, a new byte, initialized to zero, is appended to the end of the byte list. If the last byte is not fully used, its highest significant bits are set to zero. Example:
Bits Added | BitEncoder After Adding Bits |
---|---|
[] | |
0 | [00000000] |
1 | [00000010] |
1 | [00000110] |
1,0,1 | [00101110] |
0,0,0 | [00101110, 00000000] |
1,1,0 | [00101110, 00000110] |