Huffman Codes

A Loss-less data compression technique which varies the length of the encoded symbol in proportion to its information content, that is the more often a symbol or token is used, the shorter the binary string used to represent it in the compressed stream.

How it works

  • Before preforming this algorithm, we need to know some information. Consider...

Char

Code

Freq

Total Bits

A

00

10

20

B

01

15

30

C

10

3

6

Total Bits = 56

  • Char: The character within the text itself

  • Code: Arbitrary bit representation for an identification pattern. In this case, with 3 characters, we can use 2 bits to represent all of them individually, and uniquely since ceil(log2(3))=2ceil(log_2(3)) = 2. Representing 8 objects would require 3 bits.

  • Freq: Amount of occurrences

  • Total Bits: The total amount of spaces required to contain all the information. to total_bits = bit_length * freq.

Our Goal is to reduce the number of bits requires to represent this particular text.

  • Within a prefix tree, 0 means descending left, and 1 means descending right. To represent the prefix of a character, we descend down the prefix tree to find its corresponding code.

  • The prefix tree is originally represented through the combinations we've originally assigned them by default. (With 7 chars, we need 3 bits, so the height of the tree is 3).

  • This is the uncompressed prefix tree.

High Level Algorithm:

  1. Analyze data and obtain required information.

  2. Using our character list, and frequency, begin by finding the two minimum frequencies and adding them together. In this case, it was the S and new line.

  3. Create a root node between the two minimum nodes, and have their representation be the sum of the two nodes.

  4. Step 2.

  • Once the final tree is formed, we add left and right representations, and recreate our prefix representations. As long as their there are at least one differing frequency, Huffman Encoding will ensure that the total number of bits will get reduced.

Char

Code

Freq

Total Bits

A

110

10

30

E

10

15

30

I

00

12

24

S

11111

3

15

T

1110

4

16

P

01

3

26

\n

11110

1

5

Why it works

  • By choosing the summing the lowest frequency characters first, the algorithm ensures that their prefix tree code will contain the longest chain because the process adds an additional height level.

  • By representing the less frequent nodes with the longest chain, and the most frequent nodes with the shortest chain, we can represent more frequently used characters with fewer bits, and lower frequency used characters with more bits. So the overall data will remain lossless, but space efficiency will increase.

Last updated