Skip to content

Commit 1a75eee

Browse files
committed
Adding draft of huffman encoding chapter.
1 parent 934bc5c commit 1a75eee

File tree

5 files changed

+324
-4
lines changed

5 files changed

+324
-4
lines changed

chapters/data_compression/data_compression.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -142,4 +142,4 @@ L_2 &= 0.1\times 3 + 0.2 \times 3 + 0.3 \times 2 + 0.4 \times 1 = 1.9
142142
$$
143143

144144
Here, it's clear that $$L_2 < L_1$$, and thus the second set of codewords compresses our data more than the first.
145-
This measure can be used as a direct test of certain simple data compression techniques, notably those created by Shannon, Fano, and Huffman, which will be covered soon!
145+
This measure can be used as a direct test of certain simple data compression techniques, notably those created by Shannon, Fano, and [Huffman](huffman/huffman.md), which will be covered soon!
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# This is for the PriorityQueue
2+
using DataStructures
3+
4+
struct Leaf
5+
weight::Int64
6+
key::Char
7+
end
8+
9+
struct Branch
10+
right::Union{Leaf, Branch}
11+
left::Union{Leaf, Branch}
12+
weight::Int64
13+
end
14+
15+
const Node = Union{Leaf, Branch}
16+
isbranch(branch::Branch) = true
17+
isbranch(other::T) where {T} = false
18+
19+
function codebook_recurse(leaf::Leaf, code::String,
20+
dict::Dict{Char,String})
21+
dict[leaf.key] = code
22+
end
23+
24+
function codebook_recurse(branch::Branch, code::String,
25+
dict::Dict{Char,String})
26+
codebook_recurse(branch.left, string(code, "1"), dict)
27+
codebook_recurse(branch.right, string(code, "0"), dict)
28+
end
29+
30+
# This will depth-first search through the tree
31+
# to create bitstrings for each character.
32+
# Note: Any depth-first search method will work
33+
# This outputs encoding Dict to be used for encoding
34+
function create_codebook(n::Node)
35+
codebook = Dict{Char,String}()
36+
codebook_recurse(n, "", codebook)
37+
return codebook
38+
end
39+
40+
# This outputs huffman tree to generate dictionary for encoding
41+
function create_tree(phrase::String)
42+
43+
# creating weights
44+
weights = PriorityQueue()
45+
for i in phrase
46+
temp_string = string(i)
47+
if (haskey(weights, temp_string))
48+
weights[temp_string] += 1
49+
else
50+
weights[temp_string] = 1
51+
end
52+
end
53+
54+
# Creating all nodes to iterate through
55+
nodes = PriorityQueue{Node, Int64}()
56+
while(length(weights) > 0)
57+
weight = peek(weights)[2]
58+
key = dequeue!(weights)[1]
59+
temp_node = Leaf(weight, key)
60+
enqueue!(nodes, temp_node, weight)
61+
end
62+
63+
while(length(nodes) > 1)
64+
node1 = dequeue!(nodes)
65+
node2 = dequeue!(nodes)
66+
temp_node = Branch(node1, node2, node1.weight + node2.weight)
67+
enqueue!(nodes, temp_node, temp_node.weight)
68+
end
69+
70+
huffman_tree = dequeue!(nodes)
71+
return huffman_tree
72+
73+
end
74+
75+
function encode(codebook::Dict{Char, String}, phrase::String)
76+
final_bitstring = ""
77+
for i in phrase
78+
final_bitstring = final_bitstring * codebook[i]
79+
end
80+
81+
return final_bitstring
82+
end
83+
84+
function decode(huffman_tree::Node, bitstring::String)
85+
current = huffman_tree
86+
final_string = ""
87+
for i in bitstring
88+
if (i == '1')
89+
current = current.left
90+
else
91+
current = current.right
92+
end
93+
if (!isbranch(current))
94+
final_string = final_string * string(current.key)
95+
current = huffman_tree
96+
end
97+
end
98+
99+
return final_string
100+
end
101+
102+
function two_pass_huffman(phrase::String)
103+
huffman_tree = create_tree(phrase)
104+
codebook = create_codebook(huffman_tree)
105+
println(codebook)
106+
bitstring = encode(codebook, phrase)
107+
final_string = decode(huffman_tree, bitstring)
108+
println(bitstring)
109+
println(final_string)
110+
end
111+
112+
two_pass_huffman("bibbity bobbity")
Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
extern crate itertools;
2+
3+
use std::cmp::{Ord, Ordering, PartialOrd};
4+
use std::collections::{BinaryHeap, HashMap};
5+
6+
use itertools::Itertools;
7+
8+
#[derive(Debug)]
9+
enum HuffmanTree {
10+
Branch {
11+
count: i32,
12+
left: Box<HuffmanTree>,
13+
right: Box<HuffmanTree>,
14+
},
15+
Leaf {
16+
count: i32,
17+
value: char,
18+
},
19+
}
20+
21+
impl PartialEq for HuffmanTree {
22+
fn eq(&self, other: &Self) -> bool {
23+
self.count() == other.count()
24+
}
25+
}
26+
27+
impl Eq for HuffmanTree {}
28+
29+
impl PartialOrd for HuffmanTree {
30+
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
31+
other.count().partial_cmp(&self.count())
32+
}
33+
}
34+
35+
impl Ord for HuffmanTree {
36+
fn cmp(&self, other: &Self) -> Ordering {
37+
other.count().cmp(&self.count())
38+
}
39+
}
40+
41+
#[derive(Debug)]
42+
struct Codebook {
43+
codebook: HashMap<char, String>,
44+
tree: HuffmanTree,
45+
}
46+
47+
impl HuffmanTree {
48+
pub fn from(input: &str) -> Self {
49+
let counts = input.chars().fold(HashMap::new(), |mut map, c| {
50+
*map.entry(c).or_insert(0) += 1;
51+
map
52+
});
53+
let mut queue = counts
54+
.iter()
55+
.map(|(&value, &count)| HuffmanTree::Leaf { value, count })
56+
.collect::<BinaryHeap<HuffmanTree>>();
57+
58+
while queue.len() > 1 {
59+
let left = queue.pop().unwrap();
60+
let right = queue.pop().unwrap();
61+
queue.push(HuffmanTree::Branch {
62+
count: left.count() + right.count(),
63+
left: Box::new(left),
64+
right: Box::new(right),
65+
})
66+
}
67+
68+
queue.pop().expect("The Huffman tree has to have a root")
69+
}
70+
71+
pub fn count(&self) -> i32 {
72+
match *self {
73+
HuffmanTree::Branch { count, .. } => count,
74+
HuffmanTree::Leaf { count, .. } => count,
75+
}
76+
}
77+
78+
pub fn make_codebook(self) -> Codebook {
79+
let mut codebook = HashMap::new();
80+
self.dfs(String::from(""), &mut codebook);
81+
Codebook {
82+
codebook,
83+
tree: self,
84+
}
85+
}
86+
87+
pub fn decode(&self, input: &str) -> String {
88+
let mut result = String::from("");
89+
let mut start = 0;
90+
while !input[start..].is_empty() {
91+
start += self.decode_dfs(&input[start..], &mut result);
92+
}
93+
result
94+
}
95+
96+
fn decode_dfs(&self, input: &str, result: &mut String) -> usize {
97+
let current = input.chars().next();
98+
match *self {
99+
HuffmanTree::Branch { ref left, .. } if current == Some('0') => {
100+
1 + left.decode_dfs(&input[1..], result)
101+
}
102+
HuffmanTree::Branch { ref right, .. } if current == Some('1') => {
103+
1 + right.decode_dfs(&input[1..], result)
104+
}
105+
HuffmanTree::Leaf { value, .. } => {
106+
result.push(value);
107+
0
108+
}
109+
_ => panic!("Unexpected end of input"),
110+
}
111+
}
112+
113+
fn dfs(&self, code: String, codebook: &mut HashMap<char, String>) {
114+
match *self {
115+
HuffmanTree::Branch {
116+
ref left,
117+
ref right,
118+
..
119+
} => {
120+
left.dfs(code.clone() + "0", codebook);
121+
right.dfs(code.clone() + "1", codebook);
122+
}
123+
HuffmanTree::Leaf { value, .. } => {
124+
codebook.insert(value, code);
125+
}
126+
}
127+
}
128+
}
129+
130+
impl Codebook {
131+
fn encode(&self, input: &str) -> String {
132+
input.chars().map(|c| &self.codebook[&c]).join("")
133+
}
134+
135+
fn decode(&self, input: &str) -> String {
136+
self.tree.decode(input)
137+
}
138+
}
139+
140+
fn main() {
141+
let input = "hello, world";
142+
143+
let tree = HuffmanTree::from(input);
144+
let codebook = tree.make_codebook();
145+
let encoded = codebook.encode(input);
146+
let decoded = codebook.decode(&encoded);
147+
148+
// Uncomment this line if you want to see the codebook/tree
149+
// println!("{:#?}", codebook);
150+
println!("{}", encoded);
151+
println!("{}", decoded);
152+
}

chapters/data_compression/huffman/huffman.md

Lines changed: 59 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,65 @@ $$
2424
# Huffman Encoding
2525

2626
If there were ever a data compression method to take the world by storm, it would be Huffman encoding.
27-
In fact, this was the method that got me into methods to begin with.
27+
In fact, this was the method that got me into computational methods to begin with.
2828
I distinctly remember sitting in my data compression class and talking about the great information theorist Claude Shannon and Robert Fano, when suddenly my professor introduced a new kid to the mix: David Huffman.
29-
He managed to rip the heart out of the methods described by leaders of the field and create a data compression method that was easier to understand and implement, while also providing more robust results.
29+
He managed to rip the heart out of the methods described by leaders of the field and create a data compression method that was easier to understand and implement, while also providing more robust results, and apparently this was all done for a school project!
30+
3031
It was in that moment, I knew I would never amount to anything.
32+
I have since accepted that fact and moved on.
33+
34+
Huffman encoding follows from the problem described in the [Data Compression](../data_compression.md) section.
35+
We have a string that we want to encode into bits.
36+
Huffman encoding ensures that our encoded bitstring is as small as possible without losing any information.
37+
Because it is both lossless and gaurantees the smallest possible bitlength, it outright replaces both Shannon and Shannon-Fano encoding in most cases, which is a little weird because the method was devised while Huffman was taking a course from Fano, himself!
38+
39+
The idea is somewhat straightforward in principle, but a little difficult to code in practice.
40+
By creating a binary tree of the input alphabet, every branch can be provided a unique bit representation simply by assigning a binary value to each child and reading to a character in a leaf node if starting from the root node.
41+
42+
So now the question is: how do we create a binary tree?
43+
Well, here we build it from the bottom up like so:
44+
45+
1. Order all characters according to the frequency they appear in the input bitstring, with the most frequent character at the top of the list. Be sure to keep track of the frequencies, too!
46+
2. Add the smallest two values together to create a new node with a new frequency.
47+
3. Keep doing step 2 until the tree is complete.
48+
4. Read the tree backwards from the root node and concatenate the final bitstring codeword. Keep all codewords and put them into your final set of codewords (sometimes called a codebook)
49+
5. Encode your phrase with the codebook.
50+
51+
And that's it.
52+
Here's an image of what this might look like for the phrase `bibbity_bobbity`:
53+
54+
<p align="center">
55+
<img src="res/huffman_tree.png" width="500" height="500" />
56+
</p>
57+
58+
This will create a codebook that looks like this:
59+
60+
| Character | Bit Representation |
61+
| --------- | ------------------ |
62+
| _b_ | 0 |
63+
| _i_ | 100 |
64+
| _t_ | 101 |
65+
| _y_ | 110 |
66+
| _o_ | 1110 |
67+
| ___ | 1111 |
68+
69+
and `bibbity bobbity` becomes `01000010010111011110111000100101110`.
70+
As mentioned this uses the minimum number of bits possible for encoding.
71+
The fact that this algorithm is both conceptually simple and provably useful is rather extraordinary to me and is why Huffman encoding will always hold a special place in my heart.
72+
73+
# Example Code
74+
In code, this can be a little tricky. It requires a method to continually sort the nodes as you add more and more nodes to the system.
75+
The most straightforward way to do this in some languages is with a priority queue, but depending on the language, this might be more or less appropriate.
76+
In addition, to read the tree backwards, some sort of [Depth First Search](../../tree_traversal/tree_traversal.md) needs to be implemented.
77+
Whether you use a stack or straight-up recursion also depends on the language, but the recursive method is a little easier to understand in most cases.
78+
79+
{% method %}
80+
{% sample lang="jl" %}
81+
### Julia
82+
[import, lang:"julia"](code/julia/huffman.jl)
83+
{% sample lang="rs" %}
84+
### Rust
85+
[import, lang:"rust"](code/rust/huffman.rs)
86+
{% endmethod %}
87+
3188

32-
Anyway, I'll be updating this chapter on Huffman encoding soon with all the information you could want / need, so stay tuned!
Loading

0 commit comments

Comments
 (0)