Skip to content

Added Huffman with Haskell #77

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 18, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions chapters/data_compression/huffman/code/haskell/huffman.hs
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
import qualified Data.Map as M
import Data.List (insert, sort)

data Tree a = Leaf Int a
| Node Int (Tree a) (Tree a)
deriving (Show, Eq)

freq :: Tree a -> Int
freq (Leaf i _) = i
freq (Node i _ _) = i

instance (Eq a) => Ord (Tree a) where
compare t1 t2 = compare (freq t1) (freq t2)

getFrequencies :: Ord a => [a] -> [(Int, a)]
getFrequencies = toSortedList . M.fromListWith (+) . flip zip (repeat 1)
where toSortedList = sort . map swap . M.toList
swap (a, i) = (i, a)

buildTree :: (Ord a) => [a] -> Maybe (Tree a)
buildTree = build . map (uncurry Leaf) . getFrequencies
where build [] = Nothing
build [t] = Just t
build (t1:t2:ts) = build $ insert (Node (freq t1 + freq t2) t1 t2) ts

data Bit = Zero | One

instance Show Bit where
show Zero = "0"
show One = "1"

encode :: (Ord a) => [a] -> (Maybe (Tree a), [Bit])
encode s = (tree, msg)
where
tree = buildTree s
msg = concatMap (table M.!) s
table = case tree of
Nothing -> M.empty
Just t -> M.fromList $ mkTable (t, [])
mkTable (Leaf _ a, p) = [(a, reverse p)]
mkTable (Node _ t1 t2, p) = concatMap mkTable [(t1, Zero:p), (t2, One:p)]

decode :: (Ord a) => Maybe (Tree a) -> [Bit] -> [a]
decode Nothing _ = []
decode (Just t) m = path t m
where path (Leaf _ a) m = a : path t m
path (Node _ t1 _) (Zero: m) = path t1 m
path (Node _ _ t2) (One: m) = path t2 m
path _ _ = []

main = do
let msg = "bibbity bobbity"
(tree, encoded) = encode msg
decoded = decode tree encoded
putStrLn $ "Endoding \"" ++ msg ++ "\": " ++ concatMap show encoded
putStrLn $ "Length: " ++ (show $ length encoded)
putStrLn $ "Decoding: " ++ decoded
23 changes: 12 additions & 11 deletions chapters/data_compression/huffman/huffman.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
<script>
MathJax.Hub.Queue(["Typeset",MathJax.Hub]);
</script>
$$
$$
\newcommand{\d}{\mathrm{d}}
\newcommand{\bff}{\boldsymbol{f}}
\newcommand{\bfg}{\boldsymbol{g}}
Expand All @@ -23,7 +23,7 @@ $$

# Huffman Encoding

If there were ever a data compression method to take the world by storm, it would be Huffman encoding.
If there were ever a data compression method to take the world by storm, it would be Huffman encoding.
In fact, this was the method that got me into computational methods to begin with.
I distinctly remember sitting in my data compression class and talking about the great information theorist Claude Shannon and Robert Fano, when suddenly my professor introduced a new kid to the mix: David Huffman.
He managed to rip the heart out of the methods described by leaders of the field and create a data compression method that was easier to understand and implement, while also providing more robust results, and apparently this was all done for a school project!
Expand All @@ -34,10 +34,10 @@ I have since accepted that fact and moved on.
Huffman encoding follows from the problem described in the [Data Compression](../data_compression.md) section.
We have a string that we want to encode into bits.
Huffman encoding ensures that our encoded bitstring is as small as possible without losing any information.
Because it is both lossless and gaurantees the smallest possible bitlength, it outright replaces both Shannon and Shannon-Fano encoding in most cases, which is a little weird because the method was devised while Huffman was taking a course from Fano, himself!
Because it is both lossless and guarantees the smallest possible bit length, it outright replaces both Shannon and Shannon-Fano encoding in most cases, which is a little weird because the method was devised while Huffman was taking a course from Fano, himself!

The idea is somewhat straightforward in principle, but a little difficult to code in practice.
By creating a binary tree of the input alphabet, every branch can be provided a unique bit representation simply by assigning a binary value to each child and reading to a character in a leaf node if starting from the root node.
The idea is somewhat straightforward in principle, but a little difficult to code in practice.
By creating a binary tree of the input alphabet, every branch can be provided a unique bit representation simply by assigning a binary value to each child and reading to a character in a leaf node if starting from the root node.

So now the question is: how do we create a binary tree?
Well, here we build it from the bottom up like so:
Expand All @@ -48,7 +48,7 @@ Well, here we build it from the bottom up like so:
4. Read the tree backwards from the root node and concatenate the final bitstring codeword. Keep all codewords and put them into your final set of codewords (sometimes called a codebook)
5. Encode your phrase with the codebook.

And that's it.
And that's it.
Here's an image of what this might look like for the phrase `bibbity_bobbity`:

<p align="center">
Expand All @@ -66,14 +66,14 @@ This will create a codebook that looks like this:
| _o_ | 1110 |
| ___ | 1111 |

and `bibbity bobbity` becomes `01000010010111011110111000100101110`.
As mentioned this uses the minimum number of bits possible for encoding.
and `bibbity_bobbity` becomes `01000010010111011110111000100101110`.
As mentioned this uses the minimum number of bits possible for encoding.
The fact that this algorithm is both conceptually simple and provably useful is rather extraordinary to me and is why Huffman encoding will always hold a special place in my heart.

# Example Code
In code, this can be a little tricky. It requires a method to continually sort the nodes as you add more and more nodes to the system.
The most straightforward way to do this in some languages is with a priority queue, but depending on the language, this might be more or less appropriate.
In addition, to read the tree backwards, some sort of [Depth First Search](../../tree_traversal/tree_traversal.md) needs to be implemented.
In addition, to read the tree backwards, some sort of [Depth First Search](../../tree_traversal/tree_traversal.md) needs to be implemented.
Whether you use a stack or straight-up recursion also depends on the language, but the recursive method is a little easier to understand in most cases.

{% method %}
Expand All @@ -83,6 +83,7 @@ Whether you use a stack or straight-up recursion also depends on the language, b
{% sample lang="rs" %}
### Rust
[import, lang:"rust"](code/rust/huffman.rs)
{% sample lang="hs" %}
### Haskell
[import, lang:"haskell"](code/haskell/huffman.hs)
{% endmethod %}