Skip to content

Add Python Implementation of Huffman Encoding #98

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 1, 2018
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 116 additions & 0 deletions chapters/data_compression/huffman/code/python/huffman.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Huffman Encoding
# Python 2.7+
# Submitted by Matthew Giallourakis

from collections import Counter

# constructs the tree
def build_tree(message):

# get sorted list of character,frequency pairs
Copy link
Contributor

@Butt4cak3 Butt4cak3 May 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm super nitpicky here because this is a comment, but commas are usually followed by spaces.

I mentioned the spaces after commas in another comment already. Oops!

frequencies = Counter(message)
trees = frequencies.most_common()

# while there is more than one tree
while len(trees) > 1:

# pop off the two trees of least weight from the trees list
tree_left,weight_left = trees.pop()
tree_right,weight_right = trees.pop()

# combine the nodes and add back to the nodes list
new_tree = [tree_left,tree_right]
new_weight = weight_left+weight_right
trees.append((new_tree,new_weight))

# sort the trees list by weight
trees = sorted(trees, key=lambda n: n[1], reverse=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to sort the entire trees list after each iteration. I know that it will always be pretty small, but I think it would be nicer here to find the right place in the list and use list.insert() instead of list.append() and list.sort() here.

# Find the first tree that has a weight smaller than new_weight and returns its index in the list
# If no such tree can be found, use len(trees) instead to append
index = next((i for i, tree in enumerate(trees) if tree[1] < new_weight), len(trees))

# Insert the new tree there
trees.insert(index, (new_tree, new_weight))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of doing an insert, but I thought it would be a little harder to explain and detract from the point of the code. I'll replace it with your code (thanks!) and do the more efficient option from now on.


tree = trees[0][0]
return tree

# constructs the mapping with recursion
def build_mapping(tree,code=''):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to not like spaces between comma-separated identifiers. To stay consistent with other code examples and code outside the AAA you should probably put spaces between function parameters, list items, etc.

Copy link
Contributor Author

@foldsters foldsters May 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is personal preference, because I usually use white space to indicate order of operations, so ((v,k) for k,v in mapping) tells me that k and v are on the same step, while (((v, k) for k, v in mapping) looks like (v, k) for k, is one step and v in mapping is another, but I see where you're coming from and I'll use sentence syntax from now on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to mention that this goes for pretty much all operators, too (a + b instead of a+b).

I can see how it makes sense in your example and maybe it's okay to omit the space in some cases if it really improves readability. But we generally like spaces here. :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I've never programmed in an environment that other people needed to look at my code, so I appreciate the pointers!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good. Code review is weird because I'm always afraid of sounding like "YOU'RE DOING IT WRONG! YOU SHOULD DO IT LIKE ME AND YOUR CODE IS BAD!" but we seem to be on the same page!


results = []

# split the tree
left_tree,right_tree = tree

# if the left node has children, find the mapping of those children
# else pair the character with the current code + 0
if type(left_tree) is list:
results += build_mapping(left_tree,code+'0')
else:
results.append((left_tree,code+'0'))

# if the right node has children, find the mapping of those children
# else pair the character with the current code + 1
if type(right_tree) is list:
results += build_mapping(right_tree,code+'1')
else:
results.append((right_tree,code+'1'))

return results

# encodes the message
def encode(mapping,message):

encoding = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use double quotes here and in a few other places as well, while you used single quotes in others. You should stick to one or the other and since single quotes are more common in Python and because other code examples in the AAA already use them, I recommend you change all your double quotes to single quotes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I didn't even notice that I did that! Fixing that up now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name of this confused me for a second. Maybe code or encoded are better names here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I'll make the variables more descriptive


# build a char -> code dictionary
forward_dict = dict(mapping)

# replace each character with its code
for char in message:
encoding += forward_dict[char]

return encoding

# decodes a message
def decode(mapping,encoding):

message = ""
key = ""

# build a code -> char dictionary
inverse_dict = dict([(v,k) for k,v in mapping])

# for each bit in the encoding
# if the bit is in the dictionary, replace the bit with the paired character
# else look at the bit and the following bits together until a match occurs
# move to the next bit not yet looked at
for index,bit in enumerate(encoding):
key += bit
if key in inverse_dict:
message += inverse_dict[key]
key = ""

return message

def main():

# test example
message = "bibbity_bobbity"
tree = build_tree(message)
mapping = build_mapping(tree)
encoding = encode(mapping,message)
decoding = decode(mapping,encoding)

print('message: '+message)
print('tree: '+str(tree))
print('mapping: '+str(mapping))
print('encoding: '+encoding)
print('decoding: '+decoding)

# prints the following:
#
# message: bibbity_bobbity
# tree: ['b', [[['_', 'o'], 'y'], ['t', 'i']]]
# mapping: [('b', '0'), ('_', '1000'), ('o', '1001'),
# ('y', '101'), ('t', '110'), ('i', '111')]
# encoding: 01110011111010110000100100111110101
# decoding: bibbity_bobbity

if __name__ == '__main__':
main()