📜  门| GATE-CS-2007 |问题 10(1)

📅  最后修改于: 2023-12-03 14:58:27.930000             🧑  作者: Mango

GATE-CS-2007 Problem 10

This problem requires us to implement Huffman coding algorithm to compress and decompress a given string of characters. Huffman coding is a lossless data compression algorithm that assigns variable-length codes to symbols based on their frequency of occurrence. The more frequently a symbol appears in a given string, the shorter its code will be, resulting in overall compression of the input string.

Solution Approach

To implement Huffman coding, we need to perform the following steps:

  1. Count the frequency of each character in the input string.
  2. Build a Huffman tree based on the frequency counts, where the frequency of each character is the weight of the node.
  3. Assign codes to each leaf node, where the code of a leaf node is the path taken from the root of the tree to the leaf node. In this process, assign 0 to the left edge and 1 to the right edge.
  4. Encode the input string using the generated Huffman codes.
  5. Decode the encoded string using the Huffman tree.
Code Implementation
from heapq import heappush, heappop, heapify
from collections import defaultdict

def encode(input_string):
    freq = defaultdict(int)
    for c in input_string:
        freq[c] += 1

    pq = [[freq[char], char] for char in freq]
    heapify(pq)

    while len(pq) > 1:
        left = heappop(pq)
        right = heappop(pq)
        heappush(pq, [left[0] + right[0], left, right])

    codes = {}

    def assign_codes(node, prefix=''):
        if type(node) == str:
            codes[node] = prefix
        else:
            assign_codes(node[1], prefix + '0')
            assign_codes(node[2], prefix + '1')

    assign_codes(heappop(pq)[1])

    encoded_string = ''.join([codes[c] for c in input_string])

    return encoded_string, codes

def decode(encoded_string, codes):
    decoded_string = ''
    rev_codes = {v: k for k, v in codes.items()}
    i = 0
    while i < len(encoded_string):
        j = i + 1
        while encoded_string[i:j] not in rev_codes:
            j += 1
        decoded_string += rev_codes[encoded_string[i:j]]
        i = j

    return decoded_string

input_string = 'abacabadabacaba'
encoded_string, codes = encode(input_string)
print('The encoded string using Huffman coding:', encoded_string)
decoded_string = decode(encoded_string, codes)
print('The decoded string:', decoded_string)
Explanation

We count the frequency of each character in the input string using a defaultdict. Then, we create a priority queue (using a heap), where each node is a tuple consisting of the frequency and the character. We repeatedly extract the two nodes with the smallest frequency from the priority queue, create a parent node whose frequency is the sum of the frequencies of the two extracted nodes, and push the parent node back to the priority queue. We repeat this process until only one node remains in the priority queue, which is the root of the Huffman tree.

We then traverse the Huffman tree to assign codes to each leaf node, by recursively traversing the left and right subtrees and adding a '0' for the left subtree and a '1' for the right subtree. We store the generated codes in a dictionary.

After generating the Huffman codes, we encode the input string by replacing each character with its corresponding code. We then decode the encoded string by traversing it from left to right, matching each substring with a code in the generated codes dictionary and outputting its corresponding character.

Conclusion

In this article, we implemented Huffman coding algorithm to compress and decompress a given string of characters. This algorithm is widely used in data compression applications, such as zip and gzip, and provides a simple yet effective way to compress data without losing any information.