Entropy 题解 Huffman编码 优先队列

Entropy

题目描述

An entropy encoder is a data encoding method that achieves lossless data compression by encoding a message with “wasted” or “extra” information removed. In other words, entropy encoding removes information that was not necessary in the first place to accurately encode the message. A high degree of entropy implies a message with a great deal of wasted information; english text encoded in ASCII is an example of a message type that has very high entropy. Already compressed messages, such as JPEG graphics or ZIP archives, have very little entropy and do not benefit from further attempts at entropy encoding.

English text encoded in ASCII has a high degree of entropy because all characters are encoded using the same number of bits, eight. It is a known fact that the letters E, L, N, R, S and T occur at a considerably higher frequency than do most other letters in english text. If a way could be found to encode just these letters with four bits, then the new encoding would be smaller, would contain all the original information, and would have less entropy. ASCII uses a fixed number of bits for a reason, however: it’s easy, since one is always dealing with a fixed number of bits to represent each possible glyph or character. How would an encoding scheme that used four bits for the above letters be able to distinguish between the four-bit codes and eight-bit codes? This seemingly difficult problem is solved using what is known as a “prefix-free variable-length” encoding.

In such an encoding, any number of bits can be used to represent any glyph, and glyphs not present in the message are simply not encoded. However, in order to be able to recover the information, no bit pattern that encodes a glyph is allowed to be the prefix of any other encoding bit pattern. This allows the encoded bitstream to be read bit by bit, and whenever a set of bits is encountered that represents a glyph, that glyph can be decoded. If the prefix-free constraint was not enforced, then such a decoding would be impossible.

Consider the text “AAAAABCD”. Using ASCII, encoding this would require 64 bits. If, instead, we encode “A” with the bit pattern “00”, “B” with “01”, “C” with “10”, and “D” with “11” then we can encode this text in only 16 bits; the resulting bit pattern would be “0000000000011011”. This is still a fixed-length encoding, however; we’re using two bits per glyph instead of eight. Since the glyph “A” occurs with greater frequency, could we do better by encoding it with fewer bits? In fact we can, but in order to maintain a prefix-free encoding, some of the other bit patterns will become longer than two bits. An optimal encoding is to encode “A” with “0”, “B” with “10”, “C” with “110”, and “D” with “111”. (This is clearly not the only optimal encoding, as it is obvious that the encodings for B, C and D could be interchanged freely for any given encoding without increasing the size of the final encoded message.) Using this encoding, the message encodes in only 13 bits to “0000010110111”, a compression ratio of 4.9 to 1 (that is, each bit in the final encoded message represents as much information as did 4.9 bits in the original encoding). Read through this bit pattern from left to right and you’ll see that the prefix-free encoding makes it simple to decode this into the original text even though the codes have varying bit lengths.

As a second example, consider the text “THE CAT IN THE HAT”. In this text, the letter “T” and the space character both occur with the highest frequency, so they will clearly have the shortest encoding bit patterns in an optimal encoding. The letters “C”, "I’ and “N” only occur once, however, so they will have the longest codes.

There are many possible sets of prefix-free variable-length bit patterns that would yield the optimal encoding, that is, that would allow the text to be encoded in the fewest number of bits. One such optimal encoding is to encode spaces with “00”, “A” with “100”, “C” with “1110”, “E” with “1111”, “H” with “110”, “I” with “1010”, “N” with “1011” and “T” with “01”. The optimal encoding therefore requires only 51 bits compared to the 144 that would be necessary to encode the message with 8-bit ASCII encoding, a compression ratio of 2.8 to 1.

输入描述

The input file will contain a list of text strings, one per line. The text strings will consist only of uppercase alphanumeric characters and underscores (which are used in place of spaces). The end of the input will be signalled by a line containing only the word “END” as the text string. This line should not be processed.

输出描述

For each text string in the input, output the length in bits of the 8-bit ASCII encoding, the length in bits of an optimal prefix-free variable-length encoding, and the compression ratio accurate to one decimal point.

用例输入 1

AAAAABCD
THE_CAT_IN_THE_HAT
END

用例输出 1

64 13 4.9
144 51 2.8

原题

POJ1521——传送门

题目大意

读入一个字符串(“END”表示读入结束),分别使用ASCII编码和Huffman编码,并输出二者编码后的长度(多少个bit)和两个长度的比值。

思路

ASCII编码:每个字符是8个bit,长度即为8乘以字符串的长度。
Huffman编码:先统计每个字符出现的次数,接着用优先队列维护字符出现的次数,然后采用贪心策略优先选取频数小的字符,即构建huffman树的策略,并将频数相加后加入到小根堆中,重复该过程直至堆大小为1。在每次选取后,用sum统计二者之和,最终得到的sum的值即为用huffman编码得到的长度。

代码

#include <iostream>
#include <string>
#include <map>
#include <queue>
#include <iomanip>
using namespace std;
typedef long long ll;int main()
{ios::sync_with_stdio(0);cin.tie(0);cout.tie(0);while (1){string s;cin >> s;if (s == "END")break;map<char, int> mp; // 利用map统计字符出现的频数for (int i = 0; i < s.size(); i++){mp[s[i]]++;}priority_queue<int, vector<int>, greater<int>> pq; // 小根堆for (map<char, int>::iterator i = mp.begin(); i != mp.end(); i++){if ((i->second) != 0)pq.push(i->second); // 将频数加入优先队列}int ans = 0;if (pq.size() == 1) // 注意特判只有一个字符时,答案为其频数{ans = pq.top();}int x, y, sum;while (pq.size() != 1) // 贪心策略选取字符过程{x = pq.top();pq.pop();y = pq.top();pq.pop();sum = x + y;pq.push(sum);ans += sum;}int len = s.size() * 8;double proportion = (double)len / (double)ans; // 计算比值cout << len << ' ' << ans << ' ' << fixed << setprecision(1) << proportion << '\n';}return 0;
}

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/pingmian/3142.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Mac和VScode配置fortran

最近更换了mac电脑&#xff0c;其中需要重新配置各类软件平台和运行环境&#xff0c;最近把matlab、gmt、VScode、Endnote等软件全部进行了安装和配置。但是不得不说&#xff0c;mac系统对于经常编程的人来说还是非常友好的&#xff01; 由于需要对地震位错的程序进行编译运行…

【继承和多态】

提示&#xff1a;文章写完后&#xff0c;目录可以自动生成&#xff0c;如何生成可参考右边的帮助文档 文章目录 前言一、pandas是什么&#xff1f;二、使用步骤 1.引入库2.读入数据总结 前言 提示&#xff1a;这里可以添加本文要记录的大概内容&#xff1a; 例如&#xff1a;…

KEITHLEY(吉时利)2440源测量单位(SMU)数字源表

KEITHLEY(吉时利&#xff09;2440源测量单位&#xff08;SMU)数字源表 主要特性 50W 时性能高达 5A / 40V0.012&#xff05; 基本测量精度&#xff0c;具有 6 位分辨率10pA / 100nV 测量分辨率与 KickStart 软件结合使用美国2440吉时利keithley数字源表特点 2400系列提供宽动…

动态链接是什么?在JVM中,类、对象、方法是如何交互的?

什么是动态链接&#xff1f; 常量池是JVM中存储字面量&#xff08;literal&#xff09;和符号引用&#xff08;symbolic reference&#xff09;的地方。在类加载的过程中&#xff0c;类文件的常量池中的符号引用会被解析成直接引用&#xff0c;这个过程就是动态链接。 动态链接…

windows10 安装iis 跑asp.net

场景 有个asp.net 需要部署在普通的windows10电脑上&#xff0c;电脑没有启用iis。需要配置下iis&#xff0c;这里记录下应用程序中必须要选中的一些选项。 步骤 打开控制面板&#xff0c;然后选择 程序 -> 启用或关闭 Windows 功能 -> Internet Information Services…

需求 分析

需求分析的任务 需求分析的任务 1、需求分析是软件定义时期的最后一个阶段&#xff0c;它的基本任务是准确地回答“系统必须做什么?”这个问题。 2、确定系统必须完成哪些工作&#xff0c;也就是对目标系统提出完整、准确、清晰、具体的要求。 3、系统分析员应该写出软件需求…

Spring Boot 集成 tk.mybatis

一 Spring Boot 集成 tk.mybatis tk.mybatis 是 MyBatis 的一个插件&#xff0c;用于简化 MyBatis 的开发。 1.添加依赖 Spring Boot 项目中的 pom.xml 文件中添加 MyBatis、TkMyBatis 和 MySQL的依赖。 <dependency><groupId>tk.mybatis</groupId><a…

17:缓存机制-Java Spring

目录 17.1 为什么需要缓存17.2 Redis 简介17.3 不同类型的缓存技术对比17.4 缓存机制在Java Spring框架中的应用17.5 应用场景17.6 注意事项 17.1 为什么需要缓存 1. 提升性能 缓存的核心价值在于将数据临时存储在快速访问的介质&#xff08;如内存&#xff09;中&#xff0c…

Xbox VR头盔即将推出,但它是Meta Quest的‘限量版’。

&#x1f4f3;Xbox VR头盔即将推出&#xff0c;但它是Meta Quest的‘限量版’。 微软与Meta合作推出限量版Meta Quest VR头映射Xbox风格&#xff0c;可能是Meta Quest 3或未来版本的特别定制版&#xff0c;附带Xbox控制器。这一合作是Meta向第三方硬件制造商开放其Quest VR头盔…

conda出现http429报错:CondaHTTPError: HTTP 429 TOO MANY REQUESTS for url <xxx>

我的报错信息如下&#xff1a; CondaHTTPError: HTTP 429 TOO MANY REQUESTS for url <https://mirrors.ustc.edu.cn/anaconda/pkgs/main/linux-64/current_repodata.json> Elapsed: 00:46.305607An HTTP error occurred when trying to retrieve this URL. HTTP errors…

ros程序项目打包deb 详细过程以及报错解决

需要注意,最好是 ubuntu 20的 系统, 18 的系统支持不足够 CMakeLists.txt 最后添加, xxxxx 是你的节点或者程序的名称,也就是最后运行的ros节点名称: install(TARGETS xxxxxx ARCHIVE DESTINATION ${CATKIN_PACKAGE_LIB_DESTINATION} LIBRARY DESTINATION ${CATKI…

C语言如何⽤指针表示多维数组?

一、问题 如何⽤指针表示多维数组&#xff1f; 二、解答 这⾥就是以⼆维数组为例进⾏多维数组的操作演示。 ⾸先定义⼀个⼆维数组 int a[3][3] &#xff0c;数组名代表的是数组的起始地址&#xff0c;因此数组名 a 和第⼀个元素 a[0][0] 的地址是相同的&#xff0c;但是意义却…

小塔 | 时尚领域RFID应用,别人早你一步!

优衣库&#xff0c;作为知名服装品牌零售商&#xff0c;近年来在数字化转型的道路上取得了显著的成果。其中&#xff0c;RFID技术的应用成为了优衣库提升运营效率、优化客户体验以及实现精准营销的重要工具。 RFID助力时尚门店品牌升级 优衣库深知RFID技术的潜力&#xff0c;将…

计算机网络3——数据链路层3以太网的MAC层

文章目录 一、MAC 层的硬件地址1、介绍2、注意点3、定制标准 二、MAC 帧的格式1、结构2、工作原理3、其他 一、MAC 层的硬件地址 1、介绍 在局域网中&#xff0c;硬件地址又称为物理地址或 MAC地址(因为这种地址用在MAC帧中)。 大家知道&#xff0c;在所有计算机系统的设计中…

计算机视觉——两视图几何求解投影矩阵

上文我提到了通过图像匹配得到基本矩阵&#xff0c;接下来我们要接着求解投影矩阵。 计算投影矩阵思路 假设两个投影矩阵为规范化相机&#xff0c;因此采用基本矩阵进行恢复。在规范化相机下&#xff0c; P [ I ∣ 0 ] P[I|0] P[I∣0], P ′ [ M ∣ m ] P[M|m] P′[M∣m]。…

apache和IIS区别?内网本地服务器项目怎么让外网访问?

Apache和IIS是比较常用的搭建服务器的中间件&#xff0c;它们之间还是有一些区别差异的&#xff0c;下面就详细说说 Apache和IIS有哪些区别&#xff0c;以及如何利用快解析实现内网主机应用让外网访问。 首先说说apache和IIS最基本的区别。Apache运行的操作系统通常为Unix或Lin…

Mysql — 刷题知识点

一. 功能函数 1. 大小写转换 UCASE/LCASE (列名) LCASE是将内容转换为小写 UCASE将内容转换为大写 CASE是条件控制语句的关键字 二、join ... on 问题 1. right join ..on RIGHT JOIN 关键字会返回右表 (t2) 所有的行&#xff0c;即使在左表 (t1) 中没有匹配的行。或者更…

[C++]多态是如何调用不同的函数对象的?

多态调用不同的函数对象涉及C中的虚函数表&#xff08;VTable&#xff09;、虚函数指针&#xff08;VPtr&#xff09;以及动态绑定机制。下面详细解析这一底层逻辑&#xff1a; 1. 虚函数表&#xff08;VTable&#xff09;与虚函数指针&#xff08;VPtr&#xff09; 在C中&…

【每日刷题】Day22

【每日刷题】Day22 &#x1f955;个人主页&#xff1a;开敲&#x1f349; &#x1f525;所属专栏&#xff1a;每日刷题&#x1f34d; &#x1f33c;文章目录&#x1f33c; 1. 1669. 合并两个链表 - 力扣&#xff08;LeetCode&#xff09; 2. 11. 盛最多水的容器 - 力扣&#…

回归用户本真的业务价值需求,聚焦成本优化与内核能力提升——专访云和恩墨张程伟、金毅...

数据库作为企业核心业务系统的重要基座&#xff0c;其技术架构和性能都将直接影响企业的运营效率与成本。在2024“数据技术嘉年华”大会现场&#xff0c;笔者采访到了云和恩墨本原数据的两位技术合伙人——MogDB数据库研发负责人张程伟、下一代原生HTAP企业级数据库研发负责人金…