基于变长编码集合扩展的中文文本压缩算法

Compression Algorithm for Unfixed Length Encoding Set Expansion

摘要: 为了获取压缩中文文本的高压缩比，变长编码集合扩展的中文文本压缩算法依据中文语言文字的特点，以不等长高概率汉字串为单位，定义固定字典集，同时寻求高压缩率的匹配方式进行编码，算的编码转换过程适应了自然语言中的部分马尔可夫过程，相对于不同文本长度及文体风格压缩比分布均衡，此算法能够获得较高的压缩比。

Abstract: In order to get high compression ratio for a compresed Chinese text, the compression algorithm for unfixed length encoding set expansion encodes the text by matching for high compression ratio, based on a set of fixed dictionaries that comprise unfixed length and high frequency Chinese character strings following features of the Chinese language. This algorithm fits the Chinese character string as Markov message source. It also suits different lengths and the language style of the source data. This algorithm can result in higher compression ratio.