跳转到内容

散列函數:修订间差异

维基百科,自由的百科全书
删除的内容 添加的内容
Dragontail留言 | 贡献
无编辑摘要
Dragontail留言 | 贡献
第133行: 第133行:
class of these correction codes are [[cyclic redundancy check]]s and [[Reed-Solomon code]]s.
class of these correction codes are [[cyclic redundancy check]]s and [[Reed-Solomon code]]s.


===Audio identification===
===语音识别===
For audio identification such as finding out whether an [[MP3]] file matches one of a list of known
For audio identification such as finding out whether an [[MP3]] file matches one of a list of known



2006年7月1日 (六) 23:00的版本

本文章正在翻译,目前已完成 20%,原文在http://en.wikipedia.org/wiki/Hash_function

散列函数(或散列算法)是一种从任何一种数据中创建小的数字“指纹”的方法。该函数将数据打乱混合,重新创建一个叫做散列值的指纹。散列值通常用来代表一个短的随机字母和数字组成的字符串。好的散列函数在输入域中很少出现散列冲突。在散列表数据处理中,不抑制冲突来区别数据,会使得数据库记录更难找到。

一种典型的散列函数

散列函数的性质

所有散列函数都有如下一个基本特性:如果两个散列值是不相同的(根据同一函数),那么这两个散列值的原始输入也是不相同的。这个特性是散列函数具有确定性的结果。但另一方面,如果两个散列值相同,可能两个输入值是相同的,但并不能肯定二者一定相等。输入一些数据计算出散列值,然后部分改变输入值,根据散列函数强混淆特性,我们得出一个完全不同的散列值。

典型的散列函数都有无限定义域,比如任意长度的字节字符串,和有限的值域,比如固定长度的比特串。在某些情况下,散列函数可以设计成具有相同大小的定义域和值域间的一一映射一一映射的散列函数也称为排列。可逆性可以通过使用一系列的对于输入值的可逆“混合”运算而得到。

散列函数的应用

Because of the variety of applications for hash functions (details below), they are often tailored to the application. For example, cryptographic hash functions assume the existence of an

adversary who can deliberately try to find inputs with the same hash value. A well designed

cryptographic hash function is a "one-way" operation: there is no practical way to calculate a

particular data input that will result in a desired hash value, so it is also very difficult to

forge. Functions intended for cryptographic hashing, such as MD5, are commonly used

as stock hash functions.

Functions for error detection and correction focus on distinguishing cases in which data has been

disturbed by random processes. When hash functions are used for checksums, the relatively small hash

value can be used to verify that a data file of any size has not been altered.


加密

Main article: cryptographic hash function

A typical cryptographic one-way function is not one-to-one and makes an effective hash function; a

typical cryptographic trapdoor function is one-to-one and makes an effective randomization

function.

散列表

Main article: Hash table

Hash tables, a major application for hash functions, enable fast lookup of a data record given its

key. (Note: Keys are not usually secret as in cryptography, but both are used to "unlock" or

access information.) For example, keys in an English dictionary would be English words, and their

associated records would contain definitions. In this case, the hash function must map alphabetic

strings to indexes for the hash table's internal array.

The generally impossible/impractical ideal for a hash table's hash function is to map each key to a

unique index (see perfect hashing), because this guarantees access to each data record in the

first probe into the table.

Hash functions that are truly random with uniform output (including most [[Cryptographic hash

function|cryptographic hash functions]]) are good in that, on average, only one or two probes will be

needed (depending on the load factor). Perhaps as important is that excessive [[hash

collision|collision]] rates with random hash functions are highly improbable—if not

computationally infeasible for an adversary. However, a small,

predictable number of collisions is virtually inevitable (see birthday paradox).

In many cases, a heuristic hash function can yield many fewer

collisions than a random hash function. Heuristic functions take advantage of

regularities in likely sets of keys. For example, one could design a heuristic hash

function such that file names such as FILE0000.CHK, FILE0001.CHK,

FILE0002.CHK, etc. map to successive indices of the table, meaning that such sequences will

not collide. Beating a random hash function on "good" sets of keys usually means performing much

worse on "bad" sets of keys, which can arise naturally—not just through [[Computer

insecurity|attacks]]. Bad performance of a hash table's hash function means that lookup can degrade

to a costly linear search.

Aside from minimizing collisions, the hash function for a hash table should also be fast relative to

the cost of retrieving a record in the table, as the goal of minimizing collisions is minimizing the

time needed to retrieve a desired record. Consequently, the optimal balance of performance

characteristics depends on the application.

One of the most respected hash functions for use in typical hash tables is Bob Jenkins'

LOOKUP2 hash function, published in an

article in Dr. Dobb's Journal. The hash function

performs well as long as there is no adversary, for it is trivially reversible and

useless as a cryptographic hash function.

错误校正

Main article: Error correction and detection

Using a hash function to detect errors in transmission is straightforward. The hash function is

computed for the data at the sender, and the value of this hash is sent with the data. The hash

function is performed again at the receiving end, and if the hash values do not match, an error has

occurred at some point during the transmission. This is called a redundancy check.

For error correction, a distribution of likely perturbations is assumed at least approximately.

Perturbations to a string are then classified into large (improbable) and small (probable) errors.

The second criterion is then restated so that if we are given H(x) and x+s, then we can compute x

efficiently if s is small. Such hash functions are known as error correction codes. Important sub-

class of these correction codes are cyclic redundancy checks and Reed-Solomon codes.

语音识别

For audio identification such as finding out whether an MP3 file matches one of a list of known

items, one could use a conventional hash function such as MD5, but this would be very sensitive to

highly likely perturbations such as time-shifting, CD read errors, different compression algorithms or

implementations or changes in volume. Using something like MD5 is useful as a first pass to find

exactly identical files, but another more advanced algorithm is required to find all identical items.

Contrary to an assumption often voiced by people who don't follow the industry carefully, hashing

algorithms do exist that are robust to these minor differences. Most of the algorithms available

are not extremely robust, but some are so robust that they can identify music played on loud-speakers

in a noisy room. One example of this in practical use is the service Shazam

[1]. Customers call a number and place their telephone near a speaker. The

service then analyses the music, and compares it to known hash values in its database. The name of the

music is sent to the user (for a charge!).

Rabin-Karp string search algorithm

Rabin-Karp string search algorithm is a relatively fast string searching algorithm that works

in O(n) time on average. It is based on the use of hashing to compare strings.

Origins of the term

The term "hash" apparently comes by way of analogy with its standard meaning in the physical world, to

"chop and mix." Knuth notes that Hans Peter Luhn of IBM appears to have been the first to use the concept,

in a memo dated January 1953; the term hash came into use some ten years later.

In the SHA-1 algorithm, for example, the domain is "flattened" and "chopped" into "words" which are

then "mixed" with one another using carefully chosen mathematical functions. The range ("hash value")

is made to be a definite size, 160 bits (which may be either smaller or larger than the domain),

through the use of modular division.

参阅

References

Hashing for Content Identification]

外部链接

(C/C++/Pascal/Java/Ruby)]

Encoder/Decoder with MD2, MD4, MD5, SHA1+2, etc. hashing algorithms]

block sanity checking for EftPos developers.