散列函數:修订间差异
Dragontail(留言 | 贡献) 无编辑摘要 |
Dragontail(留言 | 贡献) |
||
第133行: | 第133行: | ||
class of these correction codes are [[cyclic redundancy check]]s and [[Reed-Solomon code]]s. |
class of these correction codes are [[cyclic redundancy check]]s and [[Reed-Solomon code]]s. |
||
=== |
===语音识别=== |
||
For audio identification such as finding out whether an [[MP3]] file matches one of a list of known |
For audio identification such as finding out whether an [[MP3]] file matches one of a list of known |
||
2006年7月1日 (六) 23:00的版本
本文章正在翻译,目前已完成 20%,原文在http://en.wikipedia.org/wiki/Hash_function
散列函数(或散列算法)是一种从任何一种数据中创建小的数字“指纹”的方法。该函数将数据打乱混合,重新创建一个叫做散列值的指纹。散列值通常用来代表一个短的随机字母和数字组成的字符串。好的散列函数在输入域中很少出现散列冲突。在散列表和数据处理中,不抑制冲突来区别数据,会使得数据库记录更难找到。
散列函数的性质
所有散列函数都有如下一个基本特性:如果两个散列值是不相同的(根据同一函数),那么这两个散列值的原始输入也是不相同的。这个特性是散列函数具有确定性的结果。但另一方面,如果两个散列值相同,可能两个输入值是相同的,但并不能肯定二者一定相等。输入一些数据计算出散列值,然后部分改变输入值,根据散列函数强混淆特性,我们得出一个完全不同的散列值。
典型的散列函数都有无限定义域,比如任意长度的字节字符串,和有限的值域,比如固定长度的比特串。在某些情况下,散列函数可以设计成具有相同大小的定义域和值域间的一一映射。一一映射的散列函数也称为排列。可逆性可以通过使用一系列的对于输入值的可逆“混合”运算而得到。
散列函数的应用
Because of the variety of applications for hash functions (details below), they are often tailored to the application. For example, cryptographic hash functions assume the existence of an
adversary who can deliberately try to find inputs with the same hash value. A well designed
cryptographic hash function is a "one-way" operation: there is no practical way to calculate a
particular data input that will result in a desired hash value, so it is also very difficult to
forge. Functions intended for cryptographic hashing, such as MD5, are commonly used
as stock hash functions.
Functions for error detection and correction focus on distinguishing cases in which data has been
disturbed by random processes. When hash functions are used for checksums, the relatively small hash
value can be used to verify that a data file of any size has not been altered.
加密
- Main article: cryptographic hash function
A typical cryptographic one-way function is not one-to-one and makes an effective hash function; a
typical cryptographic trapdoor function is one-to-one and makes an effective randomization
function.
散列表
- Main article: Hash table
Hash tables, a major application for hash functions, enable fast lookup of a data record given its
key. (Note: Keys are not usually secret as in cryptography, but both are used to "unlock" or
access information.) For example, keys in an English dictionary would be English words, and their
associated records would contain definitions. In this case, the hash function must map alphabetic
strings to indexes for the hash table's internal array.
The generally impossible/impractical ideal for a hash table's hash function is to map each key to a
unique index (see perfect hashing), because this guarantees access to each data record in the
first probe into the table.
Hash functions that are truly random with uniform output (including most [[Cryptographic hash
function|cryptographic hash functions]]) are good in that, on average, only one or two probes will be
needed (depending on the load factor). Perhaps as important is that excessive [[hash
collision|collision]] rates with random hash functions are highly improbable—if not
computationally infeasible for an adversary. However, a small,
predictable number of collisions is virtually inevitable (see birthday paradox).
In many cases, a heuristic hash function can yield many fewer
collisions than a random hash function. Heuristic functions take advantage of
regularities in likely sets of keys. For example, one could design a heuristic hash
function such that file names such as FILE0000.CHK, FILE0001.CHK,
FILE0002.CHK, etc. map to successive indices of the table, meaning that such sequences will
not collide. Beating a random hash function on "good" sets of keys usually means performing much
worse on "bad" sets of keys, which can arise naturally—not just through [[Computer
insecurity|attacks]]. Bad performance of a hash table's hash function means that lookup can degrade
to a costly linear search.
Aside from minimizing collisions, the hash function for a hash table should also be fast relative to
the cost of retrieving a record in the table, as the goal of minimizing collisions is minimizing the
time needed to retrieve a desired record. Consequently, the optimal balance of performance
characteristics depends on the application.
One of the most respected hash functions for use in typical hash tables is Bob Jenkins'
LOOKUP2 hash function, published in an
article in Dr. Dobb's Journal. The hash function
performs well as long as there is no adversary, for it is trivially reversible and
useless as a cryptographic hash function.
错误校正
- Main article: Error correction and detection
Using a hash function to detect errors in transmission is straightforward. The hash function is
computed for the data at the sender, and the value of this hash is sent with the data. The hash
function is performed again at the receiving end, and if the hash values do not match, an error has
occurred at some point during the transmission. This is called a redundancy check.
For error correction, a distribution of likely perturbations is assumed at least approximately.
Perturbations to a string are then classified into large (improbable) and small (probable) errors.
The second criterion is then restated so that if we are given H(x) and x+s, then we can compute x
efficiently if s is small. Such hash functions are known as error correction codes. Important sub-
class of these correction codes are cyclic redundancy checks and Reed-Solomon codes.
语音识别
For audio identification such as finding out whether an MP3 file matches one of a list of known
items, one could use a conventional hash function such as MD5, but this would be very sensitive to
highly likely perturbations such as time-shifting, CD read errors, different compression algorithms or
implementations or changes in volume. Using something like MD5 is useful as a first pass to find
exactly identical files, but another more advanced algorithm is required to find all identical items.
Contrary to an assumption often voiced by people who don't follow the industry carefully, hashing
algorithms do exist that are robust to these minor differences. Most of the algorithms available
are not extremely robust, but some are so robust that they can identify music played on loud-speakers
in a noisy room. One example of this in practical use is the service Shazam
[1]. Customers call a number and place their telephone near a speaker. The
service then analyses the music, and compares it to known hash values in its database. The name of the
music is sent to the user (for a charge!).
Rabin-Karp string search algorithm
Rabin-Karp string search algorithm is a relatively fast string searching algorithm that works
in O(n) time on average. It is based on the use of hashing to compare strings.
Origins of the term
The term "hash" apparently comes by way of analogy with its standard meaning in the physical world, to
"chop and mix." Knuth notes that Hans Peter Luhn of IBM appears to have been the first to use the concept,
in a memo dated January 1953; the term hash came into use some ten years later.
In the SHA-1 algorithm, for example, the domain is "flattened" and "chopped" into "words" which are
then "mixed" with one another using carefully chosen mathematical functions. The range ("hash value")
is made to be a definite size, 160 bits (which may be either smaller or larger than the domain),
through the use of modular division.
参阅
- Cryptography
- Cryptographic hash function
- HMAC
- Geometric hashing
- Message digest
- Perfect hash function
- Rabin-Karp string search algorithm
- Zobrist hashing
- Bloom filter
- Hash table
- Hash list
- Hash tree
- Coalesced hashing
- Transposition Table
References
- [http://www.extra.research.philips.com/natlab/download/audiofp/cbmi01audiohashv1.0.pdf Robust Audio
Hashing for Content Identification]
外部链接
- [http://www.partow.net/programming/hashfunctions/index.html General purpose hash function algorithms
(C/C++/Pascal/Java/Ruby)]
- Hash Functions for Hash Table Lookup by Bob Jenkins
- Hash Functions by Paul Hsieh
- What is a hash function? from RSA Laboratories
- [http://www.paulschou.com/tools/xlate/ Online Char (ASCII), HEX, Binary, Base64, etc...
Encoder/Decoder with MD2, MD4, MD5, SHA1+2, etc. hashing algorithms]
- Crypto-Toolbox - Online cryptography, hashing and PIN
block sanity checking for EftPos developers.