去重 De-Duplication

定义

De-duplication – the process of identifying and avoiding essentially identical web pages

With respect to web crawling, de-duplication essentially refers to the identification of identical and nearly identical web pages and indexing only a single version to return as a search result

对于相同或者相近的网页,只留下一个版本

[例子]

  • the URL’s host name can be distinct (virtual hosts),
  • the URL’s protocol can be distinct (http, https),
  • the URL’s path and/or page name can be distinct

不同的地址指向相同的网站

http://maps.google.com
https://www.google.com/maps

相类似的网站

比如一个网页前后相隔几秒的版本

镜像

Mirroring is the systematic replication of web pages across hosts.
是造成duplicate的一个重要原因

Host1/α and Host2/β are mirrors iff:

1
2
3
For all (or most) paths p such that http://Host1/α/p exists
when http://Host2/β/p exists as well
with identical (or near identical) content, and vice versa.

例如 https://www.apache.org/dyn/closer.cgi

[解释]

  • 在云计算中也经常提到De-duplication的概念,但是与此处所说的并不一样

解决Duplication/Near-Duplication的问题

Duplication

[方法] compute fingerprints using cryptographic hashing, 如 SHA-1 和 MD5。有序排列,方便查找logN

Near-Duplication

[方法] compute the syntactic similarity, and use a similarity threshold to detect near-duplicates

这里具体介绍两种方法

  • 通过计算w-Shingles和Jaccard Similarity来得到相似度

[参考视频] https://www.youtube.com/watch?v=bQAYY8INBxg
[视频笔记]
image
image
image
image

  • 通过fingerprints来计算similarity
    [SimHash] Simhash is one where similar items are hashed to similar hash values
    (by similar we mean the bitwise Hamming distance between hash values)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    pick a hashsize, lets say 32 bits
    let V = [0] * 32 # (ie a vector of 32 zeros)
    break the input phrase up into shingles. e.g.
    'the cat sat on the mat'.shingles(2)
    =>#<Set: {"th", "he", "e ", " c", "ca", "at", "t ", " s", "sa", " o", "on", "n ", " t", " m", "ma"}>
    hash each feature using a normal 32-bit hash algorithm. e.g.
    "th".hash = -502157718 "he".hash = -369049682 ...
    for each hash
    if bit_i of hash is set then add 1 to V[i]
    if bit_i of hash is not set then take 1 from V[i]
    simhash bit_i is 1 if V[i] > 0 and 0 otherwise

[通过SimHash找到相似的页面]

  • 如果两个SimHash的Hamming distance相近,则如果将所有SimHash有序排列,它们会位于比较相近的位置
  • 数字位左移或者右移相同的位置,其相对Hamming distance不改变
    因此可以这样做
    1
    2
    3
    rotate the bits B times (B is the length of vector)
    sort
    check adjancent