挖坑之非自用着装攻略-男青年

With respect to web crawling, de-duplication essentially refers to the identification of identical and nearly identical web pages and indexing only a single version to return as a search result

对于相同或者相近的网页，只留下一个版本

[例子]

the URL’s host name can be distinct (virtual hosts),
the URL’s protocol can be distinct (http, https),
the URL’s path and/or page name can be distinct

不同的地址指向相同的网站

http://maps.google.com
https://www.google.com/maps

相类似的网站

比如一个网页前后相隔几秒的版本

镜像

Mirroring is the systematic replication of web pages across hosts.
是造成duplicate的一个重要原因

Host1/α and Host2/β are mirrors iff：

1
2
3

For all (or most) paths p such that http://Host1/α/p exists 
when http://Host2/β/p exists as well 
with identical (or near identical) content, and vice versa.

例如 https://www.apache.org/dyn/closer.cgi

[解释]

在云计算中也经常提到De-duplication的概念，但是与此处所说的并不一样

解决Duplication/Near-Duplication的问题

Duplication

[方法] compute fingerprints using cryptographic hashing, 如 SHA-1 和 MD5。有序排列，方便查找logN

Near-Duplication

[方法] compute the syntactic similarity, and use a similarity threshold to detect near-duplicates

这里具体介绍两种方法

通过计算w-Shingles和Jaccard Similarity来得到相似度

[参考视频] https://www.youtube.com/watch?v=bQAYY8INBxg
[视频笔记]

通过fingerprints来计算similarity
[SimHash] Simhash is one where similar items are hashed to similar hash values
(by similar we mean the bitwise Hamming distance between hash values)

pick a hashsize, lets say 32 bits
let V = [0] * 32  # (ie a vector of 32 zeros)
break the input phrase up into shingles. e.g.
	'the cat sat on the mat'.shingles(2) 
	=>#<Set: {"th", "he", "e ", " c", "ca", "at", "t ", " s", "sa", " o", "on", "n ", " t", " m", "ma"}> 
hash each feature using a normal 32-bit hash algorithm. e.g.
	"th".hash = -502157718 "he".hash = -369049682 ... 
for each hash
	if bit_i of hash is set then add 1 to V[i]
	if bit_i of hash is not set then take 1 from V[i]
simhash bit_i is 1 if V[i] > 0 and 0 otherwise

[通过SimHash找到相似的页面]

如果两个SimHash的Hamming distance相近，则如果将所有SimHash有序排列，它们会位于比较相近的位置
数字位左移或者右移相同的位置，其相对Hamming distance不改变
因此可以这样做
1
2
3
rotate the bits B times (B is the length of vector)
sort
check adjancent

搜索引擎的评估 Search Engine Evaluation

发表于 2017-09-25 | 分类于资料检索与搜索引擎

How do we measure the quality of search engines?

Recall and Precision（召回率和精确率）

$$Precision = \frac{relevant{\space}items{\space}retrieved}{all{\space}retrieved{\space}items}$$ $$Recall = \frac{relevant{\space}items{\space}retrieved}{all{\space}relevant{\space}items}$$

—	Relevant	Nonrelevant
Retrived	True Positive(tp)	False Positive(fp)
Not Retrived	False Negative(fn)	True Negative(tn)

$$Precision = \frac{tp}{tp+fp}$$ $$Recall = \frac{tp}{tp+fn}$$ $$Accuracy = \frac{tp+tn}{tp+fp+fn+tn}$$

[解释]: 这里的positive相当于积极的评价，也就是retrive；negative表示消极的评价，也就是没有取回。因此true positive表示该判断正确结果确实是相关的，false negative表示该消极的判断错误，结果其实是相关的。

[准确率]: $Accuracy = \frac{tp+tn}{tp+fp+fn+tn}$ 但是通常我们并不会用准确率来作为评判的标准

[有意思的一段分析]:
The advantage of having the two numbers for precision and recall is that one is more important than the other in many circumstances.
Typical web surfers would like every result on the first page to be relevant (high precision) but have not the slightest interest in knowing let alone looking at every document that is relevant.
在web applications中, Precision is more important than Recall
In contrast, various professional searchers such as paralegals and intelligence analysts are very concerned with trying to get as high recall as possible, and will tolerate fairly low precision results in order to get it.
在专业搜索中，我们更关注高的召回率，为了达到这个目的可以忍受相对低的精确度
Individuals searching their hard disks are also often interested in high recall searches.
硬盘搜索中也期待有高的召回率
Nevertheless, the two quantities clearly trade off against one another: you can always get a recall of 1 (but very low precision) by retrieving all documents for all queries! Recall is a non-decreasing function of the number of documents retrieved. On the other hand, in a good system, precision usually decreases as the number of documents retrieved is increased. In general we want to get some amount of recall while tolerating only a certain percentage of false positives.
不管怎么说，Precision和Recall是相互牵制的。你总是可以通过取回尽量多的文件来达到高的召回率，比如说每次都取回所有的文件则召回率总是1。但是这时候精确度就很低了。在一个好的系统中，往往精确度随着召回数的增加而降低，不过通常我们忍受一定程度的fp来达到较好的召回率

Pythagorean Mean

Arithmetic Mean 算数平均数

$$A=\frac{1}{n}\sum_{i=1}^n{x_i}$$

Geometric Mean 几何平均数

$$G = \sqrt[n] {x_1x_2…x_n}$$

Harmonic Mean 调和平均数

$$H = {(\frac{\sum_{i=1}^n{x_i^{-1}}}{n})}^{-1}$$

F Measure

weighted harmonic mean of precision and recall

$$F = \frac{1}{\alpha\frac{1}{P}+(1-\alpha)\frac{1}{R}}
= \frac{({\beta}^2+1)PR}{\beta^2P+R}, \space\space\space {\beta}^2=\frac{1-\alpha}{\alpha} $$

Balanced $F_1$measure: with $\beta=1$, $F=\frac{2PR}{P+R}$
Values of $\beta<1$ emphasize precision
Values of $\beta>1$ emphasize recall

Calculating Recall/Precision at Fixed Positions

上面所介绍的Precision, recall和F measure都是set-based measure。也就是说用于计算unordered set of documents。
而典型的搜索引擎所给出的结果是有一定顺序的。这一小节将介绍如何评估ranked results。

Average Precision of the Revelant Documents

[例子]

Revelant documents = 6

RANK	Relevant OR Nonrelevant	Recall	Precision
1	R	0.17	1.0
2	N	0.17	0.5
3	R	0.33	0.67
4	R	0.5	0.75
5	R	0.67	0.8
6	R	0.83	0.83
7	N	0.83	0.71
8	N	0.83	0.63
9	N	0.83	0.56
10	R	1.0	0.6

Average Precision of the Revelant Documents = (1.0+0.67+0.75+0.8+0.83+0.6)/5 = 0.78

Averaging Across Queries

上面那个例子是一个query，如何评估多个query呢？

[方法一]
仿照上面的方法 (所有Revelant Documents的Precision求和)/(Relevant documents的总数目)

$$(1 + .67 + .5 + .44 + .5 + .5 + .4 + .43)/8 = 0.55$$

[方法二]
Mean Average Precision (MAP)
这个方法是research papers中最常用的方法
$$MAP=\frac{\sum_{q=1}^Q{Avg{P(q)}}}{Q}, \space\space Q=number\space of\space queries$$

$$AvgP(1)=(1.0+0.67+0.5+0.44+0.5)/5=0.62$$ $$AvgP(2)=(0.5+0.4+0.43)/3=0.44$$ $$MAP=(0.62+0.44)/2=0.53$$

利用MAP评估搜索引擎的缺点：

Marco-averaging: Each query counts equally 也就是说每个query的重要度都是一样的
MAP assumes user is interested in finding many revelant documents for each query [不是很懂]
MAP requires many relevance judgments in documents collection

Discounted Cumulative Gain

The premise of DCG is that highly relevant documents appearing lower in a seach result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result.
这个方法的基本思想是:
如果一个相关文档出现在结果列表中靠后的位置，那么应该给予相应的惩罚。越靠后惩罚越大。

[例子]

基于Precision和Recall方法的局限性。

Should average over large document collection and query ensembles
Need human relevance assessments
需要人来判断是否相关，这并不是很可靠的
Assessments have to be binary
对相关性的评价是非黑即白
Heavily skewed by collection/authorship
不同的领域有可能会有不同的结果，不具有普遍性

Non-relevance-based measures

搜索引擎也会使用不基于相关性的方法

Click-through on ﬁrst result: Not very reliable if you look at a single click-through … but pretty reliable in the aggregate.
Studies of user behavior in the lab
A/B testing

倒排索引_Inverted_Indexing

发表于 2017-09-20 | 分类于资料检索与搜索引擎

Definition of an Inverted index
An inverted index is typically composed of a vector containing all distinct words of the text collection in lexicographical order (which is called the vocabulary) and for each word in the vocabulary, a list of all documents (and text positions) in which that word occurs

通俗的说: 单词到包括该单词的文档的集合的映射。

注释: 这里的单词可能是被处理过的，一些常用的处理方法有

Case folding: converting all uppercase letters to lower case
Stemming: reducing words to their morphological root
Stop words: removing words that are so common they provide no information

Inverted index 的应用情景

假设有这样一个Query:

1	Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?

方法一暴力搜索

先找到所有包含Brutus和Caesar的话剧，再逐字排查，如果含有Caplurnia就删除掉该话剧。

缺点：

Too slow
需要很大的空间
不够灵活，不支持其他的操作，比如”Find the word Romans near countrymen”

方法二： Term-Document Incidence Matrix

用了 Bit Map 位运算的思想。构建如下矩阵，(t,d)=1代表term t在document d中出现，0代表没有出现。例如：

这样对于上一个查询，只需要

110100 AND 110111 AND 10111 = 100100

所以the two plays matching the query are:

“Anthony and Cleopatra”, “Hamlet”

缺点：

需要很大的空间,并且矩阵通常十分稀疏:

1
2
3

Given 1 million documents and 500,000 terms.
The (term x Document) matrix in this case will have size 500K x 1M or
half-a-trillion 0’s and 1’s.

方法三： Inverted Index

在数据结构上我们选择了链表而不是数组，链表的优势在于：

Dynamic space allocation
Insertion of terms into documents easy
There is space overhead of pointers, though this is not too serious

Inverted Index的构建如下图所示

应用举例：

1）Brutus AND Caesar
- Locate Brutus in the Dictionary and Retrieve its postings.
Brutus: 2->4->8->16->32->64->128
- Locate Caesar in the Dictionary and Retrieve its postings.
Caesar: 1->2->3->5->8->13->21->34
- “Merge” the two postings and select the ones in common (postings are document ids):
Brutus: 2->4->8->16->32->64->128
Caesar: 1->2->3->5->8->13->21->34
2->8
O(m + n) 复杂度
2）Brutus AND Calpurnia AND Caesar
Brutus: 2->4->8->16
Caesar: 1->2->3->5->8->13->21->34
Calpurnia: 2->16
Merge的顺序要从短到长依次进行
Execute the query as (Calpurnia AND Brutus) AND Caesar.
INTERSECT(<t1, . . . , tn>)
   terms ← SORTBYINCREASINGFREQUENCY(<t1, . . . , tn>)
   result ← postings( first(terms))
   terms ← rest(terms)
   while terms /= NIL and result /= NIL
   do
       result ← INTERSECT(result, postings(first(terms)))
       terms ← rest(terms)
   return result
印象中有算法解决Merge K sorted list复杂度可以到nklogk(n为链表的平均长度）
[LeetCode参考: https://leetcode.com/problems/merge-k-sorted-lists/description/]
书本并没有采取这种算法可能有原因的：
- 在实际情况中这些个链表的长度一点也不平均
- 只需要最短的那个链表给merge结束就完事了我们并不需要完完整整的merge。

Skip Pointers

方法一： Bi-word indexes (2-grams, 2 shingles)

方法介绍

A bi-word (or a 2-gram) is a consecutive pair of terms in some text。比如 “Friends, Romans, Countrymen” 会生成 “friends romans” 和 “romans countrymen”
Each of these bi-words is now added to the dictionary as a term。也就是说之前的Term变成了现在的bi-words。

缺点

数量爆炸！Bi-words will cause an explosion in the vocabulary database
短语长度也许大于2。Queries longer than 2 words will have to be broken into bi-word segments

方法二： Positional indexes

方法介绍

相比于之前的inverted index, 这里其实就是多记录了一个Term出现的位置位置：

1
2
3

<Term, TotalFrequency> -> <docID, Frequency> 
变成
<Term, TotalFrequency> -> <docID, Frequency, Positions>

比方说我们要查询

1	to be or not to be

现在我们首先找to be, 步骤如下

1. 找到同时含有to和be两个term的文档。
2. 查找两个term在文档中的位置，如果be的位置在to后面一个，那就找到了一个结果
比如上面那个图，我们就能找到一个possible match是
to: <...;4:<...,429,433>;...>
be: <...;4:<...,430,434>;...>

算法

POSITIONALINTERSECT(p1, p2, k)
    answer <- <>
    while p1 /= NIL and p2 /= NIL
    do if docID(p1) == docID(p2)
        then l <- <>
            pp1 <- positions(p1)
            pp2 <- positions(p2)
            while pp1 /= NIL
            do while pp2 /= NIL
                do if |pos(pp1) - pos(pp2)| <= k
                    then ADD(l, pos(pp2))
                    else if pos(pp2) > pos(pp1)
                        the break
                    pp2 <- next(pp2)
                while l /= <> and |l[0] - pos(pp1)| > k
                do DELETE(l[0])
                for each ps in l
                do ADD(answer, <docID(p1), pos(pp1), ps>)
                pp1 <- next(pp1)
            p1 <- next(p1)
            p2 <- next(p2)
        else if docID(p1) < docID(p2)
            then p1 <- next(p1)
            else p2 <- next(p2)
    return answer

【算法导论32】字符串匹配

发表于 2017-09-17 | 分类于算法

算法导论 32. String Matching
恰逢最近在学习搜索引擎，又提及了字符串处理的只是，便想起来读一次算法导论的32章并且留下笔记。日后再总结其他。

书本中介绍了四个方法。分别为：

Naive string-matching algorithm (暴力)

The Rabin-Karp algorithm

Finite automata

KMP

笔记如下:

参考:

算法导论
以及 https://niuye.info/fms-kmp/

不忘初心

发表于 2017-08-11 | 分类于开篇

一个一直在琢磨着如何转行却在码农圈越陷越深的伪程序员为自己博客写的一篇前言。

27岁之前专注玩码

偶尔想想诗和远方

给自己的生活种种草，记得按时拔。

关于专业

Yaser Abu-Mostafa
深度学习（中/英） | Udacity
GoodFellow
Clean Code
Thinking in Java
Elm
算法
Spider

关于兴趣

英文语法书
朗读者

准则

品牌推荐

一年四季篇

亮色

店铺汇集

Urban Outfitters

百武西

& Other Stories

Zara

Pixie Market

DKNY

Tommy

性冷淡风

优衣库

Everlane

Cos

Acne Studios

frankandoak

鞋子

Keds

定义

解决Duplication/Near-Duplication的问题

Duplication

Near-Duplication

How do we measure the quality of search engines?

Pythagorean Mean

F Measure

Calculating Recall/Precision at Fixed Positions

Discounted Cumulative Gain

基于Precision和Recall方法的局限性。

Non-­relevance-­based measures

Definition of an Inverted index

Inverted index 的应用情景

方法一 暴力搜索

方法二： Term-Document Incidence Matrix

方法三： Inverted Index

Skip Pointers

硬件对速度的影响

Phrase Queries

方法一： Bi-word indexes (2-grams, 2 shingles)

方法二： Positional indexes

算法导论 32. String Matching

Non-relevance-based measures

方法一暴力搜索