Jade-Jade-Jade

不忘初心,方得始终


  • 首页

  • 归档

  • 开篇

  • 算法

  • 资料检索与搜索引擎

  • 寻常日

  • AI坑

  • 区块链

  • AR VR MR

  • 软件架构

挖坑之非自用着装攻略-男青年

发表于 2017-10-19 | 分类于 寻常日

我自己本身也很喜欢男装的简洁和大方,逛起来都差不多,不像女孩子的衣服看起来眼花缭乱…
鉴于身边男性程序员朋友很多,决定写一个男装搭配的小博文。
我希望文章可以让大部分普通的男孩子收益,大长腿大高个身材好的就自己随便搭配吧,感觉都不会丑到哪里去…..
— 题记

准则

  1. 衣服不要皱不要皱不要皱,不要穿皱皱的外套,皱皱的t恤。
    关于衣服皱这个,洗完衣服好好展开。然后买衣服的时候注意一下质量!
  2. 头发不要油. 不用打摩丝什么的,就清清爽爽就好。
  3. 脸也不要太油光泛滥了,用些洗面奶就可以,基本的护肤还是需要的我觉得。
  4. 背不要弯。
  5. 有的男孩子肩膀窄,我不喜欢肩膀窄的男孩穿软塌塌的那种外套。。就显得更窄了没有男友力囧。
  6. 建议衣服自己按照一套套的搭配好,或者按照对应的关系搭配。比如自己会清楚的知道买这个裤子有哪些上衣配。这样很节约时间。

品牌推荐

一年四季篇

https://www.zhihu.com/question/58162135

亮色

我发现好多男孩子都喜欢亮色 - 橙色,黄色,蓝色为主!而且都是那种颜色十分纯的!记得以前美术老师和我们说,一般小孩子都很喜欢这种纯度很高的颜色,长大了会喜欢有点灰度的。哈哈哈,如果你喜欢纯色!说明你有个大大的童心:)

知乎里很多帖子都是黑白灰走气质路线。不过我们也不能为了帅气就抛弃自己挚爱的骚气! 我自己觉得纯色是一种很可爱的颜色!尤其是那种平时很正经突然来一个亮色,有一种十分活泼的感觉!

一般我没见到太多人穿纯彩色的裤子,一般都是外套和t恤占多数。还有骚气外露的各种鞋子。

衣

发表于 2017-10-14 | 分类于 寻常日

好看的衣服也会让自己开心起来。
— 题记

店铺汇集

Urban Outfitters

https://www.urbanoutfitters.com

第一次见到这个牌子是在Harvard Square。和橙子约着一起吃了一顿午饭,饭后开始在Harvard Square周边闲逛。一家很有意思的店面,卖衣服卖鞋帽也卖拍立得还有唱片。

后来又在纽约遇到,才意识到这是一家连锁店。来到洛杉矶,无车的自己开始怀念当时在波屯没事走走停停逛逛街的日子。

网店的种类其实也很多,但是少了一种在实体店淘货的感觉。

百武西

淘宝旗舰店链接

初识百武西是在家乡的八佰伴闲逛的时候发现的。扶手电梯的转角。复古文艺的摆设把我一下子就吸引过去。这是个向30年代致敬的生活品牌。它用的模特面目没有棱角分明,身材也没有那么前凸后翘,却一个个就像从书本里缓缓走出来的女学生一般,有着那个年龄独有的精致和年轻。

有天猫京东官网,依旧独爱在店里一件件细细品读。

& Other Stories

https://www.stories.com/us/Ready-to-wear

h&m旗下的品牌。年轻时尚价格亲民。暂时mark还未购买过。

Zara

价格迷人也有点时尚,觉得质量没有那么感人。只买过它家一个黑裤子。

Pixie Market

https://www.pixiemarket.com/

高街品牌,潮。第一次看到一下心水卖了几件。但是觉得质量对不起价格。不会再买。

DKNY

https://www.dkny.com/us/sets/clothing-view-all

最新发现的身处奥特莱斯中的好店。

Tommy

http://usa.tommy.com/en/NEWARRIVALS-WOMEN

性冷淡风

有时候想涂一次红唇,或是买了一个新包,性冷淡风的服装便让这些点缀显得格外的显眼。
同时,这又是最好的商务休闲,简单随性大方。

不过觉得,一些性冷淡风风格却不冷淡的衣服更需要在实体店中尝试和搭配。

优衣库

在国内的时候我一次都没有买过,也是很奇怪。第一次买它居然是在美国,而且那次一逛就停不下来一次剁手了好多件。我并不算是优衣库的常客,有且只有一次夏装的购买经验。但是也算是给自己种草了这个牌子吧。

Everlane

https://www.everlane.com/collections/womens-all

自己还没有购买过,但是逛官网的时候觉得不错的样子,价格也不是特别贵。

Cos

https://www.cosstores.com/us/

第一次试是大四去美国前在合肥某家商场闲逛时被朋友安利的。到了美国才后知后觉的意识到原来是h&m旗下的品牌。好看但是觉得要去实体店试,不大敢直接在网上剁手。

Acne Studios

http://www.acnestudios.com/us/en/woman/outerwear/

定价略高,不过看官网确实是我的菜。价位真的有点高啊….要努力….

frankandoak

https://www.frankandoak.com/women/featured/new-in

鞋子

Keds

http://www.keds.com/en/home

美帝老北京布鞋替代款….对,我是老北京布鞋脑残粉….

网络爬虫初识 Web-Crawling

发表于 2017-10-02 | 分类于 资料检索与搜索引擎

image
image
image
image

去重 De-Duplication

发表于 2017-09-25 | 分类于 资料检索与搜索引擎

定义

De-duplication – the process of identifying and avoiding essentially identical web pages

With respect to web crawling, de-duplication essentially refers to the identification of identical and nearly identical web pages and indexing only a single version to return as a search result

对于相同或者相近的网页,只留下一个版本

[例子]

  • the URL’s host name can be distinct (virtual hosts),
  • the URL’s protocol can be distinct (http, https),
  • the URL’s path and/or page name can be distinct

不同的地址指向相同的网站

http://maps.google.com
https://www.google.com/maps

相类似的网站

比如一个网页前后相隔几秒的版本

镜像

Mirroring is the systematic replication of web pages across hosts.
是造成duplicate的一个重要原因

Host1/α and Host2/β are mirrors iff:

1
2
3
For all (or most) paths p such that http://Host1/α/p exists
when http://Host2/β/p exists as well
with identical (or near identical) content, and vice versa.

例如 https://www.apache.org/dyn/closer.cgi

[解释]

  • 在云计算中也经常提到De-duplication的概念,但是与此处所说的并不一样

解决Duplication/Near-Duplication的问题

Duplication

[方法] compute fingerprints using cryptographic hashing, 如 SHA-1 和 MD5。有序排列,方便查找logN

Near-Duplication

[方法] compute the syntactic similarity, and use a similarity threshold to detect near-duplicates

这里具体介绍两种方法

  • 通过计算w-Shingles和Jaccard Similarity来得到相似度

[参考视频] https://www.youtube.com/watch?v=bQAYY8INBxg
[视频笔记]
image
image
image
image

  • 通过fingerprints来计算similarity
    [SimHash] Simhash is one where similar items are hashed to similar hash values
    (by similar we mean the bitwise Hamming distance between hash values)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    pick a hashsize, lets say 32 bits
    let V = [0] * 32 # (ie a vector of 32 zeros)
    break the input phrase up into shingles. e.g.
    'the cat sat on the mat'.shingles(2)
    =>#<Set: {"th", "he", "e ", " c", "ca", "at", "t ", " s", "sa", " o", "on", "n ", " t", " m", "ma"}>
    hash each feature using a normal 32-bit hash algorithm. e.g.
    "th".hash = -502157718 "he".hash = -369049682 ...
    for each hash
    if bit_i of hash is set then add 1 to V[i]
    if bit_i of hash is not set then take 1 from V[i]
    simhash bit_i is 1 if V[i] > 0 and 0 otherwise

[通过SimHash找到相似的页面]

  • 如果两个SimHash的Hamming distance相近,则如果将所有SimHash有序排列,它们会位于比较相近的位置
  • 数字位左移或者右移相同的位置,其相对Hamming distance不改变
    因此可以这样做
    1
    2
    3
    rotate the bits B times (B is the length of vector)
    sort
    check adjancent

搜索引擎的评估 Search Engine Evaluation

发表于 2017-09-25 | 分类于 资料检索与搜索引擎

How do we measure the quality of search engines?

Recall and Precision(召回率和精确率)

$$Precision = \frac{relevant{\space}items{\space}retrieved}{all{\space}retrieved{\space}items}$$ $$Recall = \frac{relevant{\space}items{\space}retrieved}{all{\space}relevant{\space}items}$$

— Relevant Nonrelevant
Retrived True Positive(tp) False Positive(fp)
Not Retrived False Negative(fn) True Negative(tn)

$$Precision = \frac{tp}{tp+fp}$$ $$Recall = \frac{tp}{tp+fn}$$ $$Accuracy = \frac{tp+tn}{tp+fp+fn+tn}$$

[解释]: 这里的positive相当于积极的评价,也就是retrive;negative表示消极的评价,也就是没有取回。因此true positive表示该判断正确结果确实是相关的,false negative表示该消极的判断错误,结果其实是相关的。

[准确率]: \(Accuracy = \frac{tp+tn}{tp+fp+fn+tn}\) 但是通常我们并不会用准确率来作为评判的标准

[有意思的一段分析]:
The advantage of having the two numbers for precision and recall is that one is more important than the other in many circumstances.
Typical web surfers would like every result on the first page to be relevant (high precision) but have not the slightest interest in knowing let alone looking at every document that is relevant.
在web applications中, Precision is more important than Recall
In contrast, various professional searchers such as paralegals and intelligence analysts are very concerned with trying to get as high recall as possible, and will tolerate fairly low precision results in order to get it.
在专业搜索中,我们更关注高的召回率,为了达到这个目的可以忍受相对低的精确度
Individuals searching their hard disks are also often interested in high recall searches.
硬盘搜索中也期待有高的召回率
Nevertheless, the two quantities clearly trade off against one another: you can always get a recall of 1 (but very low precision) by retrieving all documents for all queries! Recall is a non-decreasing function of the number of documents retrieved. On the other hand, in a good system, precision usually decreases as the number of documents retrieved is increased. In general we want to get some amount of recall while tolerating only a certain percentage of false positives.
不管怎么说,Precision和Recall是相互牵制的。你总是可以通过取回尽量多的文件来达到高的召回率,比如说每次都取回所有的文件则召回率总是1。但是这时候精确度就很低了。在一个好的系统中,往往精确度随着召回数的增加而降低,不过通常我们忍受一定程度的fp来达到较好的召回率

Pythagorean Mean

  • Arithmetic Mean 算数平均数

$$A=\frac{1}{n}\sum_{i=1}^n{x_i}$$

  • Geometric Mean 几何平均数

$$G = \sqrt[n] {x_1x_2…x_n}$$

  • Harmonic Mean 调和平均数

$$H = {(\frac{\sum_{i=1}^n{x_i^{-1}}}{n})}^{-1}$$

F Measure

weighted harmonic mean of precision and recall

$$F = \frac{1}{\alpha\frac{1}{P}+(1-\alpha)\frac{1}{R}}
= \frac{({\beta}^2+1)PR}{\beta^2P+R}, \space\space\space {\beta}^2=\frac{1-\alpha}{\alpha} $$

  • Balanced \(F_1\)measure: with \(\beta=1\), \(F=\frac{2PR}{P+R}\)
  • Values of \(\beta<1\) emphasize precision
  • Values of \(\beta>1\) emphasize recall

Calculating Recall/Precision at Fixed Positions

上面所介绍的Precision, recall和F measure都是set-based measure。也就是说用于计算unordered set of documents。
而典型的搜索引擎所给出的结果是有一定顺序的。这一小节将介绍如何评估ranked results。

  • Average Precision of the Revelant Documents

[例子]

Revelant documents = 6

RANK Relevant OR Nonrelevant Recall Precision
1 R 0.17 1.0
2 N 0.17 0.5
3 R 0.33 0.67
4 R 0.5 0.75
5 R 0.67 0.8
6 R 0.83 0.83
7 N 0.83 0.71
8 N 0.83 0.63
9 N 0.83 0.56
10 R 1.0 0.6

Average Precision of the Revelant Documents = (1.0+0.67+0.75+0.8+0.83+0.6)/5 = 0.78

  • Averaging Across Queries

上面那个例子是一个query,如何评估多个query呢?

[方法一]
仿照上面的方法 (所有Revelant Documents的Precision求和)/(Relevant documents的总数目)
image

$$(1 + .67 + .5 + .44 + .5 + .5 + .4 + .43)/8 = 0.55$$

[方法二]
Mean Average Precision (MAP)
这个方法是research papers中最常用的方法
$$MAP=\frac{\sum_{q=1}^Q{Avg{P(q)}}}{Q}, \space\space Q=number\space of\space queries$$
image
$$AvgP(1)=(1.0+0.67+0.5+0.44+0.5)/5=0.62$$ $$AvgP(2)=(0.5+0.4+0.43)/3=0.44$$ $$MAP=(0.62+0.44)/2=0.53$$

利用MAP评估搜索引擎的缺点:

  1. Marco-averaging: Each query counts equally 也就是说每个query的重要度都是一样的
  2. MAP assumes user is interested in finding many revelant documents for each query [不是很懂]
  3. MAP requires many relevance judgments in documents collection

Discounted Cumulative Gain

The premise of DCG is that highly relevant documents appearing lower in a seach result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result.
这个方法的基本思想是:
如果一个相关文档出现在结果列表中靠后的位置,那么应该给予相应的惩罚。越靠后惩罚越大。

image
[例子]
image

基于Precision和Recall方法的局限性。

  • Should average over large document collection and query ensembles
  • Need human relevance assessments
    需要人来判断是否相关,这并不是很可靠的
  • Assessments have to be binary
    对相关性的评价是非黑即白
  • Heavily skewed by collection/authorship
    不同的领域有可能会有不同的结果,不具有普遍性

Non-­relevance-­based measures

搜索引擎也会使用不基于相关性的方法

  • Click-through on first result: Not very reliable if you look at a single click-through … but pretty reliable in the aggregate.
  • Studies of user behavior in the lab
  • A/B testing

倒排索引_Inverted_Indexing

发表于 2017-09-20 | 分类于 资料检索与搜索引擎

Definition of an Inverted index

An inverted index is typically composed of a vector containing all distinct words of the text collection in lexicographical order (which is called the vocabulary) and for each word in the vocabulary, a list of all documents (and text positions) in which that word occurs

通俗的说: 单词到包括该单词的文档的集合的映射。

注释: 这里的单词可能是被处理过的,一些常用的处理方法有

  • Case folding: converting all uppercase letters to lower case
  • Stemming: reducing words to their morphological root
  • Stop words: removing words that are so common they provide no information

Inverted index 的应用情景

假设有这样一个Query:

1
Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
方法一 暴力搜索

先找到所有包含Brutus和Caesar的话剧,再逐字排查,如果含有Caplurnia就删除掉该话剧。

缺点:

  • Too slow
  • 需要很大的空间
  • 不够灵活,不支持其他的操作,比如”Find the word Romans near countrymen”
方法二: Term-Document Incidence Matrix

用了 Bit Map 位运算的思想。构建如下矩阵,(t,d)=1代表term t在document d中出现,0代表没有出现。例如:

image

这样对于上一个查询,只需要

110100 AND 110111 AND 10111 = 100100

所以the two plays matching the query are:

“Anthony and Cleopatra”, “Hamlet”

缺点:

  • 需要很大的空间,并且矩阵通常十分稀疏:
1
2
3
Given 1 million documents and 500,000 terms.
The (term x Document) matrix in this case will have size 500K x 1M or
half-a-trillion 0’s and 1’s.
方法三: Inverted Index

在数据结构上我们选择了链表而不是数组,链表的优势在于:

  • Dynamic space allocation
  • Insertion of terms into documents easy
  • There is space overhead of pointers, though this is not too serious

Inverted Index的构建如下图所示
image
image

应用举例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
1)Brutus AND Caesar
- Locate Brutus in the Dictionary and Retrieve its postings.
Brutus: 2->4->8->16->32->64->128
- Locate Caesar in the Dictionary and Retrieve its postings.
Caesar: 1->2->3->5->8->13->21->34
- “Merge” the two postings and select the ones in common (postings are document ids):
Brutus: 2->4->8->16->32->64->128
Caesar: 1->2->3->5->8->13->21->34
2->8
O(m + n) 复杂度
2)Brutus AND Calpurnia AND Caesar
Brutus: 2->4->8->16
Caesar: 1->2->3->5->8->13->21->34
Calpurnia: 2->16
Merge的顺序要从短到长依次进行
Execute the query as (Calpurnia AND Brutus) AND Caesar.
INTERSECT(<t1, . . . , tn>)
terms ← SORTBYINCREASINGFREQUENCY(<t1, . . . , tn>)
result ← postings( first(terms))
terms ← rest(terms)
while terms /= NIL and result /= NIL
do
result ← INTERSECT(result, postings(first(terms)))
terms ← rest(terms)
return result
印象中有算法解决Merge K sorted list复杂度可以到nklogk(n为链表的平均长度)
[LeetCode参考: https://leetcode.com/problems/merge-k-sorted-lists/description/]
书本并没有采取这种算法可能有原因的:
- 在实际情况中这些个链表的长度一点也不平均
- 只需要最短的那个链表给merge结束就完事了我们并不需要完完整整的merge。

Skip Pointers

相关视频讲解

目的: speed up the merging of postings

方法解释:如图所示
image

算法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
INTERSECTWITHSKIPS (p1, p2)
answer ← < >
while p1 /= NIL and p2 /= NIL
do if docID(p1) == docID(p2)
then ADD(answer, docID(p1))
p1 <- next(p1)
p2 <- next(p2)
else if docID(p1) < docID(p2)
then if hasSkip(p1) and (docID(skip(p1)) <= docID(p2))
then while hasSkip(p1) and (docID(skip(p1) <= docID(p2))
do p1 <- skip(p1)
else p1 <- next(p1)
else if hasSkip(p2) and (docID(skip(p2)) <= docID(p1))
then while hasSkip(p2) and (docID(skip(p2) <= docID(p1))
do p2 <- skip(p2)
else p2 <- next(p2)
return answer

适用范围:

  1. Index 相对固定 update的次数比较少: Building effective skip pointers is easy if an index is relatively static; it is harder if a postings list keeps changing because of updates. A malicious deletion strategy can render skip lists ineffective.
  2. the presence of skip pointers only helps for AND queries, not for OR queries.

问题: How many skip pointers should we add?

  1. TradeOff Problem:
  • More skips -> shorter skip span => more likely to skip. But lots of comparision to skip pointers
  • Fewer skips -> few pointer comparision, but then long skip spans => few successful skips
  1. Simple Heuristic
  • 用Length的平方根: for a postings list of length P, use sqrt{P} evenly-spaced skip pointers.

硬件对速度的影响

Choosing the optimal encoding for an inverted index is an ever-changing game for the system builder, because it is strongly dependent on underlying computer technologies and their relative speeds and sizes. Traditionally, CPUs were slow, and so highly compressed techniques were not optimal. Now CPUs are fast and disk is slow, so reducing disk postings list size dominates. However, if you’re running a search engine with everything in memory then the equation changes again.

Phrase Queries

之前所说的inverted index技术是基于Term的,但是在实际运用中我们经常会查询一个短语,而不是单词,比如:stanford university

这里阐述两个方法

方法一: Bi-word indexes (2-grams, 2 shingles)

方法介绍

  • A bi-word (or a 2-gram) is a consecutive pair of terms in some text。比如 “Friends, Romans, Countrymen” 会生成 “friends romans” 和 “romans countrymen”
  • Each of these bi-words is now added to the dictionary as a term。也就是说之前的Term变成了现在的bi-words。

缺点

  • 数量爆炸!Bi-words will cause an explosion in the vocabulary database
  • 短语长度也许大于2。Queries longer than 2 words will have to be broken into bi-word segments
方法二: Positional indexes

方法介绍

相比于之前的inverted index, 这里其实就是多记录了一个Term出现的位置位置:

1
2
3
<Term, TotalFrequency> -> <docID, Frequency>
变成
<Term, TotalFrequency> -> <docID, Frequency, Positions>

image

比方说我们要查询

1
to be or not to be

现在我们首先找to be, 步骤如下

1
2
3
4
5
1. 找到同时含有to和be两个term的文档。
2. 查找两个term在文档中的位置,如果be的位置在to后面一个,那就找到了一个结果
比如上面那个图,我们就能找到一个possible match是
to: <...;4:<...,429,433>;...>
be: <...;4:<...,430,434>;...>

算法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
POSITIONALINTERSECT(p1, p2, k)
answer <- <>
while p1 /= NIL and p2 /= NIL
do if docID(p1) == docID(p2)
then l <- <>
pp1 <- positions(p1)
pp2 <- positions(p2)
while pp1 /= NIL
do while pp2 /= NIL
do if |pos(pp1) - pos(pp2)| <= k
then ADD(l, pos(pp2))
else if pos(pp2) > pos(pp1)
the break
pp2 <- next(pp2)
while l /= <> and |l[0] - pos(pp1)| > k
do DELETE(l[0])
for each ps in l
do ADD(answer, <docID(p1), pos(pp1), ps>)
pp1 <- next(pp1)
p1 <- next(p1)
p2 <- next(p2)
else if docID(p1) < docID(p2)
then p1 <- next(p1)
else p2 <- next(p2)
return answer

【算法导论32】字符串匹配

发表于 2017-09-17 | 分类于 算法

算法导论 32. String Matching

恰逢最近在学习搜索引擎,又提及了字符串处理的只是,便想起来读一次算法导论的32章并且留下笔记。日后再总结其他。

书本中介绍了四个方法。分别为:

Naive string-matching algorithm (暴力)

The Rabin-Karp algorithm

Finite automata

KMP

笔记如下:

image
image
image
image

参考:

算法导论
以及 https://niuye.info/fms-kmp/

不忘初心

发表于 2017-08-11 | 分类于 开篇

一个一直在琢磨着如何转行却在码农圈越陷越深的伪程序员为自己博客写的一篇前言。

27岁之前专注玩码

偶尔想想诗和远方

给自己的生活种种草,记得按时拔。


关于专业

  • Yaser Abu-Mostafa
  • 深度学习(中/英) | Udacity
  • GoodFellow
  • Clean Code
  • Thinking in Java
  • Elm
  • 算法
  • Spider

关于兴趣

  • 英文语法书
  • 朗读者
12
Jade-Jade-Jade

Jade-Jade-Jade

18 日志
8 分类
© 2018 Jade-Jade-Jade
由 Hexo 强力驱动
主题 - NexT.Gemini