duncanzhou.github.io/atom.xml at master · DuncanZhou/duncanzhou.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Duncan&#39;s Blog</title>


  <link href="/atom.xml" rel="self"/>

  <link href="https://github.com/DuncanZhou/"/>
  <updated>2019-09-23T06:48:36.344Z</updated>
  <id>https://github.com/DuncanZhou/</id>

  <author>
    <name>duncan</name>

  </author>

  <generator uri="http://hexo.io/">Hexo</generator>

  <entry>
    <title>Spark笔记</title>
    <link href="https://github.com/DuncanZhou/2019/09/23/Spark%E7%AC%94%E8%AE%B0/"/>
    <id>https://github.com/DuncanZhou/2019/09/23/Spark笔记/</id>
    <published>2019-09-22T16:00:00.000Z</published>
    <updated>2019-09-23T06:48:36.344Z</updated>

    <content type="html"><![CDATA[<h3 id="Spark笔记"><a href="#Spark笔记" class="headerlink" title="Spark笔记"></a>Spark笔记</h3><h3 id="1-数据结构方式"><a href="#1-数据结构方式" class="headerlink" title="1.数据结构方式"></a>1.数据结构方式</h3><p> RDD是Spark处理数据的数据结构，可以通过两种方式加载数据创建RDD</p><ul><li>从程序中parallelize一种现有的数据：如Array</li><li>从外部读取文件：CSV，Hive等</li></ul><h3 id="2-RDD操作类型"><a href="#2-RDD操作类型" class="headerlink" title="2.RDD操作类型"></a>2.RDD操作类型</h3><p>2.1 RDD的计算方式是lazy加载，即用的时候再计算。</p><p>2.2 如果一个变量需要经常使用，可以持久化persist</p><p>2.3 封装函数有多种方式：</p><ul><li>封装静态方法，创建<strong>object</strong></li><li>封装方法以及变量参数等等，创建<strong>class</strong></li></ul><p>2.3 常用转换方法</p><div class="table-container"><table><thead><tr><th style="text-align:left">Transformation</th><th style="text-align:left">Meaning</th></tr></thead><tbody><tr><td style="text-align:left"><strong>map</strong>(<em>func</em>)</td><td style="text-align:left">Return a new distributed dataset formed by passing each element of the source through a function <em>func</em>.</td></tr><tr><td style="text-align:left"><strong>filter</strong>(<em>func</em>)</td><td style="text-align:left">Return a new dataset formed by selecting those elements of the source on which <em>func</em>returns true.</td></tr><tr><td style="text-align:left"><strong>flatMap</strong>(<em>func</em>)</td><td style="text-align:left">Similar to map, but each input item can be mapped to 0 or more output items (so <em>func</em> should return a Seq rather than a single item).</td></tr><tr><td style="text-align:left"><strong>mapPartitions</strong>(<em>func</em>)</td><td style="text-align:left">Similar to map, but runs separately on each partition (block) of the RDD, so <em>func</em> must be of type Iterator<t> =&gt; Iterator<u> when running on an RDD of type T.</u></t></td></tr><tr><td style="text-align:left"><strong>mapPartitionsWithIndex</strong>(<em>func</em>)</td><td style="text-align:left">Similar to mapPartitions, but also provides <em>func</em> with an integer value representing the index of the partition, so <em>func</em> must be of type (Int, Iterator<t>) =&gt; Iterator<u> when running on an RDD of type T.</u></t></td></tr><tr><td style="text-align:left"><strong>sample</strong>(<em>withReplacement</em>, <em>fraction</em>, <em>seed</em>)</td><td style="text-align:left">Sample a fraction <em>fraction</em> of the data, with or without replacement, using a given random number generator seed.</td></tr><tr><td style="text-align:left"><strong>union</strong>(<em>otherDataset</em>)</td><td style="text-align:left">Return a new dataset that contains the union of the elements in the source dataset and the argument.</td></tr><tr><td style="text-align:left"><strong>intersection</strong>(<em>otherDataset</em>)</td><td style="text-align:left">Return a new RDD that contains the intersection of elements in the source dataset and the argument.</td></tr><tr><td style="text-align:left"><strong>distinct</strong>([<em>numPartitions</em>]))</td><td style="text-align:left">Return a new dataset that contains the distinct elements of the source dataset.</td></tr><tr><td style="text-align:left"><strong>groupByKey</strong>([<em>numPartitions</em>])</td><td style="text-align:left">When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<v>) pairs.  <strong>Note:</strong> If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using <code>reduceByKey</code> or <code>aggregateByKey</code> will yield much better performance.  <strong>Note:</strong> By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional <code>numPartitions</code> argument to set a different number of tasks.</v></td></tr><tr><td style="text-align:left"><strong>reduceByKey</strong>(<em>func</em>, [<em>numPartitions</em>])</td><td style="text-align:left">When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function <em>func</em>, which must be of type (V,V) =&gt; V. Like in <code>groupByKey</code>, the number of reduce tasks is configurable through an optional second argument.</td></tr><tr><td style="text-align:left"><strong>aggregateByKey</strong>(<em>zeroValue</em>)(<em>seqOp</em>, <em>combOp</em>, [<em>numPartitions</em>])</td><td style="text-align:left">When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral “zero” value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in <code>groupByKey</code>, the number of reduce tasks is configurable through an optional second argument.</td></tr><tr><td style="text-align:left"><strong>sortByKey</strong>([<em>ascending</em>], [<em>numPartitions</em>])</td><td style="text-align:left">When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean <code>ascending</code> argument.</td></tr><tr><td style="text-align:left"><strong>join</strong>(<em>otherDataset</em>, [<em>numPartitions</em>])</td><td style="text-align:left">When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through <code>leftOuterJoin</code>, <code>rightOuterJoin</code>, and <code>fullOuterJoin</code>.</td></tr><tr><td style="text-align:left"><strong>cogroup</strong>(<em>otherDataset</em>, [<em>numPartitions</em>])</td><td style="text-align:left">When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<v>, Iterable<w>)) tuples. This operation is also called <code>groupWith</code>.</w></v></td></tr><tr><td style="text-align:left"><strong>cartesian</strong>(<em>otherDataset</em>)</td><td style="text-align:left">When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).</td></tr><tr><td style="text-align:left"><strong>pipe</strong>(<em>command</em>, <em>[envVars]</em>)</td><td style="text-align:left">Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process’s stdin and lines output to its stdout are returned as an RDD of strings.</td></tr><tr><td style="text-align:left"><strong>coalesce</strong>(<em>numPartitions</em>)</td><td style="text-align:left">Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.</td></tr><tr><td style="text-align:left"><strong>repartition</strong>(<em>numPartitions</em>)</td><td style="text-align:left">Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.</td></tr><tr><td style="text-align:left"><strong>repartitionAndSortWithinPartitions</strong>(<em>partitioner</em>)</td><td style="text-align:left">Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling <code>repartition</code> and then sorting within each partition because it can push the sorting down into the shuffle machinery.</td></tr></tbody></table></div><h3 id="3-创建DataFrame的三种方式"><a href="#3-创建DataFrame的三种方式" class="headerlink" title="3.创建DataFrame的三种方式"></a>3.创建DataFrame的三种方式</h3><ul><li><p>使用toDF函数</p></li><li><p>使用createDataFrame函数</p></li><li>通过文件直接创建</li></ul><h3 id="4-scala的vector和spark包中vector不一样"><a href="#4-scala的vector和spark包中vector不一样" class="headerlink" title="4.scala的vector和spark包中vector不一样"></a>4.scala的vector和spark包中vector不一样</h3><h3 id="5-Spark优化：（美团Spark）"><a href="#5-Spark优化：（美团Spark）" class="headerlink" title="5.Spark优化：（美团Spark）"></a>5.Spark优化：（美团Spark）</h3><p>基础版：<a href="https://tech.meituan.com/2016/04/29/spark-tuning-basic.html" target="_blank" rel="external">https://tech.meituan.com/2016/04/29/spark-tuning-basic.html</a></p><p>高级版：<a href="https://tech.meituan.com/2016/05/12/spark-tuning-pro.html" target="_blank" rel="external">https://tech.meituan.com/2016/05/12/spark-tuning-pro.html</a></p><h3 id="6-Spark保留运行环境（用于查错）"><a href="#6-Spark保留运行环境（用于查错）" class="headerlink" title="6.Spark保留运行环境（用于查错）"></a>6.Spark保留运行环境（用于查错）</h3><figure class="highlight xml"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">conf.spark.yarn.preserve.staging.files=true</div></pre></td></tr></table></figure><h3 id="7-宽依赖和窄依赖"><a href="#7-宽依赖和窄依赖" class="headerlink" title="7.宽依赖和窄依赖"></a>7.宽依赖和窄依赖</h3><ul><li><strong>窄依赖</strong>：指父RDD的每个分区只被一个子RDD分区使用，子RDD分区通常只对应常数个父RDD分区。（map、filter、union操作）。</li><li><strong>宽依赖</strong>：指父RDD的每个分区都有可能被多个子RDD分区使用，子RDD分区通常对应父RDD所有分区。（groupByKey、partitionBy等操作）</li><li>比较：<strong>宽依赖通常对应着shuffle操作</strong>，需要在运行的过程中将同一个RDD分区传入到不同的RDD分区中，中间可能涉及多个节点之间数据的传输。</li></ul><h3 id="8-ORC格式和PARQUET格式文件对比"><a href="#8-ORC格式和PARQUET格式文件对比" class="headerlink" title="8.ORC格式和PARQUET格式文件对比"></a>8.ORC格式和PARQUET格式文件对比</h3><blockquote><p>impala暂时不支持orc格式的表查询</p></blockquote><h3 id="9-left-anti-join（某个字段过滤用）"><a href="#9-left-anti-join（某个字段过滤用）" class="headerlink" title="9.left anti join（某个字段过滤用）"></a>9.left anti join（某个字段过滤用）</h3><ul><li>left semi join —&gt; exists</li><li>left anti join —&gt; not exists</li></ul><h3 id="10-Shuffle过程数据倾斜"><a href="#10-Shuffle过程数据倾斜" class="headerlink" title="10.Shuffle过程数据倾斜"></a>10.Shuffle过程数据倾斜</h3><blockquote><p>和Hive中类似，数据的倾斜都发生在shuffle过程中，下面以hive的shuffle进行总结。发生倾斜的根本原因在于，shuffle之后，key的分布不均匀，使得大量的key集中在某个reduce节点，导致此节点过于“忙碌”，在其他节点都处理完之后，任务的结整需要等待此节点处理完，使得整个任务被此节点堵塞。</p><p>要解决此问题，主要可以分为两大块：</p><ul><li>一是尽量不shuffle；</li><li>二是shuffle之后，在reduce节点上的key分布尽量均匀。</li></ul></blockquote><p>方案总结如下：</p><hr><p>解决方案：MapJoin，添加随机前缀，使用列桶表</p><ul><li>mapjoin</li></ul><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">-- mapjoin配置</div><div class="line">set hive.auto.convert.join = true;</div><div class="line">set hive.mapjoin.smalltable.filesize=25000000;</div></pre></td></tr></table></figure><ul><li>手动分割成两部分进行join</li></ul><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div></pre></td><td class="code"><pre><div class="line">select t1.* </div><div class="line">from t1 join t2 on t1.key=t2.key</div><div class="line"></div><div class="line">拆成以下SQL:</div><div class="line"></div><div class="line">select t1.* </div><div class="line">from t1 join t2 on t1.key=t2.key</div><div class="line">where t1.key=A</div><div class="line">union all </div><div class="line">select t1.*</div><div class="line">from t1 join t2 on t1.key=t2.key</div><div class="line">where t1.key&lt;&gt;A</div></pre></td></tr></table></figure><ul><li>当小表不是很小，不太方便用mapjoin，大表添加N中随机前缀，小表膨胀N倍数据</li><li>使用Skewed Table 或者 List Bucketing Table</li></ul>]]></content>

    <summary type="html">


        &lt;h3 id=&quot;Spark笔记&quot;&gt;&lt;a href=&quot;#Spark笔记&quot; class=&quot;headerlink&quot; title=&quot;Spark笔记&quot;&gt;&lt;/a&gt;Spark笔记&lt;/h3&gt;&lt;h3 id=&quot;1-数据结构方式&quot;&gt;&lt;a href=&quot;#1-数据结构方式&quot; class=&quot;headerli


    </summary>

      <category term="Learning" scheme="https://github.com/DuncanZhou//categories/Learning/"/>


      <category term="大数据" scheme="https://github.com/DuncanZhou//tags/%E5%A4%A7%E6%95%B0%E6%8D%AE/"/>

  </entry>

  <entry>
    <title>Scala笔记</title>
    <link href="https://github.com/DuncanZhou/2019/09/23/Scala%E7%AC%94%E8%AE%B0/"/>
    <id>https://github.com/DuncanZhou/2019/09/23/Scala笔记/</id>
    <published>2019-09-22T16:00:00.000Z</published>
    <updated>2019-09-23T06:49:47.411Z</updated>

    <content type="html"><![CDATA[<h2 id="Scala笔记"><a href="#Scala笔记" class="headerlink" title="Scala笔记"></a>Scala笔记</h2><h3 id="1-四种操作符的区别和联系"><a href="#1-四种操作符的区别和联系" class="headerlink" title="1.四种操作符的区别和联系"></a>1.四种操作符的区别和联系</h3><ul><li><p>:: 该方法成为cons，表时构造，向队列头部加入元素。x::list表示向list头部加入元素。（列表构造：</p><figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line"><span class="number">2</span>::<span class="number">1</span>::<span class="number">2</span>::<span class="string">"bar"</span>::<span class="string">"foo"</span> 表示<span class="type">List</span>[<span class="type">Any</span>]= (<span class="number">2</span>,<span class="number">1</span>,<span class="number">2</span>,bar,foo)</div></pre></td></tr></table></figure></li><li><p>:+和+:表示分别在尾部加入元素和在头部加入元素。</p></li><li><p>++ 表示连接两个集合</p></li><li><p>::: 该方法只能用于连接两个list类型的集合</p></li></ul><h3 id="2-日期操作-经常用到，所以记录下"><a href="#2-日期操作-经常用到，所以记录下" class="headerlink" title="2.日期操作(经常用到，所以记录下)"></a>2.日期操作(经常用到，所以记录下)</h3><ul><li><p>获取今天0点时间戳</p><figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">val</span> dateFormat = <span class="keyword">new</span> <span class="type">SimpleDateFormat</span>(<span class="string">"yyyy-MM-dd"</span>)</div><div class="line"><span class="keyword">val</span> cur = dateFormat.parse(dateFormat.format(<span class="keyword">new</span> <span class="type">Date</span>())).getTime</div></pre></td></tr></table></figure></li><li><p>日期格式转时间戳</p><figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">val</span> dateFormat = <span class="keyword">new</span> <span class="type">SimpleDateFormat</span>(<span class="string">"yyyy-MM-dd"</span>)</div><div class="line"><span class="comment">//val dateFormat = new SimpleDateFormat("yyyy-MM-dd hh:MM:ss")</span></div><div class="line"><span class="keyword">val</span> timestamp = dateFormat.parse(dateFormat.format(<span class="keyword">new</span> <span class="type">Date</span>())).getTime</div></pre></td></tr></table></figure></li><li><p>时间戳转日期</p><figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">val</span> dateFormat = <span class="keyword">new</span> <span class="type">SimpleDateFormat</span>(<span class="string">"yyyy-MM-dd"</span>)</div><div class="line"><span class="keyword">val</span> date = dateFormat.format(<span class="keyword">new</span> <span class="type">Date</span>())</div></pre></td></tr></table></figure></li></ul><h3 id="3-删除目录或文件"><a href="#3-删除目录或文件" class="headerlink" title="3.删除目录或文件"></a>3.删除目录或文件</h3><figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">import</span> java.io.<span class="type">File</span></div><div class="line"><span class="function"><span class="keyword">def</span> <span class="title">dirDel</span></span>(path: <span class="type">File</span>) &#123;</div><div class="line">        <span class="keyword">if</span> (!path.exists())</div><div class="line">          <span class="keyword">return</span></div><div class="line">        <span class="keyword">else</span> <span class="keyword">if</span> (path.isFile()) &#123;</div><div class="line">          path.delete()</div><div class="line">          <span class="keyword">return</span></div><div class="line">        &#125;</div><div class="line">        <span class="keyword">val</span> file: <span class="type">Array</span>[<span class="type">File</span>] = path.listFiles()</div><div class="line">        <span class="keyword">for</span> (d &lt;- file) &#123;</div><div class="line">          dirDel(d)</div><div class="line">        &#125;</div><div class="line">        path.delete()</div><div class="line">      &#125;</div><div class="line">      <span class="keyword">if</span>(<span class="type">Files</span>.exists(<span class="type">Paths</span>.get(path))) &#123;</div><div class="line">        <span class="comment">// 先删除目录里的文件</span></div><div class="line">        dirDel(<span class="type">Paths</span>.get(path).toFile)</div><div class="line">      &#125;</div></pre></td></tr></table></figure>]]></content>

    <summary type="html">


        &lt;h2 id=&quot;Scala笔记&quot;&gt;&lt;a href=&quot;#Scala笔记&quot; class=&quot;headerlink&quot; title=&quot;Scala笔记&quot;&gt;&lt;/a&gt;Scala笔记&lt;/h2&gt;&lt;h3 id=&quot;1-四种操作符的区别和联系&quot;&gt;&lt;a href=&quot;#1-四种操作符的区别和联系&quot; class


    </summary>

      <category term="Learning" scheme="https://github.com/DuncanZhou//categories/Learning/"/>


      <category term="语言学习" scheme="https://github.com/DuncanZhou//tags/%E8%AF%AD%E8%A8%80%E5%AD%A6%E4%B9%A0/"/>

  </entry>

  <entry>
    <title>Redis学习</title>
    <link href="https://github.com/DuncanZhou/2019/09/23/Redis%E7%AC%94%E8%AE%B0/"/>
    <id>https://github.com/DuncanZhou/2019/09/23/Redis笔记/</id>
    <published>2019-09-22T16:00:00.000Z</published>
    <updated>2019-09-23T06:50:23.424Z</updated>

    <content type="html"><![CDATA[<h3 id="Redis"><a href="#Redis" class="headerlink" title="Redis"></a>Redis</h3><h3 id="1、为什么使用Redis数据库"><a href="#1、为什么使用Redis数据库" class="headerlink" title="1、为什么使用Redis数据库"></a>1、为什么使用Redis数据库</h3><ul><li><strong>性能极高</strong> – Redis能读的速度是110000次/s,写的速度是81000次/s 。</li><li>丰富的数据类型 – Redis支持二进制案例的 Strings, Lists, Hashes, Sets 及 Ordered Sets 数据类型操作。</li><li>原子 – Redis的所有操作都是原子性的，意思就是要么成功执行要么失败完全不执行。单个操作是原子性的。多个操作也支持事务，即原子性，通过MULTI和EXEC指令包起来。</li><li>丰富的特性 – Redis还支持 publish/subscribe, 通知, key 过期等等特性。</li></ul><h3 id="2-数据类型"><a href="#2-数据类型" class="headerlink" title="2.数据类型"></a>2.数据类型</h3><ul><li><p>string</p></li><li><p>hash：键值对的集合 </p><figure class="highlight plain"><figcaption><span>hashset1 key1 val1 key2 val2```</span></figcaption><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line"></div><div class="line">```HGET hashset1 key1</div></pre></td></tr></table></figure></li><li><p>list</p><figure class="highlight plain"><figcaption><span>list1 val1```</span></figcaption><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line"></div><div class="line">* set</div><div class="line"></div><div class="line">  ```sadd set1 val1</div></pre></td></tr></table></figure><figure class="highlight plain"><figcaption><span>set1```</span></figcaption><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line"></div><div class="line">* zset（有序集合，带分数）</div><div class="line"></div><div class="line">  ```zadd key score memeber</div></pre></td></tr></table></figure></li></ul><h3 id="3-set和hset区别"><a href="#3-set和hset区别" class="headerlink" title="3.set和hset区别"></a>3.set和hset区别</h3><ul><li>set  就是普通的已key-value 方式存储数据，可以设置过期时间。时间复杂度为 O(1)</li><li>hset 则是以hash 散列表的形式存储。超时时间只能设置在 大 key 上，单个 filed 则不可以设置超时</li></ul><hr><p>使用场景对比：<strong>set 存储单个大文本非结构化数据，hset 则存储结构化数据</strong>，一个 hash 存储一条数据，一个 filed 则存储 一条数据中的一个属性，value 则是属性对应的值。</p><h3 id="4-Scan操作，keys-操作，线上谨慎使用"><a href="#4-Scan操作，keys-操作，线上谨慎使用" class="headerlink" title="4.Scan操作，keys()操作，线上谨慎使用"></a>4.Scan操作，keys()操作，线上谨慎使用</h3><p>scan操作取N条出来进行scan，并保持prefix</p><figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div></pre></td><td class="code"><pre><div class="line"><span class="function"><span class="keyword">public</span> HashSet&lt;String&gt; <span class="title">getKeysByScan</span><span class="params">(String pattern)</span> </span>&#123;</div><div class="line">        ScanParams scanParams = <span class="keyword">new</span> ScanParams().count(<span class="number">1000</span>).match(pattern);</div><div class="line">        HashSet&lt;String&gt; allKeys = <span class="keyword">new</span> HashSet&lt;&gt;();</div><div class="line">        cluster.getClusterNodes().values().forEach((pool) -&gt; &#123;</div><div class="line">            String cur = ScanParams.SCAN_POINTER_START;</div><div class="line">            <span class="keyword">do</span> &#123;</div><div class="line">                <span class="keyword">try</span> (Jedis jedis = pool.getResource()) &#123;</div><div class="line">                    ScanResult&lt;String&gt; scanResult = jedis.scan(cur, scanParams);</div><div class="line">                    allKeys.addAll(scanResult.getResult());</div><div class="line">                    cur = scanResult.getStringCursor();</div><div class="line">                &#125;</div><div class="line">            &#125; <span class="keyword">while</span> (!cur.equals(ScanParams.SCAN_POINTER_START));</div><div class="line">        &#125;);</div><div class="line">        <span class="keyword">return</span> allKeys;</div><div class="line">    &#125;</div></pre></td></tr></table></figure>]]></content>

    <summary type="html">


        &lt;h3 id=&quot;Redis&quot;&gt;&lt;a href=&quot;#Redis&quot; class=&quot;headerlink&quot; title=&quot;Redis&quot;&gt;&lt;/a&gt;Redis&lt;/h3&gt;&lt;h3 id=&quot;1、为什么使用Redis数据库&quot;&gt;&lt;a href=&quot;#1、为什么使用Redis数据库&quot; class=&quot;he


    </summary>

      <category term="Learning" scheme="https://github.com/DuncanZhou//categories/Learning/"/>


      <category term="数据库" scheme="https://github.com/DuncanZhou//tags/%E6%95%B0%E6%8D%AE%E5%BA%93/"/>

  </entry>

  <entry>
    <title>Flink学习记录</title>
    <link href="https://github.com/DuncanZhou/2019/09/05/Flink%E7%AC%94%E8%AE%B0/"/>
    <id>https://github.com/DuncanZhou/2019/09/05/Flink笔记/</id>
    <published>2019-09-05T03:10:23.580Z</published>
    <updated>2019-09-05T03:18:22.352Z</updated>

    <content type="html"><![CDATA[<h2 id="Flink笔记"><a href="#Flink笔记" class="headerlink" title="Flink笔记"></a>Flink笔记</h2><h3 id="1-数据集类型"><a href="#1-数据集类型" class="headerlink" title="1.数据集类型"></a>1.数据集类型</h3><ul><li>有界数据集：具有时间边界，在处理过程中数据一定会在某个时间范围内起始和结束。提供DataSet API</li><li>无界数据集： 数据从一开始就一直持续产生的。提供DataStream API</li></ul><h3 id="2-Flink编程接口"><a href="#2-Flink编程接口" class="headerlink" title="2.Flink编程接口"></a>2.Flink编程接口</h3><ul><li>Flink SQL</li><li>Table API：在内存中的DataSet和DataStream基础上加上Schema信息，将数据类型<strong>抽象成表结构</strong></li><li>DataStream API和DataSet API</li><li>Stateful Stream Process API</li></ul><h3 id="3-程序结构"><a href="#3-程序结构" class="headerlink" title="3.程序结构"></a>3.程序结构</h3><ul><li><p>设定运行环境：</p><ul><li><figure class="highlight plain"><figcaption><span>env = StreamExecutionEnvironment.getExecutionEnvironment```</span></figcaption><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line"></div><div class="line">* ```/*设置并行度为5*/val env = StreamExecutionEnvironment.createLocalEnvironment(5)</div></pre></td></tr></table></figure></li><li><figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">val</span> env = <span class="type">StreamExecutionEnvironment</span>.createRemoteEnvironment(<span class="string">"JobManagerHost"</span>,<span class="number">6021</span>,<span class="number">5</span>,<span class="string">"/user/application.jar"</span>)</div></pre></td></tr></table></figure></li></ul></li><li><p>初始化数据：</p><p>将外部数据转换成DataStream<t>或者DataSet<t></t></t></p></li><li><p>执行转换逻辑：</p><ul><li>复杂的逻辑通过实现MapFunction接口，然后调用map()方法将实现类传入</li><li>匿名函数</li><li>RichFunction接口</li></ul></li><li><p>分区key指定</p></li><li><p>根据第一个字段分区，根据第二个字段求和</p><p><code>val result = DataStream.keyBy(0).sum(1)</code></p></li><li><p>输出结果</p><ul><li>基于文件输出</li><li>基于控制台输出</li><li>Connector</li></ul></li><li><p>程序触发</p><p>调用ExecutionEnvironment的execute()</p></li></ul><h3 id="4-数据类型"><a href="#4-数据类型" class="headerlink" title="4.数据类型"></a>4.数据类型</h3><ul><li>原生数据类型</li><li>Tuple2元组类型</li><li>Scala case class类型</li><li>POJOs类型：复杂数据结构类型</li><li>Flink Value类型：IntValue、DoubleValue、StringValue</li><li>特殊数据类型：List，Map、Etither、Option、Try</li></ul><h3 id="5-DataStream-API"><a href="#5-DataStream-API" class="headerlink" title="5.DataStream API"></a>5.DataStream API</h3><ul><li>DataSource<ul><li>内置数据源<ul><li>文件数据源</li><li>Socket数据源</li><li>集合数据源</li></ul></li><li>外置数据源<ul><li>Kafka</li></ul></li></ul></li><li>Transformation<ul><li>单DataFrame操作：Map、FlatMap、Filter、KeyBy、Reduce、Aggregation函数（min、max、sum）</li><li>多DataFrame操作：Union、Connect、CoMap、CoFlatMap、Split、Select、Iterate</li></ul></li><li>DataSink<ul><li>文件系统</li><li>Kafka</li><li>Apache Cassandra</li><li>HDFS</li><li>RabbitMQ</li></ul></li></ul><h3 id="6-时间概念"><a href="#6-时间概念" class="headerlink" title="6.时间概念"></a>6.时间概念</h3><ul><li>Event Time（事件生成时间）</li><li>Ingestion Time（事件接入时间）</li><li><p>Process Time（事件处理时间）</p><p>— 再记录（2019-09-05）</p></li></ul>]]></content>

    <summary type="html">


        &lt;h2 id=&quot;Flink笔记&quot;&gt;&lt;a href=&quot;#Flink笔记&quot; class=&quot;headerlink&quot; title=&quot;Flink笔记&quot;&gt;&lt;/a&gt;Flink笔记&lt;/h2&gt;&lt;h3 id=&quot;1-数据集类型&quot;&gt;&lt;a href=&quot;#1-数据集类型&quot; class=&quot;headerlink


    </summary>

      <category term="Streaming Data" scheme="https://github.com/DuncanZhou//categories/Streaming-Data/"/>


      <category term="实时数据" scheme="https://github.com/DuncanZhou//tags/%E5%AE%9E%E6%97%B6%E6%95%B0%E6%8D%AE/"/>

  </entry>

  <entry>
    <title>FM &amp; FFM &amp; DeepFM</title>
    <link href="https://github.com/DuncanZhou/2019/08/18/FM/"/>
    <id>https://github.com/DuncanZhou/2019/08/18/FM/</id>
    <published>2019-08-18T06:17:40.416Z</published>
    <updated>2019-08-18T13:56:31.566Z</updated>

    <content type="html"><![CDATA[<h3 id="模型表示为：因子分解机（Factorization-Machine）"><a href="#模型表示为：因子分解机（Factorization-Machine）" class="headerlink" title="模型表示为：因子分解机（Factorization Machine）"></a>模型表示为：因子分解机（Factorization Machine）</h3><h3 id="1-概念"><a href="#1-概念" class="headerlink" title="1.概念"></a>1.概念</h3><p>在如广告点击预测问题中，根据用户画像、广告位以及一些其他特征来预测用户是否会点击广告。当对离散特征进行One-hot编码后，将出现特征维度爆炸，而且特征数据较稀疏。因此，FM最大的特点是对于稀疏的数据具有很好的学习能力。</p><hr><p>可以处理以下三类问题：</p><ul><li><strong>回归问题</strong>：使用最小均方误差作为优化标准</li><li><strong>二分类问题</strong>：加一个激活函数，如sigmoid或tanh函数等</li><li><strong>排序问题</strong>：按照预测分数召回</li></ul><h3 id="2-优点"><a href="#2-优点" class="headerlink" title="2.优点"></a>2.优点</h3><ul><li>可以<strong>在非常稀疏的数据</strong>中进行合理的参数估计</li><li>在FM模型的<strong>复杂度是线性</strong>的</li><li>FM是一个<strong>通用模型</strong>，可以应用于任何特征为实值的情况</li></ul><h3 id="3-为什么有效？模型细节"><a href="#3-为什么有效？模型细节" class="headerlink" title="3.为什么有效？模型细节"></a>3.为什么有效？模型细节</h3><ul><li><p>在一般的线性模型中，各个特征独立考虑的，没有考虑到特征与特征之间的相互关系。但实际上，<strong>大量的特征之间是有关联</strong>的。 </p><blockquote><p>举例：电商中，男性购买啤酒较多，女性购买化妆品较多，性别与购买类别之间存在关联。</p></blockquote></li><li><p>模型</p><ul><li><p>一般的线性模型为</p><script type="math/tex; mode=display">y=w_0+\sum_{i=1}^{n}w_ix_i</script></li><li><p>对于度为2的因子分解机模型为：</p><script type="math/tex; mode=display">y=w_0+\sum_{i=1}^{n}w_ix_i+\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}<v_i,v_j>x_ix_j</script><p>，其中，$<v_i,v_j>$表示两个大小为$k$的向量之间的点积。与线性模型相比，FM的模型多了后面特征组合的部分。</v_i,v_j></p></li></ul></li><li><p>如何求解？</p><p>对每一个特征分量$x<em>i$构造一个辅助向量$v_i=(v</em>{i1},v<em>{i2},…,v</em>{i<em>k})$，利用$v_iv_j^T$对交叉项的$w</em>{ij}$进行估计。</p></li><li><p>K的选取？</p><p>k越大能够解释的特征维度越高，但是k的选取不宜太大。</p></li><li><p>为什么求解模型复杂度是线性的？</p><p><img src="/DuncanZhou/2019/08/18/FM/C:/Users\DUNCAN~1\AppData\Local\Temp\1566111191953.png" alt="1566111191953"></p></li><li><p>求解过程</p><p>使用随机梯度下降方式求解</p></li></ul><hr><h3 id="局部感知因子分解机（FFM）"><a href="#局部感知因子分解机（FFM）" class="headerlink" title="局部感知因子分解机（FFM）"></a>局部感知因子分解机（FFM）</h3><h3 id="1-基于FM改进之处"><a href="#1-基于FM改进之处" class="headerlink" title="1.基于FM改进之处"></a>1.基于FM改进之处</h3><p>特征One-hot之后过于稀疏，因此，同一个categorical特征经过One-hot编码之后生成的数值特征可以放到同一个field。</p><p>因此在FFM中，每一维特征都会针对其他特征的每个field，分别学习一个隐变量，该隐变量不仅与特征相关，也与field相关。假设样本的$n$个特征属于$f$个field，那么FFM的二次项有$nf$个隐向量。而在FM模型中每一维特征的隐向量只有一个。如果隐向量的长度为$k$，那么FFM的二次项参数有$nfk$个，远多于FM的$nk$个。</p><h3 id="2-模型"><a href="#2-模型" class="headerlink" title="2.模型"></a>2.模型</h3><script type="math/tex; mode=display">y=w_0+\sum_{i=1}^nw_ix_i+\sum_{i=1}^n\sum_{j=i+1}^n<v_{i,fj},v_{j,fi}>x_ix_j</script><h3 id="3-求解"><a href="#3-求解" class="headerlink" title="3.求解"></a>3.求解</h3><p>随机梯度下降，同FM</p><h3 id="4-应用"><a href="#4-应用" class="headerlink" title="4.应用"></a>4.应用</h3><p>为了使用FFM方法，所有的特征必须转换成“field_id:feat_id:value”格式，field_id代表特征所属field的编号，feat_id是特征编号，value是特征的值。 </p><hr><h3 id="DeepFM"><a href="#DeepFM" class="headerlink" title="DeepFM"></a>DeepFM</h3><h3 id="1-概念-1"><a href="#1-概念-1" class="headerlink" title="1.概念"></a>1.概念</h3><p>DeepFM目的是同时学习<strong>低阶和高阶</strong>的特征交叉，主要由FM和DNN两部分组成，底部共享同样的输入。模型可以表示为：</p><script type="math/tex; mode=display">y=sigmoid(y_{FM}+y_{DNN})</script><p>这里的低阶和高阶指的是特征组合的维度，虽然FM理论上可以对高阶特征组合进行建模，但实际上因为计算复杂度原因，一般都只用到了二阶特征组合。因此，FM负责二阶特征组合，DNN负责高阶特征的组合。</p><h3 id="2-优势"><a href="#2-优势" class="headerlink" title="2.优势"></a>2.优势</h3><ul><li>同时结合高阶和低阶特征组合（FM+DNN）</li><li>端到端模型，无需特征工程（DNN）</li><li>共享相同的输入和embedding参数，训练高效（借助FFM来做预训练，得到embedding后的向量）</li><li>评估模型时，用到了新的指标“Gini Normalization”</li></ul>]]></content>

    <summary type="html">


        &lt;h3 id=&quot;模型表示为：因子分解机（Factorization-Machine）&quot;&gt;&lt;a href=&quot;#模型表示为：因子分解机（Factorization-Machine）&quot; class=&quot;headerlink&quot; title=&quot;模型表示为：因子分解机（Factorizatio


    </summary>

      <category term="MachineLearning" scheme="https://github.com/DuncanZhou//categories/MachineLearning/"/>


      <category term="MachineLearning" scheme="https://github.com/DuncanZhou//tags/MachineLearning/"/>

  </entry>

  <entry>
    <title>ubuntu下sublime中文输入问题</title>
    <link href="https://github.com/DuncanZhou/2019/08/17/ubuntu-sublime/"/>
    <id>https://github.com/DuncanZhou/2019/08/17/ubuntu-sublime/</id>
    <published>2019-08-17T08:14:59.465Z</published>
    <updated>2018-01-19T03:44:06.575Z</updated>

    <content type="html"><![CDATA[<h1 id="ubuntu下安装的sublime-text中文不能输入问题："><a href="#ubuntu下安装的sublime-text中文不能输入问题：" class="headerlink" title="ubuntu下安装的sublime text中文不能输入问题："></a>ubuntu下安装的sublime text中文不能输入问题：</h1><h2 id="a-保存下面的代码到文件sublime-imfix-c-位于-目录"><a href="#a-保存下面的代码到文件sublime-imfix-c-位于-目录" class="headerlink" title="a.保存下面的代码到文件sublime_imfix.c(位于~目录)"></a>a.保存下面的代码到文件sublime_imfix.c(位于~目录)</h2><p>#include”gtk/gtkimcontext.h”<br>void gtk_im_context_set_client_window (GtkIMContext <em>context, GdkWindow  </em>window)<br>{<br>GtkIMContextClass *klass;<br>g_return_if_fail (GTK_IS_IM_CONTEXT (context));<br>klass = GTK_IM_CONTEXT_GET_CLASS (context);<br>if (klass-&gt;set_client_window)<br>klass-&gt;set_client_window (context, window);<br>g_object_set_data(G_OBJECT(context),”window”,window);<br>if(!GDK_IS_WINDOW (window))<br>return;<br>int width = gdk_window_get_width(window);<br>int height = gdk_window_get_height(window);<br>if(width != 0 &amp;&amp; height !=0)<br>gtk_im_context_focus_in(context);<br>}</p><h2 id="b-将上一步的代码编译成共享库libsublime-imfix-so，命令"><a href="#b-将上一步的代码编译成共享库libsublime-imfix-so，命令" class="headerlink" title="b.将上一步的代码编译成共享库libsublime-imfix.so，命令"></a>b.将上一步的代码编译成共享库libsublime-imfix.so，命令</h2><p>gcc -shared -o libsublime-imfix.so sublime_imfix.c  <code>pkg-config --libs --cflags gtk+-2.0</code> -fPIC<br>    注意：如果提示 gtk/gtkimcontext.h：没有那个文件或目录，那就是没有相关的依赖软件，安装命令：</p><p>sudo apt-get install build-essential libgtk2.0-dev</p><h2 id="c-将libsublime-imfix-so拷贝到sublime-text所在文件夹"><a href="#c-将libsublime-imfix-so拷贝到sublime-text所在文件夹" class="headerlink" title="c.将libsublime-imfix.so拷贝到sublime_text所在文件夹"></a>c.将libsublime-imfix.so拷贝到sublime_text所在文件夹</h2><p>sudo mv libsublime-imfix.so /opt/sublime_text/</p><h2 id="d-修改文件-usr-bin-subl的内容"><a href="#d-修改文件-usr-bin-subl的内容" class="headerlink" title="d.修改文件/usr/bin/subl的内容"></a>d.修改文件/usr/bin/subl的内容</h2><p>sudo gedit /usr/bin/subl<br>将</p><p>#!/bin/sh</p><p>exec /opt/sublime_text/sublime_text “$@”</p><p>修改为</p><p>#!/bin/sh</p><p>LD_PRELOAD=/opt/sublime_text/libsublime-imfix.so exec /opt/sublime_text/sublime_text “$@”</p><p>原文链接：<a href="http://www.jianshu.com/p/1f3a3e4f4e92" target="_blank" rel="external">http://www.jianshu.com/p/1f3a3e4f4e92</a></p>]]></content>

    <summary type="html">


        &lt;h1 id=&quot;ubuntu下安装的sublime-text中文不能输入问题：&quot;&gt;&lt;a href=&quot;#ubuntu下安装的sublime-text中文不能输入问题：&quot; class=&quot;headerlink&quot; title=&quot;ubuntu下安装的sublime text中文不能输入问题


    </summary>

      <category term="Note" scheme="https://github.com/DuncanZhou//categories/Note/"/>


  </entry>

  <entry>
    <title>Twitter用户数据Profiling</title>
    <link href="https://github.com/DuncanZhou/2019/08/17/TwitterUsersDataProfiling/"/>
    <id>https://github.com/DuncanZhou/2019/08/17/TwitterUsersDataProfiling/</id>
    <published>2019-08-17T08:14:59.458Z</published>
    <updated>2018-01-19T03:43:58.286Z</updated>

    <content type="html"><![CDATA[<h3 id="1-概念"><a href="#1-概念" class="headerlink" title="1.概念"></a>1.概念</h3><blockquote><p>数据摘要:One of the crucial requirements before comsuming datasets for any application is to understand the dataset at hand and its metadata.[1]<br>Data profiling is the set of activities and processes to determine the meta-data about a given dataset.[1]</p><p>总体地说,数据概要可以描述为是能够描述原样本数据的一个子集或者结果.比较简单地一种方式是计算平均值,总和或者统计频率最高的一些值等等方式.而较为有挑战性的是,在多列数据中找出其之间的相互函数或次序依赖等等关系.</p></blockquote><p>传统的数据摘要包括data exploration/data cleansing/data integration.而之后,data management和big data analytics也开始出现.</p><p>特别地,因为大数据的数据量大,多样性等特性,传统的技术对于其查询,存储及聚合都是花费高昂的.所以,data profiling在这里就显得非常重要.</p><blockquote><p>Data profiling is an important preparatory task to determine which data to mine, how to import data into various tools, and how to interpret the results.[1]</p></blockquote><p><img src="http://i1.buimg.com/1949/2fc64d8931d34670.png" alt="Data Profiling"></p><h4 id="Data-Profiling和Data-Mining的比较"><a href="#Data-Profiling和Data-Mining的比较" class="headerlink" title="Data Profiling和Data Mining的比较"></a>Data Profiling和Data Mining的比较</h4><p>1.Distinction by the object of analysis:<strong>Instance</strong> vs. <strong>schema</strong> or <strong>column</strong> vs. <strong>rows</strong><br>2.Distinction by the goal of the task:<strong>Description of existing data</strong> vs. <strong>new insights beyond existing data</strong> .</p><h3 id="2-动机或用例"><a href="#2-动机或用例" class="headerlink" title="2.动机或用例"></a>2.动机或用例</h3><blockquote><p>Data Profiling的目的:</p><ul><li>Data Exploration</li><li>Database management</li><li>Database reverse engineering</li><li>Data integration</li><li>Big data analytics</li></ul></blockquote><h3 id="3-方法"><a href="#3-方法" class="headerlink" title="3.方法"></a>3.方法</h3><p>1.依赖关系数据库,使用SQL语句查询返回结果(不能够找出所有属性列的依赖)<br>  单列和多列分析<br>2.搜索最优解:启发式算法<br>  启发式算法是一种技术,使得<strong>可接受的计算成本内</strong>去搜寻最好的解,但<strong>不一定能保证所得到的可行解和最优解</strong>,甚至在多数情况下,无法阐述所得解同最优解的近似程度.<br>3.聚类算法—&gt;筛选<br>4.按每一维动态规划找出子集</p><h3 id="4-twitter数据集人物特征选取"><a href="#4-twitter数据集人物特征选取" class="headerlink" title="4.twitter数据集人物特征选取"></a>4.twitter数据集人物特征选取</h3><ul><li>地理位置特征(反映了用户的时空分布,对于POI的推荐是有用的)</li><li>活跃度特征(可用于聚类分析)</li><li>影响力特征(可用于聚类分析)</li><li>推文特征(反映了用户的兴趣爱好,对于推荐系统是有用的)</li><li>时域特征</li></ul><h4 id="特征处理"><a href="#特征处理" class="headerlink" title="特征处理"></a>特征处理</h4><p>1.提取<br>2.正则化(最典型的就是数据的归一化处理,即将数据统一映射到[0,1]区间)</p><blockquote><p><strong>常见的数据归一化方法:</strong></p><ul><li>min-max,对原始数据的线性变换</li><li>log函数转换</li><li>atan函数转换</li><li>z-score标准化</li><li>Decimal scaling小数定标标准化</li><li>Logistic/Softmax变换</li><li>Softmax函数</li><li>模糊量化模式</li></ul></blockquote><p>特征选取原因:该特征代表了用户的…,对于…工作是有用的.</p><h3 id="5-twitter-data-profiling思路"><a href="#5-twitter-data-profiling思路" class="headerlink" title="5.twitter data profiling思路"></a><font color="red">5.twitter data profiling思路</font></h3><p><strong>Motivation</strong><br>聚类结果的代表性:</p><blockquote><p>Even though the construction of a cluster representation is an important step in decision making, it has not been examined closely by researchers.</p></blockquote><p><strong>度量准则:</strong><br><img src="https://ooo.0o0.ooo/2017/06/22/594bcb6c616ec.png" alt=""></p><p><strong>特征提取</strong><br>直接:location(时区),Followers/Following,category<br>间接:Activity,Influence,*InterestTags</p><p><strong>距离定义</strong><br>有序属性:闵可夫斯基距离(p=2时为欧式距离)<br>无序属性:VDM</p><p><strong>方法</strong></p><ul><li>1.聚类方法(LVQ)</li><li>2.定义图结构来搜索</li></ul><p><strong>Challenge-挑战</strong></p><ul><li>a.原集和profile子集的代表性度量准则的定义</li><li>b.ProfileSet的大小,k的确定</li><li>c.寻找ProfileSet(Representation of Clustering[2])</li><li>d.优化搜索算法</li></ul><h3 id="5-参考文献"><a href="#5-参考文献" class="headerlink" title="5.参考文献"></a>5.参考文献</h3><p>1.Data Profiling-A Tutorial SIGMOD 2017<br>2.Data Clustering: A Review IEEE Computer Society</p>]]></content>

    <summary type="html">


        &lt;h3 id=&quot;1-概念&quot;&gt;&lt;a href=&quot;#1-概念&quot; class=&quot;headerlink&quot; title=&quot;1.概念&quot;&gt;&lt;/a&gt;1.概念&lt;/h3&gt;&lt;blockquote&gt;
&lt;p&gt;数据摘要:One of the crucial requirements before comsu


    </summary>

      <category term="Paper" scheme="https://github.com/DuncanZhou//categories/Paper/"/>


  </entry>

  <entry>
    <title>同步到腾讯云</title>
    <link href="https://github.com/DuncanZhou/2019/08/17/tencentcloud/"/>
    <id>https://github.com/DuncanZhou/2019/08/17/tencentcloud/</id>
    <published>2019-08-17T08:14:59.453Z</published>
    <updated>2018-08-22T09:08:30.844Z</updated>

    <content type="html"><![CDATA[<p>我的博客即将搬运同步至腾讯云+社区，邀请大家一同入驻：<a href="https://cloud.tencent.com/developer/support-plan?invite_code=cibtnefnj6my" target="_blank" rel="external">https://cloud.tencent.com/developer/support-plan?invite_code=cibtnefnj6my</a></p>]]></content>

    <summary type="html">


        &lt;p&gt;我的博客即将搬运同步至腾讯云+社区，邀请大家一同入驻：&lt;a href=&quot;https://cloud.tencent.com/developer/support-plan?invite_code=cibtnefnj6my&quot; target=&quot;_blank&quot; rel=&quot;exter


    </summary>

      <category term="Life" scheme="https://github.com/DuncanZhou//categories/Life/"/>


  </entry>

  <entry>
    <title>支持向量机(Support Vector Machine)学习</title>
    <link href="https://github.com/DuncanZhou/2019/08/17/SVM-Support%20vector%20Machine/"/>
    <id>https://github.com/DuncanZhou/2019/08/17/SVM-Support vector Machine/</id>
    <published>2019-08-17T08:14:59.447Z</published>
    <updated>2018-01-19T03:43:50.675Z</updated>

    <content type="html"><![CDATA[<h1 id="支持向量机-SVM-Support-Vector-Machine-："><a href="#支持向量机-SVM-Support-Vector-Machine-：" class="headerlink" title="支持向量机(SVM-Support Vector Machine)："></a>支持向量机(SVM-Support Vector Machine)：</h1><h2 id="定义"><a href="#定义" class="headerlink" title="定义"></a>定义</h2><p>1.SVM是一种分类算法，是一种<em>二类分类模型</em>,用于解决分类和回归问题。通过寻求结构化风险最小来提高学习机泛化能力，实现经验风险和置信范围最小化，从而达到在统计样本量较少的情况下，亦能获得良好统计规律的目的。</p><blockquote><p><em>i.e.</em><font color="red">给定一个包含正例和反例的样本集合，svm的目的是寻找一个超平面来对样本进行分割，把样本中的正例面和反例面分开，但不是简单的分开，原则是使正例和反例之间的间隔最大，鲁棒性最好。</font></p></blockquote><p>2.<em>基本公式</em>：在样本空间中，划分超平面的线性方程：<img src="https://github.com/DuncanZhou/images/raw/master/1.PNG" alt="线性方程"><br>样本空间中任意点x到超平面（w，b）距离为<img src="https://github.com/DuncanZhou/images/raw/master/2.PNG" alt="距离"><br>假设正确分类，<img src="https://github.com/DuncanZhou/images/raw/master/3.PNG" alt="">“间隔”为<img src="https://github.com/DuncanZhou/images/raw/master/4.PNG" alt=""><br>所以，现在的目标是求得<em>“最大间隔”</em><img src="https://github.com/DuncanZhou/images/raw/master/5.PNG" alt=""><br>这就是SVM的基本型。</p><p>3.求“最大间隔”过程中的问题转化（转换成对偶问题）</p><blockquote><p>最大 -&gt; 最小 -&gt; 凸二次规划|<font color="red">拉格朗日乘子法</font></p></blockquote><h2 id="线性划分-gt-非线性划分"><a href="#线性划分-gt-非线性划分" class="headerlink" title="线性划分 -&gt; 非线性划分"></a>线性划分 -&gt; 非线性划分</h2><h3 id="1-问题"><a href="#1-问题" class="headerlink" title="1.问题"></a>1.问题</h3><p><img src="https://github.com/DuncanZhou/images/raw/master/6.PNG" alt=""><br>之前的讨论是假设样本是线性可分的，然而现实生活任务中，原始样本空间也许并不存在一个能正确划分两类样本的超平面。(如“异或问题”)，<font color="red">对于这样的问题，可以将原始样本空间映射到一个更高维的特征空间。</font>（Fortunately,如果原始空间是有限集，那么一定存在一个高维特征空间是样本可分。）</p><h3 id="2-解决方案"><a href="#2-解决方案" class="headerlink" title="2.解决方案"></a>2.解决方案</h3><p>映射后求解“最大间隔”的解<br><img src="https://github.com/DuncanZhou/images/raw/master/7.PNG" alt=""></p><h3 id="3-涉及到的问题"><a href="#3-涉及到的问题" class="headerlink" title="3.涉及到的问题"></a>3.涉及到的问题</h3><p>在求解过程中涉及计算样本Xi与Xi映射到特征空间之后的内积。由于特征空间维数可能很高，甚至可能是无穷维，因此直接计算内积通常是困难的，为了避开这个问题，设想这样一个函数-核函数<img src="https://github.com/DuncanZhou/images/raw/master/8.PNG" alt=""><br>求解后得到<img src="https://github.com/DuncanZhou/images/raw/master/9.PNG" alt=""></p><h3 id="4-常用的核函数"><a href="#4-常用的核函数" class="headerlink" title="4.常用的核函数"></a>4.常用的核函数</h3><p><img src="https://github.com/DuncanZhou/images/raw/master/10.PNG" alt=""></p><h2 id="软间隔与正则化"><a href="#软间隔与正则化" class="headerlink" title="软间隔与正则化"></a>软间隔与正则化</h2><blockquote><p><em>软间隔</em>：现实任务中往往很难确定合适的核函数使训练集在特征空间中线性可分，即使恰好找到了某个核函数使训练集在特征空间中线性可分，也很难判定这个貌似线性可分的结果不是由于过拟合所造成的。</p></blockquote><p><em>解决该问题的一个方法是允许svm在一些样本上出错</em>。如<img src="https://github.com/DuncanZhou/images/raw/master/11.PNG" alt=""></p><p>也就是在求解最大化间隔时，同时使不满足约束的样本尽可能少。<img src="https://github.com/DuncanZhou/images/raw/master/15.PNG" alt=""></p><p>三种常用的替代损失函数：<img src="https://github.com/DuncanZhou/images/raw/master/12.PNG" alt=""></p><p>共性：<img src="https://github.com/DuncanZhou/images/raw/master/13.PNG" alt=""></p><h2 id="支持向量回归（Support-Vector-Regression）"><a href="#支持向量回归（Support-Vector-Regression）" class="headerlink" title="支持向量回归（Support Vector Regression）"></a>支持向量回归（Support Vector Regression）</h2><p>给定样本D={(x1,y1),(x2,y2),…},希望学得一个回归模型，使得f(x)与y尽可能接近，w和b是待确定参数。<br>传统回归模型通常直接基于模型输出f(x)与真实输出y之间的差别来计算损失，当且仅当f(x)与y完全相同时，损失才为0.与次不同，SVR假设我们能容忍f(x)与y之间最多有e的偏差，小于等于e的都算0误差。SVR问题形式化为<img src="https://github.com/DuncanZhou/images/raw/master/14.PNG" alt=""></p>]]></content>

    <summary type="html">


        &lt;h1 id=&quot;支持向量机-SVM-Support-Vector-Machine-：&quot;&gt;&lt;a href=&quot;#支持向量机-SVM-Support-Vector-Machine-：&quot; class=&quot;headerlink&quot; title=&quot;支持向量机(SVM-Support Vector


    </summary>

      <category term="Seminar" scheme="https://github.com/DuncanZhou//categories/Seminar/"/>


  </entry>

  <entry>
    <title>支持向量机(Support Vector Machine)学习（补充）</title>
    <link href="https://github.com/DuncanZhou/2019/08/17/SVM(Extension)/"/>
    <id>https://github.com/DuncanZhou/2019/08/17/SVM(Extension)/</id>
    <published>2019-08-17T08:14:59.441Z</published>
    <updated>2018-01-19T03:43:42.063Z</updated>

    <content type="html"><![CDATA[<h1 id="SMO算法-Sequential-Minimal-Optimization"><a href="#SMO算法-Sequential-Minimal-Optimization" class="headerlink" title="SMO算法(Sequential Minimal Optimization)"></a>SMO算法(Sequential Minimal Optimization)</h1><h2 id="1-定义"><a href="#1-定义" class="headerlink" title="1.定义"></a>1.定义</h2><blockquote><p>SMO算法用于训练SVM，将大优化问题分解为多个小优化问题。这些小优化问题往往很容易求解，并且对它们进行顺序求解的结构与将它们作为整体来求解的结果是完全一致。</p></blockquote><h2 id="2-目标及原理"><a href="#2-目标及原理" class="headerlink" title="2.目标及原理"></a>2.目标及原理</h2><blockquote><p>SMO算法的工作目标是求出一系列alpha和b,一旦求出了这些alpha，就能求出权重向量w。</p><p>每次循环中选择两个alpha进行优化处理。一旦找到一对合适的alpha，那么就增大其中一个同时减少另一个。这里所谓的“合适”就是指两个alpha必须要符合一定的条件，条件之一就是这两个alpha必须在间隔边界之外，而其第二个条件则是这两个alpha还没有进行过区间化处理或者不在边界上。</p></blockquote><h2 id="3-调参"><a href="#3-调参" class="headerlink" title="3.调参"></a>3.调参</h2><blockquote><p>SVM中有两个参数<font color="red">C</font>和<font color="red">K1</font>，其中C是惩罚系数，即对误差的宽容度。C越高，说明越不能容忍出误差，容易过拟合。C越小，容易欠拟合。</p><p>k1是参数是RBF函数作为核函数后，该函数自带的一个参数，隐含的决定了数据映射到新的特征空间后的分布，k1越大，支持向量越少，k1越小，支持向量越多。支持向量的个数影响训练与预测的速度。</p></blockquote>]]></content>

    <summary type="html">


        &lt;h1 id=&quot;SMO算法-Sequential-Minimal-Optimization&quot;&gt;&lt;a href=&quot;#SMO算法-Sequential-Minimal-Optimization&quot; class=&quot;headerlink&quot; title=&quot;SMO算法(Sequential M


    </summary>

      <category term="Learning" scheme="https://github.com/DuncanZhou//categories/Learning/"/>


  </entry>

  <entry>
    <title>超参的搜索方法整理</title>
    <link href="https://github.com/DuncanZhou/2019/08/17/SuperParas/"/>
    <id>https://github.com/DuncanZhou/2019/08/17/SuperParas/</id>
    <published>2019-08-17T08:14:59.436Z</published>
    <updated>2018-08-20T03:03:04.691Z</updated>

    <content type="html"><![CDATA[<h3 id="1-网格搜索"><a href="#1-网格搜索" class="headerlink" title="1.网格搜索"></a>1.网格搜索</h3><p>网格搜索通过查找搜索范围内的所有的点，来确定最优值。它返回目标函数的最大值或损失函数的最小值。给出较大的搜索范围，以及较小的步长，网格搜索是一定可以找到全局最大值或最小值的。 </p><p>当人们实际使用网格搜索来找到最佳超参数集的时候，一般会先使用较广的搜索范围，以及较大的步长，来找到全局最大值或者最小值可能的位置。然后，人们会缩小搜索范围和步长，来达到更精确的最值。 </p><h3 id="2-随机搜索"><a href="#2-随机搜索" class="headerlink" title="2.随机搜索"></a>2.随机搜索</h3><p>随机搜索的思想和网格搜索比较相似，只是不再测试上界和下界之间的所有值，只是在搜索范围中随机取样本点。它的理论依据是，如果随即样本点集足够大，那么也可以找到全局的最大或最小值，或它们的近似值。</p><p>通过对搜索范围的随机取样，随机搜索一般会比网格搜索要快一些。但是和网格搜索的快速版（非自动版）相似，结果也是没法保证的。 </p><h3 id="3-基于梯度的优化"><a href="#3-基于梯度的优化" class="headerlink" title="3.基于梯度的优化"></a>3.基于梯度的优化</h3><h3 id="4-贝叶斯优化"><a href="#4-贝叶斯优化" class="headerlink" title="4.贝叶斯优化"></a>4.贝叶斯优化</h3><p>贝叶斯优化寻找使全局达到最值的参数时，使用了和网格搜索、随机搜索完全不同的方法。网格搜索和随机搜索在测试一个新的点时，会忽略前一个点的信息。而贝叶斯优化充分利用了这个信息。贝叶斯优化的工作方式是通过对目标函数形状的学习，找到使结果向全局最大值提升的参数。它学习目标函数形状的方法是，根据先验分布，假设一个搜集函数。在每一次使用新的采样点来测试目标函数时，它使用这个信息来更新目标函数的先验分布。然后，算法测试由后验分布给出的，全局最值最可能出现的位置的点。 </p><p>补充:</p><p><img src="https://raw.githubusercontent.com/DuncanZhou/images/master/alipsipng.png" alt="PSI"></p>]]></content>

    <summary type="html">


        &lt;h3 id=&quot;1-网格搜索&quot;&gt;&lt;a href=&quot;#1-网格搜索&quot; class=&quot;headerlink&quot; title=&quot;1.网格搜索&quot;&gt;&lt;/a&gt;1.网格搜索&lt;/h3&gt;&lt;p&gt;网格搜索通过查找搜索范围内的所有的点，来确定最优值。它返回目标函数的最大值或损失函数的最小值。给出较大的搜索


    </summary>

      <category term="MachineLearning" scheme="https://github.com/DuncanZhou//categories/MachineLearning/"/>


      <category term="MachineLearning" scheme="https://github.com/DuncanZhou//tags/MachineLearning/"/>

  </entry>

  <entry>
    <title>Hive SQL 学习</title>
    <link href="https://github.com/DuncanZhou/2019/08/17/SQL_Learning/"/>
    <id>https://github.com/DuncanZhou/2019/08/17/SQL_Learning/</id>
    <published>2019-08-17T08:14:59.419Z</published>
    <updated>2018-08-17T07:20:10.019Z</updated>

    <content type="html"><![CDATA[<h3 id="partition-by"><a href="#partition-by" class="headerlink" title="partition by"></a>partition by</h3><blockquote><p>partition by关键字是分析性函数的一部分，它和聚合函数不同的地方在于它能返回一个分组中的多条记录，而聚合函数一般只有一条反映统计值的记录，partition by用于给结果集分组，如果没有指定那么它把整个结果集作为一个分组</p></blockquote><p>example: 一个班有学生id，成绩，班级，现在将学生根据班级按照成绩排名。(partition by)</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">select *,row_number() over(partition by Grade order by Score desc) as Sequence from Student</div></pre></td></tr></table></figure><h3 id="lateral-view"><a href="#lateral-view" class="headerlink" title="lateral view"></a>lateral view</h3><h3 id="explode-posexplode"><a href="#explode-posexplode" class="headerlink" title="explode / posexplode"></a>explode / posexplode</h3><blockquote><p>explode 拆分一行称多行，而posexplode是根据多行匹配行号进行拆分多行。</p></blockquote><h3 id="窗口函数"><a href="#窗口函数" class="headerlink" title="窗口函数"></a>窗口函数</h3><h4 id="a-first-value"><a href="#a-first-value" class="headerlink" title="a. first_value"></a>a. first_value</h4><p>    取分组内排序后，截止到当前行，第一个值</p><h4 id="b-last-value"><a href="#b-last-value" class="headerlink" title="b.last_value"></a>b.last_value</h4><p>    取分组内排序后，截止到当前行，最后一个值  </p><h4 id="c-lead-col-n-default"><a href="#c-lead-col-n-default" class="headerlink" title="c.lead(col,n,default)"></a>c.lead(col,n,default)</h4><p>    用于统计窗口内往下第n行值。第一个参数为列名，第二个参数为往下第n行（可选，默认为1），第三个参数为默认值（当往下第n行为NULL时候，取默认值，如不指定，则为NULL） </p><h4 id="d-lag-col-n-default"><a href="#d-lag-col-n-default" class="headerlink" title="d.lag(col,n,default)"></a>d.lag(col,n,default)</h4><p>    与lead相反，用于统计窗口内往上第n行值。第一个参数为列名，第二个参数为往上第n行（可选，默认为1），第三个参数为默认值（当往上第n行为NULL时候，取默认值，如不指定，则为NULL） </p><h4 id="c-聚集函数-over-partition-by-col1-order-by-col-rows-range-between-UNBOUNDED-num-preceding-and-num-FOLLOWING-current-ROW"><a href="#c-聚集函数-over-partition-by-col1-order-by-col-rows-range-between-UNBOUNDED-num-preceding-and-num-FOLLOWING-current-ROW" class="headerlink" title="c.聚集函数 + over + (partition by col1 [order by col (rows | range) between (UNBOUNDED | [num]) preceding and (num FOLLOWING | current ROW))"></a>c.聚集函数 + over + (partition by col1 [order by col (rows | range) between (UNBOUNDED | [num]) preceding and (num FOLLOWING | current ROW))</h4><h4 id="d-ROW-NUMBER"><a href="#d-ROW-NUMBER" class="headerlink" title="d.ROW_NUMBER()"></a>d.ROW_NUMBER()</h4><p>    从1开始，按照顺序，生成分组内记录的序列 </p><h4 id="e-RANK"><a href="#e-RANK" class="headerlink" title="e.RANK()"></a>e.RANK()</h4><p>    生成数据项在分组中的排名，排名相等会在名次中留下空位 </p><h4 id="f-DENSE-RANK"><a href="#f-DENSE-RANK" class="headerlink" title="f.DENSE_RANK()"></a>f.DENSE_RANK()</h4><p>    生成数据项在分组中的排名，排名相等会在名次中不会留下空位  </p><h4 id="g-CUME-DIST"><a href="#g-CUME-DIST" class="headerlink" title="g.CUME_DIST()"></a>g.CUME_DIST()</h4><p>    小于等于当前值的行数/分组内总行数 </p><h4 id="h-PERCENT-RANK"><a href="#h-PERCENT-RANK" class="headerlink" title="h.PERCENT_RANK ()"></a>h.PERCENT_RANK ()</h4><p>    分组内当前行的RANK值-1/分组内总行数-1  </p><h4 id="i-NTILE-n"><a href="#i-NTILE-n" class="headerlink" title="i.NTILE(n)"></a>i.NTILE(n)</h4><p>    用于将分组数据按照顺序切分成n片，返回当前切片值，如果切片不均匀，默认增加第一个切片的分布 </p><p>Note:</p><ul><li>From子句：执行顺序自上而下，从左到右，从后往前，所以<strong>数据量少的表尽量放后</strong></li><li>where子句：执行顺序自下而上，从右到左，<strong>可以过滤掉大量记录的条件写在where子句的末尾</strong></li><li>group by子句：通过将不需要的记录在group by之前过滤掉，<strong>避免使用having来过滤</strong></li><li>having子句：尽量少用</li><li>select子句：尽量少用*，取字段名称</li><li>order by子句：执行顺序为从左到右排序</li><li>join：<strong>尽量把数据量大的表放在最右边来进行关联</strong></li></ul>]]></content>

    <summary type="html">


        &lt;h3 id=&quot;partition-by&quot;&gt;&lt;a href=&quot;#partition-by&quot; class=&quot;headerlink&quot; title=&quot;partition by&quot;&gt;&lt;/a&gt;partition by&lt;/h3&gt;&lt;blockquote&gt;
&lt;p&gt;partition by关键字是分


    </summary>

      <category term="SQL" scheme="https://github.com/DuncanZhou//categories/SQL/"/>


      <category term="SQL" scheme="https://github.com/DuncanZhou//tags/SQL/"/>

  </entry>

  <entry>
    <title>pyspark记录</title>
    <link href="https://github.com/DuncanZhou/2019/08/17/SparkDataFrameLearning/"/>
    <id>https://github.com/DuncanZhou/2019/08/17/SparkDataFrameLearning/</id>
    <published>2019-08-17T08:14:59.393Z</published>
    <updated>2018-08-10T09:58:26.702Z</updated>

    <content type="html"><![CDATA[<h2 id="Spark-DataFrame学习"><a href="#Spark-DataFrame学习" class="headerlink" title="Spark DataFrame学习"></a>Spark DataFrame学习</h2><h3 id="1-文件的读取"><a href="#1-文件的读取" class="headerlink" title="1. 文件的读取"></a>1. 文件的读取</h3><p>1.1 spark.read.json() / spark.read.parquet() 或者 spark.read.load(path,format=”parquet/json”)</p><p>1.2 和数据库的交互 spark.sql(“”)</p><h3 id="2-函数使用"><a href="#2-函数使用" class="headerlink" title="2.函数使用"></a>2.函数使用</h3><ul><li><p>2.1 printSchema() - 显示表结构</p></li><li><p>2.2 df.select(col) - 查找某一列的值</p></li><li><p>2.3 df.show([int n])  - 显示[某几行的]的值</p></li><li><p>2.4 df.filter(condition) - 过滤出符合条件的行</p></li><li><p>2.5 df.groupby(col).count() </p><p>df.groupby(col).agg(col,func.min(),func.max(),func.sum()) - 聚合函数</p></li><li><p>2.6 spark.createDataFrame([(),(),(),()…,()],(col1,col2,col3,…,coln))</p></li><li><p>2.7 自定义udf函数</p></li></ul><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line"><span class="meta">@pandas_udf("col1 type,col2 type,...,coln type",PandasUDFType.GROUPD_MAP)</span></div><div class="line"><span class="function"><span class="keyword">def</span> <span class="title">f</span><span class="params">(pdf)</span>:</span></div><div class="line"><span class="keyword">pass</span></div></pre></td></tr></table></figure><p>df.groupby(col).apply(f).show()</p>]]></content>

    <summary type="html">


        &lt;h2 id=&quot;Spark-DataFrame学习&quot;&gt;&lt;a href=&quot;#Spark-DataFrame学习&quot; class=&quot;headerlink&quot; title=&quot;Spark DataFrame学习&quot;&gt;&lt;/a&gt;Spark DataFrame学习&lt;/h2&gt;&lt;h3 id=&quot;1-文件的


    </summary>

      <category term="Learning" scheme="https://github.com/DuncanZhou//categories/Learning/"/>


      <category term="Spark" scheme="https://github.com/DuncanZhou//tags/Spark/"/>

  </entry>

  <entry>
    <title>模型记录</title>
    <link href="https://github.com/DuncanZhou/2019/08/17/SomeModels/"/>
    <id>https://github.com/DuncanZhou/2019/08/17/SomeModels/</id>
    <published>2019-08-17T08:14:59.388Z</published>
    <updated>2018-08-10T09:58:14.448Z</updated>

    <content type="html"><![CDATA[<h2 id="实战模型记录"><a href="#实战模型记录" class="headerlink" title="实战模型记录"></a>实战模型记录</h2><h3 id="1-GBDT（Gradient-Boosting-Decision-Tree）"><a href="#1-GBDT（Gradient-Boosting-Decision-Tree）" class="headerlink" title="1.GBDT（Gradient Boosting Decision Tree）"></a>1.GBDT（Gradient Boosting Decision Tree）</h3><ul><li>GBDT中的树是<strong>回归树（不是分类树）</strong>，GBDT用来做回归预测，调整后也可以用来分类。</li><li>回归树：回归树总体流程类似于分类树，<strong>区别在于，回归树的每一个节点都会得到一个预测值</strong>，以年龄为例，该预测值等于属于这个节点的所有人年龄的平均值。分枝时穷举每个feature的每个阈值找最好的分割点，<strong>但衡量标准不再是最大熵，而是最小平方误差</strong>。<strong>分枝终止条件为属性值唯一或者预设的终止条件（叶子个数上限）</strong></li><li>提升树算法：提升树是迭代多棵回归树来共同决策。<strong>当采用平方误差损失函数时</strong>，<strong>每一个棵回归树学习的是之前所有树的结论和残差</strong>，拟合得到一个当前的残差回归树。</li><li><strong>梯度提升决策树：</strong>当损失函数是平方损失和指数损失函数时，每一步的优化很简单，如平方损失函数学习残差回归树。但<strong>对于一般的损失函数，往往每一步优化没那么容易</strong>（如绝对值损失函数和Huber损失函数），所以有梯度下降方法。</li></ul><h3 id="2-XGBoost（eXtreme-Gradient-Boosting）"><a href="#2-XGBoost（eXtreme-Gradient-Boosting）" class="headerlink" title="2.XGBoost（eXtreme Gradient Boosting）"></a>2.XGBoost（eXtreme Gradient Boosting）</h3><p>和gbdt对比：</p><ul><li>1.GBDT以CART作为基分类器，xgboost还<strong>支持线性分类器</strong>。</li><li>2.GBDT在优化函数中只用到一阶导数信息，<strong>xgboost则对代价函数进行了二阶泰勒展开</strong>，同时用到了一阶和二阶导数。</li><li>3.xgboost在代价函数中<strong>加入了正则项</strong>，控制了模型的复杂度。正则项包含两部分：叶子节点数和叶子结点输出分数。</li><li>4.划分点的查找:<strong>贪心算法和近似算法</strong></li><li>5.支持并行，<strong>在特征粒度上并行</strong>，预先对数据进行排序，保存为block结构，在节点分裂时计算每个特征的信息增益，<strong>各个特征的信息增益就是多个线程进行</strong>。</li></ul><h3 id="3-LightGBM"><a href="#3-LightGBM" class="headerlink" title="3.LightGBM"></a>3.LightGBM</h3><p>优化点</p><ul><li>1.Histogram算法：先把连续的浮点特征值离散化成k个整数，同事构造一个宽度为k的直方图。遍历数据时，根据离散化后的值作为索引在直方图中累计统计量，当遍历一次数据后，直方图累积了需要的统计量，然后根据直方图的离散值，遍历寻找最优的分割点。</li><li>2.带深度限制的Leaf-wise的叶子生长策略：每次从当前所有叶子中，<strong>找到分裂增益最大的一个叶子，然后分裂，如此循环</strong>。因此同Level-wise相比，在分裂次数相同的情况下，Leaf-wise可以降低更多的误差，得到更好的精度。 </li></ul><h3 id="4-RandomForest"><a href="#4-RandomForest" class="headerlink" title="4.RandomForest"></a>4.RandomForest</h3><p>用<strong>bootstrap自助法生成m个训练集</strong>，<strong>对每个训练集构造一颗决策树</strong>，在节点找特征进行分裂的时候，并不是对所有特征找到使得指标（如信息增益）最大的，而是<strong>在特征中随机抽取一部分特征</strong>，在抽取到的特征中找到最优解，进行分裂。模型预测阶段就是bagging策略，分类投票，回归取均值。</p>]]></content>

    <summary type="html">


        &lt;h2 id=&quot;实战模型记录&quot;&gt;&lt;a href=&quot;#实战模型记录&quot; class=&quot;headerlink&quot; title=&quot;实战模型记录&quot;&gt;&lt;/a&gt;实战模型记录&lt;/h2&gt;&lt;h3 id=&quot;1-GBDT（Gradient-Boosting-Decision-Tree）&quot;&gt;&lt;a href=


    </summary>

      <category term="Data Mining" scheme="https://github.com/DuncanZhou//categories/Data-Mining/"/>


      <category term="MachineLearning" scheme="https://github.com/DuncanZhou//tags/MachineLearning/"/>

  </entry>

  <entry>
    <title>天池-半导体质量预测</title>
    <link href="https://github.com/DuncanZhou/2019/08/17/Semiconduction/"/>
    <id>https://github.com/DuncanZhou/2019/08/17/Semiconduction/</id>
    <published>2019-08-17T08:14:59.381Z</published>
    <updated>2018-01-19T03:43:06.296Z</updated>

    <content type="html"><![CDATA[<h2 id="天池-半导体质量预测"><a href="#天池-半导体质量预测" class="headerlink" title="天池-半导体质量预测"></a>天池-半导体质量预测</h2><p>最近跟着做天池的比赛,将比赛过程中遇到的问题记录如下:</p><h3 id="1-特征的选择"><a href="#1-特征的选择" class="headerlink" title="1.特征的选择?"></a>1.特征的选择?</h3><blockquote><p><strong>特征选择的方法</strong>: 1) 嵌入式 2) 过滤式 3) 封装式</p></blockquote><p><strong>1)数据清洗:</strong></p><ul><li>1.筛选掉重复的列</li><li>2.对于类别类型特征,利用sklearn编码(one-hot, Label Encoder等)</li><li>3.使用平均值填充完后再去除冗余列(方差为0,列重复)</li></ul><p>清洗过后,特征从原来的8000多维降到了3400多维.</p><ul><li>4.特征中存在全为NaN值的,也去掉这些列</li></ul><blockquote><p>总结:数据清洗过后,总的特征维数维3342维;随机森林MSE为:0.03612</p></blockquote><p><strong>2)特征选择:</strong></p><ul><li>嵌入式: 根据模型来分析特征的重要性,最常见的方式为<strong>正则化</strong>来做特征选择.</li><li>过滤式: 评估单个特征和结果之间的相关程度,排序留下Top相关的特征部分.(缺点:没有考虑到特征之间的关联作用,可能把有用的关联特征误踢掉)</li><li>封装式: 把特征选择看作一个特征子集搜索问题,筛选各种特征子集,用模型评估效果.</li></ul><blockquote><p>处理方法:</p><ul><li>过滤式:使用单个随机森林得到的feature<em>importances</em>排序后保留了44个特征,为<br>[‘310X207’, ‘210X158’, ‘311X7’, ‘330X1132’, ‘220X13’, ‘310X149’, ‘750X883’, ‘210X207’, ‘312X144’, ‘210X192’, ‘312X61’, ‘312X66’, ‘440AX77’, ‘220X197’, ‘310X153’, ‘330X1190’, ‘344X252’, ‘310X33’, ‘210X174’, ‘440AX95’, ‘312X777’, ‘330X102’, ‘440AX187’, ‘340X161’, ‘312X55’, ‘330X590’, ‘210X89’, ‘330X1129’, ‘210X164’, ‘210X188’, ‘330X1146’, ‘310X119’, ‘360X1049’, ‘440AX182’, ‘750X640’, ‘440AX65’, ‘312X789’, ‘311X154’, ‘310X43’, ‘312X782’, ‘312X555’, ‘420X4’, ‘312X785’, ‘210X229’]</li></ul></blockquote><ul><li>包裹式:利用随机森林的性能作为评价指标筛选出200个特征<br>set([‘440AX98’, ‘310X207’, ‘210X158’, ‘261X641’, ‘261X269’, ‘312X61’, ‘312X66’, ‘220X197’, ‘520X317’, ‘400X151’, ‘400X150’, ‘400X153’, ‘330X594’, ‘330X590’, ‘210X164’, ‘210X8’, ‘420X4’, ‘330X1223’, ‘310X117’, ‘261X524’, ‘310X119’, ‘261X607’, ‘750X640’, ‘210X126’, ‘311X154’, ‘312X555’, ‘261X689’, ‘520X245’, ‘261X477’, ‘750X883’, ‘330X589’, ‘261X590’, ‘300X8’, ‘261X591’, ‘261X468’, ‘440AX77’, ‘300X3’, ‘220X179’, ‘330X1190’, ‘220X177’, ‘220X176’, ‘220X175’, ‘220X174’, ‘220X173’, ‘220X172’, ‘220X171’, ‘220X170’, ‘261X464’, ‘TOOL (#2)’, ‘330X351’, ‘330X102’, ‘330X355’, ‘330X354’, ‘330X1049’, ‘330X1042’, ‘312X57’, ‘312X55’, ‘210X89’, ‘330X1040’, ‘330X1043’, ‘330X1129’, ‘261X460’, ‘330X1044’, ‘261X462’, ‘330X1046’, ‘210X188’, ‘330X353’, ‘360X1049’, ‘440AX66’, ‘440AX67’, ‘440AX64’, ‘440AX65’, ‘330X135’, ‘330X134’, ‘312X144’, ‘330X133’, ‘330X132’, ‘330X139’, ‘312X782’, ‘210X174’, ‘312X785’, ‘312X789’, ‘261X608’, ‘261X609’, ‘520X312’, ‘520X313’, ‘520X314’, ‘330X1228’, ‘420X33’, ‘330X1132’, ‘261X600’, ‘261X601’, ‘330X641’, ‘330X1226’, ‘330X1221’, ‘330X1220’, ‘210X206’, ‘210X207’, ‘261X598’, ‘261X599’, ‘340X105’, ‘340X107’, ‘210X190’, ‘210X191’, ‘210X192’, ‘261X593’, ‘261X594’, ‘261X596’, ‘261X597’, ‘220X557’, ‘220X551’, ‘310X37’, ‘310X36’, ‘310X34’, ‘310X33’, ‘261X268’, ‘310X31’, ‘310X30’, ‘440AX95’, ‘210X3’, ‘210X4’, ‘210X5’, ‘210X6’, ‘210X7’, ‘312X777’, ‘210X9’, ‘261X260’, ‘261X261’, ‘261X266’, ‘261X267’, ‘261X264’, ‘261X265’, ‘520X246’, ‘520X247’, ‘261X736’, ‘261X737’, ‘520X242’, ‘520X243’, ‘310X153’, ‘344X252’, ‘440AX90’, ‘261X262’, ‘330X1146’, ‘440AX182’, ‘440AX187’, ‘261X687’, ‘261X688’, ‘310X43’, ‘330X157’, ‘330X404’, ‘261X512’, ‘261X513’, ‘330X401’, ‘520X55’, ‘330X403’, ‘261X517’, ‘261X518’, ‘261X519’, ‘311X7’, ‘330X409’, ‘330X159’, ‘330X158’, ‘330X461’, ‘520X333’, ‘220X13’, ‘310X149’, ‘520X244’, ‘261X338’, ‘330X1249’, ‘330X1248’, ‘300X7’, ‘261X330’, ‘261X331’, ‘340X161’, ‘261X333’, ‘330X1247’, ‘344X121’, ‘261X336’, ‘330X1244’, ‘520X240’, ‘330X1230’, ‘520X241’, ‘330X1241’, ‘261X335’, ‘220X535’, ‘210X129’, ‘210X128’, ‘220X531’, ‘220X530’, ‘210X125’, ‘210X124’, ‘210X127’, ‘261X526’, ‘210X121’, ‘210X120’, ‘210X123’, ‘261X230’, ‘261X592’, ‘440AX123’, ‘261X742’, ‘440AX99’, ‘311X83’, ‘220X178’, ‘330X535’, ‘210X229’]</li></ul><p>1) 提取特征后,xgboost的mse为0.0325341683406<br>2) 单个随机森林的5折交叉验证的平均mse为0.0288353227614<br>(max_depth=None,n_estimators=160,min_samples_leaf=2,max_features=n_features)</p><p>使用模型的features<em>importances</em>选择的特征和rfe做交集得到的特征为:<br>[‘210X158’, ‘330X1228’, ‘330X1132’, ‘220X13’, ‘310X149’, ‘750X883’, ‘330X589’, ‘210X207’, ‘440AX77’, ‘312X66’, ‘210X192’, ‘330X1190’, ‘310X33’, ‘312X555’, ‘310X31’, ‘310X30’, ‘440AX95’, ‘210X6’, ‘210X8’, ‘330X102’, ‘340X161’, ‘312X57’, ‘310X153’, ‘330X590’, ‘210X89’, ‘330X1129’, ‘210X164’, ‘312X777’, ‘210X188’, ‘330X1146’, ‘310X119’, ‘750X640’, ‘311X7’, ‘312X144’, ‘310X43’, ‘312X782’, ‘210X174’, ‘420X4’, ‘210X229’, ‘312X785’, ‘312X789’]</p><h3 id="2-缺失值的处理"><a href="#2-缺失值的处理" class="headerlink" title="2.缺失值的处理?"></a>2.缺失值的处理?</h3><ul><li>使用任意数值填充</li><li>使用平均值填充</li></ul><h3 id="3-维数降维"><a href="#3-维数降维" class="headerlink" title="3.维数降维?"></a>3.维数降维?</h3><h3 id="4-模型的选择"><a href="#4-模型的选择" class="headerlink" title="4.模型的选择?"></a>4.模型的选择?</h3><ol><li>Random Forest</li><li>GBDT(Gradient Boosting Decision Tree)<blockquote><p>这里记录下GBDT的发展过程: Regression Decision Tree -&gt; Boosting Decision Tree -&gt; Gradient Boosting Decision Tree,GBDT利用加法模型和前向分步法实现学习的优化过程.GBDT是一个基于迭代累加的决策树算法，它通过构造一组弱的学习器（树），并把多颗决策树的结果累加起来作为最终的预测输出。 缺点:1) 计算复杂度高 2) 不适合高维稀疏特征</p></blockquote></li></ol><p>3.Xgboost</p><blockquote><p>xgboost是boosting Tree的一个很牛的实现:</p><ul><li>显示地把树模型复杂度作为正则项加到优化目标中</li><li>公式推导中用到了二阶导数,用了二阶泰勒展开</li><li>实现了分裂点寻找近似算法</li><li>利用了特征的稀疏性</li><li>并行计算</li></ul></blockquote><p>xgboost的训练速度远远快于传统的GBDT,10倍量级.</p><blockquote><p>总结:重新选用xgboost模型,参数如下,mse为0.0320532717482<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div></pre></td><td class="code"><pre><div class="line">params=&#123;&apos;booster&apos;:&apos;gbtree&apos;,</div><div class="line">    &apos;objective&apos;: &apos;reg:linear&apos;,</div><div class="line">    &apos;eval_metric&apos;: &apos;rmse&apos;,</div><div class="line">    &apos;max_depth&apos;:4,</div><div class="line">    &apos;lambda&apos;:6,</div><div class="line">    &apos;subsample&apos;:0.75,</div><div class="line">    &apos;colsample_bytree&apos;:1,</div><div class="line">    &apos;min_child_weight&apos;:1,</div><div class="line">    &apos;eta&apos;: 0.04,</div><div class="line">    &apos;seed&apos;:0,</div><div class="line">    &apos;nthread&apos;:8,</div><div class="line">     &apos;silent&apos;:0&#125;</div></pre></td></tr></table></figure></p></blockquote><h3 id="5-实践过程"><a href="#5-实践过程" class="headerlink" title="5.实践过程"></a>5.实践过程</h3><p>1.特征选择过程:去除全为Nan的列,去除Nan值个数大于200的列,去除object列,去除重复的列,选择Pearson相关系数&gt;0.2的列,最后共得到5600多维特征.<br>这一步很粗糙,改进: </p><ul><li>1) 加入object的列</li><li>2)特征维数继续筛减:可以试一下PCA降维</li><li>3)时间列属性的加入</li></ul><p>2.模型的选择:单模型线性回归线下mse:0.0388左右,而线上为0.0446.之前用随机森林回归预测,线下0.0297,而线上0.0493.从这个现象结合线下数据只有500条是否可以得出线下和线上数据并不是分布相同,或者说差异较大,而且线下训练可能存在过拟合.2017.12.24将三种回归模型加权平均融合提交结果.</p>]]></content>

    <summary type="html">


        &lt;h2 id=&quot;天池-半导体质量预测&quot;&gt;&lt;a href=&quot;#天池-半导体质量预测&quot; class=&quot;headerlink&quot; title=&quot;天池-半导体质量预测&quot;&gt;&lt;/a&gt;天池-半导体质量预测&lt;/h2&gt;&lt;p&gt;最近跟着做天池的比赛,将比赛过程中遇到的问题记录如下:&lt;/p&gt;
&lt;h3 id


    </summary>

      <category term="Competition" scheme="https://github.com/DuncanZhou//categories/Competition/"/>


      <category term="Competition" scheme="https://github.com/DuncanZhou//tags/Competition/"/>

  </entry>

  <entry>
    <title>社交网络中抽取有代表性的用户</title>
    <link href="https://github.com/DuncanZhou/2019/08/17/Representatives/"/>
    <id>https://github.com/DuncanZhou/2019/08/17/Representatives/</id>
    <published>2019-08-17T08:14:59.329Z</published>
    <updated>2018-01-19T03:42:40.139Z</updated>

    <content type="html"><![CDATA[<h3 id="1-为什么要做这个问题"><a href="#1-为什么要做这个问题" class="headerlink" title="1.为什么要做这个问题"></a>1.为什么要做这个问题</h3><h4 id="1-1-从社会应用角度"><a href="#1-1-从社会应用角度" class="headerlink" title="1.1 从社会应用角度"></a>1.1 从社会应用角度</h4><ul><li>在HCI(人机交互)中,实施调查和去获得用户的反馈都是主要针对有代表性的用户.</li><li>代表性人物的行为习惯和关注点可以折射出整体用户的兴趣偏向和关注点,对于广告投放,物品推荐是有助的. </li><li>对于目前日益增长的社交网络用户,从大量的社交网络用户中抽取一个具有代表性的子集才是Human-readable的,有益于数据分析,相当于一个数据摘要.</li></ul><h4 id="1-2-从科研方法的角度"><a href="#1-2-从科研方法的角度" class="headerlink" title="1.2 从科研方法的角度"></a>1.2 从科研方法的角度</h4><ul><li>从大量模型或数据点中抽取一个保留了原数据集的特征是机器学习/计算机视觉领域数据分析和推荐系统领域都是一个重要的问题.</li><li>机器学习领域,找原型子集来辅助分类算法.</li></ul><h3 id="2-怎样定义代表性"><a href="#2-怎样定义代表性" class="headerlink" title="2.怎样定义代表性"></a>2.怎样定义代表性</h3><blockquote><p>Note:和在社交网络中寻找影响力最大化的问题不同,找出具有代表性的用户的目的是抽取一些”平均”的用户,他们能够在统计上代表原来所有用户的特征.</p></blockquote><h4 id="2-1-代表性用户具备的条件"><a href="#2-1-代表性用户具备的条件" class="headerlink" title="2.1 代表性用户具备的条件:"></a>2.1 代表性用户具备的条件:</h4><p><font color="blue">版本一.</font></p><ul><li>1.从<font color="green">属性特征角度</font>上,他们很好的代表了原数据集用户的属性特征(行为习惯/性格特征/领域情况等等),即,与原数据集用户具有<strong><font color="red">较少的特征损耗</font></strong></li><li>2.从<font color="green">分布特征角度</font>,代表性子集应尽可能拟合原数据集的样本分布,即,与原数据集具有<strong><font color="red">较少的分布损耗</font></strong>(类似于原数据集中每个领域的人物分布,代表性子集能够拟合原数据集每个领域的人物分布)</li><li>3.从<font color="green">差异性角度</font>上,代表性子集需要能够作为每个领域的<font color="red"><strong>典型</strong></font>人物,所以代表性子集内部各领域之间的人物需要保持一定的差异性,即,代表性子集内部需要<strong><font color="red">较大的差异性或较小的相似性</font></strong></li></ul><p><font color="blue">版本二.</font></p><ul><li>1.从<font color="green">特征角度</font>上,他们很好的代表了原数据集用户的属性特征(行为习惯/性格特征/领域情况等等),即,与原数据集用户具有<strong><font color="red">较少的特征损耗</font></strong></li><li>2.从<font color="green">分布角度</font>,代表性子集在满足(1)条件下应尽可能的分散或稀疏,使得子集可以尽可能地还原原数据集的分布,即,P具有<strong><font color="red">具有稀疏性</font></strong>;<br>-note:如果仅仅要求<strong>特征损耗最小</strong>,可能会导致代表性子集都聚集在人数较多较相似的团体中,以致于原数据集的分布丢失.</li></ul><p>目前倾向于版本一.</p><h4 id="2-2-问题定义"><a href="#2-2-问题定义" class="headerlink" title="2.2 问题定义:"></a>2.2 问题定义:</h4><p>在原数据集人物集合中寻找这样的代表性子集P</p><ul><li>a)P能够满足以上代表性的定义</li><li>b)P是数量最小的那个代表性集合</li></ul><h4 id="2-3-Novel之处或者contibution"><a href="#2-3-Novel之处或者contibution" class="headerlink" title="2.3 Novel之处或者contibution:"></a>2.3 Novel之处或者contibution:</h4><ul><li>1.代表性人物包含了两种情况的综合考虑,之前论文中大多考虑单一方面</li><li>2.代表性人物的大小不需要先验设定.</li></ul><p>将用户以各个属性构建向量,以向量之间的距离来定义人物之间的代表性.<br>以Twitter社交拓扑为例,当A用户关注了B用户,将会有A指向B的一条有向边,</p><h3 id="3-如何具体评价子集的代表性"><a href="#3-如何具体评价子集的代表性" class="headerlink" title="3.如何具体评价子集的代表性"></a>3.如何具体评价子集的代表性</h3><h3 id="4-方法"><a href="#4-方法" class="headerlink" title="4.方法"></a>4.方法</h3>]]></content>

    <summary type="html">


        &lt;h3 id=&quot;1-为什么要做这个问题&quot;&gt;&lt;a href=&quot;#1-为什么要做这个问题&quot; class=&quot;headerlink&quot; title=&quot;1.为什么要做这个问题&quot;&gt;&lt;/a&gt;1.为什么要做这个问题&lt;/h3&gt;&lt;h4 id=&quot;1-1-从社会应用角度&quot;&gt;&lt;a href=&quot;#1-1-从社


    </summary>

      <category term="Paper" scheme="https://github.com/DuncanZhou//categories/Paper/"/>


  </entry>

  <entry>
    <title>推荐算法</title>
    <link href="https://github.com/DuncanZhou/2019/08/17/RecommendationNotes/"/>
    <id>https://github.com/DuncanZhou/2019/08/17/RecommendationNotes/</id>
    <published>2019-08-17T08:14:59.318Z</published>
    <updated>2018-08-20T02:55:15.439Z</updated>

    <content type="html"><![CDATA[<h3 id="算法分类"><a href="#算法分类" class="headerlink" title="算法分类"></a>算法分类</h3><h3 id="1-基于内容-用户的推荐"><a href="#1-基于内容-用户的推荐" class="headerlink" title="1.基于内容 / 用户的推荐"></a>1.基于内容 / 用户的推荐</h3><p>更多依赖相似性计算然后推荐</p><ul><li>基于<strong>用户信息</strong>进行推荐</li><li>基于<strong>内容 、物品的信息</strong>进行推荐</li></ul><h3 id="2-协同过滤"><a href="#2-协同过滤" class="headerlink" title="2.协同过滤"></a>2.协同过滤</h3><p>需要通过用户行为来计算用户或物品见的相关性</p><ul><li><p>基于<strong>用户的协同推荐</strong>: 以人为本</p><p>| 小张 | 产品经理、Google、增长   |<br>| —— | ———————————— |<br>| 小明 | 产品经理、Google、比特币 |<br>| 小吴 | 比特币、区块链、以太币   |</p><p><strong>这是一个用户关注内容的列表，显然在这个列表中，小张和小明关注的内容更为相似，那么可以给小张推荐比特币。</strong></p></li><li><p>基于<strong>物品的系统推荐</strong></p><p>以物为本建立各商品的相似度矩阵</p><p>| 产品经理 | 小张、小明 |<br>| ———— | ————— |<br>| Google   | 小张、小明 |<br>| 比特币   | 小明、小吴 |</p><p>小张和小明都不约而同地看了产品经理和Google，这可以说明产品经理和Google有相似，那么之后<strong>有看了Google相关内容的用户就可以给推荐产品经理</strong>的相关内容。     </p></li></ul><h3 id="3-基于知识的推荐"><a href="#3-基于知识的推荐" class="headerlink" title="3.基于知识的推荐"></a>3.基于知识的推荐</h3><p>某一领域的一整套规则和路线进行推荐。参照可汗学院知识树。</p><p>补充：（图片来源知乎shawn1943，感谢）</p><p><img src="https://pic1.zhimg.com/80/v2-9f88f829b59ddb4f1e0571c46c158d1c_hd.png" alt=""></p>]]></content>

    <summary type="html">


        &lt;h3 id=&quot;算法分类&quot;&gt;&lt;a href=&quot;#算法分类&quot; class=&quot;headerlink&quot; title=&quot;算法分类&quot;&gt;&lt;/a&gt;算法分类&lt;/h3&gt;&lt;h3 id=&quot;1-基于内容-用户的推荐&quot;&gt;&lt;a href=&quot;#1-基于内容-用户的推荐&quot; class=&quot;headerlink&quot;


    </summary>

      <category term="Recommendation" scheme="https://github.com/DuncanZhou//categories/Recommendation/"/>


      <category term="Recommendation" scheme="https://github.com/DuncanZhou//tags/Recommendation/"/>

  </entry>

  <entry>
    <title>Recommendation方向学习</title>
    <link href="https://github.com/DuncanZhou/2019/08/17/RecommendationLearning/"/>
    <id>https://github.com/DuncanZhou/2019/08/17/RecommendationLearning/</id>
    <published>2019-08-17T08:14:59.311Z</published>
    <updated>2018-01-19T03:42:32.528Z</updated>

    <content type="html"><![CDATA[<h3 id="综述"><a href="#综述" class="headerlink" title="综述"></a>综述</h3><p>目前推荐上研究的方向有这样几个方向:<br>1.Temporal Context-Aware Recommendation<br>2.Spatial Recommendation for Out-of-Town Users<br>3.Location-based and Real-time Recommendation<br>4.Efficiency of Online Recommendation</p><p>补充学习:</p><blockquote><p>online learning强调的是学习是实时的，流式的，每次训练不用使用全部样本，而是以之前训练好的模型为基础，每来一个样本就更新一次模型，这种方法叫做OGD（online gradient descent）。</p><p>batch learning或者叫offline learning强调的是每次训练都需要使用全量的样本，因而可能会面临数据量过大的问题。</p></blockquote><p>传统的推荐系统广泛都使用了<font color="blue">协同过滤</font>和<font color="blue">基于内容过滤技术</font></p><p>协同过滤分为</p><blockquote><p>基于内存的推荐和基于模型的推荐(矩阵分解)</p></blockquote><p>Context-Aware Recommender Systems(CARS)包含三种范例:contextual pre-filtering,contextual post-filtering and contextual modeling.</p>]]></content>

    <summary type="html">


        &lt;h3 id=&quot;综述&quot;&gt;&lt;a href=&quot;#综述&quot; class=&quot;headerlink&quot; title=&quot;综述&quot;&gt;&lt;/a&gt;综述&lt;/h3&gt;&lt;p&gt;目前推荐上研究的方向有这样几个方向:&lt;br&gt;1.Temporal Context-Aware Recommendation&lt;br&gt;2.Spa


    </summary>

      <category term="Paper" scheme="https://github.com/DuncanZhou//categories/Paper/"/>


  </entry>

  <entry>
    <title>python-MPI安装命令</title>
    <link href="https://github.com/DuncanZhou/2019/08/17/Python-MPI/"/>
    <id>https://github.com/DuncanZhou/2019/08/17/Python-MPI/</id>
    <published>2019-08-17T08:14:59.306Z</published>
    <updated>2018-01-19T03:42:23.594Z</updated>

    <content type="html"><![CDATA[<h3 id="在Ubuntu下安装MPI环境-python环境"><a href="#在Ubuntu下安装MPI环境-python环境" class="headerlink" title="在Ubuntu下安装MPI环境(python环境)"></a>在Ubuntu下安装MPI环境(python环境)</h3><p><font color="red">Step1:</font>安装python环境&lt;/br&gt;</p><p><font color="red">Step2:</font>sudo apt-get install openmpi-bin&lt;/br&gt;</p><p><font color="red">Step3:</font>sudo apt-get install libopenmpi-dev&lt;/br&gt;</p><p><font color="red">Step4:</font>sudo apt-get install python-mpi4py&lt;/br&gt;</p><p><strong>(第三步不要忽略)</strong></p>]]></content>

    <summary type="html">


        &lt;h3 id=&quot;在Ubuntu下安装MPI环境-python环境&quot;&gt;&lt;a href=&quot;#在Ubuntu下安装MPI环境-python环境&quot; class=&quot;headerlink&quot; title=&quot;在Ubuntu下安装MPI环境(python环境)&quot;&gt;&lt;/a&gt;在Ubuntu下安装MPI


    </summary>

      <category term="Note" scheme="https://github.com/DuncanZhou//categories/Note/"/>


  </entry>

  <entry>
    <title>python构建小顶堆</title>
    <link href="https://github.com/DuncanZhou/2019/08/17/python-minHeap/"/>
    <id>https://github.com/DuncanZhou/2019/08/17/python-minHeap/</id>
    <published>2019-08-17T08:14:59.300Z</published>
    <updated>2018-01-19T03:42:14.983Z</updated>

    <content type="html"><![CDATA[<p>近日实验中需要用到小顶堆,记录下来,便于日后参考.<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div></pre></td><td class="code"><pre><div class="line">import heapq</div><div class="line"># 定义一个小顶堆</div><div class="line">class MinHeap(object):</div><div class="line"></div><div class="line">    # 允许传入tuple,按照第二个元素比较</div><div class="line">    def __init__(self, initial=None, key=lambda x:x[1]):</div><div class="line">        self.key = key</div><div class="line">        if initial:</div><div class="line">            self._data = [(key(item), item) for item in initial]</div><div class="line">            heapq.heapify(self._data)</div><div class="line">        else:</div><div class="line">            self._data = []</div><div class="line"></div><div class="line">    def push(self, item):</div><div class="line">        heapq.heappush(self._data, (self.key(item), item))</div><div class="line"></div><div class="line">    def pop(self):</div><div class="line">        return heapq.heappop(self._data)[1]</div><div class="line"></div><div class="line">    def size(self):</div><div class="line">        return len(self._data)</div></pre></td></tr></table></figure></p>]]></content>

    <summary type="html">


        &lt;p&gt;近日实验中需要用到小顶堆,记录下来,便于日后参考.&lt;br&gt;&lt;figure class=&quot;highlight plain&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre&gt;&lt;div class=&quot;line&quot;&gt;1&lt;/div&gt;&lt;div class=&quot;line


    </summary>

      <category term="Note" scheme="https://github.com/DuncanZhou//categories/Note/"/>


  </entry>

</feed>