+1
0

hyperloglog

ning 发表于 2014年10月11日 13:55 | Hits: 6214
Tag: all

Table of Contents

1   what

  • Hyperloglog是 counting 算法
  • Hyperloglog is an approximate technique for computing the number of distinct entries in a set, It does this while using a small amount of memory. For instance, to achieve 99% accuracy, it needs only 16 KB.
  • It is derived from the LogLog counting technique

但是, 不就是需要一个计数么? i++ 不就行了么?

需求:

  • 计算每天访问某个网站的uniqips
  • 计算google 每天uniqquery 个数.

我们怎么处理呢:

脚本:           sort | uniq | wc -l
map-reduce:     sort | uniq | wc -l
程序:           通过一个大字典去重.

它们的特点是:

  • sort方法时间复杂度O(n log N)
  • 去重方法 需要很大内存来去重, 空间复杂度O(n)

2   how

Here I’ll cover only the basic idea using a very clever example found at [3]. Imagine you tell me you spent your day flipping a coin, counting how many times you encountered a non interrupted run of heads. If you tell me that the maximum run was of 3 heads, I can imagine that you did not really flipped the coin a lot of times. If instead your longest run was 13, you probably spent a lot of time flipping the coin.

Long story short this is what HyperLogLog does: it hashes every new element you observe. Part of the hash is used to index a register, The other part of the hash is used to count the longest run of leading zeroes in the hash (our run of heads).

3   redis实现

PFADD var element element … element
PFCOUNT var
PFMERGE dst src src src … src

The commands prefix is “PF” in honor of Philippe Flajolet [4].

4   ref

很有意思.

5   思考

如何动态维护top10

  • redis的简单kv做计数, 再加个zset.
  • zset里面放当前最大的N个, 每次来一个query, 更新hash中的value, 同时看, 如果当前value > zset.min 把zset最后一个踢掉, 把当前这个插入到zset.

6   其它算法

这一类算法可以叫做: Streaming_algorithm

http://en.wikipedia.org/wiki/Streaming_algorithm

  • Streams can be denoted as an ordered sequence of points (or "updates")
  • 但是通常 too large to be stored

模型: - cash register - turnstile model - sliding window

7   典型问题

  • Frequency moments

  • Heavy hitters (top 10 ) Find the most frequent (popular) elements in a data stream. Some notable algorithms are:

    • Karp-Papadimitriou-Shenker algorithm
    • Count-Min sketch
    • Sticky sampling
    • Lossy counting
    • Multi-stage Bloom filters
  • Trend detection[edit] Detecting trends in data streams is often done using an heavy hitters algorithm as listed above

  • Counting distinct elements
    • hyperloglog
  • Entropy

  • Online learning

7.1   Lossy Counting

Say you're looking at the traffic for facebook profiles. You have billions of hits. You want to find which profiles are accessed the most often. You could keep a count for each profile, but then you'd have a very large number of counts to keep track of, the vast majority of which would be meaningless.

With lossy counting, you periodically remove very low count elements from the table. The most-frequently accessed profiles would almost never have low counts anyway, and if they did, they wouldn't be likely to stay there for long.

-- 其实就相当于用redis做count, expire调的比较小(比如1 day)

There are a large number of refinements to the algorithm. But the basic idea is this -- to find the heavy hitters without having to track every element, periodically purge your counts of any elements that don't seem likely to be heavy hitters based on the data so far.

7.2   Sticky Counting

抽样计算counting.

原文链接: /hyperloglog.html

0     +1

我要给这篇文章打分:

可以不填写评论, 而只是打分. 如果发表评论, 你可以给的分值是-5到+5, 否则, 你只能评-1, +1两种分数. 你的评论可能需要审核.

评价列表(1)

  • +1 guest voted at 2022-07-16 14:58:44