Estimating distinct values in streams using CVM Algorithm

Sun, 19 May 2024

I recently found an interesting article Computer Scientists Invent an Efficient New Way to Count written by Quantamagazine.

What got my attention was that they described how to estimate the number of distinct values in a list by applying an algorithm that uses coin flips and rounds.

So I wanted to write the algorithm myself using just the steps described in the article mentioned above.

Scenarios where counting distinct values would be useful

Counting distinct value is an easy task if the volume of values is small. But when we talk about BigData (billions or trillions of values) that are streamed into a system, this simple activity starts to be challenging.

Data analysis and reporting

Analyzing survey responses to determine the number of unique products or brands mentioned by respondents.
Counting the number of unique customers who made purchases in a specific period.
Counting the number of distinct items sold to understand product diversity and popularity.

Online Commerce & Finance

Determining the number of distinct products sold to manage stock levels and reordering.
Identifying the number of unique customers who belong to different demographic groups for targeted marketing.
Determining the number of unique transactions or distinct merchants a customer has interacted with.
Counting the distinct assets or securities in an investment portfolio.

Social Media and Web Analytics

Counting the number of unique users who visited a website or interacted with a social media post.
Identifying the distinct topics or hashtags used in a set of social media posts.

Healthcare

Counting the number of distinct diagnoses or treatments recorded in a patient database.
Tracking the unique adverse events reported by participants in a clinical trial.

and much more...

My take on the CVM algorithm implementation

// hamlet is a string taken from here -> https://gist.github.com/provpup/2fc41686eab7400b796b
// the actualy number of distinct values is 5697
const words = hamlet.match(/\b\w+\b/g)

// Simulate coin flips
function tails () {
    return Math.random() < 0.5
}

// Performs as many coin flips needed for each round
function flip_by_round (rounds) {
    for (let round = 1; round <= rounds; round++) {
        if (!tails()) {
            return false
        }
    }
    return true
}

// The actual algorithm
function cvm (whiteboard_size) {
    let whiteboard = []
    let round = 0

    // 1st whiteboard fill
    // We add all unique values in the whiteboard until it reaches the limit
    // When it reaches the limit we flip a coin for each value
    // heads -> we keep it
    // tails -> we ditch it
    // After this step we probably end up with a whiteboard that's ~50% filled
    for (let word of words) {
        if (!whiteboard.includes(word)) {
            whiteboard.push(word)
            words.splice(words.indexOf(word), 1)
        }

        if (whiteboard.length >= whiteboard_size) {
            whiteboard = whiteboard.filter(() => tails())
            round++
            break
        }
    }

    // Start rounds
    // Before processing any new words, we test if the whiteboard is filled.
    // If the whiteboard is filled, we flip a coin again for each value
    // We add all new words to the whiteboard
    // For each word (it will actually do something only for words that are added to the list)
    // round 1 -> we flip a coin, if it's tails -> we remove it from the list
    // round 2 -> we flip a coin, -> if it's tails -> we remove it from the list
    //                            -> if it's heads -> we flip a coin -> if it's heads, we keep it
                                                                     -> if it's tails, we ditch the word
    // round n -> we flip a coin, remove it on tails and only keep it on n heads
    for (let word of words) {
        if (whiteboard.length >= whiteboard_size) {
            whiteboard = whiteboard.filter(() => tails())
            round++
        }

        if (!whiteboard.includes(word)) {
            whiteboard.push(word)
        }

        if (!flip_by_round(round)) {
            whiteboard.splice(whiteboard.indexOf(word), 1)
        }
    }

    console.log('total rounds:', round)
    console.log('whiteboard size:', whiteboard.length)
    console.log('estimated count:', whiteboard.length / (1 / Math.pow(2, round)))
}

cvm(100)
// total rounds: 6
// whiteboard size: 87
// estimated count: 5568

cvm(500)
// total rounds: 4
// whiteboard size: 374
// estimated count: 5984

cvm(1000)
// total rounds: 3
// whiteboard size: 719
// estimated count: 5752

The whiteboard_size parameter defines how many values we can store. The higher the value, the more accurate the estimation is. For reference, the text has 5697 distinct words.

The Challenge

After the first attempts of writing the algo I didn't get the expected results. So I started investigating why my results were different than those presented in the article.

The following phrase from Quantamagazine confused me:

Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list.

It should've actually sounded like this:

Round 1. Keep going through Hamlet, adding new words as you go. For each word flip the coin. If it’s tails, delete the word; heads, and the word stays on the list.

So we flip the coin for each word, not only for coins that were in the list, but also for coins that were not in the list and were just added.

Resources

Categories: javascript, nodejs, how-to, algorithm