Explore this snippet with some demo data here.
Cumulative distribution functions (CDF) are a method for analysing the distribution of a quantity, similar to histograms. They show, for each value of a quantity, what fraction of rows are smaller or greater. One method for calculating a CDF is as follows:
select
-- Use a row_number window function to get the position of this row
(row_number() over (order by <quantity> asc)) / (select count(*) from <table>) cdf,
<quantity>,
from <table>where
quantity- the column containing the metric of interesttable- the table name
Using total Spotify streams as an example data source, let's identify:
table- this is calledrawquantity- this is the columnstreams
then the query becomes:
select
(row_number() over (order by streams asc)) / (select count(*) from raw) frac,
streams,
from raw| frac | streams |
|---|---|
| 0 | 374359 |
| 0 | 375308 |
| ... | ... |
| 1 | 7778950 |