-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.json
More file actions
1 lines (1 loc) · 43.2 KB
/
index.json
File metadata and controls
1 lines (1 loc) · 43.2 KB
1
[{"authors":["admin"],"categories":null,"content":"{first name}souza{last name} at gmail dot com\nI'm an independent researcher and engineer at 3778 Healthcare working with Machine Learning for Healthcare. Previously, I was Head of Data Science at Linx Impulse and a Researcher at the CERTI Foundation where I worked with Signal Processing and Embedded Systems. I'm interested in Statistical/Computational Learning and Information Theory. Originally, I am from Florianópolis, Brazil but I've lived in New Jersey, Orlando, Toronto and São Paulo as well as other smaller cities in the south of Brazil. I enjoy reading, playing american football and KSP ","date":-62135596800,"expirydate":-62135596800,"kind":"taxonomy","lang":"en","lastmod":-62135596800,"objectID":"2525497d367e79493fd32b198b28f040","permalink":"/authors/admin/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/authors/admin/","section":"authors","summary":"{first name}souza{last name} at gmail dot com\nI'm an independent researcher and engineer at 3778 Healthcare working with Machine Learning for Healthcare. Previously, I was Head of Data Science at Linx Impulse and a Researcher at the CERTI Foundation where I worked with Signal Processing and Embedded Systems. I'm interested in Statistical/Computational Learning and Information Theory. Originally, I am from Florianópolis, Brazil but I've lived in New Jersey, Orlando, Toronto and São Paulo as well as other smaller cities in the south of Brazil.","tags":null,"title":"Daniel Severo","type":"authors"},{"authors":null,"categories":[],"content":"","date":1582433444,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1582433444,"objectID":"9f29c86b17a202c1f653204aff8fa475","permalink":"/pon/","publishdate":"2019-09-23T01:50:44-03:00","relpermalink":"/pon/","section":"","summary":"A distributed consensus mechanism for securing content novelty: Proof of Novelty.","tags":[],"title":"Proof of Novelty","type":"page"},{"authors":[],"categories":[],"content":"This post is a solution to the problem taken from Abu-Mostafa, Yaser S., Malik Magdon-Ismail, and Hsuan-Tien Lin. Learning from data. Vol. 4. New York, NY, USA:: AMLBook, 2012..\nQuoted text refers to the original problem statement, verbatim.\nFor more solutions, see dsevero.com/blog.\nConsider leaving a Star if this helps you.\n A sample of heads and tails is created by tossing a coin a number of times independently. Assume we have a number of coins that generate different samples independently. For a given coin, let the probability of heads (probability of error) be $\\mu$. The probability of obtaining $k$ heads in $N$ tosses of this coin is given by the binomial distribution: $$ P\\left[ k \\mid N, \\mu \\right] = {N\\choose k} \\mu^k \\left(1 - \\mu\\right)^{N-k}$$ Remember that the training error $\\nu$ is $\\frac{k}{N}$\n The learning model used in this chapter is the following: assume you have a $N$ datapoints sampled independently from some unkown distribution $\\mathbf{x}_n \\sim P$, targets $y_n = f(\\mathbf{x_n})$ and a set of hypotheses (e.g. machine learning models) $h \\in \\mathcal{H}$ of size $\\mid \\mathcal{H} \\mid = M$. A coin flipping experiment is used to draw conclusions on the accuracy of binary classifiers. The $n$-th flip of a coin is the evaluation of some hypothesis $h$ on point $(\\mathbf{x}_n, y_n)$. Heads (numerically, 1) represents an error $h(\\mathbf{x}_n) \\neq y_n$, while tails is a successful prediction. In the case of $M$ coins, we have $M$ hypotheses and $NM$ data points $(x_{m,n}, y_{m,n})$\nThe objective of this problem is to show that, given a large enough set of hypotheses $\\mathcal{H}$, the probability of obtaining low training error on at least one $h \\in \\mathcal{H}$ is high if the data is i.i.d. Therefore, we should be careful when evaluating models even if we have followed the standard train, test and validation split procedure.\nHow does this translate to practice? Say you have a training dataset $\\mathcal{D}$ and $M$ models $h_m \\in \\mathcal{H}$ that you wish to evaluate. You sample (with replacement) $N$ points $\\mathbf{x}_{m,n} \\in \\mathcal{D}$ (e.g. mini-batch training) for each $h_m$ (i.e. a total of $NM$ points). What is the probability that at least one hypothesis will have zero in-sample error?\n (a) Assume the sample size $(N)$ is $10$. If all the coins have $\\mu = 0.05$ compute the probability that at least one coin will have $v = 0$ for the case of $1$ coin, $1,000$ coins, $1,000,000$ coins. Repeat for μ = 0.8.\n Let $k_m$ be the number of heads for each coin. Since $\\nu=0$ implies that $k=0$, we need to calculate\n$$ P\\left[ k_1=0 \\vee k_2=0 \\vee \u0026hellip; k_m=0 \\right] = P\\left[ \\bigvee\\limits_{m} k_m = 0 \\right]$$\nHere, we employ the common trick of computing the probability of the complement\nNote that the following step stems from the fact that $\\mathbf{x}_{m,n}$ are independent. If we had used the same set of $N$ points for all $h_m$ (i.e. $\\mathbf{x}_{m,n} \\rightarrow \\mathbf{x}_{n})$, the set of $k_m$ would not be independent, since looking at a specific $k_m$ would give you information regarding some other $k_{m^\\prime}$.\n$$ \\begin{aligned} P\\left[ \\bigvee\\limits_{m} k_m = 0 \\right] \u0026amp;= 1 - P\\left[ \\bigwedge\\limits_{m} k_m \u0026gt; 0 \\right] \\\\\n\u0026amp;= 1 - \\prod\\limits_{m}P\\left[ k_m \u0026gt; 0 \\right] \\end{aligned} $$\nSumming over the values of $k$ and using the fact that $\\sum\\limits_{k=0}^N P\\left[k\\right] = 1$ we can compute\n$$ \\begin{aligned} P\\left[ k_m \u0026gt; 0 \\right] \u0026amp;= \\sum\\limits_{i=1}^N P\\left[k\\right] \\\\\n\u0026amp;= \\sum\\limits_{i=0}^N P\\left[k\\right] - P\\left[0\\right] \\\\\n\u0026amp;= 1 - \\left(1 - \\mu\\right)^N \\end{aligned} $$\nThus, resulting in\n$$P\\left[ \\bigvee\\limits_{m} k_m = 0 \\right] = 1 - \\left( 1 - \\left(1 - \\mu\\right)^N \\right)^M$$\nThe result is intuitive. For a single coin, if $\\left(1 - \\mu\\right)^N$ is the probability that all $N$ flips result in tails, the complement $1 - \\left(1 - \\mu\\right)^N$ is the probability that at least one flip will result in heads. For this to happen to all $M$ coins, we get $\\left( 1 - \\left(1 - \\mu\\right)^N \\right)^M$. Similarly, the probability of the complement is $1 - \\left( 1 - \\left(1 - \\mu\\right)^N \\right)^M$ and can be interpretated as the probability that at least one coin out of $M$ will have all flips out of $N$ resulting in tails.\nLet\u0026rsquo;s take a look at this in python.\nimport matplotlib.pyplot as plt import pandas as pd def prob_zero_error(μ: 'true probability of error', M: 'number of hypotheses', N: 'number of data points'): return 1 - (1 - (1 - μ)**N)**M d = [{'μ': μ, 'M': M, 'p': prob_zero_error(μ, M, N=10)} for μ in [0.05, 0.8] for M in [1, 1_000, 1_000_000]] pd.DataFrame(d).pivot('M', 'μ', 'p').to_html() μ 0.05 0.5 0.8 M 1 0.598737 0.000977 1.024000e-07 1000 1.000000 0.623576 1.023948e-04 1000000 1.000000 1.000000 9.733159e-02 We\u0026rsquo;ve included the results for $\\mu = 0.5$, which represents a reasonable error rate for an untrained binary classification model. The middle cell tells us that a sample of size $NM = 10^4$ evaluated on $M=10^3$ hypotheses (with $10$ samples each) has a $62.36\\%$ chance of at least one hypothesis having error zero.\nLet\u0026rsquo;s take a look at the asymptotic properties of $P(N,M) = 1 - \\left( 1 - \\left(1 - \\mu\\right)^N \\right)^M$ for $\\mu \\in (0, 1)$.\n$$\\lim\\limits_{M \\rightarrow \\infty} P(N,M) = 1$$ $$\\lim\\limits_{N \\rightarrow \\infty} P(N,M) = 0$$\nIntuitvely, evaluating on more datapoints $N$ should make it harder for all points (coins) to have zero error (tails) for any number of hypotheses. Using a larger hypothesis set $\\mid\\mathcal{H}\\mid = M$ is analogous to brute forcing the appearance of $k=0$ through repetitive attempts.\nIf we want to bound this probability (for the sake of sanity) to some value $\\lambda$, how should we chose $M$ and $N$? Solving independently for $N$ and $M$ in $P(N,M) \\leq \\lambda$\n$$M \\leq \\frac{log\\left(1 - \\lambda\\right)}{log\\left(1 - \\left(1 - \\mu\\right)^N\\right)}$$ $$N \\geq \\frac{log\\left(1 - \\sqrt[M]{1 - \\lambda} \\right)}{log\\left(1 - \\mu\\right)}$$\nWe can use these result to calculate how fast the number of hypotheses $M$ can grow with respect to the number of datapoints $N$ for a fixed probability of zero error $\\lambda$, and vice-versa.\n (b) For the case $N = 6$ and $2$ coins with $\\mu = 0.5$ for both coins, plot the probability $$P[ \\max\\limits_i \\mid \\nu_i - \\mu_i \\mid \u0026gt; \\epsilon ]$$ for $\\epsilon$ in the range $[0, 1]$ (the max is over coins). On the same plot show the bound that would be obtained using the Hoeffding Inequality. Remember that for a single coin, the Hoeffding bound is $$P[\\mid \\nu- \\mu \\mid \u0026gt; \\epsilon ] \\leq 2e^{-2N\\epsilon^2}$$\n import matplotlib.pyplot as plt import numpy as np plt.style.use('ggplot') N = 6 M = 2 μ = 0.5 def hoeffding_bound(ε, N, M=1): return 2*M*np.exp(-2*N*ε**2) def P(N, M, ε_space, μ): k = np.random.binomial(n=N, p=μ, size=(1_000, M)) P = np.abs(k/N - μ).max(axis=1) return [(P \u0026gt; ε).mean() for ε in ε_space] ε_space = np.linspace(0, 1, 100) plt.figure(figsize=(12,5)) plt.plot(ε_space, hoeffding_bound(ε_space, N), '--', ε_space, hoeffding_bound(ε_space, N, M=3), '--', ε_space, P(6, 2, ε_space, μ), ε_space, P(6, 10, ε_space, μ)); plt.title('Average over $1000$ iterations of\\n' '$\\max \\{ \\mid k_1/6 - 0.5 \\mid,' '\\mid k_2/6 - 0.5 \\mid\\} \u0026gt; \\epsilon $\\n', fontsize=20) plt.legend(['Hoeffding Bound ' '($M=1 \\\\rightarrow 2e^{-12\\epsilon^2}$)', 'Hoeffding Bound ' '($M=3 \\\\rightarrow 6e^{-12\\epsilon^2}$)', '$M=2$', '$M=3$'], fontsize=18) plt.yticks(fontsize=18) plt.xticks(fontsize=18) plt.ylim(0, 2) plt.xlabel('$\\epsilon$', fontsize=20); Notice how the Hoeffding Bound is violated for $M=3$ if multiple hypotheses are not properly accounted for.\n[Hint: Use $P\\[A\\\\space\\\\text{or}\\\\space B\\] = P\\[A\\] + P\\[B\\] \\\\space \\\\space \\\\space \\\\space P\\[A\\\\space\\\\text{and}\\\\space B\\] = P\\[A\\] + P\\[B\\] - P\\[A\\] P\\[B\\]$, where the last equality follows by independence, to evaluate $P\\[\\\\max \\\\dots \\]]$. -- Questions, suggestions or corrections? Message me on twitter or create a pull request at dsevero/dsevero.com\nConsider leaving a Star if this helps you.\n","date":1576190247,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1576190247,"objectID":"741165db73fa6f2c5d2bb7259e027d99","permalink":"/blog/lfd-p17/","publishdate":"2019-12-12T19:37:27-03:00","relpermalink":"/blog/lfd-p17/","section":"blog","summary":"This post is a solution to the problem taken from Abu-Mostafa, Yaser S., Malik Magdon-Ismail, and Hsuan-Tien Lin. Learning from data. Vol. 4. New York, NY, USA:: AMLBook, 2012..\nQuoted text refers to the original problem statement, verbatim.\nFor more solutions, see dsevero.com/blog.\nConsider leaving a Star if this helps you.\n A sample of heads and tails is created by tossing a coin a number of times independently. Assume we have a number of coins that generate different samples independently.","tags":[],"title":"Learning From Data, Problem 1.7","type":"blog"},{"authors":[],"categories":[],"content":" Here we will show you how to properly use the Python Data Analysis Library (pandas) and numpy. The agenda is:\n How to load data from csv files The basic pandas objects: DataFrames and Series Handling Time-Series data Resampling (optional) From pandas to numpy Simple Linear Regression Consider leaving a Star if this helps you.\nThe following ipython magic (this is literally the name) will enable plots made by matplotlib to be rendered inside this notebook.\n%matplotlib inline import pandas as pd import numpy as np import matplotlib.pyplot as plt plt.style.use('ggplot') # changes the plotting style 1. Load data The file data/monthly-milk-production-pounds-p.csv contains the average monthly milk production, in pounds, of cows from Jan/1962 to Dec/1975. More information can be found here: https://datamarket.com/data/set/22ox/monthly-milk-production-pounds-per-cow-jan-62-dec-75\nFirst, we must load this data with pandas for further analysis.\ndf = pd.read_csv('data/monthly-milk-production-pounds-p.csv') df.head() .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } Month Monthly milk production: pounds per cow. Jan 62 ? Dec 75 0 1962-01 589 1 1962-02 561 2 1962-03 640 3 1962-04 656 4 1962-05 727 type(df) pandas.core.frame.DataFrame Calling .head() truncates the dataset to the first 5 lines (plus the header). Notice that the type of df is a pandas DataFrame. This is similar to an Excel table, but much more powerful. Since pandas is a widely used library, Jupyter automatically shows the dataframe as a formatted HTML.\n2. The basic pandas objects: DataFrames and Series Let\u0026rsquo;s take a look at each column individually.\ndf['Month'].head() 0 1962-01 1 1962-02 2 1962-03 3 1962-04 4 1962-05 Name: Month, dtype: object df['Monthly milk production: pounds per cow. Jan 62 ? Dec 75'].head() 0 589 1 561 2 640 3 656 4 727 Name: Monthly milk production: pounds per cow. Jan 62 ? Dec 75, dtype: int64 type(df['Month']) pandas.core.series.Series A pandas Series is the second basic type. In a nutshell, Series are made up of values and an index. For both columns, the index can be seen printed on the far left and the elements are 0, 1, 2, 3, and 4. The values are the points of interest (e.g. dates for the Month column and 589, 561, 640, 656 and 727 for the other).\nA pandas DataFrame is made up of multiple Series, each representing a column, and an index.\nThe columns of a DataFrame can be accessed through slicing (as previously shown). Since the names are hard to write, we can change them like so:\ndf.columns = ['month', 'milk'] 3. Handling Time-Series data df.head() .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } month milk 0 1962-01 589 1 1962-02 561 2 1962-03 640 3 1962-04 656 4 1962-05 727 df.info() \u0026lt;class 'pandas.core.frame.DataFrame'\u0026gt; RangeIndex: 168 entries, 0 to 167 Data columns (total 2 columns): month 168 non-null object milk 168 non-null int64 dtypes: int64(1), object(1) memory usage: 2.7+ KB The .info() function gives us some insight on which data-types are being used to represent the values of each column. Notice how the milk column is of type int64. Hence, we can perform arithmetic and plotting operations like so:\ndf['milk'].plot(); df['milk'].mean() 754.7083333333334 df['milk'].var() 10445.764720558882 The month column is of type object. This is python\u0026rsquo;s way of telling you that this column is of mixed type. Hence, it is a little bit trickier to manipulate. Due to the internals of pandas, a Series that has all values of type str will still be refered to as of type object. This is the case of the month column.\ndf['month'].apply(type).unique() array([\u0026lt;class 'str'\u0026gt;], dtype=object) The .apply function will apply the argument function (in this case type) to every single element of the series. unique will return to us the unique values of the series (i.e. it will drop all duplicates). Calling both together let\u0026rsquo;s us see what data-types are present in the Series. As can be seen, all are of type str.\npandas has a built-in timestamp data-type. It works like so.\npd.Timestamp('now') Timestamp('2019-10-25 14:42:25.259875') pd.Timestamp('1992-03-23') Timestamp('1992-03-23 00:00:00') pd.Timestamp('1992-03-23 04') Timestamp('1992-03-23 04:00:00') Internally, pandas stores a date as the amount of time that has passed since 1970-01-01 00:00:00. This date is represented as pd.Timestamp(0). This is useful for linear regression, since it allows us to convert timestamp data to integers without loss of reference.\npd.Timestamp(0) Timestamp('1970-01-01 00:00:00') pd.Timestamp('now') \u0026gt; pd.Timestamp('1992-03-23 04') True We can transform the month column into pd.Timestamp values with pd.to_datetime and set it as the index of a new time-series.\ndf['month'] = pd.to_datetime(df['month']) s = df.set_index('month')['milk'] s.head() month 1962-01-01 589 1962-02-01 561 1962-03-01 640 1962-04-01 656 1962-05-01 727 Name: milk, dtype: int64 s.index[0] Timestamp('1962-01-01 00:00:00') s.values[0] 589 s.plot(); Notice how the x-axis of the above plot differs from the first one of this same section, since the index of s is a timestamp-like-type. The timestamp index of s is also manipulatable. Time-aware slices are also now available.\ns.index.min() Timestamp('1962-01-01 00:00:00') s.index.max() Timestamp('1975-12-01 00:00:00') s['1970'].plot(); s['1970':'1972'].plot(style='o--'); 4. Resampling (optional) Looking at the plots it is pretty clear that the data trend is rising, but it fluctuates yearly reaching a local peak around June. How can we calculate the yearly mean as an attempt to smooth out the data? Luckily, s is a time-series (i.e. has a time index and numeric values), we can use the .resample function. This will allow us to group the data chronologically, given that we supply an aggregating function (i.e. mean, std, var, median, etc).\ns.resample('12M').mean().plot(style='o--'); s.resample('6M').mean().plot(style='o--'); 5. From pandas to numpy Numpy provides vector data-types and operations making it easy to work with linear algebra. In fact, this works so well, that pandas is actually built on top of numpy. The values of a pandas Series, and the values of the index are numpy ndarrays.\ntype(s.values) numpy.ndarray type(s.index.values) numpy.ndarray s.head().values array([589, 561, 640, 656, 727]) s.head().index.values array(['1962-01-01T00:00:00.000000000', '1962-02-01T00:00:00.000000000', '1962-03-01T00:00:00.000000000', '1962-04-01T00:00:00.000000000', '1962-05-01T00:00:00.000000000'], dtype='datetime64[ns]') s.values.dot(s.values) # dot product 97434667 s.values + s.values array([1178, 1122, 1280, 1312, 1454, 1394, 1280, 1198, 1136, 1154, 1106, 1164, 1200, 1132, 1306, 1346, 1484, 1432, 1320, 1234, 1166, 1174, 1130, 1196, 1256, 1236, 1376, 1410, 1540, 1472, 1356, 1278, 1208, 1222, 1188, 1268, 1316, 1244, 1418, 1444, 1564, 1512, 1404, 1306, 1230, 1242, 1204, 1270, 1354, 1270, 1472, 1510, 1622, 1596, 1470, 1394, 1322, 1334, 1290, 1376, 1426, 1334, 1524, 1568, 1674, 1634, 1534, 1444, 1362, 1374, 1320, 1396, 1434, 1392, 1550, 1592, 1716, 1652, 1566, 1480, 1402, 1412, 1354, 1422, 1468, 1380, 1570, 1610, 1742, 1690, 1602, 1528, 1450, 1446, 1380, 1468, 1500, 1414, 1614, 1648, 1772, 1718, 1638, 1566, 1480, 1494, 1422, 1502, 1608, 1512, 1720, 1756, 1884, 1826, 1738, 1668, 1580, 1600, 1526, 1600, 1652, 1598, 1780, 1800, 1922, 1870, 1788, 1710, 1618, 1620, 1532, 1610, 1642, 1546, 1766, 1796, 1914, 1848, 1762, 1674, 1568, 1582, 1520, 1604, 1656, 1556, 1778, 1804, 1938, 1894, 1816, 1734, 1630, 1624, 1546, 1626, 1668, 1564, 1784, 1806, 1932, 1874, 1792, 1716, 1634, 1654, 1594, 1686]) s.values * s.values array([346921, 314721, 409600, 430336, 528529, 485809, 409600, 358801, 322624, 332929, 305809, 338724, 360000, 320356, 426409, 452929, 550564, 512656, 435600, 380689, 339889, 344569, 319225, 357604, 394384, 381924, 473344, 497025, 592900, 541696, 459684, 408321, 364816, 373321, 352836, 401956, 432964, 386884, 502681, 521284, 611524, 571536, 492804, 426409, 378225, 385641, 362404, 403225, 458329, 403225, 541696, 570025, 657721, 636804, 540225, 485809, 436921, 444889, 416025, 473344, 508369, 444889, 580644, 614656, 700569, 667489, 588289, 521284, 463761, 471969, 435600, 487204, 514089, 484416, 600625, 633616, 736164, 682276, 613089, 547600, 491401, 498436, 458329, 505521, 538756, 476100, 616225, 648025, 758641, 714025, 641601, 583696, 525625, 522729, 476100, 538756, 562500, 499849, 651249, 678976, 784996, 737881, 670761, 613089, 547600, 558009, 505521, 564001, 646416, 571536, 739600, 770884, 887364, 833569, 755161, 695556, 624100, 640000, 582169, 640000, 682276, 638401, 792100, 810000, 923521, 874225, 799236, 731025, 654481, 656100, 586756, 648025, 674041, 597529, 779689, 806404, 915849, 853776, 776161, 700569, 614656, 625681, 577600, 643204, 685584, 605284, 790321, 813604, 938961, 896809, 824464, 751689, 664225, 659344, 597529, 660969, 695556, 611524, 795664, 815409, 933156, 877969, 802816, 736164, 667489, 683929, 635209, 710649]) The above examples are just for show. You can do the same thing directly with pandas Series objects and it will use numpy behind the scenes.\ns.dot(s) == s.values.dot(s.values) True 6. Simple Linear Regression. Side note: python accepts non-ascii type characters. So it is possible to use greek letters as variables. Try this: type in \\alpha and press the TAB key in any cell.\nα = 1 β = 2 α + β 3 Here we will implement a simple linear regression to illustrate the full usage of pandas with numpy. For a single variable with intercept: $y = \\alpha + \\beta x$, the closed form solution is:\n$$\\beta = \\frac{cov(x,y)}{var(x)}$$ $$\\alpha = \\bar{y} - \\beta \\bar{x}$$\nwhere $\\bar{y}$ and $\\bar{x}$ are the average values of the vectors $y$ and $x$, respectively.\ny = s x = (s.index - pd.Timestamp(0)).days.values β = np.cov([x,y])[0][1]/x.var() α = y.mean() - β*x.mean() α 776.0355721497116 β 0.05592981987998409 ( s .to_frame() # transforms s back into a DataFrame .assign(regression = α + β*x) # creates a new column called regression with values α + β*x .plot() # plots all columns ); The above programming style is called method chaining, and is highly recommended for clarity.\n","date":1572024810,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1572024810,"objectID":"0a96eb672f02bf7f0246d97c65c1be1d","permalink":"/blog/pandas/","publishdate":"2019-10-25T14:33:30-03:00","relpermalink":"/blog/pandas/","section":"blog","summary":"Here we will show you how to properly use the Python Data Analysis Library (pandas) and numpy. The agenda is:\n How to load data from csv files The basic pandas objects: DataFrames and Series Handling Time-Series data Resampling (optional) From pandas to numpy Simple Linear Regression Consider leaving a Star if this helps you.\nThe following ipython magic (this is literally the name) will enable plots made by matplotlib to be rendered inside this notebook.","tags":[],"title":"Handling Time-series in Pandas and Numpy.","type":"blog"},{"authors":null,"categories":[],"content":"","date":1569214244,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1569214244,"objectID":"343ff1a5b2b31855a18d8ecc72433f91","permalink":"/blog/ziggurat/","publishdate":"2019-09-23T01:50:44-03:00","relpermalink":"/blog/ziggurat/","section":"blog","summary":"Mathematical proof of functionality, of a highly efficient pseudo-random number generator: The Ziggurat Method","tags":[],"title":"A Report on the Ziggurat Method","type":"blog"},{"authors":null,"categories":null,"content":" Creating custom reports and machine learning models with pandas can be cumbersome with limited hardware resources (memory and CPU). Financial constraints can make spawning cloud instances to side-step this issue a problem, while adding the complexity of libraries such as Apache Spark isn\u0026rsquo;t worth the trouble and staggers data exploration. How can we keep the simplicity and power of pandas, while extending it to be out-of-core and parallel?\nEnter Dask: a flexible parallel computing library for analytic computing. With it we will create a linear regression model to predict read time in Medium posts using a Kaggle dataset, while comparing the equivalent implementation with pandas.\nConsider leaving a Star if this helps you.\n1. Kaggle data We will be using the official kaggle api to automate our data fetching process.\n Log on to kaggle and enter the How good is your Medium article? competition. Configure the official kaggle api following these steps. For this tutorial we will need only 1 file.\nkaggle competitions download -c how-good-is-your-medium-article -f train.json.gz The reason for decompressing will become clear later.\ngunzip -k ~/.kaggle/competitions/how-good-is-your-medium-article/train.json.gz This sample file will help us speed up the analysis.\nhead -n5 ~/.kaggle/competitions/how-good-is-your-medium-article/train.json \u0026gt; \\ ~/.kaggle/competitions/how-good-is-your-medium-article/train-sample.json 2. Exploration. Despite the extension being json our data is stored as jsonl. This means that each line of train.json is a valid json file.\nhead -n1 ~/.kaggle/competitions/how-good-is-your-medium-article/train.json | jq 'del(.content)' { \u0026quot;_id\u0026quot;: \u0026quot;https://medium.com/policy/medium-terms-of-service-9db0094a1e0f\u0026quot;, \u0026quot;_timestamp\u0026quot;: 1520035195.282891, \u0026quot;_spider\u0026quot;: \u0026quot;medium\u0026quot;, \u0026quot;url\u0026quot;: \u0026quot;https://medium.com/policy/medium-terms-of-service-9db0094a1e0f\u0026quot;, \u0026quot;domain\u0026quot;: \u0026quot;medium.com\u0026quot;, \u0026quot;published\u0026quot;: { \u0026quot;$date\u0026quot;: \u0026quot;2012-08-13T22:54:53.510Z\u0026quot; }, \u0026quot;title\u0026quot;: \u0026quot;Medium Terms of Service – Medium Policy – Medium\u0026quot;, \u0026quot;author\u0026quot;: { \u0026quot;name\u0026quot;: null, \u0026quot;url\u0026quot;: \u0026quot;https://medium.com/@Medium\u0026quot;, \u0026quot;twitter\u0026quot;: \u0026quot;@Medium\u0026quot; }, \u0026quot;image_url\u0026quot;: null, \u0026quot;tags\u0026quot;: [], \u0026quot;link_tags\u0026quot;: { \u0026quot;canonical\u0026quot;: \u0026quot;https://medium.com/policy/medium-terms-of-service-9db0094a1e0f\u0026quot;, \u0026quot;publisher\u0026quot;: \u0026quot;https://plus.google.com/103654360130207659246\u0026quot;, \u0026quot;author\u0026quot;: \u0026quot;https://medium.com/@Medium\u0026quot;, \u0026quot;search\u0026quot;: \u0026quot;/osd.xml\u0026quot;, \u0026quot;alternate\u0026quot;: \u0026quot;android-app://com.medium.reader/https/medium.com/p/9db0094a1e0f\u0026quot;, \u0026quot;stylesheet\u0026quot;: \u0026quot;https://cdn-static-1.medium.com/_/fp/css/main-branding-base.Ch8g7KPCoGXbtKfJaVXo_w.css\u0026quot;, \u0026quot;icon\u0026quot;: \u0026quot;https://cdn-static-1.medium.com/_/fp/icons/favicon-rebrand-medium.3Y6xpZ-0FSdWDnPM3hSBIA.ico\u0026quot;, \u0026quot;apple-touch-icon\u0026quot;: \u0026quot;https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png\u0026quot;, \u0026quot;mask-icon\u0026quot;: \u0026quot;https://cdn-static-1.medium.com/_/fp/icons/monogram-mask.KPLCSFEZviQN0jQ7veN2RQ.svg\u0026quot; }, \u0026quot;meta_tags\u0026quot;: { \u0026quot;viewport\u0026quot;: \u0026quot;width=device-width, initial-scale=1\u0026quot;, \u0026quot;title\u0026quot;: \u0026quot;Medium Terms of Service – Medium Policy – Medium\u0026quot;, \u0026quot;referrer\u0026quot;: \u0026quot;unsafe-url\u0026quot;, \u0026quot;description\u0026quot;: \u0026quot;These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”). By using…\u0026quot;, \u0026quot;theme-color\u0026quot;: \u0026quot;#000000\u0026quot;, \u0026quot;og:title\u0026quot;: \u0026quot;Medium Terms of Service – Medium Policy – Medium\u0026quot;, \u0026quot;og:url\u0026quot;: \u0026quot;https://medium.com/policy/medium-terms-of-service-9db0094a1e0f\u0026quot;, \u0026quot;fb:app_id\u0026quot;: \u0026quot;542599432471018\u0026quot;, \u0026quot;og:description\u0026quot;: \u0026quot;These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”). By using…\u0026quot;, \u0026quot;twitter:description\u0026quot;: \u0026quot;These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”). By using…\u0026quot;, \u0026quot;author\u0026quot;: \u0026quot;Medium\u0026quot;, \u0026quot;og:type\u0026quot;: \u0026quot;article\u0026quot;, \u0026quot;twitter:card\u0026quot;: \u0026quot;summary\u0026quot;, \u0026quot;article:publisher\u0026quot;: \u0026quot;https://www.facebook.com/medium\u0026quot;, \u0026quot;article:author\u0026quot;: \u0026quot;https://medium.com/@Medium\u0026quot;, \u0026quot;robots\u0026quot;: \u0026quot;index, follow\u0026quot;, \u0026quot;article:published_time\u0026quot;: \u0026quot;2012-08-13T22:54:53.510Z\u0026quot;, \u0026quot;twitter:creator\u0026quot;: \u0026quot;@Medium\u0026quot;, \u0026quot;twitter:site\u0026quot;: \u0026quot;@Medium\u0026quot;, \u0026quot;og:site_name\u0026quot;: \u0026quot;Medium\u0026quot;, \u0026quot;twitter:label1\u0026quot;: \u0026quot;Reading time\u0026quot;, \u0026quot;twitter:data1\u0026quot;: \u0026quot;5 min read\u0026quot;, \u0026quot;twitter:app:name:iphone\u0026quot;: \u0026quot;Medium\u0026quot;, \u0026quot;twitter:app:id:iphone\u0026quot;: \u0026quot;828256236\u0026quot;, \u0026quot;twitter:app:url:iphone\u0026quot;: \u0026quot;medium://p/9db0094a1e0f\u0026quot;, \u0026quot;al:ios:app_name\u0026quot;: \u0026quot;Medium\u0026quot;, \u0026quot;al:ios:app_store_id\u0026quot;: \u0026quot;828256236\u0026quot;, \u0026quot;al:android:package\u0026quot;: \u0026quot;com.medium.reader\u0026quot;, \u0026quot;al:android:app_name\u0026quot;: \u0026quot;Medium\u0026quot;, \u0026quot;al:ios:url\u0026quot;: \u0026quot;medium://p/9db0094a1e0f\u0026quot;, \u0026quot;al:android:url\u0026quot;: \u0026quot;medium://p/9db0094a1e0f\u0026quot;, \u0026quot;al:web:url\u0026quot;: \u0026quot;https://medium.com/policy/medium-terms-of-service-9db0094a1e0f\u0026quot; } } I\u0026rsquo;ve ommited the content field due to it\u0026rsquo;s huge verbosity. Our problem requires that we use the fields published.$date and meta_tags.twitter:data.\nhead -n10 train.json | jq '[.published[\u0026quot;$date\u0026quot;], .meta_tags[\u0026quot;twitter:data1\u0026quot;]] | @csv' -r \u0026quot;2012-08-13T22:54:53.510Z\u0026quot;,\u0026quot;5 min read\u0026quot; \u0026quot;2015-08-03T07:44:50.331Z\u0026quot;,\u0026quot;7 min read\u0026quot; \u0026quot;2017-02-05T13:08:17.410Z\u0026quot;,\u0026quot;2 min read\u0026quot; \u0026quot;2017-05-06T08:16:30.776Z\u0026quot;,\u0026quot;3 min read\u0026quot; \u0026quot;2017-06-04T14:46:25.772Z\u0026quot;,\u0026quot;4 min read\u0026quot; \u0026quot;2017-04-02T16:21:15.171Z\u0026quot;,\u0026quot;7 min read\u0026quot; \u0026quot;2016-08-15T04:16:02.103Z\u0026quot;,\u0026quot;12 min read\u0026quot; \u0026quot;2015-01-14T21:31:07.568Z\u0026quot;,\u0026quot;5 min read\u0026quot; \u0026quot;2014-02-11T04:11:54.771Z\u0026quot;,\u0026quot;4 min read\u0026quot; \u0026quot;2015-10-25T02:58:05.551Z\u0026quot;,\u0026quot;8 min read\u0026quot; 3. Building the time-series: The good, the bad and the ugly. %matplotlib inline import json import pandas as pd import numpy as np import os import dask.bag as db from toolz.curried import get from typing import Dict HOME = os.environ['HOME'] KAGGLE_DATASET_HOME = '.kaggle/competitions/how-good-is-your-medium-article/' train_file = f'{HOME}/{KAGGLE_DATASET_HOME}/train.json' train_sample_file = f'{HOME}/{KAGGLE_DATASET_HOME}/train-sample.json' MEGABYTES = 1024**2 The Ugly read_json loads each json as a record, parsing each object beforehand.\n( pd .read_json(train_sample_file, lines=True) [['published', 'meta_tags']] ) .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } published meta_tags 0 {'$date': '2012-08-13T22:54:53.510Z'} {'viewport': 'width=device-width, initial-scal... 1 {'$date': '2015-08-03T07:44:50.331Z'} {'viewport': 'width=device-width, initial-scal... 2 {'$date': '2017-02-05T13:08:17.410Z'} {'viewport': 'width=device-width, initial-scal... 3 {'$date': '2017-05-06T08:16:30.776Z'} {'viewport': 'width=device-width, initial-scal... 4 {'$date': '2017-06-04T14:46:25.772Z'} {'viewport': 'width=device-width, initial-scal... Both columns have object values. Our fields of interest can be extracted and assigned to a new column using the assign function.\n( _ .assign( published_timestamp = lambda df: df['published'].apply(dict.get, args=('$date',)), read_time = lambda df: df['meta_tags'].apply(dict.get, args=('twitter:data1',)), ) ) .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } published meta_tags published_timestamp read_time 0 {'$date': '2012-08-13T22:54:53.510Z'} {'viewport': 'width=device-width, initial-scal... 2012-08-13T22:54:53.510Z 5 min read 1 {'$date': '2015-08-03T07:44:50.331Z'} {'viewport': 'width=device-width, initial-scal... 2015-08-03T07:44:50.331Z 7 min read 2 {'$date': '2017-02-05T13:08:17.410Z'} {'viewport': 'width=device-width, initial-scal... 2017-02-05T13:08:17.410Z 2 min read 3 {'$date': '2017-05-06T08:16:30.776Z'} {'viewport': 'width=device-width, initial-scal... 2017-05-06T08:16:30.776Z 3 min read 4 {'$date': '2017-06-04T14:46:25.772Z'} {'viewport': 'width=device-width, initial-scal... 2017-06-04T14:46:25.772Z 4 min read Extracting the time value in read_time can be done with pd.Series.str processing methods. When called, the equivalent function is applied to each value, hence .str.split(' ').str[0] is equivalent to '5 min read'.split(' ')[0].\nastype casts our columns to the necessary dtypes.\n( _ .assign(read_time = lambda df: df['read_time'].str.split(' ').str[0]) .astype({ 'read_time': int, 'published_timestamp': 'datetime64[ns]' }) .set_index('published_timestamp') ['read_time'] .to_frame() ) .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } read_time published_timestamp 2012-08-13 22:54:53.510 5 2015-08-03 07:44:50.331 7 2017-02-05 13:08:17.410 2 2017-05-06 08:16:30.776 3 2017-06-04 14:46:25.772 4 The Bad The issue with The Ugly solution is that read_json loads the entire dataset into memory before slicing the necessary columns (published and meta_tags). Pre-processing our data with pure python consumes less RAM.\ndef make_datum(x): return { 'published_timestamp': x['published']['$date'], 'read_time': x['meta_tags']['twitter:data1'] } with open(train_sample_file, 'r') as f: bad_df = pd.DataFrame([make_datum(json.loads(x)) for x in f]) bad_df .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } published_timestamp read_time 0 2012-08-13T22:54:53.510Z 5 min read 1 2015-08-03T07:44:50.331Z 7 min read 2 2017-02-05T13:08:17.410Z 2 min read 3 2017-05-06T08:16:30.776Z 3 min read 4 2017-06-04T14:46:25.772Z 4 min read ( _ .assign(read_time = lambda x: x['read_time'].str.split(' ').str[0]) .astype({ 'published_timestamp': 'datetime64[ns]', 'read_time': int }) .set_index('published_timestamp') ['read_time'] .to_frame() ) .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } read_time published_timestamp 2012-08-13 22:54:53.510 5 2015-08-03 07:44:50.331 7 2017-02-05 13:08:17.410 2 2017-05-06 08:16:30.776 3 2017-06-04 14:46:25.772 4 The Good Dask allows us to build lazy computational graphs. For example, db.read_text will return a reference to each line of our jsonl file. After, .map applies json.loads to each line and .to_dataframe casts the data to a dask DataFrame preserving only the columns we explicitly tell it (in this case published and meta_tags). The rest of the code proceeds analogously with the previous implementations. The only difference is that dask won\u0026rsquo;t actually process anything until we call the .compute method, returning a pandas DataFrame. In other words, a dask DataFrame is a lazy version of a pandas DataFrame. The same is true for series.\nNotice how we pass the blocksize parameter as 100 MB. Since our file has 2 GB, dask creates 20 independent partitions. Most methods that are called (like .map and .assign) run in parallel, potentially speeding up computation significantly. Memory is also spared, since we only load the fields we need.\ndag = ( db .read_text(train_file, blocksize=100*MEGABYTES) .map(json.loads) .to_dataframe({ 'published': object, 'meta_tags': object }) .assign( published_timestamp=lambda df: df['published'].apply(get('$date')), read_time=lambda df: df['meta_tags'].apply(get('twitter:data1')).str.split(' ').str[0], ) .astype({ 'published_timestamp': 'datetime64[ns]', 'read_time': int }) [['published_timestamp', 'read_time']] ) dag Dask DataFrame Structure: .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } published_timestamp read_time npartitions=20 datetime64[ns] int64 ... ... ... ... ... ... ... ... ... Dask Name: getitem, 120 tasks Other methods like .head(N) also force the dataframe to be computed. Since this call needs only the first N rows, dask will partially solve the graph such that only those are processed.\n( _ .head() .set_index('published_timestamp') ) .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } read_time published_timestamp 2012-08-13 22:54:53.510 5 2015-08-03 07:44:50.331 7 2017-02-05 13:08:17.410 2 2017-05-06 08:16:30.776 3 2017-06-04 14:46:25.772 4 4. Prediction Here we will implement simple linear regression for a single variable with intercept: $y = \\alpha + \\beta x$, the closed form solution is:\n$$\\beta = \\frac{cov(x,y)}{var(x)}$$ $$\\alpha = \\bar{y} - \\beta \\bar{x}$$\nwhere $\\bar{y}$ and $\\bar{x}$ are the average values of the vectors $y$ and $x$, respectively.\ndef linear_regression(y: np.array, x: np.array, prefix='') -\u0026gt; Dict[str, float]: M = np.cov(x, y) beta = M[0,1]/M[0,0] alpha = y.mean() - beta*x.mean() return { prefix + 'alpha': alpha, prefix + 'beta': beta } df = ( dag .compute() .set_index('published_timestamp') ['2015':] ['read_time'] .groupby(lambda i: pd.to_datetime(i.strftime('%Y/%m'))) .agg(['mean', 'sum']) ) df.head() .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } mean sum 2015-01-01 8.380952 4928 2015-02-01 7.887564 4630 2015-03-01 7.907840 5749 2015-04-01 7.667149 5298 2015-05-01 8.307506 6862 df.plot( style=['o--', 'og--'], figsize=(12,6), subplots=True, title='Medium read time', fontsize=12 ); df_pred = ( df .assign(**linear_regression(df['mean'], df.index.asi8, prefix='mean_')) .assign(**linear_regression(df['sum'], df.index.asi8, prefix='sum_')) .assign(mean_pred = lambda z: z['mean_alpha'] + z['mean_beta']*z.index.asi8) .assign(sum_pred = lambda z: z['sum_alpha'] + z['sum_beta']*z.index.asi8) ) df_pred[['mean_alpha', 'mean_beta', 'sum_alpha', 'sum_beta', 'mean_pred', 'sum_pred']].head() .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } mean_alpha mean_beta sum_alpha sum_beta mean_pred sum_pred 2015-01-01 27.130799 -1.351069e-17 -407905.660648 2.892460e-13 7.944669 2844.030846 2015-02-01 27.130799 -1.351069e-17 -407905.660648 2.892460e-13 7.908482 3618.747348 2015-03-01 27.130799 -1.351069e-17 -407905.660648 2.892460e-13 7.875797 4318.491286 2015-04-01 27.130799 -1.351069e-17 -407905.660648 2.892460e-13 7.839610 5093.207789 2015-05-01 27.130799 -1.351069e-17 -407905.660648 2.892460e-13 7.804590 5842.933436 ( df_pred [['mean', 'mean_pred']] .plot(figsize=(12,5), style=['o--', '--']) ); ( df_pred [['sum', 'sum_pred']] .plot(figsize=(12,5), style=['o--', '--']) ); ","date":1521417600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1521417600,"objectID":"9f05f5b19a60edcdba7c3b8d47883997","permalink":"/blog/dask/","publishdate":"2018-03-19T00:00:00Z","relpermalink":"/blog/dask/","section":"blog","summary":"How can we keep the simplicity and power of pandas, while extending it to be out-of-core and parallel?","tags":null,"title":"Ad hoc Big Data Analysis with Dask","type":"blog"},{"authors":null,"categories":null,"content":"","date":-62135596800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":-62135596800,"objectID":"c4ed11351608279928032a6eb74e6f37","permalink":"/opensource/dask/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/opensource/dask/","section":"opensource","summary":"Parallel computing with task scheduling","tags":null,"title":"Dask","type":"opensource"},{"authors":null,"categories":null,"content":"","date":-62135596800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":-62135596800,"objectID":"f2177a3e7f615bb63287da6aadfdc05b","permalink":"/opensource/dask-ml/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/opensource/dask-ml/","section":"opensource","summary":"Scalable Machine Learn with Dask","tags":null,"title":"Dask-ML","type":"opensource"},{"authors":null,"categories":null,"content":"Things I think are worth reading. Consider leaving a Star if this helps you.\nStatistical Learning Theory\n Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [book] [lectures] Abu-Mostafa, Yaser S., Malik Magdon-Ismail, and Hsuan-Tien Lin. Learning from data. Vol. 4. New York, NY, USA:: AMLBook, 2012. [book] ISIT 2018 - S. Kannan, H. Kim \u0026amp; S. Oh - Deep learning and information theory An Emerging Interface [video] [slides] Ashish Khisti. ECE1508: Introduction to Statistical Learning [course] Information Theory\n Shannon, Claude Elwood. \u0026ldquo;A mathematical theory of communication.\u0026rdquo; Bell system technical journal 27.3 (1948): 379-423. [paper] Cover, Thomas M., and Joy A. Thomas. Elements of information theory. John Wiley \u0026amp; Sons, 2012. Gleick, James. The information: A history, a theory, a flood. Vintage, 2012. Probabilistic Graphical Models\n MAC6916: Probabilistic Graphical Models [course] DAFT Beautifully rendered probabilistic graphical models. [code] [docs] Deep Learning Theory\n Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. [book] Machine Learning for Health\n CSC2541HS: Topics in Machine Learning [course] Ghassemi, Marzyeh, et al. \u0026ldquo;Opportunities in machine learning for healthcare.\u0026rdquo; arXiv preprint arXiv:1806.00388 (2018). [paper] Python Data Analysis\n Brandon Rhodes - Pandas From The Ground Up - PyCon 2015 [video] Tom Augspurger. Modern Pandas. [post] ","date":-62135596800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":-62135596800,"objectID":"46049c794a9829e53b0f15bbb4f2969a","permalink":"/reading-list/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/reading-list/","section":"","summary":"Things I think are worth reading. Consider leaving a Star if this helps you.\nStatistical Learning Theory\n Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [book] [lectures] Abu-Mostafa, Yaser S., Malik Magdon-Ismail, and Hsuan-Tien Lin. Learning from data. Vol. 4. New York, NY, USA:: AMLBook, 2012. [book] ISIT 2018 - S. Kannan, H. Kim \u0026amp; S. Oh - Deep learning and information theory An Emerging Interface [video] [slides] Ashish Khisti.","tags":null,"title":"Reading List","type":"page"}]