Normal distribution in Python
Working on theoretical statistics, all of my work in my PhD was done in R. But for production code in industry, for the sake of speed and easier maintenance, I have taken up to implement some of my ideas in Python even when the prototyping is done in R. I did not expect to be learning this much by just looking at a normal distribution.
scipy.stats vs math
Suppose we want to compute the CDF of a standard normal. An R user like me who is used to pnorm(z) would probably write this
But another way is to use the erf function, given by
\[ \text{erf}(z) = \frac{2}{\sqrt{\pi}} \int_0^z e^{-t^2} \,dt. \]
We can then write
In fact this is the given example in the Python documentations for the math library.
For my purpose, this function needs to be run many times, so speed is definitely important here. We run it for 10000 times:
scipy took: 0.9739241569999999
math took: 0.011614696000000091
Computing using erf was a lot faster than using scipy, but that should not come as a surprise. Even in R, a loop without any vectorization is bound to be slow. After all, the strength of scipy is that it works well with numpy arrays. So let’s vectorize it:
array([0.84134475, 0.84134475, 0.84134475, ..., 0.84134475, 0.84134475,
0.84134475])
scipy took: 0.013408950000000086
Too bad that in my use case, the CDF of the normal distribution has to be computed in a loop, so I will go with the erf route…
erf vs erfc
…which brings us to another question. How accurate is computing the CDF based on erf? The part where we add erf to 1.0 means that anything that is close to or smaller than the machine epsilon will get erased. What if we do care about the magnitude of the CDF far in the tail of the normal? With the time constraint, we cannot really call norm.logcdf here.
[2.8665157186802404e-07, 9.865876449133282e-10, 1.2798095916366492e-12, 6.106226635438361e-16, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
We can instead use erfc, defined as
\[ \text{erfc}(z) = 1 - \text{erf}(z). \]
Now we have
[2.866515718791945e-07, 9.865876450377014e-10, 1.2798125438858348e-12, 6.22096057427182e-16, 1.1285884059538425e-19, 7.619853024160593e-24, 1.9106595744986828e-28, 1.7764821120777016e-33, 6.117164399549921e-39, 7.793536819192798e-45]
It preserved many more digits. It is curious how the Python documentations chose to use this example for erf instead of erfc. To check if this is correct, I run in R:
[1] 2.866516e-07 9.865876e-10 1.279813e-12 6.220961e-16 1.128588e-19
[6] 7.619853e-24 1.910660e-28 1.776482e-33 6.117164e-39 7.793537e-45
which gave the exactly same numbers. Did R perhaps implement pnorm using erfc as well?