Mazeinda provides functions for detecting and testing the significance of pairwise Monotonic Association for Zero-Inflated non-negative Data with the measure proposed by Pimentel(2009)[1]. This package can handle any degree of zero inflation as well as non-zero ties. The unique requirement is the independence of the entries of the vectors to be correlated. Pairwise association and testing of a large number of vectors can be expedited with parallel computations thanks to the packages “foreach” and “doMC”.
Recall that two pairs of jointly distributed non-negative
continuous random variables (Xi, Yi)
and (Xj, Yj)
are said to be concordant if (Xi − Xj)(Yi − Yj) > 0
and discordant if (Xi − Xj)(Yi − Yj) < 0.
There are 16 possible different realizations of two pairs of the
realization of the random variables (xi, yi)
and (xj, yj)
displayed in Table below.
To calculate the probability of concordance we apply the law of total probability using the two left most columns of Table , it follows that Since for positive values of the continuous random variables X and Y we have P(Xi < Xj) = 1 − P(Xi > Xj) and P(Yi < Yj) = 1 − P(Yi > Yj), reduces to
The same logic can be applied for the calculation the probability of discordance using the first and third columns of Table to show that
Finally, Kendall’s tau formula is the difference of and so that
##Simulations The choice of the measure proposed in Pimentel(2009)[1] over the one by Pimentel, Niewiadomska-Bugajb and Wang(2015) [3] is motivated by the proof above and by the following simulations which suggest that the method considered in this package has superior performance in terms of bias and power.
Let:
The zero-inflated data is derived from a beta zero-inflated distribution with parameters (0.5, 1) and probability mass at 0 of 0.3, 0.5 and 0.8. The first figure below shows the histograms of τb − τp and the following the histogram of τb − τt obtained with 10.000 vectors with 30 entries.
While in the histograms for 0.8 inflation τt and τp seem to behave similarly and the mode in the histogram for 0.8 of the difference d is close to zero, the histograms for 0.3 and 0.5 inflation suggest that τt is a better estimator for Kendall’s τb as the mode is centered at 0.
The means of the values plotted in the histograms above confirm these results:
tau_p | tau_t | |
---|---|---|
infl=0.3 | 0.1328465 | 0.0021578 |
infl=0.5 | 0.0536823 | -0.0005492 |
infl=0.8 | -0.0024742 | 0.0327359 |
##Comments about τ* and code The formula for the variance of τ* proposed in Pimentel’s thesis [1] cannot be applied when the parameters pij attain the values 0 or 1: the proof relies on the delta method which can be applied only when functions are differentiable. At the extrema of the interval [0,1] no function is differentiable, therefore, whenever any of the parameters pij attains the values 1 or 0, the standard R cor.test function for Kendall’s correlation is used to obtain p-values, when applicable, as the function does not work in case of vectors with zero variance.
In case of non-zero ties, τ11 is calculated as Kendall’s τb. If all the non-zero entries of at least one vector are tied, τ11 os set to 0. This choice is in accordance with the treatment of ties in Kendall (1970)[2]. τ11 is set to 0 also in the case of only one pair of non-zero corresponding entries.
##Examples Mazeinda provides for the user three functions: for obtaining a matrix of association values, $test \textunderscore associations$ for obtaining the p-values of the output of associate and which gives non-zero values only if the association is significantly different from 0. It should be noted that independence of two variates implies a zero value of τ* , but it is not a sufficient and necessary condition and this affects the power of the test, as dependent cases might get rejected. All these functions can be called with two vectors as arguments or with two matrices with vectors to be correlated as columns. If the vectors to be correlated belong to one matrix m, m should be specified as the argument m1 and as the argument m2.
Note that if the parameters pij attain the values 0 or 1, since the R function cor.test must be used, the function combine reports the value of Kendall’s τb given by the stardart R cor function, if significantly different from 0. The R cor cannot handle vectors with standard deviation 0, therefore such vectors will yield NA. Below are some examples; warnings are suppressed because the cor.test function prints “cannot compute exact p-value with ties”.
library(mazeinda)
set.seed(234)
x=abs(rBEZI(50, mu = 0.9, sigma = 1, nu = 0.7))
y=abs(rBEZI(50, mu = 0.9, sigma = 1, nu = 0.2))
associate(x,y)
## [1] -0.0256
## [1] 0.6678331
## [1] 0
x=matrix(abs(rBEZI(50, mu = 0.9, sigma = 1, nu = 0.6)),5)
y=matrix(abs(rBEZI(30, mu = 0.9, sigma = 1, nu = 0.3)),5)
knitr::kable(associate(x,y), digits=3, caption="associate(x,y)")
0.120 | -0.756 | 0.080 | 0.080 | -0.120 | -0.120 |
0.316 | -0.500 | 0.378 | -0.378 | 0.000 | -0.632 |
0.527 | 0.240 | -0.080 | -0.080 | 0.316 | -0.527 |
0.000 | 0.667 | -0.378 | -0.378 | 0.316 | 0.000 |
-0.120 | -0.080 | 0.714 | -0.571 | -0.359 | -0.837 |
0.359 | -0.080 | -0.571 | 1.000 | 0.359 | 0.598 |
0.359 | 0.630 | -0.571 | 0.080 | 0.598 | 0.120 |
0.105 | -0.444 | -0.080 | -0.080 | -0.105 | -0.105 |
-0.600 | -0.316 | 0.359 | -0.120 | -0.800 | 0.000 |
0.316 | -0.500 | 0.378 | -0.378 | 0.000 | -0.632 |
0.782 | 0.087 | 0.355 | 0.355 | 0.782 | 0.782 |
0.480 | 0.277 | 0.429 | 0.429 | 1.000 | 0.157 |
0.207 | 0.232 | 0.645 | 0.645 | 0.448 | 0.207 |
1.000 | 0.147 | 0.429 | 0.429 | 0.480 | 1.000 |
0.782 | 0.645 | 0.119 | 0.213 | 0.405 | 0.052 |
0.405 | 0.645 | 0.213 | 0.029 | 0.405 | 0.166 |
0.405 | 0.154 | 0.213 | 0.355 | 0.166 | 0.782 |
0.801 | 0.298 | 0.645 | 0.645 | 0.801 | 0.801 |
0.233 | 0.448 | 0.405 | 0.782 | 0.083 | 1.000 |
0.480 | 0.277 | 0.429 | 0.429 | 1.000 | 0.157 |
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
knitr::kable(suppressWarnings(associate(x,x)),
digits=3, caption="associate all vectors in the matrix x pairwise")
1.000 | 0.756 | -0.080 | -0.378 | 0.080 | 0.080 | -0.571 | 0.882 | 0.359 | 0.756 |
0.756 | 1.000 | 0.667 | -0.250 | 0.756 | -0.378 | -0.378 | 0.667 | 0.000 | 1.000 |
-0.080 | 0.667 | 1.000 | 0.333 | -0.080 | -0.080 | 0.378 | 0.240 | -0.316 | 0.667 |
-0.378 | -0.250 | 0.333 | 1.000 | -0.378 | -0.378 | 0.756 | 0.000 | -0.316 | -0.250 |
0.080 | 0.756 | -0.080 | -0.378 | 1.000 | -0.571 | -0.571 | -0.080 | 0.120 | 0.756 |
0.080 | -0.378 | -0.080 | -0.378 | -0.571 | 1.000 | 0.080 | -0.080 | -0.120 | -0.378 |
-0.571 | -0.378 | 0.378 | 0.756 | -0.571 | 0.080 | 1.000 | -0.080 | -0.598 | -0.378 |
0.882 | 0.667 | 0.240 | 0.000 | -0.080 | -0.080 | -0.080 | 1.000 | 0.316 | 0.667 |
0.359 | 0.000 | -0.316 | -0.316 | 0.120 | -0.120 | -0.598 | 0.316 | 1.000 | 0.000 |
0.756 | 1.000 | 0.667 | -0.250 | 0.756 | -0.378 | -0.378 | 0.667 | 0.000 | 1.000 |
If the number of vectors to be associated is extensive,
computations can be done in parallel for any of the functions in the
package by specifying “parallel=TRUE” and “n_cor=n”, where n is the
number for cores desired.
The parameters pij’s
necessary for calculating τ*
are by default calculated as proportions based on the entries of each
pair of vectors to be associated. If the parameters are known, they can
be provided to any of the functions in the package by specifying the
values as arguments. Only p11, p01 and p10 can be specified.
Note that if one of these parameters is specified as an argument, the
other two must be specified as well. The parameters pij’s
can be estimated as by the population mean, calculated using all the
pairs of column vectors from the matrices specified as the arguments d1
and d2.
knitr::kable(associate(x,y, estimator="own", p11=0.1,p10=0.1, p01=0.1), digits=3,
caption="parameters specified")
0.130 | 0.12 | 0.12 | 0.12 | 0.130 | 0.120 |
0.120 | 0.12 | 0.12 | 0.12 | 0.120 | 0.120 |
0.117 | 0.13 | 0.12 | 0.12 | 0.110 | 0.110 |
0.120 | 0.12 | 0.12 | 0.12 | 0.120 | 0.120 |
0.120 | 0.12 | 0.11 | 0.12 | 0.120 | 0.120 |
0.130 | 0.12 | 0.12 | 0.13 | 0.130 | 0.110 |
0.110 | 0.13 | 0.12 | 0.12 | 0.110 | 0.110 |
0.123 | 0.12 | 0.12 | 0.12 | 0.117 | 0.130 |
0.117 | 0.11 | 0.13 | 0.12 | 0.110 | 0.123 |
0.120 | 0.12 | 0.12 | 0.12 | 0.120 | 0.120 |
m1=matrix(abs(rBEZI(50, mu = 0.9, sigma = 1, nu = 0.7)),5)
m2=matrix(abs(rBEZI(30, mu = 0.9, sigma = 1, nu = 0.3)),5)
knitr::kable(associate(x,y, estimator="mean", d1=m1,d2=m2), digits=3,
caption="parameters estimated as population mean")
0.044 | 0.000 | 0.000 | 0.000 | 0.044 | 0.000 |
0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
-0.015 | 0.044 | 0.000 | 0.000 | -0.044 | -0.044 |
0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
0.000 | 0.000 | -0.044 | 0.000 | 0.000 | 0.000 |
0.044 | 0.000 | 0.000 | 0.044 | 0.044 | -0.044 |
-0.044 | 0.044 | 0.000 | 0.000 | -0.044 | -0.044 |
0.015 | 0.000 | 0.000 | 0.000 | -0.015 | 0.044 |
-0.015 | -0.044 | 0.044 | 0.000 | -0.044 | 0.015 |
0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |