The TDAvec
package provides implementations of
several vector summaries of persistence diagrams studied in Topological
Data Analysis (TDA). Each is obtained by discretizing the associated
summary function computed from a persistence diagram. The summary
functions included in this package are
For improved computational efficiency, all code behind the vector
summaries is written in C++
using the Rcpp
package. Whenever applicable, when compare our code with existing
implementations in terms of accuracy and run-time cost. In this
vignette, we illustrate the basic usage of the TDAvec
package using simple examples.
Let’s first load the required libraries.
library(TDAvec)
library(TDA) # to compute persistence diagrams
library(microbenchmark) # to compare computational costs
Now, we generate a 2D point cloud of size 100 sampled uniformly from a unit circle with added Gaussian noise:
N <- 100 # point cloud size
set.seed(123)
X <- circleUnif(N) + rnorm(2*N,mean = 0,sd = 0.2)
# plot the point cloud
plot(X,pch = 20,asp = 1)
Next, we use the TDA
package to compute the persistence
diagram (PD) using the Vietoris-Rips filtration built on top of the
point cloud X.
D <- ripsDiag(X,maxdimension = 1,maxscale = 2)$diagram
sum(D[,1]==0) # number of connected components
#> [1] 100
sum(D[,1]==1) # number of loops
#> [1] 13
sum(D[,1]==2) # number of voids
#> [1] 0
In the ripsDiag()
function, maxdimension
is
the maximum homological dimension of the topological features to be
computed (connected components if maxdimension=0; connected components
and loops if 1; connected components, loops and voids if 2, etc.)
maxscale
is the maximum value of the scale parameter of the
filtration (which we set equal to 2 since the points are sampled from a
circle with diameter 2).
The persistence diagram D
has 100 connected components (the same as the point cloud size), 13
loops (one with long life-span, the rest are short-lived) and 0 voids
along with their birth
and death
values. To
plot the diagram, we can use the plot()
function.
In the plot, the solid dots and red triangles represent connected components and loops respectively.
Let’s compute a vector summary of the persistence landscape (PL) for homological dimension H0.
# sequence of scale values to vectorize the summary function
scaleSeq = seq(0,2,length.out=11)
# compute the PL vector summary for homological dimension H_0
computePL(D,homDim = 0,scaleSeq,k=1)
#> [1] 0.0 0.2 0.4 0.6 0.8 1.0 0.8 0.6 0.4 0.2 0.0
Here, the vectorization is performed by evaluating the PL function at
each element of scaleSeq
(i.e. 0.0, 0.2, 0.4, …, 2.0) and arranging the
values into a vector. The parameter k
in
computePL()
is the order of the persistence landscape
function (by default k = 1).
To compute an H1
vector summary, set the homDim
argument equal to 1:
# compute the PL vectory summary for homological dimension H_1
computePL(D,homDim = 1,scaleSeq,k=1)
#> [1] 0.000000000 0.014217630 0.000931983 0.191114536 0.391114536 0.269212150
#> [7] 0.069212150 0.000000000 0.000000000 0.000000000 0.000000000
The TDA
package also provides an implementation of the
persistence landscapes. Below we compare the two implementations in
terms of accuracy of results and run-time cost.
pl1 <- computePL(D,homDim = 0,k=1,scaleSeq)
pl2 <- as.vector(landscape(D,dimension = 0,KK = 1, tseq = scaleSeq))
all.equal(pl1,pl2) # -> TRUE (the results are the same)
#> [1] TRUE
compCost <- microbenchmark(
computePL(D,homDim = 0,k=1,scaleSeq),
landscape(D,dimension = 0,KK = 1, tseq = scaleSeq),
times = 500
)
sm <- summary(compCost)
costRatioPL <- sm$mean[2]/sm$mean[1] # ratio of computational time means
For homological dimension H0, it took
TDA::landscape()
about 181 times more time than
TDAvec::computePL()
.
To discretize all the other univariate summary functions (i.e., persistence silhouette, persistent entropy summary function, Euler characteristic curve, normalized life curve and Betti curve), we employ a different vectorization scheme. Instead of evaluating a summary function at increasing scales and arranging the values into a vector, we compute the average values of the summary function between two consecutive scale points using integration. More specifically, if f is a (univariate) summary function and t1, t2, …, tn are increasing scale values, we discretize f as:
where Δtk = tk + 1 − tk, k = 1, …, n − 1. For the above five summary functions, the computation of integrals can be done analytically and efficiently implemented. Note that in this case, vector summaries lie in ℝn − 1, where the dimension is one less than the number of scale values.
The TDA::silhouette()
function vectorizes the
persistence silhouette function the same way as the
TDA::landscape()
function does. However,
TDA::silhouette()
and TDAvec::computePS()
provide similar results if a dense grid of scale values is used for
vectorization:
n <- 101
scaleSeq = seq(0,2,length.out=n)
ps1 <- computePS(D,homDim = 0, p = 1,scaleSeq)
ps2 <- as.vector(silhouette(D,dimension = 0,p = 1,tseq = scaleSeq))
# plot two vector summaries
plot(scaleSeq[1:(n-1)]+1/(n-1),ps1,
type="l",col="red",xlab="x",ylab="y",lty=1)
lines(scaleSeq,ps2,type='l',col='blue',lty=2)
legend(1.48, 0.122, legend=c("TDAvec","TDA"),
col=c("red","blue"),lty=1:2,cex=0.7)
compCost <- microbenchmark(
computePS(D,homDim = 0,p = 1,scaleSeq),
silhouette(D,dimension = 0,p = 1,tseq = scaleSeq),
times = 500
)
sm <- summary(compCost)
costRatioPS <- sm$mean[2]/sm$mean[1]
print(costRatioPS)
#> [1] 3.854356
The p in
TDAvec::computePS()
and TDA::silhouette()
is
the power of the weights for the silhouette function (by default p = 1). For the above example, the
former was about 4 times faster to than the latter for homological
dimension H0.
The syntax and usage of the remaining univariate summary functions are very similar.
# sequence of scale values to vectorize the summary function
scaleSeq = seq(0,2,length.out=11)
# Persistent Entropy Summary (PES) function
# compute PES for homological dimension H0
computePES(D,homDim = 0,scaleSeq)
#> [1] 4.8268301 1.0173158 0.3720045 0.3720045 0.3720045 0.3720045 0.3720045
#> [8] 0.3720045 0.3720045 0.3720045
# compute PES for homological dimension H1
computePES(D,homDim = 1,scaleSeq)
#> [1] 0.05677674 0.26458225 0.33845861 0.34789260 0.34789260 0.34789260
#> [7] 0.12039198 0.00000000 0.00000000 0.00000000
# Euler Characteristic Curve (ECC)
computeECC(D,maxhomDim = 1,scaleSeq) # maxhomDim = maximal homological dimension considered
#> [1] 65.64393340 5.93893485 -0.02801848 0.00000000 0.00000000 0.00000000
#> [7] 0.65393925 1.00000000 1.00000000 1.00000000
# Vector of Averaged Bettis (VAB) - a vectorization of Betti Curve
# compute VAB for homological dimension H0
computeVAB(D,homDim = 0,scaleSeq)
#> [1] 65.928137 7.313179 1.000000 1.000000 1.000000 1.000000 1.000000
#> [8] 1.000000 1.000000 1.000000
# compute VAB for homological dimension H1
computeVAB(D,homDim = 1,scaleSeq)
#> [1] 0.2842034 1.3742440 1.0280185 1.0000000 1.0000000 1.0000000 0.3460608
#> [8] 0.0000000 0.0000000 0.0000000
# Normalized Life (NL) Curve
# compute NL for homological dimension H0
computeNL(D,homDim = 0,scaleSeq)
#> [1] 0.8171309 0.2341008 0.1230901 0.1230901 0.1230901 0.1230901 0.1230901
#> [8] 0.1230901 0.1230901 0.1230901
# compute NL for homological dimension H1
computeNL(D,homDim = 1,scaleSeq)
#> [1] 0.01309940 0.06318557 0.68241758 0.71307326 0.71307326 0.71307326
#> [7] 0.24676667 0.00000000 0.00000000 0.00000000
To discretize the persistence surface and persistence block, we first need to switch from the birth-death to the birth-persistence coordinates.
The resulting vector summaries are called the persistence image (PI) and the vectorized of persistence block (VPB) respectively.
# Persistence Image (PI)
resB <- 5 # resolution (or grid size) along the birth axis
resP <- 5 # resolution (or grid size) along the persistence axis
# find min and max persistence values
minPH0 <- min(D[D[,1]==0,3]); maxPH0 <- max(D[D[,1]==0,3])
# construct one-dimensional grid of scale values
ySeqH0 <- seq(minPH0,maxPH0,length.out=resP+1)
# default way of selecting the standard deviation sigma of the Gaussians on top of each point of the diagram
sigma <- 0.5*(maxPH0-minPH0)/resP
# compute PI for homological dimension H_0
computePI(D,homDim=0,xSeq=NA,ySeqH0,sigma)
#> [1] 4.56670703 0.96562735 0.01135007 0.02272340 0.47724987
# Vectorized Persistence Block (VPB)
tau <- 0.3 # parameter in [0,1] which controls the size of blocks around each point of the diagram
# compute VPB for homological dimension H_0
computeVPB(D,homDim = 0,xSeq=NA,ySeqH0,tau)
#> [1] 3.90196609 0.06216766 0.00000000 0.77425106 1.80204108
Since the H0 features all have the birth value of zero in this case, a one-dimensional grid of scale values is used for vectorization.
For homological dimension H1, the birth values are not all the same and therefore the vectorization is performed over a two-dimensional grid. For the VPB summary, since the blocks around each point of the persistence diagram have different sizes, we construct the grid with scale values spread out non-uniformly (i.e. the rectangular grid cells have different dimensions). In experiments, this construction of the grid tends to provide better performance over the grid with equal cell sizes.
# PI
# find min & max birth and persistence values
minBH1 <- min(D[D[,1]==1,2]); maxBH1 <- max(D[D[,1]==1,2])
minPH1 <- min(D[D[,1]==1,3]); maxPH1 <- max(D[D[,1]==1,3])
xSeqH1 <- seq(minBH1,maxBH1,length.out=resB+1)
ySeqH1 <- seq(minPH1,maxPH1,length.out=resP+1)
sigma <- 0.5*(maxPH1-minPH1)/resP
# compute PI for homological dimension H_1
computePI(D,homDim=1,xSeqH1,ySeqH1,sigma)
#> [1] 4.371601e-02 4.955567e-02 4.512378e-02 3.299957e-02 1.864442e-02
#> [6] 7.527654e-03 7.760847e-03 6.305986e-03 4.250972e-03 2.278547e-03
#> [11] 6.038006e-05 5.841014e-05 4.427399e-05 3.094726e-05 2.021064e-05
#> [16] 1.834675e-04 8.921009e-04 2.714215e-03 5.171823e-03 6.175378e-03
#> [21] 3.853840e-03 1.874023e-02 5.701774e-02 1.086451e-01 1.297270e-01
# VPB
xSeqH1 <- unique(quantile(D[D[,1]==1,2],probs = seq(0,1,by=0.2)))
ySeqH1 <- unique(quantile(D[D[,1]==1,3],probs = seq(0,1,by=0.2)))
tau <- 0.3
# compute VPB for homological dimension H_1
computeVPB(D,homDim = 1,xSeqH1,ySeqH1,tau)
#> [1] 0.0000000000 0.0004995598 0.0090337733 0.0305098818 0.0057973687
#> [6] 0.0095194496 0.0261053157 0.0203673036 0.0042870130 0.0107446149
#> [11] 0.0425768331 0.1053468321 0.1458807241 0.0000000000 0.0000000000
#> [16] 0.0328035481 0.0276785592 0.1089088594 0.1318822254 0.0168304895
#> [21] 0.2939394561 0.3162986126 0.3344510989 0.3567486932 0.3636226941
As a note, the code for computePI()
is adopted from the
pers.image()
function (available in the
kernelTDA
package) with minor modifications. For example,
pers.image()
uses a one-dimensional grid for homological
dimension H0
regardless of the filtration type. In contrast, computePI()
uses a one-dimensional grid only if additionally the birth values are
the same (which may not be true for some filtrations such as the
lower-star filtration). Moreover, pers.image()
uses a
square grid (e.g., 10 by 10) for vectorization, whereas
computePI()
is not limited to such a grid and can compute
vector summaries using a rectangular grid (e.g., 10 by 20).
Bubenik, P. (2015). Statistical topological data analysis using persistence landscapes. Journal of Machine Learning Research, 16(1), 77-102.
Chazal, F., Fasy, B. T., Lecci, F., Rinaldo, A., & Wasserman, L. (2014, June). Stochastic convergence of persistence landscapes and silhouettes. In Proceedings of the thirtieth annual symposium on Computational geometry (pp. 474-483).
Atienza, N., Gonzalez-Díaz, R., & Soriano-Trigueros, M. (2020). On the stability of persistent entropy and new summary functions for topological data analysis. Pattern Recognition, 107, 107509.
Richardson, E., & Werman, M. (2014). Efficient classification using the Euler characteristic. Pattern Recognition Letters, 49, 99-106.
Chazal, F., & Michel, B. (2021). An Introduction to Topological Data Analysis: Fundamental and Practical Aspects for Data Scientists. Frontiers in Artificial Intelligence, 108.
Chung, Y. M., & Lawson, A. (2022). Persistence curves: A canonical framework for summarizing persistence diagrams. Advances in Computational Mathematics, 48(1), 1-42.
Adams, H., Emerson, T., Kirby, M., Neville, R., Peterson, C., Shipman, P., … & Ziegelmeier, L. (2017). Persistence images: A stable vector representation of persistent homology. Journal of Machine Learning Research, 18.
Chan, K. C., Islambekov, U., Luchinsky, A., & Sanders, R. (2022). A computationally efficient framework for vector representation of persistence diagrams. Journal of Machine Learning Research, 23, 1-33.