```
library(SLOPE)
fit <- SLOPE(wine$x, wine$y, family = "multinomial")
```

SLOPE (Bogdan et al. 2015) stands for sorted L1 penalized estimation and is a generalization of OSCAR (Bondell and Reich 2008). As the name suggests, SLOPE is a type of -regularization. More specifically, SLOPE fits generalized linear models regularized with the sorted norm. The objective in SLOPE is

where is typically the log-likelihood of some model in the family of generalized linear models and

is the sorted norm.

Some people will note that this penalty is a generalization of the standard norm penalty^{1}. As such, SLOPE is a type of sparse regression—just like the lasso. Unlike the lasso, however, SLOPE gracefully handles correlated features. Whereas the lasso often discards all but a few among a set of correlated features (Jia and Yu 2010), SLOPE instead *clusters* such features together by setting such clusters to have the same coefficient in absolut value.

SLOPE 0.2.0 is a new verison of the R package SLOPE featuring a range of improvements over the previous package. If you are completely new to the package, please start with the introductory vignette.

Previously, SLOPE only features ordinary least-squares regression. Now the package features logistic, Poisson, and multinomial regression on top of that. Just as in other similar packages, this is enabled simply by setting `family = "binomial"`

for logistic regression, for instance.

```
library(SLOPE)
fit <- SLOPE(wine$x, wine$y, family = "multinomial")
```

By default, SLOPE now fits a full regularization path instead of only a single penalty sequence at once. This behavior is now analogous with the default behavior in glmnet.

`plot(fit)`

The package now uses predictor screening rules to vastly improve performance in the domain. Screening rules are part of what makes other related packages such as glmnet so efficient. In SLOPE, we use a variant of the strong screening rules for the lasso (Tibshirani et al. 2012).

```
xy <- SLOPE:::randomProblem(100, 1000)
system.time({SLOPE(xy$x, xy$y, screen = TRUE)})
```

```
user system elapsed
1.888 0.005 0.294
```

`system.time({SLOPE(xy$x, xy$y, screen = FALSE)})`

```
user system elapsed
7.848 0.017 1.257
```

There is now a function `trainSLOPE()`

, which can be used to run cross-validation for optimal selection of `sigma`

and `q`

. Here, we run 8-fold cross-validation repeated 5 times.

```
# 8-fold cross-validation repeated 5 times
tune <- trainSLOPE(
subset(mtcars, select = c("mpg", "drat", "wt")),
mtcars$hp,
q = c(0.1, 0.2),
number = 8,
repeats = 5
)
plot(tune)
```

In addition, the package now also features a function `caretSLOPE()`

that can be used via the excellent caret package, which enables a swath of resampling methods and comparisons.

All of the performance-critical code for SLOPE has been rewritten in C++. In addition, the package now features an ADMM solver for `family = "gaussian"`

, enabled by setting `solver = "admm"`

in the call to `SLOPE()`

. Preliminary testing shows that this solver is faster for many designs, particularly when there is high correlation among predictors.

SLOPE now also allows sparse design matrcies of classes from the Matrix package.

For a full list of changes, please see the changelog.

Bogdan, Małgorzata, Ewout van den Berg, Chiara Sabatti, Weijie Su, and Emmanuel J. Candès. 2015. “SLOPE – Adaptive Variable Selection via Convex Optimization.” *The Annals of Applied Statistics* 9 (3): 1103–40. https://doi.org/10.1214/15-AOAS842.

Bondell, Howard D., and Brian J. Reich. 2008. “Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR.” *Biometrics* 64 (1): 115–23. https://doi.org/10.1111/j.1541-0420.2007.00843.x.

Jia, J., and B. Yu. 2010. “On Model Selection Consistency of the Elastic Net When p n.” *Statistica Sinica* 20 (2): 595–611.

Tibshirani, Robert, Jacob Bien, Jerome Friedman, Trevor Hastie, Noah Simon, Jonathan Taylor, and Ryan J. Tibshirani. 2012. “Strong Rules for Discarding Predictors in Lasso-Type Problems.” *Journal of the Royal Statistical Society. Series B: Statistical Methodology* 74 (2): 245–66. https://doi.org/c4bb85.

Simply set for all and you get the lasso penalty.↩︎

The purpose of my R package eulerr is to fit and *visualize* Euler diagrams. Besides the various intricacies involved in fitting the diagrams, there are many interesting problems involved in their visualization. One of these is the labeling of the overlaps.

Naturally, simply positioning the labels at the shapes’ centers fails more often than not. Nevertheless, this stategy is employed by **venneuler**, for instance, and the plots usually demand manual tuning.

```
# an example set combination
s <- c(
"SE" = 13,
"Treat" = 28,
"Anti-CCP" = 101,
"DAS28" = 91,
"SE&Treat" = 1,
"SE&DAS28" = 14,
"Treat&Anti-CCP" = 6,
"SE&Anti-CCP&DAS28" = 1
)
library(venneuler, quietly = TRUE)
fit_venneuler <- venneuler(s)
plot(fit_venneuler)
```

Up til now, I solved this in **eulerr** by, for each overlap, filling one of the involved shapes (circles or ellipses) with points and then numerically optimizing the location of the point using a Nelder–Mead optimizer. However, given that the solution to finding the distance between a point and an ellipse—at least one that is rotated—itself requires a numerical solution (Eberly 2013), this procedure turned out to be quite inefficient.

R has powerful functionality for plotting in general, but lacks capabilities for drawing ellipses using curves. High-resolution polygons are thankfully a readily available remedy for this and have since several version back been used also in **eulerr**.

The upside of using polygons, however, are that they are usually much easier, even if sometimes inefficient, to work with. For instance, they make constructing separate shapes for each overlap a breeze using the polyclip package (Johnson and Baddeley 2018).

And because basically all shapes in digital maps are polygons, there happens to exist a multitude of other useful tools to deal with a wide variety of tasks related to polygons. One of these turned out to be precisely what I needed: polylabel (**mapbox2018a?**) from the Mapbox suite. Because the details of the library have already been explained elsewhere I will spare you the details, but briefly put it uses quadtree binning to divide the polygon into square bins, pruning away dead-ends. It is inefficient and will, according to the authors, find a point that is “guaranteed to be a global optimum within the given precision”.

Because it appeared to be such a valuable tool for R users, I decided to create a wrapper for the c++ header for polylabel and bundle it as a package for R users.

```
# install.packages("polylabelr")
library(polylabelr)
# a concave polygon with a hole
x <- c(0, 6, 3, 9, 10, 12, 4, 0, NA, 2, 5, 3)
y <- c(0, 0, 1, 3, 1, 5, 3, 0, NA, 1, 2, 2)
# locate the pole of inaccessibility
p <- poi(x, y, precision = 0.01)
plot.new()
plot.window(
range(x, na.rm = TRUE),
range(y, na.rm = TRUE),
asp = 1
)
polypath(x, y, col = "grey90", rule = "evenodd")
points(p, cex = 2, pch = 16)
```

The package is availabe on cran, the source code is located at https://github.com/jolars/polylabelr and is documented at https://jolars.github.io/polylabelr/.

To come back around to where we started at, **polylabelr** has now been employed in the development branch of **eulerr** where it is used to quickly and appropriately figure out locations for the labels of the diagram.

```
library(eulerr)
plot(euler(s))
```

Eberly, David. 2013. “Distance from a Point to an Ellipse, an Ellipsoid, or a Hyperellipsoid.” Geometric Tools. June 28, 2013. https://www.geometrictools.com/Documentation/DistancePointEllipseEllipsoid.pdf.

Johnson, Angus, and Adrian Baddeley. 2018. *Polyclip: Polygon Clipping* (version 1.9-1). https://CRAN.R-project.org/package=polyclip.

`n`

colors so that the minimal pairwise distance among them is maximized, that is, we want the most similar pair of colors to be as dissimilar as possible.
Bla

This turns out to be much less trivial that one would suspect, which posts on Computational Science, MATLAB Central, Stack Overflow, and and Computer Science can attest to.

Up til now, qualpalr solved this problem with a greedy approach. If we, for instance, want to find `n`

points we did the following.

```
M <- Compute a distance matrix of all points in the sample
X <- Select the two most distant points from M
for i = 3:n
X(i) <- Select point in M that maximize the
mindistance to all points in X
```

In R, this code looked like this (in two dimensions):

```
set.seed(1)
# find n points
n <- 3
mat <- matrix(runif(100), ncol = 2)
dmat <- as.matrix(stats::dist(mat))
ind <- integer(n)
ind[1:2] <- as.vector(arrayInd(which.max(dmat), .dim = dim(dmat)))
for (i in 3:n) {
mm <- dmat[ind, -ind, drop = FALSE]
k <- which.max(mm[(1:ncol(mm) - 1) * nrow(mm) + max.col(t(-mm))])
ind[i] <- as.numeric(dimnames(mm)[[2]][k])
}
par(mfrow = c(1, 2), mai= c(0.5, 0.5, 0.1, 0.1))
plot(mat, asp = 1, xlab = "", ylab = "")
plot(mat, asp = 1, xlab = "", ylab = "")
points(mat[ind, ], pch = 19)
text(mat[ind, ], adj = c(0, -1.5))
```

While this greedy procedure is fast and works well for large values of `n`

it is quite inefficient in the example above. It is plain to see that there are other subsets of 3 points that would have a larger minimum distance but because we base our selection on the previous 2 points that were selected to be maximally distant, the algorithm has to pick a suboptimal third point. The minimum distance in our example is 0.7641338.

The solution I came up with is based on a solution from Schlomer et al. (Schlömer, Heck, and Deussen 2011) who devised of an algorithm to partition a sets of points into subsets whilst maximizing the minimal distance. They used delaunay triangulations but I decided to simply use the distance matrix instead. The algorithm works as follows.

```
M <- Compute a distance matrix of all points in the sample
S <- Sample n points randomly from M
repeat
for i = 1:n
M <- Add S(i) back into M
S(i) <- Find point in M\S with max mindistance to any point in S
until M did not change
```

Iteratively, we put one point from our candidate subset (S) back into the original se (M) and check all distances between the points in S to those in M to find the point with the highest minimum distance. Rinse and repeat until we are only putting back the same points we started the loop with, which always happens. Let’s see how this works on the same data set we used above.

```
r <- sample.int(nrow(dmat), n)
repeat {
r_old <- r
for (i in 1:n) {
mm <- dmat[r[-i], -r[-i], drop = FALSE]
k <- which.max(mm[(1:ncol(mm) - 1) * nrow(mm) + max.col(t(-mm))])
r[i] <- as.numeric(dimnames(mm)[[2]][k])
}
if (identical(r_old, r)) break
}
par(mfrow = c(1, 2), mai= c(0.5, 0.5, 0.1, 0.1))
plot(mat, asp = 1, xlab = "", ylab = "")
plot(mat, asp = 1, xlab = "", ylab = "")
points(mat[r, ], pch = 19)
text(mat[r, ], adj = c(0, -1.5))
```

Here, we end up with a minimum distance of 0.8619587. In qualpalr, this means that we now achieve slightly more distinct colors.

The new algorithm is slightly slower than the old, greedy approach and slightly more verbose

```
f_greedy <- function(data, n) {
dmat <- as.matrix(stats::dist(data))
ind <- integer(n)
ind[1:2] <- as.vector(arrayInd(which.max(dmat), .dim = dim(dmat)))
for (i in 3:n) {
mm <- dmat[ind, -ind, drop = FALSE]
k <- which.max(mm[(1:ncol(mm) - 1) * nrow(mm) + max.col(t(-mm))])
ind[i] <- as.numeric(dimnames(mm)[[2]][k])
}
ind
}
f_new <- function(dat, n) {
dmat <- as.matrix(stats::dist(data))
r <- sample.int(nrow(dmat), n)
repeat {
r_old <- r
for (i in 1:n) {
mm <- dmat[r[-i], -r[-i], drop = FALSE]
k <- which.max(mm[(1:ncol(mm) - 1) * nrow(mm) + max.col(t(-mm))])
r[i] <- as.numeric(dimnames(mm)[[2]][k])
}
if (identical(r_old, r)) return(r)
}
}
```

```
n <- 5
data <- matrix(runif(900), ncol = 3)
microbenchmark::microbenchmark(f_greedy(data, n), f_new(data, n), times = 1000L)
```

```
Unit: milliseconds
expr min lq mean median uq max neval
f_greedy(data, n) 1.311413 1.346007 1.692128 1.363831 1.407456 40.94923 1000
f_new(data, n) 1.621765 1.934945 2.420683 2.183483 2.489382 15.32823 1000
cld
a
b
```

The newest development version of qualpalr now uses this updated algorithm which has also been generalized and included as a new function in my R package euclidr called `farthest_points`

.

Schlömer, Thomas, Daniel Heck, and Oliver Deussen. 2011. “Farthest-Point Optimized Point Sets with Maximized Minimum Distance.” In, 135. ACM Press. https://doi.org/bpmnsh.

R features a number of packages that produce Euler and/or Venn diagrams; some of the more prominent ones (on CRAN) are

- eVenn,
- VennDiagram,
- venn,
- colorfulVennPlot, and
- venneuler.

The last of these (venneuler) serves as the primary inspiration for this package, along with the refinements that Ben Fredrickson has presented on his blog and made available in his javascript venn.js.

venneuler, however, is written in java, preventing R users from browsing the source code (unless they are also literate in java) or contributing. Furthermore, venneuler is known to produce imperfect output for set configurations that have perfect solutions. Consider, for instance, the following example in which the intersection between `A`

and `B`

is unwanted.

```
library(venneuler, quietly = TRUE)
venn_fit <- venneuler(c(A = 75, B = 50, "A&B" = 0))
plot(venn_fit)
```

eulerr is based on the improvements to **venneuler** that Ben Fredrickson introcued with **venn.js** but has been coded from scratch, uses different optimizers, and returns the residuals and stress statistic that venneuler features.

Currently, it is possible to provide input to `eulerr`

as either

- a named numeric vector or
- a matrix of logicals with columns representing sets and rows the set relationships for each observation.

```
library(eulerr)
# Input in the form of a named numeric vector
fit1 <- euler(c("A" = 25, "B" = 5, "C" = 5,
"A&B" = 5, "A&C" = 5, "B&C" = 3,
"A&B&C" = 3))
# Input as a matrix of logicals
set.seed(1)
mat <-
cbind(
A = sample(c(TRUE, TRUE, FALSE), size = 50, replace = TRUE),
B = sample(c(TRUE, FALSE), size = 50, replace = TRUE),
C = sample(c(TRUE, FALSE, FALSE, FALSE), size = 50, replace = TRUE)
)
fit2 <- euler(mat)
```

We inspect our results by printing the eulerr object

`fit2`

```
original fitted residuals regionError
A 13 13 0 0.008
B 4 4 0 0.002
C 0 0 0 0.000
A&B 17 17 0 0.010
A&C 5 5 0 0.003
B&C 1 0 1 0.024
A&B&C 2 2 0 0.001
diagError: 0.024
stress: 0.002
```

or directly access and plot the residuals.

```
# Cleveland dot plot of the residuals
dotchart(resid(fit2))
abline(v = 0, lty = 3)
```

This shows us that the `A&B&C`

intersection is somewhat overrepresented in `fit2`

. Given that these residuals are on the scale of the original values, however, the residuals are arguably of little concern.

For an overall measure of the fit of the solution, we use the same stress statistic that Leland Wilkinson presented in his academic paper on venneuler (Wilkinson 2012), which is given by the sums of squared residuals divided by the total sums of squares:

We fetch it from the `stress`

attribute of the `eulerr`

object.

`fit2$stress`

`[1] 0.00198`

We can now be confident that eulerr provides a reasonable representation of our input. Were it otherwise, we would do best to stop here and look for another way to visualize our data. (I suggest the excellent UpSetR package.)

No we get to the fun part: plotting our diagram. This is easy, as well as highly customizable, with eulerr.

```
plot(fit2)
# Change fill colors, border type (remove) and fontface.
plot(
fit2,
fills = c("dodgerblue4", "plum2", "seashell2"),
edges = list(lty = 1:3),
labels = list(font = 2)
)
```

eulerr’s default color palette is taken from qualpalr – another package that I have developed – which uses color difference algorithms to generate distinct qualitative color palettes.

Details of the implementation will be left for a future vignette but almost completely resemble the approach documented here.

eulerr would not be possible without Ben Fredrickson’s work on venn.js or Leland Wilkinson’s venneuler.

Wilkinson, L. 2012. “Exact and Approximate Area-Proportional Circular Venn and Euler Diagrams.” *IEEE Transactions on Visualization and Computer Graphics* 18 (2): 321–31. https://doi.org/10.1109/TVCG.2011.56.

With the advent of colorbrewer there now exists good options to generate color palettes for sequential, diverging, and qualitative data. In R, these palettes can be accessed via the popular RColorBrewer package. Those palettes, however, are limited to a fixed number of colors. This isn’t much of a problem for sequential of diverging data since we can interpolate colors to any range we desire:

```
pal <- RColorBrewer::brewer.pal(4, "PuBuGn")
color_ramp <- colorRampPalette(pal, space = "Lab")
```

There is not, however, an analogue for qualitative color palettes that will get you beyond the limits of 8–12 colors of colorbrewer’s qualitative color palettes. There is also no customization in colorbrewer. Other R packages, such as colorspace offer this, but they are primarily adapted to sequential and diverging data – not qualitative data.

This is where qualpalr comes in. qualpalr provides the user with a convenient way of generating distinct qualitative color palettes, primarily for use in R graphics. Given `n`

(the number of colors to generate), along with a subset in the hsl color space (a cylindrical representation of the RGB color space) `qualpalr`

attempts to find the `n`

colors in the provided color subspace that *maximize the smallest pairwise color difference*. This is done by projecting the color subset from the HSL color space to the DIN99d space. DIN99d is (approximately) perceptually uniform, that is, the euclidean distance between two colors in the space is proportional to their perceived difference.

`qualpalr`

relies on one basic function, `qualpal()`

, which takes as its input `n`

(the number of colors to generate) and `colorspace`

, which can be either

- a list of numeric vectors
`h`

(hue from -360 to 360),`s`

(saturation from 0 to 1), and`l`

(lightness from 0 to 1), all of length 2, specifying a min and max, or - a character vector specifying one of the predefined color subspaces, which at the time of writing are
*pretty*,*pretty_dark*,*rainbow*, and*pastels*.

```
library(qualpalr)
pal <- qualpal(
n = 5,
list(
h = c(0, 360),
s = c(0.4, 0.6),
l = c(0.5, 0.85)
)
)
# Adapt the color space to deuteranopia
pal <- qualpal(n = 5, colorspace = "pretty", cvd = "deutan")
```

The resulting object, `pal`

, is a list with several color tables and a distance matrix based on the din99d color difference formula.

`pal`

```
----------------------------------------
Colors in the HSL color space
Hue Saturation Lightness
#73CA6F 117 0.46 0.61
#D37DAD 327 0.50 0.66
#C6DBE8 203 0.42 0.84
#6C7DCC 229 0.48 0.61
#D0A373 31 0.50 0.63
----------------------------------------
DIN99d color difference distance matrix
#73CA6F #D37DAD #C6DBE8 #6C7DCC
#D37DAD 28
#C6DBE8 19 21
#6C7DCC 27 19 19
#D0A373 19 18 20 25
```

Methods for `pairs`

and `plot`

have been written for `qualpal`

objects to help visualize the results.

`plot(pal)`

`pairs(pal, colorspace = "DIN99d", asp = 1)`

The colors are normally used in R by fetching the `hex`

attribute of the palette. And so it is straightforward to use the output to, say, color the provinces of France (Figure 3

```
library(maps)
map("france", fill = TRUE, col = pal$hex, mar = c(0, 0, 0, 0))
```

`qualpal`

begins by generating a point cloud out of the HSL color subspace provided by the user, using a quasi-random torus sequence from randtoolbox. Here is the color subset in HSL with settings `h = c(-200, 120), s = c(0.3, 0.8), l = c(0.4, 0.9)`

.

The function then proceeds by projecting these colors into the sRGB space (Figure 5).

It then continues by projecting the colors, first into the XYZ space, then CIELab (not shown here), and then finally the DIN99d space (Figure 6).

The DIN99d color space (Cui et al. 2002) is a euclidean, perceptually uniform color space. This means that the difference between two colors is equal to the euclidean distance between them. We take advantage of this by computing a distance matrix on all the colors in the subset, finding their pairwise color differences. We then apply a power transformation (Huang et al. 2015) to fine tune these differences.

To select the `n`

colors that the user wanted, we proceed greedily: first, we find the two most distant points, then we find the third point that maximizes the minimum distance to the previously selected points. This is repeated until `n`

points are selected. These points are then returned to the user; below is an example using `n = 5.`

At the time of writing, qualpalr only works in the sRGB color space with the CIE Standard Illuminant D65 reference white.

The greedy search to find distinct colors is crude. Particularly when searching for few colors, the greedy algorithm will lead to sub-optimal results. Other solutions to finding points that maximize the smallest pairwise distance among them are welcome.

Bruce Lindbloom’s webpage has been instrumental in making qualpalr. Also thanks to i want hue, which inspired me to make qualpalr.

Cui, G., M. R. Luo, B. Rigg, G. Roesler, and K. Witt. 2002. “Uniform Colour Spaces Based on the DIN99 Colour-Difference Formula.” *Color Research & Application* 27 (4): 282–90. https://doi.org/cz7764.

Huang, Min, Guihua Cui, Manuel Melgosa, Manuel Sánchez-Marañón, Changjun Li, M. Ronnier Luo, and Haoxue Liu. 2015. “Power Functions Improving the Performance of Color-Difference Formulas.” *Optics Express* 23 (1): 597. https://doi.org/gcsk6f.