rai_toolbox.optim.L1qFrankWolfe#

class rai_toolbox.optim.L1qFrankWolfe(params, *, q, epsilon, dq=0.0, lr=2.0, use_default_lr_schedule=True, param_ndim=-1, div_by_zero_eps=1.1754943508222875e-38, generator=<torch._C.Generator object>)[source]#

A Frank-Wolfe [1] optimizer that, when computing the LMO, sparsifies a parameter’s gradient. Each updated parameter is constrained to fall within an \(\epsilon\)-sized ball in \(L^1\) space, centered on the origin.

The sparsification process retains only the signs (i.e., \(\pm 1\)) of the gradient’s elements. The transformation is applied to the gradient in accordance with param_ndim.

See also

FrankWolfe
L1FrankWolfe
L2FrankWolfe
LinfFrankWolfe

Notes

This parameter-transforming optimizer is useful for visual concept probing [2].

References

[1]

https://en.wikipedia.org/wiki/Frank%E2%80%93Wolfe_algorithm#Algorithm

[2]

Roberts, Jay, and Theodoros Tsiligkaridis. Controllably Sparse Perturbations of Robust Classifiers for Explaining Predictions and Probing Learned Concepts. (2021).

__init__(params, *, q, epsilon, dq=0.0, lr=2.0, use_default_lr_schedule=True, param_ndim=-1, div_by_zero_eps=1.1754943508222875e-38, generator=<torch._C.Generator object>)[source]#

Parameters:

paramsSequence[Tensor] | Iterable[Mapping[str, Any]]

Iterable of parameters or dicts defining parameter groups.

qfloat

Specifies the (fractional) percentile of absolute-largest gradient elements to retain when sparsifying the gradient. E.g., `q=0.9`means that only the gradient elements within the 90th-percentile will be retained.

Must be within [0.0, 1.0]. The sparsification is applied to the gradient in accordance to param_ndim.

epsilonfloat

Specifies the size of the L1-space ball that all parameters will be projected into, post optimization step.

dqfloat, optional (default=0.0)

If specified, the sparsity factor for each gradient transformation will be drawn from a uniform distribution over \([q - dq, q + dq] \in [0.0, 1.0]\).

lrfloat, optional (default=2.0)

Indicates the weight with which the LMO contributes to the parameter update. See use_default_lr_schedule for additional details. If use_default_lr_schedule=False then lr must be be in the domain [0, 1].

use_default_lr_schedulebool, optional (default=True)

If True, then the per-parameter “learning rate” is scaled by \(\hat{l_r} = l_r / (l_r + k)\) where k is the update index for that parameter, which starts at 0.

param_ndimUnion[int, None], optional (default=-1)

Determines how a parameter and its gradient is temporarily reshaped prior to being passed to both _pre_step_transform_ and _post_step_transform_. By default,the transformation broadcasts over the tensor’s first dimension in a batch-like style. This can be specified per param-group

A positive number determines the dimensionality of the tensor that the transformation will act on.
A negative number indicates the ‘offset’ from the dimensionality of the tensor (see “Notes” for examples).
None means that the transformation will be applied directly to the tensor without any broadcasting.

See ParamTransformingOptimizer for more details and examples.

defaultsOptional[Dict[str, Any]]

Specifies default parameters for all parameter groups.

div_by_zero_epsfloat, optional (default=`torch.finfo(torch.float32).tiny`)

Prevents div-by-zero error in learning rate schedule.

generatortorch.Generator, optional (default=`torch.default_generator`)

Controls the RNG source.

Examples

Using L1qFrankWolfe, we’ll sparsify the parameter’s gradient to retain signs of the top 70% elements, and we’ll constrain the updated parameter to fall within a \(L^1\)-ball of radius 1.8.

>>> import torch as tr
>>> from rai_toolbox.optim import L1qFrankWolfe

Creating a parameter for our optimizer to update, and our optimizer. We specify param_ndim=None so that the sparsification/normalization occurs on the gradient without any broadcasting.

>>> x = tr.tensor([1.0, 1.0, 1.0], requires_grad=True)
>>> optim = L1qFrankWolfe(
...     [x],
...     q=0.30,
...     epsilon=1.8,
...     param_ndim=None,
... )

Performing a simple calculation with x and performing backprop to create a gradient.

>>> (tr.tensor([0.0, 1.0, 2.0]) * x).sum().backward()

Performing a step with our optimizer uses the Frank-Wolfe algorithm to update its parameters; the resulting parameter was updated with a LMO based on a sparsified, sign-only gradient. Note that the parameter falls within/on the \(L^1\)-ball of radius 1.8.

>>> optim.step()
>>> x  # the updated parameter; has a L1-norm of 1.8
tensor([ 0.0000, -0.9000, -0.9000], requires_grad=True)

Methods

__init__(params, *, q, epsilon[, dq, lr, ...])

Parameters: