rai_toolbox.optim.L1qFrankWolfe#

class rai_toolbox.optim.L1qFrankWolfe(params, *, q, epsilon, dq=0.0, lr=2.0, use_default_lr_schedule=True, param_ndim=-1, div_by_zero_eps=1.1754943508222875e-38, generator=<torch._C.Generator object>)[source]#

A Frank-Wolfe [1] optimizer that, when computing the LMO, sparsifies a parameter’s gradient. Each updated parameter is constrained to fall within an \(\epsilon\)-sized ball in \(L^1\) space, centered on the origin.

The sparsification process retains only the signs (i.e., \(\pm 1\)) of the gradient’s elements. The transformation is applied to the gradient in accordance with param_ndim.

Notes

This parameter-transforming optimizer is useful for visual concept probing [2].

References

[2]

Roberts, Jay, and Theodoros Tsiligkaridis. Controllably Sparse Perturbations of Robust Classifiers for Explaining Predictions and Probing Learned Concepts. (2021).

__init__(params, *, q, epsilon, dq=0.0, lr=2.0, use_default_lr_schedule=True, param_ndim=-1, div_by_zero_eps=1.1754943508222875e-38, generator=<torch._C.Generator object>)[source]#
Parameters:
paramsSequence[Tensor] | Iterable[Mapping[str, Any]]

Iterable of parameters or dicts defining parameter groups.

qfloat

Specifies the (fractional) percentile of absolute-largest gradient elements to retain when sparsifying the gradient. E.g., `q=0.9`means that only the gradient elements within the 90th-percentile will be retained.

Must be within [0.0, 1.0]. The sparsification is applied to the gradient in accordance to param_ndim.

epsilonfloat

Specifies the size of the L1-space ball that all parameters will be projected into, post optimization step.

dqfloat, optional (default=0.0)

If specified, the sparsity factor for each gradient transformation will be drawn from a uniform distribution over \([q - dq, q + dq] \in [0.0, 1.0]\).

lrfloat, optional (default=2.0)

Indicates the weight with which the LMO contributes to the parameter update. See use_default_lr_schedule for additional details. If use_default_lr_schedule=False then lr must be be in the domain [0, 1].

use_default_lr_schedulebool, optional (default=True)

If True, then the per-parameter “learning rate” is scaled by \(\hat{l_r} = l_r / (l_r + k)\) where k is the update index for that parameter, which starts at 0.

param_ndimUnion[int, None], optional (default=-1)

Determines how a parameter and its gradient is temporarily reshaped prior to being passed to both _pre_step_transform_ and _post_step_transform_. By default,the transformation broadcasts over the tensor’s first dimension in a batch-like style. This can be specified per param-group

  • A positive number determines the dimensionality of the tensor that the transformation will act on.

  • A negative number indicates the ‘offset’ from the dimensionality of the tensor (see “Notes” for examples).

  • None means that the transformation will be applied directly to the tensor without any broadcasting.

See ParamTransformingOptimizer for more details and examples.

defaultsOptional[Dict[str, Any]]

Specifies default parameters for all parameter groups.

div_by_zero_epsfloat, optional (default=`torch.finfo(torch.float32).tiny`)

Prevents div-by-zero error in learning rate schedule.

generatortorch.Generator, optional (default=`torch.default_generator`)

Controls the RNG source.

Examples

Using L1qFrankWolfe, we’ll sparsify the parameter’s gradient to retain signs of the top 70% elements, and we’ll constrain the updated parameter to fall within a \(L^1\)-ball of radius 1.8.

>>> import torch as tr
>>> from rai_toolbox.optim import L1qFrankWolfe

Creating a parameter for our optimizer to update, and our optimizer. We specify param_ndim=None so that the sparsification/normalization occurs on the gradient without any broadcasting.

>>> x = tr.tensor([1.0, 1.0, 1.0], requires_grad=True)
>>> optim = L1qFrankWolfe(
...     [x],
...     q=0.30,
...     epsilon=1.8,
...     param_ndim=None,
... )

Performing a simple calculation with x and performing backprop to create a gradient.

>>> (tr.tensor([0.0, 1.0, 2.0]) * x).sum().backward()

Performing a step with our optimizer uses the Frank-Wolfe algorithm to update its parameters; the resulting parameter was updated with a LMO based on a sparsified, sign-only gradient. Note that the parameter falls within/on the \(L^1\)-ball of radius 1.8.

>>> optim.step()
>>> x  # the updated parameter; has a L1-norm of 1.8
tensor([ 0.0000, -0.9000, -0.9000], requires_grad=True)

Methods

__init__(params, *, q, epsilon[, dq, lr, ...])

Parameters: