rai_toolbox.optim.L1qFrankWolfe#
- class rai_toolbox.optim.L1qFrankWolfe(params, *, q, epsilon, dq=0.0, lr=2.0, use_default_lr_schedule=True, param_ndim=-1, div_by_zero_eps=1.1754943508222875e-38, generator=<torch._C.Generator object>)[source]#
A Frank-Wolfe [1] optimizer that, when computing the LMO, sparsifies a parameter’s gradient. Each updated parameter is constrained to fall within an \(\epsilon\)-sized ball in \(L^1\) space, centered on the origin.
The sparsification process retains only the signs (i.e., \(\pm 1\)) of the gradient’s elements. The transformation is applied to the gradient in accordance with
param_ndim
.See also
Notes
This parameter-transforming optimizer is useful for visual concept probing [2].
References
[2]Roberts, Jay, and Theodoros Tsiligkaridis. Controllably Sparse Perturbations of Robust Classifiers for Explaining Predictions and Probing Learned Concepts. (2021).
- __init__(params, *, q, epsilon, dq=0.0, lr=2.0, use_default_lr_schedule=True, param_ndim=-1, div_by_zero_eps=1.1754943508222875e-38, generator=<torch._C.Generator object>)[source]#
- Parameters:
- paramsSequence[Tensor] | Iterable[Mapping[str, Any]]
Iterable of parameters or dicts defining parameter groups.
- qfloat
Specifies the (fractional) percentile of absolute-largest gradient elements to retain when sparsifying the gradient. E.g., `q=0.9`means that only the gradient elements within the 90th-percentile will be retained.
Must be within
[0.0, 1.0]
. The sparsification is applied to the gradient in accordance toparam_ndim
.- epsilonfloat
Specifies the size of the L1-space ball that all parameters will be projected into, post optimization step.
- dqfloat, optional (default=0.0)
If specified, the sparsity factor for each gradient transformation will be drawn from a uniform distribution over \([q - dq, q + dq] \in [0.0, 1.0]\).
- lrfloat, optional (default=2.0)
Indicates the weight with which the LMO contributes to the parameter update. See
use_default_lr_schedule
for additional details. Ifuse_default_lr_schedule=False
thenlr
must be be in the domain[0, 1]
.- use_default_lr_schedulebool, optional (default=True)
If
True
, then the per-parameter “learning rate” is scaled by \(\hat{l_r} = l_r / (l_r + k)\) where k is the update index for that parameter, which starts at 0.- param_ndimUnion[int, None], optional (default=-1)
Determines how a parameter and its gradient is temporarily reshaped prior to being passed to both
_pre_step_transform_
and_post_step_transform_
. By default,the transformation broadcasts over the tensor’s first dimension in a batch-like style. This can be specified per param-groupA positive number determines the dimensionality of the tensor that the transformation will act on.
A negative number indicates the ‘offset’ from the dimensionality of the tensor (see “Notes” for examples).
None
means that the transformation will be applied directly to the tensor without any broadcasting.
See
ParamTransformingOptimizer
for more details and examples.- defaultsOptional[Dict[str, Any]]
Specifies default parameters for all parameter groups.
- div_by_zero_epsfloat, optional (default=`torch.finfo(torch.float32).tiny`)
Prevents div-by-zero error in learning rate schedule.
- generatortorch.Generator, optional (default=`torch.default_generator`)
Controls the RNG source.
Examples
Using
L1qFrankWolfe
, we’ll sparsify the parameter’s gradient to retain signs of the top 70% elements, and we’ll constrain the updated parameter to fall within a \(L^1\)-ball of radius1.8
.>>> import torch as tr >>> from rai_toolbox.optim import L1qFrankWolfe
Creating a parameter for our optimizer to update, and our optimizer. We specify
param_ndim=None
so that the sparsification/normalization occurs on the gradient without any broadcasting.>>> x = tr.tensor([1.0, 1.0, 1.0], requires_grad=True) >>> optim = L1qFrankWolfe( ... [x], ... q=0.30, ... epsilon=1.8, ... param_ndim=None, ... )
Performing a simple calculation with
x
and performing backprop to create a gradient.>>> (tr.tensor([0.0, 1.0, 2.0]) * x).sum().backward()
Performing a step with our optimizer uses the Frank-Wolfe algorithm to update its parameters; the resulting parameter was updated with a LMO based on a sparsified, sign-only gradient. Note that the parameter falls within/on the \(L^1\)-ball of radius
1.8
.>>> optim.step() >>> x # the updated parameter; has a L1-norm of 1.8 tensor([ 0.0000, -0.9000, -0.9000], requires_grad=True)
Methods
__init__
(params, *, q, epsilon[, dq, lr, ...])- Parameters: