rai_toolbox.optim.TopQGradientOptimizer#

class rai_toolbox.optim.TopQGradientOptimizer(params, InnerOpt=<class 'torch.optim.sgd.SGD'>, *, q=<required parameter>, dq=0.0, param_ndim=-1, defaults=None, generator=<torch._C.Generator object>, **inner_opt_kwargs)[source]#

A gradient-tranforming optimizer that zeros the elements of a gradient whose absolute magnitudes fall below the Qth percentile. InnerOpt.step() is then to update the corresponding parameter.

See also

L1qNormedGradientOptim
ParamTransformingOptimizer

__init__(params, InnerOpt=<class 'torch.optim.sgd.SGD'>, *, q=<required parameter>, dq=0.0, param_ndim=-1, defaults=None, generator=<torch._C.Generator object>, **inner_opt_kwargs)[source]#

Parameters:

paramsSequence[Tensor] | Iterable[Mapping[str, Any]]

Iterable of parameters or dicts defining parameter groups.

InnerOptType[Optimizer] | Partial[Optimizer], optional (default=`torch.nn.optim.SGD`)

The optimizer that updates the parameters after their gradients have been transformed.

qfloat

Specifies the (fractional) percentile of absolute-largest gradient elements to retain when sparsifying the gradient. E.g., q=0.9 means that only the gradient elements within the 90th-percentile will be retained.

Must be within [0.0, 1.0]. The sparsification is applied to the gradient in accordance to param_ndim.

dqfloat, optional (default=0.0)

If specified, the sparsity factor for each gradient transformation will be drawn from a uniform distribution over \([q - dq, q + dq] \in [0.0, 1.0]\).

param_ndimUnion[int, None], optional (default=-1)

Determines how a parameter and its gradient is temporarily reshaped prior to being passed to both _pre_step_transform_ and _post_step_transform_. By default,the transformation broadcasts over the tensor’s first dimension in a batch-like style. This can be specified per param-group

A positive number determines the dimensionality of the tensor that the transformation will act on.
A negative number indicates the ‘offset’ from the dimensionality of the tensor (see “Notes” for examples).
None means that the transformation will be applied directly to the tensor without any broadcasting.

See ParamTransformingOptimizer for more details and examples.

grad_scalefloat, optional (default=1.0)

Multiplies each gradient in-place after the in-place transformation is performed. This can be specified per param-group.

grad_biasfloat, optional (default=0.0)

Added to each gradient in-place after the in-place transformation is performed. This can be specified per param-group.

defaultsOptional[Dict[str, Any]]

Specifies default parameters for all parameter groups.

generatortorch.Generator, optional (default=`torch.default_generator`)

Controls the RNG source.

**inner_opt_kwargsAny

Named arguments used to initialize InnerOpt.

Examples

Let’s use TopQGradientOptimizer along with a standard SGD-step with a learning rate of 1.0. We’ll sparsify the gradient of a 2D parameter using varying percentile values. We set param_ndim=None so that no broadcasting occurs.

>>> import torch as tr
>>> from rai_toolbox.optim import TopQGradientOptimizer

>>> gradient = tr.tensor([[0.5,   1.0],
...                       [-2.5, 0.30]])
>>> for q in [0.0, 0.25, 0.5, 0.75, 1.0]:
...     x = tr.ones((2, 2), requires_grad=True)
...     optim = TopQGradientOptimizer(params=[x], lr=1.0, q=q, param_ndim=None)
...     x.backward(gradient=gradient)
...     optim.step()
...     print(f"grad (q={q})\n{x.grad}\nx:\n{x}\n---")
grad (q=0.0)
tensor([[ 0.5000,  1.0000],
        [-2.5000,  0.3000]])
x:
tensor([[0.5000, 0.0000],
        [3.5000, 0.7000]], requires_grad=True)
---
grad (q=0.25)
tensor([[ 0.5000,  1.0000],
        [-2.5000,  0.0000]])
x:
tensor([[0.5000, 0.0000],
        [3.5000, 1.0000]], requires_grad=True)
---
grad (q=0.5)
tensor([[ 0.0000,  1.0000],
        [-2.5000,  0.0000]])
x:
tensor([[1.0000, 0.0000],
        [3.5000, 1.0000]], requires_grad=True)
---
grad (q=0.75)
tensor([[ 0.0000,  0.0000],
        [-2.5000,  0.0000]])
x:
tensor([[1.0000, 1.0000],
        [3.5000, 1.0000]], requires_grad=True)
---
grad (q=1.0)
tensor([[0., 0.],
        [0., 0.]])
x:
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
---

We’ll repeat this exercise using param_ndim=1 so that the top-Q sparsification is applied to each row independently (i.e. it is “broadcast” over each 1D sub-tensor in our gradient).

>>> gradient = tr.tensor([[0.5,   1.0],
...                       [-2.5, 0.30]])
>>> for q in [0.0, 0.5, 1.0]:
...     x = tr.ones((2, 2), requires_grad=True)
...     optim = TopQGradientOptimizer(params=[x], lr=1.0, q=q, param_ndim=1)
...     x.backward(gradient=gradient)
...     optim.step()
...     print(f"grad (q={q})\n{x.grad}\nx:\n{x}\n---")
grad (q=0.0)
tensor([[ 0.5000,  1.0000],
        [-2.5000,  0.3000]])
x:
tensor([[0.5000, 0.0000],
        [3.5000, 0.7000]], requires_grad=True)
---
grad (q=0.5)
tensor([[ 0.0000,  1.0000],
        [-2.5000,  0.0000]])
x:
tensor([[1.0000, 0.0000],
        [3.5000, 1.0000]], requires_grad=True)
---
grad (q=1.0)
tensor([[0., 0.],
        [0., 0.]])
x:
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
---

Methods

__init__(params[, InnerOpt, q, dq, ...])

Parameters: