
class rai_toolbox.optim.TopQGradientOptimizer(params, InnerOpt=<class 'torch.optim.sgd.SGD'>, *, q=<required parameter>, dq=0.0, param_ndim=-1, defaults=None, generator=<torch._C.Generator object>, **inner_opt_kwargs)[source]#

A gradient-tranforming optimizer that zeros the elements of a gradient whose absolute magnitudes fall below the Qth percentile. InnerOpt.step() is then to update the corresponding parameter.

__init__(params, InnerOpt=<class 'torch.optim.sgd.SGD'>, *, q=<required parameter>, dq=0.0, param_ndim=-1, defaults=None, generator=<torch._C.Generator object>, **inner_opt_kwargs)[source]#
paramsSequence[Tensor] | Iterable[Mapping[str, Any]]

Iterable of parameters or dicts defining parameter groups.

InnerOptType[Optimizer] | Partial[Optimizer], optional (default=`torch.nn.optim.SGD`)

The optimizer that updates the parameters after their gradients have been transformed.


Specifies the (fractional) percentile of absolute-largest gradient elements to retain when sparsifying the gradient. E.g., q=0.9 means that only the gradient elements within the 90th-percentile will be retained.

Must be within [0.0, 1.0]. The sparsification is applied to the gradient in accordance to param_ndim.

dqfloat, optional (default=0.0)

If specified, the sparsity factor for each gradient transformation will be drawn from a uniform distribution over \([q - dq, q + dq] \in [0.0, 1.0]\).

param_ndimUnion[int, None], optional (default=-1)

Determines how a parameter and its gradient is temporarily reshaped prior to being passed to both _pre_step_transform_ and _post_step_transform_. By default,the transformation broadcasts over the tensor’s first dimension in a batch-like style. This can be specified per param-group

  • A positive number determines the dimensionality of the tensor that the transformation will act on.

  • A negative number indicates the ‘offset’ from the dimensionality of the tensor (see “Notes” for examples).

  • None means that the transformation will be applied directly to the tensor without any broadcasting.

See ParamTransformingOptimizer for more details and examples.

grad_scalefloat, optional (default=1.0)

Multiplies each gradient in-place after the in-place transformation is performed. This can be specified per param-group.

grad_biasfloat, optional (default=0.0)

Added to each gradient in-place after the in-place transformation is performed. This can be specified per param-group.

defaultsOptional[Dict[str, Any]]

Specifies default parameters for all parameter groups.

generatortorch.Generator, optional (default=`torch.default_generator`)

Controls the RNG source.


Named arguments used to initialize InnerOpt.


Let’s use TopQGradientOptimizer along with a standard SGD-step with a learning rate of 1.0. We’ll sparsify the gradient of a 2D parameter using varying percentile values. We set param_ndim=None so that no broadcasting occurs.

>>> import torch as tr
>>> from rai_toolbox.optim import TopQGradientOptimizer
>>> gradient = tr.tensor([[0.5,   1.0],
...                       [-2.5, 0.30]])
>>> for q in [0.0, 0.25, 0.5, 0.75, 1.0]:
...     x = tr.ones((2, 2), requires_grad=True)
...     optim = TopQGradientOptimizer(params=[x], lr=1.0, q=q, param_ndim=None)
...     x.backward(gradient=gradient)
...     optim.step()
...     print(f"grad (q={q})\n{x.grad}\nx:\n{x}\n---")
grad (q=0.0)
tensor([[ 0.5000,  1.0000],
        [-2.5000,  0.3000]])
tensor([[0.5000, 0.0000],
        [3.5000, 0.7000]], requires_grad=True)
grad (q=0.25)
tensor([[ 0.5000,  1.0000],
        [-2.5000,  0.0000]])
tensor([[0.5000, 0.0000],
        [3.5000, 1.0000]], requires_grad=True)
grad (q=0.5)
tensor([[ 0.0000,  1.0000],
        [-2.5000,  0.0000]])
tensor([[1.0000, 0.0000],
        [3.5000, 1.0000]], requires_grad=True)
grad (q=0.75)
tensor([[ 0.0000,  0.0000],
        [-2.5000,  0.0000]])
tensor([[1.0000, 1.0000],
        [3.5000, 1.0000]], requires_grad=True)
grad (q=1.0)
tensor([[0., 0.],
        [0., 0.]])
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)

We’ll repeat this exercise using param_ndim=1 so that the top-Q sparsification is applied to each row independently (i.e. it is “broadcast” over each 1D sub-tensor in our gradient).

>>> gradient = tr.tensor([[0.5,   1.0],
...                       [-2.5, 0.30]])
>>> for q in [0.0, 0.5, 1.0]:
...     x = tr.ones((2, 2), requires_grad=True)
...     optim = TopQGradientOptimizer(params=[x], lr=1.0, q=q, param_ndim=1)
...     x.backward(gradient=gradient)
...     optim.step()
...     print(f"grad (q={q})\n{x.grad}\nx:\n{x}\n---")
grad (q=0.0)
tensor([[ 0.5000,  1.0000],
        [-2.5000,  0.3000]])
tensor([[0.5000, 0.0000],
        [3.5000, 0.7000]], requires_grad=True)
grad (q=0.5)
tensor([[ 0.0000,  1.0000],
        [-2.5000,  0.0000]])
tensor([[1.0000, 0.0000],
        [3.5000, 1.0000]], requires_grad=True)
grad (q=1.0)
tensor([[0., 0.],
        [0., 0.]])
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)


