rai_toolbox.optim.TopQGradientOptimizer#
- class rai_toolbox.optim.TopQGradientOptimizer(params, InnerOpt=<class 'torch.optim.sgd.SGD'>, *, q=<required parameter>, dq=0.0, param_ndim=-1, defaults=None, generator=<torch._C.Generator object>, **inner_opt_kwargs)[source]#
A gradient-tranforming optimizer that zeros the elements of a gradient whose absolute magnitudes fall below the Qth percentile.
InnerOpt.step()
is then to update the corresponding parameter.- __init__(params, InnerOpt=<class 'torch.optim.sgd.SGD'>, *, q=<required parameter>, dq=0.0, param_ndim=-1, defaults=None, generator=<torch._C.Generator object>, **inner_opt_kwargs)[source]#
- Parameters:
- paramsSequence[Tensor] | Iterable[Mapping[str, Any]]
Iterable of parameters or dicts defining parameter groups.
- InnerOptType[Optimizer] | Partial[Optimizer], optional (default=`torch.nn.optim.SGD`)
The optimizer that updates the parameters after their gradients have been transformed.
- qfloat
Specifies the (fractional) percentile of absolute-largest gradient elements to retain when sparsifying the gradient. E.g.,
q=0.9
means that only the gradient elements within the 90th-percentile will be retained.Must be within
[0.0, 1.0]
. The sparsification is applied to the gradient in accordance toparam_ndim
.- dqfloat, optional (default=0.0)
If specified, the sparsity factor for each gradient transformation will be drawn from a uniform distribution over \([q - dq, q + dq] \in [0.0, 1.0]\).
- param_ndimUnion[int, None], optional (default=-1)
Determines how a parameter and its gradient is temporarily reshaped prior to being passed to both
_pre_step_transform_
and_post_step_transform_
. By default,the transformation broadcasts over the tensor’s first dimension in a batch-like style. This can be specified per param-groupA positive number determines the dimensionality of the tensor that the transformation will act on.
A negative number indicates the ‘offset’ from the dimensionality of the tensor (see “Notes” for examples).
None
means that the transformation will be applied directly to the tensor without any broadcasting.
See
ParamTransformingOptimizer
for more details and examples.- grad_scalefloat, optional (default=1.0)
Multiplies each gradient in-place after the in-place transformation is performed. This can be specified per param-group.
- grad_biasfloat, optional (default=0.0)
Added to each gradient in-place after the in-place transformation is performed. This can be specified per param-group.
- defaultsOptional[Dict[str, Any]]
Specifies default parameters for all parameter groups.
- generatortorch.Generator, optional (default=`torch.default_generator`)
Controls the RNG source.
- **inner_opt_kwargsAny
Named arguments used to initialize
InnerOpt
.
Examples
Let’s use
TopQGradientOptimizer
along with a standard SGD-step with a learning rate of1.0
. We’ll sparsify the gradient of a 2D parameter using varying percentile values. We setparam_ndim=None
so that no broadcasting occurs.>>> import torch as tr >>> from rai_toolbox.optim import TopQGradientOptimizer
>>> gradient = tr.tensor([[0.5, 1.0], ... [-2.5, 0.30]]) >>> for q in [0.0, 0.25, 0.5, 0.75, 1.0]: ... x = tr.ones((2, 2), requires_grad=True) ... optim = TopQGradientOptimizer(params=[x], lr=1.0, q=q, param_ndim=None) ... x.backward(gradient=gradient) ... optim.step() ... print(f"grad (q={q})\n{x.grad}\nx:\n{x}\n---") grad (q=0.0) tensor([[ 0.5000, 1.0000], [-2.5000, 0.3000]]) x: tensor([[0.5000, 0.0000], [3.5000, 0.7000]], requires_grad=True) --- grad (q=0.25) tensor([[ 0.5000, 1.0000], [-2.5000, 0.0000]]) x: tensor([[0.5000, 0.0000], [3.5000, 1.0000]], requires_grad=True) --- grad (q=0.5) tensor([[ 0.0000, 1.0000], [-2.5000, 0.0000]]) x: tensor([[1.0000, 0.0000], [3.5000, 1.0000]], requires_grad=True) --- grad (q=0.75) tensor([[ 0.0000, 0.0000], [-2.5000, 0.0000]]) x: tensor([[1.0000, 1.0000], [3.5000, 1.0000]], requires_grad=True) --- grad (q=1.0) tensor([[0., 0.], [0., 0.]]) x: tensor([[1., 1.], [1., 1.]], requires_grad=True) ---
We’ll repeat this exercise using
param_ndim=1
so that the top-Q sparsification is applied to each row independently (i.e. it is “broadcast” over each 1D sub-tensor in our gradient).>>> gradient = tr.tensor([[0.5, 1.0], ... [-2.5, 0.30]]) >>> for q in [0.0, 0.5, 1.0]: ... x = tr.ones((2, 2), requires_grad=True) ... optim = TopQGradientOptimizer(params=[x], lr=1.0, q=q, param_ndim=1) ... x.backward(gradient=gradient) ... optim.step() ... print(f"grad (q={q})\n{x.grad}\nx:\n{x}\n---") grad (q=0.0) tensor([[ 0.5000, 1.0000], [-2.5000, 0.3000]]) x: tensor([[0.5000, 0.0000], [3.5000, 0.7000]], requires_grad=True) --- grad (q=0.5) tensor([[ 0.0000, 1.0000], [-2.5000, 0.0000]]) x: tensor([[1.0000, 0.0000], [3.5000, 1.0000]], requires_grad=True) --- grad (q=1.0) tensor([[0., 0.], [0., 0.]]) x: tensor([[1., 1.], [1., 1.]], requires_grad=True) ---
Methods
__init__
(params[, InnerOpt, q, dq, ...])- Parameters: