chop.stochastic¶
Stochastic optimizers.¶
This module contains stochastic first order optimizers. These are meant to be used in replacement of optimizers such as SGD, Adam etc, for training a model over batches of a dataset. The API in this module is inspired by torch.optim.
Functions
|
Backtracking step-size finding routine for FW-like algorithms |
Classes
|
Class for the Stochastic Frank-Wolfe algorithm given in Mokhtari et al. |
|
Projected Gradient Descent |
|
What Madry et al. |
|
Pairwise Frank-Wolfe algorithm |
-
class
chop.stochastic.
FrankWolfe
(params, constraint, lr=0.1, momentum=0.9)[source]¶ Class for the Stochastic Frank-Wolfe algorithm given in Mokhtari et al. This is essentially FrankWolfe with Momentum. We use the tricks from [Pokutta, Spiegel, Zimmer, 2020]. https://arxiv.org/abs/2010.07243
-
add_param_group
(param_group)[source]¶ Add a param group to the
Optimizer
s param_groups.This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the
Optimizer
as training progresses.- Parameters
param_group (dict) – Specifies what Tensors should be optimized along with group
optimization options. (specific) –
-
property
certificate
¶ A generator over the current convergence certificate estimate for each optimized parameter.
-
load_state_dict
(state_dict)[source]¶ Loads the optimizer state.
- Parameters
state_dict (dict) – optimizer state. Should be an object returned from a call to
state_dict()
.
-
state_dict
()[source]¶ Returns the state of the optimizer as a
dict
.It contains two entries:
- state - a dict holding current optimization state. Its content
differs between optimizer classes.
param_groups - a dict containing all parameter groups
-
step
(closure=None)[source]¶ Performs a single optimization step. :param closure: A closure that reevaluates the model
and returns the loss
-
zero_grad
(set_to_none: bool = False)[source]¶ Sets the gradients of all optimized
torch.Tensor
s to zero.- Parameters
set_to_none (bool) – instead of setting to zero, set the grads to None. This is will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors. For example: 1. When the user tries to access a gradient and perform manual ops on it, a None attribute or a Tensor full of 0s will behave differently. 2. If the user requests
zero_grad(set_to_none=True)
followed by a backward pass,.grad
s are guaranteed to be None for params that did not receive a gradient. 3.torch.optim
optimizers have a different behavior if the gradient is 0 or None (in one case it does the step with a gradient of 0 and in the other it skips the step altogether).
-
-
class
chop.stochastic.
PGD
(params, constraint, lr=0.1)[source]¶ Projected Gradient Descent
-
add_param_group
(param_group)[source]¶ Add a param group to the
Optimizer
s param_groups.This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the
Optimizer
as training progresses.- Parameters
param_group (dict) – Specifies what Tensors should be optimized along with group
optimization options. (specific) –
-
property
certificate
¶ A generator over the current convergence certificate estimate for each optimized parameter.
-
load_state_dict
(state_dict)[source]¶ Loads the optimizer state.
- Parameters
state_dict (dict) – optimizer state. Should be an object returned from a call to
state_dict()
.
-
state_dict
()[source]¶ Returns the state of the optimizer as a
dict
.It contains two entries:
- state - a dict holding current optimization state. Its content
differs between optimizer classes.
param_groups - a dict containing all parameter groups
-
step
(closure=None)[source]¶ Performs a single optimization step (parameter update).
- Parameters
closure (callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.
Note
Unless otherwise specified, this function should not modify the
.grad
field of the parameters.
-
zero_grad
(set_to_none: bool = False)[source]¶ Sets the gradients of all optimized
torch.Tensor
s to zero.- Parameters
set_to_none (bool) – instead of setting to zero, set the grads to None. This is will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors. For example: 1. When the user tries to access a gradient and perform manual ops on it, a None attribute or a Tensor full of 0s will behave differently. 2. If the user requests
zero_grad(set_to_none=True)
followed by a backward pass,.grad
s are guaranteed to be None for params that did not receive a gradient. 3.torch.optim
optimizers have a different behavior if the gradient is 0 or None (in one case it does the step with a gradient of 0 and in the other it skips the step altogether).
-
-
class
chop.stochastic.
PGDMadry
(params, constraint, lr)[source]¶ What Madry et al. call PGD
-
add_param_group
(param_group)[source]¶ Add a param group to the
Optimizer
s param_groups.This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the
Optimizer
as training progresses.- Parameters
param_group (dict) – Specifies what Tensors should be optimized along with group
optimization options. (specific) –
-
property
certificate
¶ A generator over the current convergence certificate estimate for each optimized parameter.
-
load_state_dict
(state_dict)[source]¶ Loads the optimizer state.
- Parameters
state_dict (dict) – optimizer state. Should be an object returned from a call to
state_dict()
.
-
state_dict
()[source]¶ Returns the state of the optimizer as a
dict
.It contains two entries:
- state - a dict holding current optimization state. Its content
differs between optimizer classes.
param_groups - a dict containing all parameter groups
-
step
(step_size=None, closure=None)[source]¶ Performs a single optimization step (parameter update).
- Parameters
closure (callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.
Note
Unless otherwise specified, this function should not modify the
.grad
field of the parameters.
-
zero_grad
(set_to_none: bool = False)[source]¶ Sets the gradients of all optimized
torch.Tensor
s to zero.- Parameters
set_to_none (bool) – instead of setting to zero, set the grads to None. This is will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors. For example: 1. When the user tries to access a gradient and perform manual ops on it, a None attribute or a Tensor full of 0s will behave differently. 2. If the user requests
zero_grad(set_to_none=True)
followed by a backward pass,.grad
s are guaranteed to be None for params that did not receive a gradient. 3.torch.optim
optimizers have a different behavior if the gradient is 0 or None (in one case it does the step with a gradient of 0 and in the other it skips the step altogether).
-
-
class
chop.stochastic.
PairwiseFrankWolfe
(params, constraint, lr=0.1, momentum=0.9)[source]¶ Pairwise Frank-Wolfe algorithm
-
add_param_group
(param_group)[source]¶ Add a param group to the
Optimizer
s param_groups.This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the
Optimizer
as training progresses.- Parameters
param_group (dict) – Specifies what Tensors should be optimized along with group
optimization options. (specific) –
-
load_state_dict
(state_dict)[source]¶ Loads the optimizer state.
- Parameters
state_dict (dict) – optimizer state. Should be an object returned from a call to
state_dict()
.
-
state_dict
()[source]¶ Returns the state of the optimizer as a
dict
.It contains two entries:
- state - a dict holding current optimization state. Its content
differs between optimizer classes.
param_groups - a dict containing all parameter groups
-
step
(closure)[source]¶ Performs a single optimization step (parameter update).
- Parameters
closure (callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.
Note
Unless otherwise specified, this function should not modify the
.grad
field of the parameters.
-
zero_grad
(set_to_none: bool = False)[source]¶ Sets the gradients of all optimized
torch.Tensor
s to zero.- Parameters
set_to_none (bool) – instead of setting to zero, set the grads to None. This is will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors. For example: 1. When the user tries to access a gradient and perform manual ops on it, a None attribute or a Tensor full of 0s will behave differently. 2. If the user requests
zero_grad(set_to_none=True)
followed by a backward pass,.grad
s are guaranteed to be None for params that did not receive a gradient. 3.torch.optim
optimizers have a different behavior if the gradient is 0 or None (in one case it does the step with a gradient of 0 and in the other it skips the step altogether).
-
-
chop.stochastic.
backtracking_step_size
(x, f_t, old_f_t, f_grad, certificate, lipschitz_t, max_step_size, update_direction, norm_update_direction)[source]¶ Backtracking step-size finding routine for FW-like algorithms
- Parameters
x – array-like, shape (n_features,) Current iterate
f_t – float Value of objective function at the current iterate.
old_f_t – float Value of objective function at previous iterate.
f_grad – callable Callable returning objective function and gradient at argument.
certificate – float FW gap
lipschitz_t – float Current value of the Lipschitz estimate.
max_step_size – float Maximum admissible step-size.
update_direction – array-like, shape (n_features,) Update direction given by the FW variant.
norm_update_direction – float Squared L2 norm of update_direction
- Returns
- float
Step-size to be used to compute the next iterate.
- lipschitz_t: float
Updated value for the Lipschitz estimate.
- f_next: float
Objective function evaluated at x + step_size_t d_t.
- grad_next: array-like
Gradient evaluated at x + step_size_t d_t.
- Return type
step_size_t