

  • Full-batch GD(The original ever)
  • SGD (Choose a single random sample to cal at a time)
  • Min-batch SGD
  • Learning rate finding way (fastai)
  • SGD with restart (fastaai)
  • different layer with different learning rate (fastaai)


Jacobian 和 Hassian 矩阵


Origin Version

Momentum SGD 动量梯度下降法

weigth decay

weight decay 就是 ml 中正则化的正则项系数,在 nn 的梯度下降法里,自然也是可以存在的

batch normalazation


都包含了 embedding 的思想,即电影由 factors 个特征表征,用户对电影的喜爱也用 factors 个特征表征(这里为了计算方便,个数相同)

Ng 里:

X 为电影特征矩阵(n_moives x factors) Y 为用户特征矩阵(n_users x factors)

X*Y` (矩阵乘法) 就可以得到所有预测值. X 和 Y 是通过线性回归方式估计的

n_movies x factors x factors x n_users  =  n_movies x n_users
  1. 直接实现:

movie embedding dotmultipy user embedding then add bias,然后 sigmoild,直接用 momentum SGD.

def get_emb(ni,nf):
    e = nn.Embedding(ni, nf)
    return e

class EmbeddingDotBias(nn.Module):
    def __init__(self, n_users, n_movies):
        (self.u, self.m, self.ub, self.mb) = [get_emb(*o) for o in [
            (n_users, n_factors), (n_movies, n_factors), (n_users,1), (n_movies,1)

    def forward(self, cats, conts):
        users,movies = cats[:,0],cats[:,1]
        um = (self.u(users)* self.m(movies)).sum(1)
        #注意 两个变量的bias都是直接相加了
        res = um + self.ub(users).squeeze() + self.mb(movies).squeeze()
        #这个是额外的处理,讲结果sigmoid [0-1]化, 再重新计算real rating
        res = F.sigmoid(res) * (max_rating-min_rating) + min_rating
        return res.view(-1, 1)

  1. 加了 hidden/output layer 的网络:

move embedding + user embedding as nn input layer

a hidden layer (drop + Relu) size: 2*factors–>10

a output layer size : 10–>1 (drop + sigmoid(不是必需的))

class EmbeddingNet(nn.Module):
    def __init__(self, n_users, n_movies, nh=10, p1=0.05, p2=0.5):
        (self.u, self.m) = [get_emb(*o) for o in [
            (n_users, n_factors), (n_movies, n_factors)]]
        self.lin1 = nn.Linear(n_factors*2, nh)
        self.lin2 = nn.Linear(nh, 1)
        self.drop1 = nn.Dropout(p1)
        self.drop2 = nn.Dropout(p2)

    def forward(self, cats, conts):
        users,movies = cats[:,0],cats[:,1]
        x = self.drop1(torch.cat([self.u(users),self.m(movies)], dim=1))
        x = self.drop2(F.relu(self.lin1(x)))
        return F.sigmoid(self.lin2(x)) * (max_rating-min_rating+1) + min_rating-0.5

什么是 embedding

Why You Need to Start Using Embedding Layers

Turns positive integers (indexes) into dense vectors of fixed size.

是不是可以理解为 index one-hot coding(m 维)后,再加一层 full layer 的转换(nxm),embedding 的维度为 n 维?即:

1 --> 1xm  --> 1xm x m x n (n < m) --> 1 x n
 Because the embedded vectors also get updated during the training process of the deep neural network, we can explore what words are similar to each other in a multi-dimensional space. By using dimensionality reduction techniques like t-SNE these similarities can be visualized.