Gradient độc lập công suất wrt trọng mạng chứa một lượng liên tục

Giả sử tôi có một MLP đơn giảnGradient độc lập công suất wrt trọng mạng chứa một lượng liên tục

Và tôi có một gradient của một số chức năng mất mát đối với các lớp đầu ra để có được G với = [0, -1] (tức là, tăng biến đầu ra thứ hai làm giảm chức năng mất).

Nếu tôi lấy gradient của G đối với tham số mạng của tôi và áp dụng bản cập nhật trọng số gradient thì biến đầu ra thứ hai sẽ tăng, nhưng không có gì được nói về biến đầu ra đầu tiên và ứng dụng được chia tỷ lệ sẽ gần như chắc chắn thay đổi biến đầu ra (có thể tăng hoặc giảm nó)

Làm cách nào để sửa đổi chức năng mất của tôi hoặc bất kỳ phép tính độ dốc nào để đảm bảo đầu ra đầu tiên không thay đổi?

Nguồn

2017-02-11 Robert

Cập nhật: Tôi hiểu sai câu hỏi. Đây là câu trả lời mới.

Vì mục đích này, bạn cần cập nhật kết nối giữa lớp ẩn và đơn vị đầu ra thứ hai, trong khi vẫn giữ nguyên lớp giữa lớp ẩn và đơn vị đầu ra đầu tiên.

Cách tiếp cận đầu tiên là giới thiệu hai bộ biến: một cho kết nối giữa lớp ẩn và đơn vị đầu ra đầu tiên, một cho phần còn lại. Sau đó, bạn có thể kết hợp chúng bằng cách sử dụng tf.stack và chuyển một số var_list để nhận các dẫn xuất tương ứng. Nó giống như (Chỉ để minh họa. Không được kiểm tra. Sử dụng cẩn thận):

out1 = tf.matmul(hidden, W_h_to_out1) + b_h_to_out1 
out2 = tf.matmul(hidden, W_h_to_out2) + b_h_to_out2 
out = tf.stack([out1, out2]) 
out = tf.transpose(tf.reshape(out, [2, -1])) 
loss = some_function_of(out) 
optimizer = tf.train.GradientDescentOptimizer(0.1) 
train_op_second_unit = optimizer.minimize(loss, var_list=[W_h_to_out2, b_h_to_out2])

Cách tiếp cận khác là sử dụng mặt nạ. Điều này dễ thực hiện hơn và linh hoạt hơn khi bạn làm việc với một số khung công tác (ví dụ: mỏng, Keras, v.v.) và tôi sẽ giới thiệu theo cách này. Ý tưởng để ẩn đơn vị đầu ra đầu tiên thành hàm mất, trong khi không thay đổi đơn vị đầu ra thứ hai. Điều này có thể được thực hiện bằng cách sử dụng một biến nhị phân: nhân một cái gì đó bằng 1 nếu bạn muốn giữ nó, và nhân nó bằng 0 để thả nó. Đây là mã:

import tensorflow as tf 
import numpy as np 

# let's make our tiny dataset: (x, y) pairs, where x = (x1, x2, x3), y = (y1, y2), 
# and y1 = x1+x2+x3, y2 = x1^2+x2^2+x3^2 

# n_sample data points 
n_sample = 8 
data_x = np.random.random((n_sample, 3)) 
data_y = np.zeros((n_sample, 2)) 
data_y[:, 0] += np.sum(data_x, axis=1) 
data_y[:, 1] += np.sum(data_x**2, axis=1) 
data_y += 0.01 * np.random.random((n_sample, 2)) # add some noise 


# build graph 
# suppose we have a network of shape [3, 4, 2], i.e.: one hidden layer of size 4. 

x = tf.placeholder(tf.float32, shape=[None, 3], name='x') 
y = tf.placeholder(tf.float32, shape=[None, 2], name='y') 
mask = tf.placeholder(tf.float32, shape=[None, 2], name='mask') 

W1 = tf.Variable(tf.random_normal(shape=[3, 4], stddev=0.1), name='W1') 
b1 = tf.Variable(tf.random_normal(shape=[4], stddev=0.1), name='b1') 
hidden = tf.nn.sigmoid(tf.matmul(x, W1) + b1) 
W2 = tf.Variable(tf.random_normal(shape=[4, 2], stddev=0.1), name='W2') 
b2 = tf.Variable(tf.random_normal(shape=[2], stddev=0.1), name='b2') 
out = tf.matmul(hidden, W2) + b2 
loss = tf.reduce_mean(tf.square(out - y)) 

# multiply out by mask, thus out[0] is "invisible" to loss, and its gradient will not be propagated 
masked_out = mask * out 
loss2 = tf.reduce_mean(tf.square(masked_out - y)) 

optimizer = tf.train.GradientDescentOptimizer(0.1) 
train_op_all = optimizer.minimize(loss) # update all variables in the network 
train_op12 = optimizer.minimize(loss, var_list=[W2, b2]) # update hidden -> output layer 
train_op2 = optimizer.minimize(loss2, var_list=[W2, b2]) # update hidden -> second output unit 


sess = tf.InteractiveSession() 
sess.run(tf.global_variables_initializer()) 
mask_out1 = np.zeros((n_sample, 2)) 
mask_out1[:, 1] += 1.0 
# print(mask_out1) 
print(sess.run([hidden, out, loss, loss2], feed_dict={x: data_x, y: data_y, mask: mask_out1})) 

# In this case, only out2 is updated. You see the loss and loss2 decreases. 
sess.run(train_op2, feed_dict={x: data_x, y:data_y, mask: mask_out1}) 
print(sess.run([hidden, out, loss, loss2], feed_dict={x: data_x, y:data_y, mask: mask_out1})) 

# In this case, both out1 and out2 is updated. You see the loss and loss2 decreases. 
sess.run(train_op12, feed_dict={x: data_x, y:data_y, mask: mask_out1}) 
print(sess.run([hidden, out, loss, loss2], feed_dict={x: data_x, y:data_y, mask: mask_out1})) 

# In this case, everything is updated. You see the loss and loss2 decreases. 
sess.run(train_op_all, feed_dict={x: data_x, y:data_y, mask: mask_out1}) 
print(sess.run([hidden, out, loss, loss2], feed_dict={x: data_x, y:data_y, mask: mask_out1})) 
sess.close()

======================= Dưới đây là câu trả lời cũ ========== ====================

Để nhận các dẫn xuất các biến khác nhau, bạn có thể vượt qua một số var_list để quyết định biến nào cần cập nhật. Dưới đây là một ví dụ:

import tensorflow as tf 
import numpy as np 

# let's make our tiny dataset: (x, y) pairs, where x = (x1, x2, x3), y = (y1, y2), 
# and y1 = x1+x2+x3, y2 = x1^2+x2^2+x3^2 

# n_sample data points 
n_sample = 8 
data_x = np.random.random((n_sample, 3)) 
data_y = np.zeros((n_sample, 2)) 
data_y[:, 0] += np.sum(data_x, axis=1) 
data_y[:, 1] += np.sum(data_x**2, axis=1) 
data_y += 0.01 * np.random.random((n_sample, 2)) # add some noise 


# build graph 
# suppose we have a network of shape [3, 4, 2], i.e.: one hidden layer of size 4. 

x = tf.placeholder(tf.float32, shape=[None, 3], name='x') 
y = tf.placeholder(tf.float32, shape=[None, 2], name='y') 

W1 = tf.Variable(tf.random_normal(shape=[3, 4], stddev=0.1), name='W1') 
b1 = tf.Variable(tf.random_normal(shape=[4], stddev=0.1), name='b1') 
hidden = tf.nn.sigmoid(tf.matmul(x, W1) + b1) 
W2 = tf.Variable(tf.random_normal(shape=[4, 2], stddev=0.1), name='W2') 
b2 = tf.Variable(tf.random_normal(shape=[2], stddev=0.1), name='b2') 
out = tf.matmul(hidden, W2) + b2 

loss = tf.reduce_mean(tf.square(out - y)) 
optimizer = tf.train.GradientDescentOptimizer(0.1) 
# You can pass a variable list to decide which variable(s) to minimize. 
train_op_second_layer = optimizer.minimize(loss, var_list=[W2, b2]) 
# If there is no var_list, all variables will be updated. 
train_op_all = optimizer.minimize(loss) 

sess = tf.InteractiveSession() 
sess.run(tf.global_variables_initializer()) 
print(sess.run([W1, b1, W2, b2, loss], feed_dict={x: data_x, y:data_y})) 

# In this case, only W2 and b2 are updated. You see the loss decreases. 
sess.run(train_op_second_layer, feed_dict={x: data_x, y:data_y}) 
print(sess.run([W1, b1, W2, b2, loss], feed_dict={x: data_x, y:data_y})) 

# In this case, all variables are updated. You see the loss decreases. 
sess.run(train_op_all, feed_dict={x: data_x, y:data_y}) 
print(sess.run([W1, b1, W2, b2, loss], feed_dict={x: data_x, y:data_y})) 
sess.close()

Nguồn

2017-02-17 06:32:56 soloice

Làm thế nào về việc thiết 'khả năng huấn luyện = false', [Biến] (https://www.tensorflow.org/versions/r0.12/api_docs/python/state_ops/variables) – xxi

này không giống nhau - vấn đề là cả hai đầu ra đều bị ảnh hưởng bởi sự thay đổi về trọng số - áp dụng độ dốc đầu ra đối với trọng số gây ra thay đổi trong cả hai đầu ra, nhưng chúng tôi muốn gradient bằng cách nào đó đầu ra sẽ vẫn không đổi sau bước gradient – Robert

@Robert Oh, tôi hiểu. Tôi hiểu lầm câu hỏi của bạn. Tôi sẽ cập nhật câu trả lời của mình. – soloice

Gradient độc lập công suất wrt trọng mạng chứa một lượng liên tục

Trả lời

Các vấn đề liên quan