Flashcards in C1 NN & DL Deck (21)

Loading flashcards...

1

## ReLU

###
Rectified Linear Unit

Activation function, breakthrough since sigmoid which had ~0 gradient whereas ReLU does slow training

2

## Neuron and notation

###
x -> o -> y

Labeled data: m (x, y)

3

## Neural Network

### Several node layers densely connected, figure out relations

4

## NN types and applications

###
Standard NN - structured data

CNN - unstructured data (image)

RNN - temporal data (audio, language)

5

## Binary classification

###
(x, y), x in R^nx, y in {0,1}

m training examples

M = Mtrain (vs Mtest) = {(x1,y1),...(xm, ym)}

X = [x1 ... xm] in R^(nx*m)

Y = [y1 ... ym] in R^(1*m)

6

## Logistic regression (problem)

###
Algorithm for binary classification

Predict ŷ = P(y=1 | x), x in R^nx

Parameters w in R^nx, b in R

ŷ = σ(wTx+b) = σ(z) linear function 0 < ŷ < 1 (0.5 at z=0), σ(z)=1/(1+e^-z)

Need to learn ŷ(i) ≈ y(i)

7

## Logistic regression (loss and cost functions)

###
Given ŷ = P(y=1 | x) for {(x1, y1),...,(xm,ym)}, we want ŷ≈y

Loss/error function to minimise ℒ(ŷ,y) tells how good ŷ is, applied to a single training sample.

We want to maximize If y=1: P(y | x) = ŷ, if y=0: P(y | x) = 1 - ŷ

So P(y | x) = ŷ^y(1-ŷ)^(1-y) (log strictly monodically increasing function, maximising x <=> maximising log x)

logP(y | x) = ylogŷ + (1-y)log(1-ŷ)

ℒ(ŷ,y) = -(ylogŷ + (1-y)log(1-ŷ)) (- because we minimise the loss)

Cost function defines the cost of the parameters

P(Y | X) = ∏(i=1 > m) P(y(i) | x(i))

logP(Y | X) = ∑logP(y(i) | x(i))

J(w,b) = 1/m ∑ℒ(ŷ(i),y(i)) (scaling factor/to minimize)

(assuming training examples idd - identically independently distributed, and log∏x = ∑logx)

We look for w, b minimising J(w,b)

8

## Gradient descent

###
J is convex, hence there is a global optimum

Repeat

w = w - ɑ ∂J(w,b)/∂w

b = b - ɑ ∂J(w,b)/∂b

With ɑ learning rate

9

## Derivatives (general, log)

###
f'(a) = lim(h>0) [f(a+h) - f(a)]/[(a+h)-a)] (how much a small push on x changes y)

log'(x) = 1/x

10

## Computational Graph for J(a,b,c) = 3(a + bc)

###
Forward vs backward propagation.

Useful to optimise variable J (intermediate variables), left to right pass to compute J, right to left to efficiently compute derivative using chain rule.

u = bc, v = a+u, J = 3v

dJ/du = dJ/dv*dJ/du

11

## Gradient Descent for Logistic Regression

###
xi, wi, b -> z = w.T+b -> ŷ = σ(z) -> ℒ(ŷ,y)

∂L/∂ŷ = -y/ŷ + (1-y)/(1-ŷ)

∂L/∂z = ∂L/∂ŷ.∂ŷ/∂z = ŷ(1-ŷ)∂L/∂ŷ = ŷ - y

∂L/∂wi = ∂L/∂z.∂z/∂wi = xi*(ŷ - y)

∂L/∂b = ∂L/∂z.∂z/∂b = ŷ - y

∂J(w,b)/∂wi = 1/m ∑ℒ(ŷ(i),y(i))

Algorithm: (a=ŷ, avoid explicit for loops!)

z = w.TX + b

A = σ(z)

dz = A-Y

dw = 1/m X.dz.T

db = 1/m np.sum(dz)

w = w - ɑ.dw

b = b - ɑ.db

(Single iteration)

12

## Numpy (broadcasting, shapes, rank)

###
broadcasting: (m,n) +-*/ (1,n) -> (m,n)

for the smaller array, each dim = or 1.

[a b c] may be rank 1 shape (3,). Should .reshape(1,3) to make a row vector [a b c]! (otherwise .T won't work for e.g.)

13

## Neural networks representation

###
Input layer, hidden layer, output layer (2 layers, output not considered)

X = a^[0]

a^[1] = column(a1[1] ... a4[1])

a^[2] = ^y

Number of nodes nx=n[0], n[1]...

Parameters:

- W[1], b[1] (n[1],nx), (n[1],1)

- W[2], b[2] (n[2],nx), (n[2],1)

14

## Neural network output (one sample x)

###
For each node i:

- zi[1] = wi[1]Tx+bi[1]

- ai[1] = σ(zi[1])

(Picture...)

Stacked per layer (nodes):

- z[1] = w[1]x+b[1], a[1] = σ(z[1])

- z[2] = w[2]a[1]+b[2], a[2] = σ(z[2])

where:

- w[i] = col(w1[i]T ... wnx[1]T)

- x = a[0] = col(x1 ... xnx)

- b[i] = col(b1[i] .. bnx[i])

- z[i] = col(z1[i] ... znx[i])

15

## Neural network output (vectorized)

###
m samples, a[layer](sample)

X = [x(1) ... x(m)] (nx,m)

Z[1] = W[1]X+b[1], A[1]=σ(Z[1])

Z[2] = W[2]A[1]+b[2], A[2]=σ(Z[2])

Where:

Z[i]=[z[i](1) ... z[i](m)] (n[i],m)

A[i]=[a[i](1) ... a[i](m)] (n[i],m)

16

## Activation functions

###
Sigmoid (0/1, 0-0.5), a=1/(1+e^-z)

Hyperbolic tangent tanh(z)=(e^z-e^-z)/(e^z+e^-z) shifted version (-1/1, 0-0)

ReLU a=max(0,z), not diff at 0

Leaky ReLU a=max(0.01z, z)

17

## Why non-linear activation fct

### Linear useless because composition is linear (no new function discovered). Can be used at output layer to predict R (e.g. price).

18

## Derivatives activation fct

###
Sigmoid: a(1-a)

Tanh: 1-a^2

ReLU: 0 if z<0, 1 otherwise (undef 0)

Leaky ReLU: 0.01 if z<0, 1 otherwise (undef 0)

19

## Backpropagation neural network

###
dZ[2]=A[2]-Y, dW[2]=1/m.dZ[2]A[1]T, db[2]=1/mΣdZ[2]

dZ[1]=W[2]TdZ[2]*g[1]’(Z[1]), dW[1]=1/m.dZ[1]XT, db[1]=1/mΣdZ[1]

Numpy sum: keepdims=True

Use computation graph with L(a,y)

20

## Gradient descent neural network

###
Cost function J(w[1],b[1],w[2],b[2]) = 1/mΣL(^y,y)

Compute predictions ^y(i)

dparams...

Update param=param-αparam

21