ECE 597B: INTELLIGENT SYSTEMS


Professor Weibo Gong

211E Knowles Engineering Building
(413)-545-0384
e-mail: gong@ecs.umass.edu

Chapter 2. Neural Networks


CONTENTS


Background and Perspective

Back to Contents

Potential and Limitations

Back to Contents

Terminology of Neural Networks

Back to Contents

Neuron Network Operation

  • Basic concept of neuron operation
  • Network updates until equilibrium is reached
Back to Contents

Neuron Types and Models

  • Discrete
    • Activations have only two possible values: {0,1}, {-1,1}
    • Deterministic update equation
    • Stochastic update equation
  • Continuous
    • Linear
    • Nonlinear
Back to Contents

Update Mechanisms

  • Discrete neurons
    • Synchronous
      • Apply update to each unit simultaneously
      • Appropriate for simulation
      • Requires clocking mechanism in hardware
    • Asynchronous
      • Apply update to a randomly selected a unit i
      • Apply update to each unit randomly, with some probability of update per second
      • Appropriate for real-time, hardware implementation
  • Continuous neurons
    • Synchronous
    • Asynchronous
    • Continuous updating

      • Alternative
Back to Contents

Equilibrium States

  • Concept
    A set of initial activations {Vi} that do not change as the network is updated
  • Notation
  • Condition
Back to Contents

Attractors: Stable Activations

  • Concept
    • Equilibrium activation Ve
    • Initial activations near Ve converge to Ve
  • Picture
Back to Contents

Learning

  • Behavior of network depends on weights
  • Adjust weights to improve performance
  • Learning paradigms
    • Supervised learning (learning with a teacher)
    • Unsupervised learning
  • Stability
Back to Contents

Issues

  • What is the best architecture?
    • Number of layers
    • Number and organization of connections
    • Type of activation functions
    • Type of updates
    • Number of neurons
  • How should the network be programmed?
    • Learning or pre-selected weights
    • Learning mechanism
    • Real-time or off-line
    • Training sets
  • What can the various types of networks do?
    • Problems to be solved
    • Speed
    • Ability to learn
  • How should the network be implemented?
Back to Contents

Associative Memory

  • Objective: Recall pattern based a near input pattern
    • Given stored patterns
    • Find pattern closest (in Hamming distance) to a new input pattern z Hopfield Model
  • Discrete Neuron Models
  • Asynchronous Updates
    • Pick unit (neuron) i at random
    • Apply deterministic update rule
  • Hebb Rule
Back to Contents

Hopfield Model

Back to Contents

Equilibrium States

  • Would like original patterns to be equilibria
  • Actual
    • Orthogonal patterns
    • Random patterns with p << N
Back to Contents

Attractors: Energy Function

  • Want network to evolve to stored patterns
  • Energy function
  • Patterns (equilibria) are minima of the energy function
  • Energy function cannot increase at each update
    • Assume unit i has been updated
    • The change in energy when is:
  • Stored patterns are attractors (retrieval states)
Back to Contents

Spurious Minima

  • Negatives of patterns (easily handled)
  • Mixture states
    • Linear combinations of odd numbers of patterns
  • Spin glass states
    • Not correlated with any patterns
    • Occur when p is large compared to N
Back to Contents

Stochastic Networks

  • Probabilistic Updates
    • Selected unit updated by:
    • Pseudo-temperature
  • Expected unit value (mean field theory)
  • Patterns will be stable for
  • Mixture patterns will not be stable for
  • Network pattern determined by averages of units
  • Simulated annealing
    • Let pseudo-temperature decline:
    • Units approach
    • Mixture and spin glass patterns can be avoided
Back to Contents

Network Capacity

  • How many patterns can be stored and accurately retrieved? pmax
  • Measures of retrieval accuracy
    • Probability of bit error: Prob(Vi xi)
    • Probability of exact pattern retrieval
  • Assume random patterns
    • Orthogonal patterns pmax N
    • Correlated patterns: decorrelate at input or modify weights to decorrelate
Back to Contents

Network Capacity: Results

  • Deterministic network
    • 0.37% bit error, stable:
  • Stochastic network: phase diagram
Back to Contents

Continuous Neurons

  • Neuron behavior
  • Update mechanisms
    • Synchronous
    • Asynchronous
    • Continuous
  • Discrete update: analysis identical to stochastic network
  • Continuous update: only desired equilibria, stable
Back to Contents

Application to Optimization

  • Integer programming with quadratic objectives and linear constraints
    • Use penalty terms to incorporate constraints in objective
    • Objective becomes the energy function
    • Use change in energy to identify weights and thresholds
  • Applications include:
    • Traveling salesman problem
    • Graph bipartitioning
    • Image processing
Back to Contents

Optimization Example: Weighted Matching Problem

  • Minimize total distance of pairwise connections of points
    • N Points
    • Distance dij between points i and j
    • All points must be connected to exactly one other point
Back to Contents

Mathematical Formulation

  • Variables
  • Objective and Constraints
  • Energy Function: Enforce constraints via penalty terms
Back to Contents

Network Implementation

  • Stochastic units with flip probability
    where DHij is the energy change
  • Energy change:
Back to Contents

Interpretation

  • Compare with energy change of stochastic network Select:
    • Thresholds:
    • Weights:
  • Illustration: 4 points, 6 units
  • Can also implement with continuous, deterministic units
Back to Contents

Outline

  • Fundamentals of Neural Networks
  • Operation and analysis of neural networks
  • Learning in neural networks
  • Potential applications in elevator dispatching
    Emphasis: Understand the basic issues and their implications
Back to Contents

Learning

  • Concept:
  • Adjust weights to obtain desired outputs for a given input set
  • Supervised Learning
    • Apply input pattern, observe output
    • Adjust weights based on error between actual and desired output
  • Unsupervised learning
    • Network must discover patterns, features, correlations, etc.
    • Requires redundancy in input
    • Hebbian learning
    • Competitive learning
Back to Contents

Procedure for Supervised Learning

  • Define training set
    • Set of input patterns
    • Corresponding desired outputs
  • Incremental updates
    • Randomly select input pattern
    • Update weights based on output
  • Batch updates
    • Apply all inputs before updates
  • Training epoch is one cycle through training set
Back to Contents

Simple Perceptrons

  • One layer, feedforward network structure
  • Deterministic units (continuous or discrete)
  • Thresholds represented by an additional clamped unit
  • Output for pattern m should be target zm:
Back to Contents

Representable Functions

  • Discrete units
    • Patterns in input space must be linearly separable
    • Simple Boolean functions: AND, OR
    • XOR cannot be represented
  • Continuous units
  • Patterns in input space must be linearly independent
  • Restrictive
Back to Contents

Learning for Threshold Units

  • Weight update
    where
  • Interpretation
    • Increase weight (increase output) if error and input have the same sign
    • h is the learning rate
  • Delta rule
Back to Contents

Gradient Learning for Continuous Units

  • Error measure
  • Adjust weights by taking a step in direction of gradient
    • Gradient

      For g(x) = tanh(bx)
  • Delta rule
  • Alternative error functions
    • Relative entropy
    • Same delta rule with
    • Allows application to binary decision problems
Back to Contents

Multi-layer Perceptrons

  • Output for pattern m should be target zm:
Back to Contents

Representable Functions

  • Any function can be represented arbitrarily closely
    • Continuous functions by a two-layer network (1 hidden layer)
    • Discontinuous functions by a three-layer network (2 hidden layers)
    • May not be the best architectures
  • Can use continuous, nonlinear units
Back to Contents

Learning in Multi-Layer Networks: Back-Propagation

  • History(J. Hertz et al.): P-B Algorithm was invented independently several times, by Bryson and Ho [1969], Werbos [1974], Parker [1985] and Rumelhart et al.[1986].
  • Gradient descent based on error function
  • Error function
  • Delta rule
Back to Contents

Network Implementation of Back-Propagation

  • Dual network to propagate delt s backwards
Back to Contents

Multi-Layer Perceptron Applications

  • NETtalk (Sejnowski and Rosenberg)
    • Pronunciation of English text
    • Two-layers, 80 hidden units, 26 output (phoneme) units
    • Trained on 1024 words with phonemes
    • Results:
      • Intelligible speech after 10 training epochs
      • 95% accuracy after 50 epochs
  • Sonar Target Recognition (Gorman and Sejnowski, 1988)
    • Distinguish between sonar returns from underwater rocks and metal cylinders
    • Inputs were FFT of return signal
    • Two-layers, 0-24 hidden units, 2 output units (rock, cylinder)
    • Results: 100% accuracy after 200 epochs with 12 hidden units
      • No improvement for >12 hidden units
Back to Contents

Performance

  • Learning rate
  • Accuracy
  • Many factors
    • Number of layers
    • Number of units
    • Network structure
    • Input representation
    It is important to represent as much a priori information as possible
Back to Contents

Generalization

    NERF: Network Efficiently Representable Functions
    (Denker et al., 1987)

    Boolean functions for which the total number of units grows only polynomially in the number of bits of the input.

    It is hard to find useful Boolean function that is not NERF.

    K.Siu, V. Roychowdhury and T. Kailath, IEEE. Trans. on Computers, Dec., 1991:

    Any symmetric Boolean function (in n varaibles) can be computed with O(r(n)) threshold gates in a depth-3 network. Any Boolean function (in n varaibles) can be computed with O(2n/2) threshold gates in a depth-3 network.

    Capacity paradox: Consider N input bit and 1 output bit. There are possible input patterns and therefor possible rules.. For example when N = 30 the possible rules are as many as 2109!

    However reasonable rules are specified by no more than Nk bits for some small k. We should only consider NERFs, not general Boolean functions.

    Generalization

    • Networks can extend input in sensible ways
      • Identify relationships not readily apparent in original data
      • Represent features as codes in hidden layer
    • Networks can extend input in nonsensical ways
      • Too many units allows data to be overfit
      • Generalization of noise
Back to Contents

Other Supervised Learning

  • Recurrent networks
    • Modification to back propagation
    • Can be implemented without explicit matrix inversion
  • Learning with a critic
    • Limited performance feedback
      • Feedback is correct/incorrect or good/bad
      • No quantitative information on performance
    • Construct a learning target based on feedback
      • Associative reward-penalty (Barto and Anandan)
      • Back-propagation for remaining layers
Back to Contents

Unsupervised Learning

  • Hebbian Learning
    • Extract redundancies from data
    • Maximize output when input is similar to earlier inputs
    • Applications
      • Principal component analysis
      • Clustering
      • Feature Mapping
  • Competitive learning
    • Only one output unit can be on
      • Unit that wins inhibits all others
      • Output units are called winner-take-all or grandmother cells
    • Applications are clustering or categorization
Back to Contents

Learning Summary

  • Learning adapts network computation to observed data
    • Supervised learning compares network output with target data
    • Unsupervised learning categorizes
    • Training can be on- or off-line
  • Learning performance is a function of network architecture
    • Slow if network is mismatched to problem
    • Can provide significant generalization from training data
Back to Contents

Statistical Regression

    Problem:

    Exploring the relationship between some response y and a number of predictor varaibels x=(x1,x2,...,xk). Namely, find a function g such that

    y - g(x)

    is as small as possible.

    If the structure of g(x) is given then the problem reduces to the determination of the parameters.

    Example: g(x) is a polynomial?

Back to Contents

Neural Net for Regression: Example

    Projectile Ballistics System:

    h = horizontal distance that the projectile travels until it hits the ground.

    Linear Regression Model:

    Response Surface Model:

    Neural Network Model

Back to Contents

Neural Net for Simulation: An Idea

Back to Contents

Back to course home page


Comments, Questions and Suggestions to Yuanjiang Ou
Web Page Manager, at you@ecs.umass.edu

Last Update: 01/10/97