ECE 655: Fault-Tolerant Systems

 
Instructor: C.M. Krishna.

e-mail: krishna@ecs.umass.edu.

Phone: (413) 545-0766.  
 
Office Hours: Tuesdays and Thursdays: 1-2 PM.



General Description

Computers and networks are increasingly used in critical applications, where system failures can be expensive or even catastrophic. Example applications include aircraft fly-by-wire control, automobile control, computers used in medical systems, spacecraft, and databases in a large variety of financial and enterprise applications. The overall reliability expected of a computer system in these applications far exceeds that of any individual computer. This course is about how to build a highly reliable system that continue to function acceptably even after a number of its components (hardware or software) have failed.


Grading
               Test 1........................................20
               Test 2........................................20
               Final examination.............................35
               Paper and presentation........................10
               Homework......................................15
 
 

Main Topics
  • Introduction to fault tolerance.
  • Measures of fault-tolerance.
  • Exploiting and managing redundancy in:
    • Hardware.
    • Software.
    • Time.
    • Data.
  • Network fault tolerance.
  • Issues in distributed systems.
    • Byzantine generals algorithm.
    • Fault-tolerant clock synchronization.
    • Reliable remote procedure calls.
  • Reliability evaluation techniques.




Announcements


Solutions to the Final Exam

There is a typo on page 25 of Chapter 4 (networks), in the treatment of the depth-first routing algorithm for hypercubes. SC(A) is the set of relative addresses of the nodes that would be visited if we traveled on each of the dimensions listed in set A. (In the text, it says it is the set of nodes). The relative address of node x with respect to node y is (x XOR y): it is the set of dimensions that need to be traversed to get to x from y (or vice versa).




Homework
Homework 1: Questions 1, 2a, 2b, 8, 11, 17, 19 of Chapter 2. Due October 5.

Homework 2: Questions 1, 5, 6, 11, 12 of Chapter 4. Due October 17.

Homework 3: Questions 2, 3, 6, 7, 9 of Chapter 3. Due November 21.

Homework 4: Questions 3, 5a, 7 of Chapter 6; Questions 2, 4a of Chapter 5. Due November 28. Solutions


Readings
Readings will be handed out on paper or posted on this site. There is no textbook that you need to purchase for this course.

Slides: HWFT Part 1
Slides: HWFT Part 2
Slides: HWFT Part 3
Slides: Networks Part 1
Slides: Networks Part 2
Slides: Networks Part 3
Slides: Data Replication
Slides: Checkpointing Part 1
Slides: Checkpointing Part 2
Slides: Checkpointing Part 3
Slides: Coding
Slides: Coding Part 2
Slides: Software Fault Tolerance Part 1
Slides: Software Fault Tolerance Part 2
Byzantine Generals Algorithm
Slides: Byzantine Generals algorithm Slides: Hardware Clock Synchronization