ECE 655: Fault-Tolerant Systems
Instructor: C.M. Krishna.
e-mail: krishna@ecs.umass.edu.
Phone: (413) 545-0766.
Office Hours: Tuesdays and Thursdays: 1-2 PM.
General Description
Computers and networks are increasingly used in critical
applications, where system failures can be expensive or even
catastrophic. Example applications include aircraft fly-by-wire control,
automobile control, computers used in medical systems, spacecraft,
and databases in a large variety of financial and enterprise applications.
The overall reliability expected of a computer system in these applications
far exceeds that of any individual computer. This course is about how
to build a highly reliable system that continue to function acceptably
even after a number of its components (hardware or software)
have failed.
Grading
Test 1........................................20
Test 2........................................20
Final examination.............................35
Paper and presentation........................10
Homework......................................15
Main Topics
- Introduction to fault tolerance.
- Measures of fault-tolerance.
- Exploiting and managing redundancy in:
- Hardware.
- Software.
- Time.
- Data.
- Network fault tolerance.
- Issues in distributed systems.
- Byzantine generals algorithm.
- Fault-tolerant clock synchronization.
- Reliable remote procedure calls.
- Reliability evaluation techniques.
Announcements
Solutions to the Final Exam
There is a typo on page 25 of Chapter 4 (networks), in the treatment
of the depth-first routing algorithm for hypercubes. SC(A) is the
set of relative addresses of the nodes that would be visited if we
traveled on each of the dimensions listed in set A. (In the text, it
says it is the set of nodes). The relative address of node x with
respect to node y is (x XOR y): it is the set of dimensions that need
to be traversed to get to x from y (or vice versa).
Homework
Homework 1: Questions 1, 2a, 2b, 8, 11, 17, 19 of Chapter 2. Due October 5.
Homework 2: Questions 1, 5, 6, 11, 12 of Chapter 4. Due October 17.
Homework 3: Questions 2, 3, 6, 7, 9 of Chapter 3. Due November 21.
Homework 4: Questions 3, 5a, 7 of Chapter 6; Questions 2, 4a of Chapter 5.
Due November 28.
Solutions
Readings
Readings will be handed out on paper or posted on this site. There is
no textbook that you need to purchase for this course.
Slides: HWFT Part 1
Slides: HWFT Part 2
Slides: HWFT Part 3
Slides: Networks Part 1
Slides: Networks Part 2
Slides: Networks Part 3
Slides: Data Replication
Slides: Checkpointing Part 1
Slides: Checkpointing Part 2
Slides: Checkpointing Part 3
Slides: Coding
Slides: Coding Part 2
Slides: Software Fault Tolerance Part 1
Slides: Software Fault Tolerance Part 2
Byzantine
Generals Algorithm
Slides: Byzantine Generals algorithm Slides: Hardware Clock Synchronization