Non-Blocking Synchronization and System Design Michael Greenwald Non-blocking synchronization (NBS) has significant advantages over blocking synchronization. It allows the same synchronized code to run in uniprocessors, asynchronous handlers, and on shared memory multiprocessors. It is deadlock-free. It aids fault-tolerance: processor failure almost never results in corrupted data structures. It eliminates interference between synchronization and the process scheduler. It increases total system throughput by allowing other processes to proceed even if a single process modifying a shared data structure is delayed. These advantages are becoming even more important as we see an increase in the use of parallelism and multi-processors, an increase in real-time programming, increased use of signals and class-libraries, and as the cost of a delay (e.g. page-fault, cache miss, or descheduling) increases relative to processor speed. This thesis demonstrates that non-blocking synchronization (NBS) is practical as the sole co-ordination mechanism in well-designed systems that provide the necessary underlying primitives. In support of this claim: * I show that careful design and implementation of operating system software for efficiency, reliability, and modularity makes implementing simple, efficient non-blocking synchronization far easier. * I demonstrate that DCAS (Double-Compare-and-Swap) is the necessary and sufficient primitive for implementing efficient non-blocking synchronization. * I demonstrate that an efficient, hardware, DCAS implementation is practical for RISC processors. This thesis presents non-blocking implementations of several common data-structures that exploit properties of well-designed systems and depend on DCAS functionality. These out-perform all previously published non-blocking implementations and perform comparably to spin-locks under no contention. These include all the performance critical data structures used in the Cache Kernel. I present several algorithms for CASn: An O(n) non-blocking version, an O(n^2) disjoint-access-parallel version, and an O(np) wait-free version. I present a contention-reduction technique based on DCAS that performs as well as the best previously published techniques, yet still provides fault tolerance and OS independence. I present an implementation of dynamic, software transactional memory that supports multi-object updates, and has O(w) cost (for w writes in an update). The algorithm makes progress in O(1/P) steps with probability P, and is thus ``effectively non-blocking''. This allows a relatively efficient universal transformation from any synchronization using locks to a non-blocking version. Finally, I present a design of an efficient, hardware, DCAS implementation that is specific to the R4000 processor; however, the observations that make implementation practical are {\em generally} applicable. In short, the incremental costs of adding binary atomic synchronization primitives are very low, given that designers have already implemented unary atomic synchronization primitives.