Overview
- Distributed Systems are a group of computers that work together to accomplish some task
- Independent failure modes
- Connected by a network with its own failure modes
-
Sources
- MIT Distributed Systems Playlist
- Paper for every lecture, the lectures aren’t going to make much sense if you haven’t read the papers. Read a paper rapidly and efficiently
- UW 452 Distributed Systems
- MIT Distributed Systems Playlist
-
Infrastructure - Abstractions
- Storage
- Communication systems
- as a tool computers need to communicate
- Computation system
-
Implementation
- RPC
- goal is to mask the fact that we’re communicating over an unreliable network
- Threads
- Programming technique that allows us to harness multi-core computers
- Way of structuring concurrent operations
- Locks
- Because we’re going to spend lots of time on concurrency control
- RPC
-
Fault Tolerance
- Generally
- If a system has a single computer, it can stay up for years without crashing, it’s pretty reliable
- However, if a system has 1000 computers, you will have many failures a day
- Big scale turns events into constant problems
- The ability to mask failures, the ability to proceed without failures has to be built into the design
- Availability
- Some systems are designed so that under some failures, the system will keep operating despite the failure
- A good available system needs to be recoverable as well
- Recoverability
- If something goes wrong, maybe servers will stop working
- The system should be able to recover and continue as if nothing happened (w/o any loss of correctness)
- They need to save the data
- Tools for fault tolerance
- Non volatile storage
- hard drives to store a checkpoint or log a state of a system
- Then its recovered, you can read the saved data
- Management - its expensive to update… so u need clever ways to write to non-volatile storage so you won’t write it a lot
- Replication
- Replicated copies that has same system state
- Problem - the replicas will not be in sync and not be replicas
- Non volatile storage
- Generally
-
Consistency
- Maybe we’re building a KV service, and it has
put(k,v)get(k) -> v
- If we’re replicating the state, we need to think about how to keep consistency among replicas
- There are strongly consistent and weakly consistent systems
- strong
- guaranteed to see the updated data
- very expensive spec to implement, lots of communication
- Also we want the failures to be independent…
- weak
- makes no guarantee
- sometimes this is useful because its less expensive
- strong
- Maybe we’re building a KV service, and it has