Distributed Systems

Overview

Distributed Systems are a group of computers that work together to accomplish some task

Independent failure modes

Connected by a network with its own failure modes

Intro - Reasons and Challenges
Course project for Distributed Systems - key-value store
Two vs Three Tier Architecture
Lab Framework
RPC (Remote Procedure Call)
Sources
- MIT Distributed Systems Playlist
  - Paper for every lecture, the lectures aren’t going to make much sense if you haven’t read the papers. Read a paper rapidly and efficiently
- UW 452 Distributed Systems
Infrastructure - Abstractions
- Storage
- Communication systems
  - as a tool computers need to communicate
- Computation system
Implementation
- RPC
  - goal is to mask the fact that we’re communicating over an unreliable network
- Threads
  - Programming technique that allows us to harness multi-core computers
  - Way of structuring concurrent operations
- Locks
  - Because we’re going to spend lots of time on concurrency control
Fault Tolerance
- Generally
  - If a system has a single computer, it can stay up for years without crashing, it’s pretty reliable
  - However, if a system has 1000 computers, you will have many failures a day
  - Big scale turns events into constant problems
  - The ability to mask failures, the ability to proceed without failures has to be built into the design
- Availability
  - Some systems are designed so that under some failures, the system will keep operating despite the failure
  - A good available system needs to be recoverable as well
- Recoverability
  - If something goes wrong, maybe servers will stop working
  - The system should be able to recover and continue as if nothing happened (w/o any loss of correctness)
  - They need to save the data
- Tools for fault tolerance
  - Non volatile storage
    - hard drives to store a checkpoint or log a state of a system
    - Then its recovered, you can read the saved data
    - Management - its expensive to update… so u need clever ways to write to non-volatile storage so you won’t write it a lot
  - Replication
    - Replicated copies that has same system state
    - Problem - the replicas will not be in sync and not be replicas
Consistency
- Maybe we’re building a KV service, and it has
  - put(k,v)
  - get(k) -> v
- If we’re replicating the state, we need to think about how to keep consistency among replicas
- There are strongly consistent and weakly consistent systems
  - strong
    - guaranteed to see the updated data
    - very expensive spec to implement, lots of communication
    - Also we want the failures to be independent…
  - weak
    - makes no guarantee
    - sometimes this is useful because its less expensive

leejunkim

Explorer

Distributed Systems

Graph View