Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Consensus, Raft and Rafter

Consensus, Raft and Rafter

Tech talk on consensus protocols, with a focus on Raft and Andrew Stone's Erlang implementation: rafter.

Presented at Erlang NYC: http://www.meetup.com/Erlang-NYC/events/131394712/

Tom Santero

August 01, 2013
Tweet

More Decks by Tom Santero

Other Decks in Programming

Transcript

  1. Single Node w/ async disc writes Data is written to

    fs buffer, user is sent acknowledgement, power goes out Data not yet written to disk is LOST System is UNAVAILABLE Single Disk Solutions: fsync, battery backup, prayer Failure Mode 1
  2. Master/Slave with asynchronous replication Data is written by user and

    acknowledged Data synced on Primary, but crashes Failure Mode 2
  3. ? Consistent Available Primary Failed. Data not yet written to

    Secondary Write already ack’d to Client if promote_secondary() == true; { stderr(“data loss”); } else { stderr(“system unavailable”); }
  4. “The problem of reaching agreement among remote processes is one

    of the most fundamental problems in distributed computing and is at the core of many algorithms for distributed data processing, distributed file management, and fault- tolerant distributed applications.”
  5. Consensus {termination agreement validity non faulty processes eventually decide on

    a value processes that decide do so on the same value
  6. Consensus {termination agreement validity non faulty processes eventually decide on

    a value processes that decide do so on the same value value must have been proposed
  7. Consensus {termination agreement validity non faulty processes eventually decide on

    a value processes that decide do so on the same value value must have been proposed
  8. {termination agreement validity non faulty processes eventually decide on a

    value processes that decide do so on the same value value must have been proposed Safety Liveness
  9. Safety Liveness {termination agreement validity non faulty processes eventually decide

    on a value processes that decide do so on the same value value must have been proposed
  10. Safety Liveness {termination agreement validity non faulty processes eventually decide

    on a value processes that decide do so on the same value value must have been proposed
  11. Safety Liveness {termination agreement validity non faulty processes eventually decide

    on a value processes that decide do so on the same value value must have been proposed non-triviality
  12. Motivation: RAMCloud large scale, general purpose, distributed storage all data

    lives in DRAM strong consistency model https://ramcloud.stanford.edu/
  13. Motivation: RAMCloud large scale, general purpose, distributed storage all data

    lives in DRAM strong consistency model 100 byte object reads in 5μs https://ramcloud.stanford.edu/
  14. John Ousterhout Diego Ongaro In Search of an Understandable Consensus

    Algorithm https://ramcloud.stanford.edu/raft.pdf
  15. “Unfortunately, Paxos is quite difficult to understand, in spite of

    numerous attempts to make it more approachable. Furthermore, its architecture is unsuitable for building practical systems, requiring complex changes to create an efficient and complete solution. As a result, both system builders and students struggle with Paxos.”
  16. Log SM C Log SM C Log SM C Client

    1. client makes request to Leader C
  17. Log SM C Log SM C Log SM C Client

    2. consensus module manages request C
  18. Log SM C Log SM C Log SM C Client

    3. persist instruction to local log v C
  19. Log SM C Log SM C Log SM C Client

    v 4. leader replicates command to other machines C C C
  20. Log SM C Log SM C Log SM C Client

    v C C v v 5. command recorded to local machines’ log C
  21. Log SM C Log SM C Log SM C Client

    v C C v v 7. command forwarded to state machines for processing SM SM SM C
  22. Log SM C Log SM C Log SM C Client

    v C C v v 7. command forwarded to state machines for processing SM SM SM C
  23. Log SM C Log SM C Log SM C Client

    v C C v v SM 8. SM processes command, ACKs to client C
  24. Why does that work? job of the consensus module to:

    C manage replicated logs determine when it’s safe to pass to state machine for execution only requires majority participation
  25. Why does that work? job of the consensus module to:

    C manage replicated logs determine when it’s safe to pass to state machine for execution only requires majority participation Safety { Liveness {
  26. 1. Select 1/N servers to act as Leader 2. Leader

    ensures Safety and Linearizability 3. Detect crashes + Elect new Leader 4. Maintain consistency after Leadership “coups” 5. Depose old Leaders if they return 6. Manage cluster topology
  27. Possible Server Roles: Leader Follower Candidate At most only 1

    valid Leader at a time Receives commands from clients Commits entries Sends heartbeats
  28. Possible Server Roles: Leader Follower Candidate Replicate state changes Passive

    member of cluster during normal operation Vote for Candidates
  29. Potential Use Cases: Distributed Lock Manager Database Transactions Automated Failover

    Configuration Management http://coreos.com/blog/distributed-configuration-with-etcd/index.html Service Discovery etc...
  30. •A labor of love, a work in progress •A library

    for building strongly consistent distributed systems in Erlang •Implements the raft consensus protocol in Erlang •Fundamental abstraction is the replicated log What:
  31. Replicated Log •API operates on log entries •Log entries contain

    commands •Commands are transparent to Rafter •Systems build on top of rafter with pluggable state machines that process commands upon log entry commit.
  32. Erlang: A Concurrent Language •Processes are the fundamental abstraction •Processes

    can only communicate by sending each other messages •Processes do not share state •Processes are managed by supervisor processes in a hierarchy
  33. Erlang: A Concurrent Language  loop()  -­‐>        

       receive                    {From,  Msg}  -­‐>                              From  !  Msg,                              loop()            end.   %%  Spawn  100,000  echo  servers                       Pids  =  [spawn(fun  loop/0)  ||  _  <-­‐   lists:seq(1,100000)] %%  Send  a  message  to  the  first  process lists:nth(0,  Pids)  !  {self(),  ayo}.
  34. Erlang: A Functional Language • Single Assignment Variables • Tail-Recursion

    • Pattern Matching {op,  {set,  Key,    Val}}  =  {op,  {set,  <<“job”>>,  <<“developer”>>}} • Bit Syntax Header  =  <<Sha1:20/binary,  Type:8,  Term:64,  Index:64,  DataSize:32>>
  35. Erlang: A Distributed Language Location Transparency: Processes can send messages

    to other processes without having to know if the other process is local. %%  Send  to  a  local  gen_server  process gen_server:cast(peer1,  do_something) %%  Send  to  a  gen_server  on  another  machine gen_server:cast({‘[email protected]’},  do_something) %%  wrapped  in  a  function  with  a  variable  name  for  a  clean  client  API do_something(Name)  -­‐>  gen_server:cast(Name,  do_something). %%  Using  the  API Result  =  do_something(peer1).
  36. Erlang: A Reliable Language •Erlang embraces “Fail-Fast” •Code for the

    good case. Fail otherwise. •Supervisors relaunch failed processes •Links and Monitors alert other processes of failure •Avoids coding most error paths and helps prevent logic errors from propagating
  37. OTP • OTP is a set of modules and standards

    that simplifies the building of reliable, well engineered erlang applications. • The gen_server, gen_fsm and gen_event modules are the most important parts of OTP • They wrap processes as server “behaviors” in order to facilitate building common, standardized distributed applications that integrate well with the Erlang Runtime
  38. Peers •Each peer is made up of two supervised processes

    •A gen_fsm that implements the raft consensus fsm •A gen_server that wraps the persistent log •An API module hides the implementation
  39. Rafter API • The entire user api lives in rafter.erl

    • rafter:start_node(peer1, kv_sm). • rafter:set_config(peer1, [peer1, peer2, peer3, peer4, peer5]). • rafter:op(peer1, {set, <<“Omar”>>, <<“gonna get got”>>}). • rafter:op(peer1, {get, <<“Omar”>>}).
  40. Output State Machines •Commands are applied in order to each

    peer’s state machine as their entries are committed •All peers in a consensus group can only run one type of state machine passed in during start_node/2 •Each State machine must export apply/1
  41. Hypothetical KV store %%  API kv_sm:set(Key,  Val)  -­‐>    

         Peer  =  get_local_peer(),        rafter:op(Peer,  {set,  Key,  Value}). %%  State  Machine  callback kv_sm:apply({set,  Key,  Value})  -­‐>  ets:insert({kv_sm_store,   {Key,  Value}); kv_sm:apply({get,  Key})  -­‐>  ets:lookup(kv_sm_store,  Key).
  42. rafter_consensus_fsm •gen_fsm that implements Raft •3 states - follower, candidate,

    leader •Messages sent and received between fsm’s according to raft protocol •State handling functions pattern match on messages to simplify and shorten handler clauses.
  43. rafter_log.erl • Log API used by rafter_consensus_fsm and rafter_config •

    Utilizes Binary pattern matching for reading logs • Writes out entries to append only log. • State machine commands encoded with term_to_binary/1
  44. rafter_config.erl •Rafter handles dynamic reconfiguration of it’s clusters at runtime

    •Depending upon the configuration of the cluster, different code paths need navigating, such as whether a majority of votes has been received. •Instead of embedding this logic in the consensus fsm, it was abstracted out into a module of pure functions
  45. rafter_config.erl API -­‐spec  quorum_min(peer(),    #config{},  dict())  -­‐>  non_neg_integer(). -­‐spec

     has_vote(peer(),  #config{})  -­‐>  boolean(). -­‐spec  allow_config(#config{},  list(peer()))  -­‐>  boolean(). -­‐spec  voters(#config{})  -­‐>  list(peer()).
  46. Property Based Testing •Use Erlang QuickCheck •Too complex to get

    into now •Come hear me talk about it at Erlang Factory Lite in Berlin! shameless plug
  47. Andy Gross - Introducing us to Raft Diego Ongaro -

    writing Raft, clarifying Tom’s understanding, reviewing slides Chris Meiklejohn - http://thinkdistributed.io - being an inspiration Justin Sheehy - reviewing slides, correcting poor assumptions Reid Draper - helping rubber duck solutions Kelly McLaughlin - helping rubber duck solutions John Daily - for his consistent pedantry concerning Tom’s abuse of English Basho - letting us indulge our intellect on the company’s dime (we’re hiring) Thanks File