Monday 28 September 2015

Couchdb, MVCC and Conflicts while Replicating

Setup

In my last post regarding Multiversion Concurrency Control, we saw what it takes to enter conflicting versions of a document into a single couch instance. You have to be somewhat resourceful.

But the real fun with the couch comes from its distributed nature.
We will see that the rules change a bit when we talk about more than one instance and use replication to synchronize them.

Here's the setup for playing around with MVCC on two couches:

First Pi:

  • hostname: frodo
  • Model: Pi B+ (ARMv6h)
  • OS: Arch Linux ARM
  • Couchdb 1.6.1_4 (taken from Arch Linux ARM Repository)

Second Pi:


Again, everything will be done via curl. Data will not be directly on the command line but always be taken from a file (due to strange behaviour of my curl on Windows).

We assume two users entering data into the their respective Pis. Arwen uses the the couch on her arwen-Pi when Frodo uses his frodo-Pi. Eventually they will exchange their work via replication.

Preparation

Let's start from scratch by creating the database on each Pi respectively:
#
# create the database on arwen
curl -X PUT http://arwen:5984/mvcc
#
# reaponse
{"ok":true}
#
# create the database on frodo
curl -X PUT http://frodo:5984/mvcc
#
# resonse
{"ok":true}
#

...And Go

Arwen inserts her document first:
#
# Arwen inserts her Doc rep_mydoc_u1_1.json
{
  "content": "U1_1"
}
#
curl -H "Content-Type: application/json" -d @rep_mydoc_u1_1.json -X PUT http://arwen:5984/mvcc/mydoc 
#response
{
  "ok":true,
  "id":"mydoc",
  "rev":"1-3557461c60a30b0d156f8b36a1bdcf9f"
}
#

Arwen wants to share her document with Frodo. She submits a Push Replication Request into the _replicator database of her Pi to trigger the replication:
#
# Arwen shares her doc with frodo via replication
# She initiates a push replication from arwen to frodo
# push_a2f_01.json:
{
  "source": "mvcc", 
  "target": "http://frodo:5984/mvcc"
}
#
curl -H "Content-Type: application/json" -d @push_a2f_01.json -X PUT http://arwen:5984/_replicator/a2s01
#
#responsse
{
  "ok":true,
  "id":"a2s01",
  "rev":"1-0088a4a381404b513bf0586d08d6ce80"
}
#
Taking a look into Arwen's couch.log tells us that the replication took place:
#
Document `a2s01` triggered replication `6018cd9109568fed438add0722e9bccb`
starting new replication `6018cd9109568fed438add0722e9bccb` at <0 data-blogger-escaped-.31751.2=""> (`mvcc` -> `http://frodo:5984/mvcc/`)
recording a checkpoint for `mvcc` -> `http://frodo:5984/mvcc/` at source update_seq 1
Replication `6018cd9109568fed438add0722e9bccb` finished (triggered by document `a2s01`)
#

Please note that these Pis do not know about each other. The replication request is the only point of contact. This request requires Arwen to know about a frodo-Pi.

OK, Frodo should have Arwen's document on his Pi now:
#
# Frodo should now have the document too:
curl  http://frodo:5984/mvcc/mydoc
#respopnse
{
  "_id":"mydoc",
  "_rev":"1-3557461c60a30b0d156f8b36a1bdcf9f",
  "content":"U1_1"
}
#

Now Arwen and Frodo both continue to work on their respective copy of the document and eventually save their work:
#
# Arwen edits her document on arwen
# rep_mydoc_u1_2.json:
{
  "_rev": "1-3557461c60a30b0d156f8b36a1bdcf9f",
  "content": "U1_2"
}
#
curl -H "Content-Type: application/json" -d @rep_mydoc_u1_2.json -X PUT http://arwen:5984/mvcc/mydoc
# response
{
  "ok":true,
  "id":"mydoc",
  "rev":"2-2686fb85c0681a3d8c411617f048f94f"
}
#
# Frodo does the same on frodo
# rep_mydoc_u2_2.json:
{
  "_rev":"1-3557461c60a30b0d156f8b36a1bdcf9f",
  "content": "U2_2"
} 
#
curl -H "Content-Type: application/json" -d @rep_mydoc_u2_2.json -X PUT http://frodo:5984/mvcc/mydoc
# response
{
  "ok":true,
  "id":"mydoc",
  "rev":"2-03b64efa2cd6619f46bcbe618fa791f9"
}
#
Each Pi now holds a different version of the document.
Frodo initiates a full sync by triggering first a push replication followed by a pull replication to Arwen. As Frodo now takes the lead, both requests will be submitted into the_replicator DB of his Pi:
# Frodo pushes his stuff to Arwen
# push request push_f2a_01.json:
{
  "source": "mvcc", 
  "target": "http://arwen:5984/mvcc"
}
#
curl -H "Content-Type: application/json" -d @push_f2a_01.json -X PUT http://frodo:5984/_replicator/f2a01
#response
{
  "ok":true,
  "id":"f2a01",
  "rev":"1-d20099b5d5b65eb05271be0204d8100a"
}
#
# Next Frodo pulls from Arwen
# pull request pull_a2f_01.json:
{
  "source": "http://arwen:5984/mvcc", 
  "target": "mvcc"
}
curl -H "Content-Type: application/json" -d @pull_a2f_01.json -X PUT http://frodo:5984/_replicator/a2f01
#response
{
  "ok":true,
  "id":"a2f01",
  "rev":"1-26926753f759498b86ece4e48fdb0e5f"
}
#
What would be our expectation after syncing both Pis?
The same document was edited on different hosts. After the new versions had been submitted, each host then held the old and a new version of the document. Both hosts may claim to hold the current version of the document with equal rights.
After a full sync, we expect this:

  • there is identical data on both hosts
  • each host holds the old version and both "new" versions of the document
So, let's see.
We're going to check by requesting the current version and conflicting versions if any.
Let's check on Arwen first:
#
# there should be a conflict on arwen now... 
curl  http://arwen:5984/mvcc/mydoc?conflicts=true
#response
{
  "_id":"mydoc",
  "_rev":"2-2686fb85c0681a3d8c411617f048f94f",
  "content":"U1_2",
  "_conflicts":["2-03b64efa2cd6619f46bcbe618fa791f9"]
}
#
The current version is the one that Arwen herself submitted.
As expected, there is a conflict.

What is it on Frodo's Pi?
#
# there should be a conflict on frodo too... 
curl  http://frodo:5984/mvcc/mydoc?conflicts=true
#response
{
  "_id":"mydoc",
  "_rev":"2-2686fb85c0681a3d8c411617f048f94f",
  "content":"U1_2",
  "_conflicts":["2-03b64efa2cd6619f46bcbe618fa791f9"]
}
#
On Frodo's Pi we find the same situation. Arwen's document is delivered as current. Frodo's version constitutes the conflict.
The couch keeps its promise to deliver the same "winning" version on both nodes.

Summery

As far as conflicts are concerned, working distributed changes the rules completely.
On a single node, the couch is quite strict avoiding conflicts. You need a bulk update with a special mode switched on to get it done.
Once you decide to work distributed, the priorities change. When replicating between nodes, pushing or pulling your data successfully becomes the main objective. The goal is to save data over a network. As the nodes operate completely independent from one another, conflicts cannot be avoided.

Well, if you need to go for distributed and want your nodes to be independent, this is the price you have to pay. As economics teaches us: there is no such thing as a free lunch. This seems to hold true for the computer scientist's menu too.




Sunday 20 September 2015

Installing Couchdb 1.6.1 on a Raspberry Pi Model 2

Couchdb 1.6.1 on a Pi 2

Update 03.11.2015

The Erlang Solutions Repository now contains a new version of Erlang. The major version is now 18. This is too high for the couch in version 1.6.x.
For this reason, please omit the step of including the Erlang Solutions Repository.
Just rely on what you get from the default Raspbina/Debian repos.
I still have to verify this with Wheezy, but for the new Raspbian Jessie image this does the trick.

Installing the couch version 1.6.1

This will probably be my shortest post ever.
Last week I installed couchdb version 1.6.1 on my Raspberry Pi Model 2.
I did this for two reasons. One was to have the couch on Pi 2. The second was to see if my own install instructions are still valid for couch 1.6.1 and Pi Model 2. Two readers reported problems, so I was a little worried.

But everything worked well and it was soon time to relax.
I went along the instructions using copy and past, with only two exceptions:
The instructions are still valid. They worked for me and should do so for you.

Have fun.

Tuesday 10 February 2015

CouchDB - MVCC and Conflicts

This is a small entry about couchDB's Multi Version Concurrency Control mechanism and what it takes to have conflicting documents end up on the couch.
Though MVCC is well covered by couchDB's documentation, I wanted to see it in action with my own Pies :-)

Setup

I have couchDB installed on two Pies, gandalf and samwise. On gandalf, the couchdb version is 1.6.0 whereas on samwise it is a 1.5.1.

We will first create conflicts on a single node (gandalf) and then on two nodes by means of master-master replication.
Curl will be used to talk to the couches (note: my curl shell on Windows does not like mixing " and ' which is why I have to put all the JSON data I want to send via curl into files).

If you want to replay this on your system, make sure to not only adjust IP addresses or host names but also substitute the revision values (_rev) with the ones you'll receive as response.
All curl commands and there respective responses are genuine. Responses are formatted for better readability.

What is a Conflict?

Before we start, let's agree on what a conflict is:
A conflict is a state where two or more versions of a document branch from a common root version. Only the leafs of conflicting branches are considered to be in conflict with each other.
Let's try to create this on couchdb.

We'll be acting on behalf of two users, first on one and then on two couchdb nodes.

Conflicts on a Single Node

On a single instance of couchdb, it is not possible to create a conflict when performing single document updates. If you want to update a document, you have to supply the latest revision of this document's revision tree. If you do not have this revision, your update will be rejected.

If you want to end up with a conflict, i.e. two revisions branching from a single common revision, you have to use couchdb's bulk update feature. But that's not all it takes. In addition you have to use the bulk update in the special "All-or-Nothing" mode.

Not so easy to to create a conflict on a single instance, but lets see.

We start by creating a database called mvcc on gandalf:
# check if couchdb is running
curl http://gandalf:5984
# response:
{"couchdb":"Welcome",
 "uuid":"360325151b6a3c70595a522b36f52037",
 "version":"1.6.0",
 "vendor":{"name":"The Apache Software Foundation",
 "version":"1.6.0"}
}
#
# create database "mvcc"
curl -X PUT http://gandalf:5984/mvcc
# response:
{"ok":true}

User 1 inserts an initial version of a document into the database. The document is stored in file mydoc_u1_1.json and looks like this:
{
  "content": "U1_1"
} 


curl -H "Content-Type: application/json" -d @mydoc_u1_1.json -X PUT http://gandalf:5984/mvcc/mydoc
# response:
{"ok":true,
 "id":"mydoc",
 "rev":"1-3557461c60a30b0d156f8b36a1bdcf9f"
}


User 2 reads the document and takes down the revision in order to use it for the update he plans.
# User 2 reads the doc...
curl -X GET http://gandalf:5984/mvcc/mydoc
# response:
{"_id":"mydoc",
 "_rev":"1-3557461c60a30b0d156f8b36a1bdcf9f",
 "content":"U1_1"
}


Both users are now holding the same revision of the document and both plan to update the document. User 1 is faster and places his update.
# here is the updated doc (mydoc_u1_2.json)
{
  "_rev":"1-3557461c60a30b0d156f8b36a1bdcf9f",
  "content": "U1_2"
}
#
# ...and the update...
curl -H "Content-Type: application/json" -d @mydoc_u1_2.json -X PUT http://gandalf:5984/mvcc/mydoc
# response:
{"ok":true,
 "id":"mydoc",
 "rev":"2-2686fb85c0681a3d8c411617f048f94f"
}


Done. We hava a second revision of the document. User 2 will now submit his update, but he still holds revision 1. Here is his update.
# here is the document...
{
  "_rev":"1-3557461c60a30b0d156f8b36a1bdcf9f",
  "content": "U2_1"
}
#
# Note that it indeed references the 1 revision of the document
# Now the update itself:
curl -H "Content-Type: application/json" -d @mydoc_u2_1.json -X PUT http://gandalf:5984/mvcc/mydoc
# response: 
{"error":"conflict",
 "reason":"Document update conflict."
}


Here we see the expected result: You are not allowed to update a document if you do not have the latest revision. Another way of saying this is, you can only update the latest revision of a document or slightly different again, you cannot branch the document. At least not in single document update mode.
User 2 may be a bit slow, but he is resourceful. He knows about couchdb's bulk update interface and that this is a way to fork a branch from revision 1. So here is what he does:
 # this is the bulk doc (bulk_u2_1.json): 
{
"docs": [{
  "_id": "mydoc",
  "_rev":"1-3557461c60a30b0d156f8b36a1bdcf9f",
  "content": "U2_1"
}]
}


Granted, this is some sorry bulk file, consisting only of a single document...
# user 2 tries the bulk interface:
curl -H "Content-Type: application/json" -d @bulk_u2_1.json -X POST http://gandalf:5984/mvcc/_bulk_docs
# response: 
[{"id":"mydoc",
  "error":"conflict",
  "reason":"Document update conflict."
}]


Same result as before. Using the bulk interface is not enough. It has to be used with the all-or-nothing option. This is what user 2 tries next.
Now the bulk document contains the all_or_noting property.
# bulk-doc: bulk_u2_2.json
{
"all_or_nothing": true,
"docs": [{
  "_id": "mydoc",
  "_rev":"1-3557461c60a30b0d156f8b36a1bdcf9f",
  "content": "U2_1"
}]
}
#
# ...and now the update:
curl -H "Content-Type: application/json" -d @bulk_u2_2.json -X POST http://gandalf:5984/mvcc/_bulk_docs
# response:
[{"ok":true,
  "id":"mydoc",
  "rev":"2-ba85ce56711c69f7d6200935357d79f9"
}]


Success: this time, the update was accepted. We now have one root-revision and two revisions branching from that root revision:
# the revision tree:
root:     1-3557461c60a30b0d156f8b36a1bdcf9f
branch 1:   2-2686fb85c0681a3d8c411617f048f94f
branch 2:   2-ba85ce56711c69f7d6200935357d79f9


Now that we finally have a conflict, how does couchdb deal with it? Let's simply retrieve the document and see what we get.
# a simple get...
curl  http://gandalf:5984/mvcc/mydoc
# response:
{"_id":"mydoc",
 "_rev":"2-ba85ce56711c69f7d6200935357d79f9",
 "content":"U2_1"
}


Couchdb determines a "winner" and does not let the conflict surface as long as you do not specifically ask for it.
Let's ask for it.
# fetch current document and all conflicting revisions...
curl  http://gandalf:5984/mvcc/mydoc?conflicts=true
# response:
{"_id":"mydoc",
 "_rev":"2-ba85ce56711c69f7d6200935357d79f9",
 "content":"U2_1","_conflicts":["2-2686fb85c0681a3d8c411617f048f94f"]
}


Couchdb presents the revision inserted by user 2 as the winning revision. The version introduced by user 1 appears in the conflicts list.
User 1 may not be aware of the fact that his revision is no longer in favor. He continues to update his branch of the document.
# user 1 updates his branch of the document (mydoc_u1_3.json)
{
  "_rev":"2-2686fb85c0681a3d8c411617f048f94f",
  "content": "U1_3"
}
#
# here is the update:
curl -H "Content-Type: application/json" -d @mydoc_u1_3.json -X PUT http://gandalf:5984/mvcc/mydoc
#response
{"ok":true,
 "id":"mydoc",
 "rev":"3-627f10af94aaf3f31a20c9277c68219a"}


No problem with this update. This means that once a document is branched, each branch can be updated in its own right. In our case the branch user 1 maintains is now one revision longer than the branch maintained by user 2. Let's see what this means in terms of conflicting documents and which branch couchdb now elects to be the winner.
We do a regular GET with the conflicts option enabled.
# GET the winning revision and all conflicting revisions:
curl  http://gandalf:5984/mvcc/mydoc?conflicts=true
# response:
{"_id":"mydoc",
 "_rev":"3-627f10af94aaf3f31a20c9277c68219a",
 "content":"U1_3","_conflicts":["2-ba85ce56711c69f7d6200935357d79f9"]
}


We can conclude two things from the result of this GET. One is that the winning branch has changed. The branch of user 1, which has the highest revision number, is now the winner. Another thing to notice is that the conflict moved up the document tree into its leaves.

Summary

Short summary on "Conflicts on a Single Couchdb Instance":

  • It's not that easy to produce a conflict on a single instance
  • Once you have one, you are free to ignore it, couchdb will always decide on a winning revision
  • In spite of couchdb picking a winner, you are free to follow and work on any branch you please
  • With every change on any branch, the dice are rolled again an a new winner may turn up
That's it for now on working with a single instance. The next entry will deal with two instances (running on two Pies of course :-) and master-master replication between them.