It’s been quite a while since I last published a post. I’ve started a new job at Elastifile at the beginning of 2014, and it kinda took most of my spare time 🙂 I’ve resolved to get back to blogging, and I hope I succeed…

At Elastifile, we’ve built a distributed file-system that is truly scalable, and I’d like to share with you why creating such a file-system is interesting, challenging, and ultimately a lot of fun!

Let’s start by considering what exactly we expect from a distributed file-system. One of the basic expectations is to be able to access the file-system from different computers, and to have a consistent view of the file-system. For example, if I create a file “foo.txt” on one computer, I’d expect to be able to read this file from any other computer.

Time in a Distributed File-system

Consider we have a directory that has two files in it “a.txt” and “b.txt”. Suppose I run “ls -lt” within the directory. It should list the content of the directory, ordered by last-modified time. Here’s an example.

$ ls -lt
total 0
-rw-r--r-- 1 ezra wheel 0 Nov 7 21:48 a.txt
-rw-r--r-- 1 ezra wheel 0 Nov 7 21:47 b.txt

Notice that the later modified file appears first. I can change the last modified time of “b.txt”, and it will appear first:

$ touch b.txt
$ ls -lt
total 0
-rw-r--r-- 1 ezra wheel 0 Nov 7 21:49 b.txt
-rw-r--r-- 1 ezra wheel 0 Nov 7 21:48 a.txt

What’s going on behind the scenes? Each file has a last-modified timestamp attached to it, and when the file is updated (e.g., by executing “touch b.txt”) that timestamp is updated.

Where does this timestamp value originate from? It shouldn’t be based on the client’s clock, because clients’ clocks can be wrong. So it should be based on the server’s clock. However, which server? If we’re a truly scalable file-system, we’ll need to be able to scale out, by adding additional servers. This inherently means we have more than one server, and thus more than one clock (each server has its own clock).

We could decide that timestamps are taken from one server only, but that has two main drawbacks: a) there is a single-point-of-failure issue; b) a single server has a limit of how many operations it can handle concurrently, and again we have a scaling issue.

So in any scalable solution, timestamps will be taken by different servers. This isn’t too bad. We can say file “a.txt” has its timestamp taken by server X, while file “b.txt” has its timestamp taken by server Y. That sounds pretty good.

However, now we have an oddity:

/tmp/bar$ touch a.txt
/tmp/bar$ touch b.txt
/tmp/bar$ ls -lt
total 0
-rw-r--r-- 1 ezra wheel 0 Nov 7 22:04 a.txt
-rw-r--r-- 1 ezra wheel 0 Nov 7 22:03 b.txt

The timestamp of “a.txt” should be earlier than “b.txt”, but it isn’t! How can this happen? If server X and server Y have slightly different times, then even though “a.txt” was modified before “b.txt”, it could have receive a timestamp that is after “b.txt”.

The interesting aspect of this problem is that we can never guarantee that both servers have exactly the same time, even if they have really accurate clocks. On the other hand, any attempt to serialize these timestamps will inevitably limit the scalability of the file-system. However, there are a few things that can be done – I’ll let you enjoy thinking about them.

Directory Structure in a Distributed File-system

Another example of the complexity of a distributed file-system is the directory structure. Directories can contain within them files and/or other directories. Consider the following example:

$ mkdir bar
$ mkdir foo
$ mv foo bar
$ ll bar
total 0
drwxr-xr-x 2 ezra wheel 68B Nov 7 22:15 foo

We created two directories, “foo” and “bar”, then placed “foo” under the directory “bar”.

We can now try and place “bar” under “foo”:

$ mv bar bar/foo
mv: rename bar to bar/foo/bar: Invalid argument

Luckily, something prevented us from doing that. Otherwise, we’d have an infinite loop of “foo” being a sub-dir of “bar” being a sub-dir of “foo”, and so on. This example is a simple one. There are much more elaborate cycles possible (e.g., moving a parent directory to a sub-dir that is nested 20 levels deep).

Let’s consider what’s required to detect such invalid operations: We need to traverse the directory structure, and guarantee that there are no loops created due to this “mv” operation. Moreover, we need to guarantee this while other “mv” operations might be concurrently executing, changing the structure of the directory tree as we traverse it. Lastly, since our file-system is distributed and scalable, we can’t keep all directories on a single server, so we must allow different servers to handle different directories. Therefore, the aforementioned traversal is going to be a distributed traversal. Sounds like a lot of fun, right?

TL;DR

To sum up, for a distributed file-system to be truly scalable, it must allow for multiple servers to handle different files and/or directories. Once multiple servers handle different entities, guaranteeing specific relationships between these entities (such as having a sequence of timestamps, or no loops in the directory structure) becomes a distributed task.

Interesting? Yup. Challenging? Definitely. Fun? I think so 🙂

6 thoughts on “Why Are Distributed File-systems Fun?”

Alex

December 2, 2016 at 11:28 pm

Ezra, Nice post.
Regarding the timestamps I hope that you are aware of Precision Time Protocol (PTP).
http://linuxptp.sourceforge.net/
IEEE 1588

- ezrahoch
  
  December 3, 2016 at 8:21 am
  
  Thanks.
  Please notice that good clock synchronization between the servers reduces the probability of the scenario described in the post, but doesn’t prevent it. Moreover, the precision of clock sync protocols (like PTP) depends on the network’s behavior; if the network is misbehaving, the time difference between nodes increases, and with it increases the probability of “reordering” of timestamps.
  
Leon

January 20, 2017 at 12:03 pm

Hmmm,
This bit below; if server Y ALWAYS takes the timestamp then as soon as B is touched, B will be later than A.

The timestamp of “a.txt” should be earlier than “b.txt”, but it isn’t! How can this happen? If server X and server Y have slightly different times, then even though “a.txt” was modified before “b.txt”, it could have receive a timestamp that is after “b.txt”.

- ezrahoch
  
  January 20, 2017 at 2:26 pm
  
  You are correct. Indeed, if server Y is the one giving timestamps to all files, then the described scenario can’t happen. However, this means server Y is a serialization (and thus contention) point. In such a case, the file system isn’t truly scalable: in an extreme case, think of clients that all they do is continuously choose a random file F, and then “touch F”. Such clients are added to the system, as long as the system can handle it. Eventually, the system won’t be able to serve more clients (since they’re all being served by server Y).
  If you want to build a (scalable) system, in which by adding more servers you can serve more clients, then your solution won’t work.
  
  - dronadroid
    
    February 22, 2017 at 10:55 am
    
    This is interesting. What other problems can one encounter in a distributed file system? Is there any reference to such problems?
    How about using timestamp range as used in true time (spanner) or timestamp as a counter used in CRDTs
  - ezrahoch
    
    June 26, 2017 at 8:05 am
    
    I’m not aware of such a reference, but I plan to write another post about the complexities hard-links bring into the mix.
    
    Regarding timestamps, it is possible to overcome the issues by easing the requirements. For example, the CRDT solution you suggested will make the ‘mtime’ property eventually consistent, rather than strongly consistent – which isn’t what’s expected of POSIX filesystems.
    I think it is very interesting to simplify the requirements in the right places, such that we still gain most of the capabilities of a filesystem, while allowing it to be truly scalable; but that’s for another post.