Use Case

Just started working with amazon's S3 buckets to hold a centralised filesystem to support a distributed workflow system. When the tasks in the workflow run on different physical machines in a viariety of locations so it's we need efficient ways of syncronising just small sub-sections of local files with a bucket.

The Plan

Amazons API allows listing objects by a key prefix i.e. search for all the files in a particular folder or sub-folders. This is a great way of syncronising folders where they might contain sub-folders, however we need to also list the same files from the local file system.

The second task is then comparing files, I our system the synronisation is only performed in one direction at a time (pull or push) and therefore we can calculated which files have been:

Implementation

Get the current amazon file list

I'm using amazon's own .NET API for this example. The first task is to request all the objects within a particular folder. First we create the S3 client:

AmazonS3Client client = new AmazonS3Client("awsAccessKeyId", "awsSecretAccessKey");

Then we get all the files (S3 objects) under the desired folder using a ListObjectsRequest and getting the keys and their corresponding etags out into a dictionary for later:

ListObjectsResponse folderObjects = client.ListObjects(new ListObjectsRequest() { BucketName = "dbradley-test-bucket", Prefix = "test/folder" });
Dictionary<string, string> remoteObjects = folderObjects.S3Objects.ToDictionary(obj => obj.Key, obj => obj.ETag);

Get the current local file list

To get the local files in a similar format takes a little more work as filesystems don't naturally let you recursively get the files and paths for all sub folders. The approach to implement this behaviour is therefore going to be to implement a recursive function to dig down into all the sub directories.

The output of this funciton needs to be something that's comparible with the previous result from the amazon bucket - a dictionary mapping the file path to its MD5 hash.

The first step is to be able to generate an "amazon compatible" checksum of a file. We can use the ComputeHash function of the MD5CryptoServiceProvider class. This can be simply passed an stream and will return the hash as a byte array. However, to make this bit array into a hex encoded string we use the BitConverter ToString method, then simply strip the dashes and lower the case so that it will match the etag returned by amazon.

Note: There's probably a more efficient method of doing the conversion from byte array to hex, but this will do for now!

Therefore the hashing function looks something like:

string hash = BitConverter.ToString(crypto.ComputeHash(fileStream)).Replace("-", string.Empty).ToLower();

The next consideration is the time it takes to calculate these hashes. Even the most efficient of MD5 implementation introduce a significant cost to calculate, especially with big files. Therefore, rather than returning a dictionary of file paths mapping to the actual string MD5 hash we will actually return the paths mapping to a function which, only when run, will return the MD5 hash of the given file. We can define this using a delegate function which doesn't take an input:

delegate
{
   
using (var stream = file.OpenRead())
   
{
       
return BitConverter.ToString(crypto.ComputeHash(stream)).Replace("-", string.Empty).ToLower();
   
}
}

Going back to the recursive function, we need to make sure that the file keys match with those on amazon. Amazon paths looks somthing like "test/folder/file.txt" and therefore we need to make all of our local paths relative to a specific folder. Therefore we will define two root functions for simplicity:

  1. Get all the files within a directory (and assume that the given directory is the root directory in amazon).
  2. Get all the files within a directory and specify the current directories path on amazon.

Each of these funcitons will then call the internal recursive method. This internal method then simply returns the keys and hash functions of each file in it's current directory combinded with the keys and hash functions of each of it's sub-directories.

Bringing it all together.

So, finally here's the code to get a local directory as a set of amazon compatible paths mapping to an Amazon-compatible md5 hash.

public static Dictionary<string, Func<string>> GetLocalFileKeys(DirectoryInfo directory)
{    return GetLocalFileKeys(directory, string.Empty, new MD5CryptoServiceProvider()).ToDictionary(kvp => kvp.Key, kvp => kvp.Value);
}
 

public
static Dictionary<string, Func<string>> GetLocalFileKeys(DirectoryInfo directory, string rootPath)
{
   
return GetLocalFileKeys(directory, rootPath, new MD5CryptoServiceProvider()).ToDictionary(kvp => kvp.Key, kvp => kvp.Value);
}


private
static IEnumerable<KeyValuePair<string, Func<string>>> GetLocalFileKeys(DirectoryInfo directory, string currentPath, MD5CryptoServiceProvider crypto)
{
   
if (directory == null)
       
throw new ArgumentNullException("directory", "directory is null.");
 

   
return directory.EnumerateFiles().Select
       
(
       
file =>
           
new KeyValuePair<string, Func<string>>
               
(
               
currentPath + "/" + file.Name,
                
delegate
               
{
                   
using (var stream = file.OpenRead())
                   
{
                       
return BitConverter.ToString(crypto.ComputeHash(stream)).Replace("-", string.Empty).ToLower();
                   
}
               
}
               
)
       
)
       
.Union
       
(
       
directory.EnumerateDirectories().SelectMany
       
(
       
childDir => GetLocalFileKeys(childDir, currentPath + childDir.Name + "/", crypto)
       
)
       
);
}

One observation of the internal function is that it is using IEnumerable of KeyValuePair rather than an actual dictionary. This is due to dictionaries not being able to add collections of new pairs at once (as we need to do this when calling the function recursively so that the results are presented in a flat collection).