Sun Java Solaris Communities My SDN Account

Article

Storage Utilities in Practice: ZFS Snapshot to Amazon S3

 
By Paul Monday, August 2007  

This article compares the cost of local disk backup solutions to that of offsite storage utilities and offers a sample implementation guide.

Overview

Backup and archiving serve a variety of purposes and are commonly used throughout companies. Traditional backup and archiving operations target tape libraries, like the Sun StorageTek SL500 Modular Library System. Increasingly, companies are using local disk storage to facilitate a perceived decrease in access time and latency as well as to create a "simple" solution. With the decreased access time and latency, archives can remain highly accessible and restore operations can meet increasingly aggressive service-level agreements.

Smaller companies that want the advantages of local disk solutions (performance and online browsing features), combined with a level of offsite backup and archiving for disaster recovery purposes as well as the simplicity of only having to work with their own disk solutions, are finding that storage utilities provide highly competitive offerings. Companies such as SmugMug and elephantdrive are using storage utilities to reduce their hardware costs and offload the system administration costs to the storage utility. The storage utility, in turn, can create better economies of scale through hierarchical storage management, being able to build large systems with lower cost per gigabyte, and being able to purchase and build management tools for managing large amounts of data while being able to amortize these costs to many customers. These savings are passed back to the smaller companies. Further, a company well versed in data management can provide a safer and more reliable solution than a small company where the core competency is in application software or in managing store inventory.

Considerations and Requirements

When looking at hosting solutions and storage utilities, there are many things to keep in mind. Many of these things are measurable, but some are not, such as the core expertise of your company. Here are some considerations and requirements that I laid out as I took a quick look at building my own solution versus leveraging a storage utility. I wanted my solution to provide these benefits:

  • Up to 12 terabytes of accessible storage with offsite backup (assume 1TB per month utilized)
  • A lower monthly cost for backups than a locally hosted solution amortized over one year, including the following costs:
    • Hardware expense
    • Power expense
    • Remote hosting facility expense
    • Bandwidth expense
    • System administration costs
Assumptions

I then made a few assumptions about my storage consumption and environment. Note how round all the numbers are. As a company, you would certainly want to be much more aggressive in mapping your expectations, because every dollar counts and every point of accuracy will help you manage those dollars. I assume the following:

  • Monthly usage: Assume 1TB backed up per month
  • Restores: Assume that two 500MB restores must occur in separate months throughout the year
  • Access time: Assume "normal" network latency is acceptable
  • Access rate: Assume 100 Mbps network connection is acceptable
  • Reliability: Assume colocated solution has same availability as storage utility
Out of Scope

Of course, there are many things I simply felt were out of scope as I tried to create an apple-to-apple comparison. As individuals become more sophisticated with their storage management abilities, more solutions become readily available that can help decrease costs. My goal was simplicity, so here are some things I didn't even consider to be in scope:

  • Complex local storage and network solutions (assumption is a single X4500 being backed up over the network)
  • Tape backup (cohosting solution does not go down in price for hosting tape)
Architecture Overview

I assume a "simple" point-to-point file-based backup infrastructure is in place, as shown in the illustration below.

 
Figure showing a file-based backup system with source system on one side of a firewall and storage utility on the other.
A "simple" point-to-point file-based backup infrastructure.
Implementation Guide for ZFS to Amazon S3

Implementation Overview

The ZFS file system was chosen simply as a practical starting point. Other file systems should be usable but could provide different performance characteristics and certainly different snapshot and restore abilities. If you find a substantial difference in a particular file system, please feel free to send me the implementation and I will post it on my blog in your name. We are also piloting a wiki site and I intend to make this pattern "editable" so you could branch the pattern and add your own solution as well, or change this one.

Amazon Simple Storage Service (Amazon S3) is an interesting choice for a storage utility. I chose it for the following reasons:

  • High usage in the industry
  • Stable and predictable pricing model that is the lowest in the industry, so this represents a good "best case"
  • Trusted source of storage

The costs are as follows for hosting at Amazon S3:

  • Consume 1TB per month, aggregating to 12TB in the last month of the year
    • Hardware expense: $0
    • Power expense: $0 (included in remote hosting cost)
    • Storage Utility expense: $11,700 (1TB + 2TB + ... + 12TB @ $0.15 / GB-Month: $150 + $300 + ... + $1,800)
    • Bandwidth expense: $1,380 (Inbound: 12 * $0.10/GB * 1000 = $1200, Outbound: 2 * 500GB * 0.18/GB = $180)
    • System administration costs: $0
    • Total Cost of Ownership (TCO) for first year: $13,080

One of the original requirements of this particular pattern was to lower the cost for our storage backups. SmugMug did a simple analysis of their savings by using the Amazon S3 storage utility versus purchasing their own disk, and they saved about $340,000 in seven months.

It is extremely important that you spend time to do a full cost analysis based on your own internal information. You can quickly see that the Amazon S3 service is a flat rate and that over the course of several years, the trajectory of the Amazon S3 cost for storing 12TB of data will cross owning a server with 12TB of storage.

Implementation Assumptions

My implementation makes the following assumptions:

Bill of Materials
Implementation Steps

This version of the implementation is going to take the "simple is elegant" route. We use the built-in capabilites of ZFS to take snapshots, save them, and restore them, and we couple this with simple pipes to move data between our file system and Amazon S3. A "snapshot" is defined by the Storage Networking Industry Association (SNIA) to be: "A fully usable copy of a defined collection of data that contains an image of the data as it appeared at the point in time at which the copy was initiated. A snapshot may be either a duplicate or a repliate of the data it represents." The built-in ZFS snapshot capabilities are very good and are not an add-on, as they are for most file systems.

We can also restore our file system from Amazon S3 back to ZFS using the reverse flow of the process here.

Step 1: Basic Setup

To set up our system:

  1. Create one storage pool:

    zfs create media c0t1d0
    

  2. Create a file system within the media storage pool:

    zfs create media/mypictures
    

  3. Change the mountpoint to /export:

    zfs set mountpoint=/export/media/mypictures media/mypictures
    

  4. Set to share over NFS:

    zfs set sharenfs=on media/mypictures
    

  5. Turn on compression:

    zfs set compression=on media/pictures
    

  6. Copy a set of pictures into the media/pictures directory (enough to create an acceptable snapshot).

Step 2: Snapshot and Store

Our first process consists of creating a snapshot and sending it to Amazon's S3 for backup. There are some assumptions that must be made at this point:

  • You have an existing file system or volume that you would like to backup (as created above)
  • The snapshot size will fit within the constraints of Amazon's S3 license or your own service contract with Amazon S3

The steps we take are:

  • Create a snapshot of the file system
  • Turn the snapshot into a data stream
  • Compress the snapshot
  • Send it to Amazon S3 with appropriate metadata

To create a snapshot and store it:

  1. To create a snapshot, use the zfs snapshot command. I will snapshot the entire /export/media/mypictures directory and name the snapshot "20070607" using this command:

    zfs snapshot media/mypictures@20070607
    

    The snapshot should initially take up no additional space in my file system. As files change, the snapshot space grows as well, because the changes in the data must be duplicated. Still, saving the snapshot requires the full amount of space because I am creating a "file" full of the snapshot of the data (which happens to be all of the original data).

  2. It is relatively easy to turn the snapshot itself into a stream of data by using the send command to send the snapshot to a file:

    zfs send media/mypictures@20070607 > ~/backups/20070607
    

  3. We can also insert compression into the pipe, so the actual command I used is this one:

    zfs send media/mypictures@20070607 | gzip > ~/backups/20070607.gz
    

    Mileage varies with compression on snapshots.

  4. Finally, we can send the snapshot to Amazon S3.

I assume that a "bucket" is already created and that we are merely sending the final compressed snapshot to the Amazon S3 bucket. To be honest, I tried using Curl, Perl, and a variety of other things but I couldn't quickly get the right libraries to create the signatures. I just hate scrounging around the Internet for the right this or that and changing compilation flags and recompiling and so on. So, I went with the Java - REST approach.

Use the Amazon S3 Library for REST in Java library. This has classes for doing all of your favorite Amazon S3 operations and is quite easy to use. I created the following "simple" program that passes in a key and the location of my uuencoded snapshot for upload. (This program is based on the samples from Amazon S3.)

public static void main(String args[]) throws Exception {
        if (awsAccessKeyId.startsWith("<INSERT")) {
            System.err.println("Please examine S3Driver.java and update it with your credentials");
            System.exit(-1);
        }
        
        if (args.length < 2) {
            System.err.println("Send snapshot key and location with program: SendSnapshot key path");
            System.exit(-1);
        }
        
        AWSAuthConnection conn =
                new AWSAuthConnection(awsAccessKeyId, awsSecretAccessKey);
        
        System.out.println("----- putting object -----");
        S3Object object = new S3Object("this is a test".getBytes(), null);
        
        
        try {
            File file = new File(args[1]);
            InputStream is = new FileInputStream(file);
            long length = file.length();
            if (length > Integer.MAX_VALUE) {
                System.err.println("File too large: "+args[1]);
                System.exit(-1);
            }
            
            byte[] bytes = new byte[(int)length];
            int offset = 0;
            int numRead = 0;
            while (offset < bytes.length
                    && (numRead=is.read(bytes, offset, bytes.length-offset)) >= 0) {
                offset += numRead;
            }
            
            // Ensure all the bytes have been read in
            if (offset < bytes.length) {
                throw new IOException("Could not completely read file "+args[1]);
            }
            
            // Close the input stream and return bytes
            is.close();
            
            object = new S3Object(bytes, null);
        } catch (IOException ioe) {
            System.err.println("Error reading file: "+args[1]);
            System.exit(-1);
        }
        
        Map headers = new TreeMap();
        headers.put("Content-Type", Arrays.asList(new String[] { "text/plain" }));
        System.out.println(
                conn.put(bucketName, args[0], object, headers).connection.getResponseMessage()
                );
        
        System.out.println("----- listing bucket -----");
        System.out.println(conn.listBucket(bucketName, null, null, null, null).entries);
        
    }
} 

Send up your snapshot and you are GOOD TO GO!

Step 3: Retrieve and Restore

The process of retrieving and restoring the snapshot when you lose your data or want to return to a previous time in your history is relatively simple as well. Simply reverse the process above. Here is the Java code (using the Amazon S3 Java REST libraries again).

public static void main(String args[]) throws Exception {
        if (awsAccessKeyId.startsWith("<INSERT")) {
            System.err.println("Please examine S3Driver.java and update it with your credentials");
            System.exit(-1);
        }

        if (args.length < 2) {
            System.err.println("Get snapshot and write data needs 2 parameters: GetSnapshot key path");
            System.exit(-1);
        }        
        
        AWSAuthConnection conn =
            new AWSAuthConnection(awsAccessKeyId, awsSecretAccessKey);

        System.out.println("----- getting object -----");
        byte[] bytes = conn.get(bucketName, args[0], null).object.data;
        
        try {
            FileOutputStream fos = new FileOutputStream(args[1]);
            fos.write(bytes);
            fos.flush();
            fos.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    } 

When you have your gzipped snapshot, decompress it and use it, using this command:

# gunzip 20070607.gz

Now you have to decide what to do with your snapshot. I moved the existing mypictures pool and restored my old snapshot into its place to give me a complete "time travel" back to my snapshot. Here are the commands:

# zfs rename media/mypictures media/mypictures.old
# zfs receive media/mypictures < 20070607

That's it! Going to /export/media/mypictures brings me to the pictures I snapshotted on June 7, 2007!

Billing

Using Amazon S3 was simple for billing. I entered my credit information before using it, and I receive a monthly bill. Here is a copy of the bill I received for all of the storage used in the process of building this article.

Greetings from Amazon Web Services,

This e-mail confirms that your latest billing statement is available on the AWS 
web site. Your account will be charged the following:

Total: $0.09

Please see the Account Activity area of the AWS web site for detailed account 
information:

http://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=activity
-summary

Thank you for your continuing interest in Amazon Web Services.

Sincerely,

Amazon Web Services

This message was produced and distributed by Amazon Web Services LLC, 1200 12th 
Avenue South, Seattle, Washington 98144-2734

Issues With Implementation

Feel free to tackle any of these issues and feed the solutions back to me, or add to the list of issues if you perceive there to be more.

  • Amazon S3 size limitations: In the "Terms of Use" Amazon S3 specifies the following: "You may not, however, store 'objects' (as described in the user documentation) that contain more than 5Gb of data, or own more than 100 'buckets' (as described in the user documentation) at any one time." As a result, one would want to slice up the snapshot appropriately so as to conform to the Amazon S3 limitations, or possibly work with Amazon S3 to find another solution. The limitation is completely reasonable though, due to lengths and limitations in the HTTPS protocol itself.
  • Encryption: Data should be encrypted appropriately before being sent to a third-party storage site.
  • Cron Job: Timely snapshots could be made automatically.
  • Non-Java: It would be nice to do the whole process from scripts, but I got hung up on the key generation so I hopped to my native tongue (Java code).
Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.

Oracle is reviewing the Sun product roadmap and will provide guidance to customers in accordance with Oracle's standard product communication policies. Any resulting features and timing of release of such features as determined by Oracle's review of roadmaps, are at the sole discretion of Oracle. All product roadmap information, whether communicated by Sun Microsystems or by Oracle, does not represent a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. It is intended for information purposes only, and may not be incorporated into any contract.