Sun Java Solaris Communities My SDN Account Join SDN
 
Article

The Management of NFS Performance With Solaris ZFS

 
By Tom Haynes and Doug McCallum, July 2007  

Sun Microsystems' addition of the Solaris zettabyte file system (ZFS) to its storage products exposed a Network File System (NFS) performance problem based on how file systems are exported or shared to remote clients. Note that this article will use the terms export and share to refer to a path in a file system that clients can mount, or graft, into their local file systems.

As with many performance issues, the root of the problem was due to scaling services: Solaris ZFS allowed users to go from a handful of exports to many thousands. The Sun NFS team's solution to a related problem in the past had been to add an authentication cache to deal with increased numbers of clients accessing the exports. Another cache for sharing a loaded export was added to reduce the problem that ZFS exposed. But the team also had to reduce the cost of actually doing the share itself.

Scaling issues are still defined by how NFS boxes were deployed up to 20 years ago. At that time, there were normally dozens of clients and several file systems on the server. The cost to authenticate a client was negligible. Also consider that the clients were probably on the same network segment and that the file server was probably also a name server -- primary or backup. With only a couple of file systems and a restriction of not exporting children of an export, machines could load the exports into memory quite quickly.

However, large intranets, automounters, compute farms, and the increase of NFS exports with Solaris ZFS have changed all of these early assumptions. This article both provides an overview of these challenges and describes the changes required to meet today's environment.

Note: This article refers to man pages for the OpenSolaris OS versions and not for the Solaris 10 OS versions. You can download the OpenSolaris OS man pages in tar format.

Large Intranets

The first change was the development of large intranets. Consider a model in which every floor in a building has both its own subnet and a different business unit -- for example, marketing on the second floor and engineering on the fifth floor. Then magnify this by many buildings on a campus. The ramifications are that remote access might have to traverse many subnet segments; name services might be down or slow enough to be considered dead; and an admin might want to restrict access to a server from a range of hosts.

So an export might change from something like this:

share -F nfs -d "home dirs" /export/home

to this:

share -F nfs -o rw=engineering -d "home dirs" /export/home

In the first form, there is no need to query a name server to determine access rights: All hosts are allowed rw access by default. In the second form, a trusting approach is used to do access checking only on mount requests. If a client is granted access rights, then it gets the file handle of the root of the export point. Then the NFS protocol requests are trusted to be authenticated because they have a valid file handle.

From a performance view, this is a big win: mount requests are rare compared to NFS requests. But from a security view, this is a big loss: It is pretty easy to snoop traffic and spoof a file handle. Once you get a valid file handle, you have access rights to that share.

Before this article turns to the solution, ask yourself: Why are name service calls bad? The answer is that traffic to the network is at best comparable to traffic to disk. The real problem is what happens when the name server is unreachable, whether because of a network problem or a server that is down. If every NFS request were to take 20 seconds, how long would it be before admins started removing host restrictions from exports? Then, add into the mix that every entry in the host access list may result in a 20-second query to the name server.

Authentication Cache

The comparison to disk gives the solution: the mount daemon mountd(1M) needs a cache of whether or not a given host is granted or denied access to a given export. In 1995 and 1996, Brent Callaghan presented at Connectathon about work done at Sun to address these issues. When a mount request arrives, an authentication cache is checked to see whether the client has access permissions to that share. With a cache hit, the resulting permission is returned. With a cache miss, the name servers are checked to resolve host names and netgroups.

But checking only mountd requests is not sufficient. First, a server can reboot and and wipe out the existing permissions for clients to access a share. Second, it is still possible to spoof the NFS requests. The solution is to authenticate every NFS request. The same algorithm should be followed to populate the authentication cache and to keep the reply time down. Note that clients are designed to be somewhat forgiving for mountd requests. Typically, they are even User Datagram Protocol (UDP). But they are draconian in wanting quick replies to NFS requests.

In his 1996 presentation NFS Client Authentication (PDF), Brent presented two graphs showing the expected behavior of the authentication cache growth and traffic to the name servers. These graphs are not based on hard data, so blindly following them can lead to confusion about performance. Figure 1 shows the graph of authentication cache growth.

 
Figure 1: Authenticated cache growth shown as size over time
Figure 1: Authentication Cache Growth

You might assume at first that the cache has no limit. But Brent clearly accounts for "Reclaim," which is either back pressure from a memory manager or from data getting stale. At those points in the graph, the server is removing entries in the cache.

Now look at Figure 2 from Brent's presentation, which shows the traffic to the name servers.

 
Figure 2: Authenticated name service traffic shown as traffic over time
Figure 2: Authenticated Name Service Traffic

You can see that the traffic dies off as the cache fills up.

The performance issues that can develop are related to several questions:

  • How large can the cache get?
  • How long does it take for a request to come back from the name servers?
  • In the case of a reboot, how long does it take to refill the cache?

These are not new questions, but once again, in the type of NFS client deployments possible in 1996, it was safe to ignore them. The performance was satisfactory.

Automounters

Besides the growth of the business campus, the second factor that really started to impact performance was the adoption of automounters. An automounter is a service that automatically and implicitly mounts exports on a client, as opposed to the manual explicit mounting that an admin can do. A simple way for an automounter to work is to query an NFS server for all of its available exports. This is actually a single remote procedure call (RPC) in the mountd protocol and is basically the core algorithm of a showmount -e. The issue is that most servers are not optimized with respect to generating this list. Again, the mountd RPC calls are rare, and the Solaris Operating System designers made a performance trade-off between consuming main memory and having the data on disk. You will find more about this in a later section of this article.

Another simple algorithm for an automounter is to mount everything from that server. This appears to be a smart shortcut on the part of the client. If an application wants to access one share on a server, it will likely want to look at the other shares.

With a small number of shares, short host access lists, and a small set of clients, this does not appear to hurt performance. But if any of these increases -- either the number of shares, the number of hosts in the access lists, or the number of clients -- then this approach creates a serious bottleneck. And if all three increase, this reduces application performance.

You may wonder why the number of hosts in the access lists becomes an issue. The reason is that each entry in the list is a potential call to a name server. And if the NFS server does not cache name server results, does not cache netgroup expansions, or is not a name server, then each host can be a call over the network. The robustness of your name server cache daemon (nscd(1M)) can seriously impact the robustness and performance of your mountd(1M) and NFS daemon (nfsd(1M)).

Compute Farms

The third change affecting scaling issues was the increased attention and deployment of large-scale client farms. Circuit layout, simulation, image rendering, and so on have all driven enterprises to cluster together larger numbers of clients to solve a common problem. And in these solutions, NFS is used to store intermediate and final results. There might not be any storage on the clients; the cluster might not be permitted to use local storage; or the storage on the client might not be reliable, which includes a regular archival backup. Enterprises have current deployments of -- or plans for -- from one to more than 20,000 clients in such compute farms.

The problem now becomes not only the number of clients that need to be cached but the temporal synchronization. A job is started, and 512 nodes are selected to process the components. At that point, all the jobs will probably open a control file, load tools, and start working. Each NFS server is going to create a storm to the name servers when processing cache misses. And remember, depending on the cache size and memory-scavenging algorithm in place, authentication cache and ncsd entries may be reclaimed.

After a period of computation, all of the nodes will roughly start to write out results. They may each have their own data file, but these will most likely be on one file server and in a common share, if not in a directory. At that point, there may be cache misses again. Combine this one task with others in the cluster, and it is easy to imagine storms of reclaiming cache entries and name server lookups on the resulting cache misses.

Another factor to consider is that these compute farms may be distributed across the country or globe. If one part of the company needs additional resources, it may borrow cycles from another site. And the NFS mounts may be on the wide area network (WAN). Even with minimal writing or reading, without careful planning, some access checks may still cause name server requests to go across the WAN. As the time to process name server requests increases, so does the NFS response time on a cache miss. This in turn may cause the client to resend the same request, which gets a cache miss and restarts the communication with the name server.

Growth of NFS Exports With Solaris ZFS

The authentication cache has been in the Solaris OS for over 10 years, and Sun hasn't been deluged with complaints about either loading shares into memory at boot up or cache miss loads in compute farms. Obviously, scaling concerns aside, the cache is working. But users are raising complaints about boot times now. What has changed? The answer is that Sun started shipping Solaris ZFS in OpenSolaris OS and in Solaris 10 OS.

Why would that matter? The Solaris OS has the property that an export cannot have a descendant also shared. So, in order to get a large number of shares, you either need a large number of file systems or a large number of high-level directories. You either share the root of a file system or organize exports based on a directory structure. For whatever reason, the approach of organizing exports this way never caught on. And because a server normally contains only a couple of file systems, the total number of shares is small.

With the deployment of Solaris ZFS, the ZFS storage pool is not shared, but the individual ZFS data sets can be. And the easiest way to test Solaris ZFS is to use it for home directories on a file server. Sun engineering did just that on one of the company's production file servers. Until Solaris ZFS was installed on the file server, it had about 12 exports. One week before the home directories were migrated totally to ZFS, the server had about 300 exports. Right after the migration, it had about 1300 exports. The server currently has about 1500 exports.

And the Sun NFS team started seeing a serious impact on the loading of shares at boot time. As previously mentioned, a design originally made to handle 10 exports -- even one that was redesigned 10 years ago -- does not scale well when going to 1500 exports. And Sun has customers that report going to 15,000 exports.

The following sections provide the breakdown of what was going on with the file server and what the Sun NFS team did to fix the loading of shares, which included a positive impact on reducing cache miss authentication checks.

In-Kernel Sharetab

A huge issue was that the sharetab(4) contents resided on disk as the file /etc/dfs/sharetab. The kernel kept a list of exported paths in memory, but the options resided on disk. So every time a host access list had to be checked, mountd had to load /etc/dfs/sharetab, process every entry until it found the matching path, and then get the options. As an aside, imagine the 512 nodes all starting a job at the same time, using an automounter that mounted everything and a large number of shares. Although the sharetab file would essentially be paged into memory, the process would still be painful.

The very first attempt at reducing the impact of Solaris ZFS was to add a userland cache of the sharetab in mountd. Because /etc/dfs/sharetab was supposed to be read-only -- it was not -- and the kernel never added new shares, this was an acceptable stopgap. But although it helped the case of cache misses for authentication, it did not help very much for sharing file systems at boot time.

If you look at that process in more detail, each entry in the file /etc/dfs/dfstab is processed serially and added to /etc/dfs/sharetab. With the addition of Solaris ZFS and the fact that those share options do not live in dfstab(4), you have to iterate over the set of ZFS data sets to find the ones that need to have an export.

It sounds simple enough:

  1. Open the sharetab file.
  2. Lock it to keep other applications from writing to it.
  3. Open the dfstab file.
    1. Process each share.
    2. Write the result to the sharetab file.
  4. Close the dfstab file.
  5. Iterate over each ZFS shared data set.
    1. Process each share.
    2. Write the result to the sharetab file.
  6. Close the sharetab file.

That would be quicker than what was really happening. Remember that writing out a share was a rare event and could also happen long after the system was up as a result of the share(1M) command. And a good coding technique is to reuse common code.

The real way that the algorithm processed each share was to do the following:

  1. Open the dfstab file.
    1. Process each share.
    2. Open the sharetab file.
    3. Lock it to keep other applications from writing to it.
    4. Search to see if the share is already in the file.
      1. If so, overwrite the old copy.
      2. If not, append the share to the end of the file.
    5. Close the sharetab file.
  2. Close the dfstab file.
  3. Iterate over each ZFS shared data set.
    1. Process each share.
    2. Open the sharetab file.
    3. Lock it to keep other applications from writing to it.
    4. Search to see if the share is already in the file.
      1. If so, overwrite the old copy.
      2. If not, append the share to the end of the file.

Imagine that process happening 15,000 times and your system not being available for NFS access until every share is loaded. The Sun NFS team was benchmarking this scenario and had to give up. By the way, a later section of this article will show some numbers for the load times.

The team fixed this issue by storing the sharetab in the kernel. They decided to place the sharetab there and not in userland because it was possible for a userland program to be turned off or die. Also, it is a better design to have a generic sharetab and not one tied to a protocol and its associated daemons.

The Sun NFS team's second design choice was to keep the sharetab accessible through the file /etc/dfs/sharetab. They did this by making a new file system that was mounted on that path. When the file is read, the contents are taken directly from the kernel. A bonus is that the team can now enforce that the sharetab cannot be modified by any other application, including vi(1). But any other process can still read the file, and the change did not break any third-party application monitoring the file to see whether shares were changed.

By placing the sharetab in the kernel, the team was able to add a syscall to allow the sharemgr(1M) to add or delete shares. A later section of this article will discuss the sharemgr(1M) further. The new algorithm is as follows:

  1. Open the dfstab file.
    1. Process each share.
    2. Use a system call to send the share to the kernel.
      1. Hash the share path to get a linked list.
      2. Search the list for a copy and delete it.
      3. Append the share to the end of the list.
  2. Close the dfstab file.
  3. Iterate over each ZFS shared data set.
    1. Process each share.
    2. Use a system call to send the share to the kernel.
      1. Hash the share path to get a linked list.
      2. Search the list for a copy and delete it.
      3. Append the share to the end of the list.

The cost of the syscall is much less than the costs associated with the file open to the sharetab, getting a write lock, reading from disk, conducting a linear search, writing to disk, and closing the file. The reduced cost will be shown later in this article.

The Share Manager

The Sun NFS team also wanted to be able to manage shares from the command line interface (CLI) in a manner that was protocol independent. They wanted to be able to script the management, as well as to be able to extend this solution through plug-in modules. The team introduced the sharemgr(1M) to do all of that. Another Connectathon presentation, The Management of Shares (PDF), provides a good overview of the sharemgr command.

But the team also wanted to be able to introduce parallelism in the loading of shares. Remember that the basic algorithm is sequential because you are either loading the dfstab(4) or iterating over the Solaris ZFS data sets. The sharemgr introduces share groups, which are named collections of shares, each associated with a Solaris Management Framework (SMF) service instance. Each share group can be processed by its own thread, so they can load in parallel. By the way, the sharemgr shipped before the in-kernel sharetab, so share groups would not be able to run in parallel because the write lock on the sharetab(1M) would be a bottleneck.

The sharemgr also alleviated some issues in the background. For example, the unshareall(1M) script used to iterate over the sharetab(1M) and call share(1M) to remove shares. With the addition of sharemgr, the looping over sharetab and each call to share was removed. Instead, a single call to sharemgr is issued.

This single call to sharemgr was also applied to the way the Solaris ZFS code was sharing or unsharing exports. Instead of using a loop over the set of ZFS data sets that are to be processed and calling a popen(3C) function call on each to invoke share(1M), a single sharemgr(1M) is issued. The largest contributing factor to the popen being slow is the time necessary to pull in large amounts of the configuration data on each invocation of share and unshare, with each subsequent call needing more data in the share case.

A secondary performance enhancement comes from using the sharemgr library interfaces within the zfs(1M) command. This eliminates the call to popen that was being used to enable the share(s) after setting the sharenfs property on a ZFS data set.

Some Hard Data

How did the sharemgr and in-kernel sharetab changes impact the loading and unloading of shares for Solaris ZFS? The Sun NFS team ran a set of tests in which they varied the number of shares and got the resulting times in hours, minutes, and seconds. Table 1 shows the results, expressed as H:MM:SS. The team performed the "On" tests using zfs set sharenfs=on and the "Off" tests using zfs set sharenfs=off and allowing ZFS property inheritance to trigger all of the shares being enabled or disabled. The improvements refer to the elimination of the popen calls.

Table 1:
 
 
Approach
100 On
100 Off
2900 On
2900 Off
5000 On
5000 Off
15,150 On
15,150 Off
Before sharemgr improvements
0:07:35
0:12:23
> 1 week
> 1 week
> 1 week
> 1 week
> 1 week
> 1 week
After sharemgr improvements
0:00:07
0:00:13
0:02:55
0:05:40
0:08:41
0:33:24
1:06:36
2:35:36
Use of sharemgr plus in-kernel sharetab
0:00:0.7
0:00:0.4
0:00:54
0:03:39
0:02:37
0:12:08
0:19:40
2:12:14
 

With the base case, the Sun NFS team terminated runs for more than 100 shares after one week of processing. This was done for the 2900 cases, and the team made no attempt for the larger configurations. You can see that both the sharemgr and the in-kernel sharetab approaches made significant improvements in loading and unloading shares. The team is particularly interested in the loading case, that is, turning shares on, because that is what creates a bottleneck for an NFS server to boot and to start responding to clients. Although the team could do corresponding tests to check cache miss lookups, team members suspect they will see improvements. In all cases, the startup or shutdown of the shares is done serially for these tests.

There is additional room for improvement. An interesting test would be to load 15,000 shares and look at how long it takes to load and to unload a single share. As Doug McCallum points out on his blog entry Recent Performance Improvement in ZFS Handling of Shares, this would be a great OpenSolaris project for someone wanting to learn more about NFS and Solaris ZFS.

This article has raised many interesting points about how scaling impacts performance, and it has provided an overview of how the Sun NFS team approached solutions.

References

Multiple Flavors per Export (PDF), Brent Callaghan, Connectathon 1995, Mar. 13, 1995
NFS Client Authentication (PDF), Brent Callaghan, Connectathon 1996, Feb. 27, 1996
Scaling NFS Services (PDF), Tom Haynes, Connectathon 2006, Mar. 2, 2006
The Management of Shares (PDF), Tom Haynes and Doug McCallum, Connectathon 2007, Feb. 5, 2007
Recent Performance Improvement in ZFS Handling of Shares, Doug McCallum's Share Manager Weblog, May 8, 2007
Download: OpenSolaris OS man pages in tar format

About the Authors

Tom Haynes is an NFS developer for Sun Microsystems and scaling issues in the management of exports are attracted to him.

Doug McCallum is an NFS developer for Sun Microsystems with a strong interest in manageability issues for file sharing.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.