Sun Microsystems' addition of the Solaris zettabyte file system (ZFS) to its storage products exposed a Network File System (NFS) performance problem based on how file systems are exported or shared to remote clients. Note that this article will use the terms export and share to refer to a path in a file system that clients can mount, or graft, into their local file systems. As with many performance issues, the root of the problem was due to scaling services: Solaris ZFS allowed users to go from a handful of exports to many thousands. The Sun NFS team's solution to a related problem in the past had been to add an authentication cache to deal with increased numbers of clients accessing the exports. Another cache for sharing a loaded export was added to reduce the problem that ZFS exposed. But the team also had to reduce the cost of actually doing the share itself. Scaling issues are still defined by how NFS boxes were deployed up to 20 years ago. At that time, there were normally dozens of clients and several file systems on the server. The cost to authenticate a client was negligible. Also consider that the clients were probably on the same network segment and that the file server was probably also a name server -- primary or backup. With only a couple of file systems and a restriction of not exporting children of an export, machines could load the exports into memory quite quickly. However, large intranets, automounters, compute farms, and the increase of NFS exports with Solaris ZFS have changed all of these early assumptions. This article both provides an overview of these challenges and describes the changes required to meet today's environment. Note: This article refers to man pages for the OpenSolaris OS
versions and not for the Solaris 10 OS versions. You can download
the OpenSolaris OS man pages in Large Intranets
The first change was the development of large intranets. Consider a model in which every floor in a building has both its own subnet and a different business unit -- for example, marketing on the second floor and engineering on the fifth floor. Then magnify this by many buildings on a campus. The ramifications are that remote access might have to traverse many subnet segments; name services might be down or slow enough to be considered dead; and an admin might want to restrict access to a server from a range of hosts. So an export might change from something like this:
to this:
In the first form, there is no need to query a name server to
determine access rights: All hosts are allowed From a performance view, this is a big win: Before this article turns to the solution, ask yourself: Why are name service calls bad? The answer is that traffic to the network is at best comparable to traffic to disk. The real problem is what happens when the name server is unreachable, whether because of a network problem or a server that is down. If every NFS request were to take 20 seconds, how long would it be before admins started removing host restrictions from exports? Then, add into the mix that every entry in the host access list may result in a 20-second query to the name server. Authentication Cache
The comparison to disk gives the solution: the mount daemon But checking only In his 1996 presentation NFS Client Authentication (PDF), Brent presented two graphs showing the expected behavior of the authentication cache growth and traffic to the name servers. These graphs are not based on hard data, so blindly following them can lead to confusion about performance. Figure 1 shows the graph of authentication cache growth.
You might assume at first that the cache has no limit. But Brent clearly accounts for "Reclaim," which is either back pressure from a memory manager or from data getting stale. At those points in the graph, the server is removing entries in the cache. Now look at Figure 2 from Brent's presentation, which shows the traffic to the name servers.
You can see that the traffic dies off as the cache fills up. The performance issues that can develop are related to several questions:
These are not new questions, but once again, in the type of NFS client deployments possible in 1996, it was safe to ignore them. The performance was satisfactory. Automounters
Besides the growth of the business campus, the second factor that
really started to impact performance was the adoption of
automounters. An automounter is a service that automatically
and implicitly mounts exports on a client, as opposed to the manual
explicit mounting that an admin can do. A simple way for an
automounter to work is to query an NFS server for all of its
available exports. This is actually a single remote procedure call
(RPC) in the Another simple algorithm for an automounter is to mount everything from that server. This appears to be a smart shortcut on the part of the client. If an application wants to access one share on a server, it will likely want to look at the other shares. With a small number of shares, short host access lists, and a small set of clients, this does not appear to hurt performance. But if any of these increases -- either the number of shares, the number of hosts in the access lists, or the number of clients -- then this approach creates a serious bottleneck. And if all three increase, this reduces application performance. You may wonder why the number of hosts in the access lists
becomes an issue. The reason is that each entry in the list is a
potential call to a name server. And if the NFS server does not
cache name server results, does not cache netgroup expansions, or is
not a name server, then each host can be a call over the network.
The robustness of your name server cache daemon ( Compute Farms
The third change affecting scaling issues was the increased attention and deployment of large-scale client farms. Circuit layout, simulation, image rendering, and so on have all driven enterprises to cluster together larger numbers of clients to solve a common problem. And in these solutions, NFS is used to store intermediate and final results. There might not be any storage on the clients; the cluster might not be permitted to use local storage; or the storage on the client might not be reliable, which includes a regular archival backup. Enterprises have current deployments of -- or plans for -- from one to more than 20,000 clients in such compute farms. The problem now becomes not only the number of clients that need
to be cached but the temporal synchronization. A job is started, and
512 nodes are selected to process the components. At that point, all
the jobs will probably open a control file, load tools, and start
working. Each NFS server is going to create a storm to the name
servers when processing cache misses. And remember, depending on the
cache size and memory-scavenging algorithm in place, authentication
cache and After a period of computation, all of the nodes will roughly start to write out results. They may each have their own data file, but these will most likely be on one file server and in a common share, if not in a directory. At that point, there may be cache misses again. Combine this one task with others in the cluster, and it is easy to imagine storms of reclaiming cache entries and name server lookups on the resulting cache misses. Another factor to consider is that these compute farms may be
distributed across the country or globe. If one part of the company
needs additional resources, it may borrow cycles from another site.
And the NFS Growth of NFS Exports With Solaris ZFS
The authentication cache has been in the Solaris OS for over 10 years, and Sun hasn't been deluged with complaints about either loading shares into memory at boot up or cache miss loads in compute farms. Obviously, scaling concerns aside, the cache is working. But users are raising complaints about boot times now. What has changed? The answer is that Sun started shipping Solaris ZFS in OpenSolaris OS and in Solaris 10 OS. Why would that matter? The Solaris OS has the property that an export cannot have a descendant also shared. So, in order to get a large number of shares, you either need a large number of file systems or a large number of high-level directories. You either share the root of a file system or organize exports based on a directory structure. For whatever reason, the approach of organizing exports this way never caught on. And because a server normally contains only a couple of file systems, the total number of shares is small. With the deployment of Solaris ZFS, the ZFS storage pool is not shared, but the individual ZFS data sets can be. And the easiest way to test Solaris ZFS is to use it for home directories on a file server. Sun engineering did just that on one of the company's production file servers. Until Solaris ZFS was installed on the file server, it had about 12 exports. One week before the home directories were migrated totally to ZFS, the server had about 300 exports. Right after the migration, it had about 1300 exports. The server currently has about 1500 exports. And the Sun NFS team started seeing a serious impact on the loading of shares at boot time. As previously mentioned, a design originally made to handle 10 exports -- even one that was redesigned 10 years ago -- does not scale well when going to 1500 exports. And Sun has customers that report going to 15,000 exports. The following sections provide the breakdown of what was going on with the file server and what the Sun NFS team did to fix the loading of shares, which included a positive impact on reducing cache miss authentication checks. In-Kernel
SharetabA huge issue was that the The very first attempt at reducing the impact of Solaris ZFS was
to add a If you look at that process in more detail, each entry in
the file It sounds simple enough:
That would be quicker than what was really happening. Remember that
writing out a share was a rare event and could also happen long
after the system was up as a result of the The real way that the algorithm processed each share was to do the following:
Imagine that process happening 15,000 times and your system not being available for NFS access until every share is loaded. The Sun NFS team was benchmarking this scenario and had to give up. By the way, a later section of this article will show some numbers for the load times. The team fixed this issue by storing the The Sun NFS team's second design choice was to keep the
By placing the
The cost of the The Share Manager
The Sun NFS team also wanted to be able to manage shares from the
command line interface (CLI) in a manner that was protocol
independent. They wanted to be able to script the management, as
well as to be able to extend this solution through plug-in modules.
The team introduced the But the team also wanted to be able to introduce parallelism in
the loading of shares. Remember that the basic algorithm is
sequential because you are either loading the The This single call to A secondary performance enhancement comes from using the
Some Hard Data
How did the
With the base case, the Sun NFS team terminated runs for more
than 100 shares after one week of processing. This was done for the
2900 cases, and the team made no attempt for the larger
configurations. You can see that both the There is additional room for improvement. An interesting test would be to load 15,000 shares and look at how long it takes to load and to unload a single share. As Doug McCallum points out on his blog entry Recent Performance Improvement in ZFS Handling of Shares, this would be a great OpenSolaris project for someone wanting to learn more about NFS and Solaris ZFS. This article has raised many interesting points about how scaling impacts performance, and it has provided an overview of how the Sun NFS team approached solutions. References
Multiple
Flavors per Export (PDF), Brent Callaghan, Connectathon 1995, Mar. 13, 1995
About the Authors
Tom Haynes is an NFS developer for Sun Microsystems and scaling issues in the management of exports are attracted to him. Doug McCallum is an NFS developer for Sun Microsystems with a strong interest in manageability issues for file sharing. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||