NewMountDesignSpec
From Linux NFS
Contents |
Introduction
This wiki page is a working design specification for the new text-based NFS mount API. Here we discuss use cases, requirement statements, error reporting, and design specifications, in addition to minute behavioral details of mounting NFS shares. The purpose of this discussion is to understand how to implement the new interface, and to construct the basis of unit tests for both the legacy user-space mount command and the new in-kernel mount client.
Requirements
There are several broad requirements for the new text-based NFS mount API.
- Scalability - Allow for thousands of NFS mount points, and a large number of simultaneous mount operations
- No user-space dependency on a versioned binary blob for passing NFS mount options to the kernel
- Support version fallback - If NFS version 4 is not supported, fall back to version 3; if version 3 is not supported, fall back to version 2
- Support transport protocol fallback - If TCP is not supported, fall back to UDP
- Provide reasonable default behavior in the presence of network firewalls and misconfigured servers
- Facilitate new features - IPv6, RDMA, FS cache should be easy to introduce
- Better error reporting - Report and log useful, relevant, clear error messages when a failure has occurred
- Update and clarify NFS mount documentation
Use Cases
To mount a remote share using NFS version 2, use the \fBnfs\fR file system type and specify the \fInfsvers=2\fR mount option. To mount using NFS version 3, use the \fBnfs\fR file system type and specify the \fInfsvers=3\fR mount option. To mount using NFS version 4, use the \fBnfs4\fR file system type (the \fInfsvers\fR mount option is not supported for the \fBnfs4\fR file system type).
Here is an example from an \fI/etc/fstab\fP file for an NFS version 3 mount over TCP.
server:/export/share /mnt nfs nfsvers=3,proto=tcp
Here is an example for an NFS version 4 mount over TCP using Kerberos 5 mutual authentication.
server:/export/share /mnt nfs4 sec=krb5
Design Specifications
Obviously the discussion of NFSv2/v3 mounting will be significantly more complicated than NFSv4 mounting.
Return Codes and Error Reporting
Currently mount's error messages are very problematic.
- Some error messages are incorrect.
- Some error messages are repeated.
- Some errors are never reported.
- Some error messages are too specific to be useful to an average administration. For example, reporting an "RPC program/version mismatch occurred" is not helpful if the real problem is that "proto=udp" is not supported.
- Some error messages are too general to be useful. For example, reporting "mount.nfs: not a directory" is obviously an errno string, but more specific information would provide a course of corrective action.
Perhaps a clear error message can be reported to the command line, and a lot of detail should be reported in the system log? That's easy enough with in-kernel mount option parsing!
mount(2) API return codes
The mount.nfs program needs to distinguish between temporary problems and permanent errors in order to determine whether it's worth retrying a mount request in the background.
For text-based NFS mounts, the version/protocol fallback mechanism should occur in user space -- certainly fallback policy is easier to set and implement in user space, but the kernel must provide specific information about how a mount request failed so that user space can make an appropriate choice about the next step to try.
The current mount(2) API is described in a man page. The man page describes a set of generic error return codes, which we excerpt here. It also suggests that we can add specific error codes for NFS mounts.
RETURN VALUE On success, zero is returned. On error, -1 is returned, and errno is set appropriately. ERRORS The error values given below result from filesystem type independent errors. Each filesystem type may have its own special errors and its own special behavior. See the kernel source code for details. EACCES A component of a path was not searchable. (See also path_resolu- tion(2).) Or, mounting a read-only filesystem was attempted without giving the MS_RDONLY flag. Or, the block device source is located on a filesystem mounted with the MS_NODEV option. EAGAIN A call to umount2() specifying MNT_EXPIRE successfully marked an unbusy file system as expired. EBUSY source is already mounted. Or, it cannot be remounted read-only, because it still holds files open for writing. Or, it cannot be mounted on target because target is still busy (it is the work- ing directory of some task, the mount point of another device, has open files, etc.). Or, it could not be unmounted because it is busy. EFAULT One of the pointer arguments points outside the user address space. EINVAL source had an invalid superblock. Or, a remount (MS_REMOUNT) was attempted, but source was not already mounted on target. Or, a move (MS_MOVE) was attempted, but source was not a mount point, or was ’/’. Or, an unmount was attempted, but target was not a mount point. Or, umount2() was called with MNT_EXPIRE and either MNT_DETACH or MNT_FORCE. ELOOP Too many link encountered during pathname resolution. Or, a move was attempted, while target is a descendant of source. EMFILE (In case no block device is required:) Table of dummy devices is full. ENAMETOOLONG A pathname was longer than MAXPATHLEN. ENODEV filesystemtype not configured in the kernel. ENOENT A pathname was empty or had a nonexistent component. ENOMEM The kernel could not allocate a free page to copy filenames or data into. ENOTBLK source is not a block device (and a device was required). ENOTDIR The second argument, or a prefix of the first argument, is not a directory. ENXIO The major number of the block device source is out of range. EPERM The caller does not have the required privileges.
Here are some additional return codes I recommend for NFS mounts, just as a start. These should allow a calling program to report a reasonably specific error message, and decide whether and how to retry the request.
EBADF The mount option string was not able to be parsed, or an unre- cognized option was specified, or a keyword option was specified with a value that is out of range.
This is a permanent mount error. The calling program should not retry this request with the same options.
ESTALE The server denied access to the requested share. ETIMEDOUT The kernel's mount attempt timed out after n seconds (I think n is 15).
These are temporary errors. The calling program may choose to retry this request using the same options, or fail immediately.
EPROTONOSUPPORT The server reports that the program, version, or transport pro- tocol is not currently available. ECONNREFUSED The kernel's mount connection attempt was refused by the server at the network transport layer.
These are temporary errors. The calling program can attempt to recover by adjusting the options and retrying the request.
Discussion of Individual NFS Mount Options
There are four classes of mount options for \fBnfs\fR and \fBnfs4\fR file systems. Fix this: All four classes of options are specified as normal NFS mount options because there is only one way to specify mount options in the .I /etc/fstab file.
First, there are generic mount options available to all Linux file systems, such as "ro" or "sync". See .BR mount (8) for a description of generic mount options available for all file systems.
Second, some mount options can determine how the mount command behaves, such as "mountport" or "retry". These options have no affect after the mount operation has completed, but might be used to mount an NFS share through a network firewall.
Third, some mount options determine how the NFS client behaves during normal operation, such as "rsize" and "wsize". These may be used to tune performance, or change the client's caching or file locking behavior.
Fourth, mount options such as "timeout" or "retrans" can control aspects of Remote Procedure Call behavior. NFS clients send requests to NFS servers via Remote Procedure Calls, or RPCs. RPCs handle per-request authentication, adjust request parameters for different byte endianness on client and server, and retransmit requests that may have been lost by the network or server.
Note that some options take the form of .I keyword=value while some options are boolean, taking either the form of .I keyword or .I nokeyword. All options which do not use the .I keyword=value form use the boolean form, except for .I hard/soft, .I udp/tcp, and .I fg/bg.
Valid options for either the nfs or nfs4 file system type
To Do
- Tabularize this section
- Add status information about each option
- Tested (legacy / text-based)
- Works, does not work as documented (legacy / text-based)
- Implementation/fix priority
- Details about how it works and/or how it should work
.TP 1.5i soft | hard Determines the recovery behavior of the RPC client after an RPC request times out. If neither option is specified, or if the \fIhard\fR option is specified, the RPC is retried indefinitely. If the \fIsoft\fR option is specified, then the RPC client fails the RPC request after a major timeout occurs, and causes the NFS client to return an error to the calling application.
.TP 1.5i timeo=\fIn The value, in tenths of a second, before timing out an RPC request. The default value is 600 (60 seconds) for NFS over TCP. On a UDP transport, the Linux RPC client uses an adaptive algorithm to estimate the time out value for frequently used request types such as READ and WRITE, and uses the \fItimeo=\fR setting for infrequently used requests such as FSINFO. The \fItimeo=\fR value defaults to 7 tenths of a second for NFS over UDP. After each timeout, the RPC client may retransmit the timed out request, or it may take some other action depending on the settings of the \fIhard\fR or \fIretrans=\fR options.
.TP 1.5i retrans=\fIn The number of RPC timeouts that must occur before a major timeout occurs. The default is 3 timeouts. If the file system is mounted with the \fIhard\fR option, the RPC client will generate a "server not responding" message after a major timeout, then continue to retransmit the request. If the file system is mounted with the \fIsoft\fR option, the RPC client will abandon the request after a major timeout, and cause NFS to return an error to the application.
.TP 1.5i rsize=\fIn The maximum number of bytes in each network READ request that the NFS client can use when reading data from a file on an NFS server; the actual data payload size of each NFS READ request is equal to or smaller than the \fIrsize\fP value. The \fIrsize\fP value is a positive integral multiple of 1024, and the largest value supported by the Linux NFS client is 1,048,576 bytes. Specified values outside of this range are rounded down to the closest multiple of 1024, and specified values smaller than 1024 are replaced with a default of 4096. If an \fIrsize\fP value is not specified, or if a value is specified but is larger than the maximums either the client or server support, the client and server negotiate the largest \fIrsize\fP value that both will support. The \fIrsize\fP option as specified on the .BR mount (8) command line appears in the .I /etc/mtab file, but the effective \fIrsize\fP value negotiated by the client and server is reported in the .I /proc/mounts file.
.TP 1.5i wsize=\fIn The maximum number of bytes per network WRITE request that the NFS client can use when writing data to a file on an NFS server. See the description of the \fIrsize\fP option for more details.
.TP 1.5i acregmin=\fIn The minimum time in seconds that the NFS client caches attributes of a regular file before it requests fresh attribute information from a server. The default is 3 seconds.
.TP 1.5i acregmax=\fIn The maximum time in seconds that the NFS client caches attributes of a regular file before it requests fresh attribute information from a server. The default is 60 seconds.
.TP 1.5i acdirmin=\fIn The minimum time in seconds that the NFS client caches attributes of a directory before it requests fresh attribute information from a server. The default is 30 seconds.
.TP 1.5i acdirmax=\fIn The maximum time in seconds that the NFS client caches attributes of a directory before it requests fresh attribute information from a server. The default is 60 seconds.
.TP 1.5i actimeo=\fIn Using actimeo sets all of .I acregmin, .I acregmax, .I acdirmin, and .I acdirmax to the same value. There is no default value.
.TP 1.5i bg | fg This mount option determines how the .BR mount (8) command behaves if an attempt to mount a remote share fails. The \fIfg\fR option causes .BR mount (8) to exit with an error status if any part of the mount request times out or fails outright. This is called a "foreground" mount, and is the default behavior if neither \fIfg\fR nor \fIbg\fR is specified. If the \fIbg\fR option is specified, a timeout or failure causes the .BR mount (8) command to fork a child which continues to attempt to mount the remote share. The parent immediately returns with a zero exit code. This is known as a "background" mount. If the local mount point directory is missing, the .BR mount (8) command treats that as if the mount request timed out. This permits nested NFS mounts.
.TP 1.5i retry=\fIn The number of minutes to retry an NFS mount operation in the foreground or background before giving up. The default value for foreground mounts is 2 minutes. The default value for background mounts is 10000 minutes, which is roughly one week.
.TP 1.5i sec=\fImode The RPCGSS security flavor to use for accessing files on this mount point. If the \fIsec=\fR option is not specified, or if \fIsec=sys\fR is specified, the RPC client uses the AUTH_SYS security flavor for all RPC operations on this mount point. Valid security flavors are .BR none , .BR sys, .BR krb5 , .BR krb5i , .BR krb5p , .BR lkey , .BR lkeyi , .BR lkeyp , .BR spkm , .BR spkmi , and .BR spkmp . See the SECURITY CONSIDERATIONS section for details.
.TP 1.5i sharecache Determines how the client's data cache is shared between mount points that mount the same remote share. If the option is not specified, or the \fIsharecache\fR option is specified, then all mounts of the same remote share on a client use the same data cache. If the \fInosharecache\fR option is specified, then files under that mount point are cached separately from files under other mount points that may be accessing the same remote share. As of kernel 2.6.18, this is legacy caching behavior, and is considered a data risk since two cached copies of the same file on the same client can become out of sync following an update of one of the copies.
Valid options for the nfs file system type
.TP 1.5i proto=\fInetid The transport protocol used by the RPC client to transmit requests to the NFS server for this mount point. The value of \fInetid\fR can be either .B udp or .B tcp. Each transport protocol uses different default .I retrans and .I timeo settings; see the description of these two mount options for details. .I NB: This mount option controls both how the .BR mount (8) command communicates with the portmapper and the MNT and NFS server, and what transport protocol the in\-kernel NFS client uses to transmit requests to the NFS server. Specifying .I proto=tcp forces all traffic from the mount command and the NFS client to use TCP. Specifying .I proto=udp forces all traffic types to use UDP. If the \fIproto=\fR mount option is not specified, the .BR mount (8) command chooses the best transport for each type of request (GETPORT, MNT, and NFS), and by default the in\-kernel NFS client uses the TCP protocol. If the server doesn't support one or the other protocol, the .BR mount (8) command attempts to discover which protocol is supported and use that.
.TP 1.5i port=\fIn The numeric value of the port used by the remote NFS service. If the \fIport=\fR option is not specified, or if the specified port value is 0, then the NFS client uses the NFS service port provided by the remote portmapper service. If any other value is specified, then the NFS client uses that value as the destination port when connecting to the remote NFS service. If the remote host's NFS service is not registered with its portmapper, or if the NFS service is not available on the specified port, the mount fails.
.TP 1.5i .I namlen=n When an NFS server does not support version two of the RPC mount protocol, this option can be used to specify the maximum length of a filename that is supported on the remote filesystem. This is used to support the POSIX pathconf functions. The default is 255 characters. .TP 1.5i .I mountport=n The numeric value of the .B mountd port. .TP 1.5i .I mounthost=name The name of the host running .B mountd . .TP 1.5i .I mountprog=n Use an alternate RPC program number to contact the mount daemon on the remote host. This option is useful for hosts that can run multiple NFS servers. The default value is 100005 which is the standard RPC mount daemon program number. .TP 1.5i .I mountvers=n Use an alternate RPC version number to contact the mount daemon on the remote host. This option is useful for hosts that can run multiple NFS servers. The default value depends on which kernel you are using. .TP 1.5i .I nfsprog=n Use an alternate RPC program number to contact the NFS daemon on the remote host. This option is useful for hosts that can run multiple NFS servers. The default value is 100003 which is the standard RPC NFS daemon program number. .TP 1.5i .I nfsvers=n Use an alternate RPC version number to contact the NFS daemon on the remote host. This option is useful for hosts that can run multiple NFS servers. The default value depends on which kernel you are using. .TP 1.5i .I vers=n vers is an alternative to nfsvers and is compatible with many other operating systems. .TP 1.5i .I nolock Disable NFS locking. Do not start lockd. This is appropriate for mounting the root filesystem or .B /usr or .BR /var . These filesystems are typically either read-only or not shared, and in those cases, remote locking is not needed. This also needs to be used with some old NFS servers that don't support locking. .br Note that applications can still get locks on files, but the locks only provide exclusion locally. Other clients mounting the same filesystem will not be able to detect the locks. .TP 1.5i .I intr If an NFS file operation has a major timeout and it is hard mounted, then allow signals to interupt the file operation and cause it to return EINTR to the calling program. The default is to not allow file operations to be interrupted. .TP 1.5i .I posix Mount the NFS filesystem using POSIX semantics. This allows an NFS filesystem to properly support the POSIX pathconf command by querying the mount server for the maximum length of a filename. To do this, the remote host must support version two of the RPC mount protocol. Many NFS servers support only version one. .TP 1.5i .I nocto Suppress the retrieval of new attributes when creating a file. .TP 1.5i .I noac Disable all forms of attribute caching entirely. This extracts a significant performance penalty but it allows two different NFS clients to get reasonable results when both clients are actively writing to a common export on the server. .TP 1.5i .I noacl Disables Access Control List (ACL) processing. .TP 1.5i .I nordirplus Disables NFSv3 READDIRPLUS RPCs. Use this option when mounting servers that don't support or have broken READDIRPLUS implementations.
Valid options for the nfs4 file system type
.TP 1.5i proto=\fInetid The transport protocol used by the RPC client to transmit requests to the NFS server. The value of \fInetid\fR can be either .B udp or .B tcp. All NFS version 4 servers are required to support TCP, so the default transport protocol for NFS version 4 is TCP.
.TP 1.5i port=\fIn The numeric value of the port used by the remote NFS service. If the \fIport=\fR option is not specified, the NFS client uses the standard NFS port number of 2049 without checking the remote portmapper service. If the specified port value is 0, then the NFS client uses the NFS service port provided by the remote portmapper service. If any other value is specified, then the NFS client uses that value as the destination port when connecting to the remote NFS service. If the remote host's NFS service is not registered with its portmapper, or if the NFS service is not available on the specified port, the mount fails.
.TP 1.5i .I clientaddr=n On a multi-homed client, this causes the client to use a specific callback address when communicating with an NFS version 4 server. This option is currently ignored. .TP 1.5i .I intr If an NFS file operation has a major timeout and it is hard mounted, then allow signals to interupt the file operation and cause it to return EINTR to the calling program. The default is to not allow file operations to be interrupted. .TP 1.5i .I nocto Suppress the retrieval of new attributes when creating a file. .TP 1.5i .I noac Disable attribute caching, and force synchronous writes. This extracts a server performance penalty but it allows two different NFS clients to get reasonable good results when both clients are actively writing to common filesystem on the server.
Security Considerations
NFS provides access control for data, but depends on its RPC implementation to provide authentication of NFS requests. Traditional NFS access control mimics the standard mode bit access control provided in local file systems. Traditional RPC authentication uses a number to represent each user (usually the user's own uid), a number to represent the user's group (the user's gid), and a set of up to 16 auxiliary group numbers to represent other groups of which the user may be a member. File data and user ID values appear in the clear on the network.
Moreover, NFS versions 2 and 3 use separate protocols for mounting, for locking and unlocking files, and for reporting system status of clients and servers. These auxiliary protocols use no authentication.
In addition to combining all the auxiliary protocols into a single protocol, NFS version 4 introduces more advanced forms of access control, authentication, and in-transit data protection. Linux also implements the proprietary NFSv3 access control list implementation built into Solaris, but never standardized, and allows the use of advanced authentication modes for NFS version 2 and version 3 mounts.
The NFS version 4 specification mandates NFSv4 ACLs, RPCGSS authentication, and RPCGSS security flavors that provide per-RPC integrity checking and encryption, and it applies to all NFS version 4 operations including mounting, file locking, and so on. Note that Linux does not yet implement security mode negotiation between NFS version 4 clients and servers.
A mount option enables the RPCGSS security mode that is in effect on a given NFS mount point. Using the sec=krb5 mount option provides a cryptographic proof of a user's identity in each RPC request that passes between client and server. This makes a very strong guarantee about who is accessing what data on the server.
Two other flavors of Kerberos security are supported as well. krb5i provides a cryptographically strong guarantee that the data in each RPC request has not been tampered with. And krbp encrypts every RPC request so the data is not exposed at all during transit on networks between NFS client and server. There can be some performance impact when using integrity checking or encryption, however.
Support for other forms of cryptographic security are also available, including lipkey and SPKM3.
Citations
fstab(5), mount(8), umount(8), mount.nfs(5), umount.nfs(5), exports(5), nfsd(8), rpc.idmapd(8), rpc.gssd(8), rpc.svcgssd(8), kerberos(1)
- RFC 768 for the UDP specification.
- RFC 793 for the TCP specification.
- RFC 1094 for the NFS version 2 specification.
- RFC 1813 for the NFS version 3 specification.
- RFC 1832 for the XDR specification.
- RFC 1833 for the RPC bind specification.
- RFC 2203 for the RPCSEC GSS API protocol specification.
- RFC 3530 for the NFS version 4 specification.