NewMountDesignSpec

From Linux NFS

Revision as of 19:14, 24 August 2007 by Chucklever (Talk | contribs)
Jump to: navigation, search

Contents

Introduction

This wiki page is a working design specification for the new text-based NFS mount API. Here we discuss use cases, requirement statements, error reporting, and design specifications, in addition to minute behavioral details of mounting NFS shares. The purpose of this discussion is to understand how to implement the new interface, and to construct the basis of unit tests for both the legacy user-space mount command and the new in-kernel mount client.

Requirements

There are several broad requirements for the new text-based NFS mount API.

  1. Scalability - Allow for thousands of NFS mount points, and a large number of simultaneous mount operations
  2. No user-space dependency on a versioned binary blob for passing NFS mount options to the kernel
  3. Support version fallback - If NFS version 4 is not supported, fall back to version 3; if version 3 is not supported, fall back to version 2
  4. Support transport protocol fallback - If TCP is not supported, fall back to UDP
  5. Provide reasonable default behavior in the presence of network firewalls and misconfigured servers
  6. Facilitate new features - IPv6, RDMA, FS cache should be easy to introduce
  7. Better error reporting - Report and log useful, relevant, clear error messages when a failure has occurred
  8. Update and clarify NFS mount documentation

Use Cases

To mount a remote share using NFS version 2, use the \fBnfs\fR file system type and specify the \fInfsvers=2\fR mount option. To mount using NFS version 3, use the \fBnfs\fR file system type and specify the \fInfsvers=3\fR mount option. To mount using NFS version 4, use the \fBnfs4\fR file system type (the \fInfsvers\fR mount option is not supported for the \fBnfs4\fR file system type).

Here is an example from an \fI/etc/fstab\fP file for an NFS version 3 mount over TCP.

 server:/export/share    /mnt            nfs             nfsvers=3,proto=tcp

Here is an example for an NFS version 4 mount over TCP using Kerberos 5 mutual authentication.

 server:/export/share    /mnt            nfs4            sec=krb5

Design Specifications

Obviously the discussion of NFSv2/v3 mounting will be significantly more complicated than NFSv4 mounting.

Mounting NFS version 2 and version 3 shares

Mounting NFS version 4 shares

Return Codes and Error Reporting

Currently mount's error messages are very problematic.

  1. Some error messages are incorrect.
  2. Some error messages are repeated.
  3. Some errors are never reported.
  4. Some error messages are too specific to be useful to an average administration. For example, reporting an "RPC program/version mismatch occurred" is not helpful if the real problem is that "proto=udp" is not supported.
  5. Some error messages are too general to be useful. For example, reporting "mount.nfs: not a directory" is obviously an errno string, but more specific information would provide a course of corrective action.

Perhaps a clear error message can be reported to the command line, and a lot of detail should be reported in the system log? That's easy enough with in-kernel mount option parsing!

mount(2) API return codes

The mount.nfs program needs to distinguish between temporary problems and permanent errors in order to determine whether it's worth retrying a mount request in the background.

For text-based NFS mounts, the version/protocol fallback mechanism should occur in user space -- certainly fallback policy is easier to set and implement in user space, but the kernel must provide specific information about how a mount request failed so that user space can make an appropriate choice about the next step to try.

The current mount(2) API is described in a man page. The man page describes a set of generic error return codes, which we excerpt here. It also suggests that we can add specific error codes for NFS mounts.

RETURN VALUE
       On  success,  zero is returned.  On error, -1 is returned, and errno is
       set appropriately.

ERRORS
       The error values given below result from  filesystem  type  independent
       errors.  Each  filesystem  type may have its own special errors and its
       own special behavior.  See the kernel source code for details.

       EACCES A component of a path was not searchable. (See also path_resolu-
              tion(2).)   Or,  mounting  a  read-only filesystem was attempted
              without giving the MS_RDONLY flag.  Or, the block device  source
              is located on a filesystem mounted with the MS_NODEV option.

       EAGAIN A call to umount2() specifying MNT_EXPIRE successfully marked an
              unbusy file system as expired.

       EBUSY  source is already mounted. Or, it cannot be remounted read-only,
              because it still holds files open for writing.  Or, it cannot be
              mounted on target because target is still busy (it is the  work-
              ing  directory  of some task, the mount point of another device,
              has open files, etc.).  Or, it could not be unmounted because it
              is busy.

       EFAULT One  of  the  pointer  arguments points outside the user address
              space.

       EINVAL source had an invalid superblock.  Or,  a  remount  (MS_REMOUNT)
              was  attempted,  but  source  was not already mounted on target.
              Or, a move (MS_MOVE) was attempted, but source was not  a  mount
              point, or was ’/’.  Or, an unmount was attempted, but target was
              not a mount point.  Or, umount2() was called with MNT_EXPIRE and
              either MNT_DETACH or MNT_FORCE.

       ELOOP  Too  many  link  encountered  during pathname resolution.  Or, a
              move was attempted, while target is a descendant of source.

       EMFILE (In case no block device is required:) Table of dummy devices is
              full.

       ENAMETOOLONG
              A pathname was longer than MAXPATHLEN.

       ENODEV filesystemtype not configured in the kernel.

       ENOENT A pathname was empty or had a nonexistent component.

       ENOMEM The  kernel  could not allocate a free page to copy filenames or
              data into.

       ENOTBLK
              source is not a block device (and a device was required).

       ENOTDIR
              The second argument, or a prefix of the first argument, is not a
              directory.

       ENXIO  The major number of the block device source is out of range.

       EPERM  The caller does not have the required privileges.

Here are some additional return codes I recommend for NFS mounts, just as a start. These should allow a calling program to report a reasonably specific error message, and decide whether and how to retry the request.

       EBADF  The mount option  string was not able to be parsed,  or an unre-
              cognized option was specified, or a keyword option was specified
              with a value that is out of range.

This is a permanent mount error. The calling program should not retry this request with the same options.

       ESTALE The server denied access to the requested share.

       ETIMEDOUT
              The kernel's mount attempt timed out after n seconds  (I think n
              is 15).

These are temporary errors. The calling program may choose to retry this request using the same options, or fail immediately.

       EPROTONOSUPPORT
              The server reports that the program, version,  or transport pro-
              tocol is not currently available.

       ECONNREFUSED
              The kernel's mount connection  attempt was refused by the server
              at the network transport layer.

These are temporary errors. The calling program can attempt to recover by adjusting the options and retrying the request.

Discussion of Individual NFS Mount Options

There are four classes of mount options for \fBnfs\fR and \fBnfs4\fR file systems. Fix this: All four classes of options are specified as normal NFS mount options because there is only one way to specify mount options in the .I /etc/fstab file.

First, there are generic mount options available to all Linux file systems, such as "ro" or "sync". See .BR mount (8) for a description of generic mount options available for all file systems.

Second, some mount options can determine how the mount command behaves, such as "mountport" or "retry". These options have no affect after the mount operation has completed, but might be used to mount an NFS share through a network firewall.

Third, some mount options determine how the NFS client behaves during normal operation, such as "rsize" and "wsize". These may be used to tune performance, or change the client's caching or file locking behavior.

Fourth, mount options such as "timeout" or "retrans" can control aspects of Remote Procedure Call behavior. NFS clients send requests to NFS servers via Remote Procedure Calls, or RPCs. RPCs handle per-request authentication, adjust request parameters for different byte endianness on client and server, and retransmit requests that may have been lost by the network or server.

Note that some options take the form of .I keyword=value while some options are boolean, taking either the form of .I keyword or .I nokeyword. All options which do not use the .I keyword=value form use the boolean form, except for .I hard/soft, .I udp/tcp, and .I fg/bg.

Valid options for either the nfs or nfs4 file system type

To Do

  • Format this section
  • Add status information about each option
    • Tested (legacy / text-based)
    • Works, does not work as documented (legacy / text-based)
    • Implementation/fix priority
    • Details about how it works and/or how it should work

soft | hard

Description
Determines the recovery behavior of the RPC client after an RPC request times out. If neither option is specified, or if the \fIhard\fR option is specified, the RPC is retried indefinitely. If the \fIsoft\fR option is specified, then the RPC client fails the RPC request after a major timeout occurs, and causes the NFS client to return an error to the calling application.
Testing status
Not tested with legacy mount.nfs
Not tested with text-based mount.nfs
Implementation
No notes.

timeo=n

Description
The value, in tenths of a second, before timing out an RPC request. The default value is 600 (60 seconds) for NFS over TCP. On a UDP transport, the Linux RPC client uses an adaptive algorithm to estimate the time out value for frequently used request types such as READ and WRITE, and uses the timeo= setting for infrequently used requests such as FSINFO. The timeo= value defaults to 7 tenths of a second for NFS over UDP. After each timeout, the RPC client may retransmit the timed out request, or it may take some other action depending on the settings of the hard or retrans= options.
Testing status
Not tested with legacy mount.nfs
Not tested with text-based mount.nfs
Implementation
No notes.

retrans=n

Description
The number of RPC timeouts that must occur before a major timeout occurs. The default is 3 timeouts. If the file system is mounted with the hard option, the RPC client will generate a "server not responding" message after a major timeout, then continue to retransmit the

request. If the file system is mounted with the soft option, the RPC client will abandon the request after a major timeout, and cause NFS to return an error to the application.

Testing status
Not tested with legacy mount.nfs
Not tested with text-based mount.nfs
Implementation
No notes.

rsize=n

Description
The maximum number of bytes in each network READ request that the NFS client can use when reading data from a file on an NFS server; the actual data payload size of each NFS READ request is equal to or smaller than the rsize value. The rsize value is a positive integral multiple of 1024, and the largest value supported by the Linux NFS client is 1,048,576 bytes. Specified values outside of this range are rounded down to the closest multiple of 1024, and specified values smaller than 1024 are replaced with a default of 4096. If an rsize value is not specified, or if a value is specified but is larger than the maximums either the client or server support, the client and server negotiate the largest rsize value that both will support. The rsize option as specified on the mount(8) command line appears in the /etc/mtab file, but the effective rsize value negotiated by the client and server is reported in the /proc/mounts file.
Testing status
Not tested with legacy mount.nfs
Not tested with text-based mount.nfs
Implementation
No notes.

wsize=n

Description
The maximum number of bytes per network WRITE request that the NFS client can use when writing data to a file on an NFS server. See the description of the \fIrsize\fP option for more details.
Testing status
Not tested with legacy mount.nfs
Not tested with text-based mount.nfs
Implementation
No notes.

acregmin=n

Description
The minimum time in seconds that the NFS client caches attributes of a regular file before it requests fresh attribute information from a server. The default is 3 seconds.
Testing status
Not tested with legacy mount.nfs
Not tested with text-based mount.nfs
Implementation
No notes.

acregmax=n

Description
The maximum time in seconds that the NFS client caches attributes of a regular file before it requests fresh attribute information from a server. The default is 60 seconds.
Testing status
Not tested with legacy mount.nfs
Not tested with text-based mount.nfs
Implementation
No notes.

acdirmin=n

The minimum time in seconds that the NFS client caches attributes of a directory before it requests fresh attribute information from a server. The default is 30 seconds.

acdirmax=n

The maximum time in seconds that the NFS client caches attributes of a directory before it requests fresh attribute information from a server. The default is 60 seconds.

actimeo=n

Using actimeo sets all of .I acregmin, .I acregmax, .I acdirmin, and .I acdirmax to the same value. There is no default value.

bg | fg

Description
This mount option determines how the mount(8) command behaves if an attempt to mount a remote share fails. The fg option causes mount(8) to exit with an error status if any part of the mount request times out or fails outright. This is called a "foreground" mount, and is the default behavior if neither fg nor bg is specified. If the bg option is specified, a timeout or failure causes the

mount(8) command to fork a child which continues to attempt to mount the remote share. The parent immediately returns with a zero exit code. This is known as a "background" mount. If the local mount point directory is missing, the mount(8) command treats that as if the mount request timed out. This permits nested NFS mounts.

Testing status
Tested with legacy mount.nfs; works for v2/v3, not for v4
Tested with text-based mount.nfs; does not work for any version
Implementation
The mount.nfs command must distinguish between permanent mount errors (such as a bad mount option) which prevent the mount request as specified from ever being valid, and temporary errors (such as an unreachable server) which might allow the mount request as specified from completing at some future point. See the discussion of mount(2) return codes for more detail.

retry=n

Description
The number of minutes to retry an NFS mount operation in the foreground or background before giving up. The default value for foreground mounts is 2 minutes. The default value for background mounts is 10000 minutes, which is roughly one week.
Testing status
Not tested with legacy mount.nfs
Not tested with text-based mount.nfs
Implementation
The ten thousand minute default might be too long. Perhaps foreground mounts should also use a much shorter default.

sec=mode

Description
The RPCGSS security flavor to use for accessing files on this mount point. If the sec= option is not specified, or if sec=sys is specified, the RPC client uses the AUTH_SYS security flavor for all RPC operations on this mount point. Valid security flavors are none, sys, krb5, krb5i, krb5p, lkey, lkeyi, lkeyp, spkm, spkmi, and spkmp. See the SECURITY CONSIDERATIONS section for details.
Testing status
Not tested with legacy mount.nfs
Not tested with text-based mount.nfs
Implementation
No notes.

sharecache

Description
Determines how the client's data cache is shared between mount points that mount the same remote share. If the option is not specified, or the \fIsharecache\fR option is specified, then all mounts of the same remote share on a client use the same data cache. If the \fInosharecache\fR option is specified, then files under that mount point are cached separately from files under other mount points that may be accessing the same remote share. As of kernel 2.6.18, this is legacy caching behavior, and is considered a data risk since two cached copies of the same file on the same client can become out of sync following an update of one of the copies.
Testing status
Not tested with legacy mount.nfs
Not tested with text-based mount.nfs
Implementation
No notes.

Valid options for the nfs file system type

.TP 1.5i proto=\fInetid The transport protocol used by the RPC client to transmit requests to the NFS server for this mount point. The value of \fInetid\fR can be either .B udp or .B tcp. Each transport protocol uses different default .I retrans and .I timeo settings; see the description of these two mount options for details. .I NB: This mount option controls both how the .BR mount (8) command communicates with the portmapper and the MNT and NFS server, and what transport protocol the in\-kernel NFS client uses to transmit requests to the NFS server. Specifying .I proto=tcp forces all traffic from the mount command and the NFS client to use TCP. Specifying .I proto=udp forces all traffic types to use UDP. If the \fIproto=\fR mount option is not specified, the .BR mount (8) command chooses the best transport for each type of request (GETPORT, MNT, and NFS), and by default the in\-kernel NFS client uses the TCP protocol. If the server doesn't support one or the other protocol, the .BR mount (8) command attempts to discover which protocol is supported and use that.

.TP 1.5i port=\fIn The numeric value of the port used by the remote NFS service. If the \fIport=\fR option is not specified, or if the specified port value is 0, then the NFS client uses the NFS service port provided by the remote portmapper service. If any other value is specified, then the NFS client uses that value as the destination port when connecting to the remote NFS service. If the remote host's NFS service is not registered with its portmapper, or if the NFS service is not available on the specified port, the mount fails.

.TP 1.5i .I namlen=n When an NFS server does not support version two of the RPC mount protocol, this option can be used to specify the maximum length of a filename that is supported on the remote filesystem. This is used to support the POSIX pathconf functions. The default is 255 characters. .TP 1.5i .I mountport=n The numeric value of the .B mountd port. .TP 1.5i .I mounthost=name The name of the host running .B mountd . .TP 1.5i .I mountprog=n Use an alternate RPC program number to contact the mount daemon on the remote host. This option is useful for hosts that can run multiple NFS servers. The default value is 100005 which is the standard RPC mount daemon program number. .TP 1.5i .I mountvers=n Use an alternate RPC version number to contact the mount daemon on the remote host. This option is useful for hosts that can run multiple NFS servers. The default value depends on which kernel you are using. .TP 1.5i .I nfsprog=n Use an alternate RPC program number to contact the NFS daemon on the remote host. This option is useful for hosts that can run multiple NFS servers. The default value is 100003 which is the standard RPC NFS daemon program number. .TP 1.5i .I nfsvers=n Use an alternate RPC version number to contact the NFS daemon on the remote host. This option is useful for hosts that can run multiple NFS servers. The default value depends on which kernel you are using. .TP 1.5i .I vers=n vers is an alternative to nfsvers and is compatible with many other operating systems. .TP 1.5i .I nolock Disable NFS locking. Do not start lockd. This is appropriate for mounting the root filesystem or .B /usr or .BR /var . These filesystems are typically either read-only or not shared, and in those cases, remote locking is not needed. This also needs to be used with some old NFS servers that don't support locking. .br Note that applications can still get locks on files, but the locks only provide exclusion locally. Other clients mounting the same filesystem will not be able to detect the locks. .TP 1.5i .I intr If an NFS file operation has a major timeout and it is hard mounted, then allow signals to interupt the file operation and cause it to return EINTR to the calling program. The default is to not allow file operations to be interrupted. .TP 1.5i .I posix Mount the NFS filesystem using POSIX semantics. This allows an NFS filesystem to properly support the POSIX pathconf command by querying the mount server for the maximum length of a filename. To do this, the remote host must support version two of the RPC mount protocol. Many NFS servers support only version one. .TP 1.5i .I nocto Suppress the retrieval of new attributes when creating a file. .TP 1.5i .I noac Disable all forms of attribute caching entirely. This extracts a significant performance penalty but it allows two different NFS clients to get reasonable results when both clients are actively writing to a common export on the server. .TP 1.5i .I noacl Disables Access Control List (ACL) processing. .TP 1.5i .I nordirplus Disables NFSv3 READDIRPLUS RPCs. Use this option when mounting servers that don't support or have broken READDIRPLUS implementations.

Valid options for the nfs4 file system type

.TP 1.5i proto=\fInetid The transport protocol used by the RPC client to transmit requests to the NFS server. The value of \fInetid\fR can be either .B udp or .B tcp. All NFS version 4 servers are required to support TCP, so the default transport protocol for NFS version 4 is TCP.

.TP 1.5i port=\fIn The numeric value of the port used by the remote NFS service. If the \fIport=\fR option is not specified, the NFS client uses the standard NFS port number of 2049 without checking the remote portmapper service. If the specified port value is 0, then the NFS client uses the NFS service port provided by the remote portmapper service. If any other value is specified, then the NFS client uses that value as the destination port when connecting to the remote NFS service. If the remote host's NFS service is not registered with its portmapper, or if the NFS service is not available on the specified port, the mount fails.

.TP 1.5i .I clientaddr=n On a multi-homed client, this causes the client to use a specific callback address when communicating with an NFS version 4 server. This option is currently ignored. .TP 1.5i .I intr If an NFS file operation has a major timeout and it is hard mounted, then allow signals to interupt the file operation and cause it to return EINTR to the calling program. The default is to not allow file operations to be interrupted. .TP 1.5i .I nocto Suppress the retrieval of new attributes when creating a file. .TP 1.5i .I noac Disable attribute caching, and force synchronous writes. This extracts a server performance penalty but it allows two different NFS clients to get reasonable good results when both clients are actively writing to common filesystem on the server.

Security Considerations

NFS provides access control for data, but depends on its RPC implementation to provide authentication of NFS requests. Traditional NFS access control mimics the standard mode bit access control provided in local file systems. Traditional RPC authentication uses a number to represent each user (usually the user's own uid), a number to represent the user's group (the user's gid), and a set of up to 16 auxiliary group numbers to represent other groups of which the user may be a member. File data and user ID values appear in the clear on the network.

Moreover, NFS versions 2 and 3 use separate protocols for mounting, for locking and unlocking files, and for reporting system status of clients and servers. These auxiliary protocols use no authentication.

In addition to combining all the auxiliary protocols into a single protocol, NFS version 4 introduces more advanced forms of access control, authentication, and in-transit data protection. Linux also implements the proprietary NFSv3 access control list implementation built into Solaris, but never standardized, and allows the use of advanced authentication modes for NFS version 2 and version 3 mounts.

The NFS version 4 specification mandates NFSv4 ACLs, RPCGSS authentication, and RPCGSS security flavors that provide per-RPC integrity checking and encryption, and it applies to all NFS version 4 operations including mounting, file locking, and so on. Note that Linux does not yet implement security mode negotiation between NFS version 4 clients and servers.

A mount option enables the RPCGSS security mode that is in effect on a given NFS mount point. Using the sec=krb5 mount option provides a cryptographic proof of a user's identity in each RPC request that passes between client and server. This makes a very strong guarantee about who is accessing what data on the server.

Two other flavors of Kerberos security are supported as well. krb5i provides a cryptographically strong guarantee that the data in each RPC request has not been tampered with. And krbp encrypts every RPC request so the data is not exposed at all during transit on networks between NFS client and server. There can be some performance impact when using integrity checking or encryption, however.

Support for other forms of cryptographic security are also available, including lipkey and SPKM3.

Citations

fstab(5), mount(8), umount(8), mount.nfs(5), umount.nfs(5), exports(5), nfsd(8), rpc.idmapd(8), rpc.gssd(8), rpc.svcgssd(8), kerberos(1)

  • RFC 768 for the UDP specification.
  • RFC 793 for the TCP specification.
  • RFC 1094 for the NFS version 2 specification.
  • RFC 1813 for the NFS version 3 specification.
  • RFC 1832 for the XDR specification.
  • RFC 1833 for the RPC bind specification.
  • RFC 2203 for the RPCSEC GSS API protocol specification.
  • RFC 3530 for the NFS version 4 specification.
Personal tools