CITI ASC status
From Linux NFS
(→Delegation recall with byte-range locks) |
(→File lock measurements) |
||
Line 152: | Line 152: | ||
Lacking examples of real-world lock-intensive workloads, we have performed a few microbenchmarks to measure such things as the cost of acquiring a single lock with and without a delegation. | Lacking examples of real-world lock-intensive workloads, we have performed a few microbenchmarks to measure such things as the cost of acquiring a single lock with and without a delegation. | ||
- | + | To measure the performance of whole file locking, we use a benchmark that opens N files, then obtains a lock on each file. We measure the elapsed time of the loop that obtains the locks. We ran the microbenchmark on three configurations: | |
- | + | * Local file system (reiserFS) | |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | We ran the microbenchmark on three configurations: | + | |
- | * Local file system | + | |
- | + | ||
* NFS with no delegations | * NFS with no delegations | ||
* NFS with delegations | * NFS with delegations | ||
To test with no delegations, we disabled file leasing, which disables delegations as a side effect. | To test with no delegations, we disabled file leasing, which disables delegations as a side effect. | ||
- | |||
- | |||
- | |||
- | |||
For most cases, we ran the test 10 times and averaged the results. | For most cases, we ran the test 10 times and averaged the results. | ||
Line 194: | Line 171: | ||
'''Local file system''' | '''Local file system''' | ||
<pre> | <pre> | ||
- | --- | + | --- 1 RUNS --- |
- | + | 0.000024 : mean lock time | |
- | + | 0.000027 : median | |
- | + | 0.000008 : std dev | |
- | --- | + | --- 10 RUNS --- |
- | + | 0.000072 : mean lock time | |
- | + | 0.000069 : median | |
- | + | 0.000007 : std dev | |
- | --- | + | --- 100 RUNS --- |
- | + | 0.000645 : mean lock time | |
- | + | 0.000640 : median | |
- | + | 0.000009 : std dev | |
- | --- | + | --- 1000 RUNS --- |
- | + | 0.008288 : mean lock time | |
- | + | 0.006606 : median | |
- | + | 0.002138 : std dev | |
</pre> | </pre> | ||
'''NFS no read delegation''' | '''NFS no read delegation''' | ||
<pre> | <pre> | ||
- | --- | + | --- 1 RUNS --- |
- | + | 0.000511 : mean lock time | |
- | + | 0.000303 : median | |
- | + | 0.000534 : std dev | |
- | --- | + | --- 10 RUNS --- |
- | + | 0.002827 : mean lock time | |
- | + | 0.002665 : median | |
- | + | 0.000342 : std dev | |
- | --- | + | --- 100 RUNS --- |
- | + | 0.026948 : mean lock time | |
- | + | 0.026573 : median | |
- | + | 0.001114 : std dev | |
- | --- | + | --- 1000 RUNS --- |
- | + | 0.305191 : mean lock time | |
- | + | 0.296449 : median | |
- | + | 0.034397 : std dev | |
</pre> | </pre> | ||
'''NFS w/ read delegation''' | '''NFS w/ read delegation''' | ||
<pre> | <pre> | ||
- | --- OVER | + | --- OVER 1 RUNS --- |
- | + | 0.000028 : mean lock time | |
- | + | 0.000030 : median | |
- | + | 0.000006 : std dev | |
--- OVER 10 RUNS --- | --- OVER 10 RUNS --- | ||
- | + | 0.000091 : mean lock time | |
- | + | 0.000091 : median | |
- | + | 0.000000 : std dev | |
- | --- OVER | + | --- OVER 100 RUNS --- |
- | + | 0.000757 : mean lock time | |
- | + | 0.000721 : median | |
- | + | 0.000106 : std dev | |
--- OVER 2 RUNS --- | --- OVER 2 RUNS --- | ||
- | + | 0.008038 : mean lock time | |
- | + | 0.008038 : median | |
- | + | 0.000049 : std dev | |
</pre> | </pre> | ||
Revision as of 18:21, 23 October 2006
University of Michigan/CITI NFSv4 ASC alliance
Status of October 2006
Task 1. pNFS Demonstration
Demonstration of pNFS with multiple back end methods (PVFS2 and File) including layout recall — LANL will replicate this demonstration at LANL working with CITI remotely
Development
We updated the Linux pNFS client and server to the 2.6.17 kernel level, and are preparing to rebase again for 2.6.19.
We updated the pNFS code base to draft-ietf-nfsv4-minorversion1-05. Testing identified multiple bugs, which we fixed.
The linux client separates common NFS code from NFSv2/3/4 code by using version specific operations. We rewrote the Linux pNFS client to use its own set of version specfic operations. This provides a controlled interface to the pNFS code, and eases updating the code to new kernel versions.
Four client layout modules are in development.
- File layout driver (CITI, Network Appliance, and IBM Almaden).
- PVFS2 layout driver (CITI).
- Object layout driver (Panasas).
- Block layout driver (CITI).
To accommodate the requirements of the multiple layout drivers, we expanded the policy interface between the layout driver and generic pNFS client. This interface allows layout drivers to set the following policies:
- stripe size
- writeback cache gathering policies
- blocksize
- read and write threshold
- timing of layoutget invocation
- determine if I/O uses pagecache or direct method
We are designing and coding a pNFS client layout cache to replace the current implementation, which supports only a single layout per inode.
We improved the interface to the underlying file system on the Linux pNFS server. The new interface is being used by the Panasas object layout server, the IBM GPFS server, and the PVFS2 server.
We are coding the pNFS layout management service and file system interfaces on the Linux pNFS server to do a better job of bookkeeping so that we can extend the layout recall implementation, which is limited to a single layout.
We have continued to develop the PVFS2 layout driver and PVFS2 support in the pNFS server. The layout driver I/O interface supports direct access, page cache access with NFSv4 readahead and writeback, and the O_DIRECT access method. In addition, PVFS2 now supports the pNFS file-based layout, which lets pNFS clients choose how they access the file system.
We demonstrated how pNFS can improve the overall write performance of parallel file systems by using direct, parallel I/O for large write requests and the NFSv4 storage protocol for small write requests. To switch between them, we added a write threshold to the layout driver. Write requests smaller than the threshold follow the slower NFSv4 data path. Write requests larger than the threshold follow the faster layout driver data path. D. Hildebrand, L. Ward, and P. Honeyman, "Large Files, Small Writes, and pNFS," in Proc. of the 20th ACM International Conf. on Supercomputing, Cairns, Australia, 2006.
We improved the performance and scalability of pNFS file-based access with parallel file systems. Our design, named Direct-pNFS, augmented the file-based architecture to enable file-based pNFS clients to bypass intermediate data servers and access heterogeneous data stores directly. Direct access is possible by ensuring file-based layouts match the data layout in the underlying file system and giving pNFS clients the tools to effectively interpret and utilize this information. Experiments with Direct-pNFS demonstrate I/O throughput that equals or outperforms the exported parallel file system across a range of workloads. D. Hildebrand and P. Honeyman, "Direct-pNFS: Simple, Transparent, and Versatile Access to Parallel File Systems," CITI Technical Report 06-8, October 2006.
We developed prototype implementations of pNFS operations:
- OP_GETDEVICELIST,
- OP_GETDEVICEINFO,
- OP_LAYOUTGET,
- OP_LAYOUTCOMMIT,
- OP_LAYOUTRETURN and
- OP_CB_LAYOUTRECALL
We continue to test the ability of our prototype to send direct I/O data to data servers.
Milestones
At the September 2006 NFSv4 Bake-a-thon, hosted by CITI, we continued to test the ability of CITI's Linux pNFS client to operate with multiple layouts, and the ability of CITI's Linux pNFS server to export pNFS capable underlying file systems.
We demonstrated the Linux pNFS client support for multiple layouts by copying files between multiple pNFS back ends.
The following pNFS implementations were tested.
File Layout
- Clients: Linux, Solaris
- Servers: Network Appliance, Linux IBM GPFS, DESY dCache, Solaris, PVFS2
Object layout
- Client: Linux
- Servers: Linux, Panasas
Block layout
- Client: Linux
- Server: EMC
PVFS2 layout
- Client: Linux
- Server: Linux
Activities
Our current Linux pNFS implementation uses a single whole file layout. We are extending the layout cache on the client and layout management on the server to support multiple layouts and small byte ranges.
In cooperation with EMC, we continue to develop a block layout driver module for the generic pNFS client.
We continue to measure I/O performance.
We joined the Ultralight project and are testing pNFS I/O using pNFS clients on 10 GbE against pNFS clusters on 1 GbE. The Linux pNFS client included in the Ultralight kernel and distributed to Ultralight sites, providing opportunities for future long-haul WAN testing.
Task 2. Client Migration
Migration of client from one mount/metadata server to another to be demonstrated. This demonstration may be replicated at LANL depending on success of this work.
Status
When a file system moves, the old server notifies clients with NFS4ERR_MOVED. Clients then reclaim state held on the old server by engaging in reboot recovery with the new server. For cluster file systems, server-to-server state transfer lets clients avoid the reclaim.
We redesigned state bookkeeping to ensure that state created on NFSv4 servers exporting the same cluster file system will not collide.
Server reboot recovery requires servers to save the clientid of active clients in stable storage. The present server implementation does this by writing directly to a filesystem via the vfs layer. A new server instance reads the state from stable storage, again directly via the vfs. We are rewriting this implementation to use a pipefs upcall/downcall interface instead of directly using the vfs layer, and are expanding the interface to support an upcall/downcall of all a clients in-memory state. The userland daemon can then support server-to-server state transfer to the cooresponding daemon on a new server. We have a prototype of the new upcall/down call interface, and have yet to prototype the server-to-server state transfer.
It remains to inform clients that state established with the old server remains valid on the new server. The IETF NFSv4 working group is considering solutions for the NFSv4.1 protocol, but NFSv4.0 clients will not have support for this feature. We will therefore need to provide Linux specific implementation support - perhaps a mount option or a /proc flag, or simply to try to use an old clientid against a new server on migration.
Task 3. Lock Analysis
Analysis of caching and lock coherency, demonstration of caching and lock performance with scaling, under various levels of conflict, using byte range locks (looking at lock splitting issues etc.).
Background
The NFSv4 protocol supports three different lock-like operations: opens, byte-range locks, and delegations.
Opens
Unlike previous versions of NFS, NFSv4 has an on-the-wire OPEN operation.
The OPEN call includes the expected access mode, which may be read, write, or both. But it also includes a "deny" mode, which may be read, write, or both, or none. The server fails any open which whose access mode overlaps the deny mode of an existing open, or whose deny mode overlaps the access mode of an existing open.
Deny modes are not currently used by UNIX-like clients, our main focus, so we don't study this case.
However, all clients still do perform an OPEN each time an application opens a file, for several reasons: to ensure correct behavior in the presence of Windows clients, to request delegations, and to establish the state necessary to get posix byte-range locks, among other reasons.
All versions of NFS also tie data caching to open and close: data is flushed before close, and attributes revalidated before open, in such a way as to guarantee that the data seen after an open will always reflect writes any other client performed using file descriptors that were closed before the open.
Byte-range locks
POSIX byte-range locks are managed by applications using fcntl(). Each lock request has a byte-range and a type of read or write. Read locks conflict only with write locks, whereas write locks conflict with any other locks. Applications may perform read locks only on files which they have open for read, and write locks only on files which they have open for write.
Byte-range locks are normally advisory; that is, they do not conflict with IO operations. Such mandatory locking is supported by many unix-like operating systems, appears to be rarely used.
The NLM sideband protocol enables byte-range locks for versions of NFS earlier than NFSv4. NFSv4 incorporates byte-range locking into the main protocol. This makes it possible to support mandatory byte-range locking, but support for mandatory byte-range locking over NFSv4 is not supported by the linux implementation, and no support is planned at this time.
As with opens, byte-range locks also affect data caching: unlocks are not allowed to succeed until modified data in the locked range is written to the server, and locks must revalidate file data. Thus writes performed under a lock has been unlocked will be visible to any reader that locks the region after the unlock.
Delegations
A server may optionally return a "delegation" with the response to any open call. Delegations may be of type read or write. Servers must guarantee that no client ever holds a read delegation for on a file that another client has open for write, or has a write delegation for. Similarly no client may hold a write delegation on a file that another client has open for read.
A server is never required to give out a delegation. Also, it may ask for the delegation back at any time, at which point the client is required to do what is necessary to establish on the server any opens or locks which it has performed locally before returning the delegation. Once returned, the client cannot regain the delegation without performing another open.
An NFS client is not normally synchronously notified of changes performed by another client, but as long as a client holds a delegation, the above rules guarantee that it will be. In theory it might be possible for applications to take advantage of this increased cache consistency. However, this is not useful in practice since a server is never required to give out a delegation. Also, a server can ask for a delegation back at any time.
Thus clients do not expose the existence of delegations to applications the way they do opens and locks. Instead, clients use delegations to provide increased performance: delegations allows clients to perform open and lock calls locally--in the case of a read delegation, read opens and read locks may be performed without contacting the server, and in the case of a write delegation, any opens and locks may be performed without contacting the server. This also relieves the client of the responsibility to flush dirty data and revalidate data caches.
When a server recalls a delegation, the client is required to perform opens, locks, and writes to the server as necessary to inform the server of any state that the client has established only locally. Conflicting opens will be delayed until this process is completed.
Lock performance test harness
In the performance measurements that follow, we used a single client and server. We also ran the experiments on the client hardware with the local file system for comparison.
Client
- IBM/Lenovo Thinkpad T43
- 2GHz Pentium M CPU
- 512 MB RAM
- 1000bT NIC
- 5400 RPM Ultra-ATA 80GB HD
- running 2.6.17-CITI
Server
- 1GHz Athlon 64 3000+ CPU
- 512 MB RAM
- 1000bT NIC
- 7200 RPM SATA-II 80GB HD
File lock measurements
Lacking examples of real-world lock-intensive workloads, we have performed a few microbenchmarks to measure such things as the cost of acquiring a single lock with and without a delegation.
To measure the performance of whole file locking, we use a benchmark that opens N files, then obtains a lock on each file. We measure the elapsed time of the loop that obtains the locks. We ran the microbenchmark on three configurations:
- Local file system (reiserFS)
- NFS with no delegations
- NFS with delegations
To test with no delegations, we disabled file leasing, which disables delegations as a side effect.
For most cases, we ran the test 10 times and averaged the results.
* variance was nil?
With delegations enabled and 1,000 files, we average the result of two runs, not 10, because the server limited the number of delegations to something less than 3,000.
* lame
File lock measurement results
Local file system
--- 1 RUNS --- 0.000024 : mean lock time 0.000027 : median 0.000008 : std dev --- 10 RUNS --- 0.000072 : mean lock time 0.000069 : median 0.000007 : std dev --- 100 RUNS --- 0.000645 : mean lock time 0.000640 : median 0.000009 : std dev --- 1000 RUNS --- 0.008288 : mean lock time 0.006606 : median 0.002138 : std dev
NFS no read delegation
--- 1 RUNS --- 0.000511 : mean lock time 0.000303 : median 0.000534 : std dev --- 10 RUNS --- 0.002827 : mean lock time 0.002665 : median 0.000342 : std dev --- 100 RUNS --- 0.026948 : mean lock time 0.026573 : median 0.001114 : std dev --- 1000 RUNS --- 0.305191 : mean lock time 0.296449 : median 0.034397 : std dev
NFS w/ read delegation
--- OVER 1 RUNS --- 0.000028 : mean lock time 0.000030 : median 0.000006 : std dev --- OVER 10 RUNS --- 0.000091 : mean lock time 0.000091 : median 0.000000 : std dev --- OVER 100 RUNS --- 0.000757 : mean lock time 0.000721 : median 0.000106 : std dev --- OVER 2 RUNS --- 0.008038 : mean lock time 0.008038 : median 0.000049 : std dev
File lock performance discussion
The most obvious result is that, as hoped, locks have much lower latency when performed in the presence of a delegation; the latency is, in fact, nearly identical to that in the local case.
We have not attempted to understand the growth in latency as the number of files increases. We expect the explanation to not be related specifically to locking.
(XXX: More pertinent to the scalability question would be network load (count rpc's in each case) and server load (look at cpu load or whatever on the server)).
Byte-range lock measurements
To measure the cost of byte-range locks, we focused on the cost of splitting and joining locks. (All locks discussed in this section are POSIX byte-range locks.)
For lock splitting, we created a 30 MB file, locked the entire file, then unlocked non-contiguous ranges. Each unlock operation splits the initial lock.
* We really should run this test with multiple clients * We should test both sequential and random order * We should probably run this one with contiguous ranges ... * it should not make a difference, right?
To measure the cost of lock joining, we ran a complementary test: we locked non-contiguous regions of the file, then locked the entire range. The ranges are non-contiguous to avoid coalescing locks as we proceed.
We measured performance for four segment sizes:
- 3 segments, each 100,000,000 bytes
- 30 segments, each 1,000,000 bytes
- 300 segments, each 100,000 bytes
- 3,000 segments, each 10,000 bytes
As before, we ran each test 10 times and averaged the results. Variance was negligible. Between runs, we unmounted and remounted the server.
* why is that important?
We ran the join and split tests on the local file system, on NFS with no delegations, and again on NFS with delegations. After each test with delegations, we verified that the delegation was in place by opening the file for writing at the end of the run and noting the DELGETRETURN callback.
Byte-range lock split results
Local file system
0.000016 secs - lock whole file 0.000021 secs - unlock 3 10000000-byte regions (split) 0.000016 secs - lock whole file 0.000190 secs - unlock 30 1000000-byte regions (split) 0.000015 secs - lock whole file 0.002985 secs - unlock 300 100000-byte regions (split) 0.000015 secs - lock whole file 0.177999 secs - unlock 3000 10000-byte regions (split) 0.000016 secs - lock whole file 23.569079 secs - unlock 29971 1000-byte regions (split)
NFS, no delegations
0.000276 secs - lock whole file 0.000704 secs - unlock 3 10000000-byte regions (split) 0.000325 secs - lock whole file 0.007067 secs - unlock 30 1000000-byte regions (split) 0.000276 secs - lock whole file 0.073822 secs - unlock 300 100000-byte regions (split) 0.000271 secs - lock whole file 1.099715 secs - unlock 3000 10000-byte regions (split) 0.000289 secs - lock whole file 74.407294 secs - unlock 29971 1000-byte regions (split)
NFS with delegations
0.000016 secs - lock whole file 0.000026 secs - unlock 3 10000000-byte regions (split) 0.000016 secs - lock whole file 0.000248 secs - unlock 30 1000000-byte regions (split) 0.000017 secs - lock whole file 0.004350 secs - unlock 300 100000-byte regions (split) 0.000016 secs - lock whole file 0.225387 secs - unlock 3000 10000-byte regions (split) 0.000017 secs - lock whole file 22.961603 secs - unlock 29971 1000-byte regions (split)
Byte-range lock join results
Local file system
0.000028 secs - lock 3 10000000-byte regions 0.000010 secs - lock whole file (join) 0.000212 secs - lock 30 1000000-byte regions 0.000031 secs - lock whole file (join) 0.004222 secs - lock 300 100000-byte regions 0.000407 secs - lock whole file (join) 0.369237 secs - lock 3000 10000-byte regions 0.004469 secs - lock whole file (join) 43.966929 secs - lock 29971 1000-byte regions 0.030219 secs - lock whole file (join)
NFS, no delegations
0.000750 secs - lock 3 10000000-byte regions 0.000246 secs - lock whole file (join) 0.007616 secs - lock 30 1000000-byte regions 0.000307 secs - lock whole file (join) 0.081856 secs - lock 300 100000-byte regions 0.001215 secs - lock whole file (join) 1.548707 secs - lock 3000 10000-byte regions 0.011581 secs - lock whole file (join) 133.975178 secs - lock 29971 1000-byte regions 0.120294 secs - lock whole file (join)
NFS with delegations
0.000032 secs - lock 3 10000000-byte regions 0.000012 secs - lock whole file (join) 0.000284 secs - lock 30 1000000-byte regions 0.000046 secs - lock whole file (join) 0.006794 secs - lock 300 100000-byte regions 0.000558 secs - lock whole file (join) 0.347239 secs - lock 3000 10000-byte regions 0.002566 secs - lock whole file (join) 42.846999 secs - lock 29971 1000-byte regions 0.029043 secs - lock whole file (join)
Delegation recall with byte-range locks
Earlier, we saw the performance advantage for a client holding delegations when acquiring locks. In these tests, we examine the performance penalty to the server when recalling a delegation in the face of numerous client locks.
To test the performance of delegation recall, the client opens a 30 MB file, acquires a number of byte-range locks, then idles. The server then opens the file for writing, which induces a delegation recall callback. This forces the client to release its locks locally and establish them on the server. The client then relinquishes the delegation.
* we should also test with the server opening for read, * forcing the client to re-establish locks ...
We vary the number of locks, testing performance with 1, 2, 3, 4, 5, 10, 25, 50, 100, 250, 500, 1,000, 2,500, 5,000, and 10,000 locks. i started with a 30M file; i'd mount, open/close that file (to get a delegation stateid to use), re-open it and request various numbers of byte-range locks, then sleep. i un/remounted between each run.
meanwhile, back on the server i'd wait until the sleeping started, then time opening that file (locally on the server) for write. the times below reflect the cost of a single delegation recall and some number of LOCK commands getting flushed before the open(w) succeeds.
[client]$ uhu && hmd && newerlock -f 1 && locknhold midfile 10000 10 [server]$ newerlock -f 1 -w
LOCAL FS -- time open(w) -- no NFS involvement --
OPEN time: 0.000051 seconds
-- w/ 1 FDR, 0 locks --
OPEN time: 0.001025 seconds
-- w/ 1 FDR, 1 lock --
OPEN time: 0.001521 seconds
-- w/ 1 FDR, 2 locks --
OPEN time: 0.001726 seconds
-- w/ 1 FDR, 3 locks --
OPEN time: 0.002064 seconds
-- w/ 1 FDR, 4 locks --
OPEN time: 0.002235 seconds
-- w/ 1 FDR, 5 locks --
OPEN time: 0.002482 seconds
-- w/ 1 FDR, 10 locks --
OPEN time: 0.003648 seconds
-- w/ 1 FDR, 25 locks --
OPEN time: 0.007320 seconds
-- w/ 1 FDR, 50 locks --
OPEN time: 0.013309 seconds
-- w/ 1 FDR, 100 locks --
OPEN time: 0.025317 seconds
-- w/ 1 FDR, 250 locks --
OPEN time: 0.063221 seconds
-- w/ 1 FDR, 500 locks --
OPEN time: 0.128633 seconds
-- w/ 1 FDR, 1000 locks --
OPEN time: 0.295346 seconds
-- w/ 1 FDR, 2500 locks --
OPEN time: 0.842576 seconds
-- w/ 1 FDR, 5000 locks --
OPEN time: 2.358167 seconds
-- w/ 1 FDR, 10000 locks --
OPEN time: 7.409892 seconds
-- w/ 1 FDR, 15000 locks --
OPEN time: 14.412268 seconds
-- w/ 1 FDR, 25000 locks --
OPEN time: 36.535290 seconds
-- w/ 1 FDR, 50000 locks --
OPEN time: 90.007199 seconds
... if my lock-holding program terminated (i had it time-out) before all recalls were done, it'd "go off while i was cleaning it": [17179902.828000] BUG: unable to handle kernel paging request at virtual address 6b6b6ba3 [17179902.828000] printing eip: [17179902.828000] e10c4dc5 [17179902.828000] *pde = 00000000 [17179902.828000] Oops: 0000 [#1] [17179902.828000] Modules linked in: nfs lockd sunrpc st nvram cpufreq_userspace speedstep_centrino\
freq_table thermal processor fan button battery ac sr_mod
[17179902.828000] CPU: 0 [17179902.828000] EIP: 0060:[<e10c4dc5>] Not tainted VLI [17179902.828000] EFLAGS: 00010202 (2.6.17-CITI_NFS4_ALL-1-default-dirty #27) [17179902.828000] EIP is at nfs_delegation_claim_locks+0x16/0x8a [nfs] [17179902.828000] eax: dd844d24 ebx: 6b6b6b6b ecx: dc6f1eb0 edx: dc471640 [17179902.828000] esi: 00000000 edi: dc73edbc ebp: dbc01fa8 esp: dbc01f9c [17179902.828000] ds: 007b es: 007b ss: 0068 [17179902.828000] Process nfsv4-delegretu (pid: 2411, threadinfo=dbc00000 task=ddaa1a50) [17179902.828000] Stack: dcc17d14 00000000 dd844d24 dbc01fc8 e10c4ead dcc17d14 dd844d24 dc73ec84 [17179902.828000] dcc17b98 dd19e96c dc73edbc dbc01fe4 e10c552f dc73edbc dc73ec84 e10c5421 [17179902.828000] 00000000 00000000 00000000 c01012dd de33beb0 00000000 00000000 5a5a5a5a [17179902.828000] Call Trace: [17179902.828000] [<c0103871>] show_stack_log_lvl+0x7f/0x87 [17179902.828000] [<c01039c2>] show_registers+0x112/0x17b [17179902.828000] [<c0103b85>] die+0xe2/0x1b0 [17179902.828000] [<c01137fd>] do_page_fault+0x469/0x561 [17179902.828000] [<c0103553>] error_code+0x4f/0x54 [17179902.828000] [<e10c4ead>] nfs_delegation_claim_opens+0x74/0xa2 [nfs] [17179902.828000] [<e10c552f>] recall_thread+0x10e/0x15e [nfs] [17179902.828000] [<c01012dd>] kernel_thread_helper+0x5/0xb [17179902.828000] Code: c0 74 07 50 e8 62 a2 18 00 58 53 e8 f5 67 08 df 8b 5d fc c9 c3 55 89 e5 8b\
45 0c 57 56 53 8b 78 1c 8b 9f dc 00 00 00 85 db 74 4e <f6> 43 38 03 74 44 8b 43 34 8b 55 08 39 90\ 80 00 00 00 75 36 53
[17179902.828000] EIP: [<e10c4dc5>] nfs_delegation_claim_locks+0x16/0x8a [nfs] SS:ESP 0068:dbc01f9c
In this test, a client acquires N read locks on a single delegated file and holds them. We then open that file for write on the server, and time the open.
Note that the open cannot succeed until the client has established all N of its cached locks on the server and returned the delegation.
(XXX: No idea what to conclude here.)
Lock acquisition
approach is basic:
- patch sys_fcntl() so that if flock.l_whence is 20+<normal>, set FL_DMR_TIME flag and fix l_whence. - patch __posix_lock_file_conf() so that if FL_DMR_TIME is in fl_flags, print operation type and time it completed. - use simple test program with parent locking, stalling, kids attempt, parent unlocks, first kid gets it. - parent, lock-acquiring kid, and last-kid-on-locklist use l_whence hack to get timings.
[herdtime source] [patch against 2.6.18-rc2-CITI+fair-queueing] [patch against 2.6.17-CITI] LOCAL FS [1] 1160683264.152631 1160683264.153135 +0.000504us
[10] 1160683010.200763 1160683010.201253 +0.000490us 1160683010.201767 +0.001004us
[50] 1160683020.229437 1160683020.229984 +0.000547us 1160683020.230804 +0.001367us
[100] 1160683030.274120 1160683030.274676 +0.000556us 1160683030.275923 +0.001803us
[200] 1160683040.366856 1160683040.367427 +0.000571us 1160683040.369732 +0.002876us
[300] 1160683050.547605 1160683050.548179 +0.000574us 1160683050.551835 +0.004230us
[400] 1160683323.292749 1160683323.293315 +0.000566us 1160683323.298616 +0.005867us
[500] 1160683333.697860 1160683333.699169 +0.001309us 1160683333.706350 +0.008490us
[600] 1160683344.286276 1160683344.286870 +0.000594us 1160683344.296097 +0.009821us
[700] 1160683355.019052 1160683355.019669 +0.000617us 1160683355.031266 +0.012214us
[800] 1160683365.987829 1160683365.988448 +0.000619us 1160683366.002810 +0.014981us
[900] 1160683377.072625 1160683377.073232 +0.000607us 1160683377.090353 +0.017728us
[1000] 1160683388.457439 1160683388.458050 +0.000611us 1160683388.478413 +0.020974us LOCAL FS -- fair-queueing [1 kid] 1160681590.335047 1160681590.335115 +68us
[10 kids] 1160679535.642665 1160679535.642738 +73us
[50 kids]
1160679573.885154
1160679573.885289 +135us
[100]
1160679659.802738
1160679659.803091 +353us
[200]
1160679696.385868
1160679696.387069 +1201us
[300]
1160679723.117016
1160679723.119654 +2638us
[400]
1160679752.476966
1160679752.481666 +4700us
[500]
1160679803.618752
1160679803.626097 +7345us
[600]
1160679835.503997
1160679835.514580 +10583us
[700]
1160679861.705542
1160679861.719921 +14379us
[800]
1160679887.295459
1160679887.314191 +18732us
[900]
1160679920.966662
1160679920.990381 +23719us
[1000]
1160679946.961694
1160679946.990909 +29215us
When an unlock or downgrade makes a lock available to waiting processes, even when we know that not all will be able to grant the lock.
The addition to the locking code of an unrelated feature necessary for NFSv4 has made it desireable to modify the code so that it wakes only processes that will be able to acquire the lock.
To measure the effect of this change, we had N processes request an exclusive whole-file lock on a single file that was already locked by another processes. We then unlocked the file, and measured two times: the time until the first waiting processes succesfully acquired the lock, and the time till all the processes have retried the lock, failed, and gone back to sleep.
We also ran the test with our "fair queueing" patch applied, in which case we measure only the time till one of the waiters has acquired the lock (since the other waiters are not woken in this case).
The result shows that with less than 100 waiting processes, the patched kernel is able to grant the lock to the first waiter more quickly than the unpatched kernel could.
However, for larger N the time for the patched kernel to grant the acquired lock appears to be more than the time for the unpatched kernel to grant the lock and return all waiting processes to the waitqueue. We do not yet understand the reasons for this.
(XXX: Are these times the result of one trial? What's the variance like?)
Catch-all section to delete later
demonstration of caching and lock performance with scaling sounds sorta like the various timing tests we already have.. "scaling" -- multiclient stuff? you'd talked about polling strategies before? ..under various levels of conflict "conflict"? like recalled delegations when you have local locks? .. or do they just mean "what happens when locklists get long?" local/NFS byte-range numbers and herdtime numbers byte-range locks (look at, e.g., lock-splitting issues) can/should we try to come up with a more-representative case than a 30M file, perhaps? local/NFS byte-range and whole-file numbers
tests with atro.citi.umich.edu as server -- local/NFS whole-file locking -- local/NFS byte-range locking -- local/NFS delegation recall with locks -- local/NFS recalling multiple delegations -- NFS cost of delegated OPEN -- local herd-time: timing unlock-to-locklist-quiescence
various testing minions in CVS: CVSROOT=/afs/citi.umich.edu/projects/CVS-richterd module: nfs/TESTING err.. little tool info
random garbage.
-- ?.. don't like this test overmuch. -d // VFS herd-thru: how many lock/write/unlocks per second with N processes?
-- ? NFS herd-thru: how many lock/write/unlocks per second with N processes? (no delegations) tests client polling strategy
-- ? NFS herd-time-1: what's time from unlock-to-locklist-quiescence with N processes? (no delegations) tests client polling strategy
-- ? kinda covered already. -d // byte-range: lock whole file, then unlock 1 byte. compare to normal whole-file unlock.
? -- how long does it take to recall 1 delegation from 1 guy vs. recalling 1 delegation from 100 guys? (svr: apikia; clts: pugna, atro, bogo, dragonwell, guangzhou, shenzhen, l99(?), la1(?), spin(?), rip(?)) ? -- how long does it take to recall 1 delegation from 1 guy vs. recalling 100 delegations from 1 guy? well, how do we trigger all those recalls simultaneously? ? -- how do we test cost-of-recall? maybe estimate and recommend decent delay-factor on the server and try it (is it better to have the server stall for X ms in order to save client 20X ms from generic ERR_DELAY pause?) maybe recommend possibility of making it adaptive? (e.g., factor in per-client RTT, "good behavior", etc) but what's the advantage if we just refuse to grant after that one break? -- since we must avoid ping-ponging when, e.g., two guys are both appending to a logfile --> if serially breaking a bunch of delegations all at once, we get to do them in (numFiles * X ms) instead of ~ (numFiles * 20X ms) .. to avoid the Opera-startup-problem, just implement the "i-don't-break-my-own" + "upgrade-delegation-type" and forget about delays altogether?
Task 4. Directory Delegations
Analysis of directory delegations – how well does it work and when, when does it totally not work.
Background
Directory delegations promise to extend the usefulness of dentry caching in two ways. First, the client is no longer forced to revalidate the dentry cache after a timeout. Second, while positive caching can be treated as a hint, negative caching without cache invalidation violates open-to-close semantics. Directory delegations allow the client to cache negative results.
For example, if a client opens a file that does not exist, it issues an OPEN RPC that fails. But a subsequent open of the same file might succeed, if the file is created in the interim. Open-to-close semantics requires that the newly created file be seen by the client, so the earlier negative result can not be cached. Consequently, subsequent opens of the same non-existent file also require OPEN RPC calls being sent to the server. This example is played out repeatedly when the shell searches for executables in PATH or when the linker searches for shared libraries in LD_LIBRARY_PATH.
With directory delegations, the server callback mechanism can guarantee that no entries have been added or modified in a cached directory, which allows consistent negative caching and eliminates repeated checks for non-existent files.
Status
We implemented directory delegations in the Linux NFSv4 client and server.
Our server implementation follows the file delegations architecture. We extended the lease API in the Linux VFS to support read-only leases on directories and NFS-specific lease-breaking semantics.
We implemented a /proc interface on the server to enable or disable directory delegation at run time. At startup, the client queries the server for directory delegation support.
The server has hooks for a policy layer to control the granting of directory delegations. (No policy is implemented yet.) When and whether to acquire delegations is also a client concern.
Testing
We are testing delegation grant and recall in a test rig with one or two clients. Testing consists mostly of comparing NFS operation-counts when directory delegations is enabled or disabled.
Tests range from simple UNIX utilities — ls, find, touch — to hosting a CVS repository or compiling with shared libraries and header files on NFS servers. Tests will become more specific.
We have extended PyNFS to support directory delegations. So far, the support is basic and the tests are trivial. Tests will become more specific.
We are designing mechanisms that allow simulation experiments to compare delegation policies on NFSv4 network traces.
Task 5. NFS Server Load
How do you specify/measure NFS Server load?
Status
To frame the task, consider identical symmetric servers with a cluster file system back end and a task running on one of them. Can we compare the load on the servers to determine whether there would be a benefit to migrating a client from one to the other?
Answering this question requires that we define a model of load based on measurable quanta.
Given a model, the next step is to write a tool that collects the factors that influence load and to measure how well the model accurately predicts performance.
Goals
If an application is running at less than peak performance, the load model should tell us whether the bottleneck is in the server, the client, or elsewhere.
If the bottleneck is in the server, one option for improving application performance is replacing server components with faster ones. Another option is to add servers. A third option is to migrate the application to a lightly-loaded server.
* Actually, the second option is fruitless without the third.
Factors that influence server load
Disks
The rate at which a single file in a server file system can be depends on many factors, including characteristics of the disk hardware (rotation speed, access latency, etc.), the disk controller, the bus, the layout of files on the disk, the size of the transfer, and the degree of caching. The overall bandwidth of a file system also depends on the degree of striping and distribution of requests across disks.
The iostat command can reveal a bottleneck due to server disks if seek or transfer rates approach maximum values. For a given server configuration, these values can be measured directly. It might be possible to predict these values for a given hardware ensemble.
CPU
Server threads compete with one another and with the operating system for access to the CPU. Excess offered load can exhaust the availability of server threads.
* how would we know if this were to happen? * would it suffice to simply allocate more threads? * or are there pathological cases to consider?
Overall CPU utilization can be measured, also with iostat, but there may be other factors influencing the allocation of CPU to server threads. For example, excessive pressure on the memory or interrupt subsystem can force the operating system to intervene.
Interrupts
Interrupt rates can be measured with
* i forget :-(
For a given hardware configuration, a threshold can be measured experimentally.
Memory
The memory subsystem is complex and varies among operating systems. Applications compete with one another for virtual memory. Often, they also compete with the file system, which uses the virtual memory subsystem for its in-memory cache.
Often, excess demand for memory is reflected by early eviction of pages in virtual memory. The vmstat command shows the pageout rate, which does not measure early eviction, but does reflect overall memory pressure.
Network
Network utilization is the ratio of delivered bandwidth to maximum available bandwidth. Maximum available bandwidth is a property of network hardware. Delivered bandwidth can be measured with the netstat command.
Full-duplex network technologies can deliver maximal bandwidth in both directions, while half-duplex network technologies are limited to delivering the sum of the two directions.
* i believe that ius a true statement ...
Measuring load
Each measured value can be expressed as a ratio between 0 (idle) and 1 (at capacity). For each value, there is a program that consumes the corresponding resource.
The overall performance of a server can be tested by measuring NFS performance directly with microbenchmarks. Candidate microbenchmarks include NULL RPC, and small READ RPC, large READ RPC, small WRITE RPC, and large WRITE RPC.
The usefulness of a measured value can be tested by comparing microbenchmark performance as the resource is consumed.
It is useful to sample the instantaneous values, and to track them over time with a damping function that shows the averages over the last second, minute, five minutes, etc.
How do we check usefulness of this information?
boot with reduced resources somehow, see if increasing resources increases performance as predicted?
Disk bandwidth
vary size of raid arrays, bandwidth of disk interfaces?
Or run another process that soaks up some percentage of bandwidth??
CPU load
CPU throttling??
Just try different totally random machines? Vary workload? How do we get a light vs. heavy workload?
How do we measure performance of each? Increasing clients until we see performance degredation due to server bottlenecks would be obvious thing to do....)
Measures of load
what do we use to determine if our measure of load is correct?
- single rpc latency measured from a client?
- time to complete some other task, measured from a single client (not actually involved in loading the server)?
- rpc's per second?
Configuration parameters on server that can be varied
- number of server threads
- number of connections per server thread
- request queue lengths (# of bytes waiting in tcp socket)
Some special situations that can be problems (from Chuck)
- reboot recovery: everyone is recovering at once.
- mount storms: a lab full of clients may all mount at once, or a cluster job may trigger automount from all clients at once.
Possible benchmark sources, for this and locking scalability
postmark
looks pretty primitive: mixture of reads, writes, creates, unlinks. No locks.
filebench
also no locking. Haven't figured out exactly what the various loads do. Is there actually an active developer community?
See Bull.net's list?
- Bonnie++
- FStress
- dbench: simulates filesystem activity created by a samba server running the proprietary SMB benchmark "netbench". Maybe not so useful.
- Do-it-ourselves modify postmark or filebench? set up a mailserver (e.g.), send it fake mail. get traces from working servers
* Dean: Not sure what you guys are looking for here ("...for this..."), but iozone can use locks and nothing will stress out your server like IOR since it can scale to 1000s of clients. If you are looking for metadata tests, LLNL and other labs have some (mdtest, etc) Look at http://lbs.sourceforge.net/, http://www.llnl.gov/asci/purple/benchmarks/limited/, http://www.cs.dartmouth.edu/pario/examples.html, http://www.llnl.gov/icc/lc/siop/downloads/download.html)