Cthon06 Meeting Notes
From Linux NFS
Linux pNFS Implementation meeting at Connectathon 2006
Note that these are raw, unprocessed notes!
pNFS client: walk through mount, open, and I/O WRT pNFS operations. LD = layout driver
0) register LD 1) MOUNT
a) FS_LAYOT_TYPES - currently one one b) Match returned types to LD c) register SB for pnfs private area use d) LD -> client GETDEVICELIST
ISSUE: why do GETDEVICELIST at mount? current client doesn't have a clientid until first open - device list might change
- pVFS2 doesn't use GETDEVICELIST - file layout could wait uthil open - block (object) will call immediatley to determine connectivity
2) LAYOUTGET
a) layoutget in OPEN compound? opportunity for LD to say whether or not to do LAYOUTGET
ISSUE:
- might already have a layout - blocks case OPEN references hardlink? one open for write, one open for read
b) currently waits until first IOo LD ->LAYOUTGET
QUESTION: file size - do get a layout or not?
trond - client. when vm asks for io -per page or larger extent basis. nfs client tracks this because in one case can coalase io or not. obvious to nfs whetehr or not to call into pNFS
marc - wants server policy e.g. ask server garth - what if you already have a layout?
no way for server to communicate.
brent - ask client layout driver
trond - nfs_flush_list() list of request, wsize requests on wire. in pnfs case call into LD to see if it wants to deal with it.
brent: who set up parameters on layout get. dean: client asks for io size, server decides layout size
trond: 5 byte file, why should client ask for a layout benny: load distribution (10,000 5 byte files going through MDS trond/bruce: MDS needs to do state reseources for each layout brent/marc; cache small files in data server, need to redirect
ask mds prior to getting layout. or garth: IO threshhold for pNFS? per file attr for getattr?
layout driver should have some control over the range asked?
c) where is it stored? layout is stored off inode.
ISSUE: private pointer off inode, private area, hang as much stuff that you want to - all private to LD
trond: may want to reuse vfs locking code - new type of file_lock don't want to add lock management code into the nfs client.
- properties of posix locks - coalase locks
3) READ two ways: standard page cache normal nfs clients pVFS2 way.
regular: read ahead code in VM to determine actual size. calls into nfs code to read. needs to id the layout (get the whole inode, layout pointer). gives page list. regular read
brent: any hint from file system? trond: primiv - largest you an accept, some hooks
dean: different read ahead size for data servers
garth: read aheads span stripes by just a small amount trond: modification to vm read-ahead code, currently reads in PAGE_SIZE brent: prefered max matches up to the strip- stripe aligned? trond; always page aligned - chunck size aligned
read-ahead: application decides how much to read ahead
brent: important ask DS 1K ask 64 DS for 1K - same amount of time
trond end up filling up much more of the page cache and confusing VM LRU ...
random access vrs
brent: read ahead needs chunck alignment within page alignment notion
dean: currently have an interface to ask for this now (in terms of PAGES) stipe size multiple of PAGE SIZE
trond: O_DIRECT needs some consideration.
dean: can't get more than a half meg out of the VM
d) last void * pointer in the read/write interface. lots of nfs code and then needs to call LD, so used to need type checking: declare struct blah; in pvfs2 trond: setting it up as a cookie is done when you have two or more that need it. not when you have one.
ISSUE:
4) WRITE a) uses standard colasce stripe size nfs code to construct wsize chuncks different for O_DIRECT - map user mem into vm space not file system specific. dycotomy - like to convert the O_DIRECT A_IO support only for O_DIRECT. normal code writes go into the vm, and returns. generic AIO code hacky -waiting for locks, etc
b) sent to write function. write page list c) marks pagees for commit close, fsync, mempresure d) instead of calling regutlar commit, callsd LD commit. LDcommit - to data servers.
ISSUE: if not using nfs code for commits (e.g. your own when can you clear pages? LD needs to clear it's own pages when commit operation
5) LAYOUTCOMMIT:
called on fsync, stat calls, close, lock, locku, on trunc setattr? triggered by user land call.
ISSUE: garth: prefered that it is done prior to any getattr, but getattr is always called.
benny: only stat system call getattr.
don't overwrite the file size on client on getattr if havent sent layout commit.
added flag to inode structure: writes have happened, no layout commit.
set flag when writes are issues. used to trigger a LAYOUTCOMMIT on fsync, etc
and then cleared.
ISSUE: which creds to use? add a pointer to the write creds, bump counter. use nfsopen context
ISSUE: who constructs LAYOUTCOMMIT - currently generic pnfs needs to a LD call because different byte ranges etc.
LD->LAYOUTCOMMIT send an array of commits. block layout uses layout update structure, object
Small files
currently three round trips OPEN READ CLOSE whole file current stateid = down to one 154,1-8