Cthon06 Meeting Notes

From Linux NFS

Jump to: navigation, search

Linux pNFS Implementation meeting at Connectathon 2006

Note that these are raw, unprocessed notes!

pNFS client: walk through mount, open, and I/O WRT pNFS operations. LD = layout driver

0) register LD 1) MOUNT

       a) FS_LAYOT_TYPES - currently one one
       b) Match returned types to LD
       c) register SB for pnfs private area use
       d) LD -> client GETDEVICELIST

ISSUE: why do GETDEVICELIST at mount? current client doesn't have a clientid until first open - device list might change

       - pVFS2 doesn't use GETDEVICELIST
       - file layout could wait uthil open
       - block (object) will call immediatley to determine connectivity


2) LAYOUTGET

       a) layoutget in OPEN compound?
        opportunity for LD to say whether or not to do LAYOUTGET

ISSUE:

       - might already have a layout
       - blocks case OPEN references hardlink?
               one open for write, one open for read
       b) currently waits until first IOo
               LD ->LAYOUTGET

QUESTION: file size - do get a layout or not?

trond - client. when vm asks for io -per page or larger extent basis. nfs client tracks this because in one case can coalase io or not. obvious to nfs whetehr or not to call into pNFS

marc - wants server policy e.g. ask server garth - what if you already have a layout?

     no way for server to communicate.

brent - ask client layout driver

trond - nfs_flush_list() list of request, wsize requests on wire. in pnfs case call into LD to see if it wants to deal with it.

brent: who set up parameters on layout get. dean: client asks for io size, server decides layout size

trond: 5 byte file, why should client ask for a layout benny: load distribution (10,000 5 byte files going through MDS trond/bruce: MDS needs to do state reseources for each layout brent/marc; cache small files in data server, need to redirect

ask mds prior to getting layout. or garth: IO threshhold for pNFS? per file attr for getattr?

layout driver should have some control over the range asked?

       c) where is it stored?
               layout is stored off inode.

ISSUE: private pointer off inode, private area, hang as much stuff that you want to - all private to LD

trond: may want to reuse vfs locking code - new type of file_lock don't want to add lock management code into the nfs client.

       - properties of posix locks
               - coalase locks
       3) READ
       two ways: standard page cache normal nfs clients
               pVFS2 way.
       regular: read ahead code in VM to determine actual size. calls
        into nfs  code to read.
               needs to id the layout (get the whole inode, layout pointer).
               gives page list. regular read

brent: any hint from file system? trond: primiv - largest you an accept, some hooks

dean: different read ahead size for data servers

garth: read aheads span stripes by just a small amount trond: modification to vm read-ahead code, currently reads in PAGE_SIZE brent: prefered max matches up to the strip- stripe aligned? trond; always page aligned - chunck size aligned

read-ahead: application decides how much to read ahead

brent: important ask DS 1K ask 64 DS for 1K - same amount of time

trond end up filling up much more of the page cache and confusing VM LRU ...

random access vrs

brent: read ahead needs chunck alignment within page alignment notion

dean: currently have an interface to ask for this now (in terms of PAGES) stipe size multiple of PAGE SIZE

trond: O_DIRECT needs some consideration.


dean: can't get more than a half meg out of the VM

               d) last void * pointer in the read/write interface.
               lots of nfs code and then needs to call LD, so used to
               need type checking: declare struct blah; in pvfs2
       trond: setting it up as a cookie is done when you have two or more
       that need it. not when you have one.

ISSUE:

       4) WRITE
               a) uses standard colasce stripe size nfs code to construct wsize chuncks
                       different for O_DIRECT - map user mem into vm space
                       not file system specific. dycotomy - like to convert
                       the O_DIRECT
                       A_IO support only for O_DIRECT. normal code writes
                       go into the vm, and returns.
                       generic AIO code hacky -waiting for locks, etc
               b) sent to write function. write page list
               c) marks pagees for commit
                       close, fsync, mempresure
               d) instead of calling regutlar commit, callsd LD commit.
                       LDcommit - to data servers.

ISSUE: if not using nfs code for commits (e.g. your own when can you clear pages? LD needs to clear it's own pages when commit operation


5) LAYOUTCOMMIT:

               called on fsync, stat calls, close, lock, locku,
               on trunc setattr? triggered by user land call.

ISSUE: garth: prefered that it is done prior to any getattr, but getattr is always called.

benny: only stat system call getattr.

don't overwrite the file size on client on getattr if havent sent layout commit.


added flag to inode structure: writes have happened, no layout commit. set flag when writes are issues. used to trigger a LAYOUTCOMMIT on fsync, etc and then cleared.

ISSUE: which creds to use? add a pointer to the write creds, bump counter. use nfsopen context

ISSUE: who constructs LAYOUTCOMMIT - currently generic pnfs needs to a LD call because different byte ranges etc.

LD->LAYOUTCOMMIT send an array of commits. block layout uses layout update structure, object


Small files

       currently three round trips OPEN READ CLOSE whole file
               current stateid = down to one
                                                             154,1-8
Personal tools