Tuesday, January 25, 2011

Daemons die with bus error when their binaries live on NFS

We have some daemons executing on a number of hosts.

The daemon executable images are these very large binaries that are hosted on NFS.

When the binaries are updated on the NFS server, the previously running daemons sometimes drop dead with a Bus error. I'm assuming what's happening is the NFS server is replacing the binaries in a way that's invisible to the VFS layer on the NFS clients so they end up loading pages from the updated binary, which of course leads to madness.

We tried moving the new binaries into place instead of cp, but that doesn't seem to fix it.

I'm considering simply mlock()'ing the binary in the daemon startup script, but surely there's magic NFS options or semantics that we should be abusing. Is there a better way to fix this?

  • This is a common issue with NFS. When you remove the file, the existing NFS connection believes that the stat table it has is correct, goes to reload and gets a bus error.

    What you want to do is move the existing binary, put the new binary in place, after each of the machines have started using the new binary, remove the old one. Apache does this when it tries to mmap served files from NFS that change as well.

    mbac32768 : While that solves my immediate problem, it opens up a timing condition problem. There's a span of time after the existing binary has been moved out of the way but before the new binary has been moved into the way that it won't exist at all.
    From karmawhore

0 comments:

Post a Comment