#476 closed defect (fixed)
stuck PT / name servers
Reported by: | epruesse | Owned by: | devel |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Library (other) | Version: | |
Keywords: | Cc: |
Description
I can confirm that PT servers sometimes get "stuck" and keep running at 100% CPU. I just had a few on my notebook. Cannot reproduce it now, though. Attaching GDB was not possible. (GDB claims the program is being debugged already and that bin/arb_pt_server does not exist).
My best guess is that there is something wrong with the select loop in aisc_accept_calls in the presence of signals (pipefail, hup come to mind). Select() + Signals were complicated, IIRC.
While I don't see how this should affect a running PT server, it should be noted that both on my box and at MPI the binaries can "switch" underneath a running PT server (although it's a symlink at MPI, so the inode will remain identical).
Change History (5)
comment:1 Changed 11 years ago by epruesse
comment:2 Changed 11 years ago by epruesse
and here's an strace:
write(2, "AISC server: pipe broken\n", 25) = -1 EPIPE (Broken pipe) rt_sigreturn(0xffffffff^C) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- [repeat forever]
comment:3 follow-up: ↓ 4 Changed 11 years ago by epruesse
- Resolution set to fixed
- Status changed from new to closed
fixed by r11760
Signal handling is dangerous business though. There could be more issues.
comment:4 in reply to: ↑ 3 Changed 11 years ago by westram
comment:5 Changed 11 years ago by epruesse
I got lucky and used strace… that showed me better than gdb and the fflush() thing what was likely happening. See above, strace was showing tons of write()s that each got an EPIPE. So I looked for the place where that message was issued and realized that the fputs itself caused the signal to be re-emitted.
I guess the fputs() stuck out to me because I'd read a book on unix network programming once that spent a lot of text on the difficulties in getting select + signals right. It's like exceptions, only worse. There isn't much you can safely do in the handlers in both cases because you mustn't cause another error while in there. For signals there's the additional worry that you don't even know where in the code you are — you can be in the middle of external library code as well.
stack trace on a stuck one: