Opened 10 years ago

Closed 10 years ago

Last modified 10 years ago

#476 closed defect (fixed)

stuck PT / name servers

Reported by: epruesse Owned by: devel
Priority: normal Milestone:
Component: Library (other) Version:
Keywords: Cc:

Description

I can confirm that PT servers sometimes get "stuck" and keep running at 100% CPU. I just had a few on my notebook. Cannot reproduce it now, though. Attaching GDB was not possible. (GDB claims the program is being debugged already and that bin/arb_pt_server does not exist).

My best guess is that there is something wrong with the select loop in aisc_accept_calls in the presence of signals (pipefail, hup come to mind). Select() + Signals were complicated, IIRC.

While I don't see how this should affect a running PT server, it should be noted that both on my box and at MPI the binaries can "switch" underneath a running PT server (although it's a symlink at MPI, so the inode will remain identical).

Change History (5)

comment:1 Changed 10 years ago by epruesse

stack trace on a stuck one:

#0  0x00007f7b9a9e8ab0 in __write_nocancel () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007f7b9a983383 in _IO_new_file_write (f=0x7f7b9ac8d780, data=0x7f7b9be3d000, n=34) at fileops.c:1276
#2  0x00007f7b9a9849d5 in new_do_write (fp=0x7f7b9ac8d780, 
    data=0x7f7b9be3d000 "[ptserver '-boot' took 1d0h4m24s]\nnning.\naric.arb.pt', 1.48 Gb) from disk\ntializing:\n- opening connection...\nWarning: old socket file '/tmp/arb_bsafaric_pt.socket' failed to unlink\n- init internal str"..., to_do=34) at fileops.c:530
#3  _IO_new_do_write (fp=0x7f7b9ac8d780, 
    data=0x7f7b9be3d000 "[ptserver '-boot' took 1d0h4m24s]\nnning.\naric.arb.pt', 1.48 Gb) from disk\ntializing:\n- opening connection...\nWarning: old socket file '/tmp/arb_bsafaric_pt.socket' failed to unlink\n- init internal str"..., to_do=34) at fileops.c:503
#4  0x00007f7b9a983b38 in _IO_new_file_sync (fp=0x1) at fileops.c:905
#5  0x00007f7b9a9781ea in _IO_fflush (fp=0x7f7b9ac8d780) at iofflush.c:43
#6  0x0000000000415677 in ARB_main(int, char**) ()
#7  0x00007f7b9a92ec4d in __libc_start_main (main=<value optimized out>, argc=<value optimized out>, ubp_av=<value optimized out>, init=<value optimized out>, fini=<value optimized out>, 
    rtld_fini=<value optimized out>, stack_end=0x7fff14fc8528) at libc-start.c:226
#8  0x00000000004073f9 in _start ()

comment:2 Changed 10 years ago by epruesse

and here's an strace:

write(2, "AISC server: pipe broken\n", 25) = -1 EPIPE (Broken pipe)
rt_sigreturn(0xffffffff^C)                = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
[repeat forever]

comment:3 follow-up: Changed 10 years ago by epruesse

  • Resolution set to fixed
  • Status changed from new to closed

fixed by r11760

Signal handling is dangerous business though. There could be more issues.

comment:4 in reply to: ↑ 3 Changed 10 years ago by westram

Replying to epruesse:

fixed by r11760

Never heard of that, good to know.

How did you track that down to that point in code?

comment:5 Changed 10 years ago by epruesse

I got lucky and used strace… that showed me better than gdb and the fflush() thing what was likely happening. See above, strace was showing tons of write()s that each got an EPIPE. So I looked for the place where that message was issued and realized that the fputs itself caused the signal to be re-emitted.

I guess the fputs() stuck out to me because I'd read a book on unix network programming once that spent a lot of text on the difficulties in getting select + signals right. It's like exceptions, only worse. There isn't much you can safely do in the handlers in both cases because you mustn't cause another error while in there. For signals there's the additional worry that you don't even know where in the code you are — you can be in the middle of external library code as well.

Note: See TracTickets for help on using tickets.