Can't capture signals from simple python application

support
docker

#1

Hi everyone,

I have been trying - with little to no avail - to capture SIGTERM from my application.
Sadly, when stopped, the container exits with return code 165 (???) and none of the code in the signal handler gets executed.

Here’s an overview of my setup:
Dockerfile

FROM resin/intel-nuc-python:3.5
WORKDIR /usr/src/app
COPY ./services/test2 ./
CMD ["python", "test2.py"]

test2.py

#!/usr/bin/env python3

import time
import signal
import logging


class Example():
  def __init__(self):
    logging.info('initing')
    self.running = True
    signal.signal(signal.SIGINT, self._int)
    signal.signal(signal.SIGTERM, self._term)

  def _int(self, signum=None, frame=None):
    logging.info('SIGINT called')
    logging.debug('signum: ', signum)
    logging.debug('frame: ', frame)
    self.running = False

  def _term(self, signum=None, frame=None):
    logging.info('SIGTERM called')
    logging.debug('signum: ', signum)
    logging.debug('frame: ', frame)
    self.running = False

  def run(self):
    while self.running:
      logging.info('running')
      time.sleep(1)

if __name__ == '__main__':
  logFormat = '%(asctime)s [%(levelname)s] %(message)s'
  logging.basicConfig(level=logging.DEBUG, format=logFormat)
  a = Example()
  a.run()

Log trail

Recreating wdb_toolbelt_resin_test2_1 ... done
Attaching to wdb_toolbelt_resin_test2_1
test2_1  | 2018-07-20 15:16:27,170 [INFO] initing
test2_1  | 2018-07-20 15:16:27,170 [INFO] running
test2_1  | 2018-07-20 15:16:28,171 [INFO] running
test2_1  | 2018-07-20 15:16:29,172 [INFO] running
[...]
test2_1  | 2018-07-20 15:16:45,190 [INFO] running
test2_1  | 2018-07-20 15:16:46,191 [INFO] running
<<< HERE I PRESS CTRL+C >>>
wdb_toolbelt_resin_test2_1 exited with code 165

Besides the fact that I don’t get the logging messages from the handler (not even by using docker log), I’m also puzzled by the 165 (same on this small example as on my actual code), indicating an exit code 165-128=37 (ENOLCK - No record locks available) which is probably the most cryptic thing I’ve seen all week :stuck_out_tongue:

Thanks in advance.


#2

Hey Miccio,

We are working to replicate this issue internally and will report back on our findings,

Thanks!


#5

Hey @miccio, sorry, it took a bit of digging to see what’s happening here. Is it correct, that the “log trail” you show there is running the project locally on your computer with docker-compose? Not running on a device.

So what I’ve found so far:

  • if you run the code with docker compose up, and hit CTRL-C, you won’t see any extra logs there, even though the container might have caught the signal. You can test this by adding some more sleep to the signal handler, and see that the stopping takes longer time there.
  • I’ve checked that the signal you want to catch is indeed SIGINT, docker will send a that first before sending a SIGKILL if needed
  • We’ve found that there’s an issue with our current base images, and signals are not actually passed down to the application. This seems to be traced back to the time when tini was integrated into the base images, as a more container friendly initsystem changelog. We are still investigating, but it looks like that since that switch to tini on 2017-06-20, the signal capture might not have worked. We’ll keep you posted!

In the meantime, I’ve put the code you had here into a test project to try things out:

Here if you see the Dockerfile, there are a number of different base images. Ours with a date tag. So for example using resin/intel-nuc-python:3.5-20170610 (or I used resin/intel-nuc-python:3.5-slim-20170610, the slim version for quicker downloads), that should work for you.

For example here it is that earlier image deployed on a resin device, and hitting container restart to send a signal, see the middle section, to show it’s working:

We’ll keep you posted regarding our investigation on the newer images, in the meantime likely pre-tini date-tagged, or alternatively non-resin base images are recommended.

Thanks for the report, and let us know if anything is unclear above, hope to get this sorted very soon!


#6

Hi @imrehg
and thanks for the extremely exhaustive answer! I am glad to hear that you guys are looking into this issue, as proper handling of termination signals is quite crucial to a few of our applications.

Here is my feedback regarding your notes:

  • when I collected the log trail, I was actually running the container using docker-compose remotely on a local device
  • to ensure that I wasn’t just seeing the logger simply cutting the logs right after termination thereby skipping the last few entries, I later inspected the logs (docker-compose logs [container-name]) and set up a plethora of tests involving side-effects, which seemed to prove that our signal handler code was not being executed
  • interesting about how I should listen to SIGINT rather than SIGTERM, which is what docker docs mostly refer to… I actually included both in my test application just to be safe, but I would like to read more about it. Could you please point me to the relevant docs?
  • I assumed it has something related to tini, which was mentioned countless times throughout my journey troubleshooting the issue. Sadly, even its own README seems pretty vague about what it actually does to signal, citing “It ensures that the default signal handlers work for the software you run in your Docker image. For example, with Tini, SIGTERM properly terminates your process even if you didn’t explicitly install a signal handler for it. :point_left: this one sentence honestly leaves me puzzled

Again thanks for taking the time investigating this. We will try using images with the intel-nuc-python:3.5-slim-20170610 or similar as you suggested. Hope you guys can sort this out soon too :wink:


Riccardo


#7

Hi @miccio one correction on my part, looking at the screenshot above, it is indeed SIGTERM that docker uses to clean things up (before the uncapturable SIGKILL). SIGNINT is the one that happens on CTRL-C so I think I got mixed up in the long debugging. Thanks for calling it out!

Is that a resinOS device? Since that’s not the usual path that we use, so I’m still not 100% sure how you are doing it (but probably got close enough when testing myself locally, on just a dev laptop), but the main thing is a the base image according to the above investigation, so it shouldn’t matter much.

Just one note, that I think there’s some time between sending the SIGTERM and then the SIGKILL by docker, will have to check how’s it on resinOS, but I think you’d have ~15 seconds to clean things up before the SIGKILL hits. Will have to recheck how much is it exactly, but to my best current knowledge it’s 15s on current resinOS/supervisor versions. It might not be the same value when you are running with your local docker (as the value is configurable, and I think we do configure it)


#8

Thanks for calling it out!

Glad I could help! :wink:

I’m still not 100% sure how you are doing it

I will try to clarify: I created a docker-compose project, based on the python multi-container template project and consisting of exactly one service. The service uses the Dockerfile attached to my first message. The deployment occurs using docker-compose with DOCKER_API_VERSION=1.13 DOCKER_HOST=tcp:// etc etc set. The box runs on resinOS-development.
Nevertheless, I am also led to believe that your local environment should not affect the issue, which you could successfully replicate.

Just one note, that I think there’s some time between sending the SIGTERM and then the SIGKILL by docker, will have to check how’s it on resinOS, but I think you’d have ~15 seconds to clean things up before the SIGKILL hits. Will have to recheck how much is it exactly, but to my best current knowledge it’s 15s on current resinOS/supervisor versions. It might not be the same value when you are running with your local docker (as the value is configurable, and I think we do configure it)

Regarding this last note of yours, according to your docker-compose reference you seem to fully support stop_grace_period, a feature which we’re not using yet but are planning on (as you may expect, it makes the cleanup time configurable). Could you confirm that - in light of your assumptions - the feature is actually supported?

Lastly, I’m curious to know whether you’re also getting the cryptic 165-128=37 exit code for the container (and application) :smile:

Have a great weekend!


#9

hey @imrehg,
do you have any update on this?

Thanks in advance!


#10

Sorry @miccio, no new update yet, our base images team is working on this…


#13

Just a quick update, @miccio, we’ve opened an issue on the base images repository, where you can follow the progress or leave comments. Thanks again for highlighting this!