App Container has vanished


#1

Hi there,

Just encountered a very odd error, which seems to have rendered one of our dev boxes unusable. Here’s the chain of events:

  1. Yesterday we moved a device from one app to another, the app was at that point in an error state, which caused continuous restarts of the container.
  2. The device was left in this error state - not really sure how often it restarted, but was certainly hundred of times.
  3. At some point, overnight the device turned itself off, and since restarting, the following error is reported every time the device is restarted:

29.03.17 10:47:02 (+0200) Failed to start application ‘registry.resin.io/chill
i/ad1a5ff57a83bab1a5813a71a0fba3b0394ced75’ due to 'Container command not foun
d or does not exist.

I’ve looked at the root OS, and sure enough, docker ps gives me:

root@efd141c:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
55ac49a051eb resin/amd64-supervisor “/sbin/init” 22 hours ago Up 8 minutes resin_supervisor

I am unable to purge the device:

29.03.17 10:56:20 (+0200) Purging /data
29.03.17 10:56:20 (+0200) Killing application 'registry.resin.io/chilli/ad1a5f
f57a83bab1a5813a71a0fba3b0394ced75’
29.03.17 10:56:20 (+0200) Killed application 'registry.resin.io/chilli/ad1a5ff
57a83bab1a5813a71a0fba3b0394ced75’
29.03.17 10:56:20 (+0200) Purged /data
29.03.17 10:56:20 (+0200) Installing application 'registry.resin.io/chilli/ad1
a5ff57a83bab1a5813a71a0fba3b0394ced75’
29.03.17 10:56:20 (+0200) Installed application 'registry.resin.io/chilli/ad1a
5ff57a83bab1a5813a71a0fba3b0394ced75’
29.03.17 10:56:20 (+0200) Starting application 'registry.resin.io/chilli/ad1a5
ff57a83bab1a5813a71a0fba3b0394ced75’
29.03.17 10:56:20 (+0200) Failed to start application ‘registry.resin.io/chill
i/ad1a5ff57a83bab1a5813a71a0fba3b0394ced75’ due to 'Container command not foun
d or does not exist.
'
29.03.17 10:56:20 (+0200) Error purging /data: Error: (HTTP code 500) server e
rror - Container command not found or does not exist.

I then moved the device to another app (which isn’t in error state), and the container started up properly with that app.

In short, it looks like the robustness of resin was proven again (i.e. even with a seemingly totally dead container, there was still a reasonably straight-forward way to remotely recover) however I’m still quite curious what happened in order for the resin app (and device) to get into this state, and why purging didn’t work? Is there some kind of internal error state that was triggered, due to the constant container restarts?

Cheers,
Corin


#2

Hey Corin,

I think you have hit an edge case with the constant restart. The supervisor code tries to stop/kill the application and then purge the data. Since your application is constantly restarting/failed to start the supervisor gives up. I have captured this as an issue on resin-supervisor -> https://github.com/resin-io/resin-supervisor/issues/412

Thanks for reporting!


#3

Thanks for the feedback, sounds very much like an edge case, and not really something we’re going to hit if our container is working correctly.