Services reported running but containers not responding

support

#1

We have a multi container setup with two services, called hardware and controller. In the resin dashboard they are both reported as running with no errors, however, when attempting to start a terminal session with either container the shell is unable to spawn.

The following message is displayed on the shell continually, I have attempted to restart the running containers but to no avail. The shell does work successfully when connecting to the Host OS which gives access to dmesg output to shine a little more light on the issue.

Spawning shell...
Application container must be running for a terminal to be started.
SSH session disconnected

Here is a section of the output from dmesg shortly after the controller service was last known to be running. We know that the service was running up until 7:15 as there are logs available which show the system operating at that time.

Any insight as to what is occurring in the following lines would be much appreciated? I have left the system powered on in its current state in order to make debugging possible. I am sure that manually power cycling the system would resolve the issue however I would like to avoid this solution as it will not be a solution available to us in production.

[Fri Jul 20 07:22:28 2018] systemd[1]: balena.service: Killing process 28493 (udevadm) with signal SIGABRT.
[Fri Jul 20 07:22:28 2018] systemd[1]: balena.service: Killing process 28495 (udevadm) with signal SIGABRT.
[Fri Jul 20 07:22:28 2018] systemd[1]: balena.service: Killing process 28512 (udevadm) with signal SIGABRT.
[Fri Jul 20 07:22:28 2018] systemd[1]: balena.service: Killing process 28514 (udevadm) with signal SIGABRT.
[Fri Jul 20 07:22:28 2018] systemd[1]: balena.service: Killing process 28513 (udevadm) with signal SIGABRT.
[Fri Jul 20 07:22:28 2018] systemd[1]: balena.service: Killing process 28515 (udevadm) with signal SIGABRT.
[Fri Jul 20 07:22:28 2018] systemd[1]: balena.service: Killing process 28520 (udevadm) with signal SIGABRT.
[Fri Jul 20 07:22:28 2018] systemd[1]: balena.service: Killing process 28522 (udevadm) with signal SIGABRT.
[Fri Jul 20 07:22:28 2018] systemd[1]: balena.service: Killing process 28521 (udevadm) with signal SIGABRT.
[Fri Jul 20 07:22:28 2018] systemd[1]: balena.service: Killing process 28523 (udevadm) with signal SIGABRT.

In order to paint the complete picture it is relevant to say that 7 minutes prior to the above logs (precisely the time time that we have the last logs available from the controller service) there was another malfunction on the system that I believe is due to our hardware setup. It could be the case that the hardware malfunction lead to balena killing the services in such a way that they were not recovered or not registered to be malfunctioning.

I am including the full output from dmesg that occurred prior to the services not responding. This could provide context or be completely irrelevant. If anyone can provide any insight into what is occurring in the following output that would be much appreciated though it is not my main concern at the moment.

[Fri Jul 20 07:15:08 2018] Internal error: Oops - undefined instruction: 0 [#1] SMP ARM
[Fri Jul 20 07:15:08 2018] Modules linked in: ipt_REJECT nf_reject_ipv4 ipt_MASQUERADE nf_nat_masquerade_ipv4 br_netfilter nf_conntrack_netlink nfnetlink bnep hci_uart btbcm bluetooth i2c_dev brcmfmac brcmutil cfg80211 spidev rfkill cdc_acm bcm2835_gpiomem snd_bcm2835 snd_pcm snd_timer snd spi_bcm2835 i2c_bcm2835 fixed uio_pdrv_genirq uio sch_fq_codel
[Fri Jul 20 07:15:08 2018] CPU: 0 PID: 27537 Comm: udevadm Not tainted 4.9.59 #2
[Fri Jul 20 07:15:08 2018] Hardware name: BCM2835
[Fri Jul 20 07:15:08 2018] task: b7740ec0 task.stack: b0a76000
[Fri Jul 20 07:15:08 2018] PC is at di_write_lock+0x2c/0x40
[Fri Jul 20 07:15:08 2018] LR is at down_write+0x1c/0x6c
[Fri Jul 20 07:15:08 2018] pc : [<8043acf8>]    lr : [<807ee27c>]    psr: 60040013
                           sp : b0a77ca8  ip : b0a77c90  fp : b0a77cbc
[Fri Jul 20 07:15:08 2018] r10: b978e6e8  r9 : bc32f410  r8 : b0a77d9c
[Fri Jul 20 07:15:08 2018] r7 : b5570800  r6 : 00000000  r5 : 00000000  r4 : ae8a0f68
[Fri Jul 20 07:15:08 2018] r3 : b7740ec0  r2 : ffff0001  r1 : 00000000  r0 : b97c1838
[Fri Jul 20 07:15:08 2018] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[Fri Jul 20 07:15:08 2018] Control: 10c5383d  Table: 3097806a  DAC: 00000055
[Fri Jul 20 07:15:08 2018] Process udevadm (pid: 27537, stack limit = 0xb0a76210)
[Fri Jul 20 07:15:08 2018] Stack: (0xb0a77ca8 to 0xb0a78000)
[Fri Jul 20 07:15:08 2018] 7ca0:                   ae8a0f68 00000029 b0a77ce4 b0a77cc0 80429cd4 8043acd8
[Fri Jul 20 07:15:08 2018] 7cc0: b5570800 ae8a0f68 ae8a0f68 b0a77da4 b0a77d9c bc32f410 b0a77d54 b0a77ce8
[Fri Jul 20 07:15:08 2018] 7ce0: 8043c958 80429c40 00000000 be5b8c34 00000001 024200ca 00000000 80429a10
[Fri Jul 20 07:15:08 2018] 7d00: 20030113 00000000 80f74ac0 807ee21c b5570c10 8029cc30 00004010 b0a77d28
[Fri Jul 20 07:15:08 2018] 7d20: 8029cc30 807f06dc b0a77da0 b0a77e68 80f0414c ae8a0f68 b0a77da4 b0a77d9c
[Fri Jul 20 07:15:08 2018] 7d40: bc32f410 b978e6e8 b0a77d94 b0a77d58 8028ce54 8043c924 b0a77d7c b0a77da0
[Fri Jul 20 07:15:08 2018] 7d60: 80442798 0004093b b5570800 b0a77e68 00000001 80f0414c 00000000 2f2f2f2f
[Fri Jul 20 07:15:08 2018] 7d80: 00000000 00000018 b0a77ddc b0a77d98 8028f46c 8028cbfc fffffff6 2f2f2f2f
[Fri Jul 20 07:15:08 2018] 7da0: b0a77f5c 00000000 b0a77dcc 0004093b 8028d84c b8e7b015 b0a77e68 fefefeff
[Fri Jul 20 07:15:08 2018] 7dc0: 00000000 2f2f2f2f 00000000 00000018 b0a77e44 b0a77de0 8028f9d8 8028f42c
[Fri Jul 20 07:15:08 2018] 7de0: 8027a8cc 80f0414c 61c88647 80808080 d0d0d0d0 b0a77e68 00000000 00000003
[Fri Jul 20 07:15:08 2018] 7e00: 80117a18 8028e4ec b0a77e68 00004040 b0a77e44 0004093b 8028e4ec b0a77e68
[Fri Jul 20 07:15:08 2018] 7e20: ffffff9c b0a77f5c b0a77e68 00004000 b0a77f5c 00000000 b0a77e64 b0a77e48
[Fri Jul 20 07:15:08 2018] 7e40: 8028fce4 8028f71c 80f0414c ffffff9c b8e7b000 b0a77e68 b0a77f1c b0a77e68
[Fri Jul 20 07:15:08 2018] 7e60: 80291a14 8028fcbc bc32f410 b978e6e8 d58c7b9b 00000003 b8e7b011 b0a77fb0
[Fri Jul 20 07:15:08 2018] 7e80: bc32f410 b978e6e8 bb5a7838 00004010 00000002 000015ca 00000000 00000000
[Fri Jul 20 07:15:08 2018] 7ea0: 00000000 b0a77ea8 b0a77eec b0a77eb8 8026d980 802793b0 00000001 00000000
[Fri Jul 20 07:15:08 2018] 7ec0: 00000800 00004000 80f0555c b8e7b000 80f0555c 7eb948bc b8e7b000 00000000
[Fri Jul 20 07:15:08 2018] 7ee0: b0a77f1c 00000002 ffffff9c 804d150c b0a77f5c 0004093b 00004000 ffffff9c
[Fri Jul 20 07:15:08 2018] 7f00: b0a77f5c 80f0414c 7eb94cbc 00000400 b0a77f44 b0a77f20 80291b98 80291970
[Fri Jul 20 07:15:08 2018] 7f20: 00000000 00000000 b0a77f44 ffffff9c 7eb948bc b0a77f58 b0a77f94 b0a77f48
[Fri Jul 20 07:15:08 2018] 7f40: 80286d30 80291b50 b0a77f58 00000000 b0a77f6c 00000002 00000000 80186d0c
[Fri Jul 20 07:15:08 2018] 7f60: b0a77f94 0004093b 80179a3c 008daa00 008d3170 7eb948bc 00000055 80108948
[Fri Jul 20 07:15:08 2018] 7f80: b0a76000 00000000 b0a77fa4 b0a77f98 80286e20 80286cd4 00000000 b0a77fa8
[Fri Jul 20 07:15:08 2018] 7fa0: 80108780 80286e04 008daa00 008d3170 7eb948bc 7eb94cbc 00000400 ed080300
[Fri Jul 20 07:15:08 2018] 7fc0: 008daa00 008d3170 7eb948bc 00000055 7eb94cbc 008c0008 76ff9f10 00000000
[Fri Jul 20 07:15:08 2018] 7fe0: 00050d94 7eb9483c 00025037 76f4b186 60040030 7eb948bc 00000000 00000000
[Fri Jul 20 07:15:08 2018] [<8043acf8>] (di_write_lock) from [<80429cd4>] (aufs_read_lock+0xa0/0xc0)
[Fri Jul 20 07:15:08 2018] [<80429cd4>] (aufs_read_lock) from [<8043c958>] (aufs_d_revalidate+0x40/0x594)
[Fri Jul 20 07:15:08 2018] [<8043c958>] (aufs_d_revalidate) from [<8028ce54>] (lookup_fast+0x264/0x2fc)
[Fri Jul 20 07:15:08 2018] [<8028ce54>] (lookup_fast) from [<8028f46c>] (walk_component+0x4c/0x2f0)
[Fri Jul 20 07:15:08 2018] [<8028f46c>] (walk_component) from [<8028f9d8>] (link_path_walk+0x2c8/0x534)
[Fri Jul 20 07:15:08 2018] [<8028f9d8>] (link_path_walk) from [<8028fce4>] (path_lookupat+0x34/0x114)
[Fri Jul 20 07:15:08 2018] [<8028fce4>] (path_lookupat) from [<80291a14>] (filename_lookup+0xb0/0x120)
[Fri Jul 20 07:15:08 2018] [<80291a14>] (filename_lookup) from [<80291b98>] (user_path_at_empty+0x54/0x5c)
[Fri Jul 20 07:15:08 2018] [<80291b98>] (user_path_at_empty) from [<80286d30>] (SyS_readlinkat+0x68/0x130)
[Fri Jul 20 07:15:08 2018] [<80286d30>] (SyS_readlinkat) from [<80286e20>] (SyS_readlink+0x28/0x2c)
[Fri Jul 20 07:15:08 2018] [<80286e20>] (SyS_readlink) from [<80108780>] (ret_fast_syscall+0x0/0x1c)
[Fri Jul 20 07:15:08 2018] Code: e1a05001 e2800004 eb0ecd5a e5940028 (bd906cd4)
[Fri Jul 20 07:15:08 2018] ---[ end trace c2c8db724535b9c7 ]---
[Fri Jul 20 07:15:08 2018] cdc_acm 1-1.5:1.0: failed to set dtr/rts
[Fri Jul 20 07:15:08 2018] usb 1-1.3: USB disconnect, device number 12
[Fri Jul 20 07:15:08 2018] usb 1-1.3: new full-speed USB device number 17 using dwc_otg
[Fri Jul 20 07:15:08 2018] usb 1-1.3: New USB device found, idVendor=16c0, idProduct=0483
[Fri Jul 20 07:15:08 2018] usb 1-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[Fri Jul 20 07:15:08 2018] usb 1-1.3: Product: USB Serial
[Fri Jul 20 07:15:08 2018] usb 1-1.3: Manufacturer: Teensyduino
[Fri Jul 20 07:15:08 2018] usb 1-1.3: SerialNumber: 3792990
[Fri Jul 20 07:15:08 2018] cdc_acm 1-1.3:1.0: ttyACM0: USB ACM device
[Fri Jul 20 07:15:08 2018] usb 1-1.5: USB disconnect, device number 16
[Fri Jul 20 07:15:09 2018] usb 1-1.5: new full-speed USB device number 18 using dwc_otg
[Fri Jul 20 07:15:09 2018] usb 1-1.5: New USB device found, idVendor=16c0, idProduct=0483
[Fri Jul 20 07:15:09 2018] usb 1-1.5: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[Fri Jul 20 07:15:09 2018] usb 1-1.5: Product: USB Serial
[Fri Jul 20 07:15:09 2018] usb 1-1.5: Manufacturer: Teensyduino
[Fri Jul 20 07:15:09 2018] usb 1-1.5: SerialNumber: 3500980
[Fri Jul 20 07:15:09 2018] cdc_acm 1-1.5:1.0: ttyACM1: USB ACM device

#4

Hi, can you paste the output of

systemctl --state=failed

and

journalctl --no-pager -u balena

please?


#5

Thanks for your quick response, here are the outputs requested. I have just shown head and tail for the journalctl output as there were 8550 lines in total. Please let me know if there is anything else that would be useful.

root@efa3717:~# systemctl --state=failed
0 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.
root@efa3717:~# journalctl --no-pager -u balena | head -n 2
-- Logs begin at Tue 2018-07-31 04:59:38 UTC, end at Tue 2018-07-31 05:03:50 UTC. --
Jul 31 04:59:38 efa3717 healthdog[17429]: time="2018-07-31T04:59:38.812194612Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = Unimplemented desc = unknown service types.API"
root@efa3717:~# journalctl --no-pager -u balena | tail -n 2
Jul 31 05:04:09 efa3717 healthdog[17429]: time="2018-07-31T05:04:09.530633079Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = Unimplemented desc = unknown service types.API"
Jul 31 05:04:09 efa3717 healthdog[17429]: time="2018-07-31T05:04:09.531686407Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = Unimplemented desc = unknown service types.API"

#16

can you try a reboot from the dashboard and let us know if this fixes the issue please? the hostOS seems to be functioning correctly and should be able to cope with a restart and upon restart balena should be able to correctly start the user container


#17

Following an attempt to restart or reboot from the dashboard I get the same error. I am tempted to power cycle the device manually however am concerned that doing so will prevent the root of the issue being identified.

33


#18

This sounds really similar to the issue I am experiencing on one of my devices:


#20

Can you share the dashboard link of the device so we can take a look please?


#22

I am afraid that I have power cycled the device and its no longer booting so there is nothing that can be done to debug further. Thanks all the same


#23

I have managed to get the device into much the same state, the dashboard link is https://dashboard.resin.io/devices/67bbe840a0ff3812d2e551c0c95e519e/summary