Updated docker-compose support


#1

So I’ve been playing with the multi-container setups over the last few days from https://resin.io/engineering/our-first-experiments-with-multi-container-apps/ which was a good start, but the second commit to the repo broke it from building.

It also used an old docker and compose version that isn’t compatible with the v2 docker repos that are now pretty much mandatory when using docker hub itself.

I made a few changes to make it a bit more resilient too:

  • Updated docker
  • Updated docker-compose
  • Update to latest dind wrapper
  • Add error checking and restart of stack on container failures
  • Add persistent caching of docker builds
  • Add auto-cleanup of untagged builds
  • Add auto-cleanup of exited containers on boot

It’s running pretty well for me now. Hopefully someone can find another use for it; I’m using it now for a few services.


Docker-compose support with pre-built images
#2

Really nice work @justin8!

Any interest in doing a slightly more detailed write up for our blog?

Craig


#3

Yeah, I could do that. Just PM or email me about it


#4

Hey,

first of all, thank you @justin8 for sharing this. Our current application is based on your approach while we wait for first class multicontainer support.

Disclaimer: This reply is not intended as a request to fix this feature, as we understand it has been released on your good will, just to inform the community of a possible bug.

We are experiencing problems when trying to run this under the last version of resin.io available in the dashboard. Using the latest commit on justin’s github we obtain:

  • On version Resin OS 2.0.6+rev3

Everything works as espected :slight_smile:

  • On version Resin OS 2.0.8+rev1

It does not work at all, producing the log at the end of the post. I also hangs the application forever, it does not responds to restarts, reboots and freezes the device if moved to another application.

We are using the same exact setup, so we the only change is the resin version (this implies a major bump on the supervisor version from 4.3.1 to 5.1.0). We are not even sure if the error is on our end (i.e. network setup)…

The log, as promised. Many errors and warnings are common, and are also present in the working version without being an issue, the problem is when trying to fetch images from the docker registry:

19.07.17 18:48:15 (+0200) WARN[0003] Error getting v2 registry: Get https://registry-1.docker.io/v2/: dial tcp: lookup registry-1.docker.io: no such host 

Obviously the device has internet access…

full log:

19.07.17 18:48:11 (+0200) ++ find docker-compose.yml static-server transmission -maxdepth 0 -type d
19.07.17 18:48:11 (+0200) + containers='static-server
19.07.17 18:48:11 (+0200) transmission'
19.07.17 18:48:11 (+0200) + mount -o bind /data /var/lib/docker
19.07.17 18:48:11 (+0200) + docker info
19.07.17 18:48:11 (+0200) + wrapdocker echo
19.07.17 18:48:12 (+0200) INFO[0000] New containerd process, pid: 117
19.07.17 18:48:12 (+0200)             
19.07.17 18:48:13 (+0200) INFO[0001] Graph migration to content-addressability took 0.00 seconds 
19.07.17 18:48:14 (+0200) INFO[0001] Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address version=1.11.1
19.07.17 18:48:14 (+0200) INFO[0002] API listen on /var/run/docker.sock           
19.07.17 18:48:14 (+0200) ERRO[0002] Couldn't run auplink before unmount /var/lib/docker/tmp/docker-aufs-union490127216: exec: "auplink": executable file not found in $PATH 
19.07.17 18:48:14 (+0200) 
19.07.17 18:48:14 (+0200) + for i in '$containers'
19.07.17 18:48:14 (+0200) + docker-compose build static-server
19.07.17 18:48:15 (+0200) Building static-server
19.07.17 18:48:15 (+0200) WARN[0003] Error getting v2 registry: Get https://registry-1.docker.io/v2/: dial tcp: lookup registry-1.docker.io: no such host 
19.07.17 18:48:15 (+0200) ERRO[0003] Attempting next endpoint for pull after error: Get https://registry-1.docker.io/v2/: dial tcp: lookup registry-1.docker.io: no such host 
19.07.17 18:48:15 (+0200) Step 1 : FROM resin/armhf-alpine
19.07.17 18:48:15 (+0200) ERRO[0003] Not continuing with pull after error: Error while pulling image: Get https://index.docker.io/v1/repositories/resin/armhf-alpine/images: dial tcp: lookup index.docker.io: no such host 

Dashboard shows my devices as offline, although the apps running in the container is working and sending data to the cloud
No Network Connection on ResinOS 2.2.0+rev1
#5

That’s odd. Are you able to get a shell on the machine?

Testing how docker is working in the container there should give some good direction to find the cause of the error.


#6

Hi justin,

I’ve been trying to debug this as much as possible. First of all I found another part of the log that could be of use, I do not know why I ignored it in my previous post.

An error is being raised when executing wrapdocker (last line):

20.07.17 10:19:35 (+0000) /sbin/udevd
20.07.17 10:19:36 (+0000) ++ find docker-compose.yml static-server transmission -maxdepth 0 -type d
20.07.17 10:19:36 (+0000) + containers='static-server
20.07.17 10:19:36 (+0000) transmission'
20.07.17 10:19:36 (+0000) + mount -o bind /data /var/lib/docker
20.07.17 10:19:36 (+0000) + docker info
20.07.17 10:19:36 (+0000) + wrapdocker echo
20.07.17 10:19:37 (+0000) ln: failed to create symbolic link ‘/sys/fs/cgroup/systemd/name=systemd’: 

I’ve also tried to debug things inside of the container, but it keeps restarting every 5 or 10 seconds. It is virtually impossible to run any commands. I’ve been randomly able to successfully ssh into it once, for just about 5 minutes, docker info and docker images run fine with a huge lag, but the output seemed correct.

I’ve also tried to run the container in local mode from host, but I can’t reproduce the exact environment in which resin-supervisor is running the container (I even try to mount the /data directory) but the start script fails as

mount -o bind /data /var/lib/docker
returns
mount: permission denied

Another thing that is out of place is that when the application is restarting every 5 seconds the device is not responding to any resin command, it does not update, nor it restarts, reboots, purges data or can be moved to another application. From resin’s point of view it is like a bricked device (in fact I had to flash the SD in order to update the code).

My other devices using the previous OS version still work fine…


#7

Hey @Jarias,

@dagrooms52 and I have reproduced this, it appears that this may be caused by the change to DNSMasq or possibly the update to docker 17.03, and that we haven’t tested dind since this was added.

We run host networking to containers by default, so the wiping iptable rules also has the nasty knock-on effect which stops the supervisor from being able to request update images (and probably more).

It should be possible to get create and use another network bridge as explained here: https://stackoverflow.com/questions/32334167/is-it-possible-to-start-multiple-docker-daemons-on-the-same-machine/34058675#34058675 but I haven’t had any luck so far, I’m posting it here incase you have a chance to try it before I have another go at it.

I’ll let you know if I make any head way.
Craig


#9

When using a separate network bridge to start the docker daemon, internet access dies until I switch DNS lookup from “172.17.0.1” to “8.8.8.8”. It does open me up to using the docker daemon from inside my container, although all Resin docker operations (most importantly the upgrade operation) become unavailable. If starting the docker daemon is specified in the start command of the Dockerfile, this essentially bricks the device; if I just specify an interactive shell to start in, the device can be power cycled to refresh the image, and is then able to pull updates. This is useful to know for anyone else who is trying to debug the issue, as it keeps you from having to re-flash the card every time.

172.17.0.1 is the same address Resin is using in their docker bridge, so something makes the docker daemon running in the final container interfere with the original docker bridge, although the inner docker daemon is using a different bridge.

On my working device, the docker0 bridge is at 172.17.0.1 and /etc/resolv.conf points to 127.0.0.2. It is running ResinOS 2.0.4, docker version 17.06-ce. It uses the exact same wrapdocker that I’m using on the Ras Pi.

It seems that somewhere between ResinOS 2.0.4 and ResinOS 2.2.0, the OS became more dependent on communicating through the docker0 bridge. It would be great to know what exactly changed and why.

Update: did some tests on the OS versions I have available, which include 2.0.4, 2.0.6, 2.0.8, and 2.2.0. The latest version that will successfully run the hack in its current state is 2.0.6. This is the last version that resolved DNS through 127.0.0.2, then the devices switch to using the docker IP address.


#10

@dagrooms52, One thing I think that is notable is that resinOS 2.0.6 was the last version of resinOS with docker v1.10.3, all later versions have docker 17.03.1 as the docker engine, so my suspicion is that somehow something in the way docker-in-docker broke with this version bump.


#12

I’m also experiencing this issue using a fork of justin8’s multi container app. My device shows up online in the dashboard, but the application dies as soon as it is trying to download some content with an error

tcp: lookup index.docker.io: no such host

EDIT: removed a chunk of text here because I realized I had pushed some changes to my /etc/resolv.conf file while trying to debug the issue above


#13

@davo the changes to resolve.conf , did you try explicitly stating the IP address for index.docker.io ?
https://linuxconfig.org/docker-dial-tcp-lookup-index-docker-io-no-such-host-fix


#14

For posterity, a fork with some hacks to make docker-in-docker/compose run on 2.3.0+rev1 (x86_64).

Notable changes include setting DOCKER_OPTS=--dns 8.8.8.8 --dns 8.8.4.4 and adding DNAT iptables rules to re-write DNS on the docker-in-docker host.

– ab1


#15

God bless you @ab1. I have been struggling to get multi container to work on a 2.3.0 device for days now after an update from 2.0.6. Your “hacks” really do the trick. I also like that you use Alpine for a lighter base image.

Do you still consider this a WIP? Do you think it could be used in production?

@craig-mulligan: could you share an update regarding multi container support? And/or did you have any luck making this work on recent versions of resinOS?

EDIT : never mind @craig-mulligan, I just stumbled upon State update: Multicontainer


#16

@lv82 thank you. We are running this approach currently in production on a couple of Intel NUC devices, which appear to be more or less stable.

Given resin.io are imminently coming out with native multi-container support, I would be included to wait…


#17

An update from my side: I’ve been able to successfully run multi container apps with version 2.9.7+rev1


#18

I tried deploying @ab1’s fork to a resin device on a raspberrypi-3 and i get the following error when the app within resin starts building the docker images for the multiple apps:

25.02.18 23:56:21 (-0600) + docker-compose build static-server
25.02.18 23:56:28 (-0600) Building static-server
25.02.18 23:56:28 (-0600) Step 1/4 : FROM resin/armhf-alpine
25.02.18 23:56:28 (-0600)  ---> 6d41b3c0006a
25.02.18 23:56:28 (-0600) Step 2/4 : RUN apk add --update ruby
25.02.18 23:56:28 (-0600)  ---> Running in 3b9cf943e212
25.02.18 23:56:30 (-0600) fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/main/armhf/APKINDEX.tar.gz
25.02.18 23:56:35 (-0600) fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/community/armhf/APKINDEX.tar.gz
25.02.18 23:56:35 (-0600) ERROR: http://dl-cdn.alpinelinux.org/alpine/v3.7/main: temporary error (try again later)
25.02.18 23:56:35 (-0600) WARNING: Ignoring APKINDEX.70c88391.tar.gz: No such file or directory
25.02.18 23:56:40 (-0600) ERROR: http://dl-cdn.alpinelinux.org/alpine/v3.7/community: temporary error (try again later)
25.02.18 23:56:40 (-0600) WARNING: Ignoring APKINDEX.5022a8a2.tar.gz: No such file or directory
25.02.18 23:56:40 (-0600) ERROR: unsatisfiable constraints:
25.02.18 23:56:40 (-0600)   ruby (missing):
25.02.18 23:56:40 (-0600)     required by: world[ruby]
25.02.18 23:56:41 (-0600) ERROR: Service 'static-server' failed to build: The command '/bin/sh -c apk add --update ruby' returned a non-zero code: 1

I suspect some kind of DNS issue? You guys seen this before and have any suggestions?

@davo Did you need to make any changes to get 2.9.7+rev1 working with this approach?


#19

@ab1: I would love to wait for the official support but we are stuck on 2.3.0+rev1 right now. Our 3G modem no longer works with earlier versions. So until we figure this out, I think I will stick to your code. It works beautifully.

@davo: You mean with native support from resinOS? Or with ab1’s hacks?

donfmorrison: I successfully deployed ab1’s code on a PiComputeModule 3 running resinOS 2.3.0+rev1(dev). Which release are you on?

PS: apologies donfmorrison, new users cannot mention more than two users with “@” in their posts…


#20

Glad you got it working. “Temporary error” in this context means (likely) DNS resolution failure. This hack is quite brittle and will fail when the underlying OS changes Docker bridge config. However a simple change to daemon.json should resolve…


#21

@lv82 I’m using 2.9.7+rev2 right now. I’ll try it with 2.3.0+rev1 if @ab1 recommendation to modify daemon.json doesn’t work. I’m sure DNS resolution is the issue and getting that dnsmasq using the proper network should fix this.


#22

@donfmorrison @lv82 I had to use ab1’s recommendation while using 2.6, and AFAIK Resin hasn’t released the native support, so I am not using either of those two approaches.

I’m just using the barebones docker-compose example that justin8 posted at the beginning of this thread, it wasn’t working on 2.6.0-ish versions, but now it is with 2.9.7. FYI I am using a Raspberry PI 3

Here is my app in case you are curious: https://github.com/davoclavo/openag_resin/blob/master/Dockerfile