I noticed that some times my drone builds failed with an error like “Failed to connect to…. Operation timed out”. Most of the time I just killed the docker container of the drone agent and the problem was gone. But the last time, this did not help (yes, it’s not the best idea to fix symptoms and not understand the cause).
Which part of the error message is not clear? Let me explain the setup first and my findings afterwards.
The system is setup as a rancher.com managed docker environment with cattle as orchestration style. Normally I have two hosts up and running within this cattle. The two stacks represent one service with all required containers to run it. The WebLoadBalancer is the main entry point to the docker environment with stuff like SSL-offloading and the only point where ports are opened on the host itself. So all https://…. calls are routed through the web load balancer.
I had to take down one of the hosts for maintenance and kept it down longer because of maintenance of the proxmox.com vm host. As designed, rancher moved the drone agent to the one available host. So far, everything works fine and as expected.
I pushed a change to a repository the next day and expected drone to pick it up and run the build. The build job got scheduled but the build failed after two minutes. I restarted the build with the same result. I killed the drone agent container and restarted the build again with the same result. I had to investigate this in more detail and after checking all network connectivity, dns records, rancher configuration, etc I had no idea, why git fetch on the first build step runs into a timeout. I did not find any issues with my setup and yet there has to be something wrong. Gogs could trigger the build by webhooks and drone was able to login with the gogs backend. So these services can definitely talk to each other.
Digging deeper I saw that drone uses a container “plugins/git” to fetch the git repository from the gogs server. This container does start and gets destroyed by drone and is not part of the rancher infrastructure. Rancher has no clue about this container and what network connectivity has to be configured. But at first glance I saw no issues and at 01:15 I decided to have a break.
Next idea was to boot up the second docker host to restore the situation as before the problem went persistent. I forced rancher to move the container to the second host and there the build goes without an error. Now I understand what the issue is. It looks like that the “plugins/git” container is not able to access the “public-ip” of the host it runs on to get the sources. But it is okay if the container runs on a different host and therefore it works.
My solution is quite simple. I adjusted the scheduling rules of the drone agent to force it not starting on the host where gogs runs to make sure the fetch is spread over two hosts. There is also another solution where you combine gogs and drone into one stack and link them by using their docker managed network addresses. I did not test this solution but it should work as well as I saw other tutorials using such a full stack instead of two like I do.