Thanks for that bit of information. However, a couple of things to note:
-
The reason I mentioned GKE (but there are others) is specifically so that you aren’t responsible for the correct operation of the Kubernetes cluster. It’s the same as was mentioned before “If something breaks, Travis is an email away”. The same applies here, except that you contact the company that is responsible for hosting your cluster (in this case, that would be Google)
-
We actually don’t have any operations people, because we outsource the hosting of the cluster. Yes, we have two or three people who are familiar with Kubernetes, but not on an operations level, because it is not needed.
-
We’ve had zero downtime of our (hosted) Kubernetes cluster (other than the one or two Google Cloud wide outages, unrelated to Kubernetes) in the last three years or so, so I’m very confident in its capabilities and stability.
That’s interesting to know. So I’m assuming there is a team of several volunteers that have access to Travis configurations? This situation would be similar to having a team that has access to the Git repo that hosts the Jenkins configuration.
Indeed, it would require knowledge about Jenkins (this is the one thing that does require knowledge, because you need to set this up yourself, instead of having Travis do that for you, but that’s also why I’m offering to help out), and it would require some knowledge on Kubernetes, but not from an operations perspective, but from a user perspective (as in, “how to use these 10 CLI commands”). I don’t think that’s impossible to achieve, but maybe this is where we disagree?
Some of our services have fast suites, some don’t. The longest takes about 45 minutes. Rust’s is (I believe) an hour or so? But really it doesn’t matter, since the speed of the suite has no impact on how much parallel tests you can run in this set up. You can configure it to (automatically) spin up 2 machines if your queue needs that many resources to have it all run in parallel, or go to 10 if required. Your configuration determines the lower and upper bounds.
I still believe there’s nothing that I’ve read here, and what I’ve seen in the configurations and the Travis runs that makes this fundamentally impossible, and while it is different from the current set up, I don’t think it’s worse, and from my experience, if done right, can actually achieve much more CI capacity at lower costs (both in hours, and in hardware).
I’m not saying getting there is easy, but I did want to throw this curveball, and offer my insights/help in case any of you think it’s worth considering.