moleculer — deployment thoughts

Dan Kuida
Hitchhikers guide to the (software) galaxy
11 min readNov 4, 2018

--

key takeaways

  1. cluster and containers !== friends, don’t use those together
  2. moleculer services break into containers/pods extremely well and easily mix matched
  3. the retry feature is extremely useful and works well
  4. circuit breaker — at least in this limited scenario, is not that useful (but its complicated)

Background

It seems as if these days everything is microservice, from day zero of a project the first word is microservice. Everyone talks about it, everyone knows about it — but in fact, from several dozen of interviews I did in the last few months — the candidates that securely stated they know the subject probably learned about from rumours, or by watching flawed implementation at their place of work.

The above is just as correct about knowledge of scrum and agile in the industry, (lack of knowledge it is)— but thats a different topic

Unfortunately, many people do not do the due diligence of reading the original material, definition and just trust what they see others do, keep failing on same places, reinventing solution that probably someone else already tried and failed.

sidenote: i think that is one of the core pitfalls of software development in general, developers too erogant to learn from others before they start to implement

There are many principles around distributed systems and microservices (here, here, here and last my all time favourite) in particular — many of which are handling errors, which in such systems are a fact rather than a rare case.

Recently moleculer caught my attention to the extent of it just seemed to good to be true. The pluggable transport, ES7 syntax, versioning, fault tolerance, resolution and more and more and more.

One thing that seemed odd at first glance and which is a pillar of microservices is deployment, it seemed that the whole structure had to be kept in one folder, and not only — the actual official moleculer runner didn’t feel natural inside a distributed kubernetes environment. In fact, there was an example of how to use docker compose, but that led to even more suspicion because:

if you put something into same folder — that means same repositiory, someone somewhere sometime will make a reference between the files and that will lead to tight coupling of the code — which is against the principles of autonomous deployment

Intro

All the code for the following example can be found in the following repositories on github and dockerhub as containers.

Disclaimer — like in every spike code, the following code does not represent my coding standards and is done in a manner of getting to the goal as fast as possible rather than writing beautiful code.

API-gateway, container

users-service, container

posts-service, container

Part one — let’s build 3 autonomous services

The first step is to build 3 dummy services that could be deployed autonomously and see how they interact.

moleculer-cli was used to generate those services, even though the actual cli would generate those in 3 folders, It ran three times in different folders and initiated three different GitHub repos. Some cleaning was applied here and there and three services are ready.

API gateway is separated into its own service — many times when that main core interface towards the outward world starts to suffer from the “general folder” syndrome and becomes general garbage can.

The actual services really just return arrays of data, not more.

In order to test our deployment process extremely fast and conveniently a drone.io build file is thrown in, .drone.yml, so the containers would auto publish to dockerhub. I run a drone.io locally on my kubernetes cluster thanks to the amazing addition of kubernetes as an orchestrator and the whole blessed separation of orchestration in containerd 1.1.

Adding transport

At this point transport is needed, I did not want to use redis for pub/sub — for the simple sole reason that if you look at the enterprise version of redis, you have to work against sentinels and you cannot subscribe against a sentinel, rather only publish. kafka transport was not yet production grade based on documentation, so I gave NATS a try. If you read my previous posts, I am a big fan of rancher and just used a NATS installation from the catalog (helm) and had it up and running in a few minutes.

Each node of moleculer needs a unique name, so the easiest solution is to use built-in kubernetes environment variable — HOSTNAME which is the name of the pod

To sum it up at this point what we have

  • API gateway
  • 2 services that registered on the gateway and communicate with each another.
  • a transport up and running
  • Containers on docker hub.

Two last points that didn’t feel right

point 1

The idea of using a runner which I do not know much about (and good that I didn’t, more about it later) with a bit of look into the documentation the following runner is created

my entry point for “npm start.”

const {ServiceBroker} = require('moleculer');
const config = require('../moleculer.config');
const broker = new ServiceBroker(config);
broker.loadServices('./services');
broker.start();

point 2

Another advise was to extract variable to environment variables rather than using the moleculer.config.js, this practice of atomic environment variables flying around feels maybe “devopsy” but very brittle, on the other hand, code/configuration has to be DRY like any other part of the application. I create a centralized mapped folder between the services where I have all the configuration files in a structure that reuse and centralize all the variables (log server host and port, redis and DB for mutual endpoints where applicable) and map those into the container.

example of how a configuration looks like

The test phase

In order to see how resilient the services out of the box, jMeter test that would run 1000 calls with 10 concurrent users was created, at first without any fault tolerance— just as it is, a configuratin is exposed that would kill the process after certain amount of calls — that is of course the most radical case that could happen, as normally process would return a 500 rather than crashing.

NATS was on my windows kubernetes cluster, and the SUT was on ubuntu 18.04 separate machine, that had a rancher deployed cluster, on same machine (being etcd, control plane and node)

here is a base run without any failures for a reference

summary =   1000 in 00:00:02 =  592.8/s Avg:     8 Min:     4 Max:    27 Err:     0 (0.00%)
Tidying up ... @ Fri Nov 02 06:24:08 CET 2018 (1541136248036)
... end of run

now 1 pod with failure every 100 calls

summary +      1 in 00:00:00 =    7.2/s Avg:    61 Min:    61 Max:    61 Err:     0 (0.00%) Active: 2 Started: 2 Finished: 0
summary + 999 in 00:00:22 = 45.7/s Avg: 214 Min: 1 Max: 5026 Err: 802 (80.28%) Active: 0 Started: 10 Finished: 10
summary = 1000 in 00:00:22 = 45.4/s Avg: 214 Min: 1 Max: 5026 Err: 802 (80.20%)

as expected — most of the calls have failed, due to the reason that the moment a pod have died, it took quite some time for control plane to figure that out and deploy a new pod. During that period API gateway became aware there is no one to serve the calls, and failed immediately ( in your standard TCP implementation, the API gateway could have been waiting on those calls, and become stuck itself, to the extent of not being able to serve calls to other functional end points)

Let’s try the same with 2 pods — of course, to keep the same error rate we have to change the frequency of failure rate to fail every 50 calls on each pod

a little bit explanation on my expectations, documentation advice to use internal load balancer with an assumption that node have services inside them — but I have isolated the services, so there is no point in having internal load balancer, on the other hand unlike on TCP traffic there is no kubernetes load balancing of the service as well — so every pod that is killed or added is working against NATS transport and balanced by moleculer (pretty amazing don’t you think)

summary =   1000 in 00:00:19 =   51.6/s Avg:   187 Min:     1 Max:  5016 Err:   804 (80.40%)

As expected we just got a same success rate

With a switch of one boolean on moleculer.config.js it enabled retry with those same 2 pods

summary +     56 in 00:00:21 =    2.7/s Avg:  1807 Min:     2 Max:  5117 Err:    13 (23.21%) Active: 0 Started: 10 Finished: 10
summary = 1000 in 00:01:09 = 14.5/s Avg: 454 Min: 2 Max: 5126 Err: 683 (68.30%)

it is an improvement — from 80% failure to 68% — not a major but an improvement — and that is not real life scenario — and out of the box.

Let’s understand what is happening here — API gateway does what it should and makes retries — but since we have only 2 pods blasted with calls and being killed all the time, and having k8s scheduler on the same machine, they just not get revived in time to serve the request.

Lets increase number of pods to 5, and put back to no retry. — of course, we need to change the error rate for the fairness to 20

summary =   1000 in 00:00:22 =   45.3/s Avg:   214 Min:     2 Max:  5025 Err:   810 (81.00%)

obviously, we are back to the same error rate

let us enable retry with default parameters

summary +     15 in 00:00:05 =    3.2/s Avg:  2500 Min:   106 Max:  5107 Err:    15 (100.00%) Active: 0 Started: 10 Finished: 10
summary = 1000 in 00:02:55 = 5.7/s Avg: 1298 Min: 2 Max: 5135 Err: 615 (61.50%)

that was not that expected — I would expect better improvement of error rates with more pods, but in fact with the default configuration what we get is

requestTimeout: 5 * 1000,
retryPolicy: {
enabled: true,
retries: 3,
delay: 100,
maxDelay: 1000,
factor: 2,
check: (err) => err && !!err.retryable
},

which means that our final retry would actually happen in much less than the service delay (5 sec)

see here http://www.wolframalpha.com/input/?i=Sum%5B100*x%5Ek,+%7Bk,+0,+2%7D%5D+%3D+1000

let us increase the time out and retry parameters to allow pods to revive

requestTimeout: 15 * 1000,
retryPolicy: {
enabled: true,
retries: 3,
delay: 500,
maxDelay: 12000,
factor: 2,
check: (err) => err && !!err.retryable
},

and the result, of course, allow better resiliency

summary =   1000 in 00:03:02 =    5.5/s Avg:  1256 Min:     2 Max: 15532 Err:   544 (54.40%)

Mid summary and conclusions

  • Retry process is working well, and even at such error rate, we manage to get significant increase of responses.
  • The actual retry is working out of the box without need for many lines of code, testing and scenarios.
  • The real bottleneck is the time it takes for k8s scheduler to revive the failing pods and to enable the traffic flow again.

I tried to help k8s understand a pod is dead a bit faster by enabling a health check on running nodejs pod.

#!/usr/bin/env bash
ALIVE=`ps aux | grep -c "[n]ode bin/runner.js"`
if [ ${ALIVE} -eq 0 ]; then
echo bad
exit 1
fi
exit 0

But to no vein, as still, it took time to actually revive the pod rather than to detect there is an issue.

I slimmed down the container to use alpine version, but the improvement was neglectable. The idea of course — improve the time a pod is loading back, while still maintaining the session without an error.

Alternative approach suitable in some cases

If we would notify the transport gracefully that we are about to terminate with a

await ctx.broker.stop();

before terminating the process — than we would finish much faster with more errors, as the transport would be aware that there are no available services to accept the request — and the retry rate would not make a lot of difference.

summary +     20 in 00:00:00 =  123.5/s Avg:    81 Min:     3 Max:   511 Err:     3 (15.00%) Active: 0 Started: 10 Finished: 10
summary = 1000 in 00:00:47 = 21.4/s Avg: 365 Min: 1 Max: 1026 Err: 698 (69.80%)

But that would require adding quite some error handling on the serving side ( actually with moleculer that is as well extremely easy).

Depending on the situation that could be useful or not, but you would need to add that code on error handling all over the place, another approach would be to turn on circuit breaker pattern — which would invert the responsibilities — and instead of relaying on the serving side to notify it is not functioning(if it is suddenly crashed many times it even cannot), it would rely on the calling side to discover failure.

summary =   1000 in 00:00:20 =   50.2/s Avg:   193 Min:     2 Max:  5015 Err:   810 (81.00%)
Tidying up ... @ Sun Nov 04 01:43:56 CET 2018 (1541292236660)

this would not allow traffic to the failing service and fail fast, allowing you to serve from cache or alternative source.

Improving

Of course, the case where we are hitting with 10 concurrent users at a rapid pace, with this kind of error rate is extreme. Adding a more appropriate amount of pods even at this error rate of 1% would allow enough pods to live to serve the requests.

let’s drop the concurrent calls to 3 and increase the pods to 6, while keeping the error rate at 1% ( every 17 calls) with default retry policy ( other wise we would always fail every 1% of calls)

summary +     22 in 00:00:11 =    2.0/s Avg:   506 Min:   505 Max:   509 Err:    22 (100.00%) Active: 0 Started: 3 Finished: 3
summary = 999 in 00:06:58 = 2.4/s Avg: 1160 Min: 2 Max: 15520 Err: 438 (43.84%)

and error rate drops significantly.

Sum of the phase

  • If we manage to make the scheduler start the pods faster, we will get a radical improvement
  • circuit breaker improve error failure detection from the client side almost as well as the actual notification from the serving side (nor that i am implying it should be neglected)
  • fair amount of pods with great, built in and easy to setup resilience features of moleculer bring you in moments to extremely reliable service infrastructure.

Sum of the phase

  • If we manage to make the scheduler start the pods faster, we will get a radical improvement
  • circuit breaker improve error failure detection from the client side almost as well as the actual notification from the serving side (nor that i am implying it should be neglected)
  • fair amount of pods with great, built in and easy to setup resilience features of moleculer bring you in moments to extremely reliable service infrastructure.

Checking moleculer runner

As the last phase, I wanted to see what is it the moleculer runner really do, after reviewing the code I did see that

  • the commodity of loading .env comes obviously from the great dotenv library — so importing and using it directly is no brainer
  • the multiple processes are running with the use of cluster

Now that last part made me concerned — and as expected running with the cluster, means more time to load the process

default retry settings, 5 pods

summary +    603 in 00:00:09 =   63.6/s Avg:   108 Min:   103 Max:   125 Err:   603 (100.00%) Active: 0 Started: 10 Finished: 10
summary = 1000 in 00:00:52 = 19.2/s Avg: 486 Min: 3 Max: 5127 Err: 733 (73.30%)

The advice is to avoid using cluster together with docker be it for 1 or many processes

Summary

moleculer is a great tool to start building your microservices. It brings out of the box many features that other wise you would have hard time to build, test and integrate.

The deployment on cluster is done easily, and can be done without the need to rely on unfamiliar/dedicated tools or runners (same as it applies for webpack in production environment for web apps).

Failure recovery tools in particular allow you cluster rapidly recover, and signal on its state to other nodes in a proactive manner.

What is next

  • I will definitely be testing it more on production services
  • In one of the projects we use oracle extensively, so I will be working on a moleculer-db-adapter for typeorm.

--

--