This is the second article from a new series about Node.js. We began by covering by covering the fundamental pattern for controlling flow in Node.js: the callback. In this new article we’ll discuss how we can further use this basic pattern to coordinate asynchronous operations. I hope you enjoy it!
Some of the code samples in this article you can find in this github repo.
- Pedro Teixeira, CTO, YLD!
Orchestrating callbacks
It’s not only Node core API that uses the callback pattern: a multitude of third-party libraries that live on NPM also use this pattern. This allows you to use a flow-control library like async
to compose them in almost any way you would want.
To use async you must install it in the root directory of your code using:
Series
The simplest case of I/O orchestration is the case where you want to chain a series of I/O calls together, one after another, and interrupt them if an error occurs.
This example runs two operations, each of them removing one file:
async_series_1.js:
The async.series
function gets two arguments: a first argument with an array of functions that are to be invoked in sequence. These functions (our removeFile1
and removeFile2
) have to have the same signature: a callback function as the first and sole argument. Then the second argument of async.series
is a final callback that gets called when all of the functions terminate, or when one function gets an error.
If you place two files named file1.txt
and file2.txt
in the current directory, you will be able to run this without an error:
If you run this again you should get an error on the done
callback because the file1.txt
was removed:
Here you can see that the series flow breaks the first time an error occurs, and in this case the final callback function (done
) gets called with an error in the first arguments.
Parallel
If the I/O you’re doing is independent, you can reduce the total time a set of operations take by doing them in parallel.
The only thing we changed here was to replace async.series
with async.parallel
to cause these two I/O operations to be performed in parallel.
If one of these operations fails, the first one to fail triggers the done
callback with an error. If any of the pending calls finish, the done
callback doesn't get called. (Remember: a callback function is never called twice; this is part of the contract.)
This is one of the drawbacks of using I/O in parallel instead of in series: you’re only able to handle the first error that happens. If you need to handle each one of them, either avoid parallel calls using async.series
or have one custom callback to handle the error on each individual operation.
Waterfall
The reason for using async.series
would be that the I/O calls were dependent. Imagine that first you need to fetch a value from a remote system like a database server in order to be able to create the next request. Here is such an example:
Notice that we’re using thebefriend
third argument, a callback function, as a direct last argument of theasync.series
call. Here you can already see a big advantage of the callback pattern: you don't have to adapt between different function signature patterns – you can pass the callback function directly as an argument.
This befriend
function creates a "friend" link in a remote database. It's using async.series
to compose these two functions: the first finds out whether the link already exists, and the second creates the friendship link, but only if that link doesn't already exist.
The findIfAlreadyFriended
function then has to check the remote system to find out if such a link exists, and update the global alreadyFriends
Boolean variable. This Boolean variable is then used in the next function in list, the friend
function, to decide whether to do the link insertion or to callback immediately.
We designed thisbefriend
function to not error out if such a friend link already exists. You could also design it to callback with an error, which would break theasync.series
flow as we wanted.
You can entirely avoid keeping a shared variable between functions by using the async.waterfall
method:
Knowing that the Friends.exists
function already calls back with a signature of (err, exists)
, we use that to our advantage: we let async.waterfall
pass whatever the results of the callback are into the next function, keeping the state you need in the call stack instead of the outer scope.
As in all the async flow control directives, async.waterfall
also breaks on the first error, handing it off to the final callback.
Watch here that thefriend
function always has to invoke thecb
callback no matter what. If we don't, thebefriend
callback will not get invoked in some cases. This is one of the hardest parts of using callbacks: making sure they always get invoked exactly once, no matter what.
Collections
If, instead of having a set of different functions, you want to apply the same function to a set of objects, you can use async’s collection iteration function utilities.
Parallel iteration
As an example let’s say that, given a list of documents, you want to insert all of the documents into this remote database:
Here we see that async.each
accepts three arguments:
- The collection (a JavaScript array)
- The iterator function
- A final callback function, called when all iterations are complete, or on the first occasion that an error occurs.
Our iterator function insertDocument
will then be called by async for each document, in parallel. This function is responsible for accepting one object as the first argument, and a callback function as the last argument; this last to be called when there is an error, or this particular operation terminates.
Given that ourinsertDocument
is only calling one function, we could use JavaScriptObject#bind
and reduce our example to:
Parallel, but with limit
The async.each
function is quite handy, but is only useful if we know that the maximum length of the collection is relatively small. If the collection is too large, we risk:
- overloading the receiving system. For instance if, as in this example, we’re invoking this procedure on a remote system, too many of those in parallel will risk overburdening that system. Worse still, if many of these operations are done, each one when answering a client request, we multiply this effect by the number of parallel client requests.
- overloading Node.js memory. Each of these calls inevitably allocates some resources on our Node.js process: some structures, closures and file descriptors. These resources will eventually be closed once the operation finishes, or will be garbage-collected in the near future — but too many of these in parallel and we risk putting too memory much pressure, eventually leading to resource exhaustion.
- blocking the event loop. Each one of the calls to start one of these operations blocks the event for a short period of time. Multiply that period of time by the number of objects you’re iterating on, and you may be blocking the event loop for a long time, possibly degrading the experience for other users currently being served by your Node.js process.
To prevent this you can either cap the collection, thus losing documents, or you can simply use async.eachLimit
, which allows you to impose a limit on the number of outstanding operations:
By using async.eachLimit
here, we're defining that we're allowing a maximum of five outstanding operations: that is, at any time, there is a maximum of five ongoing document insert calls. In this way you can help reduce the pressure on an external service, as well helping to reduce the pressure on memory and other resources.
Bear in mind, though, that this technique doesn’t reduce the overall pressure that your Node.js process puts on an external service. If, for instance, you’re going to perform this operation for every user request, even a local parallel limit of five like this one may not be enough, since the effect must be calculated by multiplying the parallel limit by the number of requests being served at any given time. (To reduce that, you can use other patterns such as a global queue or a connection pool).
Note that the time that it takes for a client request to complete may increase as you increasely limit the maximum outstanding parallel operations, which in turn may increase the number of pending parallel requests. This number can be fine-tuned according to your application load patterns.
Next Article
In the next article of this series we’ll analyze how, still leveraging the callback pattern, how you could use a work queue to further tight down and control the flow of operations.
You can find the previous post on Flow Control here: