Platform Nuts & Bolts : Flexible decision-making with Rule Engines

After I received a lot of question from my readers about specific details, I understood that I had packed in a lot of information about platform principles and the mechanisms of building platforms in my previous article. I realized that I need to break down the process of building platforms a little bit more and give more hands-on details. In this “Platform Nuts & Bolts” series, I am going to do a set of follow up articles which look into the specifics of how exactly we can build platform systems.

How do we know what to do?

After I laid a lot of emphasis on state machines and workflows as a couple of critical tools on the road to building platformized software, a lot of people reached out to me asking what state machine libraries I recommended and what were the pros and cons of different workflow management tools like JBPM, Cadence, Airflow etc. These are good questions, and I will get to them in later articles. Today I want to discuss something more fundamental, that is, what should a platform service managing an entity do when an action is taken on said entity.

Let’s take a concrete example. Let’s imagine an order management service managing an order entity which can be cancelled. However, what exactly should happen on an order cancellation? We might want to change the state of the order and order items to “cancelled”, we can issue a refund to the buyer from the seller’s account, we might want to notify the buyer and the seller, we might want to free up inventory so that further orders can be taken against it.

However, this is just one view of things. A properly platformized order management service may offer the above as default behaviour, but it must also allow its tenants to customize it. A tenant might want to notify the buyer but not the seller, another tenant might not want to free up inventory or issue refund immediately (for whatever reason), and so on. There is the further complication that even a tenant might different rules for different types of order. e.g. a high lifetime value customer can be refunded immediately but not others, digital inventory can be considered restocked immediately but not physical inventory etc. There are as many possibilities as there are ways to make business decisions, and an order management platform should ideally enable them all.


The Tenant is King

The most straightforward way is for the order management service to let the tenants specify each of these behaviours in the cancellation request. The order management platform exposes an API with hooks for each of these behaviours and then trusts the callers to do the right thing as per their use-case. e.g. We can build this by having each behaviour as a query parameter. Such a REST API might look like this.{order id}/cancel?notifyBuyer=true&notifySeller=false&doRefund=true&releaseInventory=false

We can immediately see the many things objectionable about this API. It leaks detail of its implementation details by surfacing the internal code paths as booleans, it is difficult to extend to more actions, and it is very difficult for the tenant to govern what is happening in which scenario, which is very important if a tenant is looking to build many different behaviours. The platform need not understand the upstream cases, but in this case, it offers zero visibility into the business rules.

Also, as we will see below, the choices that the tenant needs to make are not limited to order cancellation. There are multiple platform services that will come into play in this process (payments, communications, inventory etc) and in this model, the tenant has to know, understand, and specify the behaviour of every single step of the workflow in every cancellation request.

This is obviously not the best way to design an order cancellation API.

Rule Engines FTW!

A better way of achieving the kind of flexible behaviour we need is to not think of the multiple things that need to be done as independent, but rather club combinations of them into workflows each of which satisfies one or more specific tenant use-cases. A tenant has workflows defined for different order cancellation scenarios, and the problem now becomes one of picking the right workflow based on the scenario. The scenario is defined by characteristic of the order and order items, and which characteristics to use is (potentially) unique to every tenant.

This kind of requirement is best managed using a rule engine (aka business rule management system aka rule system etc). We can define all the characteristics that a tenant wants to use in making the decision as the input of a rule engine and the output of this engine is the identifier of the workflow that is to be executed. We can now construct one such rule system per tenant and expose a vastly simpler order cancellation API which doesn’t take external input but rather picks the right rule system for the tenant, evaluates the scenario to identify the right workflow, and triggers it via the appropriate means. At low level, we can use something like a factory or a resource locator to hide the details of how to trigger the workflow and to make it easy to add more workflows over time.

Example of rules for order cancellation

The above is a contrived but entirely plausible set of rules for handling order cancellation requests. As you can imagine, a different set of input criteria mapping to another set of workflows is also possible if we want to run the business a little differently.

I want to be very clear that I’m not talking about the implementation details of any of these pieces. As the order management platform, we can use whichever rule system you like (I, of course, recommend Rulette for most of the cases) and any workflow engine you like (it could be some local code, a tenant API, or anything in between). The idea is to identify the building blocks which let platform services exhibit flexibility, and rule engines are serve as very effective decision makers.

In my opinion, the biggest advantages of this model is that all tenant rule systems are maintained in the platform, and each one of these serves as implicit documentation for the tenant’s business rules (along with the specification of the actual workflows, of course). This can be very useful in understanding overall system behaviour and debugging, especially if our platform is large and has many tenants. Developers no longer need to look up the variant behaviours in code or rely on tribal knowledge - they are available almost like a configuration.

The second good part about this model is of course that we can now have a very simple cancellation API which talk only in terms of the order entity and action being performed on it.{order id}/cancel

This is clearly far simpler to maintain since it doesn’t change if some new action is added to some tenant’s cancellation process. Those details are hidden inside the workflow and in a one time rule configuration without impacting the API at all.

Rule system per tenant with rules mapping to many workflows

The last benefit of this model is that it allows us to define specific behaviour inside the components responsible for them. Let’s look deeper into this.

It’s rule engines all the way down

As I briefly mentioned above, while we might think that sending a notification to a buyer is a simple operation in the context of an order cancellation, it is actually full of choices in its own right. e.g. what kind of notification (email/SMA/in-app)? Which provider should be used? What template should be used etc?. Similar problems exist in the other aspect like payments (which payment gateway to use, what taxation rules apply, international versus local cards, pre-paid or not etc).

If we use the first model of flexibility described above, ALL of these choices have to be made up front by the tenant (which is fine - they are his choices anyway) and they have to be sent in via the order management platform. This is where things become thorny, because it is not really the job of the order management system to accept or understand these extra parameters. At this point the cancellation API in OMS becomes totally incomprehensible.

However, if we are talking about other platform services involved in the order cancellation process, we know that each of them have the rule system-workflow capabilities. Therefore, instead of having to specify the entire behaviour in order management system, we can break it across each of the platform services involved. We can build multiple rule systems across multiple services, each of which only deal with configuring the behaviour of that particular platform system and its workflows. the rules of each service will be defined in its own domain language and will map to workflows visible only to that service and its tenants.

Workflows fan-out using domain-specifc rule systems in platform services

We can visualize an order cancellation request first hitting the order management service and being satisfied by a workflow, every step of which hits a different platform service, triggering the rule engines and workflows of each of these services, and so on. The entire graph of these operations comprises the business workflow of the tenant order cancellation.

I hope this article has thrown some more light on how we can use rule engines to not only make the behaviour of our platform services configurable but also to consolidate business rules so that are easy to access and understand.


If you liked this, subscribe to my weekly newsletter It Depends to read about software engineering and technical leadership

Leave a Reply