(MODAClouds) Sub-system orchestration
Table of contents
In the context of the MODAClouds project each of the three main run-time sub-systems (i.e. monitoring, self-adaptation, and execution) are built such that they are highly independent and reusable. Thus their contact surface must be reduced to a small set of well defined interactions, carried with / over de-facto solutions.
Moreover each sub-systems in itself is actually a distributed system, made of multiple running components that must communicate with each other. For example in the context of the monitoring sub-system, there are the various data collectors, the monitoring "coordinator" (better name for this?), various data aggregators, a data streaming middleware (C-SPARQL), etc.
Thus the role of the execution platform is to provide the needed services that enable these interactions, plus a few more facilities like initial configuration, containment, etc.
The current document focuses on which are the alternatives that could be used to fulfill the requirements listed below.
It must be noted that each of these alternatives have their trade-offs: some requiring more work on behalf of the execution platform itself, meanwhile some on the sub-systems themselves; some might even not work in case of certain PaaS-based deployments; etc.
- initial configuration of the sub-system components; (intra-sub-system)
- further re-configuration of the sub-system components; (intra-sub-system)
- communication between different sub-systems; (inter-sub-system)
- discovery of sub-system components; (both intra- and inter-sub-system)
It must be noted that the proposed solutions are only for operation and maintenance purposes (as described above only covering configuration, command and signaling), and not for normal data access (which is handled through custom solutions not supplied by the execution platform).
Although, by design, the MODAClouds sub-system components run over the mOSAIC platform, which in turn runs over a cluster of VMs obtained from IaaS providers, some of the sub-system modules live inside the application itself, and thus potentially inside a PaaS. As a consequence a constraint is that the chosen solution should be accessible from within a PaaS. However because some PaaS providers are very restrictive, especially Google App Engine, the safest way is to choose an HTTP-based solution (whose plus / minuses are discussed in the sections that follow).
We also assume that the main programming language (for the sub-system services) is Java, thus the chosen solution must offer proper libraries for the Java environment. However this is not a requirement that the underlaying solution must be Java-based, only that it must support a Java-based client.
Regarding the data formats we assume that JSON is almost ubiquitous now-days and it is the simplest, but also efficient, solution for data encoding. However if possible XML or other binary solutions should be supported, although possibly with some restrictions.
Regarding the scale of the communication and configuration payloads, we assume that each exchanged payload is well under 1 MiB, an acceptable average being a few tens of KiB, and the throughput is well under a few tens of exchanges per second, an acceptable average being a couple per second. Related to latency, we assume that up to a second is acceptable, with an average under a hundred milliseconds.
Moreover we assume that the developers are keen to make minor adjustments to the way their components communicate; plus that security, availability and scalability are not a priority, although could be taken into consideration as secondary concerns. (Security is mainly achieved by controlling the access at the VM / security group level. Scalability is not a major issue due to the limitations we have assumed in terms of payload size and throughput. Availability is the only one that must be covered on a case-by-case scenario.)
An obvious conclusion of the above is that performance is a non issue, and that flexibility and suitability is a priority.
Any of the solutions presented here-after fall in one of the categories (each line provides a different perspective):
- centralized vs. decentralized --- where in the first case we require a middleware component that bridges between the various sub-systems, meanwhile the second approach moves some of the logic to the client components themselves;
- push (e.g. publish-subscribe) vs. pull (e.g. polling) --- the first implying the need of a middleware but lowering the latency and overhead, meanwhile the second better suited for a decentralized environment;
- fully HTTP-compatible vs. binary protocols;
- messaging (i.e. stateless communication) vs. records (i.e. statefull databases) --- these are the answers to how a global view of the overall system is being built, in the first case each component must collect each message and aggregate the full view, meanwhile there is an "official" (although eventually-consistent) "image" of the system;
However the alternatives that follow are not mutually exclusive, but can be combined, each being used where it makes the most sense.
The simplest solution to configuration and communication is the "database" approach, where all the sub-systems maintain a set of records that could be used both for initial or further configuration, and for exchanging messages (i.e. commands) between components, including discovery.
The actual implementation details differ based on the underlaying technology (i.e. SQL / NO-SQL), however in most cases it boils down to the following simple steps:
- each component receives a unique and constant identifier that must be conveyed (or obtained) when starting / restarting;
- discovery of the components is simply achieved by each component creating a record in a special collection (i.e. table, database, etc.) when it starts (and is ready to process commands); then this record should be deleted by the component when it stops; however to resolve "stale" components (i.e. the ones that have the record but are running no more), an expiry mechanism could be implemented, which requires that a component refreshes its entry at regular intervals;
- one-way communication (i.e. commands) is similarly achieved by creating records keyed with the targeted component identifier; two-way communication (i.e. request-reply) can be achieved by using a composed key made of the component identifier and a unique request identifier; (in fact we are emulating message queues;)
The following are a few clear advantages:
- there is a clear global state of the entire system, especially in what regards the current configuration of each component;
- it enables the components to be stateless (with regard of the configuration), requiring only a constant identifier to be conveyed to each component on start / restart;
- debugging the interactions is simplified because there is a single entry point to be tapped;
The following are a few clear disadvantages:
- publish-subscribe mechanisms (including broadcasts) are impractical to implement because it involves high contention; thus only unicast (i.e. exact target) is achievable;
- latency and overhead is heavily impacted because polling is usually involved;
Note that although most database solutions offer only binary protocols, it is easy to provide an HTTP-based proxy, especially for NO-SQL solutions which have a much simpler interface. Moreover availability, especially read availability, is a non-issue because most solutions allow at least read-replicas.
We (IeAT) recommend that such a solution should be used at least for configuration purposes like the ones described below. In such a situation reconfiguration is as simple as changing the stored entry and signaling the component to reconfigure itself.
- specifying endpoints of core support services or core sub-system components, like the C-SPARQL endpoint, or the RabbitMQ endpoint described in the next sections;
- specifying run-time parameters or policies, like in the case of monitoring collectors the metrics to be collected and their frequency, etc.;
Moreover we (IeAT) suggest that it could also be used for discovery of components, as was described at the beginning of this section.
The best example is Redis, which efficiently implements both a KV-like interface and a simple publish-subscribe one, and it is successfully used in many Web 2.0 application deployments.
However we (IeAT) suggest that the best technical solution is instead CouchDB, which although less feature-full than Redis, offers unique facilities most useful in our use-case:
- multi-master replication, meaning that we can have multiple database instances (some on the provisioned VM's, some even on the operator's station); thus updates can be sent to any of these instances (most likely the one closer to the writer); similar with reads;
- it uses JSON as the native encoding (although we can store binary data as attachments), which coupled with its views feature (similar to SQL views, although implemented as map-reduce), could offer interesting ways to interact with the stored data; (especially for features such as discovery;)
- it uses HTTP as the native protocol, thus not requiring any proxies;
There are also alternatives in this category such as ZooKeeper (harder to bootstrap, manage and use), Doozer (not very mature), etcd (less features / simpler model), etc.
Probably the simplest solution from the execution platform point of view is to just aid in the discovery of the components and let the actual communication (be it HTTP or not) in their responsibility.
This implies that the execution platform has to expose only two support services:
- one HTTP-based service, directly visible by the components, that allows the components to manipulate the DNS records;
- one hidden and used behind the scenes by the OS libraries, which answers to DNS queries made by the components, based on the records previously manipulated;
The registration would work as bellow:
- the component will have to find its "external" IP address; (i.e. in case of EC2 and in case the component communicates only inside the cloud this address is simply found by using the usual OS-lever API; however if the component must be addressable from the Internet (or other EC2 regions), it needs to find it's public IP address, either by consulting the EC2 ZeroConf service, or if running in a mOSAIC container via the specific methods;)
- it is assumed that for each component type / class (i.e. monitoring data collectors, each type of agent, etc.) there is a well-known (i.e. constant) hostname;
- it is assumed that the component and its clients, besides the well-known hostname, also have a well-known port; (which could be specified through a configuration parameter);
- for registering, the component creates a new type A record for the well-known hostname, with the value the "external" IP;
- for de-registering, the component deletes the record created above;
The discovery would work as bellow:
- if the client needs to communicate with any of the multiple instances of the same component type, it just needs to use as an endpoint the well-known hostname (and port) instead of the IP address;
- on the other hand if the client needs to broadcast a request to all the instances, it will have to first resolve all the DNS records for the well-known hostname, and individually make a call to each of the targets by using an endpoint formed from the concrete IP address and well-known port;
The following are a few clear advantages:
- simplicity on the client side, especially in the case of singleton component types, or in those circumstances when the request must reach any of the component type instances;
- it works out-of-the box with any application that allows hostnames as endpoints;
- it enables a true SOA-based approach, where components communicate freely;
The following are a few clear disadvantages:
- the simplicity of usage on the client side is almost nulled when there is a need to broadcast the same request to multiple component instances (like in the case of reconfiguration commands);
- almost certainly it does not work on the application side when deployed in a non IaaS environment; (because implementing such a solution requires replacing the DNS resolver used by a VM with a custom one, that intercepts some hostnames and delegates the others to the "normal" provider DNS resolver;)
We (IeAT) use such a solution in mOSAIC for VM discovery in IaaS environments where UDP broadcasts or multicasts are disallowed.
Moreover we (IeAT) suggest that such a solution is best suited for those components that should be accessible from the Internet (like the various WUI's).
However, although technically possible, it is impractical to use it between internal components.
Please note that although AMQP is a protocol specification for a messaging / message queues, we are specifically targeting its model (based on exchanges, queues, bindings and routing keys), which is slightly more powerful than the JMS model (but certainly more clearer and explicit). Moreover RabbitMQ (the most popular AMQP implementation) has libraries for many programming languages unlike JMS which targets mainly Java-based developments.
Such a solution would enable the implementation of a centralized communication between components, such as:
- one-way commands or signals (such as re-configuration, startup / shutdown / heart-beat signals, etc.) targeting either individual components, or all instances of the same component type; (by using a fanout exchange, and one queue per component;)
- request-reply interactions between two components; (by using a direct exchange, and one queue per component;)
The following are clear advantages:
- the simplest solution from the component developer point of view, both in terms of API simplicity, and setup procedure (the component is always a client, thus all it needs is the endpoint of the AMQP middleware);
- low latency, low overhead, and high throughput is easily achieved;
- fully asynchronous communication, in which the sender and receiver don't have to be on-line at the same time; (i.e. if reconfiguration requires a component to restart, it won't lose any messages sent in the interim, and the sender isn't required to implement a retry mechanism;)
- debugging the interactions is simplified by having a single point where we can tap the exchanged messages;
To our (IeAT) knowledge the only "clear" disadvantage (within the boundaries we've set ourselves in the Constraints and assumptions section) is the fact that AMQP is a binary protocol, thus potentially raising issues in very strictly controlled PaaS solutions like Google App Engine.
However in such a situation it could be easy to provide a HTTP-based proxy for the two common operations of publish-subscribe.
We (IeAT) recommend the usage of such a solution, based on RabbitMQ, for the purpose of component communication, and coupled with the CouchDB-based object store, could offer a complete solution for all the issues targeted in this document.
Although there are a few AMQP alternatives like ActiveMQ, HornetMQ, etc. we (IeAT) strongly suggest RabbitMQ for its performance and stability.
The previously presented solution, based on AMQP, requires the developer to explicitly implement or adapt his solution to using the AMQP library. But there could be situations when such a solution is not easily achievable, either because the application is closed-source or third-party, either because the chosen framework has heavy constraints on interactions. However most likely the component will implement a HTTP-based protocol.
In such a situation we propose to implement intercepting (forward) and relaying (reverse) proxies that redirect the requests to the proper targets via an internal, non HTTP-based, transport protocol.
The main benefits of such an approach would be full discovery and transport transparency, including retries and possibly security enhancements. However only unicast (i.e. client to server) interactions are possible, and possibly broadcasts for certain methods (like PUT or POST).
Some details of this approach can be found at the following link: https://dev.modaclouds.eu/redmine/projects/modaclouds/wiki/Wp6SubsystemsHttpRouter.
Another communication alternative would be to use the mOSAIC native component communication mechanism, briefly drafted at the following link http://wiki.volution.ro/Mosaic/Notes/Hub.
However such an approach, although novel, is less suited in the MODAClouds project, where there is a high independence between the various sub-systems.
As noted in the introduction all solutions have their advantages and disadvantages, but of of all we (IeAT) suggest to combine the following:
- the CouchDB-based object store (described in Object stores) for storing all the configuration details of sub-system components; (thus passing to the actual components only a pointer to the database record, consisting of an URL, which can then be retrieved through a simple HTTP client;)
- the RabbitMQ-based message exchange (described in AMQP-based messaging) for sending commands and signals between the sub-system components;
The other presented solutions (DNS-based discovery, HTTP router, or mOSAIC component hub), being more targeted, are less suited in the context of the MODAClouds project. Although the DNS discovery and HTTP router could be used at later stages of the project.