(MODAClouds) Software components best practices

Packaging, configuration, and execution

Important

This document is a draft!

Contents

1 Overview

This document is mainly addressed to the MODAClouds developers which are tasked with writing software to be deployed and executed as part of the MODAClouds run-time. In general these software components run on IaaS infrastructures besides the application itself, although they are managed via the mOSAIC deployable PaaS, or potentially other PaaS-like solutions.

(Similar concerns are also addressed in [12App] and [VendorsPlea].)

2 Requirements, constraints

The following are some requirements and restrictions that must be obeyed by the software components (and their developers), in order to have seamless deployment and execution.

2.1 Operating system, architecture, dependencies

The software should be runnable on OpenSUSE 13.1 (or later minor versions, 13.x). However if the software runs on a different but recently enough and widely known distribution (like Ubuntu, Debian, CentOS, Fedora), it should be easy to port it on OpenSUSE. In general software written in interpreted (including byte-code compiled) languages (like Java, Python, Ruby, NodeJS, etc.) there should be almost zero porting effort.

The software should be built for 64 bit architecture (i.e. x86-64). If not possible, 32 bit architecture should work by adding as dependency the 32 bit libraries available in the distribution (i.e. glibc-32bit package in OpenSUSE 13.1).

The developer should pay attention to the various dependencies installed via the OS package manager, especially if he uses another distribution than the targeted OpenSUSE 13.1. The following details are especially important:

the versions used, as different distributions have different versions available; the developer should check if his software component can run with the dependent library at the version present in the OpenSUSE 13.1 distribution;
the "optional" or "recommended" dependencies that are installed as indirect dependencies of the software component direct dependencies; for example some distributions install by default extra dependencies of various packages, meanwhile others don't, thus when switching the distributions and requesting the same direct dependencies to be installed, the full set of installed package might differ drastically; therefore the developer should carefully note all the dependencies he needs;

2.2 Execution privileges

Because the software components are going to be executed in isolated environments, based on lightweight virtualization technologies like LXC, the developers should not expect elevated privileges during the execution, such as:

execution as root, including via SETUID / SETGID file mode bits; (although the various system calls like getuid / getgid could return 0, i.e. root identifier, the available rights are that of plain users;)
binding sockets on ports under 1024;
installing new software;
changing OS-wide configuration files (i.e. /etc);
changing a process' limits via the system call setrlimit;
accessing (or creating) device nodes (i.e. /dev), except the usual ones like /dev/null, /dev/zero, /dev/urandom, etc.;
changing the network interface configuration;
changing the system time;
in general any system call that requires a capability to be set, as described in man 7 capabilities;

Moreover for debugging purposes the developers can't "login" into the running container, therefore any interactions with the executing components must be explicitly provided by the component itself, via for example a web-service (or dedicated URL's) used solely for such debugging purposes.

2.3 File-system access

In addition to the constraints described in the previous section, the developers should take into account that although the execution environment "seems" as a full-blown operating system, it is in fact more constrained, especially when it comes to file-system access. In principle a well-behaving software component should not try to write files except inside the designated scratch directory (exported as TMPDIR environment variable, and usually available under /tmp).

This implies that the wrapper-scripts (i.e. the ones that create the configuration files prior to the actual component execution) should copy the configuration file templates to the temporary folder, change them in-place there, and then point them to the component which should load these updated files instead.

(See also the discussion about resource constraints below.)

2.4 Network access

Just like in the previous two cases, the isolated environment in which software components are executed implies certain constraints also in regard to network access. Fortunately, unlike in the case of file-system access, in most cases the constraints will not interfere with the running components, however it is good to know the limitations, especially when debugging faulty behaviour:

Isolated TCP/IP stack

Although the component has access, as expected, to the two network interfaces, one for loopback (lo), and another one for Ethernet (eth0), each with proper IP addresses configured, it must be noted that these are themselves virtual interfaces, thus two components running on the same host will not share these devices and therefore not be able to communicate between each other using these configured addresses. (See below for ways to enable such communication.)

On the positive side, the software component is free to bind to any port (outside the protected range from 1 to 1024) on the loopback address range (127.0.0.1/8), and use the these endpoints to communicate internally (for example in case the component has multiple running processes communicating via TCP/IP).

Binding TCP/IP endpoints

However binding on the IP address belonging to the Ethernet interface (eth0) should be done only on ports designated by the controlling platform via custom mechanisms.

Isolated IPC stack

Like in the case of TCP/IP, all other IPC mechanisms are isolated and only available to the executing software component. (This includes UNIX domain sockets, message queues, shared memory, semaphores, etc.)

UDP broadcast or multicast

Although unicast UDP is available (under the same restrictions as described above in the TCP/IP section), broadcast or multicast are not available, even between the components running on the same host. (As a matter of fact, broadcast and multicast is not allowed in almost all IaaS cloud environments, therefore its usefulness is limited.)

DNS queries

Because of security considerations, and to ease the integration with other platform sub-systems, all DNS queries could be resolved by a custom resolver integrated into the platform, regardless of the target DNS server. (I.e. inside the containers the developer can use any setting inside /etc/resolv.conf, however UDP packets with the destination port 53, are going to be redirected to an internal resolver, thus ignoring the configured destination.)

Outbound access

Although currently there is no filtering in-place for outbound connections (i.e. connections initiated by the software components themselves) the following should be kept in mind:

any connections are allowed between the component belonging to the same platform (see below for details);
for connections targeted at other hosts on the internet, or even inside the same cloud provider, one should try to limit the used protocols to HTTP/HTTPS, and only where strictly required by the involved technology to use others (like SMTP, AMQP, SSH, etc.);
special care should be given to protocols that transport sensitive data, especially authentication credentials, in case of non TLS-based ones;

Non-TCP or non-UDP protocols

Except local IPC mechanisms, and TCP or UDP, no other protocols are allowed (like SCTP, RDP, etc.).

2.5 Network endpoints

The previous section on network access related mostly with outbound communication (i.e. connections initiated by the component going outwards) and "intra-component" communication (i.e. in case of multi-process components, how these can communicate). However the topic revolving around inbound communication, i.e. accepting client connections by the software components, either from within or outside the application instance, is covered by this section.

An "endpoint" refers to a TCP or UDP address, composed of an IP address and a port, used by components to "bind" (or "listen" depending on the used library), and by their clients to "connect". Although the previous activities, that is "bind" or "connect", have well-defined API's and semantics, a third important activity is that of "advertisement" or "discovery" --- i.e. how the clients find out the component's endpoint --- is completely left at the disposal of the developer.

Moreover in a cloud-like environment, the endpoint used to "bind" by the component is not always the same one as the one used to "connect" by the clients, therefore the developer must take this consideration into account.

For example, taking into consideration application deployments on IaaS environments, such as the well-known Amazon's EC2 environment or compatible, we usually have two major types of IP addresses, public and private ones; the former available from the internet and billed, and the later available from within the same cloud instance and un-billed. Therefore even in such a basic scenario, the software components must bind on the private endpoint, but when it comes to advertisement, it has to take into account if the client (to which the advertisement is made) is running within the same cloud instance and is part of the same application instance, thus replying with its private endpoint; otherwise if the client is "on the outside" or not part of the same application instance, replying with the public endpoint. In such cases usually port of the two endpoints is the same, only the IP address changes. The underlying technique is that of NAT (Network Address Translation), more exactly DNAT (Destination NAT), as from the outside the connecting client "sees" only the public IP address (i.e. it can only connect on the public IP address), meanwhile the binding software component "sees" only the private IP address (i.e. it can only bind on the private IP address), the mapping between the two is done by the cloud routing infrastructure, and furthermore the application must be aware of it.

However in PaaS environments, the situation is even more complex due to the addition of yet another layer of virtualization (the containers), because now on the same host we have multiple running components, each with an isolated network stack. Therefore, besides the private and public endpoints, the developers must also be aware of the container endpoint, whose IP address, as specified in the previous section, is bound to the private Ethernet interface eth0. Fortunately the only difference between the IaaS scenario and the PaaS scenario, is that the component must bind on the specified container endpoint, meanwhile the advertisement remaining similar to that in the IaaS scenario, that is replying with either the public or private endpoint, depending on where the advertised-to client is situated relative to the component. The only major difference is that if in the IaaS scenario the port of the two endpoints is the same, in the PaaS scenario the three endpoints are completely different (i.e. both the IP address and port of an endpoint differs between the public, private and container endpoints).

In order to better distinguish between the three sets of endpoints, and to better convey their use case, we propose to use the following terminology:

Bind endpoint

It is the endpoint on which a software component must bind and listen, or accept, inbound client connections. Technically this endpoint is composed of the container IP address and a designated port. For example in Java it is the IP address and port given to the ServerSocket constructor; or the IP address and port to configure in the proper configuration file for most server solutions.

Cluster endpoint

It is the endpoint which should be advertised to clients connecting from within the same cloud instance. Technically this endpoint is composed of the host private IP address and a designated port, which port is different from the port of the binding endpoint.

Public endpoint

Similar to the cluster one, it is the endpoint which should be advertised to connecting clients, but those connecting from the internet, or although running inside the same cloud instance, not belonging to the same application instance. Technically it is composed of the host public IP address and a designated port, possibly different both from the cluster one, but certainly different than the bind port.

Multi-cloud endpoint

Although currently not implemented, in case of multi-cloud deployments there should be another designated endpoint, similar to the cluster one, to be used by connecting clients part of the same application instance, but running in another cloud instance. Technically this is composed of an IP address tunneled between the cloud instances, and a designated port.

Loopback / local endpoint

For completeness, this endpoint should be used between different processes running inside the same container. Technically it is composed of the loopback IP address (127.0.0.1 or one in the range 127.0.0.0/8), and any port, the choice being left at the disposal of the developer and software component.

Notes

Important

Although in most cases the software is configured to bind on the "any" IP address (i.e. 0.0.0.0), besides being a potential security issue, it is recommended to explicitly specify a proper IP address (i.e. the bind endpoint in our case).

Important

When describing the set of different endpoints, the word "designated" was used in all cases (except the loopback endpoint) because the values defining each endpoint (i.e. the IP address and port) should be obtained from run-time configuration, each time the component is started (instead from hard-coded constants). The reason is that due to the dynamic nature of cloud deployment these values can't be predicted, and most likely are randomly generated.

2.6 Resources

Because the software components are deployed on PaaS-like solutions, thus co-located with other software components, their resource consumption shouldn't be too high as not to overwhelm the hosting VM and interfere with the proper execution of their neighbors.

In general such considerations are not needed because most software run-times can auto-detect the available resources and fine-tune their run-time accordingly. However in the current case of PaaS-like deployments such auto-tuning algorithms can fail to detect the actual resource limits, and instead detect the whole VM resources out of which they have only a small slice. (This is not a limitation of the PaaS-like solutions, but stem from the fact that the employed lightweight virtualization solutions, like LXC, are still in early adoption, and there is a lack of proper detection from the side of the run-time environments.)

The following are some suggested limits:

CPU

It should be assumed that only 1, or at most 2, CPU's (or virtual CPU's) are going to be put at the disposal of the software components. This implies proper configuration of thread pools, say 4-8 threads for I/O, and only 2-4 threads for background jobs.

RAM

It should be assumed that only 1, or at most 2, GiB of RAM is available for the software component. This implies proper configuration of memory heaps, like Java's -Xmx. However keeping in mind that the run-time might also need additional memory for itself, or the native libraries, it is advisable to assume that only 75% of the memory should be given allocated to the various heaps and memory pools.

Disk capacity and retention

Each software component will receive, via configuration options, a temporary folder where it can store non-persistent data, up-to 2, or at most 4, GiB. It should be assumed (and it happens in reality) that after the component is shutdown or is restarted, the stored data is removed. (All persistent data should be stored in database-like solutions, like for example the object store.)

Disk bandwidth

Although currently the employed lightweight visualization solutions, like LXC, don't offer proper tools to limit or prioritize the disk bandwidth of the containers, the host operating systems employs a fair-share policy between the running processes and threads. This implies that the software components should be conservative with regard the disk I/O, especially the ones that spawn multiple background processes or threads.

Network bandwidth

Similarly with the disk bandwidth described above, the software components should be conservative with regard the network I/O.

I/O bandwidth in cloud environments (cumulated disk and network bandwidth)

It should be noted that in most cloud environments, especially Amazon EC2, the disk and network I/O rates are cumulated against the VM limits, thus if on the same host VM we have one network intensive component, and one disk intensive one, their I/O patterns will interfere.

Please note that these limitations are not arbitrary, but based on current best-practices and limitations imposed by other real-life PaaS solutions like Heroku [Heroku-Limitations] [Heroku-SotA].

3 Configuration

3.1 Parameters

All developed software components must be easily configurable both at deployment and execution, via at least one of the mechanisms described below, exposing all those configuration parameters that might change from one deployment to another, which might include.

It is advised that all these parameters are both syntactically and semantically verified during the component startup, and any encountered error or inconsistency (such as specifying two mutually exclusive parameters) should abort the component startup, and clearly indicate the reason through logging with an error level message.

(Similar concerns are also addressed in [12App-Config] and [Heroku-Configuration].)

Network endpoints

All network endpoints, used for both binding, advertising and connecting, should be configurable for each individual execution. In case of TCP/IP endpoints, this implies both the IP address and port, expressed either as a single string value, (for example <host>:<port>), or preferably as two separate values, one for the IP and one for the port. The <host> part should be expressed as a numeric IP address in case of binding endpoints, and as a FQDN in the other cases if possible. (See the Network endpoints section for details.)

In case these parameters are not explicitly given, the component should abort the startup, and indicate the reason through logging with an error level message. If the developer chooses to either infer or hard-code these parameters, they should have as an IP address one in the loopback range (i.e. 127.0.0.0/8), and all these chosen values should be indicated through logging with an warning level message.

Credentials (or other security tokens)

In addition to the network endpoints of various remote services, for the ones requiring authentication, the credentials should be configurable for each individual execution. In case these parameters are not explicitly given, the component should abort the startup and indicate the reason through logging with an error level message.

File-system locations

Any paths required during the execution should be configurable for each individual execution, especially those that are used for read-write access (such as database, spool, or temporary locations).

These locations should be expressed as absolute paths; in the case relative paths are allowed, they should be considered as relative to the configuration file.

It is advisable to segregate the files into separate locations based on their type, like for example:

configuration files, used either as templates for the actual configuration files or for defaults;
"public" executable files, either native executables or scripts, usable by both users or other tools, and whose location should be added to the PATH environment variable;
"private" executable files, used only by the software component itself;
libraries required for execution (like *.so, *.jar, etc.);
static files, such as constant databases, or web resources (i.e. HTML, CSS, etc.), that don't change during the execution of the component;
database or spool folders, used to store the long-term data of the component during its execution; (please note that, as explained in the Resources section, these are not persistent after the component finishes the execution;)
temporary folders, used to store the short-term data of the component during its execution;

In case these locations are not explicitly given, the component could try to infer them, however the chosen values should be indicated through logging with an warning level message. Based on the previously listed types the following heuristics can be used:

in accordance with the packaging guidelines, described in the Packaging section, the developer should assume that most of the locations share a common prefix like /opt/<package>, that we'll call $PREFIX, which could be resolved as follows: assuming that the executed binary (or wrapper script) is installed in ${PREFIX}/bin, we can use the executable path's grand-parent;
${PREFIX}/etc for configuration files;
${PREFIX}/bin for "public" executable files;
${PREFIX}/libexec or ${PREFIX}/lib for "private" executable files;
${PREFIX}/lib for libraries;
${PREFIX}/share for static files;
${PREFIX}/var for databases or spools;
${TMPDIR} or /tmp for temporaries;

Important

Under no circumstances should the component use the current working directory as a reference point in resolving any of these locations.

Run-time specific parameters

Although not as critical as the previous categories, certain run-times allow changing parameters to optimize the execution towards specific use-cases, and in general these are configured by wrapper scripts. The developer could choose to either expose the most important ones as configuration parameters, or preferably expose a set of profiles which imply certain values.

For example in the case of a Apache Tomcat based application, a couple of profiles like high-resources, medium-resources and low-resources, could limit the amount of memory given to the JVM and the size of the various thread pools based on the amount of available resources.

Miscellaneous

Other parameters that might be of use are:

logging levels
Especially useful in debugging a misbehaving component. (See also the Logging section.)
timeouts, or retries
Useful when deploying in resource-constrained environments.

3.2 Overriding

In general a component should employ (at least) one or multiple of the configuration mechanisms described below. Usually parameters are configured exclusively via a single mechanisms (like only through a configuration file), meanwhile others could be configured through multiple ones (like through either configuration or environment variable). In the later case (i.e. multiple mechanisms) the following rules should apply when determining the actual value:

parameters conveyed through process arguments should always have the highest priority;
then parameters conveyed through process environment;
then parameters expressed in the configuration file given explicitly at startup; (such as --configuration ...;)
then parameters expressed in the "default" configuration file; (such as ${PREFIX}/etc/...;)
then parameters inferred through various heuristics; (such as based on the deployment environment;)
then parameters hard-coded in the source code;

It is advisable that if the same value is expressed via multiple mechanisms (such as process argument or environment, and explicit configuration file), then the fact that some low-priority values are ignored should be indicated through logging with a warning level message.

3.3 Process arguments

FIXME: To be revised!

These are the list of values obtained as main's argv argument in C / Java (or sys.argv in Python, and so on).

The following rules should apply:

the syntax should follow the "GNU long format" (like --parameter, --parameter=<value>, or --parameter <value>) or similar;
any argument following the -- token should be treated as non-parameters; (i.e. if the executable or wrapper script expects any arguments, such as file paths to operate on, the arguments following -- should be treated as such;)
it is discouraged to allow "melding" multiple parameters in a single word (like -av instead of -a -v);

Important

Any process argument that is not expected (or incorrect) should cause the component to abort the startup and indicate the reason through logging with an error level message. This includes checking the existence of any non-parameters arguments (i.e. not starting with --, or following the -- token).

It is advised to use standard libraries for argument parsing, like the ones below:

Java
Apache Commons CLI (de-facto standard library);
JewelCli (lightweight and more "modern" alternative);
Go
Go standard "flag" module;
Python
Python standard "argparse" module;
NodeJS
FIXME: To be defined!;
Rubby
FIXME: To be defined!;
C / C++
POSIX "getopt";
Google "gflags" (C++ only);

3.4 Process environment

FIXME: To be written!

prefix the environment variables with tokens like MODACLOUDS_COMPONENT_A_...;
when configuring connection endpoints, use the target component prefix; (suppose component "A" needs to connect to component "B", thus the environment variable used by component "A" to specify the endpoint of component "B" should be prefixed like MODACLOUDS_COMPONENT_B_...; this eases the development by allowing the usage of a "shared" environment for both;)
when configuring binding endpoints, in addition to the prefix, suffix them with ..._BIND; thus the environment variable used for advertisement is without the suffix and can coincide with the one used for connection (as described above);
if a wrapper script needs to specify environment variables that should not be used by other parties, prefix them with two underscores like __MODACLOUDS_COMPONENT_A_...;
don't expect any other environment variable except PATH;

3.5 Configuration files

FIXME: To be written!

use "standard" formats such as (in order of "preference") JSON, INI, XML, YAML, Java *.proporties;
read all the configuration files only once during the component startup, and never re-open them again; (it might be that the configuration file is something like /dev/fd/5 which points to a open descriptor, etc.;)
treat all the configuration files as sequential access only (i.e. once read they can't be rewound), preferably loading and parsing them in memory; (see the previous case of synthetic files;)

It is advised to use standard libraries for configuration file parsing, like the ones below:

Java
TOML-compliant;
FIXME: To be defined!;
Go
TOML;
FIXME: To be defined!;
Python
TOML-compliant;
FIXME: To be defined!;
NodeJS
TOML-compliant;
FIXME: To be defined!;
Ruby
TOML-compliant;
FIXME: To be defined!;
C / C++
FIXME: To be defined!;

Note

A unique, strange but at the same time interesting, technique is used by the Mongrel2 web server (` Mongrel2 Manual -- Managing <http://mongrel2.org/manual/book-finalch4.html>`__).

3.6 Remote resources

FIXME: To be written!

all dynamic configuration parameters (i.e. that can change after the component has started) should be loaded from remote resources;
cache the content of these remote resources, either in memory or in temporary or spool folders, as the remote resources can become unavailable at any moment;
retry retrieval before giving up; (see the Retries section;)
use the object store for such purposes;

4 Execution

4.1 Inter-component communication

FIXME: To be written!

use only configured network-endpoints;
use "cluster-local" (i.e. private) endpoints whenever possible;

4.2 Logging

FIXME: To be written!

use stderr for logging;
use ASCII-only (or UTF-8 if necessary), with no control characters;
use \n as logging record terminator;
keep logging record under 4 KiB (see man 7 pipe, search for PIPE_BUF);
...

It is advised to use standard libraries for logging, like the ones below:

Java
Logback as logging backend;
SLF4J as logging interface;
SLF4J -- Binding legacy APIs as logging adapters for other libraries;
Go
Go standard "log" module;
Python
Python standard "logging" module;
NodeJS
FIXME: To be defined!;
Ruby
FIXME: To be defined!;
C / C++
Google "glog" (C++ only);
FIXME: To be defined!;

(Similar concerns are also addressed in [12App-Logs] and [LogsAreStreams]).

4.3 Wrapper scripts

FIXME: To be written!

use exec at the end of wrapper scripts;
write a wrapper application in a compiled language (C / Go);

4.4 Retry patterns

FIXME: To be written!

retry with exponential backoff;
exit after a limited number of retries;

4.5 "Heartbeat" or "Watchdog"

FIXME: To be written!

if possible employ a heartbeat / watchdog solution; (i.e. sub-process that continuously pools via a custom mechanism the proper functionality;)

4.6 Testing

FIXME: To be written!

Useful tools to ease execution while developing:

https://github.com/ddollar/foreman;
https://github.com/ddollar/forego;
https://github.com/mattn/goreman;
http://cr.yp.to/daemontools/envdir.html;
http://direnv.net/;