Pipelines
VirtualMetric DataStream pipelines were designed to automate large-volume data processing. They can be used to extract values from various sources, transform or convert these values, enrich them by correlating them with other available information, and to forward them to various destinations for consumption.
In short, they are aimed at helping you design and implement your telemetry workflow.
Definitions
A pipeline, in its simplest form, is a chain of processors that run sequentially, operating on the incoming data streamed from providers, and directing them to destinations for consumption.
The sources can be devices, networks, or other pipelines, and the targets can be consoles, files, storage systems, or other pipelines. Sources and targets can be connected to each other in one-to-one, one-to-many, many-to-one, and many-to-many configurations.
Each incoming stream can be queried with criteria specific to the information contained therein, and each outgoing stream can be enriched with inferences made from correlations for use on the destination side.
Schematically:
In real life scenarious, the level of complexity of the configurations will vary based on the requirements of the consumers and the intended destinations.
Configurations
A pipeline can consume data from multiple providers, and once it is finished processing them, can direct some or all of the processed data to a specific consumer in a one-to-one scheme:
It can also consume data from a single provider and deliver the processed results to multiple consumers:
Finally—and possibly in most real-life scenarios—there may be multiple providers and consumers connected to each other in much more complex arrangements:
Effectively, each source may be the target of an upstream pipeline, and each target may serve as the source for a downstream one. The detail to keep in mind is that each provider side is delegating some processing to the next pipeline based on the established requirements of the consumer side. The pipeline is acting as the middleman for the data interchange.
Even if the pipeline routes some of the data without any processing, this will be due to the demands of the consumer side, so the pipeline is still performing a meaningful role by forwarding the data as per the consumer's policy.
Included Middle
Notwithstanding the above, there is also a middle stage—where most software solutions do their real work: Routing. However, this is a stage that in turn depends on two other interim stages that also act as consumers and providers, and these are the Pre-processing and Post-processing stages, respectively.
As a side note, the implementation of these interim stages is based on a transformation known as normalization, i.e. preparing the data for easier handling further down the chain.
Pipelines are, once again, our primary tool for implementing these interim stages. Schematically:
As can be seen, routes are right in the middle of this setup.
The purpose of the graphics provided above is to emphasize one thing: the primary purpose of pipelines is using division of labor to simplify telemetry.
Data Streams
In any telemetry operation, the incoming raw data streams will naturally have their own structure. However, this structure may not—and frequently does not—lend itself to use for some specific purpose. Trying to do so would be time consuming, processor-intensive, and possibly error prone. Therefore, it is unwise to implement a curation strategy on the source side, where the data originates.
Similarly, if the curated data needs to be transformed or enriched as per the requirements of a downstream consumer, it is unwise to try to delegate these transformation tasks to the target side where the data is expected to be in an easily consumable format.
And these are exactly the places where pipelines enter the picture.
Divide and Rule
A pipeline is essentially based on the principle of divide-and-rule: divide the operations to be performed on the available data into the smallest meaningful and coherent actions, sequence them in a logical order based on the expected inputs and outputs, and run them as a chain.
The basic layout of the configuration for a typical pipeline looks like so:
pipeline:
description: "What the pipeline does"
processors:
- processor-1:
field: foo
value: "A"
- processor-2:
description: "What processor-2 does"
field: bar
value: true
- processor-3:
field: baz
value: 10
This pipeline is created based on the assumption that the incoming data stream will have the following structure:
data:
- foo: ["A", "B", "C", ...]
- baz: [5, 10, 1, ...]
- bar: true
Note that the pipeline definition contains a mixture of required and optional fields, and that—in this form—it disregards the order of fields in the source data. All that it cares about is that
- the enumerated fields exist, and
- they contain the expected types of data
Simplified Functionality
The above pipeline may appear incomplete: it does not explicitly specify what will happen to the selected values when the processors are run. Assume that, after this pipeline has been run, the data appears like so:
data:
- foo: ["Z", "B", "C", ...]
- baz: [5, 100, 1, ...]
-bar: true
One can easily notice that the value A
in foo
has become Z
, that the 10
in baz
has become 100
, whereas bar
is left untouched since it was already true
.
Although this pipeline may appear underspecified, the example is intended to illustrate that what a sequence of processors does can be as simple as this. Each processor performs a only single operation on a specific field, and that is the pipeline's power.
Once the incoming data is normalized—i.e. converted into a structure that makes it easy to extract items from it—it becomes very easy to curate, transform, and enrich its values into a desirable form on the fly.
This brings us to the concepts of curation, transformation, and enrichment.
Streamlined Streams
As we have said earlier, the incoming data streams will have their own structure. As such, they will at best be only partially suitable for analysis and, later, decision making. In its raw form, data is frequently not—at least may not be—what it seems to be. It has to be sorted and sifted through to pick the relevant information from it.
The process of selecting or discarding items based on specific criteria is called curation. This involves checking whether a field's value matches or contains the values or fragments of values we are looking for.
After this, the remaining data may need to be converted into forms making them more suitable for analysis and use. This second phase is called transformation.
Finally, the data may contain hints or fragments of information which, when correlated with other available information, may yield insights that may be prerequisite to analysis and use. Adding correlated information in order to render the data more relevant—or increase its relevance —is known as enrichment.
The overall process can be represented schematically as:
It is through this type of seamless three-stage design that a pipeline truly shines and proves itself indispensible for telemetry.