Metrics Engine

UltraESB metrics engine has two main parts with one being the in-memory metrics engine and the other being ElasticSearch based metrics engine. More details will be in respective sections. Configuration of the metrics engine is in the ultra-metrics.xml file in conf/monitoring directory which is statically imported to ultra-root.xml. The import line of ultra-metrics is commented out by default. So to enable metrics engine first we have to uncomment that line. Let’s take a brief look at the main configuration beans in ultra-metrics.

<bean id="esMetricsCollector" class="org.adroitlogic.metrics.core.ESMetricsCollectorImpl">
    <property name="primaryMetricsTemplateValidTime" value="20000"/>
    <property name="metricsReportGenerationTimePeriod" value="30000"/>
    <property name="metricsTemplateSyncTimePeriod" value="20000"/>
    <property name="esStatisticsPublisher" ref="es-pub"/>
    <property name="metricsReportGenerationInitialDelay" value="20000"/>
    <property name="metricsTemplateSyncInitialDelay" value="10000"/>
</bean>

<bean id="in-memory-metrics-engine" class="org.adroitlogic.metrics.core.InMemoryMetricsCollectorImpl"/>

<bean id="metrics-engine" class="org.adroitlogic.metrics.core.MetricsEngineImpl">
    <constructor-arg ref="esMetricsCollector"/>
    <constructor-arg ref="in-memory-metrics-engine"/>
</bean>

Here we can see the bean esMetricsCollector bean which is for publishing statistics to elasticsearch and the bean in-memory-metrics-engine which is the in-memory metrics engine. You can use either one of these or both for statistics by the changing the constructor-args in the "metrics-engine" bean. More details of this configuration can be found in respective sections.

InMemory Metrics Engine

This guide describes the InMemory Metrics Engine used in UltraESB along with information related to the internal architecture of the the in-memory metrics engine.

Overall Architecture
Metrics Window and Step Time
Metrics Stream Types
Metrics Engine Tuner
Historical Analysis and Third-party Tools Integration
- Metrics Data Point Listener
- Metrics Record Eviction Handler
Alerting and Notifications
- Alert Functions
- Alert Severity
Resource Consumption Formula

Overall Architecture

UltraESB in-memory metrics engine is not a full-featured metrics implementation, rather it focuses more on delivering high performance, near real-time, light-weight solution for monitoring the system for a pre-configured window time. It uses a RRD-Tools like architecture to collect and store metrics records. However it DOES NOT persist the records, meaning that it is not capable of providing historical analysis of the metrics records or any sort of a long standing analytical features.

It is capable of delivering a summarized view of the metrics in a given window, where this window is configurable. It is NOT capable of giving out the absolute data points, as it keeps only a summarization within a pre-configured step time. The metrics engine uses a circular buffer like in-memory data structure to keep all the metrics records, where the circular buffer contains a set of objects, each of which represents a step time of the metrics engine. Step time is the least time window that the metrics engine is capable of reporting metrics summarizations. It is also considered as the general accuracy of the metrics summary that the engine is capable of producing. The amount of records that are there in this circular buffer is determined by the configured step time and the metrics window time.

Any metrics data point reported to the in-memory metrics engine should have an metrics stream associated with it, which could possibly continue in all or a sub set of the step records in the metrics window. A given stream can have zero or more reports within a given step time. There are different metrics stream types which will be discussed in down below under the Metrics Stream Types section.

While the in-memory metrics engine keeps a reference to one of the steps in this buffer as its active record which is used to report any metrics data points by the engine. When a new step record is inserted (actually it will be reusing an existing step record in the underlying implementation to make it efficient) into the window, the most long standing record in the window will be evicted from the window, keeping absolutely a constant amount of step records in the window at any given time.

Both absolute data point recording and step record eviction can be intercepted by writing custom extensions to the in-memory metrics engine which we will be discussing in a later section of this guide.

Metrics Window and Step Time

The key parameters that controls the dimensions of the in-memory metrics engine and the resources consumed by the engine are the metrics window time and the step time. The metrics window can be defined as a moving window with the corresponding metrics records for a window time amount of past metrics records. So in effect the maximum history that the in-memory metrics engine is capable of providing metrics information is bounded by the metrics window time.

The step time is the lower bound of the metrics record history as the in-memory metrics engine keeps on accumulating the metrics data received to the engine for a step time before the record being retired to the moving window. The in-memory metrics engine cycles this process for any metrics streams reported to the in-memory metrics engine. Step time is also the least granularity that the in-memory metrics engine is capable of providing the metrics information, as it only keeps a summarized view of the step time, due to many reasons. Most important among these reasons are;

to provide metrics with the least possible memory foot print
avoid having to loop many records to present data into the presentation layer for monitoring

The limitations that are introduced by this summarization are the lose of actual data points and having a time window for any violation checks and alerting criteria. Since the in-memory metrics engine is optimized to run on this step time, any exceptional behaviors etc will only be notified in step time intervals, so you need to carefully analyze your requirements prior to selecting the step time and the window times to satisfy the non-functional requirements of the integration system.

For the configuration matters of the in-memory metrics engine it requires the step time and the number of steps, where as the number of steps (N) is given by the following formula, given that the metrics window time is (W) and the step time is (S).

Number of steps (N) = ceiling(W/S)

While a particular deployment may want more granular metrics information, which requires "S" to be minimized, certain other deployments might need more history, which requires "W" to be maximized. If you want both granular metrics information and a long history record as well, you could still configure the engine, but you need to be careful about the memory consumption, (Please refer to the section on Resource Consumption Formula) and the performance aspects too.

Note
You might have already understood from the architectural design of the in-memory metrics engine that these parameters are static for a single runtime instance, where it cannot be dynamically changed at runtime.

Ideal values for "S" and "W" for a production deployment are 3 seconds and 15 minutes meaning that the system will keep maximum 300 step records (N) in memory circular buffer to provide metrics history of 15 minutes, which is also the default configuration of the in-memory metrics engine.

Metrics Stream Types

There are different types of the metrics streams defined in the in-memory metrics engine, each of which is specialized in presenting data of specific nature. These metrics stream types are as follows;

Counters
Gauges
Timers
Events

Counter Metrics Streams

Counter streams just counts a particular event, or a occurrences of an action. Example counter stream would be the received messages count for a proxy service. Counters are heavily used in UltraESB to report metrics, it is taking the least amount of memory with compared to the other types of metrics streams and doesn’t have the histogram data, as it is naturally not meaning full. A counter record will expose the following metrics data about itself;

Data Name	Description
count	The number of occurences
rate	Rate of the occurrences of an action as a per second value

Data Name

Description

count

The number of occurences

rate

Rate of the occurrences of an action as a per second value

Following operations can be invoked on a counter;

Operation Name	Description
increment	Increases the action occurrence count
decrement	Decreases the action occurrence count

Operation Name

Description

increment

Increases the action occurrence count

decrement

Decreases the action occurrence count

Gauge Metrics Streams

Gauge streams collects the information about the gauge of a particular object seen over the time. Example gauge stream would be the received message size via a given transport. Gauges are used in UltraESB mainly to report message sizes, system load, CPU utilizations and memory usage, it is taking a bit more memory with compared to the other types of metrics streams how ever it does have the histogram data, as it is very important to have a histogram of a gauge to understand the behavior of that particular gauge over the time. A gauge record will expose the following metrics data about itself;

Data Name	Description
sum	Sum of the gauges reported in the complete step window
max	Maximum gauge value reported in this step window
min	Minimum none zero gauge reported in this step window
mean	Mathematical mean (average) of all the gauges reported in this step window
averageRate	Average data flow rate of the reported gauges in this step window
count	Number of gauges records in this step window
median	Statistical median of the recorded gauge distribution in this step window
75^th percentile	75^th percentile of the recorded gauge distribution in this step window
95^th percentile	95^th percentile of the recorded gauge distribution in this step window
98^th percentile	98^th percentile of the recorded gauge distribution in this step window
99^th percentile	99^th percentile of the recorded gauge distribution in this step window
99.9^th percentile	99.9^th percentile of the recorded gauge distribution in this step window
99.99^th percentile	99.99^th percentile of the recorded gauge distribution in this step window

Data Name

Description

sum

Sum of the gauges reported in the complete step window

max

Maximum gauge value reported in this step window

min

Minimum none zero gauge reported in this step window

mean

Mathematical mean (average) of all the gauges reported in this step window

averageRate

Average data flow rate of the reported gauges in this step window

count

Number of gauges records in this step window

median

Statistical median of the recorded gauge distribution in this step window

75^th percentile

75^th percentile of the recorded gauge distribution in this step window

95^th percentile

95^th percentile of the recorded gauge distribution in this step window

98^th percentile

98^th percentile of the recorded gauge distribution in this step window

99^th percentile

99^th percentile of the recorded gauge distribution in this step window

99.9^th percentile

99.9^th percentile of the recorded gauge distribution in this step window

99.99^th percentile

99.99^th percentile of the recorded gauge distribution in this step window

Following operations can be invoked on a gauge;

Operation Name	Description
add	Adds a gauge value point to the metrics stream

Timer Metrics Streams

Timer streams collects the information about the elapsed time of a particular action over the time. Example timer streams would be the incoming message processing time of a proxy service or the response time (round-trip time) of an endpoint. Timers are used in UltraESB mainly to measure the latencies, such as message processing time, response time, execution times etc.., it is very similar in memory consumption to that of a gauge records and have the histogram data. The histogram data including the percentiles plays a key role as it helps identify the outliers from the time series and find the 99% of the processing time etc.. such that the distribution gives the user an understanding of the general case leaving the exceptions out. A timer record will expose the following metrics data about itself;

Data Name	Description
max	Maximum elapsed time reported in this step window
min	Minimum none zero elapsed time reported in this step window
mean	Mathematical mean (average) of all elapsed times reported in this step window
count	Number of timers recorded in this step window
median	Statistical median of the elapsed time distribution in this step window
75^th percentile	75^th percentile of the elapsed time distribution in this step window
95^th percentile	95^th percentile of the elapsed time distribution in this step window
98^th percentile	98^th percentile of the elapsed time distribution in this step window
99^th percentile	99^th percentile of the elapsed time distribution in this step window
99.9^th percentile	99.9^th percentile of the elapsed time distribution in this step window
99.99^th percentile	99.99^th percentile of the elapsed time distribution in this step window

Data Name

Description

max

Maximum elapsed time reported in this step window

min

Minimum none zero elapsed time reported in this step window

mean

Mathematical mean (average) of all elapsed times reported in this step window

count

Number of timers recorded in this step window

median

Statistical median of the elapsed time distribution in this step window

75^th percentile

75^th percentile of the elapsed time distribution in this step window

95^th percentile

95^th percentile of the elapsed time distribution in this step window

98^th percentile

98^th percentile of the elapsed time distribution in this step window

99^th percentile

99^th percentile of the elapsed time distribution in this step window

99.9^th percentile

99.9^th percentile of the elapsed time distribution in this step window

99.99^th percentile

99.99^th percentile of the elapsed time distribution in this step window

Following operations can be invoked on a gauge;

Operation Name	Description
reportTime	Reports an elapsed time data point to the metrics stream

Operation Name

Description

reportTime

Reports an elapsed time data point to the metrics stream

Event Metrics Streams

Event streams collects the information about a set of events of a particular action over the time. Example event streams would be the message times event stream of a proxy service or the thread usage event stream of the work manager. Events streams are used in UltraESB mainly to measure relative average of an event w.r.t. a common base or a time difference between events in an event stream or alert for exceptional events such as the first use of the secondary thread pool in the work manager etc.., the memory consumption of an event record is subjective to the number of possible events per stream. An event record will expose the following metrics data about itself;

Data Name	Description
relativeAverage	The relative average of a particular event in the stream, with compared to the common base time

Data Name

Description

relativeAverage

The relative average of a particular event in the stream, with compared to the common base time

Metrics Engine Tuner

Tuner is designed to tune the in-memory metrics engine and its behavior. It does not directly affect the amount of recourses that the engine will utilize, rather it defines the operational behavior of the in-memory metrics engine. However, it may affect the resources, if you configure any of the metrics streams to be turned off. In-general the tuner doesn’t affect the major sizing parameters of the in-memory metrics engine.

The tuner can be configured to turn off the metrics window completely such the the metrics data points has no effect at all, or the metrics collection can be enabled while the metrics window is disabled, so that the recorded data points will be notified to any of the registered external data point listeners but the window will not be maintained in-memory.

Another feature of the tuner is the ability to map system default stream names to a user defined stream name. For example lets say the built-in stream for memory usage, which is "ue:memory:usedGauge" needs to be changed to something like, "app.memory-usage", that can be done by specifying a mapping from the default names to the custom name in the tuner. It is also possible to use wild cards for mapping, where you can map stream name that starts with "ue:" (default system streams) into "myApp." etc..

It can be configured to disable alerts completely from a single point. This will be of great use in cases where you might have different alerts configured to be triggered for critical events, and for an instance you do not want any of the alerts being triggered, for example for a performance test. For those circumstances you can directly disable all alerts from the in-memory metrics engine tuner.

Historical Analysis and Third-party Tools Integration

If you need historical analysis and heavy weight processing of metrics streams such as CEP on the metrics data, it is recommended that a third party solution to be integrated into the in-memory metrics engine. The core in-memory metrics engine facilitates this integration in 2 mechanisms, both with its own advantages as well as disadvantages. So it is the duty of the user to carefully select what you need in facilitating the integration of a third party system. These 2 integration points are;

Metrics Data Point Listener
Metrics Record Eviction Handler

Metrics Data Point Listener

Metrics engine reports the absolute metrics data points as and when it receives into the Data Point Listeners registered for it. This focuses more on the architecture of it.

There are few key aspects that you need to know architecturally about the data point listeners, first of that is the performance of a data point listener. Data point listener gets notified for absolute data points of the in-memory metrics engine, and is being invoked by the in-memory metrics engine itself. So to keep the lean behavior of the in-memory metrics engine, the data point listener should be designed in a manner that it performs the data point recording to the external party in the best possible fashion. If the third party system and the purpose of using the third party system is not time critical (no need to be real-time), one possible option would be to batch process the data points using a queue to accumulate data points and post in batches to the external system.

In an effort to provide a built-in mechanism to make this data point listeners asynchronous and not to affect the data flow of the main task, the UltraESB metrics implementation provides a Asynchronous version of the data point listener, where the asynchronous execution of the data point listener is guaranteed by the engine it self. If you need all data points to be recorded then and there to the external system you may use this option.

Metrics Record Eviction Handler

The eviction handler as the name depicts is called only when a given metrics step record is evicted out from the metrics window, so theoretically any subscriber listening for the metrics records with the eviction handler will receive records only after a window time, as the step record will be evicted after its life-cycle in the metrics window. One other limitation of the eviction handler is that it only receives the summarization of the step time and NOT absolute data points. So in a nut-shell, if you either need near real-time analysis or absolute metrics data points this is not the option for you. However this option has the advantage of the least overhead on the runtime, as it just reports the complete metrics activity set of a step time once and it is guaranteed that you will receive the summarized metrics records latests after a window time, and not before that.

This option is useful if you need some sort of a persistence layer for the metrics records averaged for a step time, where you can then analyze the historical data points about a given stream beyond your configured metrics window time, with an external tool to query the persisted metrics data points.

Alerting and Notifications

The near real-time, high performance nature of the in-memory metrics engine makes it very handy at alerting and notifications. Alerts are configured with a JSON formatted file which will be discussed in detail under the Alert Configuration section. Alerts also inherits the same summarization semantics of the core in-memory metrics engine, making it to be able to alert only in step time intervals. Once a step record completes its active life time, it gets stored in the metrics window for a window amount of time. In the mean time, each and every step record that is retired to the metrics window asynchronously evaluates the alert configurations associated with the metrics streams that it carries.

The asynchronous behavior of the alert processing makes sure that it adds the least possible overhead to the messages being processed (data flow) through the ESB. It also enables the ability to notify on alerts, as the alert generation and all alerts related activity is completely disconnected from the message processing. The disconnected nature makes sure it doesn’t affect the throughput of the ESB server, provided that the ESB is not running under 100% resource utilization.

To understand the alerts and the configuration of an alert it is very important to understand the alert functions.

Alert Functions

Alert functions are the element which defines the execution function of the metrics stream data over the defined criteria in the alert configuration. Different metrics record types are associated with different alert functions and the below table illustrates the association of the alert functions to the metrics record types.

Alert Function

Description

Applicability of Types

Counter

Gauge

Timer

Event

SUM

Sum of the values reported in the step

YES

AVERAGE

Average of the values reported in the step

YES

RATE

Rate of data flow in the step

YES

COUNT

Number of data reports in step

YES

MIN

Minimum (non zero) value reported in the step

YES

MAX

Maximum (finite) value reported in the step

YES

DIFF

Relative time difference between 2 events

YES

Alert Severity

The alert severity can be configured per alert in the alert configuration, there are 5 alert severities defined for in-memory metrics engine alert configurations which are as follows; it is recommended that users keep the following criteria in mind when assigning a severity to an alert as that will help anybody understand the alerts importance etc.

Severity Name	Description
FATAL	System has either encountered a state from which it cannot be recovered by itself
CRITICAL	Condition where the affect is critical into the live system
ERROR	Unintended behavior of the system
WARNING	Unexpected behavior that may be recovered by the system on its own
INFORMATION	No harm alert to notify the system has reached a certain milestone or an event

Severity Name

Description

FATAL

System has either encountered a state from which it cannot be recovered by itself

CRITICAL

Condition where the affect is critical into the live system

ERROR

Unintended behavior of the system

WARNING

Unexpected behavior that may be recovered by the system on its own

INFORMATION

No harm alert to notify the system has reached a certain milestone or an event

Resource Consumption Formula

This section presents a resource consumption formula for the in-memory metrics engine, from which it is very easy to derive a sizing metrics for the in-memory metrics engine of UltraESB.

Lets first of all list the general sizes of individual stream types;

Record Type	General Memory Consumption (bytes)	Total Consumption per Entry GMC + 16 (int key) + 32 (entry itself)
CounterRecord	48	48 + 48 = 96
TimerRecord	80	80 + 48 = 128
GaugeRecord	80	80 + 48 = 128
EventRecord	184	184 + 48 = 232

Record Type

General Memory Consumption (bytes)

Total Consumption per Entry GMC + 16 (int key) + 32 (entry itself)

CounterRecord

48 + 48 = 96

TimerRecord

80 + 48 = 128

GaugeRecord

80 + 48 = 128

EventRecord

184

184 + 48 = 232

Now lets assume the following setup where there are artifacts of the following count;

Artifact Type	Number of Items	Counter Records	Timer Records	Gauge Records	Event Records
Transport Senders + Listeners	T	4		2
Work Managers	W	4			1
File Caches	F	5			1
Proxy Services	P	6	2
Sequences	S	4	1
Endpoints	E	9	1
Interceptors	I	6	1
System	1	4		4

Based on the above table we can derive a formula for the memory consumption of records based on artifacts;

Memory Consumption per Step Time (M/N) = (4 x 96 + 2 x 128) T + (4 x 96 + 1 x 232) W + (5 x 96 + 1 x 232) F + (6 x 96 + 2 x 128) P + (4 x 96 + 1 x 128) S + (9 x 96 + 1 x 128) E + (6 x 96 + 1 x 128) I + 4 x 96 + 4 x 128

Memory consumption complete formula per step time
M/N = 640 T + 616 W + 712 F + 832 P + 512 S + 992 E + 704 I + 896

Memory consumption complete formula per step time

M/N = 640 T + 616 W + 712 F + 832 P + 512 S + 992 E + 704 I + 896

Now lets assume a general setup where there is only 1 file cache and 1 work manager (which is the 99% of the case) and assume the HTTP and HTTPS case which will result in 4 transports 2 listeners and senders, and no interceptors. Then the formula will be a function of deployment unit artifacts.

M = 640 x 4 + 616 x 1 + 712 x 1 + 832 P + 512 S + 992 E + 704 x 0 + 896

General memory consumption for HTTP/S per step time
M/N = 832 P + 512 S + 992 E + 4784

General memory consumption for HTTP/S per step time

M/N = 832 P + 512 S + 992 E + 4784

Total memory consumption of the in-memory metrics engine can now be derived as;

M_Tot = (832 P + 512 S + 992 E + 4784) x (N + 2)

where the N is the number of steps.

Deriving a sizing metrics based on this formula for services, sequences and endpoints etc.. for the default in-memory metrics engine configuration where the N is 300,

Proxy Services	Sequences	Endpoints	Metrics Records Theoretical Memory Consumption	Measured Memory Usage
Proxy Services	Sequences	Endpoints	Metrics Records Theoretical Memory Consumption	Metrics Window	Complete Engine
2	5	4	3,918,752 = 3.7 MB	4,718,376 = 4.5 MB	9,730,296 = 9.3 MB
5	10	10	7,243,168 = 6.9 MB	8,374,632 = 8 MB	15,674,488 = 14.9 MB
5	10	50	19,226,528 = 18.3 MB	21,011,688 = 20 MB	33,755,768 = 32.2 MB
10	20	20	13,041,568 = 12.4 MB	14,230,696 = 13.6 MB	25,564,272 = 24.4 MB
100	200	200	117,412,768 = 112 MB	128,591,208 = 122.6 MB	212,584,272 = 202.7 MB
1000	2000	2000	1,161,124,768 = 1 GB
1000	10	2000	853,423,008 = 813.9 MB	900,374,152 = 858.7 MB	1,444,303,552 = 1.3 GB

Built-in Metrics Streams

UltraESB by default reports as many metrics streams as it can providing a good view of the system information to the users to monitor it. This is possible due to the lean architecture of the UltraESB metrics engine and the ability to configure the metrics window and the step time of your choice to match your requirements with a good memory profile and least performance overhead.

Value Metrics Streams

The reported value metrics streams are explained in the following table.

Repetetive Streams
Any stream name which contains ${name} will be repeated for each item of that category, for example in an UltraESB system which has 2 proxy services namely "foo" and "bar", the stream proxy:${name}:handledErrorInCount will be reported as 2 streams each with the name proxy:foo:handledErrorInCount and proxy:bar:handledErrorInCount

Category	Stream Name	Metrics Type	Description
System Information	ue:memory:usedGauge	Gauge	Heap Memory usage of the UltraESB process reported in bytes
ue:system:loadAvgGauge	Gauge	CPU load average of the UltraESB process reported as X, where X/1000 gives the load average to 3 decimal points
ue:system:openFDGauge	Gauge	The file descriptor usage of the UltraESB process, in other words how many files being opened by UltraESB
ue:thread:activeGauge	Gauge	Number of active Java threads in the UltraESB, including system threads
ue:messages:receivedCount	Counter	Total number of messages received to the UltraESB irrespective of the transports or proxy services
ue:messages:sentCount	Counter	Total number of messages sent out by the UltraESB irrespective of the transports or proxy services
ue:fc:fileUsage	Counter	Global file cache usage of the UltraESB of all used file caches
ue:wm:threadUsage	Counter	Global all work manager thread usage (task executions) of the UltraESB
Transport Metrics	trp:${name}:receivedMessageCount	Counter	Number of received messages through this transport
trp:${name}:receivingFaultCount	Counter	Number of faults encountered while receiving messages through this transport
trp:${name}:receivedBytesGauge	Gauge	Amount of bytes received by this transport
trp:${name}:sentMessageCount	Counter	Number of messages sent by this transport
trp:${name}:sendingFaultCount	Counter	Number of faults encountered while sending messages via this transport
trp:${name}:sentBytesGauge	Gauge	Amount of bytes sent by this transport
Proxy Service Metrics	proxy:${name}:processedInCount	Counter	Number of incoming messages processed by this proxy service
proxy:${name}:processedOutCount	Counter	Number of outgoing messages processed by this proxy service
proxy:${name}:handledErrorInCount	Counter	Number of errors of incoming messages handled by this proxy service
proxy:${name}:handledErrorOutCount	Counter	Number of errors of outgoing messages handled by this proxy service
proxy:${name}:failedInCount	Counter	Number of incoming message un-handled failures in this proxy service
proxy:${name}:failedOutCount	Counter	Number of outgoing message un-handled failures in this proxy service
proxy:${name}:processingInTime	Timer	Amount of time it takes to process an incoming message in this proxy service reported in nano seconds
proxy:${name}:processingOutTime	Timer	Amount of time it takes to process an outgoing message in this proxy service reported in nano seconds
proxy:${name}:responseTime	Timer	Amount of time between message acceptance to response by the proxy service.
Sequence Metrics	seq:${name}:successCount	Counter	Number of messages that has been executed in this sequence successfully (without any errors)
seq:${name}:handledErrorCount	Counter	Number of messages that encounter an error which is handled with an error sequence, within this sequence
seq:${name}:unhandledErrorCount	Counter	Number of messages that encounter an error but has not been handled within this sequence
seq:${name}:totalProcessedCount	Counter	All messages that reached this sequence
seq:${name}:executionTime	Timer	The number of nano seconds it took for a message to execute within this sequence
Endpoint Metrics	ep:${name}:processedMessageCount	Counter	Number of messages that has reached this endpoint, including fail-over messages
ep:${name}:uniqueMessageCount	Counter	Number of unique messages that has reached this endpoint (excluding fail-over messages)
ep:${name}:failOverMessageCount	Counter	Number of fail-over messages that has reached this endpoint
ep:${name}:successfulMessageCount	Counter	Number of messages that got successfully delivered over this endpoint
ep:${name}:suspendErrorSendingMessageCount	Counter	Number of suspended errors that are encountered while sending the message to the remote EPR via this endpoint
ep:${name}:suspendErrorReceivingMessageCount	Counter	Number of suspended errors that are encountered while receiving the response from the remote EPR in this endpoint
ep:${name}:temporaryErrorSendingMessageCount	Counter	Number of temporary errors that are encountered while sending the message to the remote EPR via this endpoint
ep:${name}:temporaryErrorReceivingMessageCount	Counter	Number of temporary errors that are encountered while receiving the response from the remote EPR in this endpoint
ep:${name}:otherErrorsCount	Counter	Number of other uncategorized errors that are encountered within this endpoint
ep:${name}:responseTime	Timer	The response (round-trip) time of this endpoint
File Cache Metrics	fc:${name}:fileCacheUsage	Counter	Number of files used from this file cache
fc:${name}:fileCacheOverflow	Counter	Number of files that overflowed to disk from this file cache
fc:${name}:createdFiles	Counter	Number of created files within the file cache
fc:${name}:filesInUse	Counter	Number of files currently acquired for processing from this file cache
fc:${name}:overflowFilesInUse	Counter	Number of files currently acquired for processing as overflowed files from this file cache
Work Manager Metrics	wm:${name}:primaryThreadUsage	Counter	Number of tasks that got invoked with primary threads from this work manager
wm:${name}:secondaryThreadUsage	Counter	Number of tasks that got invoked with secondary threads from this work manager
wm:${name}:wipMessages	Counter	Number of messages that are currently being processed (work-in-progress) by a thread of this work manager
wm:${name}:messageQueueSize	Counter	Number of messages that are queued to be executed by this work manager
Interceptor Metrics	int:${name}:processedCount	Counter	Number of interceptor invocations
int:${name}:successCount	Counter	Successful interceptor invocations without and error
int:${name}:handledErrorCount	Counter	Number of errors that got handled by an error handler in an interceptor
int:${name}:unhandledErrorCount	Counter	Number of errors that are not handled in an interceptor
int:${name}:acceptedCount	Counter	Number of accepted invocations by the interceptor (# of return true, times)
int:${name}:rejectedCount	Counter	Number of rejected invocations by the interceptor (# of return false, times)
int:${name}:executionTime	Timer	Execution time of the interception

Category

Stream Name

Metrics Type

Description

System Information

ue:memory:usedGauge

Gauge

Heap Memory usage of the UltraESB process reported in bytes

ue:system:loadAvgGauge

Gauge

CPU load average of the UltraESB process reported as X, where X/1000 gives the load average to 3 decimal points

ue:system:openFDGauge

Gauge

The file descriptor usage of the UltraESB process, in other words how many files being opened by UltraESB

ue:thread:activeGauge

Gauge

Number of active Java threads in the UltraESB, including system threads

ue:messages:receivedCount

Counter

Total number of messages received to the UltraESB irrespective of the transports or proxy services

ue:messages:sentCount

Counter

Total number of messages sent out by the UltraESB irrespective of the transports or proxy services

ue:fc:fileUsage

Counter

Global file cache usage of the UltraESB of all used file caches

ue:wm:threadUsage

Counter

Global all work manager thread usage (task executions) of the UltraESB

Transport Metrics

trp:${name}:receivedMessageCount

Counter

Number of received messages through this transport

trp:${name}:receivingFaultCount

Counter

Number of faults encountered while receiving messages through this transport

trp:${name}:receivedBytesGauge

Gauge

Amount of bytes received by this transport

trp:${name}:sentMessageCount

Counter

Number of messages sent by this transport

trp:${name}:sendingFaultCount

Counter

Number of faults encountered while sending messages via this transport

trp:${name}:sentBytesGauge

Gauge

Amount of bytes sent by this transport

Proxy Service Metrics

proxy:${name}:processedInCount

Counter

Number of incoming messages processed by this proxy service

proxy:${name}:processedOutCount

Counter

Number of outgoing messages processed by this proxy service

proxy:${name}:handledErrorInCount

Counter

Number of errors of incoming messages handled by this proxy service

proxy:${name}:handledErrorOutCount

Counter

Number of errors of outgoing messages handled by this proxy service

proxy:${name}:failedInCount

Counter

Number of incoming message un-handled failures in this proxy service

proxy:${name}:failedOutCount

Counter

Number of outgoing message un-handled failures in this proxy service

proxy:${name}:processingInTime

Timer

Amount of time it takes to process an incoming message in this proxy service reported in nano seconds

proxy:${name}:processingOutTime

Timer

Amount of time it takes to process an outgoing message in this proxy service reported in nano seconds

proxy:${name}:responseTime

Timer

Amount of time between message acceptance to response by the proxy service.

Sequence Metrics

seq:${name}:successCount

Counter

Number of messages that has been executed in this sequence successfully (without any errors)

seq:${name}:handledErrorCount

Counter

Number of messages that encounter an error which is handled with an error sequence, within this sequence

seq:${name}:unhandledErrorCount

Counter

Number of messages that encounter an error but has not been handled within this sequence

seq:${name}:totalProcessedCount

Counter

All messages that reached this sequence

seq:${name}:executionTime

Timer

The number of nano seconds it took for a message to execute within this sequence

Endpoint Metrics

ep:${name}:processedMessageCount

Counter

Number of messages that has reached this endpoint, including fail-over messages

ep:${name}:uniqueMessageCount

Counter

Number of unique messages that has reached this endpoint (excluding fail-over messages)

ep:${name}:failOverMessageCount

Counter

Number of fail-over messages that has reached this endpoint

ep:${name}:successfulMessageCount

Counter

Number of messages that got successfully delivered over this endpoint

ep:${name}:suspendErrorSendingMessageCount

Counter

Number of suspended errors that are encountered while sending the message to the remote EPR via this endpoint

ep:${name}:suspendErrorReceivingMessageCount

Counter

Number of suspended errors that are encountered while receiving the response from the remote EPR in this endpoint

ep:${name}:temporaryErrorSendingMessageCount

Counter

Number of temporary errors that are encountered while sending the message to the remote EPR via this endpoint

ep:${name}:temporaryErrorReceivingMessageCount

Counter

Number of temporary errors that are encountered while receiving the response from the remote EPR in this endpoint

ep:${name}:otherErrorsCount

Counter

Number of other uncategorized errors that are encountered within this endpoint

ep:${name}:responseTime

Timer

The response (round-trip) time of this endpoint

File Cache Metrics

fc:${name}:fileCacheUsage

Counter

Number of files used from this file cache

fc:${name}:fileCacheOverflow

Counter

Number of files that overflowed to disk from this file cache

fc:${name}:createdFiles

Counter

Number of created files within the file cache

fc:${name}:filesInUse

Counter

Number of files currently acquired for processing from this file cache

fc:${name}:overflowFilesInUse

Counter

Number of files currently acquired for processing as overflowed files from this file cache

Work Manager Metrics

wm:${name}:primaryThreadUsage

Counter

Number of tasks that got invoked with primary threads from this work manager

wm:${name}:secondaryThreadUsage

Counter

Number of tasks that got invoked with secondary threads from this work manager

wm:${name}:wipMessages

Counter

Number of messages that are currently being processed (work-in-progress) by a thread of this work manager

wm:${name}:messageQueueSize

Counter

Number of messages that are queued to be executed by this work manager

Interceptor Metrics

int:${name}:processedCount

Counter

Number of interceptor invocations

int:${name}:successCount

Counter

Successful interceptor invocations without and error

int:${name}:handledErrorCount

Counter

Number of errors that got handled by an error handler in an interceptor

int:${name}:unhandledErrorCount

Counter

Number of errors that are not handled in an interceptor

int:${name}:acceptedCount

Counter

Number of accepted invocations by the interceptor (# of return true, times)

int:${name}:rejectedCount

Counter

Number of rejected invocations by the interceptor (# of return false, times)

int:${name}:executionTime

Timer

Execution time of the interception

Event Metrics Streams

Category	Stream Name	Event Name	Description
Proxy Service Metrics	proxy:${name}:messageTimes	proxy:in:start	Start time of the incoming message processing relative to the base time within this proxy service
proxy:in:end	End time of the incoming message processing relative to the base time within this proxy service
proxy:out:start	Start time of the outgoing message processing relative to the base time within this proxy service
proxy:out:end	End time of the outgoing message processing relative to the base time within this proxy service
in:seq:${name}:start	Start time of the in-sequence execution of this proxy service. Note that if you call another sequence within the proxy in sequence this event will be triggered twice with the ${name} changed for the 2 sequences
in:seq:${name}:end	End time of the in-sequence execution of this proxy service. Note that if you call another sequence within the proxy in sequence this event will be triggered twice with the ${name} changed for the 2 sequences
out:seq:${name}:start	Start time of the out-sequence execution of this proxy service. Note that if you call another sequence within the proxy out sequence this event will be triggered twice with the ${name} changed for the 2 sequences
out:seq:${name}:end	End time of the out-sequence execution of this proxy service. Note that if you call another sequence within the proxy out sequence this event will be triggered twice with the ${name} changed for the 2 sequences
in:ep:${name}:start	Start time of the in-destination processing. If there are more than one in destinations all respective destinations will trigger this event.
in:ep:${name}:end	End time of the in-destination processing. If there are more than one in destinations all respective destinations will trigger this event.
out:ep:${name}:start	Start time of the out-destination processing. If there are more than one out destinations all respective destinations will trigger this event.
out:ep:${name}:end	End time of the out-destination processing. If there are more than one out destinations all respective destinations will trigger this event.
Sequence Metrics	seq:${name}:messageTimes	seq:start	Sequence execution start time relative to the base time
seq:end	Sequence execution end time relative to the base time
Endpoint Metrics	ep:${name}:messageTimes	ep:start	Endpoint processing start time relative to the base time
ep:end	Endpoint processing end time relative to the base time
File Cache Metrics	fc:${name}:fileCacheState
Work Manager Metrics	wm:${name}:threadPoolEvents

Category

Stream Name

Event Name

Description

Proxy Service Metrics

proxy:${name}:messageTimes

proxy:in:start

Start time of the incoming message processing relative to the base time within this proxy service

proxy:in:end

End time of the incoming message processing relative to the base time within this proxy service

proxy:out:start

Start time of the outgoing message processing relative to the base time within this proxy service

proxy:out:end

End time of the outgoing message processing relative to the base time within this proxy service

in:seq:${name}:start

Start time of the in-sequence execution of this proxy service. Note that if you call another sequence within the proxy in sequence this event will be triggered twice with the ${name} changed for the 2 sequences

in:seq:${name}:end

End time of the in-sequence execution of this proxy service. Note that if you call another sequence within the proxy in sequence this event will be triggered twice with the ${name} changed for the 2 sequences

out:seq:${name}:start

Start time of the out-sequence execution of this proxy service. Note that if you call another sequence within the proxy out sequence this event will be triggered twice with the ${name} changed for the 2 sequences

out:seq:${name}:end

End time of the out-sequence execution of this proxy service. Note that if you call another sequence within the proxy out sequence this event will be triggered twice with the ${name} changed for the 2 sequences

in:ep:${name}:start

Start time of the in-destination processing. If there are more than one in destinations all respective destinations will trigger this event.

in:ep:${name}:end

End time of the in-destination processing. If there are more than one in destinations all respective destinations will trigger this event.

out:ep:${name}:start

Start time of the out-destination processing. If there are more than one out destinations all respective destinations will trigger this event.

out:ep:${name}:end

End time of the out-destination processing. If there are more than one out destinations all respective destinations will trigger this event.

Sequence Metrics

seq:${name}:messageTimes

seq:start

Sequence execution start time relative to the base time

seq:end

Sequence execution end time relative to the base time

Endpoint Metrics

ep:${name}:messageTimes

ep:start

Endpoint processing start time relative to the base time

ep:end

Endpoint processing end time relative to the base time

File Cache Metrics

fc:${name}:fileCacheState

Work Manager Metrics

wm:${name}:threadPoolEvents

ElasticSearch based Metrics Engine

Elasticsearch based metrics engine publishes stats to elasticsearch. Elasticsearch is a search engine based on Lucene which provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Stats published to elasticsearch can be monitored through AdroitLogic Integration Monitor (IMonitor), which is an independent Web Application.

AdroitLogic Integration Monitor - IMonitor
AdroitLogic Integration Monitor executes as an independent Web Application, and allows the easy management of a single UltraESB instance or a cluster of instances. Let it be a single instance or a cluster of ESB nodes, IMonitor delivers business level statistics and monitoring at the best. Apart from the operational statistics, IMonitor is capable of presenting friendly troubleshooting & diagnostics capabilities. It’s your step towards improved organisational efficiency saving hours of developer time. Note that IMonitor comes as a replacement for UConsole which was there in previous UltraESB releases and is covered separately in AdroitLogic - Integration Monitor User Guide.

Quite similar to in-memory metrics engine, any metrics data point reported to the ElasticSearch based metrics engine should have an metrics stream associated with it and all the metrics streams of type counter stream and gauge stream reported to the in-memory metrics engine are reported to the ElasticSearch based metrics engine.

Configuration

Following are the configuration beans related to ElasticSearch based metrics engine which reside in in ultra-metrics.xml. As you can see below there are three beans in the configuration.

es-server - ElasticSearch server configuration
esMetricsCollector - Metrics collection component in ElasticSearch based metrics engine
es-pub - Metrics publishing component in ElasticSearch based metrics engine

<bean id="es-server" class="org.adroitlogic.metrics.core.InternalESStatisticsServer"/>

<!-- If you want to use an standalone elasticsearch server for statistics, uncomment this bean and
make sure to comment the above bean "es-server" which is for internal (embedded) es server -->
<!--bean id="es-server" class="org.adroitlogic.metrics.core.ExternalESStatisticsServer">
    <constructor-arg>
        <map>
            <entry key="localhost" value="9300"/>
        </map>
    </constructor-arg>
    <constructor-arg value="elasticsearch" type="java.lang.String"/>
</bean-->

<bean id="es-pub" class="org.adroitlogic.metrics.core.ElasticSearchStatisticsPublisher">
    <property name="esStatisticsServer" ref="es-server"/>

    <property name="esIndexName" value="ultraesb"/>
    <property name="esRecordType" value="msg_statistics"/>
    <property name="esMetricsDefaultTTL" value="15m"/>
    <property name="waitForYellowTimeoutInMillis" value="5000"/>
    <property name="awaitCloseTimeoutInMillis" value="10000"/>
    <!--<property name="esMetricsDefaultTTL" value="15m"/>-->
    <property name="numberOfShards" value="1"/>
    <property name="numberOfReplicas" value="1"/>
    <property name="indexRefreshInterval" value="5s"/>

    <!-- Elasticsearch BulkProcessor configuration -->
    <property name="bulkActions" value="1000"/>
    <property name="bulkSizeInMB" value="5"/>
    <property name="flushIntervalInSeconds" value="60"/>
    <property name="concurrentRequests" value="1"/>
    <property name="backOffExponentialTimeinMillis" value="100"/>
    <property name="backOffExponentialRetryAttempts" value="3"/>
    <property name="optimized" value="true"/>
</bean>

<bean id="esMetricsCollector" class="org.adroitlogic.metrics.core.ESMetricsCollectorImpl">
    <property name="primaryMetricsTemplateValidTime" value="20000"/>
    <property name="metricsReportGenerationTimePeriod" value="30000"/>
    <property name="metricsTemplateSyncTimePeriod" value="20000"/>
    <property name="esStatisticsPublisher" ref="es-pub"/>
    <property name="metricsReportGenerationInitialDelay" value="20000"/>
    <property name="metricsTemplateSyncInitialDelay" value="10000"/>
</bean>

ElasticSearch Server Cofiguration

The es-server bean is the ElasticSearch server configuration for which the metrics will be published. As you can see above there are two configurations for "es-server" with one commented out. The first "es-server" bean (the uncommented one) is for the embedded ElasticSearch server shipped with UltraESB 2.6 release. The embedded es server is mainly for integration test purposes. ES yaml configuration for the embedded server can be found at conf/monitoring directory.

Use stand-alone ElasticSearch servers in production
It is recommended to use stand alone ElasticSearch server in production use.

The commented out "es-server" bean configuration is the configuration to publish statistics to an stand-alone ElasticSearch server. First and stand-alone ES server must be configured for the UltraESB to publish statistics to that. After setting up an external ES server the host and the port which the external server is listening to and the cluster name of the external server must be provided in the bean configuration.

Publishing Metrics to ElasticSearch

The bean es-pub is responsible for configuring the component which published statistics to ElasticSearch. Understanding this might require prior knowledge on ElasticSearch and hence only a few most important points will be covered here.

esIndexName

UltraESB statistics will be published under this index name and in relational database perspective it is something close to database name.

esRecordType

This is another level of categorization and this can be somewhat thought of as a table in a relational database. Please note this association with relational databases is just for general understanding purposes and how ElasticSearch and relational databases work are much more apart.

esMetricsDefaultTTL

We can give a TTL value for the statistics which are published to ElasticSearch. In the default configuration it is set to 15 minutes.

Do not delete ElasticSearch data directory manually
Deleting ElasticSearch data directory manually may cause issues in ElasticSearch and it is not the recommended way to deleting data. Refer this guide from ElasticSearch documentation on how to delete an index.

Bulk Processor Configuration

ElasticSearch Bulk Processor is in operation when publishing statistics in ES based metrics engine. This configuration allows to configure and tune the ES Bulk Processor along with some custom optimizations. For more details on Bulk Processor configuration parameters please refer this guide from ElasticSearch documentation.

Metrics Engine

InMemory Metrics Engine

Overall Architecture

Metrics Window and Step Time

Metrics Stream Types

Counter Metrics Streams

Gauge Metrics Streams

Timer Metrics Streams

Event Metrics Streams

Metrics Engine Tuner

Historical Analysis and Third-party Tools Integration

Metrics Data Point Listener

Metrics Record Eviction Handler

Alerting and Notifications

Alert Functions

Alert Severity

Resource Consumption Formula

Built-in Metrics Streams

Value Metrics Streams

Event Metrics Streams

ElasticSearch based Metrics Engine

Configuration

ElasticSearch Server Cofiguration

Publishing Metrics to ElasticSearch

Docs

API

Quick Links

Follow Us

Documentation

API

Metrics Engine

InMemory Metrics Engine

Overall Architecture

Metrics Window and Step Time

Metrics Stream Types

Counter Metrics Streams

Gauge Metrics Streams

Timer Metrics Streams

Event Metrics Streams

Metrics Engine Tuner

Historical Analysis and Third-party Tools Integration

Metrics Data Point Listener

Metrics Record Eviction Handler

Alerting and Notifications

Alert Functions

Alert Severity

Resource Consumption Formula

Built-in Metrics Streams

Value Metrics Streams

Event Metrics Streams

ElasticSearch based Metrics Engine

Configuration

ElasticSearch Server Cofiguration

Publishing Metrics to ElasticSearch

Docs

API

Quick Links

Follow Us

Subscribe to our newsletter