Metrics

This section aims to be a comprehensive list of all of the metrics that Finagle exposes. The metrics are organized by layer and then by class.

Some of the stats are only for clients, some only for servers, and some are for both. Some stats are only visible when certain optional classes are used.

NB: Finagle sometimes uses RollupStatsReceivers internally, which will take stats like “failures/twitter/TimeoutException” and roll them up, aggregating into “failures/twitter” and also “failures”. For example, if there are 3 “failures/twitter/TimeoutException” counted, and 4 “failures/twitter/ConnectTimeoutException”, then it will count 7 for “failures/twitter”.

Public

These stats come from the public interface, and are the ones that you should look at first to figure out whether a client is abusing you, or you are misusing a downstream service. They are also useful in diagnosing what contributes to request latency.

StatsFilter

requests
A counter of the total number of successes + failures.
success
A counter of the total number of successes.
request_latency_ms
A histogram of the latency of requests in milliseconds.
pending
A gauge of the current total number of outstanding requests.
failures/<exception_name>+
A counter of the number of times a specific exception has been thrown. If you are using a ResponseClassifier that classifies non-Exceptions as failures, it will use a synthetic Exception, com.twitter.finagle.service.ResponseClassificationSyntheticException, to account for these. See the FAQ for more details.
failures
A counter of the number of times any failure has been observed.
sourcedfailures/<source_service_name>{/<exception_name>}+
A counter of the number of times a specific SourcedException or sourced Failure has been thrown. Sourced failures include additional information on what service caused the failure.
sourcedfailures/<source_service_name>
A counter of the number of times any SourcedException or sourced Failure has been thrown from this service. Sourced failures include additional information on what service caused the failure.
sourcedfailures
A counter of the number of times any SourcedException or sourced Failure has been thrown. Sourced failures include additional information on what service caused the failure.

StatsFactoryWrapper

failures/<exception_class_name>
A counter of the number of times Service creation has failed with this specific exception.
failures
A counter of the number of times Service creation has failed.
service_acquisition_latency_ms
A stat of the latency, in milliseconds, to acquire a service. This entails establishing a connection or waiting for a connection from a pool.

ServerStatsFilter

handletime_us
A stat of the time it takes to handle the request in microseconds. This is how long it takes to set up the chain of Futures to be used in the response without waiting for the response. Large values suggest blocking code on a Finagle thread.
transit_latency_ms
A stat that attempts to measure (wall time) transit times between hops, e.g., from client to server. Be aware that clock drift between hosts, stop the world pauses, and queue backups can contribute here. Not supported by all protocols.

RequestSemaphoreFilter

request_concurrency
A gauge of the total number of current concurrent requests.
request_queue_size
A gauge of the total number of requests which are waiting because of the limit on simultaneous requests.

PayloadSizeFilter (enabled for Mux, HTTP (non-chunked), Thrift)

request_payload_bytes
A histogram of the number of bytes per request’s payload.
response_payload_bytes
A histogram of the number of bytes per response’s payload.

Construction

These stats are about setting up services in Finagle, and expose whether you are having trouble making services.

ClientBuilder

codec_connection_preparation_latency_ms
A histogram of the length of time it takes to prepare a connection and get back a service, regardless of success or failure.

StatsServiceFactory

available
A gauge of whether the underlying factory is available (1) or not (0). Finagle uses this primarily to decide whether a host is eligible for new connections in the load balancer.

Finagle

These metrics track various Finagle internals.

FuturePool

These metrics correspond to the state of FuturePool.unboundedPool and FuturePool.interruptibleUnboundedPool. Only one set of metrics is exported as they share their underlying “thread pool”.

finagle/future_pool/pool_size
A gauge of the number of threads in the pool.
finagle/future_pool/active_tasks
A gauge of the number of tasks actively executing.
finagle/future_pool/completed_tasks
A gauge of the number of total tasks that have completed execution.

Scheduler

scheduler/dispatches
A gauge of the number of dispatches performed by the com.twitter.concurrent.Scheduler.
scheduler/blocking_ms
A gauge of how much time, in milliseconds, the com.twitter.concurrent.Scheduler is spending doing blocking operations on threads that have opted into tracking. Of the built-in Schedulers, this is only enabled for the com.twitter.concurrent.LocalScheduler which is the default Scheduler implementation. Note that this does not include time spent doing blocking code outside of com.twitter.util.Await.result/Await.ready. For example, Future(someSlowSynchronousIO) would not be accounted for in this metric.

Timer

finagle/timer/pending_tasks
A stat of the number of pending tasks to run for HashedWheelTimer.Default.
finagle/timer/deviation_ms
A stat of the deviation in milliseconds of tasks scheduled on HashedWheelTimer.Default from their expected time.

ClientRegistry

finagle/clientregistry/size
A gauge of the current number of clients registered in the ClientRegistry.

Name Resolution

inet/dns/queue_size
A gauge of the current number of DNS resolutions waiting for lookup in InetResolver.
inet/dns/dns_lookups
A counter of the number of DNS lookups attempted by InetResolver.
inet/dns/dns_lookup_failures
A counter of the number of DNS lookups attempted by InetResolver and failed.
inet/dns/lookup_ms
A histogram of the latency, in milliseconds, of the time to lookup every host (successfully or not) in a com.twitter.finagle.Addr.
inet/dns/successes
A counter of the number of com.twitter.finagle.Addr s with at least one resolved host.
inet/dns/failures
A counter of the number of com.twitter.finagle.Addr s with no resolved hosts.
inet/dns/cache/size
A gauge of the approximate number of cached DNS resolutions in FixedInetResolver.
inet/dns/cache/evicts
A gauge of the number of times a cached DNS resolution has been evicted from FixedInetResolver.
inet/dns/cache/hit_rate
A gauge of the ratio of DNS lookups which were already cached by FixedInetResolver

Netty 4

These metrics are exported from Finagle’s underlying transport implementation, the Netty 4 library and available under finagle/netty4 on any instance running Finagle with Netty 4.

NOTE: All pooling metrics are only exported when pooling is enabled
(default: disabled) and only account for direct memory.
pooling/allocations/huge
A gauge (a counter) of total number of HUGE direct allocations (i.e., unpooled allocations that exceed the current chunk size).
pooling/allocations/normal
A gauge (a counter) of total number of NORMAL direct allocations (i.e., less than a current chunk size).
pooling/allocations/small
A gauge (a counter) of total number of SMALL direct allocations (i.e., less than a page size, 8192 bytes).
pooling/allocations/tiny
A gauge (a counter) of total number of TINY direct allocations (i.e., less than 512 bytes).
pooling/deallocations/huge
A gauge (a counter) of total number of HUGE direct deallocations (i.e., unpooled allocations that exceed the current chunk size).
pooling/deallocations/normal
A gauge (a counter) of total number of NORMAL direct deallocations (i.e., less than a chunk size).
pooling/deallocations/small
A gauge (a counter) of total number of SMALL direct deallocations (i.e., less than a page size, 8192 bytes).
pooling/deallocations/tiny
A gauge (a counter) of total number of TINY direct deallocations (i.e., less than 512 bytes).
reference_leaks
A counter of detected reference leaks. See longer note on com.twitter.finagle.netty4.trackReferenceLeaks for details.

Load Balancing

The client stats under the loadbalancer scope expose the innards of what’s going on with load balancing, and the management of equivalent groups of hosts.

All Balancers

size
A gauge of the number of nodes being balanced across.
available
A gauge of the number of available nodes as seen by the load balancer. These nodes are ready to receive traffic.
busy
A gauge of the number of busy nodes as seen by the load balancer. These nodes are current unavailable for service.
closed
A gauge of the number of closed nodes as seen by the load balancer. These nodes will never be available for service.
load
A gauge of the total load over all nodes being balanced across.
meanweight
A gauge tracking the arithmetic mean of the weights of the endpoints being load-balanced across. Does not apply to HeapLeastLoaded.
adds
A counter of the number of hosts added to the loadbalancer.
removes
A counter of the number of hosts removed from the loadbalancer.
rebuilds
A counter of the number of times the loadbalancer rebuilds its state (triggered by either an underlying namer or failing nodes).
updates
A counter of the number of times the underlying namer triggers the loadbalancer to rebuild its state (e.g., because the server set has changed). Note that these kind of events are usually collapsed so the actual number of rebuilds is usually less than the number of updates.
max_effort_exhausted
A counter of the number of times a balancer failed to find a node that was Status.Open within com.twitter.finagle.loadbalancer.Balancer.maxEffort attempts. When this occurs, a non-open node may be selected for that request.
algorithm/{type}
A gauge exported with the name of the algorithm used for load balancing.

ApertureLoadBandBalancer

aperture
A gauge of the width of the window over which endpoints are load-balanced.
coordinate
The process global coordinate for the process as sampled by the Aperture implementation.
use_deterministic_ordering
1 if the Apeture implementation uses deterministic ordering 0, otherwise.
coordinate_updates
A counter of the number of times the Aperture implementation receives updates from the DeterministicOrdering process global.

Fail Fast

The client stats under the failfast scope give insight into how Finagle handles services where it can’t establish a connection.

FailFastFactory

marked_dead
A counter of how many times the host has been marked dead due to connection problems.
unhealthy_for_ms
A gauge of how long, in milliseconds, Finagle has been trying to reestablish a connection.
unhealthy_num_tries
A gauge of the number of times the Factory has tried to reestablish a connection.

Failure Accrual

The client stats under the failure_accrual scope track how FailureAccrualFactory manages failures.

FailureAccrualFactory

removed_for_ms
A counter of the total time in milliseconds any host has spent in dead state due to failure accrual.
probes
A counter of the number of requests sent through failure accrual while a host was marked dead to probe for revival.
removals
A count of how many times any host has been removed due to failure accrual. Note that there is no specificity on which host in the cluster has been removed, so a high value here could be one problem-child or aggregate problems across all hosts.
revivals
A count of how many times a previously-removed host has been reactivated after the penalty period has elapsed.

Idle Apoptosis

These client stats keep track of how frequently Services are closed due to prolonged idleness.

ExpiringService

idle
A counter of the number of times the service has expired from staying idle for too long in between requests.
lifetime
A counter of the number of times the service has exceeded its lifetime expiration duration.

Rate Limiting

These client stats show how much you’re hitting your rate limit if you’re using rate limiting.

RateLimitingFilter

refused
A counter of the number of refused connections by the rate limiting filter.

Pooling

These client stats help you keep track of connection churn.

CachingPool

pool_cached
A gauge of the number of connections cached.

WatermarkPool

pool_waiters
A gauge of the number of clients waiting on connections.
pool_size
A gauge of the number of connections that are currently alive, either in use or not.
pool_num_waited
A counter of the number of times there were no connections immediately available and the client waited for a connection.
pool_num_too_many_waiters
A counter of the number of times there were no connections immediately available and there were already too many waiters.

SingletonPool

conn/fail
A counter of the number of times the connection could not be established and must be retried.
conn/dead
A counter of the number of times the connection succeeded once, but later died and must be retried.

PendingRequestFilter

These stats represent information about the behavior of PendingRequestFilter.

pending_requests/rejected
a counter of the number of requests that have been rejected by this filter.

Retries

These metrics track the retries of failed requests via the Retries module.

Requeues represent requests that were automatically retried by Finagle. Only failures which are known to be safe are eligible to be requeued. The number of retries allowed are controlled by a dynamic budget, RetryBudget.

For clients built using ClientBuilder, the retries stat represents retries handled by the configured RetryPolicy. Note that application level failures are not included, which is particularly important for protocols that include exceptions, such as Thrift. The number of retries allowed is controlled by the same dynamic budget used for requeues.

Somewhat confusingly for clients created via ClientBuilder there are an additional set of metrics scoped to tries that come from StatsFilter. Those metrics represent logical requests, while the metrics below are for the physical requests, including the retries. You can replicate this behavior for clients built with the Stack API by wrapping the service with a StatsFilter scoped to tries.

retries
A stat of the number of times requests are retried as per a policy defined by the RetryPolicy from a ClientBuilder.
retries/requeues
A counter of the number of times requests are requeued. Failed requests which are eligible for requeues are failures which are known to be safe — see com.twitter.finagle.service.RetryPolicy.RetryableWriteException.
retries/requeues_per_request
A stat of the number of times requests are requeued.
retries/budget
A gauge of the currently available retry budget.
retries/budget_exhausted
A counter of the number of times when the budget is exhausted.
retries/request_limit
A counter of the number of times the limit of retry attempts for a logical request has been reached.

Dispatching

Metrics scoped under dispatcher represent information about a client’s dispatching layer.

Depending on the underlying protocol, dispatchers may have different request queueing rules.

serial/queue_size
a gauge used by serial dispatchers that can only have a single request per connection at a time that represents the number of pending requests.
pipelining/pending
a gauge used by pipelining dispatchers that represents how many pipelined requests are currently outstanding.

Admission Control

The stats under the admission_control scope show stats for the different admission control strategies.

Deadline Admission Control

admission_control/deadline/exceeded
A counter of the number of requests whose deadline has expired.
admission_control/deadline/expired_ms
A stat of the elapsed time since expiry if a deadline has expired, in milliseconds.

Nack Admission Control

These metrics reflect the behavior of the NackAdmissionFilter.

dropped_requests
A counter of the number of requests probabilistically dropped.

Threshold Failure Detector

The client metrics under the mux/failuredetector scope track the behavior of out-of-band RTT-based failure detection. They only apply to the mux protocol.

ThresholdFailureDetector

ping
A counter of the number of pings sent to remote peers.
ping_latency_us
A stat of round trip ping latencies in microseconds.
marked_busy
A counter of the number of times the endpoints are marked busy.
revivals
A counter of the number of times the endpoints revive.
close
A counter of the number of endpoints that are closed.

Transport

These metrics pertain to where the Finagle abstraction ends and the bytes are sent over the wire. Understanding these stats often requires deep knowledge of the protocol, or individual transport (e.g. Netty) internals.

Netty Transporter

connect_latency_ms
A histogram of the length of time it takes for a connection to succeed, in milliseconds.
failed_connect_latency_ms
A histogram of the length of time it takes for a connection to fail, in milliseconds.
cancelled_connects
A counter of the number of attempts to connect that were cancelled before they succeeded.

ServerBridge

read_timeout
A counter of the number of times the netty channel has caught a ReadTimeoutException while reading.
write_timeout
A counter of the number of times the netty channel has caught a WriteTimeoutException while writing.

ChannelRequestStatsHandler

connection_requests
A histogram of the number of requests received over the lifetime of a connection.

ChannelStatsHandler

connects
A counter of the total number of successful connections made.
closes
A counter of the total number of channel close operations initiated. To see the total number of closes completed, use the total count from one of the “connection_duration”, “connection_received_bytes”, or “connection_sent_bytes” histograms.
connection_duration
A histogram of the duration of the lifetime of a connection.
connection_received_bytes
A histogram of the number of bytes received over the lifetime of a connection.
connection_sent_bytes
A histogram of the number of bytes sent over the lifetime of a connection.
received_bytes
A counter of the total number of received bytes.
sent_bytes
A counter of the total number of sent bytes.
writableDuration
A gauge of the length of time the socket has been writable in the channel.
unwritableDuration
A gauge of the length of time the socket has been unwritable in the channel.
connections
A gauge of the total number of connections that are currently open in the channel.
exn/<exception_name>+
A counter of the number of times a specific exception has been thrown within a Netty pipeline.

IdleChannelHandler

disconnects/{READER_IDLE,WRITER_IDLE}
A counter of the number of times a connection was disconnected because of a given idle state.

Thrift

srv/thrift/buffer/resetCount
A counter for the number of times the thrift server re-initialized the buffer for thrift responses. The thrift server maintains a growable reusable buffer for responses. Once the buffer reaches the threshold size it is discarded and reset to a smaller size. This is done to accommodate variable response sizes. A high resetCount means the server is allocating and releasing memory frequently. Use the com.twitter.finagle.Thrift.param.MaxReusableBufferSize param to set the max buffer size to the size of a typical thrift response for your server.

RecvBufferSizeStatsHandler (when Netty 4 pooling is enabled)

transport/receive_buffer_bytes
A histogram of the receive buffer size in bytes. This metric is useful when it comes to tuning pooling of receive buffers in finagle-netty4 (can be enabled with a flag: -com.twitter.finagle.netty.poolReceiveBuffers). For maximum throughput, pool’s chunk size should be bigger than receive_buffer_bytes.max of any client/server running on a given JVM.

Service Discovery

These metrics track the state of name resolution and service discovery.

Name Resolution

Finagle clients resolve names into sets of network addresses to which sockets can be opened. A number of the moving parts involved in this process are cached (i.e. Dtabs, Names, and NameTrees). The following stats are recorded under the namer/{dtabcache,namecache,nametreecache} scopes to provide visibility into this caching.

misses
A counter of the number of cache misses.
evicts
A counter of the number of cache evictions.
expires
A counter of the number of idle ServiceFactorys that were actively evicted.
idle
A gauge of the number of cached idle ServiceFactorys.
oneshots
A counter of the number of “one-off” ServiceFactorys that are created in the event that no idle ServiceFactorys are cached.
namer/bind_latency_us
A stat of the total time spent resolving Names.

Initial Resolution

finagle/clientregistry/initialresolution_ms

A counter of the time spent waiting for client resolution via ClientRegistry.expAllRegisteredClientsResolved.

Address Stabilization

Resolved addresses (represented as an instance of Addr) are stabilized in two ways:

  1. ZooKeeper failures will not cause a previously bound address to fail.
  2. When a member leaves a cluster, its removal is delayed.

Note that hosts added to a cluster are reflected immediately.

The following metrics are scoped under the concatenation of zk2/ and the ServerSet’s ZooKeeper path.

limbo
A gauge tracking the number of endpoints that are in “limbo”. When a member leaves a cluster, it is placed in limbo. Hosts in limbo are still presented to the load balancer as belonging to the cluster, but are staged for removal. They are removed if they do not recover within an interval bound by the ZooKeeper session timeout.
size
A gauge tracking the total size of the live cluster, not including members in limbo.
zkHealth
A gauge tracking the health of the underlying zk client as seen by the resolver. Unknown(0), Healthy(1), Unhealthy(2), Probation(3)
observed_serversets
A gauge tracking the number of clusters whose membership status is currently been tracked within the process. This metric differs from session_cache_size below in that it tracks live clusters rather than the total number of cached sessions.

ZooKeeper Diagnostics

The following stats reflect diagnostic information about the ZooKeeper sessions opened for the purposes of service discovery.

Under the `zk2` scope

session_cache_size
A gauge tracking the number of distinct logical clusters whose membership status has been tracked within the process.
entries/read_ms
A histogram of the latency, in milliseconds, of reading entry znodes.
entries/parse_ms
A histogram of the latency, in milliseconds, of parsing the data within entry znodes.
vectors/read_ms
A histogram of the latency, in milliseconds, of reading vector znodes.
vectors/parse_ms
A histogram of the latency, in milliseconds, of parsing the data within vector znodes.

Under the `zkclient` scope

ephemeral_successes
A counter of the number successful ephemeral node creations.
ephemeral_failures
A counter of the number failed ephemeral node creations.
ephemeral_latency_ms
A histogram of the latency, in milliseconds, of ephemeral node creation.
watch_successes
A counter of the number successful watch-related operations (i.e. “watch exists”, “get watch data”, and “get child watches” operations).
watch_failures
A counter of the number failed watch-related operations.
watch_latency_ms
A histogram of the latency, in milliseconds, of watch-related operations.
read_successes
A counter of the number successful ZooKeeper reads.
read_failures
A counter of the number failed ZooKeeper reads.
read_latency_ms
A histogram of the latency, in milliseconds, of ZooKeeper reads.
write_successes
A counter of the number successful ZooKeeper writes.
write_failures
A counter of the number failed ZooKeeper writes.
write_latency_ms
A histogram of the latency, in milliseconds, of ZooKeeper writes.
multi_successes
A counter of the number successful transactional operations.
multi_failures
A counter of the number failed transactional operations.
multi_latency_ms
A histogram of the latency, in milliseconds, of transactional operations.
session_sync_connected
A counter of the number of read-write session transitions.
session_connected_read_only
A counter of the number of read-only session transitions.
session_no_sync_connected
Unused (should always be 0).
session_sasl_authenticated
A counter of the number of sessions upgraded to SASL.
session_auth_failed
A counter of the number of session authentication failures.
session_disconnected
A counter of the number of temporary session disconnects.
session_expired
A counter of the number of session expirations.

Toggles

These metrics correspond to feature toggles.

toggles/<libraryName>/checksum
A gauge summarizing the current state of a ToggleMap which may be useful for comparing state across a cluster or over time.

HTTP

These stats pertain to the HTTP protocol.

nacks
A counter of the number of retryable HTTP 503 responses the Http server returns. Those responses are automatically retried by Finagle HTTP client.
nonretryable_nacks
A counter of the number of non-retryable HTTP 503 responses the HTTP server returns. Those responses are not automatically retried.

These metrics are added by StatsFilter and can be enabled by using .withHttpStats on Http.Client and Http.Server.

status/<statusCode>
A counter of the number of responses received, or returned for servers, that had this statusCode.
status/<statusClass>
Same as status/statusCode but aggregated per category, e.g. all 500 range responses count as 5XX for this counter.
time/<statusCode>
A histogram on duration in milliseconds per HTTP status code.
time/<statusCategory>
A histogram on duration in milliseconds per HTTP status code category.

HTTP2

These stats pertain to HTTP2 only.

<server_label>/upgrade/success
A counter of http2 upgrades and new prior knowledge connections server side.
<client_label>/upgrade/success
A counter of http2 upgrades and new prior knowledge connections client side.

Memcached

These stats pertain to the Memcached protocol.

<label>/redistributes
A counter of the number of times the cache ring has been rebuilt. This occurs whenever a node has been ejected or revived, or the set of nodes changes.
<label>/joins
A counter of the number of times a node has been added to the cache ring because the backing set of servers has changed.
<label/leaves>
A counter of the number of times a node has been removed from the cache ring because the backing set of servers has changed.
<label>/ejections
A counter of the number of times a node has been ejected from the cache ring.
<label>/revivals
A counter of the number of times an ejected node has been re-added to the cache ring.

Mux

These stats pertain to the Mux protocol.

<server_label>/mux/draining
A counter of the number of times the server has initiated session draining.
<server_label>/mux/drained
A counter of the number of times the server has successfully completed the draining protocol within its allotted time.
<client_label>/mux/draining
A counter of the number of times a server initiated session draining.
<client_label>/mux/drained
A counter of the number of times server-initiated draining completed successfully.
<server_label>/mux/duplicate_tag
A counter of the number of requests with a tag while a server is processing another request with the same tag.
<server_label>/mux/orphaned_tdiscard
A counter of the number of Tdiscard messages for which the server does not have a corresponding request. This happens when a server has already responded to the request when it receives a Tdiscard.
clienthangup
A counter of the number of times sessions have been abruptly terminated by the client.
serverhangup
A counter of the number of times sessions have been abruptly terminated by the server.
<label>/mux/framer/write_stream_bytes
A histogram of the number of bytes written to the transport when mux framing is enabled.
<label>/mux/framer/read_stream_bytes
A histogram of the number of bytes read from the transport when mux framing is enabled.
<label>/mux/framer/pending_write_streams
A guage of the number of outstanding write streams when mux framing is enabled.
<label>/mux/framer/pending_read_streams
A guage of the number of outstanding read streams when mux framing is enabled.
<label>/mux/framer/write_window_bytes
A guage indicating the maximum size of fragments when mux framing is enabled. A value of -1 means that writes are not fragmented.
<label>/mux/transport/read/failures/
A counter indicating any exceptions that occur on the transport read path for mux. This includes exceptions in handshaking, thrift downgrading (for servers), etc.
<label>/mux/transport/write/failures/
A counter indicating any exceptions that occur on the transport write path for mux. This includes exceptions in handshaking, thrift downgrading (for servers), etc.

ThriftMux

These stats pertain to the ThriftMux protocol.

<server_label>/thriftmux/connects
A counter of the number of times the server has created a ThriftMux connection. This does not include downgraded Thrift connections.
<server_label>/thriftmux/downgraded_connects
A counter of the number of times the server has created a downgraded connection for “plain” Thrift.