System Monitoring

It is important to monitor your services to prevent issues that can impact the availability of your Self-Managed Commerce services, avoid outages, and manage and operate your Self-Managed Commerce services.

There are many metrics and logs to monitor in a deployment of Self-Managed Commerce. If you are using CloudOps for Kubernetes, there are some additional metrics and logs to monitor. Some of these require additional tooling such as installing an Application Performance Monitoring (APM) tool or monitoring through the AWS services.

General Application Monitoring

You should monitor the Self-Managed Commerce services and check whether they are operational. An alert should be triggered if the service is not healthy for more than five minutes.

Health Check Endpoints

Self-Managed Commerce applications have URIs that can be used for server health monitoring. They are not intended for use by client endpoints.

Cortex Endpoint

The Cortex service endpoint /cortex/healthcheck checks that all resource bundles have started. Different 5XX status codes are returned to indicate the nature of the problem. The standard health check URI of /cortex/status can also be used for an extended health check.

Server(s): Cortex
Success Response: 200 OK
Failure Response: 5xx

Commerce Manager Endpoint

The Commerce Manager service has the healthcheck endpoint /cm/?servicehandler=status.

Server(s): Commerce Manager
Success Response: 200 OK
Failure Response: 503

General Application Endpoint

The other application services have the healthcheck endpoint /context/status, where context is the service you are checking.

Server(s):
- Cortex
- Batch Server
- Search Server
- Integration Server
Success Response: 200 OK
Failure Response: 503

Responses with human and machine readable information can be obtained using these URIs:

/<context>/status/info.html
/<context>/status/info.json

Monitoring the Java Native Memory and Heap Usage

You should monitor how much native memory the Java processes are using, and how close is that to the memory available to the process. Make sure that it is stable and no memory leaks are present. If a Java process is trying to allocate more native memory than what is available then the JVM may behave in unexpected ways or be killed by the OS. Make sure to monitor the overall JVM size and all the individual memory segments and generations such as eden, survivor, metaspace, stacks, etc.

Typically an Application Performance Monitoring (APM) tool, not provided, is used to monitor this.

Monitoring the JVM Garbage Collection

You should monitor the garbage collection statistics for the general health of your applications. Garbage collection should run regularly and quickly. Make sure to monitor for any changes to the garbage collection, such as frequent, full-collections, long collection pauses, or high garbage collection CPU usage.

Typically an Application Performance Monitoring (APM) tool, not provided, is used to monitor the server health.

Monitoring Database Open Connections

You should monitor how many connections the Self-Managed Commerce applications are making to the database. You want to ensure that the number of connections made by the application do not exceed the connection limit of the database during a Cortex scaling event. If you reach the maximum, new connections will need to wait.

If you deployed Self-Managed Commerce with CloudOps for Kubernetes, you can find these metrics in AWS CloudWatch under AWS RDS DatabaseConnections, or you can acquire an Application Performance Monitoring (APM) tool to monitor this.

Monitoring the JDBC Connection Pool

You should monitor the JDBC connection pool usage for each Self-Managed Commerce application to make sure that connections are being used and returned to the pool appropriately, and not getting hung up, preventing future requests from being processed.

Each Self-Managed Commerce application uses a JDBC connection pool to connect to the Self-Managed Commerce database. This is managed by Tomcat. Each instance of Tomcat has its own pool.

Typically an Application Performance Monitoring (APM) tool, not provided, is used to monitor this.

Monitoring Tomcat Threads

You should monitor the number of threads each Tomcat instance is using because each tomcat instance has a maximum number of threads available to do work. If all threads are busy and if new Tomcat instances cannot be brought online, then API requests may go unanswered and API clients may receive connection timeouts or other, similar errors.

Typically an Application Performance Monitoring (APM) tool, not provided, is used to monitor this.

Monitoring Application Logs

You should monitor the application logs for any errors and exceptions as they are an indication of the application’s health and functionality.

If you deployed Self-Managed Commerce with CloudOps for Kubernetes, you can find the logs in the Kubernetes container logs. Typically a log aggregator tool, not provided, is used to observe and monitor the logs.

Monitoring ehCache Usage

You should monitor cached objects stored in the Java heap. The ehCache is an in-memory cache that stores Java objects, leveraged by the Self-Managed Commerce application cache and by the OpenJPA persistence framework used by the application. The ehCache limits caches based on the number of objects instead of memory usage. Stored in the Java heap, if objects in the cache are larger than expected, a cache could use all available memory, filling the heap, and resulting in an Out Of Memory (OOM) error.

If you deployed Self-Managed Commerce with CloudOps for Kubernetes, you can acquire an Application Performance Monitoring (APM) tool, not provided, to monitor this. You can also monitor this through the JMX.

For more information on the JMX metrics, see Metrics Through the JMX.

Monitoring Response Time Latency

You should monitor the response time of the various applications and check whether they are responding more slowly than usual. An increase in application response times indicates that a change in the system has introduced an issue or there is a resource constraint that needs to be resolved.

Typically, synthetic monitoring or frontend monitoring solutions, not provided, are used to monitor this.

Application Monitoring With Kubernetes

Monitoring Pod Restarts

You should monitor the Self-Managed Commerce services and how often the pods are restarting. An alert should be triggered if the pods in a service restart once in every five minutes or more as it indicates that the pods are often in an unhealthy state. If it is a search primary pod that is restarting, it can affect the freshness of indexed search data. If it is a search replica pod that is restarting, it can affect current application functionality and the availability and response time of search results from cortex.

If you used CloudOps for Kubernetes to create your Self-Managed Commerce stack, the services are deployed in the namespace specified by the kubernetesNickname parameter of the Jenkins job deploy-or-delete-commerce-stack. You can use the kube-state-metrics metric kube_pod_container_status_restarts_total to monitor the containers in your Self-Managed Commerce namespace for each of the services.

You should set your monitor to constantly check your services.

Monitoring the Deployment Pod Count

If you deployed Self-Managed Commerce with CloudOps for Kubernetes, you should monitor the Self-Managed Commerce services and whether the desired and running pod counts of the services match. An alert should be triggered if the deployments have desired and running pod counts that are persistently out of sync as it can indicate that the deployment cannot scale, new pods cannot start or some larger issue is occurring.

The services are deployed in the namespace specified by the kubernetesNickname parameter of the Jenkins job deploy-or-delete-commerce-stack. You can use metrics from kube-state-metrics to monitor the pod count for each of the services.

You should set your monitor to constantly check your services.

Monitoring Cortex Replicas

If you deployed Self-Managed Commerce with CloudOps for Kubernetes with horizontal pod autoscaling turned on, you should monitor the number of Cortex replicas and whether the number of Cortex replicas has reached the maximum possible. An alert should be triggered if the horizontal pod autoscaler has created the maximum allowed number of cortex replicas as there is a good chance that it wants to continue scaling to meet demand, but is being prevented by the configured maximum.

The Cortex service is deployed in the namespace specified by the kubernetesNickname parameter of the Jenkins job deploy-or-delete-commerce-stack. You can monitor the state of the replicas with the following horizontal pod autoscaler metrics from kube-state-metrics:

kube_horizontalpodautoscaler_status_current_replicas
kube_horizontalpodautoscaler_spec_max_replicas

You should set your monitor to constantly check your Cortex replicas.

Monitoring the ActiveMQ Service

Monitoring Disk Space

If the ActiveMQ disk is full, the broker cannot receive new messages and you will see error messages in your logs.

You should monitor your KahaDB disk and set an alert when the disk reaches 80% capacity.

If you deployed Self-Managed Commerce with CloudOps for Kubernetes, the container uses a persistent volume claim attached to the ActiveMQ pod. The ActiveMQ KahaDB disk space is configured to have 300GB. It uses Amazon’s Elastic File System (EFS) for persistent storage in CloudOps for Kubernetes release 2.11 and newer. Older, single-replica ActiveMQ deployments use EBS in CloudOps for Kubernetes release 2.10 and older.

There is no known way to monitor the KahaDB disk in Cloudops for Kubernetes at this point. The data should be available via JMX but note that JMX is not configured to be exposed out of the box.

Monitoring the Continuous Integration Tooling

Monitoring the Nexus Service

The Nexus disk fills up as you build Self-Managed Commerce. If the disk is too full, builds and CI pipelines will no longer work, preventing you from pushing Self-Managed Commerce fixes and features to Self-Managed Commerce systems.

You should monitor your Nexus disk and set an alert when the disk reaches 80% capacity.

If you used CloudOps for Kubernetes to create your Nexus server, the disk is a persistent volume claim attached to the Nexus pod in the default namespace. It is a 256GB Amazon Elastic Block Store (EBS) volume. CloudOps for Kubernetes includes Jenkins jobs that invokes Nexus jobs to delete extra release artifacts. The following kubelet metrics are available on the metrics server to monitor your disk space:

kubelet_volume_stats_available_bytes
kubelet_volume_stats_capacity_bytes
kubelet_volume_stats_used_bytes

Depending on how frequently Self-Managed Commerce is built, you should set your monitor to check at least once a week.

Monitoring the Web Application Firewall (WAF) Service

Monitoring the WAF Logs

If you are using a Web Application Firewall (WAF) in front of Self-Managed Commerce services, you can monitor the firewall logs to determine when firewall rules are triggered. Typically, access to the application will be blocked when a rule is triggered. A rule being triggered can indicate that the WAF has blocked an unauthorized action, or that the WAF rules are too restrictive and blocked a legitimate request to one of the Self-Managed Commerce services.

If you deployed Self-Managed Commerce with CloudOps for Kubernetes, you may have enabled the optional ModSecurity web application firewall (WAF). If an HTTP 403 error appears in the ModSecurity WAF logs, it indicates that a ModSecurity rule was triggered. You can find the ModSecurity container logs in the Kubernetes namespace modsec. To monitor these logs, you can install a log aggregator tool.

Commerce

8.1.x

Manual Installation

Configuration

Performance & Monitoring

General Application Monitoring

Health Check Endpoints

Cortex Endpoint

Commerce Manager Endpoint

General Application Endpoint

Monitoring the Java Native Memory and Heap Usage

Monitoring the JVM Garbage Collection

Monitoring Database Open Connections

Monitoring the JDBC Connection Pool

Monitoring Tomcat Threads

Monitoring Application Logs

Monitoring ehCache Usage

Monitoring Response Time Latency

Application Monitoring With Kubernetes

Monitoring Pod Restarts

Monitoring the Deployment Pod Count

Monitoring Cortex Replicas

Monitoring the ActiveMQ Service

Monitoring Disk Space

Monitoring the Continuous Integration Tooling

Monitoring the Nexus Service

Monitoring the Web Application Firewall (WAF) Service

Monitoring the WAF Logs