System Monitoring
It is important to monitor your services to prevent issues that can impact the availability of your Elastic Path Commerce services, avoid outages, and manage and operate your Elastic Path Commerce services.
There are many metrics and logs to monitor in a deployment of Elastic Path Commerce. If you are using CloudOps for Kubernetes, there are some additional metrics and logs to monitor. Some of these require additional tooling such as installing an Application Performance Monitoring (APM) tool or monitoring through the AWS services.
General Application Monitoring
You should monitor the Elastic Path Commerce services and check whether they are operational. An alert should be triggered if the service is not healthy for more than five minutes.
Health Check Endpoints
Elastic Path Commerce applications have URIs that can be used for server health monitoring. They are not intended for use by client endpoints.
Cortex Endpoint
The Cortex service endpoint /cortex/healthcheck
checks that all resource bundles have started. Different 5XX
status codes are returned to indicate the nature of the problem. The standard health check URI of /cortex/status
can also be used for an extended health check.
Server(s): Cortex
Success Response:
200 OK
Failure Response:
5xx
Commerce Manager Endpoint
The Commerce Manager service has the healthcheck endpoint /cm/?servicehandler=status
.
Server(s): Commerce Manager
Success Response:
200 OK
Failure Response:
503
General Application Endpoint
The other application services have the healthcheck endpoint /context/status
, where context
is the service you are checking.
Server(s):
- Cortex
- Batch Server
- Search Server
- Integration Server
Success Response:
200 OK
Failure Response:
503
Responses with human and machine readable information can be obtained using these URIs:
/<context>/status/info.html
/<context>/status/info.json
Monitoring the Java Native Memory and Heap Usage
You should monitor how much native memory the Java processes are using, and how close is that to the memory available to the process. Make sure that it is stable and no memory leaks are present. If a Java process is trying to allocate more native memory than what is available then the JVM may behave in unexpected ways or be killed by the OS. Make sure to monitor the overall JVM size and all the individual memory segments and generations such as eden, survivor, metaspace, stacks, etc.
Typically an Application Performance Monitoring (APM) tool, not provided, is used to monitor this.
Monitoring the JVM Garbage Collection
You should monitor the garbage collection statistics for the general health of your applications. Garbage collection should run regularly and quickly. Make sure to monitor for any changes to the garbage collection, such as frequent, full-collections, long collection pauses, or high garbage collection CPU usage.
Typically an Application Performance Monitoring (APM) tool, not provided, is used to monitor the server health.
Monitoring Database Open Connections
You should monitor how many connections the Elastic Path Commerce applications are making to the database. You want to ensure that the number of connections made by the application do not exceed the connection limit of the database during a Cortex scaling event. If you reach the maximum, new connections will need to wait.
If you deployed Elastic Path Commerce with CloudOps for Kubernetes, you can find these metrics in AWS CloudWatch under AWS RDS DatabaseConnections, or you can acquire an Application Performance Monitoring (APM) tool to monitor this.
Monitoring the JDBC Connection Pool
You should monitor the JDBC connection pool usage for each Elastic Path Commerce application to make sure that connections are being used and returned to the pool appropriately, and not getting hung up, preventing future requests from being processed.
Each Elastic Path Commerce application uses a JDBC connection pool to connect to the Elastic Path Commerce database. This is managed by Tomcat. Each instance of Tomcat has its own pool.
Typically an Application Performance Monitoring (APM) tool, not provided, is used to monitor this.
Monitoring Tomcat Threads
You should monitor the number of threads each Tomcat instance is using because each tomcat instance has a maximum number of threads available to do work. If all threads are busy and if new Tomcat instances cannot be brought online, then API requests may go unanswered and API clients may receive connection timeouts or other, similar errors.
Typically an Application Performance Monitoring (APM) tool, not provided, is used to monitor this.
Monitoring Application Logs
You should monitor the application logs for any errors and exceptions as they are an indication of the application’s health and functionality.
If you deployed Elastic Path Commerce with CloudOps for Kubernetes, you can find the logs in the Kubernetes container logs. Typically a log aggregator tool, not provided, is used to observe and monitor the logs.
Monitoring ehCache Usage
You should monitor cached objects stored in the Java heap. The ehCache is an in-memory cache that stores Java objects, leveraged by the Elastic Path Commerce application cache and by the OpenJPA persistence framework used by the application. The ehCache limits caches based on the number of objects instead of memory usage. Stored in the Java heap, if objects in the cache are larger than expected, a cache could use all available memory, filling the heap, and resulting in an Out Of Memory (OOM) error.
If you deployed Elastic Path Commerce with CloudOps for Kubernetes, you can acquire an Application Performance Monitoring (APM) tool, not provided, to monitor this. You can also monitor this through the JMX.
- For more information on the JMX metrics, see Metrics Through the JMX.
Monitoring Response Time Latency
You should monitor the response time of the various applications and check whether they are responding more slowly than usual. An increase in application response times indicates that a change in the system has introduced an issue or there is a resource constraint that needs to be resolved.
Typically, synthetic monitoring or frontend monitoring solutions, not provided, are used to monitor this.
Application Monitoring With Kubernetes
Monitoring Pod Restarts
You should monitor the Elastic Path Commerce services and how often the pods are restarting. An alert should be triggered if the pods in a service restart once in every five minutes or more as it indicates that the pods are often in an unhealthy
state. If it is a search primary pod that is restarting, it can affect the freshness of indexed search data. If it is a search replica pod that is restarting, it can affect current application functionality and the availability and response time of search results from cortex.
If you used CloudOps for Kubernetes to create your Elastic Path Commerce stack, the services are deployed in the namespace specified by the kubernetesNickname
parameter of the Jenkins job deploy-or-delete-commerce-stack
. You can use the kube-state-metrics metric kube_pod_container_status_restarts_total
to monitor the containers in your Elastic Path Commerce namespace for each of the services.
You should set your monitor to constantly check your services.
Monitoring the Deployment Pod Count
If you deployed Elastic Path Commerce with CloudOps for Kubernetes, you should monitor the Elastic Path Commerce services and whether the desired and running pod counts of the services match. An alert should be triggered if the deployments have desired and running pod counts that are persistently out of sync as it can indicate that the deployment cannot scale, new pods cannot start or some larger issue is occurring.
The services are deployed in the namespace specified by the kubernetesNickname
parameter of the Jenkins job deploy-or-delete-commerce-stack
. You can use metrics from kube-state-metrics to monitor the pod count for each of the services.
You should set your monitor to constantly check your services.
Monitoring Cortex Replicas
If you deployed Elastic Path Commerce with CloudOps for Kubernetes with horizontal pod autoscaling turned on, you should monitor the number of Cortex replicas and whether the number of Cortex replicas has reached the maximum possible. An alert should be triggered if the horizontal pod autoscaler has created the maximum allowed number of cortex replicas as there is a good chance that it wants to continue scaling to meet demand, but is being prevented by the configured maximum.
The Cortex service is deployed in the namespace specified by the kubernetesNickname
parameter of the Jenkins job deploy-or-delete-commerce-stack
. You can monitor the state of the replicas with the following horizontal pod autoscaler metrics from kube-state-metrics:
kube_horizontalpodautoscaler_status_current_replicas
kube_horizontalpodautoscaler_spec_max_replicas
You should set your monitor to constantly check your Cortex replicas.
Monitoring the ActiveMQ Service
Monitoring Disk Space
If the ActiveMQ disk is full, the broker cannot receive new messages and you will see error messages in your logs.
You should monitor your KahaDB disk and set an alert when the disk reaches 80% capacity.
If you deployed Elastic Path Commerce with CloudOps for Kubernetes, the container uses a persistent volume claim attached to the ActiveMQ pod. The ActiveMQ KahaDB disk space is configured to have 300GB. It uses Amazon’s Elastic File System (EFS) for persistent storage in CloudOps for Kubernetes release 2.11 and newer. Older, single-replica ActiveMQ deployments use EBS in CloudOps for Kubernetes release 2.10 and older.
There is no known way to monitor the KahaDB disk in Cloudops for Kubernetes at this point. The data should be available via JMX but note that JMX is not configured to be exposed out of the box.
Monitoring the Continuous Integration Tooling
Monitoring the Nexus Service
The Nexus disk fills up as you build Elastic Path Commerce. If the disk is too full, builds and CI pipelines will no longer work, preventing you from pushing Elastic Path Commerce fixes and features to Elastic Path Commerce systems.
You should monitor your Nexus disk and set an alert when the disk reaches 80% capacity.
If you used CloudOps for Kubernetes to create your Nexus server, the disk is a persistent volume claim attached to the Nexus pod in the default
namespace. It is a 256GB Amazon Elastic Block Store (EBS) volume. CloudOps for Kubernetes includes Jenkins jobs that invokes Nexus jobs to delete extra release artifacts. The following kubelet metrics are available on the metrics server to monitor your disk space:
kubelet_volume_stats_available_bytes
kubelet_volume_stats_capacity_bytes
kubelet_volume_stats_used_bytes
Depending on how frequently Elastic Path Commerce is built, you should set your monitor to check at least once a week.
Monitoring the Web Application Firewall (WAF) Service
Monitoring the WAF Logs
If you are using a Web Application Firewall (WAF) in front of Elastic Path Commerce services, you can monitor the firewall logs to determine when firewall rules are triggered. Typically, access to the application will be blocked when a rule is triggered. A rule being triggered can indicate that the WAF has blocked an unauthorized action, or that the WAF rules are too restrictive and blocked a legitimate request to one of the Elastic Path Commerce services.
If you deployed Elastic Path Commerce with CloudOps for Kubernetes, you may have enabled the optional ModSecurity web application firewall (WAF). If an HTTP 403 error appears in the ModSecurity WAF logs, it indicates that a ModSecurity rule was triggered. You can find the ModSecurity container logs in the Kubernetes namespace modsec
. To monitor these logs, you can install a log aggregator tool.