What’s happening in your SAP systems? Find out with Pacemaker Alerts

By mullaned2002

March 7, 2022

525

When critical services fail, businesses risk losing revenue, productivity, and trust. That’s why Google Cloud customers running SAP applications choose to deploy high availability (HA) systems on Google Cloud.

In these deployments Linux operating system clustering provides application and guest awareness for the application state and automates recovery actions in case of failure — including cluster node, resource or node failover or failed action.

Pacemaker is the most popular software Linux administrators use to manage their HA clusters, which includes automating notifications about events — including failover fencing and node, attribute, and resource events — and reporting on events. With automated alerts and reports, Linux administrators can not only learn about events as they happen, but they can also make sure other stakeholders are alerted to take action when critical events occur. They can even discover past events to assess the overall health of their HA systems.

Here, we break down the steps to setting up automated alerts for HA cluster events and alert reporting.

How to Deploy the Alert Script

To set up event-based alerts, you’ll need to take the following steps to execute the script.

1. Download the script file ‘gcp_crm_alert.sh’ from https://github.com/GoogleCloudPlatform/pacemaker-alerts-cloud-logging

2. Under root user, add exec flag for the script and execute deployment with:

code_block[StructValue([(u’code’, u’chmod +x ./gcp_crm_alert.shrn./gcp_crm_alert.sh -d’), (u’language’, u”)])]

3. Confirm that the deployment runs successfully. If it does, you will see the following INFO log messages:

In the Red Hat Enterprise Linux (RHEL) system:

code_block[StructValue([(u’code’, u”gcp_crm_alert.sh:2022-01-24T23:48:30+0000:INFO:’pcs alert recipient add gcp_cluster_alert value=gcp_cluster_alerts id=gcp_cluster_alert_recepient options value=/var/log/crm_alerts_log’ rc=0″), (u’language’, u”)])]

In the SUSE Linux Enterprise Server (SLES):

code_block[StructValue([(u’code’, u”gcp_crm_alert.sh:2022-01-25T00:13:27+00:00:INFO:’crm configure alert gcp_cluster_alert /usr/share/pacemaker/alerts/gcp_crm_alert.sh meta timeout=10s timestamp-format=%Y-%m-%dT%H:%M:%S.%06NZ to { /var/log/crm_alerts_log attributes gcloud_timeout=5 gcloud_cmd=/usr/bin/gcloud }’ rc=0″), (u’language’, u”)])]

Now, in the event of a cluster node, resource, node failover, or failed action, Pacemaker will start the alert mechanism. For further details on the alerting agent, check out the Pacemaker Explained documentation.

How to Use Cloud Logging for Alert Reporting

Alerted events are published in Cloud Logging. Below is an example of the log record payload, where the cluster alert key-value pairs get recorded in the jsonPayload node.

code_block[StructValue([(u’code’, u'{rn “insertId”: “ktildwg1o3fbim”,rn “jsonPayload”: {rn “CRM_alert_recipient”: “/var/log/crm_alerts_log”,rn “CRM_alert_attribute_name”: “”,rn “CRM_alert_kind”: “resource”,rn “CRM_alert_status”: “0”,rn “CRM_alert_rsc”: “STONITH-sapecc-scs”,rn “CRM_alert_rc”: “0”,rn “CRM_alert_timestamp_usec”: “”,rn “CRM_alert_interval”: “0”,rn “CRM_alert_node_sequence”: “21”,rn “CRM_alert_task”: “start”,rn “CRM_alert_nodeid”: “”,rn “CRM_alert_timestamp”: “2022-01-25T00:17:06.515313Z”,rn “CRM_alert_timestamp_epoch”: “”,rn “CRM_alert_desc”: “ok”,rn “CRM_alert_target_rc”: “0”,rn “CRM_alert_version”: “1.1.15”,rn “CRM_alert_attribute_value”: “”,rn “CRM_alert_node”: “sapecc-ers”,rn “CRM_alert_exec_time”: “”rn },rn “resource”: {rn “type”: “global”,rn “labels”: {rn “project_id”: “gcp-tse-sap-on-gcp-lab”rn }rn },rn “timestamp”: “2022-01-25T00:17:09.662557309Z”,rn “severity”: “INFO”,rn “logName”: “projects/gcp-tse-sap-on-gcp-lab/logs/sapecc-ers%2F%2Fvar%2Flog%2Fcrm_alerts_log”,rn “receiveTimestamp”: “2022-01-25T00:17:09.662557309Z”rn}’), (u’language’, u”)])]

To get notified of a resource event — for example, when the HANA topology resource monitor fails — you can use the following filter for the alerting definition:

code_block[StructValue([(u’code’, u’jsonPayload.CRM_alert_node=(“hana-venus” OR “hana-mercury”)rn-jsonPayload.CRM_alert_status=”0″rnjsonPayload.CRM_alert_rsc=”rsc_SAPHanaTopology_SBX_HDB00″rnjsonPayload.CRM_alert_task=”monitor”‘), (u’language’, u”)])]

To define an alert for a fencing event, your can apply this filter:

code_block[StructValue([(u’code’, u’jsonPayload.CRM_alert_node=(“hana-venus” OR “hana-mercury”)rnjsonPayload.CRM_alert_kind=”fencing”‘), (u’language’, u”)])]

The fencing log entry gets recorded with warning severity to give you deeper insight, and this additional information is also helpful for more specific filtering criteria:

code_block[StructValue([(u’code’, u'{rn “insertId”: “1plznskfjsxt82”,rn “jsonPayload”: {rn “CRM_alert_attribute_value”: “”,rn “CRM_alert_recipient”: “/var/log/crm_alerts_log”,rn “CRM_alert_rsc”: “”,rn “CRM_alert_rc”: “0”,rn “CRM_alert_timestamp_usec”: “529261”,rn “CRM_alert_desc”: “Operation reboot of hana-mercury by hana-venus for crmd.2361@hana-venus: OK (ref=2a9bf814-9adf-4247-af3f-94ac254fc3ca)”,rn “CRM_alert_target_rc”: “”,rn “CRM_alert_nodeid”: “”,rn “CRM_alert_kind”: “fencing”,rn “CRM_alert_node_sequence”: “33”,rn “CRM_alert_task”: “st_notify_fence”,rn “CRM_alert_status”: “”,rn “CRM_alert_exec_time”: “”,rn “CRM_alert_attribute_name”: “”,rn “CRM_alert_timestamp_epoch”: “1643072786”,rn “CRM_alert_version”: “1.1.19”,rn “CRM_alert_timestamp”: “2022-01-25T01:06:26.529261Z”,rn “CRM_alert_interval”: “”,rn “CRM_alert_node”: “hana-mercury”rn },rn “resource”: {rn “type”: “global”,rn “labels”: {rn “project_id”: “gcp-tse-sap-on-gcp-lab”rn }rn },rn “timestamp”: “2022-01-25T01:06:27.267017052Z”,rn “severity”: “WARNING”,rn “logName”: “projects/gcp-tse-sap-on-gcp-lab/logs/hana-venus%2F%2Fvar%2Flog%2Fcrm_alerts_log”,rn “receiveTimestamp”: “2022-01-25T01:06:27.267017052Z”rn}’), (u’language’, u”)])]

Alerts can be delivered through multiple channels, including text and email. Below is an example of an email notification for our earlier example, when we defined an alert for a HANA topology resource monitor failure:

You can write and apply filters to your log-based alerts to isolate certain types of incidents and analyze events over time. For example, the following script will surface a resource event occurring within a two-hour window on a specific date:

code_block[StructValue([(u’code’, u’timestamp>=”2022-01-25T00:00:00Z” timestamp<=”2022-01-25T02:00:00Z”rnjsonPayload.CRM_alert_kind=”resource”‘), (u’language’, u”)])]

With the ability to analyze these logged alerts over time, determine whether event patterns warrant any action.

[SIDEBAR]

The alert script prints details in the standard output and in the log file /var/log/crm_alerts_log, and this can grow over time. We recommend that the log file is set with the Linux logrotate service in order to limit the file system space. Use the following command to create the necessary logrotate setting for the alerting log file:

code_block[StructValue([(u’code’, u’cat > /etc/logrotate.d/crm_alerts_log << END-OF-FILErn /var/log/crm_alerts_log {rn create 0660 root rootrn rotate 7rn size 10Mrn missingokrn compressrn delaycompressrn copytruncatern dateextrn dateformat -%Y%m%d-%srn notifemptyrn}rnEND-OF-FILE’), (u’language’, u”)])]

[END SIDEBAR]

Tips for Troubleshooting

When you first deploy your alert script, how can you tell for certain that you’ve done it correctly? Use the following commands to test it out:

In RHEL:
pcs alert show

In SLES:
sudo crm config show | grep -A3 gcp_cluster_alert

You should see the following if the script is correct:

In RHEL:

code_block[StructValue([(u’code’, u’Alerts:rn Alert: gcp_cluster_alert (path=/usr/share/pacemaker/alerts/gcp_crm_alert.sh)rn Description: “Cluster alerting for hana-node-X”rn Options: gcloud_cmd=/usr/bin/gcloud gcloud_timeout=5rn Meta options: timeout=10s timestamp-format=%Y-%m-%dT%H:%M:%S.%06NZrn Recipients:rn Recipient: gcp_cluster_alert_recepient (value=gcp_cluster_alerts)rn Options: value=/var/log/crm_alerts_log’), (u’language’, u”)])]

In SLES:

code_block[StructValue([(u’code’, u’alert gcp_cluster_alert “/usr/share/pacemaker/alerts/gcp_crm_alert.sh” \rntmeta timeout=10s timestamp-format=”%Y-%m-%dT%H:%M:%S.%06NZ” \rntto “/var/log/crm_alerts_log” attributes gcloud_timeout=5 gcloud_cmd=”/usr/bin/gcloud”‘), (u’language’, u”)])]

If the commands do not display the alerts properly, re-deploy the script.

In case there is an issue with the script, or if the Cloud Logging records are not presenting as expected, examine the script log file /var/log/crm_alerts_log. The errors and warning can be filtered with:

egrep ‘(ERROR|WARN)’ /var/log/crm_alerts_log

Any Pacemaker alert failures will be recorded in the messages and/or Pacemaker log. To examine recent alert failures, use the following command:

egrep ‘(gcp_crm_alert.sh|gcp_cluster_alert)’
/var/log/messages /var/log/pacemaker.log

Keep in mind, though, that the Pacemaker log location may be different in your system from the one in the example above.

From reactive to proactive

Your SAP applications are too critical to risk outages. The most effective way to manage high availability clusters for your SAP systems on Google Cloud is to take full advantage of Pacemaker’s alerting capabilities, so you can be proactive in ensuring your systems are healthy and available.

Learn more about running SAP on Google Cloud.

Cloud BlogRead More

Previous articleAugmenting Flexible Paxos in LogDevice to improve read availability

Next articleGet more insights from your Java applications logs

What’s happening in your SAP systems? Find out with Pacemaker Alerts

How to Deploy the Alert Script

How to Use Cloud Logging for Alert Reporting

Tips for Troubleshooting

From reactive to proactive

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Designing resilient cities at Arup using Amazon SageMaker geospatial capabilities

How Pinecone leverages Spanner’s familiar PostgreSQL to power its vector database

Tips to enhance your prompt-engineering abilities

POPULAR CATEGORY

What’s happening in your SAP systems? Find out with Pacemaker Alerts

How to Deploy the Alert Script

How to Use Cloud Logging for Alert Reporting

Tips for Troubleshooting

From reactive to proactive

Starting over on Google Cloud: BK Medical shares its SAP migration story

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY