UPDATE: The previously scheduled maintenance has been rescheduled for February 14, 2018, 10:00am CST – 2:00pm CST to account for additional changes and some scheduling conflict with other maintenance and releases.”
Stackify will be releasing a major update to a subsystem of the Retrace platform that impacts how alerts are processed through the system and users are notified of those alerts.
The aim of this release is partly to improve system performance and resolve some outstanding defects. Most importantly though we incorporated a tremendous amount of valuable feedback we have received from our users on how to make this part of the Retrace platform a better overall experience.
February 7 February 14, 2018. 10:00 am CST – 2:00 pm CST. Please subscribe to updates at http://stackify.statuspage.io/ to receive real time updates of this schedule.
Any open alerts at the time of this release will automatically be closed and will re-open according to their configured rules (e.g. ‘Warning’ when CPU > 70% after 10 minutes) beginning immediately after the update. If the open alert was subject to a “Snooze,” “Acknowledge,” or “Maintenance Window” no notifications will be re-sent as configured. Otherwise, if the monitor is still in an alert state, notifications will be re-sent as configured. This “duplicate” notification will be a one-time event.
During this time period, for critical monitoring, we recommend our users log in to the Retrace portal and watch for alerted monitors, since notifications may be delayed or suspended during this maintenance window. This includes all forms of notifications including email, sms, and Slack integration.
The new Alert detail view better illustrates:
- The transitions that a particular monitor passed through while it was alerting
- The scope of the alert
- What applications and devices were impacted
- When notifications were sent
Alert Detail Features
- Current Status across all associated applications and devices
- Timeline view of alert status
- Snooze/Maintenance window visualization
- Notification indicator
- Acknowledgement indicator
The timeline visualization is a composite view of the alert status across all associated applications and devices. The highest level of severity is displayed in the graph. The dark vertical lines indicate a status transition on one or more associated applications or devices.
If the monitor being displayed is associated with multiple items, the association’s section will be present at the bottom of the dialog. Expanding the section displays everywhere the monitor is being used and allows the user to navigate to that association.
This example illustrates that applications can set independent thresholds for the same monitor.
Historical alert data has been archived and will continue to be accessible from a secondary view via a link on the alerts history page.
Email and sms notifications have been revised to provide clearer content and more actionable details about what is alerting and what the impacts are.
Email details now include
- Name of affected resource
- Monitor type
- Link to alert detail in the Stackify portal
- Sparkline showing recent history of monitor
- List of all associated resources
Notification groups now support
- Contact specific message delivery preferences for each notification type
- Slack integration at a more granular level
- Slack integration @channel messages
Slack webhook integration can now be included in (or removed from) individual notification groups, rather than set at a “global” level. For customers who currently have Slack integration enabled, Slack will be added to all Notification Groups at the time of this update to preserve existing behavior. For customers enabling Slack integration in the future, Slack is not a default member of their Notification Groups and should be added to individual groups to begin receiving Slack notifications.
Alert configuration rules have two parts, a value and a duration threshold.
- The value threshold determines the point in which beyond or below is abnormal.
(e.g., X > Y, where Y is the value threshold)
The duration threshold determines how long the value threshold must be exceeded for a certain alert to be raised.
Currently, Retrace allows two methods to set the duration threshold:
- As a unit of time (e.g. 5 minutes continuously)
- As a number of repetitions for a result (e.g. 5 times consecutively).
Stackify has determined that the result-count method isn’t as reliable or accurate as the time-based duration threshold. Since any result-count rule could also be written as a time-based rule, Stackify will be removing the result-count method for determining duration.
Since the result-count duration thresholds will no longer be supported all existing configurations of this type will be automatically converted to time-based thresholds based on the current frequency of the monitor configuration.
“Check CPU every 2 minutes, alert as ‘Warning’ if value is over 70% for 5 consecutive checks.”
“Check CPU every 2 minutes, alert as ‘Warning’ if value is over 70% for 10 minutes.”
Stackify has made every effort to minimize customer impact during the deployment of this release. There is no expected downtime for the Retrace portal and all client systems will continue to be monitored. It is important to note that notifications may be delayed or suspended during this maintenance window. During the maintenance window customers are encouraged to take a more proactive approach to system monitoring by viewing alerts in the Retrace Portal.
We appreciate your business and sincerely hope that this update will be well received and enhance your ability to produce and maintain high quality software systems.