Resolved- ClaimSearch Anti-Fraud Service Disruption

Incident Report for ISO ClaimSearch

Postmortem

TIMING:

February 14, 4:13 PM ET to February 16, 10:57 AM ET

DESCRIPTION:

ClaimSearch Customers were unable to log in to ClaimSearch services.

IMPACT:

Claim Director was unavailable to customers for Thursday, Feb 15 and Friday, Feb 16. The outage spread to NICB Services, Visual Platform, and caused processing delays due to high queue depth in System-to-System interfaces (XML, FTP, Web).

ROOT CAUSE:

On Wednesday, February 14th, ClaimDirector's scoring queues started alerting in the late afternoon. By February 15th, the major outage occurred due to the high database load. The DBA team identified that the issue was caused by insufficient statistics gathering on the involved party table and table's growth over time, which led to bad query plans and performance degradation in the database. This resulted in Claim Director was unavailable, and issues with NICB Services, Visual Platform, and caused processing delays due to high queue depth in System-to-System interfaces (XML, FTP, Web).

CORRECTIVE ACTION:

· The DBAs ran vacuum on impacted tables.

· ClaimDirector tasks were brought down since it was determined that these tasks were causing unusually high DB load.

· The Engineering and DBA teams implemented tuned queries to improve the database performance.

· The Engineering teams implemented a temporary fix to disable tokenization in ClaimDirector and enabled the ClaimDirector tasks.

· The DBAs increased the reader nodes in the postgress database to process the backlog in the queues.

PREVENTATIVE MEASURES:

Increase the value for the column from 100 to 1000 and reanalyze the table in production. This will proactively set the statistics target for any tables over a certain threshold.
Create an SOP for query performance: start with vacuum analyzing the table(s), if that doesn't improve, then adjust default_statistics_target to a higher value during the session and re-analyze the table(s).
Look into baselining queries and alerting on performance degradation.
Implement tooling that can help with diagnosing and troubleshooting Postgres related issues.
Create SOP for vacuum and vacuum analyze.
Move the large tables from OLTP to a Data Warehouse or Data Store.

Posted Mar 01, 2024 - 11:14 EST

Resolved

This incident has been resolved.

Posted Feb 23, 2024 - 12:46 EST

Update

The Service disruption for Verisk Anti-Fraud ClaimSearch is now resolved. Systems were monitored through the weekend. We are scheduling a retrospective and will post the findings on this page once that process is completed. We apologize for the inconvenience.

Posted Feb 20, 2024 - 09:55 EST

Update

We are pleased to report that we have implemented a workaround for the claims that run through the claims scoring process. We are currently processing through the backlog. We will be monitoring the progress through the weekend and will provide updates as warranted. Thank you again for your patience as we worked through this issue.

Posted Feb 16, 2024 - 16:09 EST

Monitoring

Thank you for your patience as we continue to work through several issues which have impacted system performance. Many of the processes have returned to normal, however, there could still be residual delays in receiving match reports due to system backlogs, especially for claims that run through the claims scoring processes. Although there may still be delays, no claims have been lost and there will be
no action required by customers; all claims will be processed once the queues are cleared. Once all issues are resolved and the root cause is determined, we will share that information. We apologize for any impact to your claim processing and are working diligently to resolve all issues.

Posted Feb 16, 2024 - 09:03 EST

Update

We have implemented a fix for the partial outage which impacted ClaimSearch Anti-Fraud Services. We have back-log to process and expect to process the bulk of the back-log overnight. Thank you for your patience while we work through this issue. This will be the last update this evening - we will update again in the morning.

Posted Feb 15, 2024 - 20:19 EST

Identified

We have identified the cause of the disruption impacting ClaimSearch Anti-Fraud Services and are actively working on implementing a solution

Posted Feb 15, 2024 - 15:11 EST

Update

We continue to actively work the ClaimSearch Anti-Fraud service disruption. We will need to process through backlog once a fix is implemented. We apologize for the inconvenience and thank you for your patience.

Posted Feb 15, 2024 - 13:32 EST

Update

We continue to actively investigate the ClaimSearch Anti-Fraud service disruption and appreciate your patience as we work to identify the root cause. Again we apologize for the inconvenience.

Posted Feb 15, 2024 - 11:40 EST

Update

We again apologize for the inconvenience. The Anti-Fraud issue is being investigated, and we're working towards a resolution.

Posted Feb 15, 2024 - 10:16 EST

Update

While we investigate the Anti-Fraud Service Disruption, we are committed to keeping you informed about our progress toward a solution. We apologize for the inconvenience.

Posted Feb 15, 2024 - 09:02 EST

Update

We are continuing to investigate this issue.

Posted Feb 15, 2024 - 08:32 EST

Update

In addition to the above services - NICB Services and VIN Monitoring is also impacted

Posted Feb 15, 2024 - 08:32 EST

Investigating

Our team is currently investigating the issue affecting ClaimSearch Anti-Fraud Services . Rest assured, we're working diligently to restore normal service. This is impacting ClaimDirector, XML Throughput, and Visual ClaimSearch Match Reports

Posted Feb 15, 2024 - 08:05 EST

This incident affected: System to System Interfaces (FTP, XML, MQ) and Visual Platform, NICB, ClaimDirector.