RETROSPECTIVE SUMMARY:
Incident date: 3/24/2023
Resolution date: 3/24/2023
Retrospective date: 3/25/2023
CUSTOMER IMPACT: XML Delays between 2PM and 4PM ET. Average throughput time was ~18 minutes.
ROOT CAUSE:
In 2019, the table was created in DynamoDB table to capture the caching SSA Keys related information for UF Search application. On March 24, 2023, the table was deleted, but it had references of that DynamoDB Table in UF Search code. During scale up, new task was trying to create the DynamoDB Table and since the role had no create permission, task was failing during startup. This resulted in high queue depth alert triggered for XML UF Search queue.
NOTE: The role in 2019 had create table privilege – This table is no longer needed, the code was cleaned up to remove the table and avengers can delete the table. There was a mismatch in the roles between ACCP and prod. This table was identified for cleanup (contained 28GB of data). There is currently a process in place for this – the CFT (Cloud Formation Template). When the initial work was done in ~2019, this process was not in place.
MONITORING:
Are there any improvements in monitoring and alerts? Yes
RCA CATEGORY: Defect
CORRECTIVE ACTION:
Restoring the table took longer than expected, so the Avengers team granted table creation access in the PROD role to resolve the issue.
PREVENTIVE ACTION ITEMS BY POINT OF FAILURE: