Increased latency for pipelines in Hosted environment

Incident Report for Etleap

Postmortem

On October 20, 2025, between 07:13 UTC and 22:25 UTC, we experienced a disruption affecting multiple services due to the widespread outage in the AWS us-east-1 region which we operate our US deployment out of..

Pipeline Operations

From 07:13 UTC to 09:21 UTC, pipeline activities were unavailable due to outages in several dependent AWS services, including DynamoDB, SQS, SNS, and Glue. Between 07:30 UTC and 08:28:48 UTC, we were unable to send SNS notifications for completed activities. Beginning at 09:21 UTC, new activities could be initiated; however, recovery was delayed as EC2 instance provisioning was throttled, limiting our ability to restore capacity promptly. Full recovery of all pipeline activities was achieved by 18:40 UTC.

Throughout the course of the day, we observed certain source and destination connections either failing to be extracted from or failing to be loaded to do to their own use of AWS infrastructure. For more information, we recommend reviewing the status pages for these third parties.

Webhook/Event Stream Services

Between 08:20 UTC and 17:35 UTC, webhook (event stream) endpoints were unable to receive events. This was caused by insufficient EC2 capacity and networking issues between our load balancer and EC2 targets. Recovery began at 17:35 UTC, with intermittent connectivity issues persisting until 22:25 UTC, at which point all services and capacity were fully restored. During this period, some webhook calls were responded with a 502. Depending on how the sender is configured, these may have been retried until the events were processed successfully.

Customer Impact

During the outage, the webhook endpoints returned HTTP 502 responses. Any messages that received this response should be retried.

Any SNS notifications for activities completed between 07:30 UTC and 08:28:48 UTC were not sent.

Posted Oct 21, 2025 - 14:29 PDT

Resolved

We are seeing most pipelines and dbt schedules have recovered. For any remaining issues, Etleap Support has reached out directly to affected customers, and we are working on addressing the last remaining issues.

For private deployments, Etleap Support has reached out if any remedial steps are required.
Posted Oct 20, 2025 - 14:43 PDT

Update

We have addressed capacity and networking issues for our streaming ingest endpoints.
Posted Oct 20, 2025 - 13:36 PDT

Update

We are seeing connectivity issues affecting our streaming endpoints; We are investigating the root cause.
Posted Oct 20, 2025 - 13:13 PDT

Update

Events stream ingestion has fully recovered; We are continuing to monitor pipeline recovery
Posted Oct 20, 2025 - 12:43 PDT

Update

We are able to provision extra capacity for our event streaming endpoints, and we are seeing a reduction in error rates; We are continuing to monitor the situation.
Posted Oct 20, 2025 - 11:17 PDT

Update

We are seeing increased connectivity issues to our event streaming endpoints due to networking issues
Posted Oct 20, 2025 - 10:55 PDT

Update

We are seeing pipeline and dbt schedule latencies recovering; We are continuing to monitor the overall recovery.
Posted Oct 20, 2025 - 10:46 PDT

Update

We are continuing to monitor for any further issues.
Posted Oct 20, 2025 - 10:42 PDT

Update

We were able to recover some capacity for our Event Stream endpoint; They are currently able to receive data, however request latencies are still high. We are working to provision extra capacity.
Posted Oct 20, 2025 - 10:41 PDT

Update

The AWS outage in US East (N. Virginia) is still ongoing. They have identified the root cause to be an internal networking issues and have throttled requests for new EC2 instances. This is currently causing potential outages to the following Etleap components:
- Pipelines - may become latent as EMR may fail to scale up for increases in demand.
- dbt Schedules - may become latent as EMR may fail to scale up for increases in demand.
- Event Streams - may fail to read from source as the autoscaling group that serves these connections fail to provision new instances.

Our AI and UI components are currently fully operational.
Posted Oct 20, 2025 - 09:35 PDT

Update

We were experiencing increased errors in both the UI and API in the US hosted environment due to a credentials error caused by the on-going AWS outage. We have implemented a fix and are seeing a decreased rate of errors in both the API and UI.
Posted Oct 20, 2025 - 06:41 PDT

Update

We are continuing to monitor for any further issues.
Posted Oct 20, 2025 - 06:31 PDT

Update

We are continuing to monitor for any further issues.
Posted Oct 20, 2025 - 06:22 PDT

Update

We are continuing to monitor for any further issues.
Posted Oct 20, 2025 - 06:15 PDT

Update

VPCs deployed in US East (N. Virginia) also affected
Posted Oct 20, 2025 - 04:29 PDT

Monitoring

AWS Outage Operational issues – Multiple services (N. Virginia)
Outage first reported Mon 20 October 07:11am UTC (12:11am PTC)
Outage has started recovering Mon 20 October 09:27am UTC (02:27am PTC)
Posted Oct 20, 2025 - 03:15 PDT
This incident affected: US Hosted App (app.etleap.com) (UI, Pipelines, API, Event Streams, dbt Schedules) and Private Deployments (VPCs) (Pipelines).