Engineering

Achieving Operational Efficiency and Optimising Cost by Migrating to ECS Fargate with Graviton

November 26, 2024

author:

Achieving Operational Efficiency and Optimising Cost by Migrating to ECS Fargate with Graviton

This is a guest article that was co-authored with Anirban Sinha, a Senior Technical Account Manager at AWS.

CleverTap’s Log Collector is a high-performance application designed to efficiently handle the influx of data from user devices. Initially, we operated the Log Collector on ECS EC2 instances. While EC2 offered flexibility, we faced challenges with complex infrastructure management, including manual tasks like patching, scaling, and capacity planning. This led to increased operational overhead and potential limitations in scalability.

To address these challenges, we strategically migrated the Log Collector to ECS Fargate. This transition significantly improved security, operational efficiency, and cost optimization. In this blog post, we’ll discuss our migration experience, highlighting key challenges and the effective strategies we employed to overcome them. We explore how CleverTap successfully navigated this transition, leveraging best practices to optimize our event collection system.

Background

Previously, managing and scaling EC2 instances on ECS required significant operational effort from our team. We sought a more streamlined solution, so we turned to AWS ECS Fargate. By adopting the serverless compute engine Fargate, we eliminated the need for infrastructure management, allowing us to focus solely on application development. This simplified approach reduced our operational overhead, accelerated our time-to-market, and potentially lowered our costs. Additionally, ECS Fargate’s automatic scaling further optimized our resource utilization, preventing the inefficiencies of overprovisioning or underutilization.

Migration Approach & Challenges 

In addition to general migration challenges like managing dependencies, performance, and Compatibility, there are a number of specific challenges that CleverTap faced when migrating Log Collector to Fargate.

Remote Management for ECS Fargate 

When migrating from EC2 to Fargate, CleverTap required a remote management solution similar to EC2 Run Command. While ECS Exec on Fargate enables in-container command execution, it cannot be directly integrated with AWS Systems Manager.

To achieve secure remote management, CleverTap implemented a multi-faceted approach:

  • AWS Systems Manager for secure jump server management
  • A jump server (bastion host) as a secure access point
  • ECS Exec to establish secure connections from the jump server to containers within the ECS cluster

This setup offers centralized access management through the jump server, simplifying security policies. It also provides granular IAM control by allowing the configuration of roles to restrict ECS Exec access to specific sources, such as the jump server’s IP address.

Deploys Web Apps in Read-Only Containers to Improve Security

To improve security, CleverTap deployed their web applications in read-only containers. This approach would prevent someone who exploits a vulnerability in the web server from uploading and running malicious files. However, CleverTap encountered a limitation with Amazon ECS Exec, which doesn’t work with containers set to read-only mode. This is because the SSM agent, which enables ECS Exec functionality, requires write access to create temporary files and directories.

To address this challenge, CleverTap implemented a workaround. Instead of directly enforcing read-only access, they limited the container’s capabilities by creating a custom IAM policy with restricted permissions or using the pre-built “ReadOnlyAccess” policy. CleverTap then attached this policy to the IAM role assigned to the ECS task execution.

Additionally, CleverTap used Bind Mounts to configure specific directories, such as tmpfs, within the container as writable locations for logs and temporary files.

It’s important to note that these “workarounds” are not officially supported by AWS, and they could potentially stop working in the future due to changes made in Amazon ECS.

Log integration Approach with Splunk

CleverTap has configured their Amazon ECS task definitions to ship system output logs to Splunk through the Splunk HTTP Event Collector (HEC). By default, the Splunk driver log configuration property has been set to “blocking” mode.This caused ECS Fargate containers to fail due to health checks and prevented the ECS service from reaching the desired count.

To resolve the issue, the Splunk Log driver mode was changed to non-blocking. However, CleverTap raised concerns about losing logs, as non-blocking mode does not guarantee that all events will be logged.

To increase the reliability of container logging, the following steps were taken:

  1. Stress test the environment to configure the desired buffer value. This will help to determine the maximum amount of data that can be buffered before logs are lost.
  2. Monitor CloudWatch metrics such as ProcessedBytes to understand if there is a change in network data pattern. This will help to identify potential problems with the Splunk HTTP Event Collector (HEC) endpoint.
  3. Right-size HEC endpoint containers based on expected traffic volume. This will help to ensure that the HEC endpoints have enough resources to handle the expected load.

Mitigating Downtime During Heap Dumps 

CleverTap requires the ability to capture Java heap dumps for troubleshooting memory-related application issues. While taking a heap dump typically causes a brief pause in the JVM’s execution, this can lead to the application exceeding the Elastic Load Balancer (ELB) health check timeout threshold. This, in turn, triggers the termination of the container by Amazon ECS Fargate before the heap dump process finishes. To address this challenge and minimize downtime during heap dump capture, CleverTap has implemented an automated process. This process dynamically adjusts the ELB health check configuration. Here’s the breakdown:

  1. Temporary Timeout Increase: Upon initiating the heap dump capture, the process automatically increases the ELB health check timeout value. This provides sufficient time for the JVM to complete the dump without being flagged as unhealthy.
  2. Heap Dump Completion: Once the heap dump is successfully captured, the process promptly reverts the ELB health check timeout to its optimal value. This ensures the application remains responsive to subsequent health checks.

Multi-architecture Container Images for Fargate

CleverTap adopted multi-architecture container images for Fargate to capitalize on the
cost-performance benefits of different processor types while maintaining the
adaptability inherent in containerized workloads. These images, stored in Amazon ECR, utilize layers and manifests to specify runtime characteristics (ARM64 or x86_64) and enable the container runtime to automatically select the appropriate image based on the underlying system architecture, allowing CleverTap to seamlessly switch between CPU architectures.

Securing Fargate Workloads with the Default Credential Provider Chain 

Since AWS Fargate doesn’t work with ECS Instance Profiles (designed for EC2 instances in ECS clusters), CleverTap opted for a more secure approach using the default AWS Java SDK Credential Provider Chain. his default credentials provider chain is implemented by the DefaultCredentialsProvider class in the AWS SDK for Java 2.x.

The provider chain sequentially checks each place where the default configuration for supplying temporary credentials can be set, and then selects the first valid set of credentials it finds. This allows the SDK to automatically locate and retrieve the default configuration settings without requiring manual management of credentials.

The default credentials provider chain searches for configuration in your environment using a predefined sequence of locations, such as Java system properties, environment variables, and the web identity token from the AWS Security Token Service.

While Instance Profiles are useful for EC2-based ECS deployments, the default chain offers better security and scalability for Fargate tasks.

Key benefits of CleverTap’s approach

Reduced Attack Surface: Container images don’t store long-lived credentials, minimizing damage from potential attacker access.

Principle of Least Privilege: IAM roles with specific permissions can be defined for Fargate tasks, granting access only to necessary resources.

The credentials for an Amazon ECS task are isolated at both the task definition level and the container level. At the task definition level, the credentials are shared across all containers within the task, granting the necessary permissions and access for the entire task to interact with other AWS services. In contrast, at the container level, the credentials are specific to each individual container, allowing for more granular control over the permissions and access granted to different types of containers, such as a log collector container, and a Splunk container has access to specific AWS services. This differentiation in service access levels for the various container types is made possible through the use of the default credentials provider chain, which simplifies the management of credentials and ensures the containers can access the required resources without the need to store sensitive credentials directly within the containers.

Benefits of Migrating

CleverTap’s migration to ECS Fargate yielded substantial improvements in elasticity, security, operational efficiency, and cost optimization. Our experience underscores ECS Fargate’s ability to deliver operational simplification and cost-effectiveness. This blog post serves as a valuable roadmap for organizations contemplating a similar migration.

Operational Simplification: ECS Fargate offers significant operational and management advantages for CleverTap’s DevOps team. To ensure a seamless migration without downtime, we implemented a parallel service approach. This involved utilizing Route 53 to gradually route 5-10% of traffic to the new load balancer while simultaneously modifying the CloudFront origin.

Blue-Green deployments further streamlined the process by eliminating the need for manual replica creation. This approach allowed for a smoother and more efficient transition.

Deployments became significantly faster, with only the pool and container requiring boot-up. This resulted in an approximate 5x improvement in container start-up times compared to the traditional EC2 launch type, demonstrating the efficiency gains offered by ECS Fargate.

Elasticity/scalability: ECS Fargate’s granular scaling capabilities offer a significant advantage over the ECS EC2 launch type. By enabling Log Collector applications to scale up or down at the task level rather than the instance level, Fargate allows for more precise resource allocation. This flexibility has proven beneficial for CleverTap, resulting in a 50% reduction in scaling operation time. CleverTap can now scale their resources up or down within 40 seconds, ensuring optimal resource utilization and cost efficiency.

Security: CleverTap employs a multi-tenant Log collector application running on ECS Fargate. Each task is isolated in a separate container, reducing the attack surface and preventing potential compromises from spreading across tasks or infrastructure. Fargate’s immutable containers hinder attackers’ ability to persist or make malicious changes, enhancing data security across customers.

Cost Optimization: The granular scaling capabilities of ECS Fargate allow for precise resource allocation, while Graviton-based tasks can reduce CPU resource requirements by up to 20% compared to traditional x86-based instances. This powerful combination empowers organizations to optimize their containerized workloads, resulting in a 25% reduction in compute costs when compared to the ECS EC2-based launch type.

Conclusion

This blog post explores the key considerations and challenges faced when migrating CleverTap’s log collector to Amazon ECS Fargate. It details innovative solutions implemented to address these challenges, including secure remote management, overcoming read-only container limitations, and minimizing downtime during heap dumps. By migrating to ECS Fargate, CleverTap significantly improved elasticity, security, and operational efficiency.

Leave a comment

Leave a Reply

Discover more from CleverTap Tech Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading