logdistributor

Improvements and Future Work

This document outlines potential improvements and future work for the log distributor project:

Improvements

Given more time, the following improvements could be made to address the identified issues and enhance the overall system:

gRPC (Protobufs) Instead of HTTP (Increases performance and throughput)

Special consideration was given to using gRPC instead of HTTP for the communication between the log emitter, distributor and analysers but due to added complexity in the implementation with respect to Akka gRPC I chose to stick with HTTP Rest for now. gRPC provides better performance and type safety compared to HTTP, especially for high-throughput applications. The log packets could be defined using Protocol Buffers, which would allow for more efficient serialization and deserialization. This would also enable better integration with other systems that use gRPC, such as microservices architectures. All three services (distributor, analysers, and frontend) could benefit from using the autogenerated grpc stubs for communication, reducing the amount of boilerplate code needed for HTTP handling.

Using clustered setup for the distributor

The current implementation of the distributor service is a single instance. For better scalability and fault tolerance, it would be beneficial to implement a clustered setup using Akka Cluster or Apache Pekko Cluster. This would allow multiple instances of the distributor to run in parallel, distributing the load and providing redundancy in case of failures.

Dynamic Scaling of Worker Actors

Dynamic scaling of worker actors is a potential improvement for the log distributor system. The idea is to dynamically adjust the number of worker actors based on the load and concurrent users. This would involve first studying how load affects the performance of the system and what effects the worker size has on performance. Once this understanding is established, a mechanism could be implemented to dynamically scale the number of worker actors based on current load and user count. This could involve monitoring system performance metrics such as CPU usage, memory usage, and response times, and adjusting the number of workers accordingly. This would help ensure that the system can handle varying loads efficiently without overloading any single worker actor.

Better Management of Shared State

Approach 1

Using CRDT for the distributor instead of Singleton Shared State

The current implementation of the distributor uses a singleton shared state to manage the registered analysers and their health status. This approach can lead to bottlenecks and single points of failure. Instead, using Conflict-free Replicated Data Types (CRDTs) would allow for a more distributed and resilient state management system. CRDTs enable multiple nodes to update the state independently while ensuring eventual consistency, which is ideal for a distributed system like the log distributor. CRDT would be a very good fit to maintain information about the traffic routing.

Using independent failure and recovery detector per worker

CRDT would not be a good choice for the failure and recovery detection of analysers. Instead, each worker actor should have its own failure and recovery detector thread. This would allow for more granular control over the health monitoring of analysers and reduce the risk of false positives or negatives in health checks. Each worker could independently determine the health status of its assigned analyser, improving overall system reliability.

Approach 2

Not using any shared state at all

An alternative approach to managing the state of the distributor is to avoid using any shared state altogether. Instead, each worker actor could maintain its own local state regarding the analysers it interacts with. This would eliminate the need for a centralised singleton state and reduce contention between actors. Each worker would be responsible for its own health checks and routing decisions, allowing for greater flexibility and scalability. The ratio, if maintained properly, would ensure that the distribution of packets is still balanced across the analysers.

This approach needs independent failure and recovery detection per worker as well, similar to the CRDT approach. Each worker would need to implement its own logic for detecting analyser failures and recovering from them, ensuring that the system remains responsive and resilient.

FastAPI with Multiple Workers

The FastAPI analysers are currently run with a single worker. This is because they maintain state of the packets they have processed and the health of the analysers. However, this can lead to performance bottlenecks under high load, as all requests are handled by a single worker. Running multiple workers would allow for better parallelism and improved throughput. Multiple workers can maintain state in shared storage (e.g., Redis, database) or use a distributed cache to share the state across workers. This would allow the FastAPI service to handle more requests concurrently while still maintaining the necessary state information.

Comprehensive Testing Suite

To ensure the robustness of the system, a more comprehensive testing suite is needed. This should include:

Unit Tests for individual Components

Unit tests should be implemented for individual components of the system, including:

Integration Tests for End-to-End Scenarios

Integration test using run scripts for the simulation, a log collection system from all the services and then testing the logs for correctness. An example fo this testing strategy is used in my previous project at Distributed Algorithms

Possible Implementation Architecture:

Possible Architecture

Might be too complicated but:

What does testing for log correctness mean?

Apache Pekko Instead of Akka (Solves 1st issue)

The project currently uses Akka HTTP for the distributor service. However, Akka has changed its licensing model from open source to a commercial license, which may not be suitable for all users. Apache Pekko is a fork of Akka that retains the original open source license, making it a more suitable choice for future development. Migrating to Pekko would ensure continued compatibility with open source principles.