This document outlines potential improvements and future work for the log distributor project:
Given more time, the following improvements could be made to address the identified issues and enhance the overall system:
Special consideration was given to using gRPC instead of HTTP for the communication between the log emitter, distributor and analysers but due to added complexity in the implementation with respect to Akka gRPC I chose to stick with HTTP Rest for now. gRPC provides better performance and type safety compared to HTTP, especially for high-throughput applications. The log packets could be defined using Protocol Buffers, which would allow for more efficient serialization and deserialization. This would also enable better integration with other systems that use gRPC, such as microservices architectures. All three services (distributor, analysers, and frontend) could benefit from using the autogenerated grpc stubs for communication, reducing the amount of boilerplate code needed for HTTP handling.
The current implementation of the distributor service is a single instance. For better scalability and fault tolerance, it would be beneficial to implement a clustered setup using Akka Cluster or Apache Pekko Cluster. This would allow multiple instances of the distributor to run in parallel, distributing the load and providing redundancy in case of failures.
Dynamic scaling of worker actors is a potential improvement for the log distributor system. The idea is to dynamically adjust the number of worker actors based on the load and concurrent users. This would involve first studying how load affects the performance of the system and what effects the worker size has on performance. Once this understanding is established, a mechanism could be implemented to dynamically scale the number of worker actors based on current load and user count. This could involve monitoring system performance metrics such as CPU usage, memory usage, and response times, and adjusting the number of workers accordingly. This would help ensure that the system can handle varying loads efficiently without overloading any single worker actor.
The current implementation of the distributor uses a singleton shared state to manage the registered analysers and their health status. This approach can lead to bottlenecks and single points of failure. Instead, using Conflict-free Replicated Data Types (CRDTs) would allow for a more distributed and resilient state management system. CRDTs enable multiple nodes to update the state independently while ensuring eventual consistency, which is ideal for a distributed system like the log distributor. CRDT would be a very good fit to maintain information about the traffic routing.
CRDT would not be a good choice for the failure and recovery detection of analysers. Instead, each worker actor should have its own failure and recovery detector thread. This would allow for more granular control over the health monitoring of analysers and reduce the risk of false positives or negatives in health checks. Each worker could independently determine the health status of its assigned analyser, improving overall system reliability.
An alternative approach to managing the state of the distributor is to avoid using any shared state altogether. Instead, each worker actor could maintain its own local state regarding the analysers it interacts with. This would eliminate the need for a centralised singleton state and reduce contention between actors. Each worker would be responsible for its own health checks and routing decisions, allowing for greater flexibility and scalability. The ratio, if maintained properly, would ensure that the distribution of packets is still balanced across the analysers.
This approach needs independent failure and recovery detection per worker as well, similar to the CRDT approach. Each worker would need to implement its own logic for detecting analyser failures and recovering from them, ensuring that the system remains responsive and resilient.
The FastAPI analysers are currently run with a single worker. This is because they maintain state of the packets they have processed and the health of the analysers. However, this can lead to performance bottlenecks under high load, as all requests are handled by a single worker. Running multiple workers would allow for better parallelism and improved throughput. Multiple workers can maintain state in shared storage (e.g., Redis, database) or use a distributed cache to share the state across workers. This would allow the FastAPI service to handle more requests concurrently while still maintaining the necessary state information.
To ensure the robustness of the system, a more comprehensive testing suite is needed. This should include:
Unit tests should be implemented for individual components of the system, including:
Integration test using run scripts for the simulation, a log collection system from all the services and then testing the logs for correctness. An example fo this testing strategy is used in my previous project at Distributed Algorithms
Might be too complicated but:
The project currently uses Akka HTTP for the distributor service. However, Akka has changed its licensing model from open source to a commercial license, which may not be suitable for all users. Apache Pekko is a fork of Akka that retains the original open source license, making it a more suitable choice for future development. Migrating to Pekko would ensure continued compatibility with open source principles.