Distributed Systems
Career Paths
How to interpret this table?
You may choose this advanced topic if you like doing the things listed under “they usually do”, and you are fine with not doing the things listed under “they usually do not do”.
Alternatively, you may choose it if you are interested in applying for the listed job roles and want to practice work that is close to those roles.
| Job title | They usually do | They usually do NOT do | Real-life examples |
|---|---|---|---|
| Distributed Systems Engineer | Design and implement systems running on multiple machines, handle communication, partial failures, scaling, and consistency | Build single-process applications, focus only on local logic without networking concerns | Distributed chat system, microservice-based backend |
| Cloud / Platform Engineer | Deploy and operate services across multiple nodes, use containers and virtualization, handle scaling and monitoring | Design UI/UX or single-device applications | Kubernetes-based services, cloud-hosted APIs |
| Software Engineer (distributed focus) | Implement service-to-service communication, event-driven or reactive logic, measure performance under load | Write purely sequential, single-machine programs | Event-driven processing pipelines, service meshes |
Affected SDLC Phases
If a team chooses this advanced topic, the system design, implementation, testing, and deployment phases are most strongly affected. Planning must explicitly address distribution, communication, and failure modes. Testing focuses on performance, scalability, and end-user behavior under load. Deployment includes containerization, multi-instance execution, and basic cloud or cluster setup.
Affected Tasks
Features are defined
Minimum Viable Product (MVP)
By the end of this task, your team has defined at least six distributed-system features that require multiple running instances. Each feature must clearly explain which parts of the system run on different machines and how they communicate.
Technical Details
Before defining features, the team must choose a distributed paradigm (e.g. microservices, event-driven, reactive, actor-based) and state this clearly in README.md. This choice applies to all features.
Use a consistent feature issue template. Each feature must include:
- Components or services involved
- Communication method (HTTP, messaging, events, streams, etc.)
- Expected behavior when multiple instances run simultaneously
- Done criteria observable at system level
Avoid vague items like “make it distributed” or “add cloud support”.
Quality
High-quality feature definitions make distribution explicit and understandable. Communication paths are clear, responsibilities are separated, and the chosen paradigm is applied consistently. Features are realistic, testable, and require true multi-instance execution.
System is working
Minimum Viable Product (MVP)
By the end of this task, your team demonstrates a working distributed system running on at least two instances on different machines simultaneously. The demo must show real interaction between instances and correct system behavior.
Technical Details
The demo must include:
- At least two running instances on different computers (or clearly separated virtual machines/containers)
- Real communication between instances
- A user-visible or measurable outcome of that interaction
Instances may run locally, in containers, or in the cloud, but they must be independent processes. Startup and demo steps must be documented in README.md.
Quality
High-quality demos clearly show that the system is truly distributed. The system behaves correctly when scaled up or down (e.g. adding or removing instances). Extra quality comes from showing basic resilience, clear logs, and understandable system behavior.
Bug fixing, performance and scalability
Minimum Viable Product (MVP)
During development, your team must report and fix one reproducible distributed-system issue that appears in a multi-instance setup. The issue must be demonstrated under load or stress (e.g. increased concurrency, higher request rate, or scaling up instances) and shown to improve after a light fix or configuration change.
Technical Details
The issue must be identified using a performance or stress test.
The report must include:
- System configuration (number of instances, environment, deployment method)
- Load or stress scenario used to expose the problem
- Measured metrics before the fix (e.g. response time, throughput, error rate)
- Observed failure or degradation (e.g. timeouts, errors, uneven load, race conditions)
After applying a fix (code change, configuration tweak, or scaling adjustment), re-run the exact same test and document:
- Measured metrics after the fix
- Clear evidence of improvement or stabilization
The fix should be lightweight and realistic (not a full redesign). Tests and results must be reproducible and documented.
Quality
High-quality work shows a clear link between load, observed system-level problems, and the applied fix. Metrics are meaningful and compared before vs after. The team explains why the issue occurs in a distributed setup and how the fix mitigates it. The solution improves behavior under load without breaking single-instance execution or other features.