Distributed Systems

Career Paths

How to interpret this table?

You may choose this advanced topic if you like doing the things listed under “they usually do”, and you are fine with not doing the things listed under “they usually do not do”.

Alternatively, you may choose it if you are interested in applying for the listed job roles and want to practice work that is close to those roles.

Job title They usually do They usually do NOT do Real-life examples
Distributed Systems Engineer Design and implement systems running on multiple machines, handle communication, partial failures, scaling, and consistency Build single-process applications, focus only on local logic without networking concerns Distributed chat system, microservice-based backend
Cloud / Platform Engineer Deploy and operate services across multiple nodes, use containers and virtualization, handle scaling and monitoring Design UI/UX or single-device applications Kubernetes-based services, cloud-hosted APIs
Software Engineer (distributed focus) Implement service-to-service communication, event-driven or reactive logic, measure performance under load Write purely sequential, single-machine programs Event-driven processing pipelines, service meshes

Affected SDLC Phases

If a team chooses this advanced topic, the system design, implementation, testing, and deployment phases are most strongly affected. Planning must explicitly address distribution, communication, and failure modes. Testing focuses on performance, scalability, and end-user behavior under load. Deployment includes containerization, multi-instance execution, and basic cloud or cluster setup.

Affected Tasks

Features are defined

Minimum Viable Product (MVP)

By the end of this task, your team has defined at least six distributed-system features that require multiple running instances. Each feature must clearly explain which parts of the system run on different machines and how they communicate.

Technical Details

Before defining features, the team must choose a distributed paradigm (e.g. microservices, event-driven, reactive, actor-based) and state this clearly in README.md. This choice applies to all features.

Use a consistent feature issue template. Each feature must include:
- Components or services involved
- Communication method (HTTP, messaging, events, streams, etc.)
- Expected behavior when multiple instances run simultaneously
- Done criteria observable at system level

Avoid vague items like “make it distributed” or “add cloud support”.

Quality

High-quality feature definitions make distribution explicit and understandable. Communication paths are clear, responsibilities are separated, and the chosen paradigm is applied consistently. Features are realistic, testable, and require true multi-instance execution.

System is working

Minimum Viable Product (MVP)

By the end of this task, your team demonstrates a working distributed system running on at least two instances on different machines simultaneously. The demo must show real interaction between instances and correct system behavior.

Technical Details

The demo must include:
- At least two running instances on different computers (or clearly separated virtual machines/containers)
- Real communication between instances
- A user-visible or measurable outcome of that interaction

Instances may run locally, in containers, or in the cloud, but they must be independent processes. Startup and demo steps must be documented in README.md.

Quality

High-quality demos clearly show that the system is truly distributed. The system behaves correctly when scaled up or down (e.g. adding or removing instances). Extra quality comes from showing basic resilience, clear logs, and understandable system behavior.

Bug fixing, performance and scalability

Minimum Viable Product (MVP)

During development, your team must report and fix one reproducible distributed-system issue that appears in a multi-instance setup. The issue must be demonstrated under load or stress (e.g. increased concurrency, higher request rate, or scaling up instances) and shown to improve after a light fix or configuration change.

Technical Details

The issue must be identified using a performance or stress test.

The report must include:
- System configuration (number of instances, environment, deployment method)
- Load or stress scenario used to expose the problem
- Measured metrics before the fix (e.g. response time, throughput, error rate)
- Observed failure or degradation (e.g. timeouts, errors, uneven load, race conditions)

After applying a fix (code change, configuration tweak, or scaling adjustment), re-run the exact same test and document:
- Measured metrics after the fix
- Clear evidence of improvement or stabilization

The fix should be lightweight and realistic (not a full redesign). Tests and results must be reproducible and documented.

Quality

High-quality work shows a clear link between load, observed system-level problems, and the applied fix. Metrics are meaningful and compared before vs after. The team explains why the issue occurs in a distributed setup and how the fix mitigates it. The solution improves behavior under load without breaking single-instance execution or other features.