The Architecture Problem and its Organizational Implications

8 min readFeb 12, 2022

The architecture problem is what software engineering leaders think about all the time: how to keep their teams moving fast with confidence? At the end of day, this is as valid a business problem as any application feature that the development team will ship: it will be reflected as both the development velocity as well as the cost of the human resources. You don’t need to be an engineering leader to think like one. It is beneficial to consider the architecture problem earlier in your career to pick up relevant techniques, habits, and conceptual models over time.

At the individual level, there are things one can do to become an effective engineer, which is a topic for another time. As for an engineering organization, architects are on the hook for guiding the systems to ensure they are evolvable over time. Often the processes and structures architects desire and encourage are in conflict with the immediate business need of shipping XYZ feature ASAP, which require architects to take a stance and protect the desirable structural traits of their systems. A well-structured system missing a feature can always be extended for that feature at relatively low cost, whereas updating an ill-structured system might turn out to be entirely cost-prohibitive if not simply depressing.

Desired System Traits

Evolvable architecture can adapt to the ever changing reality with relative ease. It is the kind of system that can “withstand the test of time”. Plainly speaking, systems with evolvable architecture survive over time. Of course, if a system needs to retire, it should be relatively simple to do as well, which suggests the dependents of the retired system should also be evolvable.

To allow eng teams to move fast with confidence, the systems should be independently developable, independently deployable, easy to operate, and easy to maintain.

Independent Developability

In the software development world, frequently there are stories about how a certain feature is needed but being developed and owned by another team. A common perspective in that case is to wait so as not to waste effort reinventing the wheel. The result is that the folks who need the feature are blocked, sometimes for a few quarters or years, sometimes forever.

Another example might be an internal team is developing an in-house database based on their cutting-edge research results, and your team is asked to adopt this new database. Weeks later you find out this database is not meeting the SLO for your usage pattern; improvement requests are made and then your team waits for quarters for the improvements that may or may not land.

Independent developability prevents such frustrating situations through appropriate decoupling. Essentially one team’s development effort should not be blocked by that of another. This requires separating code for the business domain from code that pulls in dependencies. The business logic won’t know about any of the dependencies. Or at least it only knows about their interfaces but not implementation, by leveraging dependency injection. In an object-oriented language like Java, this will be presented as the factory pattern; while in multi-paradigm or functional languages, a function interface will suffice.

Even if there is an integration between teams, say foo team will use bar team’s API / database, such integration should be protected under an anti-corruption layer. Treat this as if the API you are integrating with is only one of the many options you might choose from, and hide it behind an interface. Ideally this should have been done in the prototyping phase where multiple options were explored, although it is not uncommon later in the development lifecycle when you find out the dependency has fallen short of your expectation and needs to be replaced by something else.

Independent Deployability

Once the code change is reviewed, approved, and submitted, it needs to be deployed to production. If your team owns the binary that will be released, things are simple. Sometimes this is not the case. For example, you might be implementing a plugin-style API that will be compiled into a single monolith binary, and your particular plugin is statically linked instead of dynamically linked as a separate package, e.g. DLL / jar files. Another example is implementing a service API and having another team releasing and maintaining the server that actually exposes the service. In these cases, the code owner has limited control over when the latest code will arrive in production. A frequent and robust continuous delivery process will partially mitigate this, yet this ownership and control asymmetry still exists.

Microservice is a popular solution for this, although using it widely requires having an easy-to-use template for common server functionalities such as logging, monitoring, permission control, etc. The takeaway is that sometimes in order to obtain separate deployability, it might be necessary to push decoupling all the way to the service level.

Once source code is split into separate deployment units, it is necessary to ensure binaries running different versions are compatible with each other. Write code with compatibility in mind, e.g. handling data that might be at different versions is generally easier than requiring a coordinated release, where sets of binaries or data only work together when using specific pairing of versions. It’s essentially a coordination problem, mirroring a distributed transaction.

Easy to Operate

Code eventually becomes programs that are run in different places to serve business needs. Someone needs to be responsible for the operation of these programs. For a team without dedicated SREs or DevOps, the team itself will be in charge of the operations of the programs they write. For example, a team owning a search API will need to be on the watch if the API server is encountering any kind of trouble such as too many errors, server crashes, etc.

For a system to be easy to operate, its operation needs should be clearly stated and discoverable. If latency is a requirement, then there should be monitoring and alerts set up for it, and potentially even load test and staged releases to detect any latency related issue. New hires should be able to discover and learn about the requirements as part of the team’s processes and documentation, vs. only until when an outage takes place.

Sometimes the team might not necessarily be aware of all the operation needs and get caught in surprise when a dependency breaks down. This is not uncommon for systems that have grown too large and hard to understand where developers take a long time to understand the operational implication of their systems.

Easy to Maintain

Over time, bugs will be discovered and need to be fixed; new feature requests will come up and be implemented. The codebase will undergo constant changes. Maintainability can be measured in terms of how easy it is to make these changes in an existing codebase. A codebase that’s easy to maintain makes it really clear where a certain type of changes should be done, and ideally there is one and only one such place.

We can imagine a really simple program that calls two expressions: foo() and bar(). Now a new feature request asks another expression baz() to be called. Where should this new call be added? There are at least five possibilities:

Before foo()
Inside foo()
After foo() and before bar()
Inside bar()
After bar()

This process of finding and figuring out the “right” place to make a change is called spelunking. It can not only get difficult and time-consuming, but also dangerous: introducing a change in the wrong place can introduce hard-to-understand bugs into your systems.

Organizational Considerations

Most of the desired architectural attributes ask for a system with clear boundaries, i.e. a well-defined but often small architectural quantum. The metaphor of size here most refers to the issue of coupling: a system can be rather complex, but the complexity should be mostly around the domain problem the system is aiming to tackle instead of dealing with its dependencies and interaction with other systems. Achieving such system boundaries usually requires having corresponding organizational boundaries in place. However, team structures are usually not as soft as software and tend to lag behind quite a bit; consequently, systems tend to form their own boundaries based on business needs while the organization structure could look nothing alike.

Conway’s Law

Conway’s Law states:

Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.

If we put this into context of the evolution of a software product, usually a codebase will start as a monolith as there are only a handful of developers, and everyone works together rather closely. Later as systems get bigger, they get split into smaller components given the business and operational needs, e.g. to have separate releases or decoupling the source of failures. Team structures, however, don’t change as easily, not to mention there are usually other dominating factors that contribute to how teams are structured.

When the team and the systems it owns are not isomorphic, all sorts of problems happen:

People on the same team might start working in silos and on totally different things and team synergy declines
Certain systems will become unowned yet still widely used, especially if it’s hard to pin down a team with a role that maps the system’s role related to other systems
Any development efforts might be meddled by a lot of folks with different opinions as there is no clear system boundary and integration points, often leaving it undefined who the stakeholders actually are

There are some exceptions where certain engineers (e.g. Solvers) need to work across a larger codebase to perform their critical functions. For most engineers and teams, however, having a big team owning big systems is generally less satisfactory than having smaller teams owning smaller systems which communicate among themselves with clear interfaces.

Cross-Functional Teams

There is also the question of how teams should be organized. Traditionally it tends to be by job functions, e.g. there will be developer team, DevOps team, business analysts, database admins, QA, etc. We are well aware of the story that a project that would take a fullstack engineer one-month would take much longer if it’s handed to a pair of a frontend engineer and a backend engineer.

The story is similar when we consider the ownership and lifecycle of software systems. A collective of individuals, e.g. the owners of this particular software product, needs to be responsible for every aspect of the system over time: development, deployment, operation, and maintenance. Suppose the release of the software requires collecting some metrics based on experiments and custom queries before pushing the new version to production, the cycle time will be much shorter if a single person can perform all that vs.having the metric verification manually done by a business analyst / release manager and having the production push stamped by a DevOps.

The idea of a cross-functional team is having all the required functions in a single team, or at least having functions such as DevOps mostly automated so it requires minimum interventions. The team is organized by the business domain instead of job functions, and an organization will then end up with multiple cross-functional teams, each of them can execute and iterate independently. They will depend on each other only at the interface level, and won’t otherwise be blocked or let themselves be blocked based on the progress of another team.

Putting together cross-functional teams is definitely no simple task. At Alphabet, even if most of the infrastructure toolings for various job functions are well documented and/or highly automated, there aren’t always the right opportunities to train the engineers to pick up the entire set of skills. It requires deliberate efforts and investments once the engineering leaders could agree that building a cross-functional team to execute independently should be a priority.