⚡️ TL;DR
- Managing risk is critical when building complex systems. IcePanel’s Flows, Tags, and Dependencies View make it easier to get a comprehensive perspective of risk.
- To define risk, consider the scope (context) and characteristics (criteria) of your systems, like security, reliability, and elasticity.
- Spot risks by scoring how likely the worst-case scenario could happen and the potential impact using a scorecard. Make educated guesses to score initial risk ratings in your diagrams.
- Validate and adjust these scores with real-world data, using observability tools to track performance over time.
📖 Overview
Building complex systems is fundamentally about managing risk. After all the blood, sweat, and tears to define your as-is architecture, you’re rewarded with the opportunity to evaluate risk. You may have spotted a few concerning areas in your diagram already, but the focus now turns to looking at risk more rigorously.
In this guide, we’ll discuss how to define, identify, and measure architectural risk using IcePanel and Mark Richards’s framework, which is covered in his three-part video series. A link to the series can be found at the end of this article.
✍️ Defining architectural risk
Architectural risk can mean many things. Broadly, we can think about risk in terms of what we’re assessing (the context) and what we’re defining as risk (the criteria). Don’t jump too quickly and assess risk before aligning with your team to define these dimensions.
Context risk is about scope — are we looking at risk from a service or domain (business area) perspective? Looking at the risk of services allows us to dive into the specific units of the system with precision. However, services commonly interact with one another to support a workflow and rarely operate in isolation, so it may make more sense to look at risk by domain. If you have the time, it may be helpful to look at it from both a service and domain perspective.
Criteria risk is about getting specific about the architectural characteristics you want to prioritize. This can include things like:
- Feasibility
- Security
- Reliability
- Responsiveness
- Availability
- Data integrity
- Elasticity
Mark’s Architecture Characteristics Worksheet is an excellent resource for defining these characteristics as a team. Choose at most 3 criteria that are of critical importance to your team. There’s no such thing as a risk-free system, and designing systems is ultimately about making purposeful tradeoffs.
After defining your context and criteria, create an architectural risk assessment scorecard and start by filling out the column and row headers.
Mark Richards (Assessing Architectural Risk Part 1) — Risk scorecard.
🔍 Identifying architectural risk
With your scorecard headers filled out, you can now start looking at your architecture to estimate risk scores. If you haven’t diagrammed your as-is state by now, stop here and get that done before reading ahead. Check out our Getting Started article if you need help.
Mark Richards recommends scoring based on likelihood and impact.** At this stage, your ratings will be based on assumptions. The goal here is to get directionally correct scores, which you can later confirm in the last step when measuring risk. Aligning on high-risk areas can already kick-start discussions on improving your design.
Let’s look at Mark’s student test platform example to put this in practice. The platform allows students to authenticate, answer test questions, have them graded, and saved in a database. The system consists of several services and databases that we can visualize in the C4 model. Mark examines the elasticity (how well it scales with traffic/requests) of the Test taking context diagrammed below.
Level 2 App Diagram of the Test system (Mark’s example also included admin services that we’re omitting).
To evaluate elasticity, we make worst-case assumptions, like what happens if 120,000 students are taking the test at the same time? We then go through the diagram using a Flow and tag things based on likelihood and impact.
Test-taking Flow in IcePanel.
After going through the Flow, we can make a few assumptions:
- If the test taker service goes down or experiences slow performance, it will severely impact the experience of students taking the test (impact = high; 3). However, the likelihood of this happening isn’t high because it’s only reading from the db (likelihood = medium; 2).
- If the message queue performance degrades with too many requests, the impact will also be high because student answers will no longer be graded and saved (impact = high; 3). The auto grader has to do both reads and writes, which increases the likelihood that, in a high-volume scenario, the queue will fill up (likelihood = high; 3).
We can layer on risk scores using tags in IcePanel:
- Create a Tag group called ‘Elasticity likelihood’ and ‘Elasticity impact’
- Create tags for each numeric score from 1 to 3
- Assign a score for each service
- Create another Tag group, ‘Elasticity risk’
- Create tags for each numeric score from 1 to 9
- Multiply the likelihood and impact scores to get the final risk score
We multiply the impact and likelihood values to determine a final risk score. Mark suggests using the highest risk value across the services as the overall domain rating. In this case, it’s 9. In IcePanel, we can represent this ‘domain’ rating by adding the risk rating to the system as a tag.
Level 2 App Diagram with elasticity risk scores tagged.
📈 Measuring architectural risk
The last step is to confirm the subjective risk scores with real-life quantitative data. We’ll want to answer key questions like:
- Were our assumptions about the risk scores accurate?
- Is the risk criteria getting better or worse over time?
- What changes in the architecture are impacting the risk criteria?
- Have we achieved our goals for the risk criteria?
To answer these questions, we’ll need to look at observability tools to analyze architecture fitness functions. These are charts based on trends or thresholds over time. In the student testing system, we’ll record things like response times from the services and queue depth over time. Use these data inputs to refine your scores further and update their ratings in your scorecard and IcePanel.
Example of architectural fitness functions from Mark Richards in Lesson 129 — Assessing Architectural Risk (Part 3)
This is one of the areas we’re exploring at IcePanel to help you complete the final step in measuring risk. We see an opportunity to connect objects to external observability tools and surface this information in your diagrams. If you have any ideas, we’d love to hear them—email us at mail@icepanel.io.
🏁 Final thoughts
Building complex systems is all about managing risk, and IcePanel helps by allowing you to get a full picture of your systems. In this guide, we covered defining risk, scoring it with a scorecard, assigning tags to objects in IcePanel, and validating those scores with real-world data. We look forward to exploring further improvements to connect objects in IcePanel to reality so you can close the loop in your risk analysis.