Our approach to software maintainability: benchmarked code metrics offer perspective

What is good quality software?

Finding an objective measure is tough. When it behaves the way you expect it to, when it can withstand unforeseen situations, when it is secure against abuse, when it complies with requirements.

What they all have in common is the need for maintenance. The honorable task of developers, and the roles surrounding them. Efficient, effective, workable code, easy to read even. That is, easy for the developers that come after you. When we talk about technical debt, it is easy to imagine its compounding effects. Where over time it gathers interest over the original debt, the technical imperfections of code. The metaphor does not stretch limitlessly, since not all debt must necessarily be recovered (this is essentially a business trade-off). But it is well-understood (“cruft” never quite made it past developer argot). Technical debt might be just harmless unrealized potential, but we see often that at a certain point, teams get stuck in a vicious cycle. Then code/software quality will be very noticeable to developers, clients and business. See also our discussion on technical debt in an agile development process. Essentially, bad code is bad for business. The inverse is not necessarily always true, but certain is that you need to have a grip on quality to deliver business value in a predictable and efficient way.

Setting a quality benchmark

Generally, to assess the quality of something, you need a sense of context. And each piece of software may have a widely different context. To put objective thresholds under code quality, at SIG we use technology-independent code measurements and compare those to a benchmark. A benchmark on quality is meaningful because it shows you an unbiased norm of how well you are doing. The context for this benchmark is “the current state of the software development market”. This means that you can compare your source code to the code that others are maintaining.

To compare different programming technologies with each other, the metrics represent a type of abstractions that occur universally, like the volume of pieces of code and the complexity of decision paths within. In that way, system size can be normalized to “person-months” or “person-years”, indications of amount of developer work done per time period. Those numbers are again based on benchmarks.

Below a short video introduction of this approach.

Summarizing, Sigrid compares analysis results for your system against a benchmark of 10,000+ industry systems. This benchmark set is selected and calibrated (rebalanced) yearly to keep up with the current state of software development. “Balanced” here can be understood as a representative spread of the “system population”. This will include anything in between old and new technologies, from anything legacy to modern JavaScript frameworks. In terms of technologies this is skewed towards programming languages that are now most common, because that best represents this current state. The metrics underlying the benchmark approach a normal distribution. This offers a sanity check of being a fair representation and allows statistical inferences on “the population” of software systems.

Benchmarking maintainability from 1 to 5 stars

The code quality score compared to this benchmark is expressed in a star rating on a scale from 1 to 5 stars. It follows a 5%-30%-30%-30%-5% distribution. Technically, its metrics range from 0.5-5.5 stars. This is a matter of convention, but it also avoids a “0” rating score, because 0 is not a meaningful end on a quality scale. So the middle 30% exists between 2.5 - 3.5 and all scores within this range are scored as 3 stars, so market average.

To be sure, so even though 50% necessarily is below average (3.0), 35% of systems will score below the 3-star threshold (below 2.5). And 35% of systems will score above the 3-star threshold (above 3.5). To avoid a suggestion of extreme precision, it is indeed helpful to think about these stars as ranges, such that 3.4 star would be considered “within the expected range of market average, on the higher end”. Note that calculation rounding tolerances are always downwards, with a maximum of 2 decimals of precision. So 1.49 star will be 1.

The system cloud: maintainability and system volume

One of the findings that the SIG maintainability benchmark consistently confirms is that system volume correlates negatively with maintainability. This generally applies to all sectors and technologies: large systems tend to be harder to maintain than smaller systems. This holds even if we take SIG model bias into account (where a higher system volume scores lower on maintainability because of e.g. increasing analysis/coordination efforts). From our experience we know that large systems can still score very well on a “code unit level” (the smallest unit of programming logic), but complexity increases quickly on an architectual level. As an illustration, a simplified visualization of what is commonly known as the “SIG cloud” is shown below. The X-axis is a logarithmic scale of system volume (in person-years) and the Y-axis represents the maintainability scale from 0.5-5.5 stars.

Star ratings are predictive of speed and costs

By experience with its datasets of systems, SIG has gathered that there is a strong correlation, and cause-effect relationship, between this star rating and development speed/system maintenance costs. For example, maintenance/ownership costs for 4-star systems are 2 times lower than for 2-star systems, and their development speed is up to 4 times faster. See the graph below:

Further reading and interpreting metrics