Electronics-Related.com
Forums

Tracking bug report frequency

Started by Don Y September 4, 2023
Anyone else use bug reporting frequency as a gross indicator
of system stability?

On Mon, 4 Sep 2023 06:30:44 -0700, Don Y <blockedofcourse@foo.invalid>
wrote:

>Anyone else use bug reporting frequency as a gross indicator >of system stability?
One bug is an engineering failure and should be fixed immediately.
On Monday, September 4, 2023 at 11:30:55&#8239;PM UTC+10, Don Y wrote:
> > Anyone else use bug reporting frequency as a gross indicator > of system stability?
IBM claimed to be doing that some thirty years ago. They also claimed to have used their debugging system on Bill Gates MS/DOS software and made it much more reliable. It must have been truly appalling when Bill Gates originally offered it to them for the IBM PC. -- Bill Sloman, Sydney
On 04/09/2023 14:30, Don Y wrote:
> Anyone else use bug reporting frequency as a gross indicator > of system stability?
Just about everyone who runs a beta test program. MTBF is another metric that can be used for something that is intended to run 24/7 and recover gracefully from anything that may happen to it. It is inevitable that a new release will have some bugs and minor differences from its predecessor that real life users will find PDQ. The trick is to gain enough information from each in service failure to identify and fix the root cause bug in a single iteration and without breaking something else. Modern optimisers make that more difficult now than it used to be back when I was involved in commercial development. -- Martin Brown
On Tue, 5 Sep 2023 13:13:51 +0100, Martin Brown
<'''newspam'''@nonad.co.uk> wrote:

>On 04/09/2023 14:30, Don Y wrote: >> Anyone else use bug reporting frequency as a gross indicator >> of system stability? > >Just about everyone who runs a beta test program. >MTBF is another metric that can be used for something that is intended >to run 24/7 and recover gracefully from anything that may happen to it. > >It is inevitable that a new release will have some bugs and minor >differences from its predecessor that real life users will find PDQ.
That's the story of software: bugs are inevitable, so why bother to be careful coding or testing? You can always wait for bug reports from users and post regular fixes of the worst ones.
> >The trick is to gain enough information from each in service failure to >identify and fix the root cause bug in a single iteration and without >breaking something else. Modern optimisers make that more difficult now >than it used to be back when I was involved in commercial development.
There have been various drives to write reliable code, but none were popular. Quite the contrary, the software world loves abstraction and ever new, bizarre languages... namely playing games instead of coding boring, reliable applications in some klunky, reliable language. Electronic design, and FPGA coding, are intended to be bug-free first pass and often are, when done right. FPGAs are halfway software, so the coders tend to be less careful than hardware designers. FPGA bug fixes are easy, so why bother to read your own code? That's ironic, when you think about it. The hardest bits, the physical electronics, has the least bugs.
On Tue, 05 Sep 2023 08:57:22 -0700, John Larkin
<jlarkin@highlandSNIPMEtechnology.com> wrote:

>On Tue, 5 Sep 2023 13:13:51 +0100, Martin Brown ><'''newspam'''@nonad.co.uk> wrote: > >>On 04/09/2023 14:30, Don Y wrote: >>> Anyone else use bug reporting frequency as a gross indicator >>> of system stability? >> >>Just about everyone who runs a beta test program. >>MTBF is another metric that can be used for something that is intended >>to run 24/7 and recover gracefully from anything that may happen to it. >> >>It is inevitable that a new release will have some bugs and minor >>differences from its predecessor that real life users will find PDQ. > >That's the story of software: bugs are inevitable, so why bother to be >careful coding or testing? You can always wait for bug reports from >users and post regular fixes of the worst ones. > >> >>The trick is to gain enough information from each in service failure to >>identify and fix the root cause bug in a single iteration and without >>breaking something else. Modern optimisers make that more difficult now >>than it used to be back when I was involved in commercial development. > >There have been various drives to write reliable code, but none were >popular. Quite the contrary, the software world loves abstraction and >ever new, bizarre languages... namely playing games instead of coding >boring, reliable applications in some klunky, reliable language. > >Electronic design, and FPGA coding, are intended to be bug-free first >pass and often are, when done right. > >FPGAs are halfway software, so the coders tend to be less careful than >hardware designers. FPGA bug fixes are easy, so why bother to read >your own code? > >That's ironic, when you think about it. The hardest bits, the physical >electronics, has the least bugs.
There is a complication. Modern software is tens of millions of lines of code, far exceeding the inspection capabilities of humans. Hardware is far simpler in terms of lines of FPGA code. But it's creeping up. On a project some decades ago, the customer wanted us to verify every path through the code, which was about 100,000 lines (large at the time) of C or assembler (don't recall, doesn't actually matter). In round numbers, one in five lines of code is an IF statement, so in 100,000 lines of code there will be 20,000 IF statements. So, there are up to 2^20000 unique paths through the code. Which chokes my HP calculator, so we must resort to logarithms, yielding 10^6021, which is a *very* large number. The age of the Universe is only 14 billion years, call it 10^10 years, so one would never be able to test even a tiny fraction of the possible paths. The customer withdrew the requirement. Joe Gwinn
On 05/09/2023 16:57, John Larkin wrote:
> On Tue, 5 Sep 2023 13:13:51 +0100, Martin Brown > <'''newspam'''@nonad.co.uk> wrote: > >> On 04/09/2023 14:30, Don Y wrote: >>> Anyone else use bug reporting frequency as a gross indicator >>> of system stability? >> >> Just about everyone who runs a beta test program. >> MTBF is another metric that can be used for something that is intended >> to run 24/7 and recover gracefully from anything that may happen to it. >> >> It is inevitable that a new release will have some bugs and minor >> differences from its predecessor that real life users will find PDQ. > > That's the story of software: bugs are inevitable, so why bother to be > careful coding or testing? You can always wait for bug reports from > users and post regular fixes of the worst ones.
Don't blame the engineers for that - it is the ship it and be damned senior management that is responsible for most buggy code being shipped. Even more so now that 1+GB upgrades are essentially free. :( First to market is worth enough that people live with buggy code. The worst major release I can recall in a very long time was MS Excel 2007 (although bugs in Vista took a lot more flack - rather unfairly IMHO). (which reminds me it is a MS patch Tuesday today)
>> The trick is to gain enough information from each in service failure to >> identify and fix the root cause bug in a single iteration and without >> breaking something else. Modern optimisers make that more difficult now >> than it used to be back when I was involved in commercial development. > > There have been various drives to write reliable code, but none were > popular. Quite the contrary, the software world loves abstraction and > ever new, bizarre languages... namely playing games instead of coding > boring, reliable applications in some klunky, reliable language.
The only ones which actually could be truly relied upon used formal mathematical proof techniques to ensure reliability. Very few practitioners are able to do it properly and it is pretty much reserved for ultra high reliability safety and mission critical code. It could be all be done to that standard iff commercial developers and their customers were prepared to pay for it. However, they want it now and they keep changing their minds about what it is they actually want so the goalposts are forever shifting around. That sort of functionality creep is much less common in hardware. UK's NATS system is supposedly 6 sigma coding but its misbehaviour on Bank Holiday Monday peak travel time was somewhat disastrous. It seems someone managed to input the halt and catch fire instruction and the buffers ran out before they were able to fix it. There will be a technical report out in due course - my guess is that they have reduced overheads and no longer have some of the key people who understand its internals. Malformed flight plan data should not have been able to kill it stone dead - but apparently that is exactly what happened! https://www.ft.com/content/9fe22207-5867-4c4f-972b-620cdab10790 (might be paywalled) If so Google "UK air traffic control outage caused by unusual flight plan data"
> Electronic design, and FPGA coding, are intended to be bug-free first > pass and often are, when done right.
But using design and simulation *software* that you fail to acknowledge is actually pretty good. If you had to do it with pencil and paper your would be there forever.
> FPGAs are halfway software, so the coders tend to be less careful than > hardware designers. FPGA bug fixes are easy, so why bother to read > your own code? > > That's ironic, when you think about it. The hardest bits, the physical > electronics, has the least bugs.
So do physical mechanical interlocks. I don't trust software or even electronic interlocks to protect me compared to a damn great beam stop and a padlock on it with the key in my pocket. -- Martin Brown
On 05/09/2023 17:45, Joe Gwinn wrote:
> On Tue, 05 Sep 2023 08:57:22 -0700, John Larkin > <jlarkin@highlandSNIPMEtechnology.com> wrote: > >> On Tue, 5 Sep 2023 13:13:51 +0100, Martin Brown >> <'''newspam'''@nonad.co.uk> wrote: >> >>> On 04/09/2023 14:30, Don Y wrote: >>>> Anyone else use bug reporting frequency as a gross indicator >>>> of system stability? >>> >>> Just about everyone who runs a beta test program. >>> MTBF is another metric that can be used for something that is intended >>> to run 24/7 and recover gracefully from anything that may happen to it. >>> >>> It is inevitable that a new release will have some bugs and minor >>> differences from its predecessor that real life users will find PDQ. >> >> That's the story of software: bugs are inevitable, so why bother to be >> careful coding or testing? You can always wait for bug reports from >> users and post regular fixes of the worst ones. >> >>> >>> The trick is to gain enough information from each in service failure to >>> identify and fix the root cause bug in a single iteration and without >>> breaking something else. Modern optimisers make that more difficult now >>> than it used to be back when I was involved in commercial development. >> >> There have been various drives to write reliable code, but none were >> popular. Quite the contrary, the software world loves abstraction and >> ever new, bizarre languages... namely playing games instead of coding >> boring, reliable applications in some klunky, reliable language. >> >> Electronic design, and FPGA coding, are intended to be bug-free first >> pass and often are, when done right. >> >> FPGAs are halfway software, so the coders tend to be less careful than >> hardware designers. FPGA bug fixes are easy, so why bother to read >> your own code? >> >> That's ironic, when you think about it. The hardest bits, the physical >> electronics, has the least bugs. > > There is a complication. Modern software is tens of millions of lines > of code, far exceeding the inspection capabilities of humans. Hardware > is far simpler in terms of lines of FPGA code. But it's creeping up. > > On a project some decades ago, the customer wanted us to verify every > path through the code, which was about 100,000 lines (large at the > time) of C or assembler (don't recall, doesn't actually matter). > > In round numbers, one in five lines of code is an IF statement, so in > 100,000 lines of code there will be 20,000 IF statements. So, there > are up to 2^20000 unique paths through the code. Which chokes my HP
Although that is true it is also true that a small number of cunningly constructed test datasets can explore a very high proportion of the most frequently traversed paths in any given codebase. One snag is that testing is invariably cut short by management when development overruns. The bits that fail to get explored tend to be weird error recovery routines. I recall one latent on the VAX for ages which was that when it ran out of IO handles (because someone was opening them inside a loop) the first thing the recovery routine tried to do was open an IO channel!
> calculator, so we must resort to logarithms, yielding 10^6021, which > is a *very* large number. The age of the Universe is only 14 billion > years, call it 10^10 years, so one would never be able to test even a > tiny fraction of the possible paths.
McCabe's complexity metric provides a way to test paths in components and subsystems reasonably thoroughly and catch most of the common programmer errors. Static dataflow analysis is also a lot better now than in the past. Then you only need at most 40000 test vectors to take each branch of every binary if statement (60000 if it is Fortran with 3 way branches all used). That is a rather more tractable number (although still large). Any routine with too high a CCI count is practically certain to contain latent bugs - which makes it worth looking at more carefully. -- Martin Brown
On 9/5/2023 5:13 AM, Martin Brown wrote:
> On 04/09/2023 14:30, Don Y wrote: >> Anyone else use bug reporting frequency as a gross indicator >> of system stability? > > Just about everyone who runs a beta test program. > MTBF is another metric that can be used for something that is intended to run > 24/7 and recover gracefully from anything that may happen to it.
I'm looking at the pre-release period (you wouldn't want to release something that wasn't "stable"). I commit often (dozens of times a day) so I can have a record of each problem encountered and, thereafter, how it was "fixed". As the number of messages related to fixups decreases, confidence in the codebase rises.
> It is inevitable that a new release will have some bugs and minor differences > from its predecessor that real life users will find PDQ.
The "bugs" that tend to show up after release are specification shortcomings. E.g., I had a case where a guy wired a motor incorrectly and the software just kept driving it further and further from it's desired setpoint -- until it smashed into the "wrong" limit switches (which, of course, weren't examined because it wasn't SUPPOSED to be traveling in that direction). When you've got 7-figures at stake, you can't resort to blaming the "electrician" for the failure ("Why didn't the software sense that it was running the wrong way?" Um, why didn't it sense that the electrician's wife had been ragging on him before he came to work and left him in a distracted mood??) Bugs (as in "coding errors") should never leave the lab.
> The trick is to gain enough information from each in service failure to > identify and fix the root cause bug in a single iteration and without breaking > something else. Modern&nbsp; optimisers make that more difficult now than it used to > be back when I was involved in commercial development.
Good problem decomposition goes a long way towards that goal. If you try to do "too much" you quickly overwhelm the developer's ability to manage complexity (7 items in STM?). And, as you can't *see* the entire implementation, there's nothing to REMIND you of some salient issue that might impact your local efforts. [Hence the value of eschewing globals and the languages that tolerate/encourage them! This dramatically cuts down the number of ways X can influence Y.]
On Tue, 05 Sep 2023 12:45:01 -0400, Joe Gwinn <joegwinn@comcast.net>
wrote:

>On Tue, 05 Sep 2023 08:57:22 -0700, John Larkin ><jlarkin@highlandSNIPMEtechnology.com> wrote: > >>On Tue, 5 Sep 2023 13:13:51 +0100, Martin Brown >><'''newspam'''@nonad.co.uk> wrote: >> >>>On 04/09/2023 14:30, Don Y wrote: >>>> Anyone else use bug reporting frequency as a gross indicator >>>> of system stability? >>> >>>Just about everyone who runs a beta test program. >>>MTBF is another metric that can be used for something that is intended >>>to run 24/7 and recover gracefully from anything that may happen to it. >>> >>>It is inevitable that a new release will have some bugs and minor >>>differences from its predecessor that real life users will find PDQ. >> >>That's the story of software: bugs are inevitable, so why bother to be >>careful coding or testing? You can always wait for bug reports from >>users and post regular fixes of the worst ones. >> >>> >>>The trick is to gain enough information from each in service failure to >>>identify and fix the root cause bug in a single iteration and without >>>breaking something else. Modern optimisers make that more difficult now >>>than it used to be back when I was involved in commercial development. >> >>There have been various drives to write reliable code, but none were >>popular. Quite the contrary, the software world loves abstraction and >>ever new, bizarre languages... namely playing games instead of coding >>boring, reliable applications in some klunky, reliable language. >> >>Electronic design, and FPGA coding, are intended to be bug-free first >>pass and often are, when done right. >> >>FPGAs are halfway software, so the coders tend to be less careful than >>hardware designers. FPGA bug fixes are easy, so why bother to read >>your own code? >> >>That's ironic, when you think about it. The hardest bits, the physical >>electronics, has the least bugs. > >There is a complication. Modern software is tens of millions of lines >of code, far exceeding the inspection capabilities of humans.
After you type a line of code, read it. When we did that, entire applications often worked first try. Hardware
>is far simpler in terms of lines of FPGA code. But it's creeping up.
FPGAs are at least (usually) organized state machines. Mistakes are typically hard failures, not low-rate bugs discovered in the field. Avoiding race and metastability hazards is common practise.
> >On a project some decades ago, the customer wanted us to verify every >path through the code, which was about 100,000 lines (large at the >time) of C or assembler (don't recall, doesn't actually matter).
Software provability was a brief fad once. It wasn't popular or, as code is now done, possible.
> >In round numbers, one in five lines of code is an IF statement, so in >100,000 lines of code there will be 20,000 IF statements. So, there >are up to 2^20000 unique paths through the code. Which chokes my HP >calculator, so we must resort to logarithms, yielding 10^6021, which >is a *very* large number. The age of the Universe is only 14 billion >years, call it 10^10 years, so one would never be able to test even a >tiny fraction of the possible paths.
An FPGA is usually coded as a state machine, where the designer understands that the machine has a finite number of states and handles every one. A computer program has an impossibly large number of states, unknown and certainly not managed. Code is like hairball async logic design.
> >The customer withdrew the requirement.
It was naiive of him to want correct code.
> >Joe Gwinn