On Thu, 14 Oct 2021 17:10:37 +0100, Tom Gardner
<spamjunk@blueyonder.co.uk> wrote:
>On 14/10/21 16:26, Joe Gwinn wrote:
>> On Wed, 13 Oct 2021 00:57:53 +0100, Tom Gardner
>> <spamjunk@blueyonder.co.uk> wrote:
>>
>>> On 13/10/21 00:14, Joe Gwinn wrote:
>>>>
>>>>>>> That is all/partially/ hidden by the operation of the L1/L2/L3
>>>>>>> caches in processors, but all the interacting C language
>>>>>>> features (e.g. const, volatile, etc) have to be got right.
>>>>>>> If incorrect, then subtle rare unreproduceable errors will
>>>>>>> occur.
>>>>>> Also true. It's the programmers' job to understand all this. Not
>>>>>> that many understand the hardware that deeply, but enough do.
>>>>> Most of them understand (and I use that word loosely) only
>>>>> enough to allow them to copy-and-paste "solutions" from
>>>>> stackexchange. Once it compiles and passes their inadequate
>>>>> unit tests, it works - by definition.
>>>>
>>>> That is certainly true. There is a wide dynamic range of programmer
>>>> skill. Back in the day, I was a rarity, being bilingual (hardware and
>>>> software), and it allowed me to solve some pretty wild bugs fairly
>>>> easily, because I had access to a level of information not commonly
>>>> available to pure software folks.
>>>
>>> I was too, everything from low noise analogue, conventional
>>> digital, "micros", RT software, cellphone modelling and measuring,
>>> and even some CRUD database stuff.
>>>
>>> When I'm feeling mischievous, usually in a pub, I'll tell
>>> people that I don't know where the boundary between hardware
>>> and software actually is.
>>>
>>> They are usually aghast at first. After mentioning microcode,
>>> the way modern ISAs are decomposed into RISC-like micro-ops
>>> inside the processor, FPGAs, emulation, etc, the reactions
>>> are one of two kinds
>>> - slightly aggressive denial, usually accompanied by looks
>>> of bewilderment and incomprehension
>>> - amusement, and delight at the philosophical questions
>>>
>>> Guess which people I trust (technically) more!
>>
>> Heh. Did they buy you a beer?
>>
>>
>>>> War story. Something like ten years ago, the C++ tribe was unable to
>>>> figure our why the radar software would go casters-up on startup. This
>>>> was likely a million lines of code at least. When it fell over, no
>>>> error messages or other information was printed. This problem
>>>> endured for months.
>>>
>>> Java, and other modern languages, are usually much
>>> better in that respect - aggressive use of exceptions
>>> and full stack traces really help.
>>
>> In theory, so does C++. For all the good it did.
>>
>> And C++ stack traces can be pretty hard to follow. But in my above
>> example, there was no stack trace to pore over. That's where the
>> kernel debugger came in. The kernel knows who is waiting on what, and
>> where in the application code the request was made.
>
>The only time this is likely to happen in a Java application
>is if the JVM is broken. I have seen config statements to
>the effect of "don't HotSpot optimise ByteArrays if using
>JRE 1.4.16". No idea how they noticed and isolated that
>as the bug!
The first three teams didn't figure it out?
>>> Another was debugging the 68000 SBC and its RTOS, where we
>>> had purchased both and both were buggy. Oh the (unproductive)
>>> fun we had!
>>
>> We had lots of problems with 68000 SBCs that didn't work as well, but
>> eventually figured out who _not_ to buy from. Which was most of the
>> then vendors, who mostly vanished over time. One assumes that word
>> got around.
>
>Interesting. From memory (1988!) we had three, and different
>ones had a different 1/4 of their memory non-functional.
I also recall lots of problems with correct implementation of such
instructions as Test-and-Set, which require the bus and bus interfaces
to cooperate when used in a multiprocessor setup.
Also, backplanes that could not handle conflicting atomic operations
from different SBCs on the same bus. The symptom was that the
backplane locked up, requiring a power cycle to recover control.
I wrote a short test program for that, called "TASbasher". One ran an
instance in each SBC, with mutually prime numbers of NOPs in the
loops, so the two instances could not get comfortable in an
alternating cycle. Vulnerable backplanes would lock up in a second or
two. And the hardware folk ran out of fix-your-buggy-software
excuses.
>> As for RTOSes, we used MTOS, which did work.
>
>It /might/ have been MTOS (but 1988 etc). I captured bus
>transactions to determine that when one RTOS call was made
>with a parameter, that did not reappear when the relevant
>task awoke. That was really "fun", given that instructions
>seen on the bus were not necessarily executed - the only
>way to tell was to execute the instructions on paper, and
>discount irrelevant prefetches.
>
>The RTOS vendor traced the problem to some assembly
>code in the port to the specific SBC, and fixed it
>speedily.
MTOS had the usual teething problems, but I don't recall that one. But
then it had survived some malicious benchmark tests.
>
>> Again, there were many
>> RTOSes that didn't work. In many cases, the problem was bugs, and
>> also by design. A classic design problem was if the inter-task
>> messaging facility could not handle a circular path, where A-> B -> C
>> -> A, and so on.
>>
>> This meant that the RTOS could handle only synchronous activities,
>> which was crippling in ERT, because the order of arrival of events is
>> necessarily random, and all orders will happen, and a system built on
>> a synchronous RTOS would immediately lock up. This lead to a very
>> simple but deadly RTOS benchmarking architecture. This architecture
>> also works on middleware.
>
>It is why "higher level" design patterns are so useful,
>including in embedded systems. The traditional mutex/semaphore
>is necessary, but not sufficient.
Well, at the time design patterns were well in the future. All they
did was describe and name bits of the ERT lore, which isn't a bad
thing, but was hardly a revelation either.
>In the Java world, Doug Lea transliterated useful design patterns
>found in the real time community into Java classes. They were
>eventually included in the standard Java libraries.
>
>Most of my architectures seem to be what is sometimes called
>the half-async-half-sync pattern:
> - create event (from a task or hardware interrupt)
> - put event in queue, and return pronto
> - loop, sucking event from queue, processing it to completion,
> often creating an event and yielding
An oldie but goodie. We would timestamp event records at creation,
and then process them while respecting the responsiveness needed for
whatever kind of event it was, in order if that was needed.
>A variation on that is a telecom system where there are
>many calls each with their own distinct event flow, where
>each call can have multiple outstanding events from different
>sources that must be processed in the order of reception.
>
>In that case
> - for each remote event source, a task sucks on the incoming
> events
> - each event queue for the relevant call, and the relevant
> call is queued in a global "work to be done" queue
> - set of worker threads (~1 per core) take then next call in
> the global queue, take the first event in that call's queue,
> and process it to completion
This sounds like what I described just above.
>> One also tested for priority inversion failures. While speed is also
>> measured, this and the circular path are problems no matter the speed
>> of the problem or RTOS.
>
>I manage to structure my systems into three priority levels:
> - hardware interrupt
> - panic and commit seppuku
> - everything else
>
>My brain is too feeble to cope with anything else.
The old hard-frame periodic RTOSes worked that way, but it doesn't
scale all that well. But nobody had the panic and die option - there
was always some form of elasticity, sometimes called a "rubber clock",
because customers would not tolerate such fragility.
Present-day ERT systems have many priority levels, basically for
better overall responsiveness, taking into account that while they are
all ERT, some are more urgent than others.
A classic example is a weather radar, where a 3D volume scan takes
about five minutes to collect all the data. The problem is that the
intensity and kind of weather passing through coverage varies
randomly, and there can be too much data to handle.
The data reduction algorithms form a data-driven data-flow machine.
The objective is to not lose raw radar data even if the data
processing falls behind, especially in the heaviest of weather just
short of blowing the radar tower away.
So the highest priority is given to the interface to the radar
hardware, and the components of the data-flow machine have a gradient
of priorities, such that if one level falls behind, the earlier layer
will carry on, and the overall data reduction will eventually catch
up. But of course, in heavy enough weather, this becomes impossible.
So, in parallel, the number of free memory blocks is tracked, and if
it falls below some threshold, the later processing steps are simply
omitted, so some outputs are not produced at all in extremis. Given
that the raw data was collected, anything can be generated after the
storm passes, up to when the radar tower went over the moon.
Joe Gwinn