Reply by Joseph Gwinn April 11, 20192019-04-11
On Apr 9, 2019, Tim Williams wrote
(in article <q8it1k$3qi$1@dont-email.me>):

> "Joseph Gwinn" <joegwinn@comcast.net> wrote in message > news:0001HW.225C3CD800A246FA70000E6E32EF@news.giganews.com... > > They coded in plain C, inspected the generated assembly code, and > tweaked the C code until the assembly code was clean and fast. It turned out > > that the resulting code was largely portable in that all C compilers > generated clean, fast code from the same tweaked C source code, after the > source code was tweaked to the first two compilers. > > Expanding on this some more -- > > Just in my humble experience alone, your writing style can massively affect > code generation. > > The optimizer is terribly, terribly far from exhaustive. (It /could/ be > exhaustive -- but then users would complain of hours or years of compilation > time for almost no benefit, and that's no good!) If it doesn't figure out > any simple tricks, it's just going to pick the best, mediocre solution and > let that be that. > > And mind this affects execution time about as much as it does code size. > Often, compact code runs faster, especially on simpler embedded platforms. > (Yeah, when pipelines and caches get involved, unrolled and inlined > operations get more attractive, and the discrepancy between compact and fast > code can grow.) > > Things the optimizer is likely to check, can range from modestly unrolling > or reordering loops, to factoring numerical expressions, to inlining > functions and operating on the resulting mega-function, and more. All of > these grow quickly in complexity, however, and the pursuit can become > self-defeating. > > A recent example was a bit-packing function, on an 8-bit platform with > hardware multiply. I wrote this a few different ways. By far, the worst > was a mega-expression: between macros and carefully indented and inspected > sub-expressions, the whole operation can be expressed entirely numerically. > That's technically fine, but the compiler really throws up its proverbial > hands and basically ends up writing out the expression long-form without any > reuse of sub-expressions, or registers even(!). > > The next best was using a bunch of variables (which are allocated on the > stack normally, but these are optimized out quickly when there are enough > free registers to put them there instead, which was the case here) to hold > intermediate steps, and repeating common steps in a short loop. But keep in > mind that, if the variables are allocated in registers... you can't loop > over them. Doing it with a loop, forces it to allocate stack, get the > pointer, and index the variables. Plus the memory accesses themselves, > which are slower. It is not without overhead! (This is a much better deal > on, say, classic 16-bit machines (x86, 68k), and most everything since.) It > might even try it with the loop and array, then try it unrolled with > registers, and keep the unrolled version because it's simply better! > > I forget what I ended up with; I think I sliced the bit pattern differently, > still using a loop but getting better reuse. It's still ugly, like hundreds > of bytes for something nearly trivial if the data were byte aligned. > > All of this optimizing is subject to the constrains of the functions > executed within each expression or statement. C functions can do literally > anything. Side effects are the bane of optimization. If the optimizer > can't reason about being able to move a function up or down the expression > tree, it's simply forced to treat that as a sequence point. (Sure, it could > reason about the function's contents as if inlining it, but that would be > extremely costly.) > > As far as I know, the optimizer is bad at guessing what functions do, in > terms of side effects, so it can help greatly (and this is why they put the > features in there) to add hints about the nature of the function (e.g., > using const with parameters, writing pure functions when possible, etc.). > > (FYI, most of my experience centers around GCC. Most of this is motivated > by my own observations, illuminated by some of the official documentation > about the optimizing step.)
Yes, this matches my experience as well. Assembly code rules! Joe Gwinn
Reply by April 10, 20192019-04-10
On Wed, 10 Apr 2019 11:26:48 +0200, Gerhard Hoffmann <dk4xp@arcor.de>
wrote:

>Am 10.04.19 um 06:40 schrieb upsidedown@downunder.com: > >> Yes, TMS 9900 was interesting, but with main memory cycle times in >> the order of one microsecond it was s_l_o_w. However, these days with >> modern caches and cache management it would make sense. >> > >No, never ever. That design is bad to the bone. > >There is a direct and unbreakable link between the computer's fastest >and slowest operations. > > >In the time one can do sub (Rdest,flags), Rsrc1, Rsrc2 > >one has to decide in addition: >- It is in L1 cache > if not: is it in L2 cache > if not: is it in L3 cache > if not: is it at least in L1 page tables...L2 page tables > if not: is swapped out -> trap, needs software to handle > if not: does it exist at all -> trap > >All that 3 times for a simple instruction. In addition to the normal >complications for pipelining, multi issue, speculative execution.
While this is true to any register in the register set after switching workspace pointer, a read into cache usually reads more than a single register but instead loads a full cache line, containing multiple or even all registers. After this initial load, all active registers are already in the cache and no more loads from main memory is needed. Better yet, the loading a new value into the workspace pointer should automatically load the full register set in one or more cache lines. This would be better for cache consistency. An other issue about cache consistency is how to restore the modified values into main memory. Either use "write through" to immediately save individual modified cached register values or "write back" the whole register set into main memory just prior to workspace pointer reload.
>Even large register files or a huge number of renaming registers or >too large caches at a certain level are bad. Not having constant >replacement pressure means that the resource is too fat and therefore >to slow.
Register renaming is just reloading register blocks, often with two or more workspace pointers.
>Some registers are badly needed to keep the complexity out of EVERY >instruction.
Using bulk load of the whole register set does just that.
> >cheers, >Gerhard
Reply by Gerhard Hoffmann April 10, 20192019-04-10
Am 10.04.19 um 06:40 schrieb upsidedown@downunder.com:

> Yes, TMS 9900 was interesting, but with main memory cycle times in > the order of one microsecond it was s_l_o_w. However, these days with > modern caches and cache management it would make sense. >
No, never ever. That design is bad to the bone. There is a direct and unbreakable link between the computer's fastest and slowest operations. In the time one can do sub (Rdest,flags), Rsrc1, Rsrc2 one has to decide in addition: - It is in L1 cache if not: is it in L2 cache if not: is it in L3 cache if not: is it at least in L1 page tables...L2 page tables if not: is swapped out -> trap, needs software to handle if not: does it exist at all -> trap All that 3 times for a simple instruction. In addition to the normal complications for pipelining, multi issue, speculative execution. Even large register files or a huge number of renaming registers or too large caches at a certain level are bad. Not having constant replacement pressure means that the resource is too fat and therefore to slow. Some registers are badly needed to keep the complexity out of EVERY instruction. cheers, Gerhard
Reply by Tauno Voipio April 10, 20192019-04-10
On 10.4.19 03:30, Tim Williams wrote:
> "Gerhard Hoffmann" <dk4xp@arcor.de> wrote in message > news:q8j19j$nld$1@solani.org... >> National tried to implement indexing over registers in their 16032 / >> 32032 processors. They didn't get it to work. Also, it was a bad idea. >> When the array is small, then there is no real advantage. When it is >> large, it will not fit into the registers. > > Cool.&#4294967295; In that, it was something to try, that didn't work out, but now > we know. > > There's been a number of architectures with memory mapped registers (not > including memory-as-register architectures, which are worse :^) ), but > they never really seem to stick around.&#4294967295; IIRC, AVR for example was > introduced with it, but most examples today have dropped it. > > In some cases, you can still hack it (self modifying code), but that's > almost always worse (due to instruction pipelines and caches).&#4294967295; Other > times, it's specifically forbidden (physical ROM, Harvard architecture, > execute-only memory spaces, etc.). > > Perhaps ironically, the feature still lives on, but in a more limited > way, as register renaming is a well established feature in more advanced > hardware. > > Tim
One moderately successful was Sun / Oracle Sparc, with 24 of the 32 registers windowed into the stack. A different story is that it needs the branch delay slots in the traditional RISC way, and it makes the assembly code pretty difficult to read (and still more to write). -- -TV
Reply by April 10, 20192019-04-10
On Tue, 9 Apr 2019 19:30:18 -0500, "Tim Williams"
<tiwill@seventransistorlabs.com> wrote:

>"Gerhard Hoffmann" <dk4xp@arcor.de> wrote in message >news:q8j19j$nld$1@solani.org... >> National tried to implement indexing over registers in their 16032 / >> 32032 processors. They didn't get it to work. Also, it was a bad idea. >> When the array is small, then there is no real advantage. When it is >> large, it will not fit into the registers. > >Cool. In that, it was something to try, that didn't work out, but now we >know. > >There's been a number of architectures with memory mapped registers (not >including memory-as-register architectures, which are worse :^) ), but they >never really seem to stick around.
Yes, TMS 9900 was interesting, but with main memory cycle times in the order of one microsecond it was s_l_o_w. However, these days with modern caches and cache management it would make sense.
Reply by Gerhard Hoffmann April 9, 20192019-04-09
Am 10.04.19 um 02:30 schrieb Tim Williams:
> "Gerhard Hoffmann" <dk4xp@arcor.de> wrote in message > news:q8j19j$nld$1@solani.org... >> National tried to implement indexing over registers in their 16032 / >> 32032 processors. They didn't get it to work. Also, it was a bad idea. >> When the array is small, then there is no real advantage. When it is >> large, it will not fit into the registers. > > Cool.&#4294967295; In that, it was something to try, that didn't work out, but now > we know.
The whole family was quite buggy. Some friends of mine tried to port Andy Tanenbaum's p-code machine to it.
> There's been a number of architectures with memory mapped registers
DEC system 10. That wasn't bad. (not
> including memory-as-register architectures, which are worse :^) ), but > they never really seem to stick around.
OMG! TMS9900. What a turd. regards, Gerhard
Reply by Tim Williams April 9, 20192019-04-09
"Gerhard Hoffmann" <dk4xp@arcor.de> wrote in message 
news:q8j19j$nld$1@solani.org...
> National tried to implement indexing over registers in their 16032 / > 32032 processors. They didn't get it to work. Also, it was a bad idea. > When the array is small, then there is no real advantage. When it is > large, it will not fit into the registers.
Cool. In that, it was something to try, that didn't work out, but now we know. There's been a number of architectures with memory mapped registers (not including memory-as-register architectures, which are worse :^) ), but they never really seem to stick around. IIRC, AVR for example was introduced with it, but most examples today have dropped it. In some cases, you can still hack it (self modifying code), but that's almost always worse (due to instruction pipelines and caches). Other times, it's specifically forbidden (physical ROM, Harvard architecture, execute-only memory spaces, etc.). Perhaps ironically, the feature still lives on, but in a more limited way, as register renaming is a well established feature in more advanced hardware. Tim -- Seven Transistor Labs, LLC Electrical Engineering Consultation and Design Website: https://www.seventransistorlabs.com/
Reply by Gerhard Hoffmann April 9, 20192019-04-09
Am 09.04.19 um 21:50 schrieb Tim Williams:

> The next best was using a bunch of variables (which are allocated on the > stack normally, but these are optimized out quickly when there are > enough free registers to put them there instead, which was the case > here) to hold intermediate steps, and repeating common steps in a short > loop.&#4294967295; But keep in mind that, if the variables are allocated in > registers... you can't loop over them.&#4294967295; Doing it with a loop, forces it > to allocate stack, get the pointer, and index the variables.&#4294967295; Plus the > memory accesses themselves, which are slower.&#4294967295; It is not without
National tried to implement indexing over registers in their 16032 / 32032 processors. They didn't get it to work. Also, it was a bad idea. When the array is small, then there is no real advantage. When it is large, it will not fit into the registers. regards, Gerhard
Reply by Tim Williams April 9, 20192019-04-09
"Joseph Gwinn" <joegwinn@comcast.net> wrote in message 
news:0001HW.225C3CD800A246FA70000E6E32EF@news.giganews.com...
> They coded in plain C, inspected the generated assembly code, and
tweaked the C code until the assembly code was clean and fast. It turned out that the resulting code was largely portable in that all C compilers generated clean, fast code from the same tweaked C source code, after the source code was tweaked to the first two compilers.
>
Expanding on this some more -- Just in my humble experience alone, your writing style can massively affect code generation. The optimizer is terribly, terribly far from exhaustive. (It /could/ be exhaustive -- but then users would complain of hours or years of compilation time for almost no benefit, and that's no good!) If it doesn't figure out any simple tricks, it's just going to pick the best, mediocre solution and let that be that. And mind this affects execution time about as much as it does code size. Often, compact code runs faster, especially on simpler embedded platforms. (Yeah, when pipelines and caches get involved, unrolled and inlined operations get more attractive, and the discrepancy between compact and fast code can grow.) Things the optimizer is likely to check, can range from modestly unrolling or reordering loops, to factoring numerical expressions, to inlining functions and operating on the resulting mega-function, and more. All of these grow quickly in complexity, however, and the pursuit can become self-defeating. A recent example was a bit-packing function, on an 8-bit platform with hardware multiply. I wrote this a few different ways. By far, the worst was a mega-expression: between macros and carefully indented and inspected sub-expressions, the whole operation can be expressed entirely numerically. That's technically fine, but the compiler really throws up its proverbial hands and basically ends up writing out the expression long-form without any reuse of sub-expressions, or registers even(!). The next best was using a bunch of variables (which are allocated on the stack normally, but these are optimized out quickly when there are enough free registers to put them there instead, which was the case here) to hold intermediate steps, and repeating common steps in a short loop. But keep in mind that, if the variables are allocated in registers... you can't loop over them. Doing it with a loop, forces it to allocate stack, get the pointer, and index the variables. Plus the memory accesses themselves, which are slower. It is not without overhead! (This is a much better deal on, say, classic 16-bit machines (x86, 68k), and most everything since.) It might even try it with the loop and array, then try it unrolled with registers, and keep the unrolled version because it's simply better! I forget what I ended up with; I think I sliced the bit pattern differently, still using a loop but getting better reuse. It's still ugly, like hundreds of bytes for something nearly trivial if the data were byte aligned. All of this optimizing is subject to the constrains of the functions executed within each expression or statement. C functions can do literally anything. Side effects are the bane of optimization. If the optimizer can't reason about being able to move a function up or down the expression tree, it's simply forced to treat that as a sequence point. (Sure, it could reason about the function's contents as if inlining it, but that would be extremely costly.) As far as I know, the optimizer is bad at guessing what functions do, in terms of side effects, so it can help greatly (and this is why they put the features in there) to add hints about the nature of the function (e.g., using const with parameters, writing pure functions when possible, etc.). (FYI, most of my experience centers around GCC. Most of this is motivated by my own observations, illuminated by some of the official documentation about the optimizing step.) Tim -- Seven Transistor Labs, LLC Electrical Engineering Consultation and Design Website: https://www.seventransistorlabs.com/
Reply by Joseph Gwinn April 9, 20192019-04-09
On Apr 9, 2019, Tim Williams wrote
(in article <q8hc77$nr7$1@dont-email.me>):

> "Joseph Gwinn" <joegwinn@comcast.net> wrote in message > news:0001HW.225C3CD800A246FA70000E6E32EF@news.giganews.com... > > Twenty years ago, when I was involved with developing what is now called > > Middleware for a radar fire-control system for ship self defense against > cruise missiles (no-kidding realtime - the final exam arrived at Mach 2, and > fumbling was bad for your health), our main vendor had an interesting > approach. They coded in plain C, inspected the generated assembly code, and > tweaked the C code until the assembly code was clean and fast. It turned out > that the resulting code was largely portable in that all C compilers > generated clean, fast code from the same tweaked C source code, after the > source code was tweaked to the first two compilers.
I should mention that the mission code was written in Ada83, but the infrastructure was in C. It&#4294967295;s a long story, but Ada was a major cause of many problems. This was the last Ada project in that area - the Ada mandate was rescinded during the project. .
> Schools need to teach this [more often?]. If you're working in a low level > language* like C, you need to think about the end product: ultimately, > you're telling the compiler how to write a program. C is not exactly a > specification or description language, but that's not an invalid perspective > to take.
The original intent of C was to develop a portable equivalent to assembly language, so UNIX could be ported from computer type to computer type. This is documented in K&R. But whatever the language level, C won the language wars hands down. Ada83 was an early casualty. .
> (Personally, I've seen VHDL taught that way, as describing instances of > logic gates through semantic structures, but not C. I did get ASM and then > C from the same prof as did many other students on the same track, but > without the explicit call to inspect ones' machine code, I doubt anyone made > the connection.) > > *Face it, C is low. Well, "medium" would be more charitable. It's > assembler, cleaned up with richer macros and an optimizer. If it were a > high level language, it would know better than to give mere /developers/ the > pointers^Hkeys to the missile silos!
One of the problems with Ada was precisely that it attempted to prevent errors by constraining the programmers. Having right hand tied to left foot does prevent mistakes, but it turns out that the price was too high. Ada83 enforced a 1970s academic theory of how programs ought to be structured, from people who had zero experience of embedded realtime programming. There was zero support for hardware interfaces (no shared memory, no volatile variables, etc). The whole issue of priority inversion was simply missed - priority inversion was in the embedded-realtime lore, but had neither a name nor a literature, and so was invisible to academics. And Ada83 was locked down, and could not be changed to fix any of the many problems. So, we fixed such problems in the C-coded runtime, where Lady Ada could neither see nor touch - she lived in an artificial space designed to never surprise her. .
> As for the matter of writing mission-critical software in C, I will withold > judgement on that... :^)
Too late. Most military mission-critical code is written in C++ these days, with direct hardware control in ANSI C. The operating system is some form of Linux. Ada retains a niche in safety-critical code, where DO-178 is implemented. But DO-178 is so heavy that it actually does not matter what language one uses, and one can argue that assembler is safer than any HOL, because with assembler nothing is hidden. Joe Gwinn