On Apr 9, 2019, Tim Williams wrote
(in article <q8it1k$3qi$1@dont-email.me>):

> "Joseph Gwinn" <joegwinn@comcast.net> wrote in message
> news:0001HW.225C3CD800A246FA70000E6E32EF@news.giganews.com...
> > They coded in plain C, inspected the generated assembly code, and
> tweaked the C code until the assembly code was clean and fast. It turned out
>
> that the resulting code was largely portable in that all C compilers
> generated clean, fast code from the same tweaked C source code, after the
> source code was tweaked to the first two compilers.
>
> Expanding on this some more --
>
> Just in my humble experience alone, your writing style can massively affect
> code generation.
>
> The optimizer is terribly, terribly far from exhaustive. (It /could/ be
> exhaustive -- but then users would complain of hours or years of compilation
> time for almost no benefit, and that's no good!) If it doesn't figure out
> any simple tricks, it's just going to pick the best, mediocre solution and
> let that be that.
>
> And mind this affects execution time about as much as it does code size.
> Often, compact code runs faster, especially on simpler embedded platforms.
> (Yeah, when pipelines and caches get involved, unrolled and inlined
> operations get more attractive, and the discrepancy between compact and fast
> code can grow.)
>
> Things the optimizer is likely to check, can range from modestly unrolling
> or reordering loops, to factoring numerical expressions, to inlining
> functions and operating on the resulting mega-function, and more. All of
> these grow quickly in complexity, however, and the pursuit can become
> self-defeating.
>
> A recent example was a bit-packing function, on an 8-bit platform with
> hardware multiply. I wrote this a few different ways. By far, the worst
> was a mega-expression: between macros and carefully indented and inspected
> sub-expressions, the whole operation can be expressed entirely numerically.
> That's technically fine, but the compiler really throws up its proverbial
> hands and basically ends up writing out the expression long-form without any
> reuse of sub-expressions, or registers even(!).
>
> The next best was using a bunch of variables (which are allocated on the
> stack normally, but these are optimized out quickly when there are enough
> free registers to put them there instead, which was the case here) to hold
> intermediate steps, and repeating common steps in a short loop. But keep in
> mind that, if the variables are allocated in registers... you can't loop
> over them. Doing it with a loop, forces it to allocate stack, get the
> pointer, and index the variables. Plus the memory accesses themselves,
> which are slower. It is not without overhead! (This is a much better deal
> on, say, classic 16-bit machines (x86, 68k), and most everything since.) It
> might even try it with the loop and array, then try it unrolled with
> registers, and keep the unrolled version because it's simply better!
>
> I forget what I ended up with; I think I sliced the bit pattern differently,
> still using a loop but getting better reuse. It's still ugly, like hundreds
> of bytes for something nearly trivial if the data were byte aligned.
>
> All of this optimizing is subject to the constrains of the functions
> executed within each expression or statement. C functions can do literally
> anything. Side effects are the bane of optimization. If the optimizer
> can't reason about being able to move a function up or down the expression
> tree, it's simply forced to treat that as a sequence point. (Sure, it could
> reason about the function's contents as if inlining it, but that would be
> extremely costly.)
>
> As far as I know, the optimizer is bad at guessing what functions do, in
> terms of side effects, so it can help greatly (and this is why they put the
> features in there) to add hints about the nature of the function (e.g.,
> using const with parameters, writing pure functions when possible, etc.).
>
> (FYI, most of my experience centers around GCC. Most of this is motivated
> by my own observations, illuminated by some of the official documentation
> about the optimizing step.)

Yes, this matches my experience as well. Assembly code rules!

Joe Gwinn

On Wed, 10 Apr 2019 11:26:48 +0200, Gerhard Hoffmann <dk4xp@arcor.de>
wrote:

>Am 10.04.19 um 06:40 schrieb upsidedown@downunder.com:
>
>> Yes, TMS 9900 was interesting, but with  main memory cycle times in
>> the order of one microsecond  it was s_l_o_w. However, these days with
>> modern caches and cache management it would make sense.
>> 
>
>No, never ever. That design is bad to the bone.
>
>There is a direct and unbreakable link between the computer's fastest
>and slowest operations.
>
>
>In the time one can do   sub  (Rdest,flags),  Rsrc1, Rsrc2
>
>one has to decide in addition:
>- It is in L1 cache
>    if not: is it in L2 cache
>      if not: is it in L3 cache
>         if not: is it at least in L1 page tables...L2 page tables
>           if not: is swapped out -> trap, needs software to handle
>           if not: does it exist at all -> trap
>
>All that 3 times for a simple instruction. In addition to the normal
>complications for pipelining, multi issue, speculative execution.

While this is true to any register in the register set after switching
workspace pointer, a read into cache usually reads more than a single
register but instead loads a full cache line, containing multiple or
even all registers. After this initial load, all active registers are
already in the cache and no more loads from main memory is needed. 

Better yet, the loading a new value into the workspace pointer should
automatically load the full register set in one or more cache lines.
This would be better for cache consistency.

An other issue about cache consistency is how to restore the modified
values into main memory. Either use "write through" to immediately
save individual modified cached register values or "write back" the
whole register set into main memory just prior to workspace pointer
reload.

>Even large register files or a huge number of renaming registers or
>too large caches at a certain level are bad. Not having constant
>replacement pressure means that the resource is too fat and therefore
>to slow.

Register renaming is just reloading register blocks, often with two or
more workspace pointers.

>Some registers are badly needed to keep the complexity out of EVERY
>instruction.

Using bulk load of the whole register set does just that.

>
>cheers,
>Gerhard

Am 10.04.19 um 06:40 schrieb upsidedown@downunder.com:

> Yes, TMS 9900 was interesting, but with  main memory cycle times in
> the order of one microsecond  it was s_l_o_w. However, these days with
> modern caches and cache management it would make sense.
> 

No, never ever. That design is bad to the bone.

There is a direct and unbreakable link between the computer's fastest
and slowest operations.


In the time one can do   sub  (Rdest,flags),  Rsrc1, Rsrc2

one has to decide in addition:
- It is in L1 cache
    if not: is it in L2 cache
      if not: is it in L3 cache
         if not: is it at least in L1 page tables...L2 page tables
           if not: is swapped out -> trap, needs software to handle
           if not: does it exist at all -> trap

All that 3 times for a simple instruction. In addition to the normal
complications for pipelining, multi issue, speculative execution.

Even large register files or a huge number of renaming registers or
too large caches at a certain level are bad. Not having constant
replacement pressure means that the resource is too fat and therefore
to slow.

Some registers are badly needed to keep the complexity out of EVERY
instruction.

cheers,
Gerhard

On 10.4.19 03:30, Tim Williams wrote:
> "Gerhard Hoffmann" <dk4xp@arcor.de> wrote in message 
> news:q8j19j$nld$1@solani.org...
>> National tried to implement indexing over registers in their 16032 /
>> 32032 processors. They didn't get it to work. Also, it was a bad idea.
>> When the array is small, then there is no real advantage. When it is
>> large, it will not fit into the registers.
> 
> Cool.&#4294967295; In that, it was something to try, that didn't work out, but now 
> we know.
> 
> There's been a number of architectures with memory mapped registers (not 
> including memory-as-register architectures, which are worse :^) ), but 
> they never really seem to stick around.&#4294967295; IIRC, AVR for example was 
> introduced with it, but most examples today have dropped it.
> 
> In some cases, you can still hack it (self modifying code), but that's 
> almost always worse (due to instruction pipelines and caches).&#4294967295; Other 
> times, it's specifically forbidden (physical ROM, Harvard architecture, 
> execute-only memory spaces, etc.).
> 
> Perhaps ironically, the feature still lives on, but in a more limited 
> way, as register renaming is a well established feature in more advanced 
> hardware.
> 
> Tim


One moderately successful was Sun / Oracle Sparc, with 24 of the 32
registers windowed into the stack.

A different story is that it needs the branch delay slots in the
traditional RISC way, and it makes the assembly code pretty difficult
to read (and still more to write).

-- 

-TV

On Tue, 9 Apr 2019 19:30:18 -0500, "Tim Williams"
<tiwill@seventransistorlabs.com> wrote:

>"Gerhard Hoffmann" <dk4xp@arcor.de> wrote in message 
>news:q8j19j$nld$1@solani.org...
>> National tried to implement indexing over registers in their 16032 /
>> 32032 processors. They didn't get it to work. Also, it was a bad idea.
>> When the array is small, then there is no real advantage. When it is
>> large, it will not fit into the registers.
>
>Cool.  In that, it was something to try, that didn't work out, but now we 
>know.
>
>There's been a number of architectures with memory mapped registers (not 
>including memory-as-register architectures, which are worse :^) ), but they 
>never really seem to stick around.  

Yes, TMS 9900 was interesting, but with  main memory cycle times in
the order of one microsecond  it was s_l_o_w. However, these days with
modern caches and cache management it would make sense.

Am 10.04.19 um 02:30 schrieb Tim Williams:
> "Gerhard Hoffmann" <dk4xp@arcor.de> wrote in message 
> news:q8j19j$nld$1@solani.org...
>> National tried to implement indexing over registers in their 16032 /
>> 32032 processors. They didn't get it to work. Also, it was a bad idea.
>> When the array is small, then there is no real advantage. When it is
>> large, it will not fit into the registers.
> 
> Cool.&#4294967295; In that, it was something to try, that didn't work out, but now 
> we know.

The whole family was quite buggy. Some friends of mine tried to port
Andy Tanenbaum's p-code machine to it.


> There's been a number of architectures with memory mapped registers 

DEC system 10. That wasn't bad.

(not
> including memory-as-register architectures, which are worse :^) ), but 
> they never really seem to stick around.  

OMG!   TMS9900.  What a turd.


regards,
Gerhard

"Gerhard Hoffmann" <dk4xp@arcor.de> wrote in message 
news:q8j19j$nld$1@solani.org...
> National tried to implement indexing over registers in their 16032 /
> 32032 processors. They didn't get it to work. Also, it was a bad idea.
> When the array is small, then there is no real advantage. When it is
> large, it will not fit into the registers.

Cool.  In that, it was something to try, that didn't work out, but now we 
know.

There's been a number of architectures with memory mapped registers (not 
including memory-as-register architectures, which are worse :^) ), but they 
never really seem to stick around.  IIRC, AVR for example was introduced 
with it, but most examples today have dropped it.

In some cases, you can still hack it (self modifying code), but that's 
almost always worse (due to instruction pipelines and caches).  Other times, 
it's specifically forbidden (physical ROM, Harvard architecture, 
execute-only memory spaces, etc.).

Perhaps ironically, the feature still lives on, but in a more limited way, 
as register renaming is a well established feature in more advanced 
hardware.

Tim

-- 
Seven Transistor Labs, LLC
Electrical Engineering Consultation and Design
Website: https://www.seventransistorlabs.com/

Am 09.04.19 um 21:50 schrieb Tim Williams:

> The next best was using a bunch of variables (which are allocated on the 
> stack normally, but these are optimized out quickly when there are 
> enough free registers to put them there instead, which was the case 
> here) to hold intermediate steps, and repeating common steps in a short 
> loop.&#4294967295; But keep in mind that, if the variables are allocated in 
> registers... you can't loop over them.&#4294967295; Doing it with a loop, forces it 
> to allocate stack, get the pointer, and index the variables.&#4294967295; Plus the 
> memory accesses themselves, which are slower.&#4294967295; It is not without 

National tried to implement indexing over registers in their 16032 /
32032 processors. They didn't get it to work. Also, it was a bad idea.
When the array is small, then there is no real advantage. When it is
large, it will not fit into the registers.

regards, Gerhard

"Joseph Gwinn" <joegwinn@comcast.net> wrote in message 
news:0001HW.225C3CD800A246FA70000E6E32EF@news.giganews.com...
> They coded in plain C, inspected the generated assembly code, and
tweaked the C code until the assembly code was clean and fast. It turned out
that the resulting code was largely portable in that all C compilers
generated clean, fast code from the same tweaked C source code, after the
source code was tweaked to the first two compilers.
>

Expanding on this some more --

Just in my humble experience alone, your writing style can massively affect 
code generation.

The optimizer is terribly, terribly far from exhaustive.  (It /could/ be 
exhaustive -- but then users would complain of hours or years of compilation 
time for almost no benefit, and that's no good!)  If it doesn't figure out 
any simple tricks, it's just going to pick the best, mediocre solution and 
let that be that.

And mind this affects execution time about as much as it does code size. 
Often, compact code runs faster, especially on simpler embedded platforms. 
(Yeah, when pipelines and caches get involved, unrolled and inlined 
operations get more attractive, and the discrepancy between compact and fast 
code can grow.)

Things the optimizer is likely to check, can range from modestly unrolling 
or reordering loops, to factoring numerical expressions, to inlining 
functions and operating on the resulting mega-function, and more.  All of 
these grow quickly in complexity, however, and the pursuit can become 
self-defeating.

A recent example was a bit-packing function, on an 8-bit platform with 
hardware multiply.  I wrote this a few different ways.  By far, the worst 
was a mega-expression: between macros and carefully indented and inspected 
sub-expressions, the whole operation can be expressed entirely numerically. 
That's technically fine, but the compiler really throws up its proverbial 
hands and basically ends up writing out the expression long-form without any 
reuse of sub-expressions, or registers even(!).

The next best was using a bunch of variables (which are allocated on the 
stack normally, but these are optimized out quickly when there are enough 
free registers to put them there instead, which was the case here) to hold 
intermediate steps, and repeating common steps in a short loop.  But keep in 
mind that, if the variables are allocated in registers... you can't loop 
over them.  Doing it with a loop, forces it to allocate stack, get the 
pointer, and index the variables.  Plus the memory accesses themselves, 
which are slower.  It is not without overhead!  (This is a much better deal 
on, say, classic 16-bit machines (x86, 68k), and most everything since.)  It 
might even try it with the loop and array, then try it unrolled with 
registers, and keep the unrolled version because it's simply better!

I forget what I ended up with; I think I sliced the bit pattern differently, 
still using a loop but getting better reuse.  It's still ugly, like hundreds 
of bytes for something nearly trivial if the data were byte aligned.

All of this optimizing is subject to the constrains of the functions 
executed within each expression or statement.  C functions can do literally 
anything.  Side effects are the bane of optimization.  If the optimizer 
can't reason about being able to move a function up or down the expression 
tree, it's simply forced to treat that as a sequence point.  (Sure, it could 
reason about the function's contents as if inlining it, but that would be 
extremely costly.)

As far as I know, the optimizer is bad at guessing what functions do, in 
terms of side effects, so it can help greatly (and this is why they put the 
features in there) to add hints about the nature of the function (e.g., 
using const with parameters, writing pure functions when possible, etc.).

(FYI, most of my experience centers around GCC.  Most of this is motivated 
by my own observations, illuminated by some of the official documentation 
about the optimizing step.)

Tim

-- 
Seven Transistor Labs, LLC
Electrical Engineering Consultation and Design
Website: https://www.seventransistorlabs.com/

On Apr 9, 2019, Tim Williams wrote
(in article <q8hc77$nr7$1@dont-email.me>):

> "Joseph Gwinn" <joegwinn@comcast.net> wrote in message
> news:0001HW.225C3CD800A246FA70000E6E32EF@news.giganews.com...
> > Twenty years ago, when I was involved with developing what is now called
> > Middleware for a radar fire-control system for ship self defense against
> cruise missiles (no-kidding realtime - the final exam arrived at Mach 2, and
> fumbling was bad for your health), our main vendor had an interesting
> approach. They coded in plain C, inspected the generated assembly code, and
> tweaked the C code until the assembly code was clean and fast. It turned out
> that the resulting code was largely portable in that all C compilers
> generated clean, fast code from the same tweaked C source code, after the
> source code was tweaked to the first two compilers.

I should mention that the mission code was written in Ada83, but the 
infrastructure was in C. It&#4294967295;s a long story, but Ada was a major cause of 
many problems. This was the last Ada project in that area - the Ada mandate 
was rescinded during the project.

.
> Schools need to teach this [more often?]. If you're working in a low level
> language* like C, you need to think about the end product: ultimately,
> you're telling the compiler how to write a program. C is not exactly a
> specification or description language, but that's not an invalid perspective
> to take.

The original intent of C was to develop a portable equivalent to assembly 
language, so UNIX could be ported from computer type to computer type. This 
is documented in K&R.

But whatever the language level, C won the language wars hands down. Ada83 
was an early casualty.

.
> (Personally, I've seen VHDL taught that way, as describing instances of
> logic gates through semantic structures, but not C. I did get ASM and then
> C from the same prof as did many other students on the same track, but
> without the explicit call to inspect ones' machine code, I doubt anyone made
> the connection.)
>
> *Face it, C is low. Well, "medium" would be more charitable. It's
> assembler, cleaned up with richer macros and an optimizer. If it were a
> high level language, it would know better than to give mere /developers/ the
> pointers^Hkeys to the missile silos!

One of the problems with Ada was precisely that it attempted to prevent 
errors by constraining the programmers. Having right hand tied to left foot 
does prevent mistakes, but it turns out that the price was too high.

Ada83 enforced a 1970s academic theory of how programs ought to be 
structured, from people who had zero experience of embedded realtime 
programming. There was zero support for hardware interfaces (no shared 
memory, no volatile variables, etc). The whole issue of priority inversion 
was simply missed - priority inversion was in the embedded-realtime lore, but 
had neither a name nor a literature, and so was invisible to academics. And 
Ada83 was locked down, and could not be changed to fix any of the many 
problems. So, we fixed such problems in the C-coded runtime, where Lady Ada 
could neither see nor touch - she lived in an artificial space designed to 
never surprise her.

.
> As for the matter of writing mission-critical software in C, I will withold
> judgement on that... :^)
Too late. Most military mission-critical code is written in C++ these days, 
with direct hardware control in ANSI C. The operating system is some form of 
Linux.

Ada retains a niche in safety-critical code, where DO-178 is implemented. But 
DO-178 is so heavy that it actually does not matter what language one uses, 
and one can argue that assembler is safer than any HOL, because with 
assembler nothing is hidden.

Joe Gwinn