Electronics-Related.com
Forums

supercomputer progress

Started by Unknown April 26, 2022
On Fri, 29 Apr 2022 10:03:23 -0400, Phil Hobbs
<pcdhSpamMeSenseless@electrooptical.net> wrote:

>Mike Monett wrote: >> Phil Hobbs <pcdhSpamMeSenseless@electrooptical.net> wrote: >> >>> John Larkin wrote: >>>> On Thu, 28 Apr 2022 12:01:59 -0500, Dennis <dennis@none.none> wrote: >>>> >>>>> On 4/28/22 11:26, boB wrote: >>>>> >>>>>> I would love to have a super computer to run LTspice. >>>>>> >>>>> I thought one of the problems with LTspice (and spice in general) >>>>> performance is that the algorithms don't parallelize very well. >>>> >>>> LT runs on multiple cores now. I'd love the next gen LT Spice to run >>>> on an Nvidia card. 100x at least. >>>> >>> >>> The "number of threads" setting doesn't do anything very dramatic, >>> though, at least last time I tried. Splitting up the calculation >>> between cores would require all of them to communicate a couple of times >>> per time step, but lots of other simulation codes do that. >>> >>> The main trouble is that the matrix defining the connectivity between >>> nodes is highly irregular in general. >>> >>> Parallellizing that efficiently might well need a special-purpose >>> compiler, sort of similar to the profile-guided optimizer in the guts of >>> the FFTW code for computing DFTs. Probably not at all impossible, but >>> not that straightforward to implement. >>> >>> Cheers >>> >>> Phil Hobbs >> >> Supercomputers have thousands or hundreds of thousands of cores. >> >> Quote: >> >> "Cerebras Systems has unveiled its new Wafer Scale Engine 2 processor with >> a record-setting 2.6 trillion transistors and 850,000 AI-optimized cores. >> It&#4294967295;s built for supercomputing tasks, and it&#4294967295;s the second time since 2019 >> that Los Altos, California-based Cerebras has unveiled a chip that is >> basically an entire wafer." >> >> https://venturebeat.com/2021/04/20/cerebras-systems-launches-new-ai- >> supercomputing-processor-with-2-6-trillion-transistors/ > >Number of cores isn't the problem. For fairly tightly-coupled tasks >such as simulations, the issue is interconnect latency between cores, >and the required bandwidth goes roughly as the cube or Moore's law, so >it ran out of gas long ago. > >One thing that zillions of cores could do for SPICE is to do all the >stepped parameter runs simultaneously. At that point all you need is >infinite bandwidth to disk.
This whole hairball is summarized in Amdahl's Law: .<https://en.wikipedia.org/wiki/Amdahl%27s_law#:~:text=In%20computer%20architecture%2C%20Amdahl's%20law,system%20whose%20resources%20are%20improved> Joe Gwinn
Joe Gwinn wrote:
> On Fri, 29 Apr 2022 10:03:23 -0400, Phil Hobbs > <pcdhSpamMeSenseless@electrooptical.net> wrote: > >> Mike Monett wrote: >>> Phil Hobbs <pcdhSpamMeSenseless@electrooptical.net> wrote: >>> >>>> John Larkin wrote: >>>>> On Thu, 28 Apr 2022 12:01:59 -0500, Dennis <dennis@none.none> >>>>> wrote: >>>>> >>>>>> On 4/28/22 11:26, boB wrote: >>>>>> >>>>>>> I would love to have a super computer to run LTspice. >>>>>>> >>>>>> I thought one of the problems with LTspice (and spice in >>>>>> general) performance is that the algorithms don't >>>>>> parallelize very well. >>>>> >>>>> LT runs on multiple cores now. I'd love the next gen LT Spice >>>>> to run on an Nvidia card. 100x at least. >>>>> >>>> >>>> The "number of threads" setting doesn't do anything very >>>> dramatic, though, at least last time I tried. Splitting up the >>>> calculation between cores would require all of them to >>>> communicate a couple of times per time step, but lots of other >>>> simulation codes do that. >>>> >>>> The main trouble is that the matrix defining the connectivity >>>> between nodes is highly irregular in general. >>>> >>>> Parallellizing that efficiently might well need a >>>> special-purpose compiler, sort of similar to the profile-guided >>>> optimizer in the guts of the FFTW code for computing DFTs. >>>> Probably not at all impossible, but not that straightforward to >>>> implement. >>>> >>>> Cheers >>>> >>>> Phil Hobbs >>> >>> Supercomputers have thousands or hundreds of thousands of cores. >>> >>> Quote: >>> >>> "Cerebras Systems has unveiled its new Wafer Scale Engine 2 >>> processor with a record-setting 2.6 trillion transistors and >>> 850,000 AI-optimized cores. It&rsquo;s built for supercomputing tasks, >>> and it&rsquo;s the second time since 2019 that Los Altos, >>> California-based Cerebras has unveiled a chip that is basically >>> an entire wafer." >>> >>> https://venturebeat.com/2021/04/20/cerebras-systems-launches-new-ai-
supercomputing-processor-with-2-6-trillion-transistors/
>> >> Number of cores isn't the problem. For fairly tightly-coupled >> tasks such as simulations, the issue is interconnect latency >> between cores, and the required bandwidth goes roughly as the cube >> or Moore's law, so it ran out of gas long ago. >> >> One thing that zillions of cores could do for SPICE is to do all >> the stepped parameter runs simultaneously. At that point all you >> need is infinite bandwidth to disk. > > This whole hairball is summarized in Amdahl's Law: > > .<https://en.wikipedia.org/wiki/Amdahl%27s_law#:~:text=In%20computer%20architecture%2C%20Amdahl's%20law,system%20whose%20resources%20are%20improved>
Not exactly. There's very little serial execution required to parallellize parameter stepping, or even genetic-algorithm optimization. Communications overhead isn't strictly serial either--N processors can have several times N communication channels. It's mostly a latency issue. Cheers Phil Hobbs -- Dr Philip C D Hobbs Principal Consultant ElectroOptical Innovations LLC / Hobbs ElectroOptics Optics, Electro-optics, Photonics, Analog Electronics Briarcliff Manor NY 10510 http://electrooptical.net http://hobbs-eo.com
On Fri, 29 Apr 2022 20:51:43 -0400, Phil Hobbs
<pcdhSpamMeSenseless@electrooptical.net> wrote:

>Joe Gwinn wrote: >> On Fri, 29 Apr 2022 10:03:23 -0400, Phil Hobbs >> <pcdhSpamMeSenseless@electrooptical.net> wrote: >> >>> Mike Monett wrote: >>>> Phil Hobbs <pcdhSpamMeSenseless@electrooptical.net> wrote: >>>> >>>>> John Larkin wrote: >>>>>> On Thu, 28 Apr 2022 12:01:59 -0500, Dennis <dennis@none.none> >>>>>> wrote: >>>>>> >>>>>>> On 4/28/22 11:26, boB wrote: >>>>>>> >>>>>>>> I would love to have a super computer to run LTspice. >>>>>>>> >>>>>>> I thought one of the problems with LTspice (and spice in >>>>>>> general) performance is that the algorithms don't >>>>>>> parallelize very well. >>>>>> >>>>>> LT runs on multiple cores now. I'd love the next gen LT Spice >>>>>> to run on an Nvidia card. 100x at least. >>>>>> >>>>> >>>>> The "number of threads" setting doesn't do anything very >>>>> dramatic, though, at least last time I tried. Splitting up the >>>>> calculation between cores would require all of them to >>>>> communicate a couple of times per time step, but lots of other >>>>> simulation codes do that. >>>>> >>>>> The main trouble is that the matrix defining the connectivity >>>>> between nodes is highly irregular in general. >>>>> >>>>> Parallellizing that efficiently might well need a >>>>> special-purpose compiler, sort of similar to the profile-guided >>>>> optimizer in the guts of the FFTW code for computing DFTs. >>>>> Probably not at all impossible, but not that straightforward to >>>>> implement. >>>>> >>>>> Cheers >>>>> >>>>> Phil Hobbs >>>> >>>> Supercomputers have thousands or hundreds of thousands of cores. >>>> >>>> Quote: >>>> >>>> "Cerebras Systems has unveiled its new Wafer Scale Engine 2 >>>> processor with a record-setting 2.6 trillion transistors and >>>> 850,000 AI-optimized cores. It&#4294967295;s built for supercomputing tasks, >>>> and it&#4294967295;s the second time since 2019 that Los Altos, >>>> California-based Cerebras has unveiled a chip that is basically >>>> an entire wafer." >>>> >>>> https://venturebeat.com/2021/04/20/cerebras-systems-launches-new-ai- >supercomputing-processor-with-2-6-trillion-transistors/ >>> >>> Number of cores isn't the problem. For fairly tightly-coupled >>> tasks such as simulations, the issue is interconnect latency >>> between cores, and the required bandwidth goes roughly as the cube >>> or Moore's law, so it ran out of gas long ago. >>> >>> One thing that zillions of cores could do for SPICE is to do all >>> the stepped parameter runs simultaneously. At that point all you >>> need is infinite bandwidth to disk. >> >> This whole hairball is summarized in Amdahl's Law: >> >> .<https://en.wikipedia.org/wiki/Amdahl%27s_law#:~:text=In%20computer%20architecture%2C%20Amdahl's%20law,system%20whose%20resources%20are%20improved> > >Not exactly. There's very little serial execution required to >parallellize parameter stepping, or even genetic-algorithm optimization. > >Communications overhead isn't strictly serial either--N processors can >have several times N communication channels. It's mostly a latency issue.
In general, yes. But far too far down in the weeds. Amdahl's Law is easier to explain to a business manager that thinks that parallelism solves all performance issues, if only the engineers would stop carping and do their jobs. And then there are the architectures that would do wondrous things, if only light were not so damn slow. Joe Gwinn
On Saturday, April 30, 2022 at 9:04:50 AM UTC-4, Joe Gwinn wrote:
> On Fri, 29 Apr 2022 20:51:43 -0400, Phil Hobbs > <pcdhSpamM...@electrooptical.net> wrote: > > >Joe Gwinn wrote: > >> On Fri, 29 Apr 2022 10:03:23 -0400, Phil Hobbs > >> <pcdhSpamM...@electrooptical.net> wrote: > >> > >>> Mike Monett wrote: > >>>> Phil Hobbs <pcdhSpamM...@electrooptical.net> wrote: > >>>> > >>>>> John Larkin wrote: > >>>>>> On Thu, 28 Apr 2022 12:01:59 -0500, Dennis <den...@none.none> > >>>>>> wrote: > >>>>>> > >>>>>>> On 4/28/22 11:26, boB wrote: > >>>>>>> > >>>>>>>> I would love to have a super computer to run LTspice. > >>>>>>>> > >>>>>>> I thought one of the problems with LTspice (and spice in > >>>>>>> general) performance is that the algorithms don't > >>>>>>> parallelize very well. > >>>>>> > >>>>>> LT runs on multiple cores now. I'd love the next gen LT Spice > >>>>>> to run on an Nvidia card. 100x at least. > >>>>>> > >>>>> > >>>>> The "number of threads" setting doesn't do anything very > >>>>> dramatic, though, at least last time I tried. Splitting up the > >>>>> calculation between cores would require all of them to > >>>>> communicate a couple of times per time step, but lots of other > >>>>> simulation codes do that. > >>>>> > >>>>> The main trouble is that the matrix defining the connectivity > >>>>> between nodes is highly irregular in general. > >>>>> > >>>>> Parallellizing that efficiently might well need a > >>>>> special-purpose compiler, sort of similar to the profile-guided > >>>>> optimizer in the guts of the FFTW code for computing DFTs. > >>>>> Probably not at all impossible, but not that straightforward to > >>>>> implement. > >>>>> > >>>>> Cheers > >>>>> > >>>>> Phil Hobbs > >>>> > >>>> Supercomputers have thousands or hundreds of thousands of cores. > >>>> > >>>> Quote: > >>>> > >>>> "Cerebras Systems has unveiled its new Wafer Scale Engine 2 > >>>> processor with a record-setting 2.6 trillion transistors and > >>>> 850,000 AI-optimized cores. It&rsquo;s built for supercomputing tasks, > >>>> and it&rsquo;s the second time since 2019 that Los Altos, > >>>> California-based Cerebras has unveiled a chip that is basically > >>>> an entire wafer." > >>>> > >>>> https://venturebeat.com/2021/04/20/cerebras-systems-launches-new-ai- > >supercomputing-processor-with-2-6-trillion-transistors/ > >>> > >>> Number of cores isn't the problem. For fairly tightly-coupled > >>> tasks such as simulations, the issue is interconnect latency > >>> between cores, and the required bandwidth goes roughly as the cube > >>> or Moore's law, so it ran out of gas long ago. > >>> > >>> One thing that zillions of cores could do for SPICE is to do all > >>> the stepped parameter runs simultaneously. At that point all you > >>> need is infinite bandwidth to disk. > >> > >> This whole hairball is summarized in Amdahl's Law: > >> > >> .<https://en.wikipedia.org/wiki/Amdahl%27s_law#:~:text=In%20computer%20architecture%2C%20Amdahl's%20law,system%20whose%20resources%20are%20improved> > > > >Not exactly. There's very little serial execution required to > >parallellize parameter stepping, or even genetic-algorithm optimization. > > > >Communications overhead isn't strictly serial either--N processors can > >have several times N communication channels. It's mostly a latency issue. > In general, yes. But far too far down in the weeds. > > Amdahl's Law is easier to explain to a business manager that thinks > that parallelism solves all performance issues, if only the engineers > would stop carping and do their jobs. > > And then there are the architectures that would do wondrous things, if > only light were not so damn slow.
People often focus on the fact that the size of the chip limits the speed without considering how the size might be reduced (and the speed increased) using multi-valued logic. I suppose the devil is in the details, but if more information can be carried on fewer wires, the routing area of a chip can be reduced, speeding the entire chip. I've only heard of memory type circuits being implemented with multivalued logic, since the bulk of the die area is storage and that shrinks considerably. I believe they are up to 16 values, so four bits in place of one, but I only see TLC which has 8 values per cell. Logic chips are much harder to find using multi-valued logic. Obviously there are issues with making them work well. -- Rick C. +- Get 1,000 miles of free Supercharging +- Tesla referral code - https://ts.la/richard11209
On 30/04/2022 01:51, Phil Hobbs wrote:
> Joe Gwinn wrote: >> On Fri, 29 Apr 2022 10:03:23 -0400, Phil Hobbs >> <pcdhSpamMeSenseless@electrooptical.net> wrote: >>
>>> Number of cores isn't the problem.&nbsp; For fairly tightly-coupled >>> tasks such as simulations, the issue is interconnect latency >>> between cores, and the required bandwidth goes roughly as the cube >>> or Moore's law, so it ran out of gas long ago. >>> >>> One thing that zillions of cores could do for SPICE is to do all >>> the stepped parameter runs simultaneously.&nbsp; At that point all you >>> need is infinite bandwidth to disk.
Parallelism for exploring a wide range starting parameters and then evolving them based on how well the model fits seems to be in vogue now. eg https://arxiv.org/abs/1804.04737
>> This whole hairball is summarized in Amdahl's Law: >> >> .<https://en.wikipedia.org/wiki/Amdahl%27s_law#:~:text=In%20computer%20architecture%2C%20Amdahl's%20law,system%20whose%20resources%20are%20improved> >> > > Not exactly.&nbsp; There's very little serial execution required to > parallellize parameter stepping, or even genetic-algorithm optimization. > > Communications overhead isn't strictly serial either--N processors can > have several times N communication channels.&nbsp; It's mostly a latency issue.
Anyone who has ever done it quickly learns that by far the most important highest priority task is the not the computation itself but the management required to keep all of the cores doing useful work! It is easy to have all cores working flat out but if most of the parallelised work being done so quickly will be later shown to be redundant due to some higher level pruning algorithm all you are doing is generating more heat and only a miniscule performance gain (if that). SIMD has made quite a performance improvement for some problems on the Intel and AMD platforms. The compilers still haven't quite caught up with the hardware though. Alignment is now a rather annoying issue if you care about avoiding unnecessary cache misses and pipeline stalls. You can align your own structures correctly but can do nothing about virtual structures that the compiler creates and puts on the stack misaligned spanning two cache lines. The result is code which executes with two distinct characteristic times depending on where the cache line boundaries are in relation the top of stack when it is called! It really only matters in the very deepest levels of computationally intensive code which is probably why they don't try quite hard enough. Most people probably wouldn't notice ~5% changes unless they were benchmarking or monitoring MSRs for cache misses and pipeline stalls. -- Regards, Martin Brown
Martin Brown wrote:
> On 30/04/2022 01:51, Phil Hobbs wrote: >> Joe Gwinn wrote: >>> On Fri, 29 Apr 2022 10:03:23 -0400, Phil Hobbs >>> <pcdhSpamMeSenseless@electrooptical.net> wrote: >>> > >>>> Number of cores isn't the problem.&nbsp; For fairly tightly-coupled >>>> tasks such as simulations, the issue is interconnect latency >>>> between cores, and the required bandwidth goes roughly as the cube >>>> or Moore's law, so it ran out of gas long ago. >>>> >>>> One thing that zillions of cores could do for SPICE is to do all >>>> the stepped parameter runs simultaneously.&nbsp; At that point all you >>>> need is infinite bandwidth to disk. > > Parallelism for exploring a wide range starting parameters and then > evolving them based on how well the model fits seems to be in vogue now. eg > > https://arxiv.org/abs/1804.04737 > >>> This whole hairball is summarized in Amdahl's Law: >>> >>> .<https://en.wikipedia.org/wiki/Amdahl%27s_law#:~:text=In%20computer%20architecture%2C%20Amdahl's%20law,system%20whose%20resources%20are%20improved> >>> >> >> Not exactly.&nbsp; There's very little serial execution required to >> parallellize parameter stepping, or even genetic-algorithm optimization. >> >> Communications overhead isn't strictly serial either--N processors can >> have several times N communication channels.&nbsp; It's mostly a latency >> issue. > > Anyone who has ever done it quickly learns that by far the most > important highest priority task is the not the computation itself but > the management required to keep all of the cores doing useful work!
Yup. In my big EM code, that's handled by The Cluster Script From Hell. ;)
> > It is easy to have all cores working flat out but if most of the > parallelised work being done so quickly will be later shown to be > redundant due to some higher level pruning algorithm all you are doing > is generating more heat and only a miniscule performance gain (if that). > > SIMD has made quite a performance improvement for some problems on the > Intel and AMD platforms. The compilers still haven't quite caught up > with the hardware though. Alignment is now a rather annoying issue if > you care about avoiding unnecessary cache misses and pipeline stalls. > > You can align your own structures correctly but can do nothing about > virtual structures that the compiler creates and puts on the stack > misaligned spanning two cache lines. The result is code which executes > with two distinct characteristic times depending on where the cache line > boundaries are in relation the top of stack when it is called! > > It really only matters in the very deepest levels of computationally > intensive code which is probably why they don't try quite hard enough. > Most people probably wouldn't notice ~5% changes unless they were > benchmarking or monitoring MSRs for cache misses and pipeline stalls.
Well, your average hardcore numerical guy would proably just buy two clusters and pick the one that finished first. ;) Fifteen or so years ago, I got about a 3:1 improvement in FDTD speed by precomputing a strategy that let me iterate over a list containing runs of voxels with the same material, vs. just putting a big switch statement inside a triply-nested loop (the usual approach). I mentioned it to another EM simulation guy at a conference once, who said, "So what? I'd just get a bigger cluster." Cheers Phil Hobbs -- Dr Philip C D Hobbs Principal Consultant ElectroOptical Innovations LLC / Hobbs ElectroOptics Optics, Electro-optics, Photonics, Analog Electronics Briarcliff Manor NY 10510 http://electrooptical.net http://hobbs-eo.com
Martin Brown <'''newspam'''@nonad.co.uk> wrote:
> On 28/04/2022 18:47, Jeroen Belleman wrote: >> On 2022-04-28 18:26, boB wrote: >> [...] >>> I would love to have a super computer to run LTspice. >>> >>> boB >> >> In fact, what you have on your desk *is* a super computer, >> in the 1970's meaning of the words. It's just that it's >> bogged down running bloatware. > > Indeed. The Cray X-MP in its 4 CPU configuration with a 105MHz clock and > a whopping for the time 128MB of fast core memory with 40GB of disk. The
what is fast core memory?
> one I used had an amazing for the time 1TB tape cassette backing store. > It did 600 MFLOPs with the right sort of parallel vector code. > > That was back in the day when you needed special permission to use more > than 4MB of core on the timesharing IBM 3081 (approx 7 MIPS). > > Current Intel 12 gen CPU desktops are ~4GHz, 16GB ram and >1TB of disk. > (and the upper limits are even higher) That combo does ~66,000 MFLOPS. > > Spice simulation doesn't scale particularly well to large scale > multiprocessor environments to many long range interractions. >
On Friday, April 29, 2022 at 7:30:55 AM UTC-7, jla...@highlandsniptechnology.com wrote:

> Climate simulation uses enormous multi-CPU supercomputer rigs.
Not so; it's WEATHER mapping and prediction that uses the complex data sets for a varied bunch of globe locations doing sensing, to make a 3-d map for the planet's atmosphere. Climate is a much cruder problem, no details required. Much of the greenhouse gas analysis comes out of models that a PC spreadsheet would handle easily.
On 05/03/2022 03:12 PM, Cydrome Leader wrote:
> Martin Brown <'''newspam'''@nonad.co.uk> wrote: >> On 28/04/2022 18:47, Jeroen Belleman wrote: >>> On 2022-04-28 18:26, boB wrote: >>> [...] >>>> I would love to have a super computer to run LTspice. >>>> >>>> boB >>> In fact, what you have on your desk *is* a super computer, >>> in the 1970's meaning of the words. It's just that it's >>> bogged down running bloatware. >> Indeed. The Cray X-MP in its 4 CPU configuration with a 105MHz clock and >> a whopping for the time 128MB of fast core memory with 40GB of disk. The > what is fast core memory? >
A very expensive item: https://en.wikipedia.org/wiki/Magnetic-core_memory Fortunately by the X-MP's time SRAMs had replaced magnetic core.
On Tuesday, May 3, 2022 at 5:12:59 PM UTC-4, Cydrome Leader wrote:
> Martin Brown <'''newspam'''@nonad.co.uk> wrote: > > On 28/04/2022 18:47, Jeroen Belleman wrote: > >> On 2022-04-28 18:26, boB wrote: > >> [...] > >>> I would love to have a super computer to run LTspice. > >>> > >>> boB > >> > >> In fact, what you have on your desk *is* a super computer, > >> in the 1970's meaning of the words. It's just that it's > >> bogged down running bloatware. > > > > Indeed. The Cray X-MP in its 4 CPU configuration with a 105MHz clock and > > a whopping for the time 128MB of fast core memory with 40GB of disk. The > what is fast core memory?
An oxymoron. -- Rick C. ++ Get 1,000 miles of free Supercharging ++ Tesla referral code - https://ts.la/richard11209