Electronics-Related.com
Forums

Burn-in strategy

Started by Don Y December 2, 2022
On 12/6/2022 10:05 AM, legg wrote:
> On Mon, 5 Dec 2022 05:09:48 -0700, Don Y <blockedofcourse@foo.invalid> > wrote: > >> On 12/4/2022 10:46 AM, Joe Gwinn wrote: >>> I was doing much the same in the late 1970s. We had a number of new >>> SEL 32/55 midi computers, with this brand new semiconductor RAM memory >>> (replacing magnetic core memory), and were having lots of early >>> failures. >>> >>> So, I decided to give them some hot summer days: The computers were >>> looping on a memory test, as before, but now with their air intakes >>> partially blocked by cardboard, with a thermocouple in the core so we >>> could adjust the cardboard to achieve the max allowed temperature. >>> >>> Initially, delivered units would fail within a day. We would remove >>> the cardboard et al and call the vendor, who would then find and >>> replace the failed memory. Rinse and repeat. >>> >>> Pretty soon, the vendor instituted a hot screening program before >>> delivery, it being far cheaper to fix in factory than the field, and >>> in a year or two semiconductor memory field reliability had improved >>> greatly. >> >> But, the vendor likely didn't just "block the vents" and *hope* >> ALL the early faults would manifest in the first 24 hours. >> >> Instead, he likely stressed a sample population over a longer >> period of time and recorded the failure rates, over time -- looking >> for the "knee" at which the failure rate leveled off. Longer burnin >> times would just needlessly shorten the useful life of the device; >> shorter would risk some number of infant mortality failures slipping >> through to manifest at the customer. >> >> It seems that most folks have a naive understanding of how burnin is >> supposed to work. That "simply" plugging the unit in before sale >> is enough to catch the early failures. Unless you know where (in time) >> those failures are probabilistically going to manifest, how can >> you know that 24, 48, 72, 168 hours is "enough"? Or, that 60C is >> the best temperature to accelerate failures? (my residential >> devices have to *operate* at 60C. And, -40C.) >> >> [If you're not going to approach it with a scientific basis, you're >> likely just looking to capitalize on your customers' ignorance: >> "We burn in our products for ## hours to ensure quality". Yeah. >> Right. "Then why did OUR unit shit the bed after two weeks?"] > > The best temperature to accellerate failures is the operating limit > for which the design is intended to address, under functioning > conditions that produce the highest intended self-generated rise.
Yes. My point was that naively assuming N hours at T degrees is just silly. You need to characterize your failure pattern before you can figure out how long and at which conditions you should stress the design.
> If you have access to early testing, you'll have some idea of the > margins for functional operation that this limit condition provides, > and the accompanying MTBF calculation for this previously-measured > condition. > > It is only when margins to the limits are actually exceeded that > predicted life is possibly compromised.
Any time "operating" comes at the expense of "remaining useful life". If you can assume that the time spent operating is << the expected useful life, then you can ignore it. OTOH, if your "usage in test/burnin" represents a significant portion of the useful life of the device, approaching it willy-nilly can be costly. E.g., there are SIMM/DIMM connectors that are rated for a *handful* of operations. You'd not want to be designing a test plan that called for them to be exercised *dozens* of times. And, then wonder why their reliability suffered post-test!
> Complete thermal cycling is impractical for simple burn-in. > It is usually restricted in application to design verification or > later sample process quality assurance.
Most devices operate in a narrow temperature range -- esp if deployed in human-occupied environments. Something intended for use in a lab will likely see constant temperatures. OTOH, there are classes of devices that are not constrained by "human habitation". E.g., an outdoor whether station will typically see 100+C variations, over its lifetime -- though likely only ~30C in a given (short) interval. Here, I expect to encounter 0F to ~140F as a normal yearly range. In North Dakota, it might be -40F to +100F, etc. If you don't want to design AZ and ND models, you have to test the design at the union of those conditions. Then, there are devices that are intended to operate in environments where the operating conditions are varied by necessity (e.g., many manufacturing processes).
> Cold cycling tolerance is relevant to consumer products mainly to > demonstrate air-shipment worthiness.
Or, it's -26F outside and will be that way for a few days!
> For burn in, simple on-off cycling to allow stress over self- > generated temperature swings is considered adequate.
On Tuesday, 6 December 2022 at 19:38:38 UTC, Don Y wrote:
> On 12/6/2022 10:05 AM, legg wrote: > > On Mon, 5 Dec 2022 05:09:48 -0700, Don Y <blocked...@foo.invalid> > > wrote: > > > >> On 12/4/2022 10:46 AM, Joe Gwinn wrote: > >>> I was doing much the same in the late 1970s. We had a number of new > >>> SEL 32/55 midi computers, with this brand new semiconductor RAM memory > >>> (replacing magnetic core memory), and were having lots of early > >>> failures. > >>> > >>> So, I decided to give them some hot summer days: The computers were > >>> looping on a memory test, as before, but now with their air intakes > >>> partially blocked by cardboard, with a thermocouple in the core so we > >>> could adjust the cardboard to achieve the max allowed temperature. > >>> > >>> Initially, delivered units would fail within a day. We would remove > >>> the cardboard et al and call the vendor, who would then find and > >>> replace the failed memory. Rinse and repeat. > >>> > >>> Pretty soon, the vendor instituted a hot screening program before > >>> delivery, it being far cheaper to fix in factory than the field, and > >>> in a year or two semiconductor memory field reliability had improved > >>> greatly. > >> > >> But, the vendor likely didn't just "block the vents" and *hope* > >> ALL the early faults would manifest in the first 24 hours. > >> > >> Instead, he likely stressed a sample population over a longer > >> period of time and recorded the failure rates, over time -- looking > >> for the "knee" at which the failure rate leveled off. Longer burnin > >> times would just needlessly shorten the useful life of the device; > >> shorter would risk some number of infant mortality failures slipping > >> through to manifest at the customer. > >> > >> It seems that most folks have a naive understanding of how burnin is > >> supposed to work. That "simply" plugging the unit in before sale > >> is enough to catch the early failures. Unless you know where (in time) > >> those failures are probabilistically going to manifest, how can > >> you know that 24, 48, 72, 168 hours is "enough"? Or, that 60C is > >> the best temperature to accelerate failures? (my residential > >> devices have to *operate* at 60C. And, -40C.) > >> > >> [If you're not going to approach it with a scientific basis, you're > >> likely just looking to capitalize on your customers' ignorance: > >> "We burn in our products for ## hours to ensure quality". Yeah. > >> Right. "Then why did OUR unit shit the bed after two weeks?"] > > > > The best temperature to accellerate failures is the operating limit > > for which the design is intended to address, under functioning > > conditions that produce the highest intended self-generated rise. > Yes. My point was that naively assuming N hours at T degrees is > just silly. You need to characterize your failure pattern before > you can figure out how long and at which conditions you should > stress the design. > > If you have access to early testing, you'll have some idea of the > > margins for functional operation that this limit condition provides, > > and the accompanying MTBF calculation for this previously-measured > > condition. > > > > It is only when margins to the limits are actually exceeded that > > predicted life is possibly compromised. > Any time "operating" comes at the expense of "remaining useful life". > If you can assume that the time spent operating is << the expected > useful life, then you can ignore it. > > OTOH, if your "usage in test/burnin" represents a significant > portion of the useful life of the device, approaching it willy-nilly > can be costly. > > E.g., there are SIMM/DIMM connectors that are rated for a *handful* > of operations. You'd not want to be designing a test plan that > called for them to be exercised *dozens* of times. And, then wonder > why their reliability suffered post-test! > > Complete thermal cycling is impractical for simple burn-in. > > It is usually restricted in application to design verification or > > later sample process quality assurance. > Most devices operate in a narrow temperature range -- esp if > deployed in human-occupied environments. Something intended > for use in a lab will likely see constant temperatures. > > OTOH, there are classes of devices that are not constrained by > "human habitation". E.g., an outdoor whether station will > typically see 100+C variations, over its lifetime -- though > likely only ~30C in a given (short) interval. Here, I expect > to encounter 0F to ~140F as a normal yearly range. In > North Dakota, it might be -40F to +100F, etc. If you don't want > to design AZ and ND models, you have to test the design at > the union of those conditions. > > Then, there are devices that are intended to operate in environments > where the operating conditions are varied by necessity (e.g., > many manufacturing processes). > > Cold cycling tolerance is relevant to consumer products mainly to > > demonstrate air-shipment worthiness. > Or, it's -26F outside and will be that way for a few days! > > For burn in, simple on-off cycling to allow stress over self- > > generated temperature swings is considered adequate.
There are some assumptions about the thermodynamics of the failure mechanisms built into many accelerated testing scenarios which are sometimes not justified. John
On 12/6/2022 3:10 PM, John Walliker wrote:
> There are some assumptions about the thermodynamics of the > failure mechanisms built into many accelerated testing scenarios > which are sometimes not justified.
Again, my point is that you need to understand your design and the environment in which it will operate before you can create a burn-in strategy. And, you need to monitor the EXPECTED failures in your burnin process to determine if the characterization of the product needs to be updated (new model). Places that take this seriously usually have staff on hand to keep track of the process on all such products. After release to manufacturing, engineering only gets re-involved when "they" discover something has changed (component suppliers, process, etc.) Early in my career, I used an MNOS WAROM in a design. The *suggested* burnin regimen would have caused 100% failures in the first few *hours* as the device was only guaranteed for ~1,000 (that's "one thousand") write cycles. So, we had to come up with a different way of exercising the design that didn't involve repeated accesses to that device. Similar issues exist, today, with MLC flash, etc.
John Walliker wrote:
> On Tuesday, 6 December 2022 at 19:38:38 UTC, Don Y wrote: >> On 12/6/2022 10:05 AM, legg wrote: >>> On Mon, 5 Dec 2022 05:09:48 -0700, Don Y <blocked...@foo.invalid> >>> wrote: >>> >>>> On 12/4/2022 10:46 AM, Joe Gwinn wrote: >>>>> I was doing much the same in the late 1970s. We had a number of new >>>>> SEL 32/55 midi computers, with this brand new semiconductor RAM memory >>>>> (replacing magnetic core memory), and were having lots of early >>>>> failures. >>>>> >>>>> So, I decided to give them some hot summer days: The computers were >>>>> looping on a memory test, as before, but now with their air intakes >>>>> partially blocked by cardboard, with a thermocouple in the core so we >>>>> could adjust the cardboard to achieve the max allowed temperature. >>>>> >>>>> Initially, delivered units would fail within a day. We would remove >>>>> the cardboard et al and call the vendor, who would then find and >>>>> replace the failed memory. Rinse and repeat. >>>>> >>>>> Pretty soon, the vendor instituted a hot screening program before >>>>> delivery, it being far cheaper to fix in factory than the field, and >>>>> in a year or two semiconductor memory field reliability had improved >>>>> greatly. >>>> >>>> But, the vendor likely didn't just "block the vents" and *hope* >>>> ALL the early faults would manifest in the first 24 hours. >>>> >>>> Instead, he likely stressed a sample population over a longer >>>> period of time and recorded the failure rates, over time -- looking >>>> for the "knee" at which the failure rate leveled off. Longer burnin >>>> times would just needlessly shorten the useful life of the device; >>>> shorter would risk some number of infant mortality failures slipping >>>> through to manifest at the customer. >>>> >>>> It seems that most folks have a naive understanding of how burnin is >>>> supposed to work. That "simply" plugging the unit in before sale >>>> is enough to catch the early failures. Unless you know where (in time) >>>> those failures are probabilistically going to manifest, how can >>>> you know that 24, 48, 72, 168 hours is "enough"? Or, that 60C is >>>> the best temperature to accelerate failures? (my residential >>>> devices have to *operate* at 60C. And, -40C.) >>>> >>>> [If you're not going to approach it with a scientific basis, you're >>>> likely just looking to capitalize on your customers' ignorance: >>>> "We burn in our products for ## hours to ensure quality". Yeah. >>>> Right. "Then why did OUR unit shit the bed after two weeks?"] >>> >>> The best temperature to accellerate failures is the operating limit >>> for which the design is intended to address, under functioning >>> conditions that produce the highest intended self-generated rise. >> Yes. My point was that naively assuming N hours at T degrees is >> just silly. You need to characterize your failure pattern before >> you can figure out how long and at which conditions you should >> stress the design. >>> If you have access to early testing, you'll have some idea of the >>> margins for functional operation that this limit condition provides, >>> and the accompanying MTBF calculation for this previously-measured >>> condition. >>> >>> It is only when margins to the limits are actually exceeded that >>> predicted life is possibly compromised. >> Any time "operating" comes at the expense of "remaining useful life". >> If you can assume that the time spent operating is << the expected >> useful life, then you can ignore it. >> >> OTOH, if your "usage in test/burnin" represents a significant >> portion of the useful life of the device, approaching it willy-nilly >> can be costly. >> >> E.g., there are SIMM/DIMM connectors that are rated for a *handful* >> of operations. You'd not want to be designing a test plan that >> called for them to be exercised *dozens* of times. And, then wonder >> why their reliability suffered post-test! >>> Complete thermal cycling is impractical for simple burn-in. >>> It is usually restricted in application to design verification or >>> later sample process quality assurance. >> Most devices operate in a narrow temperature range -- esp if >> deployed in human-occupied environments. Something intended >> for use in a lab will likely see constant temperatures. >> >> OTOH, there are classes of devices that are not constrained by >> "human habitation". E.g., an outdoor whether station will >> typically see 100+C variations, over its lifetime -- though >> likely only ~30C in a given (short) interval. Here, I expect >> to encounter 0F to ~140F as a normal yearly range. In >> North Dakota, it might be -40F to +100F, etc. If you don't want >> to design AZ and ND models, you have to test the design at >> the union of those conditions. >> >> Then, there are devices that are intended to operate in environments >> where the operating conditions are varied by necessity (e.g., >> many manufacturing processes). >>> Cold cycling tolerance is relevant to consumer products mainly to >>> demonstrate air-shipment worthiness. >> Or, it's -26F outside and will be that way for a few days! >>> For burn in, simple on-off cycling to allow stress over self- >>> generated temperature swings is considered adequate. > > There are some assumptions about the thermodynamics of the > failure mechanisms built into many accelerated testing scenarios > which are sometimes not justified. > > John >
The idea that all failure modes follow an Arrhenius temperature dependence over many orders of magnitude is completely up a pole. MIL-HDBK-217 really ought to be relegated to a museum. Cheers Phil Hobbs -- Dr Philip C D Hobbs Principal Consultant ElectroOptical Innovations LLC / Hobbs ElectroOptics Optics, Electro-optics, Photonics, Analog Electronics Briarcliff Manor NY 10510 http://electrooptical.net http://hobbs-eo.com