Memory test, killing laptops - almost a detective

(UPD: a photo of the board was added in addition to the schemes)
(UPD2: information from the libreboot IRC channel)



Recently we had a heartbreaking story - two Lenovo T500 laptops died in one morning. If only one died, nobody would understand. But two in one morning is too much! Moreover, at least one of them (and this is confirmed by three users!) Worked normally until the last minute, was turned off by the power button, moved 100 meters into a call and ... did not turn on.


Naturally, in the first place, all handicraft methods of resuscitation were tried: replace the battery, replace the power adapter ... Pull out the battery and de-energize, reset the CMOS, and so on ... The result? Exactly zero - laptops continued to be in a state of bricks.


Began to restore the picture of events to find at least some clue. It turned out the following:



Obviously, the death of laptops must be related to one of three things: a power adapter, a projector, or memtest. But with what exactly?


First of all, they checked the projector. They did not find the crime, and in addition they found out that on this day (but later) other laptops were connected to it, which remained alive and healthy. Secondly, they checked the power adapter - it seemed to be understating the voltage, and it was isolated in quarantine.


Laptops were given to the service, which brought them back with the following result: "motherboard failure, spare parts are missing!". I had to open the carcass themselves (the benefit of the network, you can find diagrams and service manuals for the old series of Thinkpads).


By this time, all became (by exception) suspect in the death of memtest notebooks. But it was absolutely not obvious - how exactly? In the end, there was a chance that the death of laptops is a rare, unpleasant, but still a coincidence. But no! Or yes ... In general, while not exactly know.


Here, a lyrical digression should be made about building a power management system on IBM / Lenovo books (at least the old series). In simpler devices, power management is given either to the processor / chipset or to the specialized motherboard controller (System controller, also known as the Embedded controller). Relatively speaking, this thing is responsible for the reflex-spinal functions of a laptop: switching current sources, battery charge, battery identification / vendor lock-in, and the like. But not at IBM / Lenovo!


IBM engineers apparently thought that the EC firmware might contain errors or the controller would suddenly hang. Of course, the EC has its own watchdog, but it is not a panacea. Therefore, it is the responsibility of the EC to only generate high-level power management signals. The power switches unlock and lock two specialized microcircuits (and not mindlessly, but comparing the EC desires with the readings of thermal sensors, the presence of the voltages required for the next step on the tires, etc.). These chips are: RINKAN (decoding unknown) and PMH_7 (Power Management Hub rev7)


RINKAN in the interior

image


Please note that RINKAN has no CPU bus outputs - it is basically unattainable for the processor. One of the important (and non-obvious) functions of RINKAN is the generation of a stable 3.3v voltage on the VCC3SW bus (let's call it the starting bus). Since there are no chokes near it, it can be assumed that this regulator is built according to a simple linear scheme. That is, somewhere inside there sits a transistor with a strapping and with its active resistance drops power, leaving only 3.3v behind. This regulator is powered by the leg of the VREGIN20, on which all laptop power sources (docking station, power adapter, main battery and ultrabay battery) are connected via diodes. That is, it works at all times (therefore, low power - a very small current of own consumption is needed!)


PMH_7 in the working environment

image


PMH is a more intelligent microchip. At a minimum, it has a connection to the EC via the SPI bus. In addition, it turns on or off a whole bunch of voltages and clock signals on the laptop's motherboard. Both chips are custom, without the presence of datasheet-s. Since Lenovo / IBM uses the same custom chips for different device lines - some PMH legs in the T500 are not used. However, it is unlikely they were left hanging in the air. Typical recommendations suggest pulling unused leads either to the power supply of the circuit or to the ground. Remember this.


Despite the lack of documentation, the Coreboot project team managed (comparing the notebooks of the T60, T40 and later series - where the RINKAN / PMH functions were still divided between chips of a lower degree of integration) to dig up something interesting. PMH is available in the CPU address space. Not directly, of course, but through EC - but still available! UPD: connected to ICH via LPC bus (Low pin count is an ISA equivalent). To raise or lower the PMH leg, they use the following sequence of operations ( pmh7.c ):


outb(reg, EC_LENOVO_PMH7_ADDR); val = inb(EC_LENOVO_PMH7_DATA); outb(reg, EC_LENOVO_PMH7_ADDR); outb(val | (1 << bit), EC_LENOVO_PMH7_DATA); 

That is, we first write to the EC register (mapped to the address space of the CPU) PMH register code, and then we can read or write its contents. We want, for example, to turn on the backlight (foot 55 PMH): we write 0x55 bit 2 to the register - everything is simple.


UPD: colleagues from the Libreboot project consider the described short-circuit scenario through PMH unlikely, moreover - the current protection in RINKAN should be at the level of 55mA


Unfortunately, memtest does roughly the same thing - it reads and writes different values ​​to different areas of memory. Theoretically, the BIOS should describe the memory areas reserved for I / O devices. And memtest should not record anything there - but ... recorded! And, apparently, at some point, then he raised, then, or lowered the unsuccessful leg of PMH. Accordingly, through the PMH foot output transistor, the VCC3SW power bus was shorted to ground ...


What happened next? Further RINKAN began to bask. Because the current was growing, the PMH transistor in the key mode dragged it without problems, and the half-open transistor in the LDO RINKAN became worse. But outwardly, this did not manifest at all: in the included laptop, no one eats from a low-power 3.3v source, and the power is supplied by a special high-power DC / DC powering the main buses 3.3 and 5 volts, respectively.


Well, when they pressed the power button - the main tires were de-energized. There was no power supply on the starting bus 3.3v! And the laptop turned into pumpkin brick


UPD: an alternative theory (omz + libreboot)


In service centers known inclination RINKAN to failure. Colleagues from Libreboot further argue that this is especially true for Toshiba controllers (and ROHM will be better). Accordingly, the memtest was innocent all the way, and the almost simultaneous failure of two laptops occurred:



Diagnostic results:


The first laptop is a dual-graphics COR5SOPV3 board. On the bus VCC3SW instead of 3.3, only 1.2 volts. Resistance to earth is about 400 ohms. Carefully sealed off and raised the output of the voltage converter RINKAN. The tire resistance immediately increased to hundreds of kilo-ohms. 3.3v voltage was supplied from an external source - the beech came to life.


The fee is in the process of repair

A chip with a white sticker - Embedded controller, in the middle with wires - RINKAN, the last one without stickers - PMH.


image


As a result, an external low-power LDO (LP2930-3.3) was picked up, which feeds the starting bus instead of the RINKAN. As a result of the tests, it was found that the postponed clinical death left an imprint on the nature of the device - the laptop refuses to turn on if a battery is inserted into it but the adapter is not inserted. If you want to turn it on, remove the battery, turn on the power adapter, and then you can insert the battery back. All other functions (charge, autonomous operation, sleep, etc.) without problems, and to turn on - just not otherwise. They did not bother - they solved the question administratively: use sleep or reboot instead of shutting down. The first sufferer was lucky!


And the second is not ... There is a C5ISOVP board with integrated graphics - the bus voltage is not there at all, and the resistance to earth is tens of ohms. After tearing off the legs of the VCC3SW, it didn’t get better - the same low resistance in VREGIN20. They also tore it off, turned on external power to the starting bus - they saw 3.3 and 5 volts on the main one. However, despite the encouraging start, there was no Power-good signal at the output of PMH / RINKAN and the system could not start. Apparently, the internal logic of the chips is damaged, and this is not treated ...


It is very likely that memtest can kill laptops in this way from the T6x series to the T420 / 520 series inclusively. Starting with T430 / 530, the way of communication with the EC is changed, and writing to the PMH registers cannot be done in principle. Perhaps only certain BIOS or EC versions are affected. Bugreport debian-maintainers package is unsubscribed, maybe with upstream of which they will find ...


The exact reason for the failure of two laptops after running memtest is unknown. An experiment that will be able to establish whether memtest causes off-design current consumption from the starting power bus is scheduled, but the date has not yet been determined. The results will be reported additionally.


When launching the memtest on Lenovo laptops of the T6x series to T420 / 520 inclusive, the potential risk and benefits of this event should be weighed. If you run the test, and it did not lead (or led) to scaling or freezing the laptop - please write the result in the comments indicating the model of the laptop and the test time.


That's all - good luck!

Source: https://habr.com/ru/post/413469/


All Articles