Low Voltage, Intermittent Communication, No Regeneration - $1,200 EGR Valve?
A case study of diagnostic strategy and working with what you have. Doesn't make the make, model, or product line (on road or off).
John Deere 244K-II (FT4) - 1LU244KTVZB046717 w/ Yanmar 4TNV98CT engine.
Before we delve into the laundry list of codes, let me give you a little info here. 244K-II is a machine made for John Deere by Liebherr and equipped with a Yanmar engine. So we have a few different languages going on here; it can make it interesting at times.
Codes & Brief Description (the description is pretty much all I get in the service information - code set parameter documentation is almost non-existent):
- P068A - ECU Main Relay Early Opening (main relay is internal to the ECU) (*turning battery disconnect off with machine running will cause this)
- P1424 - DPF OP Interface Back-Up Mode
- P1421 - Stationary Regeneration Standby
- P2459 - Regen Defect (Stationary Regen Not Performed)
- U1303 - Y_DPFIF CAN Message Reception Timeout
- … - CCU Improper Shutdown
- … - CAN 1 Abnormal Update Rate
- … - CAN 2 Abnormal Update Rate
- … - CCU Operating Hours Not Saved
- … - Battery Voltage Low - Engine Running
- … - CCU EEPROM Error
This case study starts out, as most of mine do, with a phone call from a road technician. Customer had called in requested a forced service regeneration because machine had de-rated and was essentially useless for loading salt. Fairly common occurrence for operators to short trip machines like this or inhibit regeneration due to lack of understanding of how to properly run the machine. Technician hooked up Service Advisor to the machine and was getting ready to start regeneration when he noticed a laundry list of codes popping up, a red STOP Engine light, and an incredibly annoying beep (all designed to force an operator to stop, sadly it's rarely enough).
JD is pretty convenient with certain machines, they have a test called DTC Priority test. It takes a list of codes such as this one and cross-references them. It then gives you a code to begin with and filters down so that your focused on direct faults instead of a fault code caused by another fault code. This machine was not one of them, and you'll notice that they aren't labeled in the DTC list, that means manual look-up. I recognized the … (battery voltage low - engine running), so I had the tech check out the belt drive, charging system, and battery while I pulled up the descriptions. The engine load profile was also brought up at this time as the initial complaint was regeneration related.
As you can see this machine has spent the vast majority of it's life idling with little to no load on it. That is a big no-no on these new engines with DOC/DPF and SCR (selective catalytic reduction). I get a call back from the tech at this point, alternator drive belt was loose and battery voltage was down to 11.5V with the engine running. He tightened the belt, cleared the codes (we both agreed that codes relating directly or indirectly to low voltage would likely not re-occur). Low voltage codes would be our … (low battery voltage), this caused our P068A (main relay opening early due to low voltage) which in turn caused our … (improper shutdown as it has a cycle down time after key off to store all its data), … (EEPROM error) and … (CCU operating hours not saved). The CCU (chassis control unit) is fed 5V from the ECU and remains powered up after the key is switched off to save data. If it loses power (due to the ECU losing power) it will throw all sorts of codes.
After charging the battery up I had the tech perform a Tuple Error Correction test (internal computer memory correction essentially), reprogram of the CCU with the updated Liebherr software, and finally a Chassis Configuration Test (this ensures that the CCU is operational and correctly configured for the specific machine). The only codes that re-occurred were our regeneration defect codes, our U1303, and our CAN abnormal update rate codes. Watching some live data recordings from the tech didn't bring anything to my attention, so I packed up some equipment and headed out to meet.
I knew from the theory of operation that the exhaust after-treatment system is on it's own leg of the CANbus (CAN 1). I hadn't seen any data to suggest that any components weren't working so I pulled a schematic and found CAN 1 only runs to the EGR valve. The EGR valve has power, ground, CAN H, CAN L wires. Nothing else on that leg. Bi-directional control of the EGR valve was attempted but unsuccessful (I believed at the time that this was due to the regen codes locking out some of the controls as a failsafe, because EGR is commanded fully closed to perform regeneration). I was able to initiate a service regeneration and this gave me a solid 2 hours to think about my direction and the system. I decided that we would attempt bi-directional controls again (they failed).
Seeing no other real direction, and wanting to get rid of communication errors before chasing anything else, I brought out the scope. Watching the CAN data, back-probed at the EGR valve, I didn't see anything out of the ordinary. I attempted bi-directional control once more while monitoring the CAN data and I saw this waveform. I am no network communication expert but I didn't like how it looked, and I also didn't like how I had no control over the valve. So I unplugged the valve and immediately the signal went back to a "standard" CAN signal. Plugging the valve back in caused the signal to continue to degrade. I unplugged the valve once again after this and continued on.
I cleared all codes and the only one that re-occurred was U1303 - CAN message timeout. Feeling 90% confident I next-day aired a new EGR valve and gaskets in. Made the repair and scoped the signal again. It was exactly what I expected to see.
So a confirmed issue where a CANbus EGR valve was able to scramble the network enough to cause communication errors. Definitely a first for me. Those more adept at interpreting network waveforms may be able to confirm or deny how much difference that network signal has. I know I replaced the EGR valve, loaded salt spreaders with the machine for about an hour and had no fault re-occurrences. I now also had bi-directional control of the EGR valve, as well as proper computer controlled operation while running.
Final verification was putting the machine into regeneration from the operator's seat, it went in and passed perfectly. Full repair verification complete.
Lesson here is that even when faced with a scary amount of codes there is no need to get nervous. In fact it is just more data pointing you to the right direction. The more strange the failure is, the easier it can be to track down. How many more variables are there to a simple low power complaint with no codes compared to a list of 13 codes all pointing in a general direction? By studying theory of operation and using logic we were able to knock out a lot of codes just by grouping them and tightening a loose belt. Then we were able to track down to one specific component based off a schematic. Using some network and tooling knowledge we were able to get to a relatively confident diagnosis and then confirm as best as we could given the lack of concrete SI.
I promise one day I will post a nice short 1 paragraph tech tip. Thanks to everyone who trudges through these write-ups and I hope they are at useful in some way, shape or form.
Hey Chris great write up. I would rather , as you said, "trudge through" a long write up with lots of detail than read a short write up lacking pertinent information. Thanks for taking the time.
I appreciate that. I've written them up for years for my own reference. I'm happy to be able to contribute them for others use.
Chris, I wish the Verus had better resolution... but the first waveform looks like the other modules are trying to transmit the so-called "error frame" upon detecting a garbled message previously. Sort of like CAN bus SOS. If you connected headphones to the bus, you would hear ABBA playing. Umm, I'll see myself out.
I agree that the verus isn't the best, but it does serve the purpose for my HD stuff. The time base for both is 1ms per screen. I've never been a fan of trying of only being able to zoom out on Snap-On scope captures.
A Pico would have been infinitely easier to get good captures with. One day...
I think the the main problem was how the signal would stay high or low without a full transition. I have to research some more, but I believe the signal was just far away enough from a full transition in this implementation that the network was getting confused.
As long as you don't start singing Dancing Queen I think we can overlook ABBA references.
Excellent observation, Chris -- there are lots of long pulses in the "bad waveform". However, it seems to be intentional:
"every bit stream of more than 5 bits of the same polarity, dominant or recessive, is considered an error condition. As a matter of fact, CAN uses this rule to send an error frame, which contains of (minimum) 6 consecutive dominant bits. Each node in the network will recognize the violation of the standard and initiate the appropriate response."
Quote from here: copperhilltechnologies.com/can-bus-guide…
My take on this is that it's not obvious from the waveform what is wrong with the EGR messages, but the reaction of other modules can be seen clearly. Good enough for diagnostics purposes, right?
I'll read up on that more. Thanks Dmitriy. I'll have to think about the variables here some more as they have been bothering me a bit.
One way to confirm would be to log that CAN message and then inject it back into the comm network and look for what results occur. Message priority might be a concern but it will flag eventually I would imagine.
I've only played with CANbus data logging on my pick up, you can do some fun stuff, but it's rather tedious to be of diagnostic value in all but the most strange of failures. For now at least.
I think the periods of time "without" full transitions is more indicative of a failing transceiver. It seems to me that the error frame transmissions should still modulate at the normal voltage levels. If anyone knows that to be incorrect please reply.
That's kind of where I was leaning Bob, that the problem was in the lack of full transition and not so much the packets themselves. I hope I phrased that correctly.
I'm doing some reading into transceiver architecture and failure modes now. It makes sense with transceiver design (just looking at generic architecture),that one failure mode would be lack of full transition.
This was a warranty repair, so I have to hold the part for 90 days. If they don't call the part back I plan on doing some disassembly and testing with a bench CANbus system. If it happens, I'll be posting up a case study about it.
Hey, no fair, you've replaced a "known bad" waveform with a different one! The old one was not nearly as dysfunctional as the new one -- here the transitions are indeed screwed up.
Could you re-upload the old one for comparison purposes?
My apologies. I was attempting to have both known bad up there. And.....it's fixed. Scope files uploaded as well. There is a brief glitch in the new EGR valve capture where the signal gets a little wonky, but clears out right away and didn't return (monitored the CANbus on and off while it ran through regen). I did cycle the key after this capture, so it was very likely that everything was still a bit off until it had time to do it's power down cycle, save, and then self-check on reboot. Similar to key switch cycle after a reprogramming event in automotive.
Thanks, Chris, I will take a look soon. You've given us lots of food for thought lately!
Always enjoy the HD write ups, thank you
Long as its helpful to at least one person I'll post up case studies. I've got notebooks full of them.
An excellent example of not succumbing to paralysis by analysis. Your …e study illustrates what many techs are exposed to. Unlike you, they end up like this:
I was taught that that there is a hierarchy to diagnosing multiple codes:
1: Voltage Issues - a high side or low side issue can also cause communication, component or performance issues.
2: Communication Issues - a communication issue can also cause component of performance issues.
3: Component Issues - a component issue can also cause a performance issue.
4: Performance Issues - these are always diagnosed last.
Based upon the little that I know about you, if you think about it, you very well may see that techs get all balled up by attacking #4 first. In a way, it makes sense. That's what the complaint is. Your …e study illustrated all 4 types of issues. That is a great …e study.
A silly question if you don't mind. The code list shows multiple Active Codes with 0 Counts. Was this caused by the "…" and why you reflashed the module?
I believe so. We have Deere, Liebherr, and Yanmar all thrown together on these machines so alot of conflicting or missing service information. Generally I have found that codes with a 0 count are caused by something else.
It's almost as if the fault detection strategy is telling you to treat them as secondary codes, which in this case they were. It may be that they are "soft" codes or it may be pending codes as part of FT4 emissions criteria; I haven't researched that far into it yet.
Anytime I see a EEPROM error on these they always cause a myriad of strange codes with 0 counts. I perform the tuple correction, followed by a reflash. That usually corrects it.
Thank you, Chris! That is a … read.
I even learned a new word- tuple. Sounds interesting. ;)
Glad you enjoyed it Marlin. I try to keep a solid thought process and game plan for every diagnosis (doesn't always happen). I know I learn all sorts of things by reading other people's diagnostic strategies. I find that my overall organizational process doesn't really change at all between automotive/equipment. Only differences are which tests I might choose based on ease of access.
I have another case study that I'll probably be uploading tomorrow. It beat me up pretty good and ran me around in circles for about 4 hours.
"I find that my overall organizational process doesn't really change at all between automotive/equipment. Only differences are which tests I might choose based on ease of access." That will be one of the most important observations that you will ever make.
So the new word you learned "tuple"...
Can you let us in on it's meaning?
Tuple is a computer term that takes on rather specific meanings depending on its context but for here the generic definition should be sufficient.
When your are using a relational database (think of it as a spreadsheet of computer logic and sensor values, for example;, IF key is in crank [value 1] THEN send 12v on starter solenoid excite wire [relational value 6] these are made up values for illustration) there are rows and columns of data, as if you took an engine fuel map and made an excel spreadsheet.
The row can also be called a tuple and represents a row of data. The module runs a self test utilizing a checksum (means of checking a known good value for original data and checking it against current data), and if an error is found it reboots for lack of a better term. The errors generally occur from low battery voltage scrambling modules. Reference this video one of our members Dmitiry made for some additional information on voltage and bricking modules. Voltage fluctuations can cause improper data writing to the module and create all sorts of messes when the data doesn't jive, think Microsoft's blue screen of death.
The event that occurs during tuple error correction isn't the same as a reprogramming event, but the data/module is rebooted so to say. It gets rather in depth on the computer level and I fear my understanding of the deeper software/hardware level interactions are lacking.
Suffice it to say, it attempts to restore the affected module back to "factory spec".
I hope that helps.
Dmitiry is correct. I first observed it with 8 different GM vehicles and a Grand Caravan. On the GM's the complaint was the airbag lamp was illuminated. The Chrysler was a no start. What they had in common was the VIN had been erased in all of them. How it happened was the battery in each was so low, when attempting to start the vehicle using a jump pack, it got stupid (technical term). Correction of the symptom was straight-forward. Program the PCM on the GMs. PCM Replace on the Chrysler.
I relayed that story to a GM engineer. He told me that it was impossible for that to happen since the VIN is stored in ROM. :) I think that we, and possibly he now, know better.
As an aside, this can happen to a PC also. I have one here that it happened to. If I ever find that ubiquitous free time and get off of my lazy ass, I'll see if I can recover it. It happened because I didn't feel like putting another battery in a laptop that I was "in the process of" replacing. Well, the PC got replaced but I didn't get all of the data off of the old. It died while I was transferring data. I wasn't totally surprised. A lot of the files wouldn't transfer because they were corrupted. I also learned a very valuable lesson. Backups don't work when you change the O/S.
As an aside 2: See if this helps answer some questions on tuple in programming. Please keep in mind that I'm not a programmer. Because of that, I'm quite a bit out of my league. Still, see if this helps explain it a little more. I came across it the other cay while looking into the upcoming C++ 20 release.
I'm always happy to have your input, you always seem to be able to pull out the data that I just couldn't quite reach for. Surprisingly I haven't found nearly as much info on programming language in off-road ECUs but I would imagine it is still C, C++,C+³¹⁸⁴⁹ , or whatever new variations will be coming. That makes your link all the more relevant.
I believe that link will let Roger and others head in the right direction if they want to dig deeper.
No promises but I wonder if this may help when looking at an orphaned module. (Something that was awaiting me in my Inbox when I got home tonight.)