When you have a ruleset problem, the thing to do is to turn on the nvcorrd trace and see why what you expected to happen is not happening. You do that with "nvcdebug -d all" and the results are written to the nvcorrd.alog and blog. When a trap is received, you'll see the eye-catcher "Received a trap" and when nvcorrd is finished with it, you'll see "Finished with the trap", Everything that happens in between is the processing nvcorrd did on it.
That said, I think I already know what you are going to find. You are looking at the ruleset as a human being would, as a purely logical construct, not as a computer algorithm which executes in real time. The problem is that when the Reset-on-Match releases the held Node Down, nvcorrd immediately proceeds to the next node for it, the Pass-on-Match. But there is nothing for it to match because the Node Up event, which triggered the Reset, has not yet been stored in the cache for the Pass-on-Match. That processing will come after the Node Down is released. In short, this is a timing issue.
So what can you do? You have to insert another step between the Reset and the Pass, one that gives nvcorrd time to finish processing the Node Down, for now, so that he can go back and store the Node Up. I'm indebted to my colleague, Paul Stroud, for one workable solution, which he was the first to think of. Insert another Reset-on-Match after the first one and connect the Node Down as Input One. Set the interval for anywhere from 30 seconds to a minute. And connect the output to the Pass-on-Match as Input One for it. The trick is to have nothing connected to that second Reset-On-Match as Input Two. That way the Node Down event will just be held for the interval and then released. Once the Node Down is stored in the cache for the second Reset, processing for it ceases, and nvcorrd can then go back to the Node Up and store it in the Pass.
The only catch to this is that the triggering Node Up is sent to TEC about a minute later than the Node Down, but that should not matter much to your TEC rules, since they can evaluate what's already in the reception store as well as the current event.
Incidentally, that's why the IBM direction was to do this kind of correlation in TEC, where timing issues were not as relevant.
Hope this helps
James Shanks
Level 3 Support for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group
|