Last time that I spent more than 2 hours on a webex session with clients was four years ago when I troubleshoot the connectivity issue inside client's firewall by using Wireshark. Since then, I usually resolved the issues very quickly, so I was really not ready for the surprise today.
The issue came from SecurityManager. The client got "Invalid timezone" error when sync with their QA app server. We've been reading log files they sent to us, but I asked for a webex since the log files did not give me good findings.
The client was waiting for us when we called them, so everything was very smooth at the beginning. They stopped one of their SecurityManager app server, one nMarket app server so that we did not have to worry about clustering. I spent time setting up debug categories, retrieving POST data sent from SecurityManager server. Eventually I was able to reproduce the error from a web browser by using servlet. Everything led us to believe that something was wrong with the OutboundAPI User View. When I ran from the browser, the user view gave me the Invalid Timezone error, however, if I ran from database directly, the view returned data with no error!
What really troubled me was the tool they use to connect to Oracle database. The client has Oracle SQL Developer on the machine, however, he did not know how to use it well. I did not either. So I had to open Google on local machine the same time. I can modify the view, however I was not able to save the view to test!
Since we do not use that tool here at work, I turned help to the client and he had to get their DBA join the webex. It turned out that the DBA did not know either! We had to switch to DBA's desktop to use her Toad for Oracle. Good that the DBA is really good at the tool and SQL itself and she gave me very good advice when I started debugging.
Interesting findings came. When we ran a query against all 253 user login history records, the exception happened and we traced the error to two local time to GMT conversion PLSQL package calls. I started to use binary search algorithm to find the culprit record. It was a joy to do that in a webex. Not after too long, we find the problem was with one database record which contained a date of 11/06/2011 01:42:53AM value. Why the view works with all other records but not this one? Haha, it was daylight switching hour! I wouldn't be surprised if it's a bug in our code.
A quick test SQL script proved that it's true -- the PLSQL function could not handle that time conversion! Since it's client's QA environment, we just changed the data in the table and it worked! The value itself means that one employee in the client company logged into our product at DST switching hour. I feel I am so lucky now -- I was on call for SecurityManager DST change issues and the client only tested in QA but not their PROD environment. Otherwise I would be waked up that morning... .
I felt not bad after the webex. It was a good story and good finding and we would never find it out if we were just asking for log files in emails. Thanks Ross, Russel and Koma!
One lesson I learned from the webex is that I need to get familiar with other Oracle development tools other than the PLSQL Developer that we are using at work.
No comments:
Post a Comment