Nothing lasts forever, like my X session. Not even 10 seconds, in fact - not even 5.
At work, I'm running Solaris 10 with the new Java Desktop System (Sun's branded version of Gnome, which AFAICT has bugger all to do with Java), I ran into a strange problem. Actually, several, which cost my entire morning. How grumpy am I?
So first problem was some weird hardware issue, whereby I come in, coffee in hand (Map from Food Inc at the station, a nice drop) ready to start the day. I turn on the screen and hit a key, ready for the beast to awaken.
But - nothing. It doesn't respond to my cajoling, not even to the threat of hot liquid being poured in through the air vents. NumLock doesn't do anything, a sure sign that there's some hardware issue. I try ssh-ing from my laptop, nothing. I try pinging it, nothing. He's dead, Jim.
I mutter to myself as I reboot it and wait for the thing to start up, and eventually up comes the login screen. I go to log in as normal, expecting to have to trawl the logs to find out why the entire machine locked up solid. Did it miss me that much?
I go to log in, get a blank screen, a few seconds of activity, then dropped back to the login screen. How rude! No error, no message, nothing. I try again, thinking that it couldn't have been a wrong password. Weird. I try a failsafe session, and that does work. But I'm not getting any logs whatsoever that clue me in on what's going wrong. And if there's somewhere other than the 27 different directory trees Solaris uses for logging that I should examine, I'd love to know.
Anyway, I decide that the slightly crufty dtgreet is clearly not a match for the funky gdm as far as login greeters go, so I decide to change it, then at least I can get some decent logging to find out what's going wrong.
Trawling Sun's site was useless. But thanks to a useful post by Bill Rushmore, I discovered that I could switch off the old login greeter and move to GDM (which is arguably better). The mojo is, first shut off the dtlogin:
# /usr/dt/bin/dtconfig -d
Then enable GDM in SMF:
# svccfg import /var/svc/manifest/application/gdm2-login.xml
# svcadm enable application/gdm2-login
(You may not need to import the service, YMMV.)
This time when I log in, I get the dreaded error:
Your session only lasted less than 10 seconds...
Great. I try the xterm failsafe option, that works fine. I even try the failsafe Gnome option, and interestingly, that does work. Okay, what does Mr. xsession-errors have to say? The last few lines:
/etc/X11/gdm/Xsession: Beginning session setup...
/etc/X11/gdm/Xsession: Setup done, will execute: /bin/ssh-agent -- /usr/dt/config/Xsession.jds
I put in some tracing and checks to make sure the agent is really there. There's no error after that, so what could be the problem? I log in with a failsafe session and run the Xsession manually, which works. Weirdness.
After much googling, I find plenty of other people have had the same problem. Various well-meaning people have suggested various odd reasons as to the cause, proferring even more odd, generally kludgy solutions.
I check permissions on my home directory, I check the umask, startup scripts, the agent is running, my coffee is cold, and by now I've even checked the oil in the car. Just to be sure, you know?
One post offers a tantalising hint, however. Someone suggests that there was a premature exit in his ~/.bash_profile. I didn't have that, but it was worth checking out. I moved my .profile aside, tried the login again, and - bingo! - it worked!
Okay, so the problem was in my startup script. I first checked the syntax, making sure everything was kosher. Then I started commenting things out, in that time honoured debugging tradition. And sure enough, I found the culprit.
The day before, I had added /usr/local/lib to my LD_LIBRARY_PATH so that some of the extensions and apps I had installed could be found. (Yes, I'm aware there's a "proper" solution for this, in /etc/ld.so.conf. We all have our lazy moments, ok?). So as a convenience, I added this to my ~/.profile. This worked fine, and I left myself logged in overnight.
But when I was forced to log in again after the reboot to get over the first problem, obviously ssh-agent is smart enough to detect that LD_LIBRARY_PATH is a potential security hole, and is thus not to be trusted. So the whole thing failed silently, with nary as much as a whiff of an error message to clue me in.
So kids, the lessons from this story are:
- Assume nothing
- Turn up the logging!
- Don't ignore any error messages, no matter how insignificant they may seem
- Change one thing at a time, and test one change at a time (otherwise you won't know which change made it work and you won't have understood the solution).
- Check the syntax independently from the semantics (here the syntax was valid, the semantics were not)
- Do not set LD_LIBRARY_PATH in your profile - add it your ~/.bashrc if you must, so it's only executed on interactive logins.
- Shortcuts can cut you back in the long run.
And if you're a software developer, for the love of $DEITY put in as much logging and tracing as you can!