About three weeks ago, I decided to upgrade our server from OS-X 10.5 to 10.6. The Apple installation disks were 10.6.3, though the latest update was by then 10.6.4.
Let me quickly first outline the nature of the network, as it’s very likely that the types of issues I encountered would not manifest in a different setting. The room can be described as an island network: ie, the 30 client machines therein are the only ones accessing the server, and the server does not act as a webserver. All clients run 10.5 (except for one with 10.6, which will eventually be used to clone all others to 10.6) and are connected with ethernet cabling through 3 1GB switch-hubs. The ADSL modem also acts as DHCP, though all the clients (and server) have been allocated a manual address.
Note should also be made that the network worked, on the whole, very effectively (apart from the slow ADSL account, and another problem with the latest update to Safari to which I shall return).
Another aspect of this network is that there is an older (errr… some would call it a dinosaur) model Lexmark laserprinter networked therein.
An IMPORTANT note should really be made that 10.6 is, in many important ways, quite unlike 10.5 and earlier. The main claim from Apple is that it only works on Intel-based machines, and is thus ‘leaner’ – though that’s undoubtedly true, as a user, I also wish they had written in bold headline that 10.6 DOES NOT SUPPORT APPLETALK! This may seem minor, but it took me quite some time to be able to get my MacBook Pro to be able to again talk to various printers, and I know other people who had similar problems. Basically, printers and the like now need to be accessed via IPP (unless the printer can be accessed with Bonjour – which thankfully most recent ones can), which also means, with older printers, finding out how to alter the IP settings on the printer (in some cases, a waste of a few hours of fiddling).
But let’s return to the Server upgrade.
I of course did not do a backup… though I did have a backup from a few days earlier in a worst case scenario situation. For now, however, let’s just say that I had no effective backup.
Running the installer appeared faultless, and even updated to Server 10.6.4 without a seeming hitch.
But now, NONE of the 200+ users were able to login from the client machines.
The Log-in is really a two-step process: the Server acts as LDAP database with username and password for the user to access the machine (except for the single local admin user); and the Server also acts as location for users’ directories (folders and files).
With 14/20 hindsight, it appears that the LDAP was working fine, but that the real issue came with incorrect directory permissions. But that’s with 14/20 hindsight – and I do not even claim 20/20 hindsight as there are other small issues that may indicate that the problem was a little more than this.
To also quickly skip through many other considerations I took into account in seeking to isolate the problem(s) with may have been more general problems on the network, I switched off all except one client computer, replaced cables, used only one switch, restarted Server and all machines, etc. From all that can be surmised and as was eventually the case, these steps were unnecessary, as the network itself proved faultless (still, they are steps that one needs to take, I suppose, even if they prove a waste of time).
Part of the problem was that the reverse mapping was not correctly set – something that 10.5 never properly created, but also something that 10.6 requires. So this I did manually. After trying various other tweekings here and there (and I’m reasonably ok with OD networks), I eventually called support which, given the nature of the problem, transferred me to a very helpful specialist based in the USA. Much of the same procedures as I had already undertaken were repeated (which only confirmed that I had at least been on the right track). We then took the more severe steps of destroying the Open Directory and recreating it (a list of users and groups had previously been exported to another location, to avoid having to re-enter each name).
Everything appeared fine as I was able to login as a ‘test’ user, thanked him, hung up… and the problem recurred, without even that same user able to log in again after logging out!
At that stage, I thought it possible that some invisible and corrupted database file was causing havoc, so I decided to take a rather radical step: after re-exporting all user documents to another drive, I ZEROed the server, and re-installed OS-X 10.6.3 Server afresh, and created a couple of NEW users totally afresh as well (in case the problem was in the user database). Checking reverse settings with Terminal, it also appeared that all was fine.
Forwarder IP Addresses
Information from the server installation also gave me one new suggestion: that the DHCP point to the server’s address, and that I did, as well as setting Forwarder IP addresses to the ISP’s DNS. That I also checked, and they appear to the left in the lower panel.
I want to return to this image in a while, for there’s something else that appears different here…. but more on that in a while.
Quite frankly, by that stage, and after two weeks of fiddling around with settings that are really, for those amongst us that tend to work with servers often enough, relatively straightforward, I accepted defeat. Either there was a bug in the server software, the installation disks, or something else entirely. I decided to call a local person who also runs OS-X Server courses… and lucky I had not called him a week earlier, for he mentioned that only a few days earlier he had encountered a somewhat similar problem which was fixed by the REVERSE MAPPING BEING RECREATED USING NAMES THAT DO NOT MIMIC THE COMPUTER NAME.
Anyway, I thought that I’d spent too much time on the thing, and contracted him to come and assist.
(If in the Melbourne area, I highly recommend him: Richard Gynes at www.designwyse.com.au)
Installing afresh for the third time
Turns out that the problem was not as obvious as it appears… we had ongoing problems so decided to AGAIN wipe the Server, and re-install afresh.
JUST in case part of the problem was caused by my installation disks, or installation 10.6.3, we used his 10.6.2 disks (which we later updated once installation was complete). As per the previous times, we also installed everything manually, and even used Apple exampled network names (as is evident with the reverse mapping info above).
Next was CREATING a NEW main zone and new machine name to reverse map, for the existing one cannot be deleted until a new one is created (ie, it always needs to actually have one). Having done all this, next step was to delete the self-made zone, and rename the ones created to be as previously shown ones above.
Terminal checks showed that all appeared (again) fine.
The ONLY differences between all that we had done previously and this time are: firstly, the zone/machine reverse mapping were fully manually entered; and secondly, and inexplicably as to why it was missing the other times, ‘Localzones’ appeared in the ‘recursive queries’ in the DNS settings info box (as in image above left).
We also reset the LDAP on the Client test machine to see the now newly named network (this, incidentally, had also been done each time before).
Now for testing it with a test account… and same thing, again, though with obvious delay differences, and also differences in that it was obvious (from the delay in login) that the LDAP DB was being correctly read, but that the user, whose directory was on the server, was unable to be accessed.
It should also be mentioned that from the AFP, Home directory locations had been specified, and user accounts appears to have been properly created. Users were imported (as were groups), passwords had to of course be reset.
In comes Passenger to the rescue
Using Passenger, the directories were re-configured so that permissions reflected users…
after that, all worked. What I still don’t get is why things had not operated the way they ought to in the first place.
A rather painful and still confusing process as to why things had not worked in the first place.