+===### LEAR SYSTEM ADMINISTRATION LOG ###=====================================+ | This log contains updates on maintenance operations and ongoing events. | +==============================================================================+ ---------- TODO: ---------- * gpuhost4: MIXED IDs, quickfix: installed two TitanX Pascal GPUs * revert cable swap between GPUHOST6 and GPUHOST7 /!\ observe what happens next, gpuhost6 disconnections could come back it might not be a good idea to reverse this at all, but rather re-label the cables * find the suppliers of installed GPUs, useful for RMAs ---------- LOG: ---------- 2018-11-20: problem of network overhead on the cluster (genome assembly by dchen) solution: local copy of the database on node{51-54}-thoth (1TB local disk) to avoid network access tmp dir for data: /local_scratch/tmp/ to do some copy: edit script edgar:/root/copy_data_to_cpu_cluster.sh and run it 2018-09-05: double disk failure on curan need for 4TB drives took one in hydrus (replaced by a 2TB drives) move /scratch/scylla to /scratch/clear/legacy_homedir and remove scratch from scylla to get a second 4TB drive 2018-08-28: disk failure on hydrus and scylla use 4TB drives as replacement 2018-08-06: Vulnerability in Django on Fedora 21 * reinstall of curan on Xubuntu 16.04 * Decativate ciondition for the vulnerability to be usable on hypnos. In "/usr/lib/python2.7/site-packages/django/conf/setting.py" - set APPEND_SLASH = False - comment "django.middleware.common.CommonMiddleware" module and related 2018-05-31: reinstall of abo with Windows * Admin account: - login: .\admin - password: Koala_3105 * ToDo: fix Grub and boot on Linux partition (unplugged at the moment) 2018-05-25: reinstall hydrus 2018-04-26: * Set up idrac on all gpuhosts (except gpuhost1 and gpuhost11, not working). Remote access possible via http (https://gpuhostXXidrac) or via ssh (ssh gpuhostXXidrac) and racadm CLI * disable updates on all oar related package (because of dependency issue) apt-mark hold apt-mark unhold # to enable update back 2018-04-25: * reinstall edgar OS (Xubuntu 16.04) * setting new OAS server --> fixing cgi loading (enable cgi with 'a2enmod cgi') * fix X11 forward on remote ssh host $ ssh user@host $ su # xauth merge /home/user/.Xauthority 2018-04-03: troubleshouting sssd service failing to restart rm /var/run/sssd.pid 2018-03-26: change disks of /scratch/kent (1TB -> 4TB) 2017-10-06: couldn't mount /scratch/pascal on several machines --> fix: /etc/exports contains references to "thoth" instead of "lear" 2017-10-05: move cpu cluster enclosure to new rack 2017-10-04: reinstall of pascal with Ubuntu 16.04 (Xavier Martin) 2017-10-02: move clear and scratch enclosure to new rack Re-do NAT configuration on SFP+ port (see sysadmin.html) 2017-08-04: update and reboot all cpu cluster nodes (dg) except node52 (busy at the time) all remaining nodes in cpu cluster listed in cpu_cluster_nodes.txt in /root on clear 2017-08-03: reinstall taurus (dg) deactivate Nouveau Graphic Driver 1) add the following lines to /etc/modprobe.d/blacklist.conf blacklist nouveau blacklist lbm-nouveau options nouveau modeset=0 alias nouveau off alias lbm-nouveau off 2) disable the Kernel Nouveau with the following command: echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf 3) build the new kernel with: update-initramfs -u 4) reboot 2017-07-10: move /scratch/adrian to /scratch/zeus (dg) reinstall both adrian and zeus 2017-06-23: wipe last MB of hard drive: dd bs=512 if=/dev/zero of=/dev/sda count=2048 seek=$((`blockdev --getsz /dev/sda` - 2048)) 2017-09-06: reinstall adrian (dg) 2017-05-15: edgar would not mount home directories FIX: systemctl enable rpcbind systemctl restart rpcbind 2017-04-19: ncdu utility to show recursive directory size 2017-02-28: /dev/md0 /local_scratch auto defaults,nofail,x-systemd.device-timeout=4 0 0 2016-11-2: gpuhost12 sysdisk slot 0 failed, rebuilding 2016-09-20: UBUNTU MIGRATION on GPU/CPU clusters 2016-09-20: edgar OAR hangs --> stop mariadb, stop oar-server, kill -9 Almighty:bipbip, restart mariadb, restart oar-server 2016-08-02: installed some machines on XUbuntu 16.04 -> some recent machines had trouble, couldn't boot after install. use the "laptop install" version. will be fixed by Jean-Marc soon otherwise, use old version of fdisk, set bootable flag to partition 1 2016-06-15: installing idrac on gpuhosts for easier access -> root:to123to -> done: 6, 8, 10, 11, 12, 13 2016-06-13: couldn't access homes and scratches from the CPU cluster, GPU cluster, various desktops -> updated certificate for LDAP, all machines without puppet were concerned -> problem with machines actively used.. have to restart -> new certificate in "/home/clear/sys/config_fc21/authconfig_downloaded.pem" 2016-05-12: SSSD fix for LEAR/THOTH group write permissions - stop SSSD - empty SSS cache: sss_cache -E - edit /etc/sssd/sssd.conf: #ldap_schema = rfc2307 ldap_schema = rfc2307bis - restart SSSD 2016-05-10: when compiling and linker can't find library, use: export LDFLAGS="-L/usr/lib64/..." 2016-04-26: NFS and disk access on clear are superslow -> /etc/sysconfig/nfs : added '--no-nfs-version 4', no change apparently -> rpcdebug is useful to see what's happening, doesn't give whole file paths though.. NFS works with file handles, and converting them to full paths is a PITA, I don't even see how -> partial FIX: using fanotify-based solution, found what files were accessed. Then, run cssh on all machines to find where the traffic was coming from. The source was an Eclipse instance running on curan, deep scanning a large dataset by error. 2016-04-22: to edit network interface files: /etc/sysconfig/network-scripts/ifcfg-interface-name 2016-04-08: to avoid boot hangup due to failed fstab entry mount, append the following options to scratches: ,nofail,x-systemd.device-timeout=10 2016-04-06: forced sector reallocation on cornwall's system disk, commands used: dd if=/dev/sda of=somewhere # to find out which sector is bad hdparm --read-sector 40730418 /dev/sda hdparm --write-sector 40730418 /dev/sda 2016-04-04: ran fsck on SSD drive of node42, by inserting it in node41. Made a bunch of repairs. Seems ok now. Both nodes activated in oar. old issue: node42, rpmdb open failed with Input/output error. can't delete the files /var/lib/rpm/__db.001 .. 003 2016-03-21: had to reboot bigimbaz -> DO NOT RELAUNCH THE BIGIMBAZ DEMO SERVER, it went into uninterruptible sleep and required a machine reboot 2016-02-19: installed new R730 gpuhosts -> route all cables, tag them at both ends -> setup KVM + route KVM cabling, remote access HELL YEAH! 2016-02-17: clear: moved /usr/share to /local_sysdisk, '/' full 2016-02-11: changed /etc/matplotlibrc on cluster nodes to allow X-less rendering -> backend: Agg 2016-02-05: tip to redirect program output while running gdb -p PID p dup2(open("/tmp/stdout", 1), 1) p dup2(open("/tmp/stderr", 2), 2) detach quit 2016-02-04: clear has almost no space left in "/" 2016-02-04: gpuhost5 rebooted for no apparent reason, Feb 3 ~13:00, cooling seems to work 2016-02-03: edgar, tried to enable nfs cto in /etc/nfsmount.conf 2016-02-01: SERVER ROOM - CLEARCMC2 POWER SUPPLY FAILURE -> several nodes were automatically shut down and impossible to start -> took spare from clearcmc 2016-01-25: cornwall, systemd looping too fast -> FIX: systemctl daemon-reexec 2016-01-25: cooling unit fixed, gpuhost3 moved back 2016-01-25: moved gpuhost3 from H111/H113 to H115/117, the cooling unit is defective 2016-01-21: algorab, KDE: polkit keeps asking for password to update -> created file /etc/polkit-1/localauthority/50-local.d/allowuserupdate.pkla -> content: [Allow User Updates] Identity=* Action=org.freedesktop.packagekit.system-update ResultAny=no ResultInactive=no ResultActive=yes 2016-01-21: gpuhost3 went into deep sleep, has behaved strangely in recent days (Suspected). The machine might have went into limp mode after overheating. -> the cooling unit doesn't seem to turn on, check 2016-01-21: INVERTING NETWORK CABLE, GPUHOST6 and GPUHOST7 2016-01-20: gpuhost6, weird network problems. Not due to the machine itself, went to try and debug with J-P Augé, but it went back to normal during the process. 2016-01-19: if user can't SSH somewhere, check permissions on their homedir. they *have* to be 755 2016-01-06: pascal had a CPU lockup, machine non-responsive, had to forcefully reboot. misc: when rebooting, have to restart autofs and httpd by hand 2015-12-08: OLD KERNEL 3.10.93: managed to automount scratches!! You have to force NFSv3 protocol (v4 support appeared in 3.11 onwards) -> /etc/autofs.conf: mount_nfs_default_protocol = 3 -> /etc/sysconfig/autofs: MOUNT_NFS_DEFAULT_PROTOCOL=3 -> edit /etc/nfsmount.conf, uncomment values "Defaultvers" and "Nfsvers", use "3" 2015-12-08: taurus scratch active 2015-11-30: new taurus and new algorab installed 2015-11-30: setup mongodb on cordelia - create repository file: [mongodb] name=MongoDB Repository baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64/ gpgcheck=0 enabled=1 - install packages "mongodb-org mongodb-org-server" - move /var/lib/mongo to /local_sysdisk and create symlink 2015-11-30: put new video card in adrian (artifacts on screen) 2015-11-30: put new video card in zephyrus (beeped at boot), had to pry off the locking cable 2015-11-30: removed dying scratch disk from horatio 2015-11-24: gpuhost11 setup complete 2015-11-24: /scratch/gpuhost3 and /scratch/artemis OK 2015-11-09: clear back online, used these commands: -> ifconfig eth1 10.0.0.254 -> systemctl restart dhcpd -> mount /local_scratch2 ... -> systemctl restart sssd autofs nfs-server 2015-11-06: clear crash, machine rebooted but cluster down and scratches more or less accessible 2015-10-28: replaced gpuhost1 sysdisk slot 1 (RAID1) 2015-10-19: curan has 25GB of vaguely allocated memory. -> in /proc/meminfo, will show up as "Active(anon)" -> slabtop: shows up as kmalloc-64 -> does not belong to any particular process, tried to increase "vfs_cache_pressure" to no avail -> trying fix: add to grub options "apm=off acpi=off", seems to have worked for someone else 2015-10-13: finally set up gpuhost1 properly. Some thoughts for the next install: -> pull out all GPUs before you do anything -> install the OS -> modify /etc/default/grub : remove "rhgb", add "vmalloc=1024M nouveau.modeset=0 rd.driver.blacklist=nouveau" -> grub2-mkconfig -o /boot/grub2/grub.cfg -> shut down system, add one NVIDIA card -> boot, install NVIDIA driver -> shutdown, install all 4 GPUs 2015-10-07: some mdadm scratches vanish after a reboot (gpuhost3, artemis) -> cause: the partition table is there but there is no partition, consequently no "raid" flag that means mdadm won't check these drives for auto-mount -> fix that doesn't work: recreate the arrays (follow same procedure as for artemis, be careful that nb of drives and raidlevel match) run "mdadm --examine --scan --config=/etc/mdadm.conf > /etc/mdadm.conf" this will save a configuration file with mdadm RAID name and disk UUIDs update: it does not help, artemis still hangs at boot 2015-10-02: gpuhost4, trying to find which GPU arrangement makes the machine reboot -> 2x GTX980: stable for 4 days, even with jobs running -> 1x TitanX (top slot), 1x GTX980: stable for more than 1 day -> 2x TitanX (bottom slot is the new one): 2015-09-28: Shreyas had a job on gpuhost7 that didn't die, despite the OAR job ID being terminated (and replaced by another job, which was running) 2015-09-28: machine stuck at boot, failed to start nfs-statd (gpuhost3) -> systemctl enable rpcbind -> reboot 2015-09-25: pascal - new RAID volume (2*1TB RAID1) -> mirroring existing scratch (/dev/sdb) -> once mirror is complete, will convert RAID1 to RAID5 with original disk 2015-09-25: problem with NVIDIA drivers kernel update + driver update (304.125 -> 304.128) on new kernels, can't install 304.125 anymore 2015-09-25: gpuhost4 rebooted again, seems to reboot everyday between 18:00 - 23:00. I've pulled out the two GPUs, we'll see if it reboots then. If it doesn't, I'll add one GPU and then a second. 2015-09-24: unmounted /scratch/hypnos (was making too much noise), looking for machine to offload it to 2015-09-23: add gpumem property on edgar OAR 2015-09-22: gpuhost4 and gpuhost5 reboot for no apparent reason, quite often -> in the logs, seemed related to /admin/teste-rpmdb.sh that failed, fixed it. Will see if it holds. -> doesn't hold, rebooted 23 Sept at 18:01 right after a cron pass 2015-09-16: gpuhost1 install won't work. waiting for Jean-Marc Joseph to come back (1 or 2 weeks) 2015-09-15: upgraded NVIDIA driver on gpuhost2-10 for CUDA 7.5 2015-09-14: new remote Windows desktop procedure -> install remmina-plugins-rdp remmina -> connect to server rds-gra.grenoble.inria.fr, domain AD.inria.fr, using RDP and NLA security 2015-09-14: fixing adrian disk /dev/mapper control -> dmraid -r # discovers disks tagged as under RAID control -> dmraid -rE # removes RAID metadata on disks found (asks Y/N for each disk) -> dmsetup remove /dev/mapper/ddf1_44656c6c202020201000005d10281f4742bd335698b3486e 2015-09-14: gemini: new RAID1 disk (initialized as degraded array) gives Input/Output errors, doesn't show in fdisk -l || DEAD DISK adrian: new RAID1 disks: one of the disks does not want to be mounted / used as mdadm disk, shows as busy. a weird "/dev/mapper-*" entry appears in fdisk -l 2015-09-11: hypnos, RAID1 2x2TB 2015-09-10: gpuhost6-10 - RAID1 sysdisks 2015-09-10: orion (Federico Pierucci) installed in room E116 (BEBOP team), still connected on .21 subnet 2015-09-09: removed 5x 500G disks from adrian and hypnos added second system disk to gpuhost8-9 added second system disk to gpuhost1, restarted install (after proper disk initialization) 2015-09-09: configured top-of-rack switch, disabled spanning tree -> used serial-to-serial port, usb adapter didnt work at all -> connected top-of-rack LEAR to top-of-rack MI -> thanks to Jean-Pierre's configuration, clear can now ping gpuhost6-10 over 10.0.0.* -> had to disable spanning tree, otherwise the switch would be blacklisted by MI switch 2015-09-08: to allow user mount, open file /usr/share/polkit-1/actions/org.freedesktop.udisks2.policy : -> find action org.freedesktop.udisks2.filesystem-mount-system -> change to 'yes' 2015-09-08: gpuhost1: system disk failed, difficulties at reinstall 2015-09-07: artemis: /dev/md0 not found /!\ -> array not recognized at boot -> mdadm --examine --scan empty result -> individual disks did not show RAID info with "mdadm --examine /dev/sdb" -> recreated array with command # mdadm -Cv /dev/md0 -l1 -n2 /dev/sdb /dev/sdc -> data magically reappeared... 2015-09-07: power configuration on gpuhost6-10 changed, draws as much power from each power supply 2015-09-07: some cluster nodes (node51) and 'clear' have an atlas/numpy problem on "import numpy", python can't find /usr/lib64/atlas/... -> /etc/ld.so.conf.d/atlas.conf exists and is referenced on both machines -> UPDATE: fixed itself for unknown reason, may be related to installing a package? this is weird 2015-09-07: cluster nodes didn't have /var/log/messages: installed package "rsyslog", added to fc21_post_install.sh 2015-09-07: added package "gstreamer-plugins-good" to all cluster nodes, added to post_install 2015-09-04: created a user wiki at http://lear/local/wiki 2015-08-31: orion: /var/ full, "/var/log/secure-20150831" was 2.3G of garbage. Observed problem: couldn't update svn (NFS file locking didn't work anymore). 2015-08-27: gpuhost7 raid conversion to level 6 finished overnight. 2015-08-25: added two 6T drives to scratch/gpuhost6, started raid rebuild (raid 5>6 conversion finished yesterday). Dell replacement disk received last week. 2015-08-24: node50-53: atlas fix didn't work, it created a "libsatlas.so.3.10_backup" symlink that pointed to threaded version and delete the single-threaded version 2015-08-17: installed 2x Titan-X in gpuhost2, shelved 2x GTX980 2015-08-17: removed old gpuhost6 scratch gpuhost1 scratch --> new gpuhost6 scratch + ADDED A DISK, CONVERTING TO RAID6 (have to finish background check first) gpuhost7 scratch + ADDED A DISK, CONVERTING TO RAID6 (started 17/08/2015) Gained two 6TB spares 2015-08-17: opened Dell support case for failing 6TB disk ( http://www.dell.com/support/incidents-online/fr/fr/frdhs1/Case/srRequest/915367820 ) 2015-07-27: gpuhost6: DISK FAILING/FAILED. Backing up Shreyas' data. 2015-07-22: gpuhost4, gpuhost5 OK: included in @net_lear 2015-07-16: can't mount /home/lear or /scratch/clear from gpuhost4, gpuhost5 2015-07-16: change hostnames cepheus, goneril -> gpuhost4, gpuhost5. 2015-07-16: two new Dell T7910 machines, apollo and artemis. install FC21 2015-07-15: install 4* M420 cluster nodes. reconfigured clear for PXE install. 2015-07-10: VNC: xfce doesn't work anymore, use MATE. cp /home/clear/sys/misc/xstartup ~USERNAME/.vnc/xstartup 2015-07-10: gpuhost6: /var/lib/systemd/coredump -> /dev/null 2015-07-10: received 2* T7910, 4* M420 cluster nodes. -- CLEAR UPGRADE -- 2015-07-03: weird behavior: jobs don't source .bashrc, according to Nicolas they used to. Tempfix: add 'source ~/.bashrc' before job command. 2015-07-03: clear upgrade done. 2015-07-02: upgrading clear, reinstall is in progress. puppet didn't generate certificate -- CLEAR UPGRADE -- 2015-06-23: loan 2TB drive to Matthijs (temporary, 1 week) 2015-06-23: /scratch/gpuhost7: NEW, 24TB, RAID 5 ; gpuhost8,9,10 : NEW scratch 2-3 disks (RAID1-5) 2015-06-19: /scratch/gpuhost6: NEW, 24TB, RAID 5 2015-06-19: gpuhost1 wouldn't configure its own interface eno1, had to rename and edit interface config files. 2015-06-19: POWER OUTAGE in the cluster room, includes DNS. This explains why all jobs accessing NFS mounts died overnight. 2015-06-15: gpuhost2 freezed with no jobs running on it. Barely responsive, had to forcefully reboot. Nothing in the logs.. 2015-06-15: all gpuhosts are switched to OAR reservation 2015-06-09: rolled out OAR GPU reservation for cepheus, gpuhost1, gpuhost3. GPU nodes front is edgar. 2015-06-09: to disable puppet: puppet agent --disable; systemctl disable puppet; systemctl stop puppet; 2015-06-08: faster RAID rebuild: echo 50000 > /proc/sys/dev/raid/speed_limit_min; echo 200000 > /proc/sys/dev/raid/speed_limit_max; echo 8192 > /sys/block/md0/md/stripe_cache_size 2015-06-03: gloucester: sound problem, knotify4 still hogging sound card. Current fix: mv /usr/bin/knotify{,_back} 2015-06-03: albireo: yum would hang on any command. Fix: cd /var/lib/rpm ; /usr/lib/rpm/rpmdb_recover; 2015-06-02: can't print A3: in Print->Page Handling, select "Shrink to Printable Area", tick only "Auto Rotate and Center" 2015-06-02: ran memtest86 on gpuhost2, no errors showed up. 2015-06-02: clear/drawgantt-svg fixed 2015-06-01: screensaver not working (GNOME3) 2015-06-01: GNOME3 not running after update -> rename ~/.config/dconf and login 2015-05-26: gpuhost2 still has weird behavior after reboot. Happens with HDFS over sshfs mountpoint (can't create file). 2015-05-21: gpuhost2 weird I/O errors on scratch. Only happens in python, file is transferred completely but save() doesn't finish and waits for timeout. 2015-05-20: gloucester sound not working, knotify4 was hogging sound card. To find process: lsof /dev/snd/* 2015-05-19: gpuhost2 is showing memory errors: [4868605.167066] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x815d5a offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c2 socket:0 channel_mask:1 rank:1) [4868608.316828] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR [4868608.316834] EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 11: 8c000047000800c2 [4868608.316835] EDAC sbridge MC1: TSC 0 [4868608.316837] EDAC sbridge MC1: ADDR 81c43a000 [4868608.316838] EDAC sbridge MC1: MISC 908400800080e8c [4868608.316840] EDAC sbridge MC1: PROCESSOR 0:306f2 TIME 1431998906 SOCKET 0 APIC 0 2015-04-30: GTX extracted from cepheus, installed in gpuhost1. 256MB of dedicated memory isn't enough, set to 512MB. 2015-04-29: kent dropped from LEAR network, will be used as Julien Bardonnet's machine 2015-04-29: Titan Black back from RMA, installed in cepheus, OK. 2015-04-27: /etc/sudoers file on cluster nodes: added "yum" and "rpm" 2015-04-23: disabled puppet on gpuhosts 2015-04-16: node22 has a dead memory module, currently running on 44 GB of memory. 2015-04-15: /var/ is filling up quickly on all machines, mostly due to yum cache and core dumps. -> cron on edgar every two hours as temp fix -> change core dump behaviour (Matthijs) doesn't work on gloucester? (can't find the script) 2015-04-15: cornwall is in a weird state: two systemd-libs exist. Fails update, delete dups == delete almost the whole system... 2015-04-08: node15 couldn't mount lear-home. fix: enable "rpcbind" service 2015-04-01: automatic kernel updates wreak havoc on GPU machines -> enforced by the MI, /etc/yum.conf is overwritten by puppet -> static nvidia drivers need to be re-installed (the kernel module is compiled against the kernel) -> akmod (automatic kernel module) only works for 340.XX -> TEST FIX: auto driver reinstall script at boot, checks whether nvidia-smi returns without errors. deployed on: gpuhost1,2,3, gemini, hydrus (machines with 346.XX drivers) 2015-03-30: RAID1 on albireo. 2015-03-30: aquarius system drive dead. New drive + system OK. 2015-03-29: gpuhost1 case is opened. can't close case with locking mechanism /!\ it seems the mechanism needs to be removed entirely, not just the unlock latch ////////// Fedora 21 \\\\\\\\\\\\\\\\\\\ === Userland problems encountered (x) CUDA examples don't work on desktop machines -> FIX: install latest NVIDIA driver, those bundled in the CUDA package don't work -> /home/clear/sys/config_fc21/drivers -> todo: bundle driver install in post_install script (x) pulseaudio crashes on some machines (polyxo) -> FIX: yum remove alsa-plugins-pulseaudio (x) dir /softs/ is not mounted on gpuhost3 (x) Mate: fonts behave weirdly (disappearing underscores, aliasing) -> System > Control Center > Appearance > Fonts > Subpixel smoothing (LCDs) (x) numpy and co. are single-threaded -> atlas naming convention changed, numpy is linked against the single-threaded version by default -> solution: softlink the single-threaded.so to the multi-threaded.so, applied during post_install_fc21 (x) yael moans about atlas libraries not found -> atlas naming convention changed. you need to recompile yael with correct .so links The ATLAS blas and lapack libraries are now in /usr/lib64/atlas/libsatlas.so.3 so set LAPACKLDFLAGS=/usr/lib64/atlas/libsatlas.so.3 # due to remark above, this will link to the *multithreaded* version of BLAS. An up-to-date yael version is available here: /home/clear/xmartin/MED/usr/yael_v438_fc21 (x) VNC -> Gnome doesn't work, use XFCE instead: Replace .vnc/xstartup file with: exec /bin/sh /etc/xdg/xfce4/xinitrc (x) MATE: missing packages, appended to post_install script === System config problems encountered (x) exports, auto.master, etc... aren't copied despite the post_install_fc21 script -> without those, /scratch/ directories are not accessible -> quickfix: copy them "by hand", or use /home/clear/sys/finalize_install.sh -> fix: datadir is now /home/clear/sys/config_fc21, always accessible during install (x) zephyrus doesn't boot at all after initial PXE install. there is either a "grub rescue" mode where the "linux" can't be run (no boot), or the boot sequence stalls at "Attempting boot from hard drive" -> fix: install Ubuntu through PXE (seems to wipe the drive in a clean and thorough fashion) then install Fedora (x) node46, node48 keep looping in an install cycle -> explanation: can't write to disk properly -> fix: create RAID0 with the two small sysdisks, this wipes enough metadata.. (x) prospero can't install properly -> fix: couldn't choose the proper PXE boot, had to unplug cable to 10.0.0.X (x) antonio will not install -> write errors during install ("not allowed", perhaps SELinux related) -> tried a fix: wipe drive with dd if=/dev/zero, same errors -> FIXED: had to ask the MI to force SELinux off at the bootloader level.. SELinux could remnants of previous installations in the MBR. /!\ you have to do a "dd" precisely to wipe the MBR, don't use blocksize=1M (x) antonio installed but crashes on reboot -> copied prospero's whole disk image using block copy, wouldn't boot (UUID errors) -> boot in rescue kernel, update fstab with new UUIDs, run "dracut --regenerate-all --force" (x) Titan Black is not properly acknowledged by nvidia-smi (faulty card?) -> (comes from goneril) tried to switch the card to a different T7610 -> tried to use different slot -> always the same error in dmesg: "nvdrm failed to load vbios" -> the card does give a video feed during boot -> TBC: RMA e-network 31 March === FC21 machines ** CLUSTER MACHINES ** All cluster machines have been switched to Fedora 21. (05/03/2015) ** DESKTOP MACHINES ** ( ) cordelia can't open the case, there is a cut cable that is not earmarked. none of the keys worked.. (x) exadius (09/03/2015) (x) burgundy (09/03/2015) (x) algorab (09/03/2015) (x) zeus (09/03/2015) (x) aquarius (09/03/2015) (x) kent (09/03/2015) (x) prospero (06/03/2015) (x) zephyrus (06/03/2015) /!\ lock is seized, stuck key (x) cepheus (05/03/2015) (x) goneril (05/03/2015) (x) gemini (05/03/2015) (x) gloucester (05/03/2015) (x) oswald (05/03/2015) (x) orion (04/03/2015) (x) phoenix (04/03/2015) (x) lennox (03/03/2015) (x) adrian (03/03/2015) (x) curan (03/03/2015) (x) taurus (03/03/2015) (x) horatio (02/03/2015) (x) edmund (27/02/2015) (x) polyxo (27/02/2015) (x) edgar (26/02/2015) OK, 2TB=2TB*2. ?? (2/03/15): /etc/fstab was overwritten, deleted md0 entry and replaced by non-existent sdb5 (x) hydrus (26/02/2015) OK, 2TB=2TB*2 (x) scylla (26/02/2015) OK, 1TB=1TB*2 (x) regan (25/02/2015) (x) gpuhost2 (24/02/2015) (x) gpuhost3 (24/02/2015) (x) gpuhost1 (20/02/2015) (x) albireo (13/02/2015) === Next machines ( ) pascal (use 1TB drive from edgar). /!\ kernel bug \\\\\\\\\\\\\\\\\\\\\\\\/////////////////////////////// ---------- LOG: ---------- 2015-03-06: Titan GPUs are out of production. Ordered GTX 980 instead. 2015-02-25: New machines (gpuhost2, gpuhost3) are installed with FC21. 2015-02-20 ---------- Automounter problems solved on gpuhost1. GPUs have not arrived yet... Doubled size of /scratch/bigimbaz. Connection problems on pascal due to new MI configuration. Working on a fix... 2015-02-19 ---------- New machine (gpuhost1) is installed with FC21. Problems with the automounter... 2015-02-17 ---------- Received new machine for the cluster. Preparing to retire /scratch2/bigimbaz. Still problems with /home/bigimbaz. 2015-02-16 ---------- Start of this page. 12:00 cluster back to normal. We decide to migrate the desktop machines to FC21 during this week or next week. 2015-02-15 ---------- Still problems with some machines in the cluster. Not all machines can be rebooted. 2015-02-14 ---------- Several machines in the cluster hanged due to a faulty mount from /home/bigimbaz. Trying to reboot them. 2015-02-13 ---------- Install of Fedora 21 on albireo. Seems to work fine. 2015-02-12 ---------- Replaced video card of scylla by an older one. Downgraded nvidia driver.