+===### LEAR SYSTEM ADMINISTRATION LOG ###=====================================+
| This log contains updates on maintenance operations and ongoing events.      |
+==============================================================================+
----------
TODO:
----------

* gpuhost4: MIXED IDs, quickfix: installed two TitanX Pascal GPUs

* revert cable swap between GPUHOST6 and GPUHOST7
    /!\ observe what happens next, gpuhost6 disconnections could come back
    it might not be a good idea to reverse this at all, but rather re-label the cables

* find the suppliers of installed GPUs, useful for RMAs

----------
LOG:
----------
2018-11-20: problem of network overhead on the cluster (genome assembly by dchen)
	solution: local copy of the database on node{51-54}-thoth (1TB local disk) to avoid network access
	tmp dir for data: /local_scratch/tmp/<username>
	to do some copy: edit script edgar:/root/copy_data_to_cpu_cluster.sh and run it

2018-09-05: double disk failure on curan
	need for 4TB drives
	took one in hydrus (replaced by a 2TB drives)
	move /scratch/scylla to /scratch/clear/legacy_homedir and remove scratch from scylla to get a second 4TB drive

2018-08-28: disk failure on hydrus and scylla
	use 4TB drives as replacement

2018-08-06: Vulnerability in Django on Fedora 21
* reinstall of curan on Xubuntu 16.04
* Decativate ciondition for the vulnerability to be usable on hypnos. In "/usr/lib/python2.7/site-packages/django/conf/setting.py"
	- set APPEND_SLASH = False
	- comment "django.middleware.common.CommonMiddleware" module and related

2018-05-31: reinstall of abo with Windows
* Admin account:
	- login: .\admin
	- password: Koala_3105
* ToDo: fix Grub and boot on Linux partition (unplugged at the moment)

2018-05-25: reinstall hydrus

2018-04-26:
* Set up idrac on all gpuhosts (except gpuhost1 and gpuhost11, not working).
  Remote access possible via http (https://gpuhostXXidrac) or via ssh (ssh gpuhostXXidrac) and racadm CLI
* disable updates on all oar related package (because of dependency issue)
apt-mark hold <package-name>
apt-mark unhold <package-name> # to enable update back

2018-04-25:
* reinstall edgar OS (Xubuntu 16.04)
* setting new OAS server
  --> fixing cgi loading (enable cgi with 'a2enmod cgi')

* fix X11 forward on remote ssh host
$ ssh user@host
$ su
# xauth merge /home/user/.Xauthority

2018-04-03: troubleshouting sssd service failing to restart
rm /var/run/sssd.pid

2018-03-26: change disks of /scratch/kent (1TB -> 4TB)

2017-10-06: couldn't mount /scratch/pascal on several machines
       --> fix: /etc/exports contains references to "thoth" instead of "lear"

2017-10-05: move cpu cluster enclosure to new rack

2017-10-04: reinstall of pascal with Ubuntu 16.04 (Xavier Martin)

2017-10-02: move clear and scratch enclosure to new rack
           Re-do NAT configuration on SFP+ port (see sysadmin.html)

2017-08-04: update and reboot all cpu cluster nodes (dg)
	except node52 (busy at the time)
	all remaining nodes in cpu cluster listed in cpu_cluster_nodes.txt in /root on clear

2017-08-03: reinstall taurus (dg)
deactivate Nouveau Graphic Driver
1) add the following lines to /etc/modprobe.d/blacklist.conf
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off
2) disable the Kernel Nouveau with the following command:
echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
3) build the new kernel with:
update-initramfs -u
4) reboot

2017-07-10: move /scratch/adrian to /scratch/zeus (dg)
	reinstall both adrian and zeus

2017-06-23: wipe last MB of hard drive:
    dd bs=512 if=/dev/zero of=/dev/sda count=2048 seek=$((`blockdev --getsz /dev/sda` - 2048))

2017-09-06: reinstall adrian (dg)

2017-05-15: edgar would not mount home directories
    FIX: systemctl enable rpcbind
         systemctl restart rpcbind

2017-04-19: ncdu utility to show recursive directory size
2017-02-28: /dev/md0 /local_scratch auto defaults,nofail,x-systemd.device-timeout=4 0 0

2016-11-2: gpuhost12 sysdisk slot 0 failed, rebuilding

2016-09-20: UBUNTU MIGRATION on GPU/CPU clusters

2016-09-20: edgar OAR hangs --> stop mariadb, stop oar-server, kill -9 Almighty:bipbip, restart mariadb, restart oar-server

2016-08-02: installed some machines on XUbuntu 16.04
	-> some recent machines had trouble, couldn't boot after install. use the "laptop install" version.
	   will be fixed by Jean-Marc soon
	   otherwise, use old version of fdisk, set bootable flag to partition 1

2016-06-15: installing idrac on gpuhosts for easier access
	-> root:to123to
	-> done: 6, 8, 10, 11, 12, 13

2016-06-13: couldn't access homes and scratches from the CPU cluster, GPU cluster, various desktops
	-> updated certificate for LDAP, all machines without puppet were concerned
	-> problem with machines actively used.. have to restart
	-> new certificate in "/home/clear/sys/config_fc21/authconfig_downloaded.pem"

2016-05-12: SSSD fix for LEAR/THOTH group write permissions
	- stop SSSD
	- empty SSS cache: sss_cache -E
	- edit /etc/sssd/sssd.conf:
		#ldap_schema = rfc2307
		ldap_schema = rfc2307bis
	- restart SSSD

2016-05-10: when compiling and linker can't find library, use:
	export LDFLAGS="-L/usr/lib64/..."

2016-04-26: NFS and disk access on clear are superslow
	-> /etc/sysconfig/nfs : added '--no-nfs-version 4', no change apparently
	-> rpcdebug is useful to see what's happening, doesn't give whole file paths though..
	NFS works with file handles, and converting them to full paths is a PITA, I don't even see how
	-> partial FIX: using fanotify-based solution, found what files were accessed.
        Then, run cssh on all machines to find where the traffic was coming from.
        The source was an Eclipse instance running on curan, deep scanning a large dataset by error.

2016-04-22: to edit network interface files:
	/etc/sysconfig/network-scripts/ifcfg-interface-name

2016-04-08: to avoid boot hangup due to failed fstab entry mount, append the following options to scratches:
	,nofail,x-systemd.device-timeout=10

2016-04-06: forced sector reallocation on cornwall's system disk, commands used:
	dd if=/dev/sda of=somewhere   # to find out which sector is bad
	hdparm --read-sector 40730418 /dev/sda
	hdparm --write-sector 40730418 /dev/sda

2016-04-04: ran fsck on SSD drive of node42, by inserting it in node41. Made a bunch of repairs. Seems ok now. Both nodes activated in oar.
	old issue: node42, rpmdb open failed with Input/output error. can't delete the files /var/lib/rpm/__db.001 .. 003


2016-03-21: had to reboot bigimbaz
	-> DO NOT RELAUNCH THE BIGIMBAZ DEMO SERVER, it went into uninterruptible sleep
	and required a machine reboot


2016-02-19: installed new R730 gpuhosts
	-> route all cables, tag them at both ends
	-> setup KVM + route KVM cabling, remote access HELL YEAH!

2016-02-17: clear: moved /usr/share to /local_sysdisk, '/' full
2016-02-11: changed /etc/matplotlibrc on cluster nodes to allow X-less rendering
	-> backend: Agg

2016-02-05: tip to redirect program output while running
	gdb -p PID
	p dup2(open("/tmp/stdout", 1), 1)
	p dup2(open("/tmp/stderr", 2), 2)
	detach
	quit

2016-02-04: clear has almost no space left in "/"
2016-02-04: gpuhost5 rebooted for no apparent reason, Feb 3 ~13:00, cooling seems to work
2016-02-03: edgar, tried to enable nfs cto in /etc/nfsmount.conf
2016-02-01: SERVER ROOM - CLEARCMC2 POWER SUPPLY FAILURE
	-> several nodes were automatically shut down and impossible to start
	-> took spare from clearcmc

2016-01-25: cornwall, systemd looping too fast
	-> FIX: systemctl daemon-reexec

2016-01-25: cooling unit fixed, gpuhost3 moved back
2016-01-25: moved gpuhost3 from H111/H113 to H115/117, the cooling unit is defective
2016-01-21: algorab, KDE: polkit keeps asking for password to update
	-> created file /etc/polkit-1/localauthority/50-local.d/allowuserupdate.pkla
	-> content:
[Allow User Updates]
Identity=*
Action=org.freedesktop.packagekit.system-update
ResultAny=no
ResultInactive=no
ResultActive=yes

2016-01-21: gpuhost3 went into deep sleep, has behaved strangely in recent days (Suspected). The machine might have went into limp mode after overheating.
	-> the cooling unit doesn't seem to turn on, check

2016-01-21: INVERTING NETWORK CABLE, GPUHOST6 and GPUHOST7
2016-01-20: gpuhost6, weird network problems. Not due to the machine itself, went to try and debug with J-P Augé, but it went back to normal during the process.
2016-01-19: if user can't SSH somewhere, check permissions on their homedir. they *have* to be 755
2016-01-06: pascal had a CPU lockup, machine non-responsive, had to forcefully reboot.
            misc: when rebooting, have to restart autofs and httpd by hand

2015-12-08: OLD KERNEL 3.10.93: managed to automount scratches!!
	You have to force NFSv3 protocol (v4 support appeared in 3.11 onwards)
	-> /etc/autofs.conf: mount_nfs_default_protocol = 3
	-> /etc/sysconfig/autofs: MOUNT_NFS_DEFAULT_PROTOCOL=3
	-> edit /etc/nfsmount.conf, uncomment values "Defaultvers" and "Nfsvers", use "3" 

2015-12-08: taurus scratch active
2015-11-30: new taurus and new algorab installed

2015-11-30: setup mongodb on cordelia
- create repository file:

[mongodb]
name=MongoDB Repository
baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64/
gpgcheck=0
enabled=1

- install packages "mongodb-org mongodb-org-server"
- move /var/lib/mongo to /local_sysdisk and create symlink


2015-11-30: put new video card in adrian (artifacts on screen)
2015-11-30: put new video card in zephyrus (beeped at boot), had to pry off the locking cable
2015-11-30: removed dying scratch disk from horatio

2015-11-24: gpuhost11 setup complete
2015-11-24: /scratch/gpuhost3 and /scratch/artemis OK

2015-11-09: clear back online, used these commands:
        -> ifconfig eth1 10.0.0.254
        -> systemctl restart dhcpd
        -> mount /local_scratch2     ...
        -> systemctl restart sssd autofs nfs-server

2015-11-06: clear crash, machine rebooted but cluster down and scratches more or less accessible

2015-10-28: replaced gpuhost1 sysdisk slot 1 (RAID1)
2015-10-19: curan has 25GB of vaguely allocated memory.
	-> in /proc/meminfo, will show up as "Active(anon)"
	-> slabtop: shows up as kmalloc-64
	-> does not belong to any particular process, tried to increase "vfs_cache_pressure" to no avail
	-> trying fix: add to grub options "apm=off acpi=off", seems to have worked for someone else

2015-10-13: finally set up gpuhost1 properly. Some thoughts for the next install:
	-> pull out all GPUs before you do anything
	-> install the OS
	-> modify /etc/default/grub : remove "rhgb", add "vmalloc=1024M nouveau.modeset=0 rd.driver.blacklist=nouveau"
	-> grub2-mkconfig -o /boot/grub2/grub.cfg
	-> shut down system, add one NVIDIA card
	-> boot, install NVIDIA driver
	-> shutdown, install all 4 GPUs

2015-10-07: some mdadm scratches vanish after a reboot (gpuhost3, artemis)
	-> cause: the partition table is there but there is no partition, consequently no "raid" flag
		that means mdadm won't check these drives for auto-mount
	-> fix that doesn't work: recreate the arrays (follow same procedure as for artemis, be careful that nb of drives and raidlevel match)
		run "mdadm --examine --scan --config=/etc/mdadm.conf > /etc/mdadm.conf"
		this will save a configuration file with mdadm RAID name and disk UUIDs
		update: it does not help, artemis still hangs at boot

2015-10-02: gpuhost4, trying to find which GPU arrangement makes the machine reboot
	-> 2x GTX980: stable for 4 days, even with jobs running
	-> 1x TitanX (top slot), 1x GTX980: stable for more than 1 day
	-> 2x TitanX (bottom slot is the new one):


2015-09-28: Shreyas had a job on gpuhost7 that didn't die, despite the OAR job ID being terminated (and replaced by another job, which was running)

2015-09-28: machine stuck at boot, failed to start nfs-statd (gpuhost3)
	-> systemctl enable rpcbind
	-> reboot

2015-09-25: pascal - new RAID volume (2*1TB RAID1)
	-> mirroring existing scratch (/dev/sdb)
	-> once mirror is complete, will convert RAID1 to RAID5 with original disk

2015-09-25: problem with NVIDIA drivers
	kernel update + driver update (304.125 -> 304.128)
	on new kernels, can't install 304.125 anymore

2015-09-25: gpuhost4 rebooted again, seems to reboot everyday between 18:00 - 23:00.
            I've pulled out the two GPUs, we'll see if it reboots then.
            If it doesn't, I'll add one GPU and then a second.

2015-09-24: unmounted /scratch/hypnos (was making too much noise), looking for machine to offload it to

2015-09-23: add gpumem property on edgar OAR
2015-09-22: gpuhost4 and gpuhost5 reboot for no apparent reason, quite often
	-> in the logs, seemed related to /admin/teste-rpmdb.sh that failed, fixed it. Will see if it holds.
	-> doesn't hold, rebooted 23 Sept at 18:01 right after a cron pass

2015-09-16: gpuhost1 install won't work. waiting for Jean-Marc Joseph to come back (1 or 2 weeks)
2015-09-15: upgraded NVIDIA driver on gpuhost2-10 for CUDA 7.5

2015-09-14: new remote Windows desktop procedure
	-> install remmina-plugins-rdp remmina
	-> connect to server rds-gra.grenoble.inria.fr, domain AD.inria.fr, using RDP and NLA security

2015-09-14: fixing adrian disk /dev/mapper control
	-> dmraid -r	# discovers disks tagged as under RAID control
	-> dmraid -rE	# removes RAID metadata on disks found (asks Y/N for each disk)
	-> dmsetup remove /dev/mapper/ddf1_44656c6c202020201000005d10281f4742bd335698b3486e
	

2015-09-14: gemini: new RAID1 disk (initialized as degraded array) gives Input/Output errors, doesn't show in fdisk -l || DEAD DISK
            adrian: new RAID1 disks: one of the disks does not want to be mounted / used as mdadm disk, shows as busy.
                    a weird "/dev/mapper-*" entry appears in fdisk -l

2015-09-11: hypnos, RAID1 2x2TB

2015-09-10: gpuhost6-10 - RAID1 sysdisks
2015-09-10: orion (Federico Pierucci) installed in room E116 (BEBOP team), still connected on .21 subnet

2015-09-09: removed 5x 500G disks from adrian and hypnos
	added second system disk to gpuhost8-9
	added second system disk to gpuhost1, restarted install (after proper disk initialization)

2015-09-09: configured top-of-rack switch, disabled spanning tree
	-> used serial-to-serial port, usb adapter didnt work at all
	-> connected top-of-rack LEAR to top-of-rack MI
	-> thanks to Jean-Pierre's configuration, clear can now ping gpuhost6-10 over 10.0.0.*
	-> had to disable spanning tree, otherwise the switch would be blacklisted by MI switch

2015-09-08: to allow user mount, open file /usr/share/polkit-1/actions/org.freedesktop.udisks2.policy :
	-> find action org.freedesktop.udisks2.filesystem-mount-system
	-> change <allow_active> to 'yes'

2015-09-08: gpuhost1: system disk failed, difficulties at reinstall
2015-09-07: artemis: /dev/md0 not found /!\
	-> array not recognized at boot
	-> mdadm --examine --scan    empty result
	-> individual disks did not show RAID info with "mdadm --examine /dev/sdb"
	-> recreated array with command # mdadm -Cv /dev/md0 -l1 -n2 /dev/sdb /dev/sdc
	-> data magically reappeared...

2015-09-07: power configuration on gpuhost6-10 changed, draws as much power from each power supply

2015-09-07: some cluster nodes (node51) and 'clear' have an atlas/numpy problem
	on "import numpy", python can't find /usr/lib64/atlas/...
	-> /etc/ld.so.conf.d/atlas.conf exists and is referenced on both machines
	-> UPDATE: fixed itself for unknown reason, may be related to installing a package? this is weird

2015-09-07: cluster nodes didn't have /var/log/messages: installed package "rsyslog", added to fc21_post_install.sh
2015-09-07: added package "gstreamer-plugins-good" to all cluster nodes, added to post_install
2015-09-04: created a user wiki at http://lear/local/wiki
2015-08-31: orion: /var/ full, "/var/log/secure-20150831" was 2.3G of garbage. Observed problem: couldn't update svn (NFS file locking didn't work anymore).
2015-08-27: gpuhost7 raid conversion to level 6 finished overnight.
2015-08-25: added two 6T drives to scratch/gpuhost6, started raid rebuild (raid 5>6 conversion finished yesterday). Dell replacement disk received last week.
2015-08-24: node50-53: atlas fix didn't work, it created a "libsatlas.so.3.10_backup" symlink that pointed to threaded version and delete the single-threaded version
2015-08-17: installed 2x Titan-X in gpuhost2, shelved 2x GTX980
2015-08-17: removed old gpuhost6 scratch

	gpuhost1 scratch --> new gpuhost6 scratch
		+ ADDED A DISK, CONVERTING TO RAID6 (have to finish background check first)
	gpuhost7 scratch
		+ ADDED A DISK, CONVERTING TO RAID6 (started 17/08/2015)

	Gained two 6TB spares

2015-08-17: opened Dell support case for failing 6TB disk ( http://www.dell.com/support/incidents-online/fr/fr/frdhs1/Case/srRequest/915367820 )
2015-07-27: gpuhost6: DISK FAILING/FAILED. Backing up Shreyas' data.
2015-07-22: gpuhost4, gpuhost5 OK: included in @net_lear
2015-07-16: can't mount /home/lear or /scratch/clear from gpuhost4, gpuhost5
2015-07-16: change hostnames cepheus, goneril -> gpuhost4, gpuhost5.
2015-07-16: two new Dell T7910 machines, apollo and artemis. install FC21
2015-07-15: install 4* M420 cluster nodes. reconfigured clear for PXE install.

2015-07-10: VNC: xfce doesn't work anymore, use MATE. cp /home/clear/sys/misc/xstartup ~USERNAME/.vnc/xstartup
2015-07-10: gpuhost6: /var/lib/systemd/coredump -> /dev/null
2015-07-10: received 2* T7910, 4* M420 cluster nodes.

-- CLEAR UPGRADE --
2015-07-03: weird behavior: jobs don't source .bashrc, according to Nicolas they used to. Tempfix: add 'source ~/.bashrc' before job command.
2015-07-03: clear upgrade done.
2015-07-02: upgrading clear, reinstall is in progress. puppet didn't generate certificate
-- CLEAR UPGRADE --

2015-06-23: loan 2TB drive to Matthijs (temporary, 1 week)
2015-06-23: /scratch/gpuhost7: NEW, 24TB, RAID 5    ; gpuhost8,9,10 : NEW scratch 2-3 disks (RAID1-5)
2015-06-19: /scratch/gpuhost6: NEW, 24TB, RAID 5
2015-06-19: gpuhost1 wouldn't configure its own interface eno1, had to rename and edit interface config files.
2015-06-19: POWER OUTAGE in the cluster room, includes DNS. This explains why all jobs accessing NFS mounts died overnight.
2015-06-15: gpuhost2 freezed with no jobs running on it. Barely responsive, had to forcefully reboot. Nothing in the logs..
2015-06-15: all gpuhosts are switched to OAR reservation
2015-06-09: rolled out OAR GPU reservation for cepheus, gpuhost1, gpuhost3. GPU nodes front is edgar.
2015-06-09: to disable puppet: puppet agent --disable; systemctl disable puppet; systemctl stop puppet;
2015-06-08: faster RAID rebuild: echo 50000 > /proc/sys/dev/raid/speed_limit_min; echo 200000 > /proc/sys/dev/raid/speed_limit_max; echo 8192 > /sys/block/md0/md/stripe_cache_size
2015-06-03: gloucester: sound problem, knotify4 still hogging sound card. Current fix: mv /usr/bin/knotify{,_back}
2015-06-03: albireo: yum would hang on any command. Fix: cd /var/lib/rpm ; /usr/lib/rpm/rpmdb_recover;
2015-06-02: can't print A3: in Print->Page Handling, select "Shrink to Printable Area", tick only "Auto Rotate and Center"
2015-06-02: ran memtest86 on gpuhost2, no errors showed up.
2015-06-02: clear/drawgantt-svg fixed
2015-06-01: screensaver not working (GNOME3)
2015-06-01: GNOME3 not running after update -> rename ~/.config/dconf and login
2015-05-26: gpuhost2 still has weird behavior after reboot. Happens with HDFS over sshfs mountpoint (can't create file).
2015-05-21: gpuhost2 weird I/O errors on scratch. Only happens in python, file
	is transferred completely but save() doesn't finish and waits for timeout.
2015-05-20: gloucester sound not working, knotify4 was hogging sound card. To find process: lsof /dev/snd/*
2015-05-19: gpuhost2 is showing memory errors:

[4868605.167066] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x815d5a offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:0 channel_mask:1 rank:1)
[4868608.316828] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[4868608.316834] EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 11: 8c000047000800c2
[4868608.316835] EDAC sbridge MC1: TSC 0 
[4868608.316837] EDAC sbridge MC1: ADDR 81c43a000 
[4868608.316838] EDAC sbridge MC1: MISC 908400800080e8c 
[4868608.316840] EDAC sbridge MC1: PROCESSOR 0:306f2 TIME 1431998906 SOCKET 0 APIC 0


2015-04-30: GTX extracted from cepheus, installed in gpuhost1. 256MB of dedicated memory isn't enough, set to 512MB.
2015-04-29: kent dropped from LEAR network, will be used as Julien Bardonnet's machine
2015-04-29: Titan Black back from RMA, installed in cepheus, OK.
2015-04-27: /etc/sudoers file on cluster nodes: added "yum" and "rpm"
2015-04-23: disabled puppet on gpuhosts
2015-04-16: node22 has a dead memory module, currently running on 44 GB of memory.
2015-04-15: /var/ is filling up quickly on all machines, mostly due to yum cache and core dumps.
	-> cron on edgar every two hours as temp fix
	-> change core dump behaviour (Matthijs) doesn't work on gloucester? (can't find the script)
2015-04-15: cornwall is in a weird state: two systemd-libs exist. Fails update, delete dups == delete almost the whole system...
2015-04-08: node15 couldn't mount lear-home. fix: enable "rpcbind" service

2015-04-01: automatic kernel updates wreak havoc on GPU machines
	-> enforced by the MI, /etc/yum.conf is overwritten by puppet
	-> static nvidia drivers need to be re-installed
	  (the kernel module is compiled against the kernel)
	-> akmod (automatic kernel module) only works for 340.XX
	-> TEST FIX: auto driver reinstall script at boot, checks whether
	  nvidia-smi returns without errors.
	  deployed on: gpuhost1,2,3, gemini, hydrus (machines with 346.XX drivers)

2015-03-30: RAID1 on albireo.
2015-03-30: aquarius system drive dead. New drive + system OK.
2015-03-29: gpuhost1 case is opened. can't close case with locking mechanism
	/!\ it seems the mechanism needs to be removed entirely, not just the unlock latch


////////// Fedora 21 \\\\\\\\\\\\\\\\\\\
  === Userland problems encountered

	(x) CUDA examples don't work on desktop machines
		-> FIX: install latest NVIDIA driver, those bundled in the CUDA package don't work
		-> /home/clear/sys/config_fc21/drivers

		-> todo: bundle driver install in post_install script

	(x) pulseaudio crashes on some machines (polyxo)
		-> FIX: yum remove alsa-plugins-pulseaudio

	(x) dir /softs/ is not mounted on gpuhost3

	(x) Mate: fonts behave weirdly (disappearing underscores, aliasing)
		-> System > Control Center > Appearance > Fonts > Subpixel smoothing (LCDs)

	(x) numpy and co. are single-threaded
		-> atlas naming convention changed, numpy is linked against
		the single-threaded version by default
		-> solution: softlink the single-threaded.so to the multi-threaded.so, applied
		   during post_install_fc21

	(x) yael moans about atlas libraries not found
		-> atlas naming convention changed. you need to recompile yael
		   with correct .so links

		The ATLAS blas and lapack libraries are now in /usr/lib64/atlas/libsatlas.so.3
                so set LAPACKLDFLAGS=/usr/lib64/atlas/libsatlas.so.3
                # due to remark above, this will link to the *multithreaded* version of BLAS. 
                
		An up-to-date yael version is available here:
		/home/clear/xmartin/MED/usr/yael_v438_fc21

	(x) VNC
		-> Gnome doesn't work, use XFCE instead:
		Replace .vnc/xstartup file with:
		exec /bin/sh /etc/xdg/xfce4/xinitrc
	
	(x) MATE: missing packages, appended to post_install script

  === System config problems encountered

	(x) exports, auto.master, etc... aren't copied despite the post_install_fc21 script
		-> without those, /scratch/ directories are not accessible
		-> quickfix: copy them "by hand", or use /home/clear/sys/finalize_install.sh
		-> fix: datadir is now /home/clear/sys/config_fc21, always accessible during install

	(x) zephyrus doesn't boot at all after initial PXE install.
	    there is either a "grub rescue" mode where the "linux" can't be run (no boot),
	    or the boot sequence stalls at "Attempting boot from hard drive"
		-> fix: install Ubuntu through PXE (seems to wipe the drive in a clean and thorough fashion)
	           then install Fedora

	(x) node46, node48 keep looping in an install cycle
		-> explanation: can't write to disk properly
		-> fix: create RAID0 with the two small sysdisks, this wipes enough metadata..

	(x) prospero can't install properly
		-> fix: couldn't choose the proper PXE boot, had to unplug cable to 10.0.0.X

	(x) antonio will not install
		-> write errors during install ("not allowed", perhaps SELinux related)
		-> tried a fix: wipe drive with dd if=/dev/zero, same errors
		-> FIXED: had to ask the MI to force SELinux off at the bootloader level..
		   SELinux could remnants of previous installations in the MBR.
		   /!\ you have to do a "dd" precisely to wipe the MBR, don't use blocksize=1M

	(x) antonio installed but crashes on reboot
		-> copied prospero's whole disk image using block copy, wouldn't boot (UUID errors)
		-> boot in rescue kernel, update fstab with new UUIDs, run "dracut --regenerate-all --force"

	(x) Titan Black is not properly acknowledged by nvidia-smi (faulty card?)
		-> (comes from goneril) tried to switch the card to a different T7610
		-> tried to use different slot
		-> always the same error in dmesg: "nvdrm failed to load vbios"
		-> the card does give a video feed during boot
		-> TBC: RMA e-network 31 March
	
  === FC21 machines
	** CLUSTER MACHINES **
	All cluster machines have been switched to Fedora 21. (05/03/2015)	

	** DESKTOP MACHINES **

	( ) cordelia	can't open the case, there is a cut cable that is not earmarked. none of the keys worked..
	
	(x) exadius	(09/03/2015)
	(x) burgundy	(09/03/2015)
	(x) algorab	(09/03/2015)
	(x) zeus	(09/03/2015)
	(x) aquarius	(09/03/2015)
	(x) kent	(09/03/2015)
	(x) prospero	(06/03/2015)
	(x) zephyrus	(06/03/2015) /!\ lock is seized, stuck key
	(x) cepheus 	(05/03/2015)
	(x) goneril	(05/03/2015)
	(x) gemini	(05/03/2015)
	(x) gloucester	(05/03/2015)
	(x) oswald	(05/03/2015)
	(x) orion	(04/03/2015)
	(x) phoenix	(04/03/2015)
	(x) lennox	(03/03/2015)
	(x) adrian	(03/03/2015)
	(x) curan	(03/03/2015)
	(x) taurus	(03/03/2015)
	(x) horatio	(02/03/2015)	
	(x) edmund	(27/02/2015)
	(x) polyxo	(27/02/2015)
	(x) edgar	(26/02/2015) OK, 2TB=2TB*2.  ?? (2/03/15): /etc/fstab was overwritten, deleted md0 entry and replaced by non-existent sdb5
	(x) hydrus	(26/02/2015) OK, 2TB=2TB*2
	(x) scylla	(26/02/2015) OK, 1TB=1TB*2
	(x) regan	(25/02/2015)
	(x) gpuhost2	(24/02/2015)
	(x) gpuhost3	(24/02/2015)
	(x) gpuhost1	(20/02/2015)
	(x) albireo	(13/02/2015)

  === Next machines
	( ) pascal (use 1TB drive from edgar). /!\ kernel bug
\\\\\\\\\\\\\\\\\\\\\\\\///////////////////////////////

----------
LOG:
----------

2015-03-06: Titan GPUs are out of production. Ordered GTX 980 instead.

2015-02-25: New machines (gpuhost2, gpuhost3) are installed with FC21.

2015-02-20
----------

Automounter problems solved on gpuhost1. GPUs have not arrived yet...

Doubled size of /scratch/bigimbaz.

Connection problems on pascal due to new MI configuration. Working on a fix...


2015-02-19
----------

New machine (gpuhost1) is installed with FC21. Problems with the automounter...


2015-02-17
----------

Received new machine for the cluster. Preparing to retire /scratch2/bigimbaz. 

Still problems with /home/bigimbaz.


2015-02-16
----------

Start of this page.

12:00 cluster back to normal.

We decide to migrate the desktop machines to FC21 during this week or
next week.

2015-02-15
----------

Still problems with some machines in the cluster. Not all machines can
be rebooted.

2015-02-14
----------

Several machines in the cluster hanged due to a faulty
mount from /home/bigimbaz. Trying to reboot them. 

2015-02-13
----------

Install of Fedora 21 on albireo. Seems to work fine.

2015-02-12
----------

Replaced video card of scylla by an older one. Downgraded nvidia driver.