Category: Uncategorized

  • Raspberry Pi Cluster V: Deploying a NextJs App on Ubuntu Server 20.04

    Intro

    In the last part I opened up my primary node to the Internet. We’re now in a position to make public-facing applications that will eventually connect up with microservices that harness the distributed computing power of the 4 RPi nodes in our cluster.

    Before we can make such parallel applications, we need to be able to deploy a web-facing interface to the primary node in our cluster. Not only is that a generally important thing to be able to do, but it allows us to separate web-specific code/processes from what we might call “computing” or “business-logic” code/processes (i.e. a microservices architecture).

    So in this post (and the next few), I am going to go through the necessary steps to get a MERN stack up and running on our primary node. This is not a cluster-specific task; it is something every full-stack developer needs to know how to do on a linux server.

    Tech Stack Overview

    In the last part, we used AWS Route 53 to set up a domain pointing to our primary node. I mentioned that you need to have a web server like Apache running to check that everything is working, namely the port forwarding and dynamic DNS cronjob.

    We are going to continue on here by creating a customer-facing application with the following features:

    • Full set up of Apache operating as our gateway/proxy web server
    • SSL Certification with certbot
    • NextJs as our application server (providing the “Express”, “React” and “Node” parts of our MERN stack)
    • User signup and authentication with:
      • AWS Cognito as our Authentication Server
      • MongoDB as our general/business-logic DB

    Apache & Certbot Setup

    Apache 101

    This section is aimed at beginners for setting up Apache. Apache is a web server. It’s job is to receive an http/https request and return a response. That response is usually one of three things:

    1. A copy of a file on the filesystem
    2. HTML representing the content within the filesystem with links to download individual files (a ‘file browser’)
    3. A response that Apache gets back from another server that Apache “proxied” your original request to.

    In my opinion, absolutely every web developer needs to know how to set up Apache and/or Nginx with SSL certification to be able to accomplish these three things. I tend to use Apache because I am just more used to it than Nginx.

    An important concept in Apache is that of a “virtual host”. The server that Apache runs on can host multiple applications. You might want to serve some files in a folder to the internet at one subdomain (e.g. myfiles.mydomain.com), a react app at another domain (e.g. react.mydomain.com), and an API with JSON responses at yet another domain (e.g. api.mydomain.com).

    In all three of these example subdomains, you are setting up the DNS server to point the subdomain to the same IP Address — the public IP Address of your home in this project’s case. So if requests are coming into the same apache server listening on port 443, then we need in general to configure Apache to separate these requests and have them processed by the appropriate process running on our machine. The main way that one configures Apache to separate requests is based on the target subdomain. This is done by creating a “virtual host” within the Apache configuration, as demonstrated below.

    Installing Apache and Certbot on Ubuntu 20.04

    Installing apache and certbot on Ubuntu 20.04 is quite straightforward.

    sudo apt install apache2
    sudo snap install core
    sudo snap refresh core
    sudo apt remove certbot
    sudo snap install --classic certbot
    sudo ln -s /snap/bin/certbot /usr/bin/certbot

    We also need to enable the following apache modules with the a2enmod tool (“Apache2-Enable-Module”) that gets installed along with the apache service:

    sudo a2enmod proxy_http proxy macro

    Make sure you have a dynamic domain name pointing to your public IP address, and run the certbot wizard with automatic apache configuration:

    sudo certbot --apache

    If this is the first time running, it will prompt you for an email, domain names, and whether to set up automatic redirects from http to https. (I recommend you do.) It will then modify your configuration files in /etc/apache2/sites-available. The file /etc/apaches/sites-available/000-default-le-ssl.conf looks something like this:

    <IfModule mod_ssl.c>
    <VirtualHost *:443>
            # ...        
            ServerAdmin webmaster@localhost
            DocumentRoot /var/www/html
            # ...
            ErrorLog ${APACHE_LOG_DIR}/error.log
            CustomLog ${APACHE_LOG_DIR}/access.log combined
            # ...
            ServerName wwww.yourdomain.com
            SSLCertificateFile /etc/letsencrypt/live/wwww.yourdomain.com/fullchain.pem
            SSLCertificateKeyFile /etc/letsencrypt/live/wwww.yourdomain.com/privkey.pem
            Include /etc/letsencrypt/options-ssl-apache.conf
    </VirtualHost>
    </IfModule>

    There is quite a lot of boilerplate stuff going on in this single virtual host. It basically says “create a virtual host so that requests received on port 443 with target URL with subdomain wwww.yourdomain.com will get served a file from the directory /var/www/html; decrypt using the information within these SSL files; if errors occur, log them in the default location, etc.”.

    Since we might want to have lots of virtual hosts set up on this machine, each with certbot SSL certification, we will want to avoid having to repeat all of this boilerplate.

    To do this, let’s first disable this configuration with the tool sudo a2dissite 000-default-le-ssl.conf.

    Now lets create a fresh configuration file with sudo touch /etc/apaches/sites-available/mysites.conf and add the following text:

    <IfModule mod_ssl.c>
    <Macro SSLStuff>
        ServerAdmin webmaster@localhost
        ErrorLog ${APACHE_LOG_DIR}/error.log
        CustomLog ${APACHE_LOG_DIR}/access.log combined
        Include /etc/letsencrypt/options-ssl-apache.conf
        SSLCertificateFile /etc/letsencrypt/live/www.yourdomain.com/fullchain.pem
        SSLCertificateKeyFile /etc/letsencrypt/live/www.yourdomain.com/privkey.pem
    </Macro>
    
    <VirtualHost _default_:443>
        Use SSLStuff
        DocumentRoot /var/www/notfound
    </VirtualHost>
    <VirtualHost *:443>
        Use SSLStuff
        ServerName www.yourdomain.com
        ProxyPass / http://127.0.0.1:5050/
        ProxyPassReverse / http://127.0.0.1:5050/
    </VirtualHost>
    </IfModule>

    Here we are making use of the apache “macro” module we enabled earlier to define the boilerplate configurations that we want all of our virtual hosts to have. By including the line Use SSLStuff in a virtual host, we thereby include everything we defined in the SSLStuff block.

    This configuration has two virtual hosts. The first one is a default; if a request is received without a recognized domain, then serve files from /var/www/notfound. (You of course need to create such a dir, and, at minimum, have an index.html file therein with a “Not found” message.)

    The second virtual host tells Apache to forward any request sent to www.yourdomain.com and forward it onto the localhost on port 5050 where, presumably, a separate server process will be listening for http requests. This port is arbitrary, and is where we will be setting up our nextJs app.

    Whenever you change apache configurations, you of course need to restart apache with sudo systemctl restart apache2. To quickly test that the proxied route is working, install node (I always recommend with with nvm), install a simple server with run npm i -g http-server, create a test index.html file somewhere on your filesystem, and run http-server -p 5050.

    Now visit the proxied domain and confirm that you are receiving the content of the index.html file you just created. The great thing about this set up is that Apache is acting as a single encryption gateway on port 443 for all of your apps, so you don’t need to worry about SSL configuration on your individual application servers; all of our inner microservices are safe!

    Expanding Virtual Hosts

    (There will inevitably come a time when you want to add more virtual hosts for new applications on the same server. Say that I want to have a folder for serving miscellaneous files to the world.

    First, you need to go back to your DNS interface (AWS Route 53 in my case), and add a new subdomain pointing to your public IP Address.

    Next, in my case, where I am using a script to dynamically update my AWS-controlled the IP Address that my domain points to, as I described in the last part of this cluster series, I need to open up crontab -e and add a line for this new domain.

    Next, I need to change the apache configuration by adding another virtual host and restarting apache:

    <VirtualHost *:443>
        Use SSLStuff
        DocumentRoot /var/www/miscweb
        ServerName misc.yourdomain.com
    </VirtualHost>

    Next, we need to create a dir at /var/www/misc (with /var/www being the conventional location for all dirs served by apache). Since /var/www has strict read/write permissions requiring sudo, and since I don’t want to have to remember to use sudo every time I want to edit a file therein, I tend to create the real folder in my home dir and link it there with, in this case:

    sudo ln -fs /home/myhome/miscweb /var/www/miscweb

    Next, I need to rerun certbot with a command to expand the domains listed in my active certificate. This is done with the following command:

    sudo certbot certonly --apache --cert-name www.yourdomain.com --expand -d \
    www.yourdomain.com,\
    misc.yourdomain.com

    Notice that when you run this expansion command you have to specify ALL of the domains to be included in the updated certificate including those that had been listed therein previously; it’s not enough to specify just the ones you want to add. Since it can be hard to keep up with all of your domains, I recommend that you keep track of this command with all of your active domains in a text file somewhere on your server. When you want to add another domain, first edit this file with one domain on each line and then copy that new command to the terminal to perform the update.

    If you want to prevent the user from browsing the files within ~/miscweb, then you need to place an index.html file in there. Add a simple message like “Welcome to my file browser for misc sharing” and check that it works by restarting apache and visiting the domain with https.

    Quick Deploy of NextJs

    We’ll talk more about nextJs in the next part. For now, we’ll do a very quick deployment of nextJs just to get the ball rolling.

    Normally, I’d develop my nextJs app on my laptop, push changes to github or gitlab, pull those changes down on the server, and restart it. However, since node is already installed on the RPi primary node, we can just give it a quick start by doing the following:

    • Install pm2 with npm i -g pm2
    • Create a fresh nextJs app in your home directory with cd ~; npx create-next-app --typescript
    • Move into the dir of the project you just created and edit the start script to include the port you will proxy to: "start": "next start -p 5050"
    • To run the app temporarily, run npm run start and visit the corresponding domain in the browser to see your nextJs boilerplate app in service
    • To run the app indefinitely (even after you log out of the ssh shell, etc.), you can use pm2 to run it as a managed background process like so: pm2 start npm --name "NextJsDemo" -- start $PWD -p 5050

    NextJs has several great features. First, it will pre-render all of your react pages for fast loading and strong SEO. Second, it comes with express-like API functionality built-in. Go to /api/hello at your domain to see the built-in demo route in action, and the corresponding code in pages/api/hello.ts.

    More on NextJs in the next part!

  • Raspberry Pi Cluster Part IV: Opening Up to the Internet

    Intro

    We want to serve and access the cluster to the Internet from a home connection. To do that, we need to set up our home router with port forwarding, and open up ports for SSH, HTTP and HTTPS.

    All traffic goes through our primary node (RPi1). If I want to connect to a non-primary node from the outside world, then I can just ssh into RPi1 and then ssh onto the target node.

    So far, we have been able to move around the network with with using passwords. This is not ideal. Having to type in a password each time slow us down and makes our automations trickier to secure. It’s also just not good practice to use passwords since hackers will spam your ssh servers with brute-force password attacks.

    Network SSH Inter Communication

    So we want to be able to ssh from our laptop into any node, and from any node to any node, without using a password. We do this with private-public key pairs.

    First, if you have not done so already, create a public-private key pair on your unix laptop with ssh-keygen -t rsa and skip adding a passphrase:

    ❯ ssh-keygen -t rsa
    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/me/.ssh/id_rsa):
    Enter passphrase (empty for no passphrase):
    Enter same passphrase again:
    Your identification has been saved in /home/me/.ssh/id_rsa
    Your public key has been saved in /home/me/.ssh/id_rsa.pub
    The key fingerprint is:
    ....
    The key's randomart image is:
    +---[RSA 3072]----+
    |      . +.o.     |
    ...
    |       .o++Boo   |
    +----[SHA256]-----+

    This will generate two files (“public” and “private”) in ~/.ssh (id_rsa.pub and id_rsa respectively). Now, in order to ssh into another host, we need to copy the content of the public key file into the file ~/.ssh/authorized_keys on the destination server. Once that is in place, you can ssh without needing a password.

    Rather than manually copying the content of the public key file into a remote host, Unix machines provide a script called `ssh-copy-id` that provides a shortcut for you. In this case, running it as ssh-copy-id name@server will copy the default id_rsa.pub file content to the /home/user/.ssh/authorized_keys file on the server. If you want to use a non-default public key file, you can specify it with ssh-copy-id -i [non-default] file.

    Once the public-private keys are available on the laptop, to make this process easier, we can copy/paste this script to copy the public portion into each of the nodes. Each time we’ll need to specify the password to copy into the node, but afterwards we can ssh without passwords.

    while read SERVER
    do
        ssh-copy-id user@"${SERVER}"
    done <<\EOF
    rpi1
    rpi2
    rpi3
    rpi4
    EOF

    Next, we want to be able to ssh from any node into any other node. We could just copy the private key we just created on the laptop into each node, but this is not the safest practice. So, in this case, I went into each node and repeated the process (create a private-public pair, and then ran the above script).

    Securing SSH Connections

    Raspberry Pis are notorious for getting pwned by bots as soon as they’re opened up to the internet. One way to help ensure the security of your RPi is to use 2-Factor Authorization (2FA).

    I am not going to do that because it creates added complexity to keep up with, and the other measures I’ll be taking are good enough.

    Now that we have set up the ssh keys on our nodes, we need to switch off the ability to ssh in using a password — especially on our primary node which will be open to the internet. We do this by editing the file /etc/ssh/sshd-config and change the line from PasswordAuthentication yes to PasswordAuthentication no. Having done that, the only way to ssh in now is through public-private key-gen pairs, and they only exist on my laptop. (If my laptop gets lost or destroyed, then the only way to access these nodes will be by directly connecting them to a monitor with keyboard, and logging in with a password.)

    The next precaution we’ll take is to change the port on which the sshd service runs on the primary node from the default 22 to some random port. We do this by uncommenting the line #Port 22 in the sshd_config file and changing the number to, say, Port 60022 and restarting the sshd service.

    Having made this change, you will need to specify this non-default port whenever you try to get into that node with e.g. ssh -p 60022 user@rpi1. A bit annoying but, I have heard off the grapevine, this will prevent 99% of hacker bots in their tracks.

    Finally, we can install a service called fail2ban with the usual sudo apt install fail2ban. This is a service that scans log files for suspicious behvior (e.g. failed logins, email-searching bots) and takes action to prevent malicious behavior by modifying the firewall for a period of time (e.g. banning the ip address).

    With these three measures in place, we can be confident in the security of the cluster when opening it up to the internet.

    Dynamic DNS

    To open up our cluster to the internet, we need to get our home’s public IP address. This is assigned to your home router/modem box by your Internet Service Provider (ISP). In my case, I am using a basic Xfinity service with very meager upload speeds. I’ve used Xfinity for several years now, and it tends to provide fairly stable IP addresses, and does not seem to care if you use your public IP address to host content. (By contrast, I once tried setting up port-forwarding for a friend who had a basic home internet connection provided by Cox, and Cox seemed to actively adapt to block this friend’s outgoing traffic. I.e. Cox wants you to upgrade to a business account to serve content from your home connection.)

    To see your public IP Address, run curl http://checkip.amazonaws.com.

    We want to point a domain name towards our home IP address, but since ISPs can change your public IP address without notice, we need to come up with a method to adapt to any such change. The “classic” way to adapt to these changes in IP address is to use a “dynamic” DNS (DDNS) service. Most modern modem/router devices will give you the option to set up with an account from a a company like no-ip.com, and there are plenty of tutorials on the web if you wish to go this route.

    However, I don’t want to pay a company like no-ip.com, or mess about with their “free-tier” service that requires you to e.g. confirm/renew the free service every month.

    Since I get my IP addresses using AWS Route 53, we can use a cronjob script wrapping to periodically check that the IP Address assigned by the ISP is the same as the one that the AWS DNS server points to and, if it has changed, then this script can use the AWS CLI to update the IP Address. The process is described here and the adapted script that I am using is based on this gist. The only change I made was to extract the two key variables HOSTED_ZONE_ID and
    NAME to the scripts arguments (in order to allow me to run this same script for multiple domains).

    Once I had the script _update_ddns in place, I decided to run it every 10 minutes by opening crontab -e and adding the line:

    */10 * * * * /home/myname/ddns_update/_update_dns [MY_HOSTED_ZONE] [DOMAIN] >/dev/null 2>&1

    Port Forwarding

    Finally, we need to tell our modem/router device to forward requests submitted to our ssh, http and https ports onto the primary node’s wifi IP Address. Every router/modem device will be different depending on the brand and model, so you’ll have to poke around for the “port-forwarding” setup.

    In my case, I’m using an Arris router, and it was not too hard to find the port-forwarding. You’ll then need to set up a bunch of rules that tell the device how to route packets that come from the external network on a given port (80 and 443 in the figure below) and to what internal address and ports I want those packets directed (192.168.0.51 in the figure below). Also add a rule if you want to be able to ssh into your primary node on the non-default port.

    Port forwarding on my home modem/router device.

    Make sure you have a server running, e.g. apache, when you test the URL you set up through route 53.

    And that’s it — we have a pretty secure server open to the internet.

  • RaspberryPi Cluster III: Network storage and backup

    Intro

    The goal in this section is to add a harddrive to our primary node, make it accesible to all nodes as a network drive, partition it so that each of the RPi3B+ nodes can have some disk space that is more reliable than the SD cards, configure them so that they log to the network drive, and set up a backup system for the whole cluster

    Mounting an external drive

    I bought a 2TB external hard drive and connected it to one of the USB 3.0 slots in my primary node RPi4. (The other USB 3.0 slot is used for the external 1TB SSD drive.)

    Connecting a drive will make a linux-run system aware of it. Running lsblk will show you all connected disks. In this case, my two disks show up as:

    NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
    ...
    sda      8:0    0 931.5G  0 disk
    ├─sda1   8:1    0   256M  0 part /boot/firmware
    └─sda2   8:2    0 931.3G  0 part /
    sdb      8:16   0   1.8T  0 disk
    └─sdb1   8:17   0   1.8T  0 part

    The key thing here is that the SSD drive is “mounted” (as indicated by the paths under MOUNTPOINT in the rows for the two partitions of disk sda). If a disk is not mounted, then there are no read-write operations going on between the machine and the disk and it is therefore safe to remove it. If a disk is mounted then you can mess up its internal state if you suddenly remove it — or if, say, the machine loses power — during a read-write operation.

    To mount a disk, you need to create a location for it in the file system with mkdir. For temporary/explorartory purposes, it is conventional to create such a location in the /mnt directory. However, we will go ahead and create a mount point in the route directory since we intend to create a network-wide permanent mount point in the next section. In this case, I can run the following:

    sudo mkdir /networkshared 
    sudo mount /dev/sdb1 /networkshared 
    ls /networkshared

    At this point, I can read-write to the external 2TB hard drive from the primary node, and I risk corrupting it if I suddenly disconnect. To safely remove a disk, you need to unmount it first with the sudo umount path/to/mountpoint command.

    If you want to automatically mount a drive on start up, then you need to add an entry to the /etc/fstab file. In this case, you first need to determine the PARTUUID (partition universally unique id) with:

    ❯ lsblk -o NAME,PARTUUID,FSTYPE,TYPE /dev/sdb1
    NAME PARTUUID                             FSTYPE TYPE
    sdb1 a692fa77-01                          ext4   part

    … and then add a corresponding line to /etc/fstab (“File System TABle”) as follows:

    PARTUUID=a692fa77-01 /networkshared ext4 defaults 0 0

    This line basically reads to the system on boot up as “look for a disk partition of filesystem type ext4 with id a692fa77-01 and mount it to /mnt/temp”. (The last three fields (defaults 0 0) are default values for further parameters that determine how the disk will be booted and how it will behave.)

    To test that this works, you can reboot the machine or, even easier, run sudo mount -a (for mount ‘all’ in fstab).

    Setting up a network drive

    Our goal here is not to just mount the 2TB hard drive for usage on the primary node, but to make it available to all nodes. To do that, we need to have the disk mounted on the primary node as we did in the last section, and we need the primary node to run a server whose task is to read/write to the disk on behalf of requests that will come from the network. There are a few different types of server out there that will do this.

    If you want to make the disk available to mixed kinds of devices on a network (Mac, Windows, Linux), then the common wisdom is to run a “Samba” server on the node on which the disk is mounted. This is a server/transfer-protocol designed primarily by/for Windows, but generally supported elsewhere.

    If you are just sharing files between linux machines, then the common wisdom is that it is better to use the linux-designed NFS (Network File System) server/transfer-protocol.

    Also, since we will be creating a permanent mount point useable by all machines in our cluster network, we’ll create a root folder with maximally permissive permissions:

    sudo mkdir /networkshared 
    sudo chown nobody:nogroup /networkshared
    sudo chmod 777 /networkshared 

    Now, on our primary node, we will run:

    sudo apt install nfs-kernel-server

    … to install and enable the NFS server service, and now we need to edit /etc/exports by adding the following line:

    /networkshared 10.0.0.0/24(rw,sync,no_subtree_check)

    … and then run the following commands to restart the server with these settings:

    sudo exportfs -a
    sudo systemctl restart nfs-kernel-server

    Now, on each of the RPi3B+ nodes we need to install software to make NFS requests across the network:

    sudo apt install nfs-common
    sudo mkdir /networkshared

    … and add this line to /etc/fstab:

    10.0.0.1:/networkshared /networkshared nfs defaults 0 0

    … and run sudo mount -a to activate it. You can now expect to find the external disk accessible on each node at /networkshared.

    Testing Read/Write Speeds to Disk

    To see how quickly we can read/write to the network-mounted disk, we can use dd – a low level command-line tool for copying devices/files. Running it on the primary node — which has direct access to the disk, yields:

    ❯ dd if=/dev/zero of=/networkshared/largefile1 bs=1M count=1024
    1024+0 records in
    1024+0 records out
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.95901 s, 217 MB/s

    This figure — 217 MB/s — is a reasonable write speed to a directly connected hard drive. When we try the same command from e.g. rpi2 writing to the network-mounted disk:

    ❯ dd if=/dev/zero of=/networkshared/largefile2 bs=1M count=1024
    1024+0 records in
    1024+0 records out
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 212.259 s, 5.1 MB/s

    … we get the terrible speed of 5.1 MB/s. Why is this so slow? Is it the CPU or RAM on the rpi2? Is it the CPU or RAM on the rpi1? Is it the network the bottleneck?

    We can first crudely monitor the CPU/RAM status by re-starting the dd command above on rpi2 and, while that is running, start htop on both the rpi2 and rpi1. On both machines, the CPU and RAM seemed to be lightly taxed by the dd process.

    What about network speed? In our case, we want to measure how fast data can be sent from a minor node like rpi2 to the primary node rpi1 where the disk is physically located.

    First, you can monitor a realtime network transfer speed using the cbm tool (installed with sudo install cbm) running on rpi1 while writing from rpi2. This reveals the transfer speed to be only in the 5-12MB/sec ballpark. Is that because the network can only go that fast, or is it due to something else?

    Another handy tool for measuring network capacity is iperf. We need to install it on both the source node and the target node (rpi1 and rpi2 in this case) with sudo apt install iperf. Now, on the target node (rpi1) we start iperf in server mode with iperf -s. This will start a server listening on port 5001 on rpi1 awaiting a signal from another instance of iperf running in “client” mode. So on rpi2 we run the following command to tell iperf to ping the target host: iperf -c rpi1. iperf will then take a few seconds to stress and measure the network connection between the two instances of iperf and display the network speed. You can see a screen shot of the results here:

    Demo of iperf communicating between two nodes; rpi1 upper half; rpi2 lower half

    As you can see, it turns out that the max network throughput between these two nodes is only about ~95Mbits/sec, which is actually what this $9 switch is advertised as (10/100Mbits/sec). This corresponds to only ~(95/8)MBs/sec = ~12MBs/sec. So the actual write speed of 5.1MBs/sec is certainly within the order of magnitude of what the network switch will support. On a second run, I got the write speed up to 8.1MBs/sec, and the difference in time between the maximum network speeds (~12MBs/sec) and the disk-write speeds (~5-8MBs/sec) is likely due to overhead on both ends of the nfs connection.

    To test read speeds, you can simply swap the input for the output files like so on the rpi2:

    ❯ dd of=/dev/null if=/networkshared/largefile2 bs=1M count=1024
    1024+0 records in
    1024+0 records out
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 92.3027 s, 11.6 MB/s

    Read speeds were, as you can see from this output, likewise bottlenecked to ~12MBs/sec by the network switch.

    In conclusion, the network switch is bottlenecking the read/write speeds, and would probably have been an order of magnitude faster if I’d just shelled out another $4 for a gigabit switch.

    An Aside on Network Booting

    Now, ideally, we would not be using SD cards on our RPi3B+ nodes to house their entire file systems. A better approach would be to boot each of these nodes with a network file system mounted on the external 2TB disk.

    (This would require getting DNSMasq to act as a TFTP server on the primary node, adjust the /boot/firmare/cmdline.txt file on the minor nodes, and point them to a file system on the network drive. See here: https://docs.oracle.com/cd/E37670_01/E41137/html/ol-dnsmasq-conf.html)

    This is probably possible with these RPi models and Ubuntu server, and I hope to explore this in the future but, for right now, this is a bridge too far since I need to get the primary node open to the internet.

    Backing up the primary node

    My backup philosophy has long been to not to try to preserve the exact state of a disk in order to try to restore things exactly as they were in the event of e.g. disk failure. Rather, I have long much preferred to just keep copies of files so that, if my disk were to fail, then I would be able to set up a fresh disk in the future and then simply copy over any specific files, or even subsets of files.

    Why? First of all, this just simplifies the backup process IMO. It’s conceptually easier to make copies of files versus copies of disks. Second, restoration feels cleaner to me. Over time my computer practices tend to improve, and my files tend to get more disorderly. So I like to “start fresh” whenever I get a new machine, only installing what I most currently think I’ll need, rather than copying over God-knows-what cluttersome programs, config files I had conjured up in the last life. OK, that might mean that I need to do some more configuration, but I am happy to do so if it means that my files get some spring cleaning.

    Anyhow, in this instance, the minor nodes are simple enough that, were one to fail, it would not be too much work to restore one from scratch, especially given that I have recorded what I have been doing here.

    However, the primary node is and will soon become more complex, and we need to think about making regular backups. There are two basic ways to do backups of a non-GUI linux server.

    One way is to set it up manually with cronjobs and rsync. You can see how that is done in this guide. It’s actually not as complicated as you might expect, and going through this guide gives you a sense of how incremental backups work under the hood.

    The other way is to use a program or service that aims to abstract away the details of the underlying tools, such as rsync. Which I what I decided to do here.

    The first tool I tried after some googling was bacula. However, after running it for several months and realizing that it was not backing things up incrementally (but, rather, performing full backups everyday), and that the configuration scripts were so ridiculously convoluted that I would lose the will to live trying to get it to work for my very simple use case, I decided that I would look for an alternative much closer to running simple snapshots wrapped around rsync.

    And that’s exactly what I found: rsnapshot. Unlike bacula, which when installed requires you to configure and manage two background services with insanely idiosyncratic config files and bconsole commands (don’t ask what that is), rsnapshot is a much simple tool that runs no background service. Rather, you simply edit a config file for a tool that just wraps around rsync, and then you set up cronjobs to execute that tool as you prefer. And, in accordance with my backup philosophy, you just specify what fils/dirs you want backed-up, and so restoration in the future involves simply copying/consulting these backed-up files when reestablishing your machine afresh.

    (With bacula, you can’t just browse your backed-up files — no no no — the files are stored in some shitty byte-code file format which means that you can only restore those files by working with the god-awful bacula bconsole. Honestly, it’s tools like bacula that really suck the life out of you.)

    In fact, I was so happy with the utter simplicity of the rsnapshot that I also installed it in one of my minor nodes in order to also back up the files. For reference, here is my rsnapshot config file:

    # ...
                                                   #
    #######################
    # CONFIG FILE VERSION #
    #######################
    
    config_version	1.2
    
    ###########################
    # SNAPSHOT ROOT DIRECTORY #
    ###########################
    
    # All snapshots will be stored under this root directory.
    snapshot_root	/networkshared/rsnapshot
    
    
    # If no_create_root is enabled, rsnapshot will not automatically create the
    # snapshot_root directory. This is particularly useful if you are backing
    # up to removable media, such as a FireWire or USB drive.
    #
    no_create_root	1
    
    #################################
    # EXTERNAL PROGRAM DEPENDENCIES #
    #################################
    
    # LINUX USERS:   Be sure to uncomment "cmd_cp". This gives you extra features.
    # EVERYONE ELSE: Leave "cmd_cp" commented out for compatibility.
    #
    # See the README file or the man page for more details.
    #
    cmd_cp		/bin/cp
    
    # uncomment this to use the rm program instead of the built-in perl routine.
    #
    cmd_rm		/bin/rm
    
    # rsync must be enabled for anything to work. This is the only command that
    # must be enabled.
    #
    cmd_rsync	/usr/bin/rsync
    
    # Uncomment this to enable remote ssh backups over rsync.
    #
    #cmd_ssh	/usr/bin/ssh
    
    # Comment this out to disable syslog support.
    #
    cmd_logger	/usr/bin/logger
    
    # Uncomment this to specify the path to "du" for disk usage checks.
    # If you have an older version of "du", you may also want to check the
    # "du_args" parameter below.
    #
    cmd_du		/usr/bin/du
    
    # Uncomment this to specify the path to rsnapshot-diff.
    #
    cmd_rsnapshot_diff	/usr/bin/rsnapshot-diff
    
    # Specify the path to a script (and any optional arguments) to run right
    # before rsnapshot syncs files
    #
    #cmd_preexec	/path/to/preexec/script
    
    # Specify the path to a script (and any optional arguments) to run right
    # after rsnapshot syncs files
    #
    #cmd_postexec	/path/to/postexec/script
    
    # Paths to lvcreate, lvremove, mount and umount commands, for use with
    # Linux LVMs.
    #
    linux_lvm_cmd_lvcreate	/sbin/lvcreate
    linux_lvm_cmd_lvremove	/sbin/lvremove
    linux_lvm_cmd_mount	/bin/mount
    linux_lvm_cmd_umount	/bin/umount
    
    #########################################
    #     BACKUP LEVELS / INTERVALS         #
    # Must be unique and in ascending order #
    # e.g. alpha, beta, gamma, etc.         #
    #########################################
    
    # Days
    retain	alpha	6
    # Weeks
    retain	beta	6
    # Months
    retain	gamma	6
    
    ############################################
    #              GLOBAL OPTIONS              #
    # All are optional, with sensible defaults #
    ############################################
    
    # Verbose level, 1 through 5.
    # 1     Quiet           Print fatal errors only
    # 2     Default         Print errors and warnings only
    # 3     Verbose         Show equivalent shell commands being executed
    # 4     Extra Verbose   Show extra verbose information
    # 5     Debug mode      Everything
    #
    verbose		2
    
    # Same as "verbose" above, but controls the amount of data sent to the
    # logfile, if one is being used. The default is 3.
    # If you want the rsync output, you have to set it to 4
    #
    loglevel	3
    
    # If you enable this, data will be written to the file you specify. The
    # amount of data written is controlled by the "loglevel" parameter.
    #
    logfile	/var/log/rsnapshot.log
    
    # If enabled, rsnapshot will write a lockfile to prevent two instances
    # from running simultaneously (and messing up the snapshot_root).
    # If you enable this, make sure the lockfile directory is not world
    # writable. Otherwise anyone can prevent the program from running.
    #
    lockfile	/var/run/rsnapshot.pid
    
    # By default, rsnapshot check lockfile, check if PID is running
    # and if not, consider lockfile as stale, then start
    # Enabling this stop rsnapshot if PID in lockfile is not running
    #
    #stop_on_stale_lockfile		0
    
    # Default rsync args. All rsync commands have at least these options set.
    #
    #rsync_short_args	-a
    #rsync_long_args	--delete --numeric-ids --relative --delete-excluded
    
    # ssh has no args passed by default, but you can specify some here.
    #
    #ssh_args	-p 22
    
    # Default arguments for the "du" program (for disk space reporting).
    # The GNU version of "du" is preferred. See the man page for more details.
    # If your version of "du" doesn't support the -h flag, try -k flag instead.
    #
    du_args	-csh
    
    # If this is enabled, rsync won't span filesystem partitions within a
    # backup point. This essentially passes the -x option to rsync.
    # The default is 0 (off).
    #
    #one_fs		0
    
    # The include and exclude parameters, if enabled, simply get passed directly
    # to rsync. If you have multiple include/exclude patterns, put each one on a
    # separate line. Please look up the --include and --exclude options in the
    # rsync man page for more details on how to specify file name patterns.
    #
    #include	"/"
    #exclude	"/networkshared"
    
    # The include_file and exclude_file parameters, if enabled, simply get
    # passed directly to rsync. Please look up the --include-from and
    # --exclude-from options in the rsync man page for more details.
    #
    #include_file	/path/to/include/file
    #exclude_file	/path/to/exclude/file
    
    # If your version of rsync supports --link-dest, consider enabling this.
    # This is the best way to support special files (FIFOs, etc) cross-platform.
    # The default is 0 (off).
    #
    #link_dest	0
    
    # When sync_first is enabled, it changes the default behaviour of rsnapshot.
    # Normally, when rsnapshot is called with its lowest interval
    # (i.e.: "rsnapshot alpha"), it will sync files AND rotate the lowest
    # intervals. With sync_first enabled, "rsnapshot sync" handles the file sync,
    # and all interval calls simply rotate files. See the man page for more
    # details. The default is 0 (off).
    #
    #sync_first	0
    
    # If enabled, rsnapshot will move the oldest directory for each interval
    # to [interval_name].delete, then it will remove the lockfile and delete
    # that directory just before it exits. The default is 0 (off).
    #
    #use_lazy_deletes	0
    
    # Number of rsync re-tries. If you experience any network problems or
    # network card issues that tend to cause ssh to fail with errors like
    # "Corrupted MAC on input", for example, set this to a non-zero value
    # to have the rsync operation re-tried.
    #
    #rsync_numtries 0
    
    # LVM parameters. Used to backup with creating lvm snapshot before backup
    # and removing it after. This should ensure consistency of data in some special
    # cases
    #
    # LVM snapshot(s) size (lvcreate --size option).
    #
    #linux_lvm_snapshotsize	100M
    
    # Name to be used when creating the LVM logical volume snapshot(s).
    #
    #linux_lvm_snapshotname	rsnapshot
    
    # Path to the LVM Volume Groups.
    #
    #linux_lvm_vgpath	/dev
    
    # Mount point to use to temporarily mount the snapshot(s).
    #
    #linux_lvm_mountpath	/path/to/mount/lvm/snapshot/during/backup
    
    ###############################
    ### BACKUP POINTS / SCRIPTS ###
    ###############################
    
    # LOCALHOST
    # DWD: Careful -- you need to copy each line and modify, otherwise your tabs will be spaces!
    backup	/home/	localhost/
    backup	/etc/	localhost/
    backup	/usr/	localhost/
    backup	/var/	localhost/
    # You must set linux_lvm_* parameters below before using lvm snapshots
    #backup	lvm://vg0/xen-home/	lvm-vg0/xen-home/
    
    # EXAMPLE.COM
    #backup_exec	/bin/date "+ backup of example.com started at %c"
    #backup	root@example.com:/home/	example.com/	+rsync_long_args=--bwlimit=16,exclude=core
    #backup	root@example.com:/etc/	example.com/	exclude=mtab,exclude=core
    #backup_exec	ssh root@example.com "mysqldump -A > /var/db/dump/mysql.sql"
    #backup	root@example.com:/var/db/dump/	example.com/
    #backup_exec	/bin/date "+ backup of example.com ended at %c"
    
    # CVS.SOURCEFORGE.NET
    #backup_script	/usr/local/bin/backup_rsnapshot_cvsroot.sh	rsnapshot.cvs.sourceforge.net/
    
    # RSYNC.SAMBA.ORG
    #backup	rsync://rsync.samba.org/rsyncftp/	rsync.samba.org/rsyncftp/

  • Raspberry Pi Cluster Part II: Network Setup

    Introduction

    In the last post we got the hardware in order and made each of our 4 RPi nodes production ready with Ubuntu Server 20.04. We also established wifi connections between each node and the home router.

    In this post, I’m going to describe how to set up the “network topology” that will enable the cluster to become easily transportable. The primary RPi4 node will act as the gateway/router to the cluster. It will communicate with the home router on behalf of the whole network. If I move in the future, then I’ll only have to re-establish a wifi connection with this single node in order to restore total network access to each node. I also only need to focus on securing this node in order to expose the whole cluster to the internet. Here’s the schematic again:

    In my experience, it’s tough to learn hardware and networking concepts because the field is thick with jargon. I am therefore going to write as though to my younger self keenly interested in becoming self-reliant in the field of computer networking.

    Networking Fundamentals

    If you’re not confident with your network fundamentals, then I suggest you review the following topics by watching the linked explainer videos. (All these videos are made by the YouTube chanel “Power Cert Animated Videos” and are terrific.

    Before we get into the details of our cluster, let’s quickly review the three main things we need to think about when setting up a network: IP-address assignment, domain-name resolution, and routing.

    IP-Address Assignment

    At its core, networking is about getting fixed-length “packets” of 1s and 0s from one program running on a computer to another program running on any connected computer (including programs running on the same computer). For that to happen, each computer needs to have an address – an IP Address – assigned to it. As explained in the above video, the usual way in which that happens is by interacting with a DHCP server. (However, most computers nowadays run a process in the background that will attempt to negotiate an IP Address automatically in the event that no machine on its network identifies itself as a DHCP server.) In short, we’ll need to make sure that we have a DHCP server on our primary node in order to assign IP addresses to the other nodes.

    Domain-Name Resolution

    Humans do not like to write instructions as 1s and 0s, so we need each node in our network to be generally capable of exchanging a human-readable address (e.g. ‘www.google.com’, ‘rpi3’) into a binary IP address. This is where domain-name servers (DNS) and related concepts come in.

    The word “resolve” is used to describe the process of converting a human-readable address into an IP address. In general, an application that needs to resolve an IP address will interact with a whole bunch of other programs, networks and servers to obtain its target IP address. The term “resolver” is sometimes used to refer to this entire system of programs, networks and servers. The term resolver is also sometimes used to refer to a single element within such a system. (Context usually makes it clear.) From hereon, we’ll use “resolver” to refer to a single element within a system of programs, networks and servers whose job is to convert strings of letters to an IP Address, and “resolver system” to refer to the whole system.

    Three types of resolver to understand here are “stub resolvers”, “recursive resolver”, and “authoritative resolver”. A stub resolver is a program that basically acts as a cache within the resolver system. If it has recently received a request to return an IP address in exchange for a domain name (and therefore has it in its cache), then it will return that domain name. Otherwise, it will pass the request onto another resolver, (which might also be a stub resolver that has to just pass the buck on).

    A recursive resolver will also act as a cache and if it does not have all of the information needed to return a complete result, then it will pass on a request for information to another resolver. Unlike a stub resolver though, it might not receive back a final answer to its question but, rather, an address to another resolver that might have the final answer. The recursive resolver will keep following any such lead until it gets its final answer.

    An “authoritative” resolver is a server that does not pass the buck on. It’s the final link in the chain, and if it does not have the answer or suggestions for another server to consult, then the resolution will fail, and all of these resolvers will send back a failure message.

    In summary, domain-name resolution is all about finding a simple lookup table that associates a string (domain name) with a number (the IP Address). This entry in the table is called an “A Record” (A for Address).

    Routing

    Once a program has an IP Address to send data to, it needs to know where first to send the packet in order to get it relayed. In order for this to happen, each network interface needs to have a router address applied to it when configured. You can see the router(s) on a linux with router -n. In a home setup, this router will be the address of the wifi/modem box. Once the router address is determined, the application can just send packets there and the magic of Internet routing will take over.

    Ubuntu Server Networking Fundamentals

    Overview

    Ubuntu Server 20.04, which we’re using here, comes with several key services/tools that are installed/enabled by default or by common practice: systemd-resolved, systemd-networkd, NetworkManager and netplan.

    systemd-resolved

    You can learn the basic about it by running:

    man systemd-resolved

    This service is a stub resolver making it possible for applications running on the system to resolve hostnames. Applications running on the system can interact with it by issuing some low-level kernel jazz via their underlying C libraries, or by pinging the internal (“loopback”) network address 127.0.0.53. To see it in use as a stub server, you can run dig @127.0.0.53 www.google.com.

    You can check what DNS servers it is set up to consult by running resolvectl status. (resolvectl is a pre-installed tool that lets you interact with the running systemd-resolved service; see resolvectl --help to get a sense of what you can do with it.)

    Now we need to ask how systemd-resolved resolves hostnames? It does it by communicating over a network with a DNS server. How do you configure it so it knows what DNS servers to consult and in what order of priority?

    systemd-networkd

    systemd-networkd is a pre-installed and pre-enabled service on Ubuntu that acts as a DHCP client (listening on port 68 for signals from a DHCP server). So when you switch on your machine and this service starts up, it will negotiate the assignment of an IP Address on the network based upon DHCP broadcast signals. In the absence of a DHCP server on the network, it will negotiate with any other device. I believe it is also involved in the configuration of interfaces.

    NetworkManager

    This is an older service that does much the same as networkd. It is NOT enabled by default, but is so prominent that I thought it would be worth mentioning in this discussion. (Also, during my research to try and get the cluster configured the way I want it, I installed NetworkManager and messed with it only to ultimately conclude that this was unnecessary and confusing.)

    Netplan

    Netplan is pre-installed tool (not service) that, in theory, makes it easier to configure systemd-resolved and either networkd or NetworkManager. The idea is that you declare your desired network end state in a YAML file (/etc/netplan/50-cloud-init.yaml) so that after start up (or running netplan apply), it will do whatever needs to be done under the hood with the relevant services to get the network into your desired state.

    Other Useful Tools

    In general, when doing networking on linux machines, it’s useful to install a couple more packages:

    sudo apt install net-tools traceroute

    The net-tools package gives us a bunch of classic command-line utilities, such as netstat. I often use it (in an alias) to check what ports are in use on my machibne: sudo netstat -tulpn.

    traceroute is useful in making sense of how your network is presently set up. Right off the bat, running traceroute google.com, will show you how you reach google.

    Research References

    For my own reference, the research I am presenting here is derived in large part from the following articles:

    • This is the main article I consulted that shows someone using dnsmasq to set up a cluster very similar to this one, but using Raspbian instead of Ubuntu.
    • This article and this article on getting dnsmasq and system-resolved to handle single-word domain names.
    • Overview of netplan, NetworkManager, etc.
    • https://unix.stackexchange.com/questions/612416/why-does-etc-resolv-conf-point-at-127-0-0-53
    • This explains why you get the message “ignoring nameserver 127.0.0.1” when starting up dnsmasq.
    • Nice general intro to key concepts with linux
    • This aids understanding of systemd-resolved’s priorities when multiple DNS’s are configured on same system
    • https://opensource.com/business/16/8/introduction-linux-network-routing
    • https://www.grandmetric.com/2018/03/08/how-does-switch-work-2/
    • https://www.cloudsavvyit.com/3103/how-to-roll-your-own-dynamic-dns-with-aws-route-53/

    Setting the Primary Node

    OK, enough preliminaries, let’s get down to setting up out cluster.

    A chief goal is to try to set up the network so that as much of the configuration as possible is on the primary node. For example, if we want to be able to ssh from rpi2 to rpi3, then we do NOT want to have to go to each node and explicitly state where each hostname is to be found.

    So we want our RPi4 to operate as the single source of truth for domain-name resolution and IP-address assignment. We do this by running dnsmasq – a simple service that turns our node into a DNS and DHCP server:

    sudo apt install dnsmasq
    sudo systemctl status dnsmasq

    We configure dnsmasq with /etc/dnsmasq.conf. On this fresh install, this conf file will be full of fairly detailed notes. Still, it takes some time to get the hang of how it all fits together. This is the file I ended up with:

    # Choose the device interface to configure
    interface=eth0
    
    # We will listen on the static IP address we declared earlier
    # Note: this might be redundant
    listen-address=127.0.0.1
    
    # Enable addresses in range 10.0.0.1-128 to be leased out for 12 hours
    dhcp-range=10.0.0.1,10.0.0.128,12h
    
    # Assign static IPs to cluster members
    # Format = MAC:hostname:IP
    dhcp-host=ZZ:YY:XX:WW:VV:UU,rpi1,10.0.0.1
    dhcp-host=ZZ:YY:XX:WW:VV:UU,rpi2,10.0.0.2
    dhcp-host=ZZ:YY:XX:WW:VV:UU,rpi3,10.0.0.3
    dhcp-host=ZZ:YY:XX:WW:VV:UU,rpi4,10.0.0.4
    
    # Broadcast the router, DNS and netmask to this LAN
    dhcp-option=option:router,10.0.0.1
    dhcp-option=option:dns-server,10.0.0.1
    dhcp-option=option:netmask,255.255.255.0
    
    # Broadcast host-IP relations defined in /etc/hosts
    # And enable single-name domains
    # See here for more details
    expand-hosts
    domain=mydomain.net
    local=/mydomain.net/
    
    # Declare upstream DNS's; we'll just use Google's
    server=8.8.8.8
    server=8.8.4.4
    
    # Useful for debugging issues
    # Run 'journalctl -u dnsmasq' for resultant logs
    log-queries
    log-dhcp
    
    # These two are recommended default settings
    # though the exact scenarios they guard against 
    # are not entirely clear to me; see man for further details
    domain-needed
    bogus-priv

    Hopefully these comments are sufficient to convey what is going on here. Next, we make to sure that the /etc/hosts file associates the primary node with its domain name, rpi1. It’s not clear to me why this is needed. The block of dhcp-host definitions above do succeed in enabling dnsmasq to resolve rpi2, rpi3, and rpi4, but the line for rpi1 does not work. I assume that this is because dnsmasq is not setting the IP address of rpi1, and this type of setting only works for hosts that it sets the IP Address of. (Why that is the case seems odd to me.)

    # /etc/hosts
    10.0.0.1 rpi1

    Finally, we need to configure the file /etc/netplan/50-cloud-init.yaml on the primary node in order to declare this node with a static IP Address on both the wifi and ethernet networks.

    network:
        version: 2
        ethernets:
            eth0:
                dhcp4: no
                addresses: [10.0.0.1/24]
        wifis:
            wlan0:
                optional: true
                access-points:
                    "MY-WIFI-NAME":
                        password: "MY-PASSWORD"
                dhcp4: no
                addresses: [192.168.0.51/24]
                gateway4: 192.168.0.1
                nameservers:
                    addresses: [8.8.8.8,8.8.4.4]

    Once these configurations are set up and rpi1 is rebooted, you can expect to find that ifconfig will show ip addresses assigned to eth0 and wlan0 as expected, and that resolvectl dns will read something like:

    Global: 127.0.0.1
    Link 3 (wlan0): 8.8.8.8 8.8.4.4 2001:558:feed::1 2001:558:feed::2
    Link 2 (eth0): 10.0.0.1

    Setting up the Non-Primary Nodes

    Next we jump into the rpi2 node and edit the /etc/netplan/ to:

    network:
        version: 2
        ethernets:
            eth0:
                dhcp4: true
                optional: true
                gateway4: 10.0.0.1
                nameservers:
                    addresses: [10.0.0.1]
        wifis:
            wlan0:
                optional: true
                access-points:
                    "MY-WIFI-NAME":
                        password: "MY-PASSWORD"
                dhcp4: no
                addresses: [192.168.0.52/24]
                gateway4: 192.168.0.1
                nameservers:
                    addresses: [8.8.8.8,8.8.4.4]

    This tells netplan to set up systemd-networkd to get its IP Address from a DHCP server on the ethernet network (which will be found to be on rpi1 when the broadcast event happens), and to route traffic and submit DNS queries to 10.0.0.1.

    To reiterate, the wifi config isn’t part of the network topology; this is optionally added because it makes life easier when setting up the network to be able to ssh straight into a node. In my current setup, I am assigning all the nodes static IP Addresses on the wifi network of 192.168.0.51-4.

    Next, as described here, in order for our network to be able to resolve single-word domain names, we need to alter the behavior of systemd-resolved by linking these two files together:

    sudo ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf

    This causes the systemd-resolved stub resolver to dynamically determine a bunch of settings based upon what dnsmasq broadcasts on rpi1.

    After rebooting, and doing the same configuration on rpi3 and rpi4, we can run dig rpi1, dig rpi2, etc. on any of the non-primary nodes and expect to get the single-word hostnames resolved as we intend.

    If we go to trpi1 and check the ip-address leases:

    cat /var/lib/misc/dnsmasq.leases

    … then we can expect to see that dnsmasq has successfully acted as a DHCP server. You can also check that dnsmasq has been receiving DNS queries by examining the system logs: journalctl -u dnsmasq.

    Routing All Ethernet Traffic Through the Primary Node

    Finally, we want all nodes to be able to connect to the internet by routing through the primary node. This is achieved by first uncommenting the line net.ipv4.ip_forward=1 in the file /etc/sysctl.conf and then running the following commands:

    sudo iptables -t nat -A POSTROUTING -o wlan0 -j MASQUERADE
    sudo iptables -A FORWARD -i wlan0 -o eth0 -m state --state RELATED,ESTABLISHED -j ACCEPT
    sudo iptables -A FORWARD -i eth0 -o wlan0 -j ACCEPT

    These lines mean something like the following:

    1. When doing network-address translation (-t nat), and just before the packet is to go out via the wifi interface (-A POSTROUTING = “append a postrouting rule”), replace the source ip address with the ip address of this machine on the outbound network
    2. forward packets in from wifi to go out through ethernet
    3. forward packets in from ethernet to go out through wifi

    For these rules to survive across reboots you need to install:

    sudo apt install iptables-persistent

    and agree to storing the rules in /etc/iptables/rules.v4. Reboot, and you can now expect to be able to access the internet from any node, even when the wifi interface is down (sudo ifconfig wlan0 down).

    Summary

    So there we have it – an easily portable network. If you move location then you only need to adjust the wifi-connection details in the primary node, and the whole network will be connected to the Internet.

    In the next part, we’ll open the cluster up to the internet through our home router and discuss security and backups.

  • WordPress Backup Restoration Practice

    Intro

    This is Part II of a multi-part series on creating and managing a WordPress-NextJs app. In the first part, we set up a WordPress (WP) instance on AWS EC2 with OpenLiteSpeed, MySql on RDS, and S3 for our media storage. The WP site will serve as a headless CMS for a hypothetical client to manage his/her content; it will act as the data source for nextJs builds to generate a pre-rendered site deployed on AWS Cloudfront.

    In this part, having set up our WP backend, we will practice restoring the site in the event that the EC2 instance gets bricked.

    Rationale

    In this section, I am going to argue for why we only need to be able to recreate a corrupted EC2 instance given our setup so far.

    One of the advantages of (essentially) storing all of our application’s state in AWS RDS and S3 is that these managed services provide backup functionality out of the box.

    S3 has insane multi-AZ “durability” out of the box of with, to quote a quick search google, “at least 99.999999999% annual durability, or 11 nines… [; t]hat means that even with one billion objects, you would likely go a hundred years without losing a single one!” You could make your S3 content even more durable by replicating across regions but, this is plenty durable for anything I’ll be building any time soon.

    RDS durability requires a little more understanding. On a default setup, snapshots are taken of the DB’s disk storage every day, and retained for a week. RDS uses these backups to provide you with the means to restore the state of your DB to any second of your choosing between your oldest and most recent backups. This is your “retention period”. At extra cost, you can set the retention period to be as far back as 35 days, and you can also create manual snapshots to be stored indefinitely.

    These RDS backups, while good and all, do not guard against certain worst-case scenarios. If, say, your DB instance crashes during a write operation and corrupts the physical file, then you will need to restore a backup before the end of the retention period; this could mean that you will lose your data if you do not also regularly check the WP site to make sure everything is in order.

    Perhaps even worse is the case where some mishap occurs with your data that does not show up in an immediate or obvious way by visiting your site. For example, suppose you or a client installs a crappy plugin that deletes a random bunch; which you are not going to know about unless you do a thorough audit of your site’s content (and who’s got time for that!)

    For this reason, you also really want to create regular logical backups of your DB and save them to e.g. S3 Glacier.

    EC2 Restoration

    We’ll practice reconstructing a working EC2 instance to host our WP site. This section assumes that you have set up everything as prescribed in Part I of this series.

    First, in order to create a more authentic restoration simulation, you might want to stop your existing/working EC2 instance. Before you do that though, consider whether or not you have noted the details for connecting to your WP DB; if not, first note them down somewhere “safe”. Also be aware that if you did not associate an elastic-ip address with your EC2 instance, then it will get reset when you restart it later.

    Next, create a new EC2 Ubuntu 20.04 instance, open port 80 and 7080, ssh into it, and run the following:

    sudo apt update
    sudo apt upgrade -y
    curl -k https://raw.githubusercontent.com/litespeedtech/ols1clk/master/ols1clk.sh -o ols1clk.sh
    sudo bash ols1clk.sh -w

    The ols1clk.sh script will print to screen the details of the server and WP it is about to install. Save those details temporarily and proceed to install everything.

    In the AWS RDS console, go to your active mysql instance, go to its security group, and add an inbound rule allowing connections from your the security group attached to your new EC2 instance. Copy the RDS endpoint.

    Back in the EC2 instance, check that you can connect to the RDS instance by running:

    mysql --host RDS_ENDPOINT -u WPDB_USER -p

    … and entering the password for this user. Also make sure you can use the WP database within this instance.

    If the connection works, then open the file /usr/local/lsws/wordpress/wp-config.php in your editor and replace the mysql connection details with those corresponding to your RDS instance and WP DB_NAME. You also need to add the following two lines in order to override the fact that the DB is set with a site base url corresponding to your previous EC2 instance:

    define('WP_HOME', '[NEW_IP_ADDRESS]');
    define('WP_SITEURL', '[NEW_IP_ADDRESS]');

    … where NEW_IP_ADDRESS is taken from your new EC2 console.

    Now you can go to NEW_IP_ADDRESS in a browser and expect to find a somewhat functioning WP site. If you try navigating to post though you will get a 404 error. To fix this, you need to go into OLS Admin account on port 7080, login using the new credentials generated by ols1clk.sh, go to the “Server Configuration” section, and under the “General” tab, in the “Rewrite Control” table, set the “Auto Load from .htaccess” field to “Yes”. Now you can expect to be able to navigate around posts.

    (Side note: it’s very surprising to me that ols1clk.sh, in installing WP, does not set this field to Yes.)

    The images will not work of course, because the DB records their location at the address of the old EC2 instance, which is not running. So in an actual restoration scenario, we would need to point the previous hostname from the old instance to the new instance, and then set up S3fs (which I am not going to test right now).

    Having gone through this backup restoration practice, you can switch back to the old EC2 instance and, since we had to play around with our login credentials on a non-SSL site, it is a good idea to update your WP login password.

  • Headless WordPress with OpenLiteSpeed using AWS EC2, RDS & S3

    Intro

    I’ve heard good things about Litespeed so decided to try setting it up on AWS and to perform some comparisons with Apache. I also decided that I would try AWS RDS and S3 for data persistence.

    This article assumes general knowledge of setting up and administering an EC2 instance, and focuses on OpenLiteSpeed (OSL) setup.

    EC2 Setup

    I set up an EC2 instance with Ubuntu 20.04 LTS through the AWS Console. I chose 15GB of EBS storage, which I expect will be more than enough so long as this instance remains dedicated to one WordPress instance with data and media files stored externally (~5GB for OS, ~4GB for swap space, leaving ~5-6GB to spare). I’ve started off with 1GB RAM (i.e. the free-tier-eligible option).

    Then you need to ssh into your EC2 instance, do the usual set up (add swap space, add configurations for vim, zsh, tmux, etc.), and, if you plan for this to be a production WordPress site, then you’ll want to set up backups using the Life Cycle Manage.

    Installing OpenLiteSpeed

    Once your EC2 instance is configured for general usage, we need to install OpenLiteSpeed. Although it claims to be “compatible” with Apache, there are a lot of differences to setting it up and operating it. I used the official guide as well as this googled resource.

    NOTE: this section describes how to install OLS manually; see below for the option of installing OLS, php, wordpress, mysql, and LSCache through a convenience “one-click” script.

    First, to install “manually”, you need to add the relevant repository:

    wget -O - http://rpms.litespeedtech.com/debian/enable_lst_debian_repo.sh | sudo bash

    Then you can install the package:

    sudo apt install openlitespeed

    This will install to /usr/local/lsws (laws = “Lite Speed Web Server”) and includes a service allowing you to run:

    sudo systemctl start|stop|restart|status|enable|disable lsws

    OLS will be running and enabled upon installation. Installing OSL also installs a bunch of other packages, including a default version of php73, so OSL is ready to work with php out of the box.

    Managing OpenLiteSpeed

    Unlike apache — where all configuration is performed by editing files within /etc/apache2, OSL comes with an admin interface allowing you to configure most things graphically. This interface runs on port 7080 (by default) so, to access it on your EC2 instance, you need to open up port 7080 in your security group, and then navigate to your_ec2_ip_address:7080, where you will see a login form.

    Now, first time you do all of this, you will not have set up SSL certification, so any such login will be technically insecure. So my approach is to:

    Kick off with a weak/throwaway password to get stuff up initially, set up SSL, then switch to a proper/strong/long-term password, and hope that, during these few minutes, your ISP or some government power does not packet-sniff out your credentials and take over your OSL instance.

    To set up the initial credentials, run this CLI wizard:

    sudo /usr/local/lsws/admin/misc/admpass.sh

    Then use those credentials to login into the interface. By modifying settings here, you cause details within config files to get updated within /usr/local/lsws, so in theory you never need to directly alter settings in these files directly.

    By default, OSL will be running the actual server on port 8088. We want to change that to 80 to make sure things are working. So go to “Listeners”, click “View” on the Default listener, and edit the port to 80. Save and restart the server. Now you can got to your_ec2_ip_address in the browser to view the default server setup provided by OSL.

    This default code is provided in /usr/local/lsws/Example/html. Let’s create the file /usr/local/lsws/Example/html/temp.php with the contents:

    <?php
    echo "Is this working?";
    phpinfo();
    ?>

    And then go to your_ec2_ip_address/temp.php to confirm things are working. If you’ve followed these instructions precisely, then you’d expect to see something like this:

    A note on LiteSpeed & PHP

    The first time I tried installing OSL, I was rather confused on what one needed to do to get php working with it. The instructions I had come across told me to run the following commands after installing OSL:

    sudo apt-get install lsphp74
    sudo ln -sf /usr/local/lsws/lsphp74/bin/lsphp /usr/local/lsws/fcgi-bin/lsphp5

    This might have been necessary back in Ubuntu18.04, but with Ubuntu20.04 it is not necessary: installing OSL comes with the package lsphp73 already installed, so you only need to install this if you care about having php 7.4 over 7.3.

    It was also frustrating to be told to create the soft link given above without any explanation as to what it does or why it is needed. As far as I can discern, you need this soft link iff you want to specify the php interpreter to be used with fast-cgi scripts. But since I never deal with cgi stuff, I am pretty sure one can skip this.

    Furthermore, the instructions I read were incomplete. To get OSL to recognize the lsphp74 interpreter, you need to perform the additional step of setting the path in admin console. To do that, go to “Server Configuration” and the “External App” tab. There you need to edit the settings for the “LiteSpeed SAPI App” entry, and change the command field from “lsphp73/bin/lsphp” to “lsphp73/bin/lsphp”. Save, restart OSL, and check that the php version coming through in the temp.php page set up earlier is 7.4.

    SSL Setup

    I followed the instructions here, though they’re slightly out of date for Ubuntu20.04.

    Point a subdomain towards your EC2 instance; in this example, I’ll be using temp.rndsmartsolutions.com.

    Run sudo apt install certbot to install certbot.

    Run sudo cerbot certonly to kick off the certbot certificate wizard. When asked “How would you like to authenticate with the ACME CA?”, choose “Place files in webroot directory (webroot)”.

    Add your domain name(s) when prompted, then when it asks you to “Input the webroot for [your domain name]” , enter “/usr/local/lsws/Example/html“. This is the default dir that OLS comes with, and certbot will then know to add a temporary file there in order to coordinate with the CA server to verify that you control the server to which the specified domain name is pointing.

    If successful, certbot will output the certificate files onto your server. You now have to use the OLS console to add those files to your server’s configuration. Go to the “Listeners” section and under the “General” tab change the”Port” field to 443 and the “Secure” field to Yes. In the SSL tab set the “Private Key File” field to, in this example, /etc/letsencrypt/live/temp.rndsmartsolutions.com/privkey.pem and the “Cerficate File” field to /etc/letsencrypt/live/temp.rndsmartsolutions.com/fullchain.pem. Now restart the server and try navigating to your the domain that you just tried to setup with https and you can expect it to work.

    If SSL is working for the main server for the default Example virtual host on port 443, then you can use those same certificates for the WebAdmin server listening on port 7080. To do so, go to the “WebAdmin Settings > Listeners” section and view the “adminListener” entry. Under the SSL tab, set the “Private Key File” and “Certificate File” fields to the same values as above (i.e. pointing to our certbot-created certificates), and then save and restart the server. Now you can expect to be able to access the WebAdmin interface by visiting, in this example, temp.rndsmartsolutions.com:7080.

    Now that we can securely access the WebAdmin interface without the threat of packet sniffing, we can set a new password to something strong and longterm by going to “WebAdmin Settings > General” and under the “Users” tab we can view the entry for our username and a form for updating the password.

    Creating Further Virtual Hosts

    In general, we want to be able to add further web applications to our EC2 instance that funnel through the OSL server in one way or another. For that, we need to be able to set up multiple virtual hosts. Let’s start off with a super basic html one, and then explore the addition of more sophisticated apps.

    First, I’ll go to my DNS control panel (AWS Route 53) and add another record pointing the subdomain temp2.rndsmartsolutions.com to my EC2 instance.

    Now, in the WebAdmin interface, go to the “Virtual Hosts” section and click “+” to Add a new Virtual Host. Create a name in the “Virtual Host Name” field; this can be set to the text of the subdomain that, in this case, is “temp2”. In the “Virtual Host Root” field, set the path to the directory that you plan to use for your content. You need to create this dir on your EC2 instance; I tend to put them in my home folder so, in this case, I am using “/home/ubuntu/temp2”. While you’re there, create a dir called “html” in there and place a test index.html with some hello-world text. (You can determine the exact dir name to hold the root content by going to the “General” tab and setting the field “Document Root” to something other than “$VH_ROOT/html/”; in this case, we have set “$VH_ROOT” to “/home/ubuntu/temp2”.)

    Under the table titled “Security” within the “Basic” tab of “Virtual Hosts” section you probably want to also set things as depicted in the image below.

    Having created the Virtual Host entry for our new app, go to the “Listeners” section. Since earlier we changed the Default listener to have “Secure” value of “Yes” and the “Port” to have value “443”, we need to create a separate listener for port 80, that does not need to be secure. (At minimum we need the listener on port 80 in order for our next certbot call to be able to perform its verification steps.) So create such a listener and give it two “Virtual Host Mappings” to our two existing virtual hosts as depicted below.

    We now have two listeners — one for 80 and one for 443 — and both virtual hosts can be reached at their respective domains. Going now to temp2.rndsmartsoliutions.com is expected to show the hello-word index.html file created earlier.

    In order for SSL to work with out new virtual host, we need to expand the domains within our certificate files. So go to your SSH EC2 terminal and re-run certbot certonly as follows:

    sudo certbot certonly --cert-name temp.rndsmartsolutions.com --expand -d \
    temp.rndsmartsolutions.com,\
    temp2.rndsmartsolutions.com

    If all goes well, the certificates will get updated and once you restart the server you will be able to access your new virtual host at, in this example, https://temp2.rndsmartsolutions.com.

    (Note: you can add the SSL configurations to the virtual hosts rather than the listeners, but I prefer the latter.)

    Finally, we want to be able to tell OLS to redirect all traffic for a given virtual host from http to https. We’ll exemplify this here with the temp2 virtual host. Go to the “Virtual Hosts” section and view the temp2 virtual host. Under the “Rewrite” tab set the field “Enable Rewrite” to Yes in the “Rewrite Control” table. Then add the following apache-style rewrite syntax to the “Rewrite Rules” field:

    rewriteCond %{HTTPS} !on
    rewriteCond %{HTTP:X-Forwarded-Proto} !https
    rewriteRule ^(.*)$ https://%{SERVER_NAME}%{REQUEST_URI} [R,L]

    Restart the OSL server and you can now expect to be redirected to https next time you visit http://temp2.rndsmartsolutions.com.

    Certbot Renewal Fix

    This section was inserted several months later to fix a problem with the set up described in the last section. What I discovered was that adding these apache rewrite rules disrupted the way that certbot renews the certificate automatically. It wants to check you control the domain by adding temporary documents and then accessing them over http. However, these rewrite rules redirect you to https, which certbot doesn’t like (presumably because it doesn’t want to assume you can use https.)

    The fix I came up with is to disable the rewrite rules and, if I want to have an application that can only be accessed over https, then I create an additional virtual host listening on port 80 for the same domain, and then I just make that application a single index.php with the following redirection logic (taken from here):

    if (empty($_SERVER['HTTPS']) || $_SERVER['HTTPS'] === "off") {
        $location = 'https://' . $_SERVER['HTTP_HOST'] . $_SERVER['REQUEST_URI'];
        header('HTTP/1.1 301 Moved Permanently');
        header('Location: ' . $location);
        exit;
    }

    So now, if you accidentally go to this domain, you will be redirected to the https listener, which will route you to the actual application you want users using at this domain. (And if someone accidentally goes to a non-root extension of this domain, then OLS will issue a “not found” error.)

    Installing WordPress with One-Click Install

    I discovered (somewhat after the fact) that you can actually install OLS, php, mysql, wordpress, and LSCache with a convenient script found here. To use, download and execute with:

    curl https://raw.githubusercontent.com/litespeedtech/ols1clk/master/ols1clk.sh -o ols1clk.sh
    sudo bash ols1clk.sh -w

    … where the -w flag indicates that you also want wordpress installation. This will prompt you with all of the items and credentials that the script plans to make, including installing mysql. If you have already installed OLS, it will wipe out your existing admin user+password with the new one declared, and will likely disrupt your Example settings, and create confusion by adding virtual hosts and/or listeners that conflict with what you’ve already set up. In short, do NOT use this script if you have already set up OLS — just install wordpress manually; Digital Ocean provides thorough guides for accomplishing this; e.g. see here.

    Once you have cleanly run the ols1clk.sh script, a virtual host will be ready for you, so go to the domain/ip-address for this instance, and you will encounter the set up wizard for wordpress. (Obviously, ideally, you first go through the relevant steps as laid out already for setting up SSL before going through the wizard.)

    However, before you do anything in the WP setup wizard, you need to change mysql DB configurations unless you want to host on your EC2 instance; we’ll outline that process in the next section.

    Installing WordPress Manually

    Having install OLS and lsphp manually already, and not wanting to disrupt my setup with the the ols1clk.sh script, I installed WP following the instructions here. I deviated from these instructions slightly since they are for apache and I am using OLS.

    One difference in particular is permissions … [TBC]

    AWS RDS Mysql Setup

    AWS using some confusing terminology. I normally think of a mysql DB as being the thing that you create with the sql command CREATE DATABASE name_of_db;, but when you click on “Create database” in the AWS console, what you are really setting up is/are the server(s) that host(s) what is, in general, a distributed mysql-server. Contra to AWS’s confusing nomenclature, I shall refer to these managed servers as the mysql or RDS “instance”, and the entities created therein via CREATE DATABASE as the “DBs”.

    Anyhow, click on “Create database” and go through the wizard to create a mysql instance. I am using the free-tier instance for now, with Mysql Community 8.0.23.

    You need to create a master user and password. You do not need a super strong password since we will also be setting the connectivity such that the instance can only be accessed from AWS resources sharing the same (Default) VPC. Since I only intend to connect to this instance from EC2 instances in the same VPC that are themselves very secure (I SSH in via SSL certs), we do not need another lay of “full” security to have to note down somewhere. (Obviously, if you want to connect from outside AWS, then you need super strong credentials.) We also choose default “Subnet group” and “VPC security group”.

    In the section “Additional Configuration” we have the choice to create an initial DB, but we will not since we will do that manually under a different non-admin username.

    Note how confusing AWS is here in stating that it “does not create a database” when the wizard will in fact create what AWS also calls a “database”

    After the console is done creating the instance, it will make available an endpoint of the form instance-name.ABCD.us-east-1.rds.amazonaws.com that we can use to test connecting from our EC2 instance.

    First, we want to ensure that we control precisely what EC2 instances we can connect from. In the RDS console, select the instance you just created, select the “Connectivity and security” tab, and scroll down to the “Security group rules” table. This will show you all of the rules for inbound/outbound traffic determined by the setting with in the security group assigned to the instance upon creation. You’ll want it to look like the following image:

    Click on that security group link to edit the associated inbound/outbound rules. For the sake of security, it’s sufficient to just limit all inbound traffic to the mysql instance, and leave all outbound traffic unrestricted. Here, I’ve limited the traffic to be inbound only from AWS resources using the specific security groups shown in the above image; these are associated with two separate EC2 instances that I set up.

    Back in the EC2 instance hosting my OLS server, install the mysql client with sudo apt install mysql-client. Then run:

    mysql --host END_POINT -u admin -p

    … where END_POINT is given to you in the RDS console for your mysql instance. You’ll be prompted for the admin password you created for the instance, and you can then expect to connect to the instance.

    We set up the mysql instance without an initial database, so lets now create one explicitly for our wordpress site. We also want to create username-password combo for specific use with this wordpress instance who only has permissions to read/write that db. Run the SQL commands:

    CREATE DATABASE dbname;
    CREATE USER 'newuser'@'%' IDENTIFIED BY 'password';
    GRANT ALL PRIVILEGES ON dbname . * TO 'newuser'@'%';
    FLUSH PRIVILEGES;

    … replacing dbname, newuser, and password with values of your choice. (Note the syntax ‘username’@’%’ means “a user with username connecting from any host”.) You can now exit this mysql session and try logging in again with the user you just created to make sure that that user is able to connect to the RDS instance remotely.

    Next, go to the dir in whcih you downloaded wordpress, and open the file wordpress/wp-config.php. Go down to the section for “MySQL settings” and enter the details for this user we just created, as well as the endpoint for the RDS instance in the DB_HOST slot. It also makes your WP more secure to change the $table_prefix variable to e.g. 'wp_sth_'.

    WordPress Troubleshooting

    If you have difficulty getting your wordpress to work, try adding the line:

    define( 'WP_DEBUG', true );
    

    … to true in wp-config.php to get an error stack. In my case, I had trouble getting php to recognize mysqli_connect; I got it working by installing sudo apt install lsphp74-mysql and restarting the OLS server. I also had some trouble getting php to recognize the php-curl module; I eventually got it working after running all sorts of installs (sudo apt install php-curl, etc.) and restarts, though I am not sure what the eventual solution was exactly (all I know is that I did not need to edit any OLS config files directly). After playing with OLS for several days now, I am tempted to say that you never want to edit an OLS config file directly; there is always a way to do things though the interface, or your WP/.htaccess files.

    Setting Up WordPress

    Once you have your WP interface working (i.e. you can login through a browser), you need to perform some essential set up.

    First, go to Settings > Permalinks and select “Post name”. Make sure the REST API is working by going to /wp-json/wp/v2/. If it is not working, try /index.php/wp-json/wp/v2/. If that works, then you need to get OLS to perform rewrites to skip the index.php part. OLS does not read .htaccess files by default (that WP supplies), so to get OLS to recognize those files go to your OLS admin, go the “Server Configuration” section, and in the “Rewrite Control” table, set the “Auto Load from .htaccess” field to “Yes” and restart. If your REST API is still not working, then, well, you’ve got some investigation to do.

    (I think it’s the case that because we are loading .htaccess files at the server level, OLS will read these files into memory upon first encounter, so subsequent use will be cached; if you set this setting at the virtual host level then OLS will consult the file system on each request.)

    I have heard it said that it’s a good idea to prevent users from executing the wp-config.php file. I think the idea here is that, by default, all it takes is an accidental deletion of the first line <?php for the file to be treated as plain text and, therefore, for its content to be printed to screen. The usual precaution against this on an Apache server is to add the following to the root .htaccess file:

    <Files wp-config.php>
        <IfModule mod_authz_core.c>
        	Require all denied
        </IfModule>
        <IfModule !mod_authz_core.c>
        	Order deny,allow
        	Deny from all
        </IfModule>
    </Files>

    However, this will not work with OLS because it “only supports .htaccess for rewrite rules, and not for directives”. We therefore need to add the following instead:

    RewriteEngine on
    RewriteRule ^wp-config\.php$ - [F,L]

    … and then confirm that you get a 403 Forbidden Error upon visiting /wp-config.php.

    Next, we want to install some vital plugins. I try to keep these plugins to as few as possible since I am wary of conflicts and bloat; our goal is to keep WP operating in a super-lean manner, but we still need some essentials.

    First, we want LSCache, since this is the big motivation of using OSL. The plugin claims to be an all-round site accelerator, cache-ing everything on all levels. When you install it, the has default settings with all site-acceleration features in place.

    Next, we want WordFence (WF) to provide site protection against attacks, as well as to provide free Two-Factor Authorization (2FA) logins. Install WF and enable auto-updates. WF will go into learning mode for a week or so.

    In order to set up 2FA you need a smart phone with client app. I will describe how to set up 2FA using the free “Duo Mobile” app on an iPhone. In the WF menu, go to “Login Security” and with your iPhone use the camera to scan the code. It will give you the option to open within Duo Mobile. Then, back in the WF interface, input the latest code from Duo Mobile for your WP site to activate it. Also download the recovery codes and keep them safe. Under the WF “Settings” tab for “Login Security”, I also make 2FA required for all types of user (though I only plan to use administrator and maybe Editor roles for this headless CMS). You can also require recaptcha via WF, but this is overkill for my purposes.

    WF will also want you to enable “Extended Protection” mode. If you agree to this, then it will prompt you to download your old version of the site’s .htaccess file (presumably in case WF screws it up if/when you uninstall it later). I am a bit skeptical about this feature since it sounds like it would incur quite a performance hit. However, since the overall architecture I am building here aims to put all of the serious site load onto AWS Cloudfront — with WP just functioning as the headless CMS for the convenience of the client — I have opted for now to add this extra layer of security.

    For this feature to be enabled on Litespeed, you need to follow these instructions, namely, go to the “Virtual Hosts” section and in the entry for your WP site, go to the “General” tab and add the following to the field “php.ini Override”:

    php_value auto_prepend_file /path/to/wp/wordfence-waf.php

    You may also need to go to LSCache and purge all caches.

    Offloading Media to AWS S3

    We want to offload the serving of our media files to AWS S3 with Cloudfront. This will also ensure that we can scale to any amount of media storage. We also want to avoid having duplicates on our EC2 disk.

    At first I assumed the best way to go about this would be through a WP plugin, and I tried out W3 Total Cache. However, these free plugins always seem to have a downside. In this case, W3 Total Cache would not automatically delete the file on disk after uploading to S3.

    I therefore decided to pursue a different strategy using S3fs. This is an open source apt-installable package that lets you mount an S3 bucket to your disk. Writing to disk in such a mounted volume therefore has the effect of uploading directly to S3, and leaving no footprint on your EC2 storage. You also don’t need any WP plugins.

    To set up S3fs on Ubuntu 20.04, first install it along with the AWS CLI:

    sudo apt install s3fs awscli

    In the AWS console (or via terraform if you prefer), create a new S3 bucket, Cloudfront distribution, and ACM SSL certificate. You can see this post for guidance on those steps, but note that, in this case, we are going to create a user with only the permissions needed to edit this particular S3 bucket.

    To create that user, go to the IAM interface in the AWS Console and click to “Add user”. Give the user “Programmatic Access”, then, under the “Set permissions” step, select “Attach existing policies directly” and the click on “Create policy”. Paste the following into the “JSON” tab:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": [
                    "s3:ListAllMyBuckets",
                    "s3:ListBucket"
                ],
                "Resource": "*"
            },
            {
                "Sid": "VisualEditor1",
                "Effect": "Allow",
                "Action": [
                    "s3:PutObject",
                    "s3:PutObjectAcl",
                    "s3:GetObject",
                    "s3:GetObjectAcl",
                    "s3:DeleteObject"
                ],
                "Resource": "arn:aws:s3:::YOUR_BUCKET_NAME/*"
            },
            {
                "Sid": "VisualEditor3",
                "Effect": "Allow",
                "Action": "cloudfront:ListDistributions",
                "Resource": "*"
            }
        ]
    }

    [Note to self: will need to update this policy when it comes time to enable user to invalidate Cloudfront distributions]

    Skip tags and save the policy with a name like “my-wps3-editor-policy”. Now, back in the user-creation wizard, search for and select the policy you just created. Skip tags and create the user. You will then be able to access the key and secret key for programmatic use of this user.

    Back in the EC2 console, run the following to set this user as the user who will mount the S3 bucket (replacing the keys):

    touch ${HOME}/.passwd-s3fs
    echo ACCESS_KEY_ID:SECRET_ACCESS_KEY > ${HOME}/.passwd-s3fs
    chmod 600 ${HOME}/.passwd-s3fs

    We will be mounting the S3 bucket to wp-content/uploads within your WP dir. Before mounting, we need to enable other users to read/write to the mounted dir so that WP can properly sync the files we upload. To enable that, you need to edit the file /etc/fuse.conf by simply uncommenting the line user_allow_other.

    Now we can mount the dir BUT, before you do that, check if you already have content in the uploads dir. If you do, move that content to /tmp, and then make sure the uploads dir is empty, and then run the following:

    s3fs BUCKET_NAME /path/to/wp_dir/wp-content/uploads -o allow_other -o passwd_file=${HOME}/.passwd-s3fs

    Now you can copy back the contents that you may have moved, and/or upload something new, and expect it to appear within the corresponding S3 bucket.

    (Optionally, you can also run `aws configure`, and enter the credentials for this user, if you want to interact with the S3 bucket from the command line.)

    Finally, we want to mount this dir to S3 upon EC2 instance reboots, so add this to /etc/fstab file:

    BUCKET_NAME /path/to/wp_dir/wp-content/uploads fuse.s3fs allow_other,passwd_file=/path/to/home/.passwd-s3fs 0 0

    To test that it works, unmount uploads and then run sudo mount -a. If it looks like it works, you can then try actually rebooting, but be careful since messed up fstab files can brick your OS.

    Here are some final notes on using S3/S3fs:

    • You can setup a local image cache to speed up performance of serving files from WP, but performance doesn’t matter here since this will only be used by the CMS admin.
    • s3fs does not sync the bucket between different clients connecting to it; so if you want to create a distributed WP-site then you might want to consider e.g. yas3fs.
    • Depending on the amount of content on your site, you might benefit from creating a lifecycle rule on your S3 bucket to move objects from “Standard” storage to “Standard Infrequent Access” (SIA) storage. However, SIA will charge you for each file smaller than 128kb as though it were 128kb; since WP tends to make multiple copies of images of different sizes, which are often smaller than 128kb, this might offset your savings.

    URL Rewrites

    The last thing to consider is getting the end-user interface to serve up content from the Cloudfront URL instead of the WP URL. If you use a WP Plugin to sync with S3, then it will do the rewrites for you.

    In my case though, I am going to avoid having to work with php and do all the rewrites within my nextJs frontend app. See the next part for setting that up.

    Next Part

    The next part in the series is on practicing backup restoration.

  • Python Relative Imports

    Intro

    I thought I understood python modules. To my embarrassment, I tried doing a relative import just now and, having researched the ensuing error, realized that I had still failed to understand some fundamentals of relative imports.

    Background

    I created the following file structure:

    .
    └── src
        ├── __init__.py
        ├── main.py
        └── temp
            └── __init__.py
    

    … with the following demo code:

    ### src/main.py
    from .temp import foo
    print('>>>'+foo)
    
    ### src/temp/__init__.py
    foo = 'bar'

    … and tried running python3 src/main.py expecting to get the simple message “>>>bar” printed out. Instead, I got the error:

    Traceback (most recent call last):
      File ".../src/main.py", line 1, in <module>
        from .temp import foo
    ImportError: attempted relative import with no known parent package

    So something about the relative import was not working. What really threw me off understanding the problem was the fact that VSCode (with Pylance) was resolving the variable foo in the import without any apparent problem or warning, so I figured that the command-line interpreter must just be struggling to know where to look.

    VSCode three me off by treating the file as a module, whereas the CLI interpreter was treating it as a top-level script!

    But no matter how many extra __init__.py files I scattered around the repo, or how many more directories I told the interpreter to look in for modules (viz. PYTHONPATH=".:src:src/temp" python src/main.py), I could not get this error about “no known parent package” to go away.

    TL;DR

    When you kick off a python process, you can choose to run your entry-point script either as a “top-level script” (TLS) or as a “module”. If you run main.py as a TLS, then you cannot import stuff from elsewhere using relative imports; if you run it as a module, then you can if the script is not in the same directory as the interpreter’s launch location. Remind please me why everyone loves python?

    # TLS: relative imports not ok
    python3 src/main.py
    
    # Module: relative imports ok
    python3 -m src.main

    Explanation

    This SO answer was key to my eventual understanding.​*​

    The key thing is that, in order to move up/down a “package hierarchy”, the interpreter needs to establish a position in a hierarchy, and it simply does not establish such a position if you start a process off as a TLS. You can see that that is the case by running print(">>>"+str(__package__)) within main.py and running it as a TLS (which prints “>>>None“).

    Now, yes, this is counter-intuitive (as everyone in the python blogosphere agrees), and, yes, the authors could have built the language with a common-sense decision tree so as to establish a TLS’s placement within a package,​†​ but they did not, and you just have to treat that as a brute fact to work around.

    And since your initial position within a package is not determined, you can’t perform relative imports up/down the package hierarchy. By contrast, if you start off the process by executing a module then the “src.main” dot syntax that is passed to the interpreter is used to establish a position within a package hierarchy that can be examined within the __package__ variable.​‡​

    Further Notes

    This topic involves a bunch of pesky details/subtleties that it’s worth trying to remain aware of.

    • “Relative imports” begin with a dot, indicating placement within a package hierarchy. “Absolute imports” begin with a package/module name with dots to indicate descent into subdirectories. Absolute imports rely on the contents of the sys.path list, which you can set/augment with the PYTHONPATH env var.
    • If you perform an absolute import on a module, then the __name__ and __package__ variables of that script will reflect the position of the module in the package as it was called. For example, if you run from package1.subpackage.somefile import something, then the __name__ variable within somefile will be “package1.subpackage.somefile“, and the __package__ variable will be package1.subpackage; the interpreter knows how to use this information to then interpret imports relative to somefile. For example, you can run from .. import something_else within somefile to access other modules within the subpackage module.
    • If you run a module from the same directory that you are in when starting the interpret, even if you are specifying a module (e.g. python3 -m main), then no package structure has been communicated and, like with a TLS, you cannot perform relative imports right off the bat.
    • If you have a package with nested subdirectories then you do not need to have an __init__.py file in each of them in order to access the modules in a nested file/dir. The only real practical consequence of having an __init__.py file (that I can discern) is to turn the “directory itself” into a module, so you can import the dir by name without having to name any subdirectory or file.
    • I had thought that to be a package, a dir had to have an __init__.py file in it; perhaps that’s true in some sort of definitional sense but, practically speaking, a dir does NOT need to have an __init__.py file in order to function “like a package”, i.e. to just represent a collection of modules. In particular, as mentioned above, you do not need an __init__.py file in order to do imports from within a dir.
    • In writing this article I worried that my explanations would sound circular since I sort of assumed that the reader knew what I meant by the term “hierarchical package” and, ideally, I would define this first before building off of the concept. However, part of the challenge here is that a “package hierarchy” is clearly a notion that must be defined with respect to the design of the interpreter, and trying to understand how the interpreter works with respect to package hierarchies is the really the point of this article. So, apologies if you feel dissatisfied; however, it feels sufficiently clear in my head (right now at least) what is going on here to close this chapter.
    • Having researched this matter recently, I can report that a lot of people feel python’s package/module system is confusing and sub-optimally designed.

    Summary

    As a rule of thumb, you can only perform relative imports after performing an absolute import, or kicking things off as a module with hierarchical details (i.e. dots in the path to the entry script), since this info is used to establish placement within a package hierarchy.

    Relative imports are tricky and actually generally discouraged in the python community since absolute imports (e.g. from package1.some_module import something) tend to read clearer than relative imports (e.g. from ..some_module import something).

    So make sure that when you name packages in your code you are mindful not to introduce confusion/conflicts with packages installed from 3rd parties. (Sounds “imprecise”, but python just is far from perfect.)


    1. ​*​
      Beware, this response while thorough gets wrong the detail that python -m src.main will render __name__ as “__main__”, not src.main; you need to refer to the __package__ variable as your primary means for understanding how the interpreter determines placement within package hierarchy.
    2. ​†​
      E.g. “check if __init__.py is in same dir as TLS and, if so, make that dir the package name”
    3. ​‡​
      Interestingly, it seems you don’t even need a __init__.py file within src for the interpreter to treat it as a package in this special case
  • AWS Production Static Site Setup

    Intro

    I’ve fumbled through the process of setting up a production static site on AWS a few times now. These are my notes for the next time to get through the process faster.

    Overview

    We want to be able to run a local script that wraps around the AWS CLI to upload assets to an AWS S3 Bucket (with user credentials with limited restrictions). The S3 Bucket is to be set up for serving a static-site and serve as the origin of a Cloudfront instance, which is itself aliased to a Route 53 hosted-zone record, all glued together with a ACM certificate.

    Finally, we need a script to copy over the contents of a directory for the static site to S3 in such a way as to compress all image files. In summary:

    • S3 Bucket
    • Cloudfront Instance
    • Certificate Manager Instance
    • Route 53 Configuration
    • AWS User for CLI uploads

    S3 Bucket

    Setting up an S3 Bucket is quite straightforward to accomplish in the AWS GUI Console. When asked about “Block all public access“, just uncheck, and don’t apply checks to any of the sub options. (Everyone I’ve seen just seems to ignore these convoluted suboptions without explanation.)

    Under permissions you need to create a bucket policy that will allow anyone to access objects in the bucket. So copy the ARN for the bucket (e.g. “arn:aws:s3:::rnddotcom-my-site-s3-bucket”) and use the “Policy Generator” interface to generate some JSON text as depicted below.

    Note: under the “Actions” option you need to select just the “GetObject” option. Click “Add Statement” and “Generate Policy” to get the JSON. Copy/paste it into the bucket’s policy text field and save. The following JSON is confirmed to work.

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "AddPerm",
                "Effect": "Allow",
                "Principal": "*",
                "Action": "s3:GetObject",
                "Resource": "arn:aws:s3:::rnddotcom-site-s3-bucket/*"
            }
        ]
    }

    Next, when you enable “Static website hosting”, you must specify the “Index document” since the S3 servers will not default to index.html.

    Upload Static Files (with gzip compression)

    When developing, I always want to be able to re-upload/re-deploy my software with a script. For that, I use a bash script that wraps around the AWS CLI. You can install it on a Mac with homebrew.

    For an example of such a script, see here in my terraform-aws-modules this repo. For this to work, you need to have AWS credentials for a user with access to this bucket.

    A good practice is to create a user with just enough permissions for the resources you need to access. So go to the AWS IAM console, and create a user with “Programmatic Access”.

    In the permissions step, click on “Attach existing policies directly” and select — in this example — the “AmazonS3FullAccess” policy and click on “Next: Tags”.

    Skip through Tags, create the user, and copy the “Access key ID” and “Secret access key” items to somewhere safe. If you are using the script I shared above, then you can add these items directly to your local .env file. By sourcing the .env file, you give these credentials priority over those stored in ~/.aws/credentials (which is handy if you manage multiple AWS accounts.)

    export AWS_ACCESS_KEY_ID="..."
    export AWS_SECRET_ACCESS_KEY="..."

    Now you can run the above bash script that wraps around the AWS CLI to upload the contents of a local directory. The script also includes logic to pick out image files and compress them before uploading.

    You now have a complete simple http static site, great for development, etc.

    Cloudfront I

    If you need a production site then you need to have SSL encryption (at minimum to look professional), CDN distribution, and a proper domain.

    So next go to Cloudfront in the AWS GUI Console and create a new “Distribution”. There are a lot of options here (CDN’s are complicated things after all), and you just have to go through each one and give it some thought. In most cases, you can just leave the defaults. A few notes are worth making:

    • Grant Read Permissions on Bucket“: No we already set these up
    • Compress Objects Automatically“: Select yes; here is a list of type of file that CloudFront will compress automatically
    • Alternate Domain Names (CNAMEs)“: Leave this blank — sort it out after creating a distribution
    • Default Root Object“: Make sure to set this to index.html
    • Viewer Protocol Policy“: Set this to “Redirect HTTP to HTTPS” (as is my custom)

    SSL Certification

    Now we need to point host name to the CloudFront distribution. Surprisingly, it seems you NEED to have SSL, and to have it setup first for this to happen. So go to ACM and click on “Request a Certificate”. Select “Request a public certificate” and continue.

    Add your host names and click continue. Assuming you have access to the DNS servers, select “DNS Validation” and click ‘next’. Skip over tags and click on “Confirm and Request”.

    The next step will be to prove to AWS ACM that you do indeed control the DNS for the selected hosts you wish to certify. To do this, the AWS console will provide details to create DNS records whose sole purpose will be for ACM to ping in order to validate said control.

    Screenshot

    You can either go to your DNS server console and add CNAME records manually, or, if you’re using Route 53, just click on “Create record in Route 53”, and it will basically do it automatically for you. Soon thereafter, you can expect the ACM entry to turn from “Pending validation” to “Success”.

    Cloudfront II

    Now go back and edit your Cloudfront distribution. Add the hostname to the space “Alternate Domain Names
    (CNAMEs)“, choose “Custom SSL Certificate (example.com)”, and select the certificate that you just requested, and save these changes.

    Route 53

    Finally, go to the hosted zone for your domain in Route 53, and click on “Create Record”. Leave the record type as “A” and toggle the “Alias” switch. This will transform the “Value” field to a drop down menu letting you select “Route traffic to”, in this case, “Alias to Cloudfront distribution”, and then a region, and then in the final drop down you can expect to be able to select the default url to the CloudFront instance (something like “docasfafsads.cloudfront.net”).

    Hit “Create records” and, in theory, you have a working production site.

    NextJs Routing

    If you are using nextJs to generate your static files then you will not be able to navigate straight to a page extension because, I have discovered, the nextJs router will not pass you onto the correct page when you fall back to index.js, as it would if you’re using e.g. react router. There are two solutions to this problem, both expressed here.

    • Add trailing slash to all routes — simple but ugly solution IMO
    • (Preferred) Create a copy of each .html file without the extensions whenever you want to reupload your site; requires extra logic in your bash script

    Trouble Shooting

    • The terraform-launched S3 bucket comes with the setting “List” in the ACL section of permissions tab; it’s not clear to me what difference this makes.
    • I was getting a lot of 504 errors at one point that had me befuddled. I noticed that they would go away if I first tried to access the site with http and then with https. I was saved by this post, and these notes that I was then prompted to find, that brought my attention to a setting that you cannot access in the AWS GUI Console called “Origin Protocol Policy”. Because I originally created the Cloudfront distribution with terraform, which can set this setting, and it set it to “match-viewer”, the Cloudfront servers were trying to communicate with S3 with the same protocol that I was using. So when I tried to view the site with https and got a cache miss on a file, Cloudfront would try to access the origin with https; but S3 doesn’t handle https, so it would fail. When I tried using http, Cloudfront would successfully get the file from S3, so that the next time I tried with https I would get a cache hit. Now, since I don’t like using http in general, and in fact switched to redirecting all http requests to https, I was stuck until I modified my terrafform module to change the value of Origin Protocol Policy to http-only. I do not know what the default value of Origin Protocol Policy is when you create a Cloudfront distribution through the Console — this might be a reason so always start off with terraform.
  • Installing an NVIDIA Gigabyte GTX 1060 Graphics Card into a Dell Precision T5810

    Might be easy. Might be hard. Never done it. Here are my notes.

    TL;DR

    • Managed to get it working after a dozen or so reboots, driver installations, and placement permutation of the original and/or new graphics card
    • The only interesting/semi-innovative part of the project was having to pry off a strip of metal from the case’s side panel in order to get the new card to fit.

    Long Boring Version

    Warning: Ramblings Ahead

    I had hoped to go about this process in a slow/careful manner in order to be sure that I knew exactly what the keys to success are in installing a Gigabyte NVIDIA 1060 Geforce GTX into a Dell Precision T5810 running Windows 10.

    Nope. Sorry. I did get there eventually, but it was such a convoluted/confusing journey that, frankly, I really am not sure what the “magic” step(s) was/were. So if you’re trying to get help in achieving something similar, then all I can offer are my rambling recollections of a frustrating 1/2-day process.

    Background

    I wanted to try out photogrammetry (in particular meshroom and alicevision, which require Nvidia GPUs, with these recommended specs) so I posted on r/photogrammetry asking what the lowest-budget hardware is that I would need. Based on the advice I received from a random guy who sounded like he knew what he was talking about, I ended up getting a Dell T5810 from newegg.com. At $290+tax with 12 (logical) Xeon Cores and 64GB of RAM, it seemed like a phenomenal deal. (Tip: I suspect that this was a still-boxed return that newegg.com didn’t want sitting on their shelves so slashed the price; if you’re looking for a good deal, go to newegg.com and just try lots of combinations of specs; if you find a combo with a surprisingly low price — e.g. you add some RAM and price drops with “Only 1 in stock” — then you might be onto an awesome deal!)

    Now, to do photogrammetry, I’m told you need a decent GPU, the random guy recommended a GTX 1060. (Note ahead: he also warned that a lot of them are too big for the Dell T5810!)

    I’ve never been into gaming, so this was the first time in my life when I’d had to give graphics cards any attention. So this article is very much for the GPU newbies out there.

    After observing the ebay market for a week or two, I eventually put in a bid for a Gigabyte Geforce GTX 1060 with 3GB of RAM (the “1060” from hereon). It cost me about $90+shipping. The box arrived in superb condition; the previous owner had evidently kept the box, wrappings, CD and start up pamphlet with the intention to sell it for a future upgrade.

    My T5810 came with a basic Nvidia Quadro NVS 295 card. I tested it out by installing Steam and the free trial version of “Shadow of the Tomb Raider” (I literally haven’t played Tomb Raider since the late 90s!), and the poor little Quadro would not even start!

    Tomb Raider at 100% GPU, and even the Task Manager is burning up!

    I googled how to replace a graphics card and, of course, there were like a billion articles/discussions on Google, and so I could only peruse the first few I came across, which all conveyed the silly message that there’s nothing to it — just swap the cards, and you’re good to go. Yeah Right.

    Hardware Access

    So I powered off and opened up the Dell T5810. It was pretty easy to remove the Quadro — no screws to deal with, just a blue plastic clip held it in place. Here it is sitting alongside the 1060.

    A quick note on power delivery. The 1060 takes a 6-pin power input. (The Quadro does not; all of its power came via the PCIe slot.) My T5810 Power Supply Unit (PSU) is the default that comes from the manufacturer with the “minimum” 425w delivery. The 1060 says that it requires a PSU with 400w minimum. So part of my experiment is to test if this slender-sounding margin is truly sufficient. The Dell T5810 has two yellow 6-pin cables emanating from the PSU. According to this page, a 6-pin connector provides an additional 75w to the PCIe’s 75w.

    Setting the 1060 into place was only a tiny bit trickier. It was twice as wide as the Quadro, so would need to be placed in a different PCIe slot with the extra slot covering removed.

    The bigger problem though involved the power connection. First, the 6-pin cable only just reached far enough to get to the top of the 1060 GTX where the power input is situated. Not wanting to put physical stress on the card or connection, I almost considered getting an extension, but decided in the end that it was “only just” long enough.

    Second, and much more seriously, the T5810 would not shut on the card because the power cable connector, going into the top of the large 1060 card, was obstructing a ridiculous “crossbar” riveted to the inner side of the removable panel!

    The connector (red circle) collides with the “crossbar” (red ellipse) when trying to shut the side panel

    So I had to make a decision to either:

    • Sell the Gigabyte 1060 model (i.e. resell it on ebay), research what medium-range Nvidia models DO fit the T5810, and then buy that model from ebay/craigslist (at the cost of hours of work, shipping hassle, and weeks of waiting), or
    • Pry the riveted crossbar off the panel at risk of damaging the aesthetics of the Dell T5810.

    Since the Dell was already second hand, a little bit scuffed up, and not ever intended to be a flagship machine, and since the crossbar had no discernible structural role, I went with the latter. The de-riveting process was fairly easy — I just got a screwdriver and worked it under each rivet till it popped. You could see some minor dents on the outside; but overall I’m very pleased with the outcome.

    Some pics of the “de-riveting” process/outcome resulting in a closable computer case.

    Software Mayhem

    With the case in a closable state I powered on the T5810 having no idea what to expect. I did not install any drivers because (i) I was advised not to on some random google search, and (ii) I was not sure whether drivers would get installed automatically, so wanted to give Windows 10 the opportunity to impress me.

    To my (initial) pleasant surprise, the screen came on but, for better or worse, I thought that the image looked ‘crude’, and concluded that the signal was probably not arising from the GPU but from the Motherboard’s fallback signal, even though the signal was being channeled through the 1060 card to the DVI output. (My monitor is too old for an HDMI, but still has a great image.)

    So I decided to install the NVIDIA driver. I went to the Nvidia site, downloaded the relevant driver, and was just about to install it when I got a message in a dialog on my Windows 10 machine that said (paraphrased) that I needed to restart for the driver to take effect. Since I had not installed any driver yet, I concluded that Windows 10 must have initiated the driver download automatically, so I restarted with the naïve expectation that it would boot back again with the 1060 fully operational.

    Unfortunately, this did not happen. Instead, I got a blank screen. I left it for quite a while (1 hour?) and then killed the power and rebooted.

    Now I won’t bore you with the details, but I spent the rest of the day, and much of the next morning, jumping through hoops to get things working. I tried replacing the 1060 with the Quadro; I tried running them both together; I tried the HDMI output, the DVI outputs; I tried installing the driver I downloaded from nvidia.com; I tried installing the software that came with the CD. Sometimes the T5810 would boot; sometimes it wouldn’t. Sometimes it would boot and detect the 1060, sometimes it wouldn’t. I could not discern any rhyme or reason to what the heck was going on. The whole process was incredibly frustrating and opaque. What’s soooo annoying is that I worried it would be confusing and opaque, but neither the “Internet” (presented to me by Google), nor Windows 10, did anything to bring clarity to the situation. I was given no clue as to what was handled automatically, and what was not. I was given no advice to e.g. “give it several hours; it will boot on when ready”. Nothing.

    In the end, I got it working without the Quadro. However, I am not willing to say that “all is well that end’s well”, because I really wanted to understand the process, and have not been able to get the clarity that I desired. Windows 10 did nothing to make this process transparent. All it needed was a simple user experience along the lines of “A new graphics card has been detected. Would you like a driver to be installed automatically?”

    So if you’re looking for guidance in this area, all I can say is good luck. I still don’t understand why it is not advised to install the driver before installing the hardware. Next time I will.

    The good news is that the GPU seems to be performing very well now. It makes light work of Tombraider.

    Tombraider running at just 3% GPU.

  • Windows 10 for Unix Users

    Goals

    It’s been literally decades since I’ve owned a Windows machine; here’s a guide to setting up Windows 10 and getting orientated for those used to doing things the MacOS/Linux way. Some specific goals include:

    • Initial setup
    • Backups
    • Install winget package manager
    • Install WSL2 and Ubuntu
    • Install terminal

    Initial Setup & Orientation

    The most important tools and short cuts for navigating round my Mac, and the W10 equivalent, are:​*​

    • CMD+Tab to switch programs
      • ALT+Tab
    • Cycle through programs (with CMD+Tab) to get to Finder
      • Unlike Finder, which is always open on a Mac, File Explorer is not necessarily open. So to get to it fast, I use WIN+N where N is the position of File Explorer in the Taskbar; I like to keep it at the first position, so WIN+1 does the trick
    • CMD+` to cycle through windows
      • Unfortunately, although lots of places online claim that the combo CTRL+` will cycle through windows of the same program, I have not found this to work on my W10M. The best way I have found to achieve this is to use the WIN+N combo where N is the position of the program in the taskbar. If you have a program with two or more windows open, then repetitions of the WIN+N combo for that program will cycle through its windows. Since it is almost always the browser or File Explorer that has lots of windows open, I make sure to place these programs near the beginning of the taskbar.
    • CMD+zxcvwqas to perform “classic” short cut operations
      • CTRL+zxcvwqas
    • CTRL+CMD+left/right to move between tabs within Browser
      • CTRL+Tab, or
      • CTRL+PageUp/PageDown in Edge
    • CMD+left/right to move cursor to beginning/end of line
      • Home/End
    • Shift+CTRL+3/4 to take a screen shot
      • WIN+Shift+S to open screen shot wizard; your image will initially be copied to clipboard; to save it to a file, you need to click on the pop up icon of that image and save it proper.

    Additionally, when we use ALT+Tab, we do NOT want to see all the various Edge browser tabs as this makes the visual field overwhelming. So we go to Settings > System > Multitasking and below the “Alt + Tab” section, choose “Open windows only” from the drop down menu.

    Windows File Structure

    The main difference between the W10 file structure and that of Unix is that, with Unix, everything is container within the root file '/'. With W10, each storage device gets assigned a letter, and then it acts as the root of its own file system. For a decent quick reference of the similarities/differences see here.

    The other thing worth nothing is that the W10 File Explorer is a bit fiddly in so far as it does not readily provide a clear representation of the file structure; rather, it wants to provide you with a bunch of options to find useful stuff. However, IMO, I’d prefer a simple tree-structure nested menu set up. I can get somewhat close to that by going to View > Options > Change Folder and Search Options > View and then checking/unchecking the following:

    • Checked
      • “Display the full path in the title bar”
      • “Show hidden folders, files and drives”
      • “Show all libraries”
    • Unchecked
      • “Hide empty drives”
      • “Show all folders”

    In the General tab, I also like to apply the “Single click to open an item” option.

    With these options applied, the File Explorer provides a fairly intuitive representation of the nested file systems. The only “quirk” is the entry labelled “This PC”. When selected, you don’t see a bunch of nested files/folders as is otherwise the case, rather, you see two sets of icons labelled “Folders” and “Devices and Drives”. The former lets you jump straight into the user’s private data, the latter let’s you go to the root of a file system on a physical device. Although this is arguably inelegant in that it disrupts the simple pattern of having every view a simple snap shot of a level within a nested structure, I can see that its purpose is well intentioned and that I can live with it.

    Installing Windows Subsystem for Linux 2

    There are two ways to install Windows Subsystem for Linux 2 (WSL2). The “simpler way” is to sign up to the Windows insider program where you get to install the latest non-stable version of W10. If you’re not willing to share your usage diagnostics with Microsoft or risk non-stable releases, then you need to do the following (based on this article) to enable WSL2. First, open a Powershell prompt with admin privileges and run:

    dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart

    Then this to enable virtual machines on your W10M:

    dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart

    … then restart. (Note: when I tried this my computer stalled on reboot; but it seemed to work after a further power-off-on cycle.) Next, download this WSL2 kernel update installer and run it, and then run this command to make WSL2 your default (as opposed to WSL1, I guess):

    wsl --set-default-version 2
    

    If you now run:

    wsl --list --verbose

    … then you’ll be told that:

    Windows Subsystem for Linux has no installed distributions.
    Distributions can be installed by visiting the Microsoft Store:
    https://aka.ms/wslstore

    So go to the Store and install the latest release of Ubuntu and, if you haven’t done so already, the new W10 Terminal. Now start/restart the Terminal and you can expect to find the option to start a new tab with the Ubuntu shell prompt. Run the usual:

    sudo apt update -y
    sudo apt upgrade -y

    … and you’re up to date and ready to go with WSL2!

    WSL2 Ubuntu Configuration

    OK, now that we have a bash shell working in Ubuntu, we want to be able to configure it nicely. In order to configure my working unix shells, I maintain an easy-to-install CLI called “myconfig” that you can see here. This allows me to quickly install up-to-date versions of tmux, vim, zsh, oh-my-zsh, nvm, powerlevel10K, etc., and to thereby enjoy a slick-looking shell experience packed with useful aliases, shell scripts and other tools.

    Anyhow, I was pleased to discover that my myconfig CLI has worked very well to date on WSL2 Ubuntu. The only extra thing I needed to do to get a nice zsh shell working was to configure the fonts for Powerlevel10k (P10K) to make the command prompt look snazzy. To do that, you need to manually download the four files specified in the P10k instructions to your W10M.

    Download these files to your W10 Machine

    Then double click on each file to open it and then click “install”.

    Then you need to press CTRL+, in your W10 Terminal app in order to open up the settings. These settings are configured as a JSON file, so you might need to specify the first time you open it that you want to open it in a text editor like notepad.

    Once open, find the section labelled “defaults” where you can specify the default look of the different shells available within Terminal, and add an entry telling Terminal to use the fonts we just installed to the W10M. (FYI: by installing these files, you cause them to be system-findable within C:\Windows\Fonts\MesloLGS NF .)

    "defaults":
            {
                // Put settings here that you want to apply to all profiles.
    		//DWD ADDED
    		"fontFace":"MesloLGS NF"
            }

    And, presto, your Terminal will now have nice modern fonts applied that make P10k look fantastic. (If, unlike me, you don’t have P10k installed by a self-maintained CLI like myconfig then of course you’ll need to follow the instructions to set it up.)

    Again, nice work Microsoft, I honestly expected the Terminal to have rock-bottom 80s-level support for aesthetics.

    OK, this is not my zsh with P10k, but credit to Terminal that Hollywood works really well

    WSL2 SSH and Daemon Services

    What about running background services like an apache server from WSL2? One difference we have with WSL2 is that Ubuntu is not initiated with systemd, so to start/stop services, such as apache, you need to use the older syntax of e.g. sudo service apache2 start.

    sudo /etc/init.d/apache2 start

    Of course, you need to ask yourself how you plan to use your W10M. If you plan to run something like a production server on it, then you also need to worry about things like restarts when you are out of town, forwarding requests between W10 and WSL2, etc. I am not going to explore that sort of thing now, but will make a note of this article and this article in case I do in the future.

    However, I will briefly cover ssh since, IMO, it’s not fun having to work directly with the W10M to accomplish something when I can work from a laptop.

    The best/simplest approach I have found so far is based on this article. The idea is to connect to the ssh server that is easily installable with Powershell into W10, and then to set WSL2 bash as your default shell.

    As some quick background, if you run Powershell and then simply run bash, it will execute .\Windows\System32\bash.exe, since .\Windows\System32\bash.exe is in path (cf. cmd /c path), which has the effect of turning the W10 Powershell into the WSL2 Ubuntu bash shell. Similarly, when in a bash shell, you can run powershell.exe to convert the shell back into a Powershell (since /mnt/c/Windows/System32/WindowsPowerShell/v1.0 is in the path).

    So the approach we are taking is to basically ssh into Powershell and then run ‘bash’ (or a process under the hood that is similar to this).

    First, check if you have ssh installed by runing the following from an Administrator’s Powershell:

    Get-WindowsCapability -Online | ? Name -like 'OpenSSH*'
    
    Name  : OpenSSH.Client~~~~0.0.1.0
    State : Installed
    
    Name  : OpenSSH.Server~~~~0.0.1.0
    State : NotPresent

    To install an ssh server, run:

    Add-WindowsCapability -Online -Name OpenSSH.Server~~~~0.0.1.0
    

    Now start the service and get it to start automatically:

    Start-Service sshd
    Set-Service -Name sshd -StartupType 'Automatic'

    Now we can set the default shell to bash with the following:

    New-ItemProperty -Path "HKLM:\SOFTWARE\OpenSSH" -Name DefaultShell -Value "C:\WINDOWS\System32\bash.exe" -PropertyType String -Force
    

    Of course, you can skip this step if you want to ssh into W10 with Powershell and then jump into bash as needed. Equally, with this setting, you ssh into bash and then can switch to Powershell with powershell.exe as described above.

    Now you need to go to Windows Defender Firewall > Advanced Settings > Inbound Rules and begin the “New Rule…” wizard. Select “Port” and click Next, then select TCP and enter “22” into the field labelled “Specific local ports” and click next. Then select “Allow the connection” and clikc Next. Then uncheck all but the “Private” option and click Next. The give this inbound rule a name like “SSH-Inbound-I” and click Finish. Do the same for an outbound rule, and now you can accept ssh connections from within your private network.

    That’s it — now you can ssh into your W10M with your windows (not Ubuntu!) username/password combination, and get passed straight through to a WSL2 bash prompt.

    I would most definitely not recommend exposing your W10M to anything outside your private network!

    WSL2 Misc Notes

    What about graphical applications? The short story is that one can set up their W10M to display Linux GUI applications, but it involves a lot of set up. This article claims that Microsoft is working to make Linux applications run somewhat seemlessly on W10, so I plan to just wait for that (since there is no Linux GUI in particular that I am keen to use on my W10M.)

    How does the Ubuntu file system interact with the W10 file system? In the last few months there have been some improvements in this regard. When you start an Ubuntu shell, it lands you in your W10 home folder in what looks as though it’s a mounted volume! So you can peruse and edit the files on your NSFT file system from within your Linux shell! (About a year ago, you could do something similar, but it was not so nearly smooth and “out of the box”. I’m really pleasantly surprised by how much Microsoft seems to be investing in WSL2, and how well these things are working together.

    What about networking? Again, a year ago, if you tried starting an apache server in Ubuntu, then, being within a virtual machine, it would not just “work” out of the box. But now when you do it, you can find access that network from a W10 browser just by going to localhost in the url bar. Awesome!

    What about ssh-ing into your Ubuntu shell?

    Intro to the new W10 Package Manager “winget”

    One of the best and most innovative things about Linux is the concept of the “package manager”. If you want to add software to your Linux machine, then, typically, you do not have to hunt down something to download, and then have that installer place something somewhere on your machine that may or may not be malware and/or completely impossible to remove (due to lack of centralized convention). Instead, you use a command line tool that speaks to a centralized remote “repository” of software. This CLI is called a package manager; it makes it incredibly easy to install/update/remove software and, since it’s organized by the same people in charge of your Linux distribution,​†​ you can have confidence that it won’t have any malware in it, and that it will be placed on your file system according to smart conventions so that it will not clash with other installed software, will have all the correct dependencies, etc., and thus will be easy to upgrade or remove in the future.

    As a rule of thumb, whenever you can install a progam via a package manager, install it via a package manager. Conversely, if the only option to install software is via a third party download, then first consider whether you really need it or if there is an alternative provided by a package manager. There is a lot to be said in keeping your machine lean and well organized, and key to this is to avoid third-party download installations or to keep them to an absolute minimum. (I very rarely install programs other than through a package manager these days.)

    (Further aside: one of my dreaded memories of owning a Windows PC back in the 90s was the inexorable feeling that the machine would inevitably become bloated over the course of time as you download/install more and more software that — depending on the 3rd parties competence and/or moral compass — will cripple eventually bring your machine to a horrible grinding halt.)

    Thankfully, Microsoft have come round (after realizing the popularity of centralized repositories like the Apple App store, Google Play, Homebrew, and all the various Linux package managers). First, they implemented the Microsoft Store, which allows you to get plenty of games and GUI apps, and now they have the “winget” CLI. In fact, it is the offering of WSL2 and winget that made me decide to give W10 another chance after decades of lofty disdain.

    At some point winget will be built into W10. In the meantime, if you are not on the W10 insiders program, then you can install it by visiting the github release page for winget-cli, downloading the “*.appxbundle” file, and running it.

    Once installed, you can run winget from a W10 Command Pompt or Powershell by running e.g. winget install vim.

    Exploring winget

    Our system thus far has three command line options:

    • The classic Windows “Command Prompt”
    • The Windows Powershell
    • Ubuntu WSL2

    Each of these can be launched through the Terminal app that we downloaded via the Microsoft Store. winget can be run from either of the Windows shells. To be clear, it is simply a safer and cleaner way to install programs on your W10M than by visiting myriad separate sites and downloading a graphical installer.

    Here are some programs that I recommend you install right off the bat with winget:

    • winget install vim
    • winget install vscode-system-x86
    • winget install 'Google Chrome'
    • winget install Brave
    • winget install firefox
    • winget install blender

    vim of course is meant to be run direct from the command line but, unfortunately, winget does not configure the W10 path environment variable to make vim launchable straight after installation. To achieve this, search for “Advanced System Settings” in the W10 search field (bottom left of screen), and select the option associated with the Control Panel.

    Then click on “Environment Variables” and within the area labelled “System Variables” select path and click on edit:

    Then click on “New” and then type or “Browse” to the location "C:\Program Files\Vim\vim82". Click OK, etc. to apply these settings, then open a fresh Powershell or Command Prompt. There you can type and run the command path to check the new content of the Path environment variable and verify that the path we just added is recognized within the shell.

    As an alternative way to check if vim is findable from within a Command Prompt shell, you can run where vim, and from a Powershell, run cmd /c where vim.​‡​ Now we can run vim from the Windows shells to edit files directly.

    Likewise, GUI apps like can be found in the main applications menu as per apps downloaded by the Microsoft Store. In some cases you can download a GUI app through either the store or winget, such as blender.

    One shortcoming, IMO, of winget is the fact that you cannot perform upgrades or uninstalls right out of the box. These are considered “experimental features”. To enable these features, you need to run winget settings, which opens a file in notepad, top which you then add the following block:

    {
        ...
        // DWD ADDED:
        "experimentalFeatures": {
            "uninstall": true,
            "upgrade": true
       }
    }

    Now you can run e.g. winget upgrade 'Google Chrome', etc. (IMO, this sort of functionality is supposed to be central to what a package manager is all about, and does NOT lend confidence.)

    Backing Up a W10 Machine

    … to be continued …


    1. ​*​
      I am not claiming these are the only or even best way to accomplish things, they are just part of my regular workflow, and so I want to be able to map these concepts from Mac to W10
    2. ​†​
      This is not always the case; sometimes you need to add access to additional remote repositories; when you do this, of course you have to make a judgement as to the reputability of the source.
    3. ​‡​
      cmd /c CMD just means to use the Command Prompt command to run CMD