Motivation
It’s important in the world of IT to know a scripting language for data processing, task automation, etc. For simple tasks (e.g. moving files, starting programs), my goto scripting language is Bash. But when I need something with more tools and precision (e.g. parsing html), I use python3.
I decided in recent times to move towards coding in a more cross-platform manner, which basically means less bash, and more python. That meant that I needed to get more comfortable with python’s system of launching subprocesses. For years I’d been in the habit of copy/pasting code like this (which I probably grabbed from Stack Overflow originally), without really thinking through what was happening:
import subprocess
# Launch some shell script CMD
p = subprocess.Popen(
CMD,
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT
)
# Wait for it to finish
p.wait()
# Save the result to a str variable
result: str = p.stdout.read().decode('utf8')
Not only is this a lot of code compared to the Bash equivalent (i.e. just $CMD), I also wasn’t very clear on whether all of these verbose arguments and methods were really needed (Spoiler: they’re not!), and this had been a mental block to using python3 more liberally in the past. So having actually now read the subprocess documentation, I figured I’d consolidate what I’d learnt there by trying to describe it all to a younger self.
Standard Streams Recap
Skip this section if you’re already well-versed with “Standard Streams” (stdin, stdout, and stderr) — you’ll need this to understand the above code block. If you are not well-versed with them then here is a super swift crash course (with focus on the two output streams: stdout, and stderr).
What streams “are” is a bit abstract (think inner OS/kernel magic), and it’s easier to learn to think in terms of how you work with them. Let’s start by thinking in general terms about the sort of things one wants to be able to do when writing a program:
- receive input information
- send signals/data to other running processes
- interact with devices (files in storage, graphics cards, network cards, etc.)
- signal to developers/administrators the state and/or output of the program
- start child processes (that can also do all of the above)
The item in red is what output streams are all about, which will be our focus. When you write a program, you want to be able to get information about the state of the program to the user who will run the program but you want that end user to get to decide how that information is to be viewed/processed.* †
Since you do not know in advance what the user will want to do with the information that you will want the program to broadcast, the operating system (OS) provides you with the ability to send messages out of your program in a sort of “un-opinionated” manner. More specifically, the OS lets you place a label on the data you want to emit (viz. “standard” or “error”), but where such messages will go, and how they will be used, will not be decided at the moment you write the program. Rather, the user of the program will be in control as to where data output with the “standard” label will be sent, and where data output with the “error” label will be sent.‡
The purpose of providing the programmer with two spaces to publish information is that it will allow the end user to separate these messages by, for example, viewing the messages sent to stdout on the terminal and saving messages sent to stderr to a file.
To see this in action, we’ll consider the simple command “ping www.google.com” run on a linux terminal. (I choose ping because it runs indefinitely, allowing us to examine the properties of this process before it ends.)
If your network connection is ok, this program will print to stdout every second. Now, the terminal is itself a program — a special program — that is designed to receive input, run programs and, being a program, it can (and does) send messages to stdout and stderr.
Where do those messages “end up”? We can find the PID of the ping process (ps -ef | grep -Ei "PID|ping"), which is 5381 in this case, and then examine the use that PID in the following command on linux:
sudo ls -l /proc/5381/fd
lrwx------ 1 root root 64 Sep 28 00:10 0 -> /dev/tty1
lrwx------ 1 root root 64 Sep 28 00:10 1 -> /dev/tty1
lrwx------ 1 root root 64 Sep 28 00:10 2 -> /dev/tty1
The file descriptors shown in this print out (/proc/5381/fd/0, /proc/5381/fd/1, and /proc/5381/fd/2) tell you that the stdin, stdout and stderr respectively for this process all “point to” /dev/tty1. This is a linux virtual device file that you can think of as a handle or interface to a driver for the (emulated) terminal. (This is the same terminal that ping was started in, which can be confirmed by running the command tty.) Since ping prints to stdout, and since stdout points to the terminal emulator, the data is sent there and displayed on the screen accordingly.
As stated earlier, the destination of messages sent to stdout and stderr is only determined at the moment that the byte code for the program is converted into a new process by the OS. In the case of a linux terminal, the processes that are started therein, such as ping, inherit by default the same standard-stream destinations as those of the terminal. This is why the file descriptors above all point to the terminal by default /dev/tty1. But we can change what the file descriptors will point to when we start the process by using redirects.
For example, if we now begin a new process in the terminal with ping www.google.com 1> /dev/null, then we get a new PID (5816), and override the default value of the file descriptor (which woould have been /proc/5816/fd/1 -> /dev/tty1), so that we won’t see any regular printout to terminal. Examining the file descriptors again for the new ping process:
sudo ls -l /proc/5816/fd
lrwx------ 1 root root 64 Sep 28 00:10 0 -> /dev/tty1
lrwx------ 1 root root 64 Sep 28 00:10 1 -> /dev/null
lrwx------ 1 root root 64 Sep 28 00:10 2 -> /dev/tty1
...
… confirms that stdout is pointing to /dev/null — the linux “black hole” — so the messages now just get thrown away. Likewise, if we now redirect stderr to a file, and stdout to stderr when starting ping:
ping www.google.com 2> /tmp/temp.txt 1>&2
sudo ls -l /proc/5816/fd
lrwx------ 1 root root 64 Sep 28 01:25 0 -> /dev/pts/0
l-wx------ 1 root root 64 Sep 28 01:25 1 -> /tmp/temp.txt
l-wx------ 1 root root 64 Sep 28 01:25 2 -> /tmp/temp.txt
… then we get nothing printed to the terminal, and the output of ping ends up in /tmp/temp.txt, as expected.
A few notes are useful here if this is new-ish to you:
- The numbers around the redirect symbol > represent the standard streams as follows:
stdin: 0, stdout: 1, stderr: 2. So1>&2means redirectstdouttostderr, etc. - A redirect symbol
>without a number in front is short for1>(redirectstdoutto something) - You need an ampersand
&after the>symbol whenever you redirect to a number representing a standard stream, otherwise the terminal will read e.g.2>1as “redirectstderrto a file named1“. Don’t use an ampersand though if redirecting to a file path. - The order of the redirects in the earlier example might seem counterintuitive at first. You might expect it to look like
ping www.google.com 2>&1 1>&2 /tmp/temp.txt, which looks as though it reads “redirectstderrtostdout, andstdoutto a file”. But if you think of these redirects in terms of setting what the file descriptors point to, and reading these commands from left to right, then you see that at the moment that the terminal reads2>&1it will set the file descriptor/proc/5816/fd/2to point to the same destination held by/proc/5816/fd/1, which has not been changed yet from its default value; so this redirect will not have any effect, andstderrwill still print to screen. That is why you need to first set one of the streams to point to the file (e.g./proc/5816/fd/1 -> /tmp/temp.txt), and then set the other stream to point to the same thing as the previous stream (e.g./proc/5816/fd/2 -> /proc/5816/fd/1 -> /tmp/temp.txt).
You can also print messages to another terminal window by identifying its device file (tty), and then redirecting stdout to that device (e.g. echo hello > /dev/tty2).
In summary, since most of us learn about programming and system admin in a terminal, it’s easy to come to think of the programs that we’re used to launching therein as being in some sense bound up with, or unusable without, the terminal. But once you realize that all the programs you are used to launching from the terminal have no intrinsic tie to the terminal, and that the terminal has been conveniently determining the default destinations for the standard streams of the programs you’ve been running in it, you can begin to appreciate the need to be able to explicitly set the streams of programs that are not launched by the terminal.
Python3 Subprocess.run and .Popen
Now let’s go back to the python code I’d been pasting/copying for several years and see if we can understand and simplify what’s happening with the subprocess module.
import subprocess
# Launch CMD
p = subprocess.Popen(
CMD,
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT
)
# Wait for it to finish
p.wait()
# Save the result to result str
result: str = p.stdout.read().decode('utf8').strip()
The first thing I learned by reading the subprocess documentation is that (blush), I wasn’t even using the recommended method:
The recommended approach to invoking subprocesses is to use the
run()function for all use cases it can handle
The run function is a simplified wrapper around the more complete Popen function I had been using, and you basically want to use it whenever you want to execute a synchronous§ process of finite duration. For example, you can view your PWD contents with the following in the python3 REPL:
>>> import subprocess
>>> subprocess.run(['/bin/ls'])
temp1 temp2
(Note: if you launch python3 in a terminal then it will inherit the shell’s environment variables, including $PATH, which means that you often don’t need to spell out the full path to the executable as I’ve done here.)
Notice that the executed program, though launched as a separate process, still prints to the same terminal as the python3 REPL. We can discern why this happens by going through the same procedure we went through earlier, i.e. by launching an ongoing process like ping:
>>> import subprocess
>>> subprocess.run(['/bin/ping','www.google.com'])
64 bytes from 172.253.63.104: icmp_seq=1 ttl=101 time=1.31 ms
64 bytes from 172.253.63.104: icmp_seq=2 ttl=101 time=1.42 ms
...
… and, in a separate window, finding the PID and examining the file descriptors of that process:
❯ sudo ls -l /proc/18175/fd
lrwx------ 1 root root 64 Sep 28 21:29 0 -> /dev/tty1
lrwx------ 1 root root 64 Sep 28 21:29 1 -> /dev/tty1
lrwx------ 1 root root 64 Sep 28 21:29 2 -> /dev/tty1
The ping process evidently inherited the same file descriptors as its parent process (the python3 REPL), which itself inherited those descriptors from its parent, the terminal. So both python3 and ping will print to the same terminal.
Now, we want to be able to launch processes in python3 with something akin to redirection in the terminal. In particular, we want to be able to pipe the standard output streams of the process we launch with the subprocess module to the parent python3 process and to be able to capture that data as a python3 variable. We do that by providing stdout and stderr arguments, as shown in the following example:
>>> from subprocess import run, PIPE
>>> url = 'www.google.com'
>>> p = run(['/bin/ping','-c','2',url], stdout=PIPE, stderr=PIPE)
>>> p.stdout.decode('utf8')
'PING www.google.com (172.217.2.100) 56(84) bytes of data. ...'
Notice this time that we made ping run for only a finite duration by supplying the --count=2 arguments, and that the process did not print to terminal while running. This is because the stdout=PIPE argument has an effect similar to the terminal redirection command (1>).
Where/how was ping‘s stdout redirected? We can investigate as before by rerunning the above code (but without ‘-c’, ‘2’ to make the process run indefinitely), finding the PID of the new ping process in another window, and examining that process’ file descriptors:
❯ sudo ls -l /proc/33463/fd/
lrwx------ 1 root root 64 Sep 28 20:58 0 -> /dev/tty1
l-wx------ 1 root root 64 Sep 28 20:58 1 -> 'pipe:[110288]'
l-wx------ 1 root root 64 Sep 28 20:58 2 -> 'pipe:[110289]'
...
As we can see, ping‘s stdout is now being directed to a device labelled 'pipe:[110288]' (and stderr to a device labelled 'pipe:[110289]'). This “pipe” is an OS in-memory “unnamed” device¶ whose purpose is to connect a write-able file descriptor of one process to a read-able file descriptor of another process. (Pipes connect a process to a process, redirects connect a process to a file.) The number 110288 is the ID for the inode of the pipe device file in the filesystem. You can get more information on the pipe device file with the lsof (“list open files”) utility:
❯ lsof | grep -E "PID|110288"
COMMAND PID ... FD TYPE DEVICE ... NODE NAME
python3 33462 ... 3r FIFO 0,13 ... 110288 pipe
ping 33463 ... 1w FIFO 0,13 ... 110288 pipe
Here we can see that the pipe shows up in relation to the python3 and ping processes, with PID 33462 and 33463 respectively. The FD column gives the file descriptor number for the pipe file, and the letters r and w refer to read and write permissions. Referring to the previous ls -l command, we can confirm here that /proc/33463/fd/1 does indeed point to this pipe device file, and it does have write-only permissions.
Lets now look at the corresponding python3 file descriptors:
> ls -l /proc/33462/fd/
lrwx------ 1 dwd dwd 64 Sep 28 20:59 0 -> /dev/tty1
lrwx------ 1 dwd dwd 64 Sep 28 20:59 1 -> /dev/tty1
lrwx------ 1 dwd dwd 64 Sep 28 20:59 2 -> /dev/tty1
lr-x------ 1 dwd dwd 64 Sep 28 20:59 3 -> 'pipe:[110288]'
lr-x------ 1 dwd dwd 64 Sep 28 20:59 5 -> 'pipe:[110289]'
Here we can see that the python3 parent process has kept its standard streams pointing to /dev/tty1 (so you can still interact with it through the terminal). In addition, it has created two new file descriptors (3 and 5) pointing to the two pipes we created in our subprocess.run command (one for stdout, one for stderr). The file descriptor /proc/33462/fd/3, as we have seen, is the read-only end of the pipe emanating from the stdout file descriptor of the ping process. This “non-standard stream” file descriptor is created by the python3 process according to its underlying C code. That code is responsible for marshaling the data emitted from the pipe into a python runtime variable and, hence, why we are able to see the result of the ping process printed out in the python3 REPL as a string.
For reference here is some relatively simple C code demonstrating inter-process communication through pipes, the sort of thing you’d find in python3‘s source code.
Let’s return to the subprocess module. The other argument worth taking special note of is shell=False. When set to True, subprocess.run passes the first argument (that the documentation recommends be a string now instead of an array of strings) to the /bin/sh program for execution as a script. So now python3 launches a single child process — the sh shell — that can launch arbitrarily many further child processes. That obviously has the advantage of letting you do more complex sequences of commands in a script-like written format, and lets you take advantage of shell features like setting/expanding env variables. Piping to stdout and stderr works the same: any process you invoke in your shell script that writes to either stream will contribute to the strings that become accessible in run().stdout.decode('utf8') and . run().stderr.decode('utf8')
There are only two disadvantages that I can discern of using the shell=True argument:
- Overhead: it takes resources to start a new shell
- Complexity: there’s something to be said about keeping your calls to the outside world short, simple and infrequent
Finally, let’s review the subprocess.Popen that subprocess.run wraps around. The main difference is that run is blocking, while Popen is not. That is, if you start a process with run, it will wait to finish before proceeding to the next line of python code. Again, this is great when you just want to get e.g. the stdout of a command-line tool dumped straight into a string variable:
>>> from subprocess import run
>>> date = run('date', capture_output=True).stdout.decode('utf8')
>>> print(date)
'Mon Sep 28 23:44:49 EDT 2020\n'
Note: the argument (..., capture_output=True) is provided in python3.6+ as a short cut for (..., stderr=PIPE, stderr=PIPE).
Popen, by contrast, is a class constructor that will start the subprocess, return an object that lets you communicate with that process, and then you’ll immediately move onto the next line of code. This is useful if you want to launch a lot of processes, like network requests, in parallel. You then control the timing of the processes with the Popen.wait() method. The Popen object also has more complex data structures owing. toits asynchronous nature, meaning that, for example, you have to convert a buffer to a string using an intermediary .read() method. The equivalent code with Popen to the above run code is thus the more verbose pattern I had been using for so long:
>>> from subprocess import Popen
>>> p = Popen('date', capture_output=True)
>>> p.wait()
>>> date = p.stdout.read().decode('utf8')
>>> print(date)
'Mon Sep 28 23:44:49 EDT 2020\n'
Summary
I expect I’ll be using the following patterns a lot going forward.
from subprocess import Popen, run, PIPE
### Simple stuff
run(['mv', 'temp', 'temp2'])
### Simple I/O stuff
date = run(['date'], capture_output=True).stdout.decode('utf8')
### Parallel stuff
p1 = Popen('curl -o foo.html https://www.foo.com')
p2 = Popen('curl -o bar.html https://www.bar.com')
p1.wait()
p2.wait()
(And, yes, I know there are native python3 equivalents to all these commands.)
Further reading
https://www.linusakesson.net/programming/tty/
https://www.informit.com/articles/article.aspx?p=2854374&seqNum=5
https://lucasfcosta.com/2019/04/07/streams-introduction.html
- *Is the message to be literally “viewed” (i.e. on the screen), is it to be “stored” (i.e. saved to a file on disk), is it to be “piped” (i.e. imbibed as input information by another program), or ignored (i.e. discarded)?
- †The “user” of your program could of course be someone who writes a program that calls your program
- ‡The key here is that you don’t need to know what this channel “is”, only that the user will be provided with a systematic means to determine where messages designated for that channel will end up.
- §I.e. your code will wait for it to finish
- ¶An unnamed device is one that does not show up in the /dev/





















