Tag: python

  • Python Relative Imports

    Intro

    I thought I understood python modules. To my embarrassment, I tried doing a relative import just now and, having researched the ensuing error, realized that I had still failed to understand some fundamentals of relative imports.

    Background

    I created the following file structure:

    .
    └── src
        ├── __init__.py
        ├── main.py
        └── temp
            └── __init__.py
    

    … with the following demo code:

    ### src/main.py
    from .temp import foo
    print('>>>'+foo)
    
    ### src/temp/__init__.py
    foo = 'bar'

    … and tried running python3 src/main.py expecting to get the simple message “>>>bar” printed out. Instead, I got the error:

    Traceback (most recent call last):
      File ".../src/main.py", line 1, in <module>
        from .temp import foo
    ImportError: attempted relative import with no known parent package

    So something about the relative import was not working. What really threw me off understanding the problem was the fact that VSCode (with Pylance) was resolving the variable foo in the import without any apparent problem or warning, so I figured that the command-line interpreter must just be struggling to know where to look.

    VSCode three me off by treating the file as a module, whereas the CLI interpreter was treating it as a top-level script!

    But no matter how many extra __init__.py files I scattered around the repo, or how many more directories I told the interpreter to look in for modules (viz. PYTHONPATH=".:src:src/temp" python src/main.py), I could not get this error about “no known parent package” to go away.

    TL;DR

    When you kick off a python process, you can choose to run your entry-point script either as a “top-level script” (TLS) or as a “module”. If you run main.py as a TLS, then you cannot import stuff from elsewhere using relative imports; if you run it as a module, then you can if the script is not in the same directory as the interpreter’s launch location. Remind please me why everyone loves python?

    # TLS: relative imports not ok
    python3 src/main.py
    
    # Module: relative imports ok
    python3 -m src.main

    Explanation

    This SO answer was key to my eventual understanding.​*​

    The key thing is that, in order to move up/down a “package hierarchy”, the interpreter needs to establish a position in a hierarchy, and it simply does not establish such a position if you start a process off as a TLS. You can see that that is the case by running print(">>>"+str(__package__)) within main.py and running it as a TLS (which prints “>>>None“).

    Now, yes, this is counter-intuitive (as everyone in the python blogosphere agrees), and, yes, the authors could have built the language with a common-sense decision tree so as to establish a TLS’s placement within a package,​†​ but they did not, and you just have to treat that as a brute fact to work around.

    And since your initial position within a package is not determined, you can’t perform relative imports up/down the package hierarchy. By contrast, if you start off the process by executing a module then the “src.main” dot syntax that is passed to the interpreter is used to establish a position within a package hierarchy that can be examined within the __package__ variable.​‡​

    Further Notes

    This topic involves a bunch of pesky details/subtleties that it’s worth trying to remain aware of.

    • “Relative imports” begin with a dot, indicating placement within a package hierarchy. “Absolute imports” begin with a package/module name with dots to indicate descent into subdirectories. Absolute imports rely on the contents of the sys.path list, which you can set/augment with the PYTHONPATH env var.
    • If you perform an absolute import on a module, then the __name__ and __package__ variables of that script will reflect the position of the module in the package as it was called. For example, if you run from package1.subpackage.somefile import something, then the __name__ variable within somefile will be “package1.subpackage.somefile“, and the __package__ variable will be package1.subpackage; the interpreter knows how to use this information to then interpret imports relative to somefile. For example, you can run from .. import something_else within somefile to access other modules within the subpackage module.
    • If you run a module from the same directory that you are in when starting the interpret, even if you are specifying a module (e.g. python3 -m main), then no package structure has been communicated and, like with a TLS, you cannot perform relative imports right off the bat.
    • If you have a package with nested subdirectories then you do not need to have an __init__.py file in each of them in order to access the modules in a nested file/dir. The only real practical consequence of having an __init__.py file (that I can discern) is to turn the “directory itself” into a module, so you can import the dir by name without having to name any subdirectory or file.
    • I had thought that to be a package, a dir had to have an __init__.py file in it; perhaps that’s true in some sort of definitional sense but, practically speaking, a dir does NOT need to have an __init__.py file in order to function “like a package”, i.e. to just represent a collection of modules. In particular, as mentioned above, you do not need an __init__.py file in order to do imports from within a dir.
    • In writing this article I worried that my explanations would sound circular since I sort of assumed that the reader knew what I meant by the term “hierarchical package” and, ideally, I would define this first before building off of the concept. However, part of the challenge here is that a “package hierarchy” is clearly a notion that must be defined with respect to the design of the interpreter, and trying to understand how the interpreter works with respect to package hierarchies is the really the point of this article. So, apologies if you feel dissatisfied; however, it feels sufficiently clear in my head (right now at least) what is going on here to close this chapter.
    • Having researched this matter recently, I can report that a lot of people feel python’s package/module system is confusing and sub-optimally designed.

    Summary

    As a rule of thumb, you can only perform relative imports after performing an absolute import, or kicking things off as a module with hierarchical details (i.e. dots in the path to the entry script), since this info is used to establish placement within a package hierarchy.

    Relative imports are tricky and actually generally discouraged in the python community since absolute imports (e.g. from package1.some_module import something) tend to read clearer than relative imports (e.g. from ..some_module import something).

    So make sure that when you name packages in your code you are mindful not to introduce confusion/conflicts with packages installed from 3rd parties. (Sounds “imprecise”, but python just is far from perfect.)


    1. ​*​
      Beware, this response while thorough gets wrong the detail that python -m src.main will render __name__ as “__main__”, not src.main; you need to refer to the __package__ variable as your primary means for understanding how the interpreter determines placement within package hierarchy.
    2. ​†​
      E.g. “check if __init__.py is in same dir as TLS and, if so, make that dir the package name”
    3. ​‡​
      Interestingly, it seems you don’t even need a __init__.py file within src for the interpreter to treat it as a package in this special case
  • Launching Subprocesses in Python3

    Motivation

    It’s important in the world of IT to know a scripting language for data processing, task automation, etc. For simple tasks (e.g. moving files, starting programs), my goto scripting language is Bash. But when I need something with more tools and precision (e.g. parsing html), I use python3.

    I decided in recent times to move towards coding in a more cross-platform manner, which basically means less bash, and more python. That meant that I needed to get more comfortable with python’s system of launching subprocesses. For years I’d been in the habit of copy/pasting code like this (which I probably grabbed from Stack Overflow originally), without really thinking through what was happening:

    import subprocess
    # Launch some shell script CMD
    p = subprocess.Popen(
        CMD,
        shell=True,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT
    )
    # Wait for it to finish
    p.wait()
    # Save the result to a str variable
    result: str = p.stdout.read().decode('utf8')

    Not only is this a lot of code compared to the Bash equivalent (i.e. just $CMD), I also wasn’t very clear on whether all of these verbose arguments and methods were really needed (Spoiler: they’re not!), and this had been a mental block to using python3 more liberally in the past. So having actually now read the subprocess documentation, I figured I’d consolidate what I’d learnt there by trying to describe it all to a younger self.

    Standard Streams Recap

    Skip this section if you’re already well-versed with “Standard Streams” (stdin, stdout, and stderr) — you’ll need this to understand the above code block. If you are not well-versed with them then here is a super swift crash course (with focus on the two output streams: stdout, and stderr).

    What streams “are” is a bit abstract (think inner OS/kernel magic), and it’s easier to learn to think in terms of how you work with them. Let’s start by thinking in general terms about the sort of things one wants to be able to do when writing a program:

    • receive input information
    • send signals/data to other running processes
    • interact with devices (files in storage, graphics cards, network cards, etc.)
    • signal to developers/administrators the state and/or output of the program
    • start child processes (that can also do all of the above)

    The item in red is what output streams are all about, which will be our focus. When you write a program, you want to be able to get information about the state of the program to the user who will run the program but you want that end user to get to decide how that information is to be viewed/processed.​*​ ​†​

    Since you do not know in advance what the user will want to do with the information that you will want the program to broadcast, the operating system (OS) provides you with the ability to send messages out of your program in a sort of “un-opinionated” manner. More specifically, the OS lets you place a label on the data you want to emit (viz. “standard” or “error”), but where such messages will go, and how they will be used, will not be decided at the moment you write the program. Rather, the user of the program will be in control as to where data output with the “standard” label will be sent, and where data output with the “error” label will be sent.​‡​

    The purpose of providing the programmer with two spaces to publish information is that it will allow the end user to separate these messages by, for example, viewing the messages sent to stdout on the terminal and saving messages sent to stderr to a file.

    To see this in action, we’ll consider the simple command “ping www.google.com” run on a linux terminal. (I choose ping because it runs indefinitely, allowing us to examine the properties of this process before it ends.)

    If your network connection is ok, this program will print to stdout every second. Now, the terminal is itself a program — a special program — that is designed to receive input, run programs and, being a program, it can (and does) send messages to stdout and stderr.

    Where do those messages “end up”? We can find the PID of the ping process (ps -ef | grep -Ei "PID|ping"), which is 5381 in this case, and then examine the use that PID in the following command on linux:

    sudo ls -l /proc/5381/fd
    lrwx------ 1 root root 64 Sep 28 00:10 0 -> /dev/tty1
    lrwx------ 1 root root 64 Sep 28 00:10 1 -> /dev/tty1
    lrwx------ 1 root root 64 Sep 28 00:10 2 -> /dev/tty1

    The file descriptors shown in this print out (/proc/5381/fd/0, /proc/5381/fd/1, and /proc/5381/fd/2) tell you that the stdin, stdout and stderr respectively for this process all “point to” /dev/tty1. This is a linux virtual device file that you can think of as a handle or interface to a driver for the (emulated) terminal. (This is the same terminal that ping was started in, which can be confirmed by running the command tty.) Since ping prints to stdout, and since stdout points to the terminal emulator, the data is sent there and displayed on the screen accordingly.

    As stated earlier, the destination of messages sent to stdout and stderr is only determined at the moment that the byte code for the program is converted into a new process by the OS. In the case of a linux terminal, the processes that are started therein, such as ping, inherit by default the same standard-stream destinations as those of the terminal. This is why the file descriptors above all point to the terminal by default /dev/tty1. But we can change what the file descriptors will point to when we start the process by using redirects.

    For example, if we now begin a new process in the terminal with ping www.google.com 1> /dev/null, then we get a new PID (5816), and override the default value of the file descriptor (which woould have been /proc/5816/fd/1 -> /dev/tty1), so that we won’t see any regular printout to terminal. Examining the file descriptors again for the new ping process:

    sudo ls -l /proc/5816/fd
    lrwx------ 1 root root 64 Sep 28 00:10 0 -> /dev/tty1
    lrwx------ 1 root root 64 Sep 28 00:10 1 -> /dev/null
    lrwx------ 1 root root 64 Sep 28 00:10 2 -> /dev/tty1
    ...

    … confirms that stdout is pointing to /dev/null — the linux “black hole” — so the messages now just get thrown away. Likewise, if we now redirect stderr to a file, and stdout to stderr when starting ping:

    ping www.google.com 2> /tmp/temp.txt 1>&2 
    sudo ls -l /proc/5816/fd
    lrwx------ 1 root root 64 Sep 28 01:25 0 -> /dev/pts/0
    l-wx------ 1 root root 64 Sep 28 01:25 1 -> /tmp/temp.txt
    l-wx------ 1 root root 64 Sep 28 01:25 2 -> /tmp/temp.txt

    … then we get nothing printed to the terminal, and the output of ping ends up in /tmp/temp.txt, as expected.

    A few notes are useful here if this is new-ish to you:

    • The numbers around the redirect symbol > represent the standard streams as follows: stdin: 0, stdout: 1, stderr: 2. So 1>&2 means redirect stdout to stderr, etc.
    • A redirect symbol > without a number in front is short for 1> (redirect stdout to something)
    • You need an ampersand & after the > symbol whenever you redirect to a number representing a standard stream, otherwise the terminal will read e.g. 2>1 as “redirect stderr to a file named 1“. Don’t use an ampersand though if redirecting to a file path.
    • The order of the redirects in the earlier example might seem counterintuitive at first. You might expect it to look like ping www.google.com 2>&1 1>&2 /tmp/temp.txt, which looks as though it reads “redirect stderr to stdout, and stdout to a file”. But if you think of these redirects in terms of setting what the file descriptors point to, and reading these commands from left to right, then you see that at the moment that the terminal reads 2>&1 it will set the file descriptor /proc/5816/fd/2 to point to the same destination held by /proc/5816/fd/1, which has not been changed yet from its default value; so this redirect will not have any effect, and stderr will still print to screen. That is why you need to first set one of the streams to point to the file (e.g. /proc/5816/fd/1 -> /tmp/temp.txt), and then set the other stream to point to the same thing as the previous stream (e.g. /proc/5816/fd/2 -> /proc/5816/fd/1 -> /tmp/temp.txt).

    You can also print messages to another terminal window by identifying its device file (tty), and then redirecting stdout to that device (e.g. echo hello > /dev/tty2).

    In summary, since most of us learn about programming and system admin in a terminal, it’s easy to come to think of the programs that we’re used to launching therein as being in some sense bound up with, or unusable without, the terminal. But once you realize that all the programs you are used to launching from the terminal have no intrinsic tie to the terminal, and that the terminal has been conveniently determining the default destinations for the standard streams of the programs you’ve been running in it, you can begin to appreciate the need to be able to explicitly set the streams of programs that are not launched by the terminal.

    Python3 Subprocess.run and .Popen

    Now let’s go back to the python code I’d been pasting/copying for several years and see if we can understand and simplify what’s happening with the subprocess module.

    import subprocess
    # Launch CMD
    p = subprocess.Popen(
        CMD,
        shell=True,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT
    )
    # Wait for it to finish
    p.wait()
    # Save the result to result str
    result: str = p.stdout.read().decode('utf8').strip()

    The first thing I learned by reading the subprocess documentation is that (blush), I wasn’t even using the recommended method:

    The recommended approach to invoking subprocesses is to use the run() function for all use cases it can handle

    The run function is a simplified wrapper around the more complete Popen function I had been using, and you basically want to use it whenever you want to execute a synchronous​§​ process of finite duration. For example, you can view your PWD contents with the following in the python3 REPL:

    >>> import subprocess
    >>> subprocess.run(['/bin/ls'])
    temp1    temp2

    (Note: if you launch python3 in a terminal then it will inherit the shell’s environment variables, including $PATH, which means that you often don’t need to spell out the full path to the executable as I’ve done here.)

    Notice that the executed program, though launched as a separate process, still prints to the same terminal as the python3 REPL. We can discern why this happens by going through the same procedure we went through earlier, i.e. by launching an ongoing process like ping:

    >>> import subprocess
    >>> subprocess.run(['/bin/ping','www.google.com'])
    64 bytes from 172.253.63.104: icmp_seq=1 ttl=101 time=1.31 ms
    64 bytes from 172.253.63.104: icmp_seq=2 ttl=101 time=1.42 ms
    ...

    … and, in a separate window, finding the PID and examining the file descriptors of that process:

    ❯ sudo ls -l /proc/18175/fd
    lrwx------ 1 root root 64 Sep 28 21:29 0 -> /dev/tty1
    lrwx------ 1 root root 64 Sep 28 21:29 1 -> /dev/tty1
    lrwx------ 1 root root 64 Sep 28 21:29 2 -> /dev/tty1
    

    The ping process evidently inherited the same file descriptors as its parent process (the python3 REPL), which itself inherited those descriptors from its parent, the terminal. So both python3 and ping will print to the same terminal.

    Now, we want to be able to launch processes in python3 with something akin to redirection in the terminal. In particular, we want to be able to pipe the standard output streams of the process we launch with the subprocess module to the parent python3 process and to be able to capture that data as a python3 variable. We do that by providing stdout and stderr arguments, as shown in the following example:

    >>> from subprocess import run, PIPE
    >>> url = 'www.google.com'
    >>> p = run(['/bin/ping','-c','2',url], stdout=PIPE, stderr=PIPE)
    >>> p.stdout.decode('utf8')
    'PING www.google.com (172.217.2.100) 56(84) bytes of data. ...'

    Notice this time that we made ping run for only a finite duration by supplying the --count=2 arguments, and that the process did not print to terminal while running. This is because the stdout=PIPE argument has an effect similar to the terminal redirection command (1>).

    Where/how was ping‘s stdout redirected? We can investigate as before by rerunning the above code (but without ‘-c’, ‘2’ to make the process run indefinitely), finding the PID of the new ping process in another window, and examining that process’ file descriptors:

    ❯ sudo ls -l /proc/33463/fd/
    lrwx------ 1 root root 64 Sep 28 20:58 0 -> /dev/tty1
    l-wx------ 1 root root 64 Sep 28 20:58 1 -> 'pipe:[110288]'
    l-wx------ 1 root root 64 Sep 28 20:58 2 -> 'pipe:[110289]'
    ...

    As we can see, ping‘s stdout is now being directed to a device labelled 'pipe:[110288]' (and stderr to a device labelled 'pipe:[110289]'). This “pipe” is an OS in-memory “unnamed” device​¶​ whose purpose is to connect a write-able file descriptor of one process to a read-able file descriptor of another process. (Pipes connect a process to a process, redirects connect a process to a file.) The number 110288 is the ID for the inode of the pipe device file in the filesystem. You can get more information on the pipe device file with the lsof (“list open files”) utility:

    ❯ lsof | grep -E "PID|110288"
    COMMAND   PID   ... FD     TYPE    DEVICE ...    NODE    NAME
    python3   33462 ... 3r     FIFO    0,13   ...    110288  pipe
    ping      33463 ... 1w     FIFO    0,13   ...    110288  pipe

    Here we can see that the pipe shows up in relation to the python3 and ping processes, with PID 33462 and 33463 respectively. The FD column gives the file descriptor number for the pipe file, and the letters r and w refer to read and write permissions. Referring to the previous ls -l command, we can confirm here that /proc/33463/fd/1 does indeed point to this pipe device file, and it does have write-only permissions.

    Lets now look at the corresponding python3 file descriptors:

    > ls -l /proc/33462/fd/
    lrwx------ 1 dwd dwd 64 Sep 28 20:59 0 -> /dev/tty1
    lrwx------ 1 dwd dwd 64 Sep 28 20:59 1 -> /dev/tty1
    lrwx------ 1 dwd dwd 64 Sep 28 20:59 2 -> /dev/tty1
    lr-x------ 1 dwd dwd 64 Sep 28 20:59 3 -> 'pipe:[110288]'
    lr-x------ 1 dwd dwd 64 Sep 28 20:59 5 -> 'pipe:[110289]'

    Here we can see that the python3 parent process has kept its standard streams pointing to /dev/tty1 (so you can still interact with it through the terminal). In addition, it has created two new file descriptors (3 and 5) pointing to the two pipes we created in our subprocess.run command (one for stdout, one for stderr). The file descriptor /proc/33462/fd/3, as we have seen, is the read-only end of the pipe emanating from the stdout file descriptor of the ping process. This “non-standard stream” file descriptor is created by the python3 process according to its underlying C code. That code is responsible for marshaling the data emitted from the pipe into a python runtime variable and, hence, why we are able to see the result of the ping process printed out in the python3 REPL as a string.

    For reference here is some relatively simple C code demonstrating inter-process communication through pipes, the sort of thing you’d find in python3‘s source code.

    Let’s return to the subprocess module. The other argument worth taking special note of is shell=False. When set to True, subprocess.run passes the first argument (that the documentation recommends be a string now instead of an array of strings) to the /bin/sh program for execution as a script. So now python3 launches a single child process — the sh shell — that can launch arbitrarily many further child processes. That obviously has the advantage of letting you do more complex sequences of commands in a script-like written format, and lets you take advantage of shell features like setting/expanding env variables. Piping to stdout and stderr works the same: any process you invoke in your shell script that writes to either stream will contribute to the strings that become accessible in run().stdout.decode('utf8') and run().stderr.decode('utf8').

    There are only two disadvantages that I can discern of using the shell=True argument:

    • Overhead: it takes resources to start a new shell
    • Complexity: there’s something to be said about keeping your calls to the outside world short, simple and infrequent

    Finally, let’s review the subprocess.Popen that subprocess.run wraps around. The main difference is that run is blocking, while Popen is not. That is, if you start a process with run, it will wait to finish before proceeding to the next line of python code. Again, this is great when you just want to get e.g. the stdout of a command-line tool dumped straight into a string variable:

    >>> from subprocess import run
    >>> date = run('date', capture_output=True).stdout.decode('utf8')
    >>> print(date)
    'Mon Sep 28 23:44:49 EDT 2020\n'

    Note: the argument (..., capture_output=True) is provided in python3.6+ as a short cut for (..., stderr=PIPE, stderr=PIPE).

    Popen, by contrast, is a class constructor that will start the subprocess, return an object that lets you communicate with that process, and then you’ll immediately move onto the next line of code. This is useful if you want to launch a lot of processes, like network requests, in parallel. You then control the timing of the processes with the Popen.wait() method. The Popen object also has more complex data structures owing. toits asynchronous nature, meaning that, for example, you have to convert a buffer to a string using an intermediary .read() method. The equivalent code with Popen to the above run code is thus the more verbose pattern I had been using for so long:

    >>> from subprocess import Popen
    >>> p = Popen('date', capture_output=True)
    >>> p.wait()
    >>> date = p.stdout.read().decode('utf8')
    >>> print(date)
    'Mon Sep 28 23:44:49 EDT 2020\n'

    Summary

    I expect I’ll be using the following patterns a lot going forward.

    from subprocess import Popen, run, PIPE
    ### Simple stuff
    run(['mv', 'temp', 'temp2'])
    ### Simple I/O stuff
    date = run(['date'], capture_output=True).stdout.decode('utf8')
    ### Parallel stuff
    p1 = Popen('curl -o foo.html https://www.foo.com')
    p2 = Popen('curl -o bar.html https://www.bar.com')
    p1.wait()
    p2.wait()

    (And, yes, I know there are native python3 equivalents to all these commands.)

    Further reading

    https://www.linusakesson.net/programming/tty/

    https://www.informit.com/articles/article.aspx?p=2854374&seqNum=5

    https://lucasfcosta.com/2019/04/07/streams-introduction.html


    1. ​*​
      Is the message to be literally “viewed” (i.e. on the screen), is it to be “stored” (i.e. saved to a file on disk), is it to be “piped” (i.e. imbibed as input information by another program), or ignored (i.e. discarded)?
    2. ​†​
      The “user” of your program could of course be someone who writes a program that calls your program
    3. ​‡​
      The key here is that you don’t need to know what this channel “is”, only that the user will be provided with a systematic means to determine where messages designated for that channel will end up.
    4. ​§​
      I.e. your code will wait for it to finish
    5. ​¶​
      An unnamed device is one that does not show up in the /dev/