Using python for scripting¶

In this notebook, I'll cover some tricks in using python to script jobs in the terminal.

The os module¶

The built-in module os contains lots of useful functionality. For example, here is a way to check if a file exists:

In [1]:
import os
os.path.exists("python_scripting.ipynb")
Out[1]:
True

Here is a method that checks if the directory exists and, if it doesn't, creates it (otherwise mkdir will throw and error):

In [2]:
directory = "example_directory"
if os.path.exists(directory) is False:
    os.mkdir("example_directory")

Reading/writing files¶

Python includes built-in methods to read and write files (see, e.g. this tutorial. However, in general for scientific work, it is best to try and use dedicated to software to read and write file where the software knows about the file formats.

For example, a common way to store data is csv (command separated file). Here, pandas is one of the best modules available and easily handles the task. First, let's create a data frame

In [3]:
import pandas as pd
df = pd.DataFrame(dict(A=[1, 2, 3], B=[4, 5, 6]))
df
Out[3]:
A B
0 1 4
1 2 5
2 3 6

Then we will write it to a file test.csv which we put in our example_directory created above

In [4]:
filename = "example_directory/test.csv"
df.to_csv(filename)

We can look at what the file looks like using the cat module (note the "!" is using the command line from within this notebook)

In [5]:
!cat example_directory/test.csv
,A,B
0,1,4
1,2,5
2,3,6

Then we can read it back in like this (note the index_col is needed so that pandas doesn't think the first column contains data)

In [6]:
df = pd.read_csv(filename, index_col=0)
df
Out[6]:
A B
0 1 4
1 2 5
2 3 6

The subprocess module¶

subprocess is a built-in module that enables you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. In other words, you can run any executable from python using subprocess.

For example, let's use the disk usage program du to find out the size of this notebook

In [7]:
import subprocess
cmd = ["du", "-h", "python_scripting.ipynb"]
out = subprocess.run(cmd)
12K	python_scripting.ipynb

As you can see, the output get's printed in the notebook. But, if you want to capture the output you can use the capture_output argument then returnedout(a [subprocess.CompletedProcess`](https://docs.python.org/3/library/subprocess.html#subprocess.CompletedProcess) instance) will have the output

In [8]:
out = subprocess.run(cmd, capture_output=True)

# Print the captured output after converting from bytes to a string
print(out.stdout.decode("utf-8"))
12K	python_scripting.ipynb

Finally, it is worth saying that the cmd above is a list, starting with the program, and then including any flags and arguments. I define it as a variable because it is often useful to print it before running run so you know what is happening!

The glob module¶

glob finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. For example, let's find all the files in the exampel_directory

In [9]:
import glob

files = glob.glob("example_directory/*")
print(files)
['example_directory/test.csv']

In the pattern past to glob.glob you can use wildcard matching to say limit to only csv files with

In [10]:
files = glob.glob("example_directory/*csv")
print(files)
['example_directory/test.csv']

The returned files is a list of the file names, you can now iterate over these. For example to check the filesize with subprocess and du

In [11]:
for file in files:
    cmd = ["du", "-h", file]
    subprocess.run(cmd)
4.0K	example_directory/test.csv