Using python for scripting¶
In this notebook, I'll cover some tricks in using python to script jobs in the terminal.
The os module¶
The built-in module os
contains lots of useful functionality. For example, here is a way to check if a file exists:
import os
os.path.exists("python_scripting.ipynb")
True
Here is a method that checks if the directory exists and, if it doesn't, creates it (otherwise mkdir
will throw and error):
directory = "example_directory"
if os.path.exists(directory) is False:
os.mkdir("example_directory")
Reading/writing files¶
Python includes built-in methods to read and write files (see, e.g. this tutorial. However, in general for scientific work, it is best to try and use dedicated to software to read and write file where the software knows about the file formats.
For example, a common way to store data is csv
(command separated file). Here, pandas
is one of the best modules available and easily handles the task. First, let's create a data frame
import pandas as pd
df = pd.DataFrame(dict(A=[1, 2, 3], B=[4, 5, 6]))
df
A | B | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
Then we will write it to a file test.csv
which we put in our example_directory
created above
filename = "example_directory/test.csv"
df.to_csv(filename)
We can look at what the file looks like using the cat
module (note the "!" is using the command line from within this notebook)
!cat example_directory/test.csv
,A,B 0,1,4 1,2,5 2,3,6
Then we can read it back in like this (note the index_col
is needed so that pandas
doesn't think the first column contains data)
df = pd.read_csv(filename, index_col=0)
df
A | B | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
The subprocess module¶
subprocess
is a built-in module that enables you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. In other words, you can run any executable from python using subprocess
.
For example, let's use the disk usage program du
to find out the size of this notebook
import subprocess
cmd = ["du", "-h", "python_scripting.ipynb"]
out = subprocess.run(cmd)
12K python_scripting.ipynb
As you can see, the output get's printed in the notebook. But, if you want to capture the output you can use the capture_output
argument then returned
out(a [
subprocess.CompletedProcess`](https://docs.python.org/3/library/subprocess.html#subprocess.CompletedProcess) instance) will have the output
out = subprocess.run(cmd, capture_output=True)
# Print the captured output after converting from bytes to a string
print(out.stdout.decode("utf-8"))
12K python_scripting.ipynb
Finally, it is worth saying that the cmd
above is a list, starting with the program, and then including any flags and arguments. I define it as a variable because it is often useful to print it before running run
so you know what is happening!
The glob module¶
glob
finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. For example, let's find all the files in the exampel_directory
import glob
files = glob.glob("example_directory/*")
print(files)
['example_directory/test.csv']
In the pattern past to glob.glob
you can use wildcard matching to say limit to only csv
files with
files = glob.glob("example_directory/*csv")
print(files)
['example_directory/test.csv']
The returned files
is a list of the file names, you can now iterate over these. For example to check the filesize with subprocess
and du
for file in files:
cmd = ["du", "-h", file]
subprocess.run(cmd)
4.0K example_directory/test.csv