Wrapping up - Day 2, afternoon

Everything is on the Web site

We’ve tried to put everything up on the Web site. Please ask us if you can’t find it!

Additional resources include those in U. Arizona – On Campus Resources, as well as using Google to look for answers, and checking out both StackOverflow and BioStars.

You might also consider attending office hours.

Posting IPython Notebooks

You can post IPython Notebooks on github, and view them statically via nbviewer.ipython.org (i.e. without loading them into a running IPython Notebook instance). For example, see:

https://github.com/swcarpentry/2013-04-az/blob/master/notebooks/10-introducing-bird-counting-FULL.ipynb

which can be viewed in “raw” form here:

https://github.com/swcarpentry/2013-04-az/raw/master/notebooks/10-introducing-bird-counting-FULL.ipynb

and which can be viewed in rendered form by pasting the ‘raw’ URL into http://nbviewer.ipython.org:

http://nbviewer.ipython.org/urls/github.com/swcarpentry/2013-04-az/raw/master/notebooks/10-introducing-bird-counting-FULL.ipynb

Moving around directories in the shell

Paths, directories, and file locations are a source of great confusing. Here are a few simple rules:

  1. Treat everything as a relative location.

    For example, if you’re in the 2013-04-az/scripts folder and you want to reference a file in the 2013-04-az/data folder, use:

    ../data/filename

    where the ‘..’ means “go up one level” and the ‘/data/’ means “go down into the data directory from there.”

  2. pwd will tell you what directory you’re in.

  3. cd will change your current directory.

  4. ls will list files in your current directory.

  5. TAB completion is your friend. TAB completion does two things: it makes you less typo-prone, AND it makes sure that the file actually exists (because you can only tab-complete on files that are actually there).

Shell scripts and pipelines

In the scripts subdirectory, we have a number of Python scripts. Turn your attention to three of them, please:

make-big-birdlist.py – creates a bunch of fake bird data. Run by giving it the name of the file that you want it to output, e.g.:

python make-big-birdlist.py counts.csv

make-birdcounts.py – parses the counts.csv file and converts dates into day-of-year, then produced a histogram of bird counts by day. This is then saved into a .dat file that can be loaded by the next script. Usage:

python make-birdcounts.py counts.csv counts.dat

plot-birdcounts.py – loads in the counts.dat file, produces a plot, and then saves the plot.

python plot-birdcounts.py counts.dat counts.pdf

These are three automated scripts that each do some smaller part of an overall larger set of tasks - essentially, a data analysis pipeline.

You could run all three scripts above by hand, but that’s error prone. You could also combine all three scripts above into one – there’s no reason why not – but that’s inflexible, especially if there are multiple ways to use the scripts (not so in the above case, but frequently true in real cases) or if some of the steps take a long time.

So what to do?

You can write a shell script, as in make-plot.sh. This can be run by typing:

bash make-plot.sh

All this is is a list of the commands you want run at the shell – simplicity itself, right? No hidden tricks here. It’s that simple.

Note that a good way to get a shell script started is to run a series of commands at the shell, then – when done – type ‘history’ to get a list of the commands you’ve run. Copy/paste from that history into a file, and you’ve got a shell script of sorts!

Remember to version control your shell scripts :)

Writing shell scripts is a good way to keep track of what it is you’ve actually run, as opposed to what you think you’ve run. And once they’re version controlled, well, now you’re really in good shape for your methods section...

Final point: if you look at the shell script, you’ll see that every time you run it, it runs all three commands. For large data sets, this can get time consuming, especially if you’re working to optimize one step, or doing parameter explanation. There’s a program called ‘make’ that can keep track of what has changed and what needs to be run again based on those changes, but you have to configure it a bit with a file called ‘Makefile’.

For an example from our own pipeline for the diginorm paper, see:

comments powered by Disqus



Edit this document!

This file can be edited directly through the Web. Anyone can update and fix errors in this document with few clicks -- no downloads needed.

  1. Go to Wrapping up - Day 2, afternoon on GitHub.
  2. Edit files using GitHub's text editor in your web browser (see the 'Edit' tab on the top right of the file)
  3. Fill in the Commit message text box at the bottom of the page describing why you made the changes. Press the Propose file change button next to it when done.
  4. Then click Send a pull request.
  5. Your changes are now queued for review under the project's Pull requests tab on GitHub!

For an introduction to the documentation format please see the reST primer.