Wrapping up - Day 2, afternoon ============================== Everything is on the Web site ----------------------------- We've tried to put everything up on the Web site. Please ask us if you can't find it! Additional resources include those in :doc:`resources`, as well as using Google to look for answers, and checking out both `StackOverflow `__ and `BioStars `__. You might also consider `attending office hours `__. Posting IPython Notebooks ------------------------- You can post IPython Notebooks on github, and view them statically via nbviewer.ipython.org (i.e. without loading them into a running IPython Notebook instance). For example, see: https://github.com/swcarpentry/2013-04-az/blob/master/notebooks/10-introducing-bird-counting-FULL.ipynb which can be viewed in "raw" form here: https://github.com/swcarpentry/2013-04-az/raw/master/notebooks/10-introducing-bird-counting-FULL.ipynb and which can be viewed in rendered form by pasting the 'raw' URL into http://nbviewer.ipython.org: http://nbviewer.ipython.org/urls/github.com/swcarpentry/2013-04-az/raw/master/notebooks/10-introducing-bird-counting-FULL.ipynb Moving around directories in the shell -------------------------------------- Paths, directories, and file locations are a source of great confusing. Here are a few simple rules: 1. Treat everything as a relative location. For example, if you're in the 2013-04-az/scripts folder and you want to reference a file in the 2013-04-az/data folder, use:: ../data/filename where the '..' means "go up one level" and the '/data/' means "go down into the data directory from there." 2. pwd will tell you what directory you're in. 3. cd will change your current directory. 4. ls will list files in your current directory. 5. TAB completion is your friend. TAB completion does two things: it makes you less typo-prone, AND it makes sure that the file actually exists (because you can only tab-complete on files that are actually there). Shell scripts and pipelines --------------------------- In the `scripts subdirectory `__, we have a number of Python scripts. Turn your attention to three of them, please: `make-big-birdlist.py `__ -- creates a bunch of fake bird data. Run by giving it the name of the file that you want it to output, e.g.:: python make-big-birdlist.py counts.csv `make-birdcounts.py `__ -- parses the counts.csv file and converts dates into day-of-year, then produced a histogram of bird counts by day. This is then saved into a .dat file that can be loaded by the next script. Usage:: python make-birdcounts.py counts.csv counts.dat `plot-birdcounts.py `__ -- loads in the counts.dat file, produces a plot, and then saves the plot. :: python plot-birdcounts.py counts.dat counts.pdf These are three automated scripts that each do some smaller part of an overall larger set of tasks - essentially, a data analysis pipeline. You could run all three scripts above by hand, but that's error prone. You could also combine all three scripts above into one -- there's no reason why not -- but that's inflexible, especially if there are multiple ways to use the scripts (not so in the above case, but frequently true in real cases) or if some of the steps take a long time. So what to do? You can write a shell script, as in `make-plot.sh `__. This can be run by typing:: bash make-plot.sh All this is is a list of the commands you want run at the shell -- simplicity itself, right? No hidden tricks here. It's that simple. Note that a good way to get a shell script started is to run a series of commands at the shell, then -- when done -- type 'history' to get a list of the commands you've run. Copy/paste from that history into a file, and you've got a shell script of sorts! Remember to version control your shell scripts :) Writing shell scripts is a good way to keep track of what it is you've actually run, as opposed to what you think you've run. And once they're version controlled, well, now you're really in good shape for your methods section... Final point: if you look at the shell script, you'll see that every time you run it, it runs all three commands. For large data sets, this can get time consuming, especially if you're working to optimize one step, or doing parameter explanation. There's a program called 'make' that can keep track of what has changed and what needs to be run again based on those changes, but you have to configure it a bit with a file called 'Makefile'. For an example from our own pipeline for the diginorm paper, see: https://github.com/ged-lab/2012-paper-diginorm/blob/master/pipeline/Makefile