June 17th Technical Track Workshop: BioInformatics pipeline development using NextFlow, Singularity, and Git.
BioFrontiers IT and Michael Smallegan (IQBiology/Rinnlab) will be presenting a one-day, hands-on exploration of what is possible using modern containerization and pipeline management technologies to standardize bioinformatics pipelines. We will be hosting this interactive session to ease the transition into more consistent and standardized data processing using Git, NextFlow, and Singularity. Attendees will leave the session with working knowledge and a real world example of how these technologies can help to ensure reproducibility in bioinformatics analyses. The session will be designed to allow participants from any institution to build useful tools that will run on diverse computing resources. The event will be capped with testimonials and project descriptions from researchers using these tools to support their workflows. Note: the workshop is now full, but we will be covering similar topics at our conference which still has space available! Register here!
For full computational reproducibility, we should be aspiring to build and publish analysis pipelines that transform raw data into manuscript figures. Doing this requires many moving parts: containers, version control systems, workflow management languages, and literate programming tools. While learning each tool on its own can seem daunting, when knitted together the power of a holistic approach is clear. In this workshop you will get experience with the full ensemble by running an end-to-end analysis pipeline for RNA-seq data in the cloud. You will get a hands-on introduction to each component (Singularity, Git, NextFlow, and RMarkdown) and to the resources available to continue your learning. Laptop required. Some experience with Unix preferred.
Schedule (tentative and subject to change):
9:00am-10:00am Optional pre-workshop coffee and help session: If you need help with some of the concepts we communicated you should know ahead of time, you should come to this hour and work with us to get up to speed!
10:00am-10:30am Computational reproducibility - why does it matter, why is it hard, why are we here?: What problem are we trying to solve, what happens if we don’t properly account for these types of problems? What are some examples of what went wrong in the past?
10:30am-11:00am Cloud Computing Basics: Logging into cloud instances using SSH. Ensuring you can interact with the shell - a few shell basics will be presented. Moving files from local machine to cloud, and back again.
11:00am-12:00pm Git basics for everyone: Go through Git basics specific to this workshop. All Git work will be done on cloud instances. We will be forking and modifying our actual repository for the workshop in this portion!
12:00pm-1:00pm Lunch: Nom nom nom
1:00pm-1:45pm Singularity Discussion: Why containers, why now? What flexibility do containers provide, how are we going to use them today? What will we NOT cover with regards to Singularity today? Why are we using Singularity instead of Docker?
1:45pm-2:45pm Nextflow Discussion: What are workflow management languages and why do we need them? What do they do and how to we use them? Why is Nextflow an excellent choice for computational biology? We will put together the pieces you learned in morning to run an actual Nextflow pipeline on your cloud instance to analyze RNA-seq data.
3:00pm-5:00pm Data analysis: What is literate programming and how can we use it to round out our analysis pipelines for full reproducibility? How can your custom analyses be presented to your colleagues and the public? What kind of data integration fun can we have when we know that the data has been processed in a uniform way with standardized pipelines? We will examine the results from the pipeline that you ran and combine everyone’s results to gain new perspective.