Intro to R for Biologists

NCGAS_S4ES_online-training_facebook_female-hero.png

The National Center for Genome Analysis Support (NCGAS) provided this course to help participants get started in R, so they’ll be able to read and write code, and figure out where to get help when needed.

The NCGAS was funded by the National Science Foundation under Grant Nos. DBI-1062432 2011 ABI-1458641 2015 , and ABI-1759906 2018  to Indiana University.

This workshop covers three major concepts in R:

  • The general syntax of the language, the basic data types, and how to manipulate them.
  • Introduction to the two different plotting paradigms in R, and visualizing GIS and ordination as examples of plotting different data.
  • How to read and write functions in R.

The course does not focus on any particular analysis, but uses DNA sequences as a case study to apply the material covered. We will also cover how to use Jetstream (the research cloud) to power analyses in RStudio. Use of personal installations on laptops is fine for the workshop; however, we will not troubleshoot individual installations during class.

Objectives

  • Navigate and use RStudio (on and off Jetstream)—load files, export graphs, etc.
  • Understand how to install, load, and use new libraries.
  • Become familiar with Bioconductor Project.
  • Understand basic data types, functions, objects, and classes in R.
  • Write and use a function.

Prerequisites

  • Unix familiarity is a plus, but not required.
  • For in-person workshops, a laptop is required - if you do not have one, contact the organizers to borrow one.

Online Course Available via Expand

View Course in Expand

When in-person: Daily Agenda & Requirements

  • Day 1: Introduction
    The goal of this section is to get you acquainted with R, both the environment and the language. We’ll discuss data types, manipulation, the structure of commands, how to get help and more information, how to load packages, and how to use the environment. The hope is that you will use R more intuitively. We will discuss some common errors and troubleshooting during the recitation meeting.

    This section does not focus on any individual analysis or demonstration, rather it focuses on reading and making sense of the language. This is very helpful for new users or anyone currently copying, pasting, and hoping the command will work.

    Requirements: There are no requirements for this section. Basic Unix skills (how variables work, cat, pwd, etc.) are helpful, but we will not be using command line, but will be referencing them throughout.
  • Introduction lab
    A guided activity to practice your skills from day 1. This will give you practice using R and working with sequence data/vectors with a bit more independence. We will answer questions and help troubleshoot the activity during the recitation meeting.

  • Day 2: Introduction to visualization
    We will build on the basic data types and syntax of R to explore visualization of geological data. The two main families of plotting will be introduced (plot style and ggplot style), with examples of how to plot various types of data on geographical maps. This is a useful skill for ecologists and geneticists alike. During the online recitation meeting, we will further discuss options in graphing, troubleshoot setting up Google maps, and share some helpful tutorials/cheat sheets for the plotting language in R.

    Requirements: This is a lab based on the material covered in day 1—familiarity with that material will be useful. Day 1 material will be available online.

  • Introduction to visualization lab
    Thisactivity will extend the same plotting syntax types to a different kind of data—plotting ordination (PCA, PCoA, and nMDS plots) for use in exploring various data you may have. Microbiome, ecological, or population genetics are common examples. We will discuss ordination, when to use different types, and some of the finer points in choosing packages during the recitation meeting.

  • Day 3: Making your own scripts and functions
    The goal of this section is to get a bit more in depth on how to read, understand, and troubleshoot R code by introducing classes and functions. Classes and functions are a large part of R, and therefore a large part of understanding the syntax and function of the language. We will walk through creating your own function for summarizing tables of data (both ecological and genetic data sets are available for use). We will discuss more tips for designing and writing code in R during the recitation.

    Requirements: This material assumes basic usage of R covered in the previous two days, or a moderate familiarity with R basics.

  • Making your own scripts and functions lab
    This activity builds on day 2's lab, where you will create a function to graph a sliding window plot for GC content. This activity is meant to practice building functions, but this particular example can easily be applied to visualize the variation across any continuous data, such as ecological measure through time, population variation over a genome, etc. We will help answer questions and troubleshoot this activity during the online recitation.

Go to the archive of talks from the workshop

Intro to R for Biologists (on YouTube)

Go to the archive of print materials from the workshop

Intro to R for Biologists (.pdf file on Google Drive)

Go to the archive of print materials from the workshop

Intro to R for Biologists (.zip file on Google Drive)

The Supercomputing for Everyone Series (SC4ES) aims to bring more users into the realm of advanced computing, whether it be visualization, computation, analytics, storage, or any related discipline. Research Technologies can take you to the next level of computing.

Supercomputing for Everyone Series workshops and seminars are led by personnel from Research Technologies, a division of University Information Technology Services and a center in the Pervasive Technology Institute at Indiana University.