Solutions: Data Types

Materials adapted from Adrien Osakwe, Larisa M. Soto and Xiaoqi Xie.

1. Atomic Classes

Write a piece of code that stores a number in a variable and then check if it is greater than 5. Try to use comments!
Bonus: Is there a way to store the result after checking the number?

x <- 10
x > 5

[1] TRUE

#Bonus
y <- x > 5

print(y)

[1] TRUE

2. Vectors

Make a vector with the numbers 1 through 26. Multiply the vector by 2, and give the resulting vector names A through Z (hint: there is a built in vector called LETTERS).

x <- 1:26
x <- x * 2
names(x) <- LETTERS
print(x)

 A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z 
 2  4  6  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52

3. Matrices

Make a matrix with the numbers 1:50, with 5 columns and 10 rows. Did the matrix function fill your matrix by column, or by row, as its default behavior? Once you have figured it out, try to change the default. (hint: read the documentation for matrix)

# By default the matrix is filled by columns, we can change this behavior using byrow=TRUE
m <- matrix(1:50, ncol = 5, nrow = 10, byrow = T)
print(m)

      [,1] [,2] [,3] [,4] [,5]
 [1,]    1    2    3    4    5
 [2,]    6    7    8    9   10
 [3,]   11   12   13   14   15
 [4,]   16   17   18   19   20
 [5,]   21   22   23   24   25
 [6,]   26   27   28   29   30
 [7,]   31   32   33   34   35
 [8,]   36   37   38   39   40
 [9,]   41   42   43   44   45
[10,]   46   47   48   49   50

Bonus: Which of the following commands was used to generate the matrix below?

	[,1]	[,2]
[1,]	4	1
[2,]	9	5
[3,]	10	7

matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)
matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)
matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)
matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)

# correct
matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
##      [,1] [,2]
## [1,]    4    1
## [2,]    9    5
## [3,]   10    7

# others
matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)
##      [,1] [,2]
## [1,]    4    5
## [2,]    1   10
## [3,]    9    7

matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)
##      [,1] [,2]
## [1,]    4    9
## [2,]   10    1
## [3,]    5    7

matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)
##      [,1] [,2] [,3]
## [1,]    4   10    5
## [2,]    9    1    7

Note

The byrow Argument

The matrix() function works like a worker filling a grid of boxes. The byrow argument tells that worker whether to walk across the rows or down the columns.

byrow = FALSE (Default): The worker fills the first column from top to bottom, then moves to the second column. This is “Column-major order.”
byrow = TRUE: The worker fills the first row from left to right, then moves to the second row. This is “Row-major order.”

4. Lists

Create a list of length two containing a character vector for each of the data sections: (1) Data types and (2) Data structures. Populate each character vector with the names of the data types and data structures, respectively.

dt <- c('double', 'complex', 'integer', 'character', 'logical')
ds <- c('data.frame', 'vector', 'factor', 'list', 'matrix')
data.sections <- list(dt, ds)
print(data.sections)

[[1]]
[1] "double"    "complex"   "integer"   "character" "logical"  

[[2]]
[1] "data.frame" "vector"     "factor"     "list"       "matrix"

5. Data frames

There are several subtly different ways to call variables, observations and elements from data frames. Try them all and discuss with your team what they return. (Hint, use the function typeof())

iris[1]
iris[[1]]
iris$Species
iris["Species"]
iris[1,1]
iris[,1]
iris[1,]

# The single brace [1] returns the first slice of the list, as another list. In this case it is the first column of the data frame.
head(iris[1])
##   Sepal.Length
## 1          5.1
## 2          4.9
## 3          4.7
## 4          4.6
## 5          5.0
## 6          5.4

# The double brace [[1]] returns the contents of the list item. In this case it is the contents of the first column, a vector of type factor.
head(iris[[1]])
## [1] 5.1 4.9 4.7 4.6 5.0 5.4

# This example uses the $ character to address items by name. Species is a vector of type factor.
head(iris$Species)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

# A single brace ["Species"] instead of the index number with the column name will also return a list like in the first example
head(iris["Species"])
##   Species
## 1  setosa
## 2  setosa
## 3  setosa
## 4  setosa
## 5  setosa
## 6  setosa

# First element of first row and first column. The returned element is an integer
iris[1,1]
## [1] 5.1

# First column. Returns a vector
iris[,1]
##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
##  [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
##  [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
##  [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
##  [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
##  [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9

# First row. Returns a list with all the values in the first row.
iris[1,]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa

6. Coercion

Take the list you created in 4 and coerce it into a data frame. Then change the names of the columns to “dataTypes” and “dataStructures”

df <- as.data.frame(data.sections)
colnames(df) <- c("dataTypes", "dataStructures")
print(df)

  dataTypes dataStructures
1    double     data.frame
2   complex         vector
3   integer         factor
4 character           list
5   logical         matrix

Note

Common ways to change column names

colnames()

If you want to rename all the columns at once, this is the fastest method. You simply provide a vector of names that matches the number of columns.

# Create a dummy dataframe
df <- data.frame(V1 = 1:3, V2 = 4:6, V3 = 7:9)

# Rename all columns
colnames(df) <- c("ID", "Treatment", "Response")

Indexing

If you only want to change one specific column, you can use its index (position). This is great for small tables but risky for large ones if the column order changes.

# Change only the 2nd column
colnames(df)[2] <- "Condition"

dplyr::rename()

This is the preferred method for most researchers because it is readable and safe. You don’t need to know the index of the column, and you can pipe it into your analysis.

Syntax: new_name = old_name

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

df <- df %>% 
  rename(Patient_ID = ID, 
         Dosage = Response)

Note

The “Backtick” Trick: Column Names with Spaces

Usually, R replaces spaces in column names with a dot (.) because spaces can break code. However, you can force R to accept them using Backticks (`).

Important Distinction: Notice the difference between Single Quotes (') and Backticks (`).

Quotes (' '): Tell R that something is Text.

Backticks (` `): Tell R that something is a Name that contains “illegal” characters (like spaces or starting with a number).

# This works because of the backticks!
colnames(df) <- c("Data Types", "Data Structures")
print(df)

  Data Types Data Structures <NA>
1          1               4    7
2          2               5    8
3          3               6    9

# To call this column later, you MUST use backticks:
df$`Data Types`

[1] 1 2 3

Why we avoid spaces in Bioinformatics

While R can handle spaces, it is generally discouraged in professional pipelines for several reasons:

Tab Completion: If you type df$d... and hit Tab, RStudio can instantly find data_types. If there is a space, you have to manually type the backticks every single time.
Compatibility: If you export your data to a colleague using Python or a command-line tool like awk, spaces in column names can cause their scripts to crash.
The “Snake Case” Standard: Most researchers prefer snake_case (e.g., gene_id) or camelCase (e.g., geneId).