Exploring Data Frames

Materials adapted from Adrien Osakwe, Larisa M. Soto and Xiaoqi Xie.

1. Adding Columns and Rows

We will still use the gapminder dataset as demo dataste.

# Dataset Preparation
if (!require("gapminder", quietly = TRUE))
    install.packages("gapminder")
library(gapminder)

aust <- gapminder[gapminder$country == "Australia",]
head(aust)
## # A tibble: 6 × 6
##   country   continent  year lifeExp      pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Australia Oceania    1952    69.1  8691212    10040.
## 2 Australia Oceania    1957    70.3  9712569    10950.
## 3 Australia Oceania    1962    70.9 10794968    12217.
## 4 Australia Oceania    1967    71.1 11872264    14526.
## 5 Australia Oceania    1972    71.9 13177000    16789.
## 6 Australia Oceania    1977    73.5 14074100    18334.
dim(aust)
## [1] 12  6

1.1 Individually Adding Columns

You can add a new column by simply assigning values to a name that doesn’t exist yet using the $ operator or bracket [ , ] notation.

Important

Crucial Rule: The new column must have the exact same number of rows as the existing data frame.

Method A: Using the $ operator

# Method A: Using the $ operator
# Adding a random column for "mean_children"
mean_children <- sample(1:10, nrow(aust), replace = TRUE)
aust$mean_children <- mean_children
# the same as: aust$mean_children <- sample(1:10, nrow(aust), replace = TRUE)

# Adding a calculated column (GDP = population * GDP per capita)
aust$GDP <- aust$pop * aust$gdpPercap

head(aust)
# A tibble: 6 × 8
  country   continent  year lifeExp      pop gdpPercap mean_children         GDP
  <fct>     <fct>     <int>   <dbl>    <int>     <dbl>         <int>       <dbl>
1 Australia Oceania    1952    69.1  8691212    10040.             3     8.73e10
2 Australia Oceania    1957    70.3  9712569    10950.             8     1.06e11
3 Australia Oceania    1962    70.9 10794968    12217.             8     1.32e11
4 Australia Oceania    1967    71.1 11872264    14526.            10     1.72e11
5 Australia Oceania    1972    71.9 13177000    16789.             8     2.21e11
6 Australia Oceania    1977    73.5 14074100    18334.            10     2.58e11
Note

Understanding sample():

  • 1:10: The range of numbers to choose from.

  • nrow(aust): Tells R exactly how many numbers to generate (12 in this case).

  • replace = TRUE: Allows the same number to be picked more than once.

Method B: Using bracket notation

# Method B: Using bracket notation
mean_bikes <- sample(1:4, nrow(aust), replace = TRUE)
aust[, "mean_bikes"] <- mean_bikes

str(aust)
tibble [12 × 9] (S3: tbl_df/tbl/data.frame)
 $ country      : Factor w/ 142 levels "Afghanistan",..: 6 6 6 6 6 6 6 6 6 6 ...
 $ continent    : Factor w/ 5 levels "Africa","Americas",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ year         : int [1:12] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp      : num [1:12] 69.1 70.3 70.9 71.1 71.9 ...
 $ pop          : int [1:12] 8691212 9712569 10794968 11872264 13177000 14074100 15184200 16257249 17481977 18565243 ...
 $ gdpPercap    : num [1:12] 10040 10950 12217 14526 16789 ...
 $ mean_children: int [1:12] 3 8 8 10 8 10 8 7 5 2 ...
 $ GDP          : num [1:12] 8.73e+10 1.06e+11 1.32e+11 1.72e+11 2.21e+11 ...
 $ mean_bikes   : int [1:12] 1 1 4 2 1 4 3 3 4 4 ...
Note

What happens if the lengths don’t match?

If you try to “force” a vector into a table that is a different height, R will protect the integrity of your data and throw an error.

mean_bikes <- sample(1:4, 5, replace = TRUE)
aust[, "mean_bikes"] <- mean_bikes
Error in `[<-`:
! Assigned data `mean_bikes` must be compatible with existing data.
✖ Existing data has 12 rows.
✖ Assigned data has 5 rows.
ℹ Only vectors of size 1 are recycled.
Caused by error in `vectbl_recycle_rhs_rows()`:
! Can't recycle input of size 5 to size 12.

Why did this fail? R uses a concept called Recycling.

  1. If you provide 1 value: R repeats it for every row (this is allowed).

  2. If you provide 12 values: It’s a perfect fit (this is allowed).

  3. If you provide 5 values: R doesn’t know what to do with the remaining 7 rows, so it stops the code to prevent errors in your analysis.

Best Practice: Use nrow()

Always use nrow(your_dataframe) instead of typing a specific number (like 12). If you later filter your data or add more patients, your code will automatically adjust the length of your new columns!

1.2 Combining Data Frames by Columns

Method A: Using the cbind()

If you have a whole table of new data, you can “staple” it to the side of your original data using cbind() (column bind). Unlike adding columns one by one, cbind() allows you to merge two entire data frames into one.

aust <- gapminder[gapminder$country == "Australia", ]
# Create a separate data frame with two new variables
df <- data.frame(mean_children = sample(1:10, nrow(aust), replace = TRUE),
               mean_bikes = sample(1:4, nrow(aust), replace = TRUE))
head(df)
##   mean_children mean_bikes
## 1             2          2
## 2             8          1
## 3            10          3
## 4             3          3
## 5             6          4
## 6             9          4

aust <- cbind(aust,df)
head(aust)
##     country continent year lifeExp      pop gdpPercap mean_children mean_bikes
## 1 Australia   Oceania 1952   69.12  8691212  10039.60             2          2
## 2 Australia   Oceania 1957   70.33  9712569  10949.65             8          1
## 3 Australia   Oceania 1962   70.93 10794968  12217.23            10          3
## 4 Australia   Oceania 1967   71.10 11872264  14526.12             3          3
## 5 Australia   Oceania 1972   71.93 13177000  16788.63             6          4
## 6 Australia   Oceania 1977   73.49 14074100  18334.20             9          4
Warning

⚠️ A Critical Warning: The Row Order

While cbind() is powerful, it is also “blind.”

It assumes that Row 1 of your first table belongs with Row 1 of your second table.

In medical research, this is very dangerous if your tables aren’t sorted perfectly. If your first table has patients in the order P001, P002, P003 and your second table has them as P003, P001, P002, cbind() will mismatch the data!

Best Practice: Use merge() or left_join()

If you aren’t 100% sure the rows are in the same order, don’t use cbind(). Instead, use merge() (Base R) or left_join()(dplyr). These functions look at an ID column (like a patient ID) and automatically align the rows for you.

Method B: Using the merge()

To avoid mismatching data, we use merge(). This function looks at a specific Key Column (like Patient_ID or country) and aligns the rows automatically, even if they are in a different order.

# 1. Prepare a subset of data
countries <- gapminder[gapminder$country %in% c("Taiwan", "Canada"), ]

# 2. Create a secondary table with a "Key Column" (country)
# Note: This table could be in any order!
df_info <- data.frame(
  country = c("Canada", "Taiwan"),
  avg_coffee_price = c(4.50, 3.20)
)

# 3. Merge based on the "country" column
# R will find where "Taiwan" is in both tables and match them up
countries_merged <- merge(countries, df_info, by = "country")

head(countries_merged)
  country continent year lifeExp      pop gdpPercap avg_coffee_price
1  Canada  Americas 1952   68.75 14785584  11367.16              4.5
2  Canada  Americas 1957   69.96 17010154  12489.95              4.5
3  Canada  Americas 1962   71.30 18985849  13462.49              4.5
4  Canada  Americas 1967   72.13 20819767  16076.59              4.5
5  Canada  Americas 1972   72.88 22284500  18970.57              4.5
6  Canada  Americas 1977   74.21 23796400  22090.88              4.5
# 1. Prepare a subset of data
patient_info <- data.frame(
  patientID = c("P1", "P2", "P3", "P100"),
  BW = c(50, 62, 73, 85)
)

# 2. Create a secondary table
patient_BG <- data.frame(
  patientID = c("P1", "P100", "P3", "P2"),
  BG = c(7.1, 8.3, 9.5, 11)
)

# 3. Merge based on the "country" column
# R will find where "Taiwan" is in both tables and match them up
data_merged <- merge(patient_info, patient_BG, by = "patientID")

head(data_merged)
  patientID BW   BG
1        P1 50  7.1
2      P100 85  8.3
3        P2 62 11.0
4        P3 73  9.5
Note

💡 Why merge() is better than cbind()

  • Order Doesn’t Matter: In the example above, df_info lists Canada first, but in the countries dataset, the rows might appear in a different order. merge() finds the match regardless of where the row is located.

  • Flexible Lengths: Notice that countries has many years of data for each country, but df_info only has one price per country. merge() is smart enough to “broadcast” (repeat) that one coffee price across all the matching years for that country.

  • Pattern Matching: Instead of “mechanically” pasting columns, merge() looks for a specific pattern (the country name). If a country in your first table isn’t found in your second table, R can be told exactly how to handle that “missing” data.

1.3 Adding Rows (rbind)

Adding rows is slightly more complex because a row usually contains mixed data types (text, numbers, etc.).

Note

Why use a list for a new row? A data frame is a list of vectors. A single row crosses those vectors, so it must be able to hold different types of data at once. A list is the perfect container for this.

aust <- gapminder[gapminder$country == "Australia",]

# Create a new row as a list
new_row <- list(
  "country" = "Australia",
  "continent" = "Oceania",
  "year" = 2022,
  "lifeExp" = mean(aust$lifeExp),
  "pop" = mean(aust$pop),
  "gdpPercap" = mean(aust$gdpPercap)
)

# Append the row to the bottom
aust <- rbind(aust, new_row)

tail(aust)
## # A tibble: 6 × 6
##   country   continent  year lifeExp       pop gdpPercap
##   <fct>     <fct>     <dbl>   <dbl>     <dbl>     <dbl>
## 1 Australia Oceania    1987    76.3 16257249     21889.
## 2 Australia Oceania    1992    77.6 17481977     23425.
## 3 Australia Oceania    1997    78.8 18565243     26998.
## 4 Australia Oceania    2002    80.4 19546792     30688.
## 5 Australia Oceania    2007    81.2 20434176     34435.
## 6 Australia Oceania    2022    74.7 14649312.    19981.

2. Removing and Filtering Data

2.1 Removing Columns and Rows

We use negative indexing to drop data.

Method A: Removing by Index (Position)

If you know exactly where a column is located, you can drop it using its number.

aust <- gapminder[gapminder$country == "Australia",]
dim(aust)
## [1] 12  6
colnames(aust)
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

# 1. Remove the LAST column dynamically
# ncol(aust) finds the last number for you, so you don't have to count!
aust <- aust[, -ncol(aust)]
colnames(aust)
## [1] "country"   "continent" "year"      "lifeExp"   "pop"

# 2. Remove multiple specific columns (e.g., the 1st and 3rd)
aust <- aust[, -c(1, 3)]
colnames(aust)
## [1] "continent" "lifeExp"   "pop"

dim(aust)
## [1] 12  3

Method B: Removing by Name (The Safer Way)

Using numbers like -3 is risky because if your dataset changes, the 3rd column might become something else! It is safer to remove columns by their actual Name.

You ask R to keep every column except the one using the “Not Equal To” operator (!=).

# Keep everything where the column name is NOT "pop"
colnames(aust)
## [1] "continent" "lifeExp"   "pop"
aust <- aust[, colnames(aust) != "pop"]

colnames(aust)
## [1] "continent" "lifeExp"

Method C: Removing Rows

The logic for rows is the same: use a negative sign before the row indices.

# Remove the first 10 rows
# Both of these are correct; the first is the most professional standard
aust_reduced <- aust[-c(1:10), ] 
## aust_reduced <- aust[-(1:10), ]

dim(aust_reduced)
[1] 2 2

2.2 Applying Filters (Logical Subsetting)

Filtering (or subsetting) is how you “query” your data to find specific samples or variables. In R, we use Logical Operators (like >=, ==, or %in%) to tell R exactly what to keep.

Method A: Filtering Columns by Name

If you have a table with 20,000 genes but only want to see two specific ones, you can use the %in% operator.

aust <- gapminder[gapminder$country == "Australia",]

# Keep only the columns "year" and "pop"
# which() identifies the position (index) of the names that match our list
aust_col <- aust[ , which(colnames(aust) %in% c("year", "pop"))]

colnames(aust_col)
## [1] "year" "pop"
Note

Why use which()?

You might notice that colnames(aust) %in% c("year", "pop") returns a list of TRUE and FALSE. Wrapping it in which() converts those into actual numbers (the column positions), which is often safer and faster when dealing with very large datasets.

colnames(aust)
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

colnames(aust) %in% c("year", "pop")
## [1] FALSE FALSE  TRUE FALSE  TRUE FALSE

which(colnames(aust) %in% c("year", "pop"))
## [1] 3 5

Method B: Filtering Rows by Condition

This is where you ask R to find samples that meet a specific clinical or biological threshold.

aust <- gapminder[gapminder$country == "Australia",]

# 1. Filter by a fixed threshold
# Keep only rows where Life Expectancy is 70 or higher
long_life <- aust[aust$lifeExp >= 70, ]
dim(long_life)
## [1] 11  6

# 2. Filter by a calculated threshold
# Keep only rows where GDP is above average
high_gdp <- aust[aust$gdpPercap >= mean(aust$gdpPercap), ]
dim(high_gdp)
## [1] 5 6
Note

💡 Understanding the “Comma” Rule

The most common mistake for beginners is forgetting the comma inside the brackets: [rows, columns].

  • aust[aust$lifeExp >= 70 ]Error! R doesn’t know if this is a row or column instruction.

  • aust[aust$lifeExp >= 70 , ]Success! The comma tells R: “Filter the rows based on this math, but keep all columns.”

3. Cleaning Your Data

Raw datasets often contain “noise”—such as duplicate entries from technical replicates or empty rows created during data entry.

3.1 Handling Duplicates

Duplicates can artificially inflate your sample size and skew your statistics. The unique() function scans your data frame and keeps only the first instance of any identical rows.

aust <- gapminder[gapminder$country == "Australia",]

# Create a dataset with duplicates (binding the data to itself)
aust_double <- rbind(aust, aust)
dim(aust_double) # Should show 24 rows
## [1] 24  6

# Remove duplicate rows
aust_unique <- unique(aust_double)
dim(aust_unique) # Back to 12 rows!
## [1] 12  6

3.2 Handling Missing Values (NA)

In R, missing data is represented by NA (Not Available). Most statistical functions (like mean() or sd()) will fail or return NA if even one value is missing, so clearing them is essential.

A. Identifying NAs: is.na()

You cannot use == NA to find missing data because NA is not a value; it is a placeholder. Instead, we use the logical function is.na().

B. Filtering Out Empty Rows

We use the “Logical NOT” operator (!) to tell R: “Keep only the rows where the column is NOT NA.”

# 1. Let's create an empty row for practice
# rep(NA, ncol(aust)) creates a row of NAs the same width as our table
na_row <- rep(NA, ncol(aust))
aust <- rbind(aust, na_row)
tail(aust) # You will see the last row is all NA
## # A tibble: 6 × 6
##   country   continent  year lifeExp      pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Australia Oceania    1987    76.3 16257249    21889.
## 2 Australia Oceania    1992    77.6 17481977    23425.
## 3 Australia Oceania    1997    78.8 18565243    26998.
## 4 Australia Oceania    2002    80.4 19546792    30688.
## 5 Australia Oceania    2007    81.2 20434176    34435.
## 6 <NA>      <NA>         NA    NA         NA       NA

# 2. Remove rows where 'country' is NA
# We use !is.na() to keep the "Not NA" data
aust <- aust[!is.na(aust$country), ]

tail(aust) # The empty row is gone!
## # A tibble: 6 × 6
##   country   continent  year lifeExp      pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Australia Oceania    1982    74.7 15184200    19477.
## 2 Australia Oceania    1987    76.3 16257249    21889.
## 3 Australia Oceania    1992    77.6 17481977    23425.
## 4 Australia Oceania    1997    78.8 18565243    26998.
## 5 Australia Oceania    2002    80.4 19546792    30688.
## 6 Australia Oceania    2007    81.2 20434176    34435.
Note

Remove any row that has even one NA anywhere in it: na.omit

If you want to remove any row that has even one NA anywhere in it, you can use the “nuclear option”:

aust_clean <- na.omit(aust)

Warning: Be careful with na.omit(). If a patient is missing just one minor piece of information (like a middle name), na.omit() will delete their entire record, including their vital clinical data! It is usually safer to filter by specific, essential columns using the !is.na() method.

3.3 Editing Specific Elements

If you find a typo in your data, you can target a single “cell” using [row, col].

# Add 1 year to the Life Expectancy of the first row
aust[1, "lifeExp"] <- aust[1, "lifeExp"] + 1
aust[1, ]
## # A tibble: 1 × 6
##   country   continent  year lifeExp     pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>   <int>     <dbl>
## 1 Australia Oceania    1952    70.1 8691212    10040.

# Specify the Life Expectancy of specific cell
aust[1, "lifeExp"] <- 50
aust[1, ]
## # A tibble: 1 × 6
##   country   continent  year lifeExp     pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>   <int>     <dbl>
## 1 Australia Oceania    1952      50 8691212    10040.