Back to Article
Data Import and Transformation
Download Source

Data Import and Transformation

Published

July 10, 2025

Modified

August 4, 2025

In [1]:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Import

Explicitly define variables and types

In [2]:
cols <- cols_only(
  Area = col_double(), Perim. = col_double(), Width = col_double(),
  Height = col_double(), Major = col_double(), Minor = col_double(),
  Circ. = col_double(), Feret = col_double(), MinFeret = col_double(),
  AR = col_double(), Round = col_double(), Solidity = col_double(),
  L_BlackBackground = col_double(), a_BlackBackground = col_double(),
  b_BlackBackground = col_double(), L_GreenBackground = col_double(),
  a_GreenBackground = col_double(), b_GreenBackground = col_double(),
  Color_sd = col_double(), ColorNature = col_factor(),
  Color_Name = col_factor(), Color_Category = col_factor(),
  Group = col_factor(), RF_use = col_logical(),
  File_Name = col_character(), Particle_Num = col_character()
)

Load all data sets and combine them

In [3]:
data <- list.files("data", pattern = "*.csv$", full.names = T) |> 
  map(\(path) read_csv(path, col_types = cols)) |>
  list_rbind()
In [4]:
data <- data |> janitor::clean_names()

All variables rewritten in snake_case.

Transformations

Duplicated items

In [5]:
data |> count(file_name, particle_num, sort = TRUE)
# A tibble: 144,002 × 3
   file_name                particle_num     n
   <chr>                    <chr>        <int>
 1 SFS220112_VNG_T1_1_25000 1                2
 2 SFS220112_VNG_T1_1_25000 3                2
 3 GOMEX240405_T1_S_1_315   1                1
 4 GOMEX240405_T1_S_1_315   2                1
 5 GOMEX240405_T1_S_1_315   3                1
 6 GOMEX240405_T1_S_1_315   4                1
 7 GOMEX240405_T1_S_1_315   5                1
 8 GOMEX240405_T2_S_1_315   1                1
 9 GOMEX240405_T2_S_1_315   10               1
10 GOMEX240405_T2_S_1_315   11               1
# ℹ 143,992 more rows

There are 2 duplicated items from SFS220112_VNG_T1_1_25000.

In [6]:
data <- data |> distinct(file_name, particle_num, .keep_all = TRUE)

They are filtered out from the database.

Unique identifiers

In [7]:
data <- data |> mutate(
  id = str_c(file_name, "_", particle_num), .keep = "unused", .before = 1
)

Factor variables

In [8]:
data <- data |> mutate(
  color_nature = color_nature |> fct_recode(
    opaque = "Opaque",
    translucent = "TT"
  ),
  
  group = group |> fct_collapse(
    turf = c("Artificial turf", "Artificial.turf"),
    filament = c("Filament"),
    film = c("Film.Sheet", "Film/Sheet"),
    foam = c("Foam"),
    fragment = c("Fragment"),
    pellet = c("Pellet"),
    spherule = c("Spherule.Microbead", "Spherule/Microbead"),
    other_level = "other"
  )
)

Writing database

In [9]:
data |> write_rds("data/data.rds")