Random selection with ransampler

Scenario 1: simple selection
Scenario 2: more groups
Scenario 3: No-shares
Scenario 4: More than one individual per type
scenario 5: Not enough individuals!
- Strategy 1: the use_duplis option
- Strategy 2: The best of multiple runs
Prioritizing

Sometimes we need to randomly pick out individuals from a study population, for example for sampling, measurements, or other activities. In the simplest cases, this just means haphazardly picking out individuals on-site, but sometimes we need more controll over this process. The R package ransampler can help with some cases of this. It can do random sampling of individuals from multiple groups, prioritizing some indiiduals over others, and can include “no-share” conditions, for example avoiding picking two individuals of the same type within some group (for example, two individuals of the same family from the same study-group).

In this quick demo, I’ll show some examples of how to use ransampler. First, install ransampler from github (needs devtools installed):

devtools::install_github("eiriksen/ransampler")

# see also the function's documentation:
?ransampler

the ransampler package comes with two example datasets, “table_salmon” and “table_salmon_small”. The datasets include a study population of fish, where each row is one individual, and columns denote various traits and experimental groups (ID, weight, tank, vgll3 genotype, mother ID, father ID, etc).

library(ransampler)
library(tidyverse)
library(glue)
library(magrittr)
head(table_salmon)

## # A tibble: 6 × 11
##   ID        tank  pit     weight temp  geno_…¹ sex   popul…² ID_ma ID_pa ID_fa…³
##   <chr>     <chr> <chr>    <dbl> <chr> <chr>   <chr> <chr>   <chr> <chr> <chr>  
## 1 Offsp1377 Tank1 A00000…   987  hot   EE      F     NEV     Pare… Pare… NEV 54 
## 2 Offsp1384 Tank1 A00000…   240. hot   EE      M     NEV     Pare… Pare… NEV 2  
## 3 Offsp1262 Tank1 A00000…   966. hot   EL      F     NEV     Pare… Pare… NEV 4  
## 4 Offsp1333 Tank1 A00000…  1554. hot   LE      M     NEV     Pare… Pare… NEV 55 
## 5 Offsp1378 Tank1 A00000…   212. hot   EE      M     NEV     Pare… Pare… NEV 22 
## 6 Offsp0938 Tank1 A00000…   996. hot   EE      F     NEV     Pare… Pare… NEV 39 
## # … with abbreviated variable names ¹geno_vgll3, ²population, ³ID_family

Scenario 1: simple selection

Sampling one individual randomly from each tank:

ransampler(
  table = table_salmon,
  ofeach = "tank"
)

## Searching using the following table of combinations:

## # A tibble: 3 × 3
##   tank  n_ofeach n_options
##   <chr>    <dbl>     <int>
## 1 Tank1        1       136
## 2 Tank2        1       134
## 3 Tank3        1       142

## # A tibble: 3 × 15
##   ID        tank  pit     weight temp  geno_…¹ sex   popul…² ID_ma ID_pa ID_fa…³
##   <chr>     <chr> <chr>    <dbl> <chr> <chr>   <chr> <chr>   <chr> <chr> <chr>  
## 1 Offsp1313 Tank1 A00000…  1206. hot   LL      F     NEV     Pare… Pare… NEV 17 
## 2 Offsp1522 Tank2 A00000…  1186  hot   EL      M     NEV     Pare… Pare… NEV 4  
## 3 Offsp1972 Tank3 A00000…   902. hot   EE      F     NEV     Pare… Pare… NEV 34 
## # … with 4 more variables: internalPri <dbl>, ID_num <int>, ID_type <chr>,
## #   n_of_type <dbl>, and abbreviated variable names ¹geno_vgll3, ²population,
## #   ³ID_family

The function first outputs a table, the combination table into the console. This table shows you what the sampling algorithm will be looking for. In this case: 1 individual from each of the 3 tanks, and in each tank there is some ~130 individuals to pick from.

After searching, the function returns a new table with the results. This has the same structure as the original table, but includes only the selected individuals. We can see that it found 3 individuals, one from each tank.

Scenario 2: more groups

Sampling one individual from each sex and vgll3 genotype from each tank:

ransampler(
  table = table_salmon,
  ofeach = c("tank","sex","geno_vgll3")
)

## Searching using the following table of combinations:

## # A tibble: 24 × 5
##    tank  sex   geno_vgll3 n_ofeach n_options
##    <chr> <chr> <chr>         <dbl>     <int>
##  1 Tank1 F     EE                1        26
##  2 Tank1 F     EL                1        18
##  3 Tank1 F     LE                1        17
##  4 Tank1 F     LL                1        12
##  5 Tank1 M     EE                1        23
##  6 Tank1 M     EL                1        22
##  7 Tank1 M     LE                1         7
##  8 Tank1 M     LL                1        11
##  9 Tank2 F     EE                1        21
## 10 Tank2 F     EL                1        25
## 11 Tank2 F     LE                1         5
## 12 Tank2 F     LL                1         9
## 13 Tank2 M     EE                1        28
## 14 Tank2 M     EL                1        20
## 15 Tank2 M     LE                1        15
## 16 Tank2 M     LL                1        11
## 17 Tank3 F     EE                1        30
## 18 Tank3 F     EL                1        20
## 19 Tank3 F     LE                1        13
## 20 Tank3 F     LL                1        12
## 21 Tank3 M     EE                1        27
## 22 Tank3 M     EL                1        21
## 23 Tank3 M     LE                1        11
## 24 Tank3 M     LL                1         8

## # A tibble: 24 × 15
##    ID        tank  pit    weight temp  geno_…¹ sex   popul…² ID_ma ID_pa ID_fa…³
##    <chr>     <chr> <chr>   <dbl> <chr> <chr>   <chr> <chr>   <chr> <chr> <chr>  
##  1 Offsp1265 Tank1 A0000…   346  hot   EE      F     NEV     Pare… Pare… NEV 34 
##  2 Offsp1247 Tank1 A0000…   298  hot   EL      F     NEV     Pare… Pare… NEV 20 
##  3 Offsp1371 Tank1 A0000…  1146  hot   LE      F     NEV     Pare… Pare… NEV 67 
##  4 Offsp1341 Tank1 A0000…   944. hot   LL      F     NEV     Pare… Pare… NEV 65 
##  5 Offsp1036 Tank1 A0000…  1124. hot   EE      M     NEV     Pare… Pare… NEV 34 
##  6 Offsp1269 Tank1 A0000…  1002. hot   EL      M     NEV     Pare… Pare… NEV 56 
##  7 Offsp1222 Tank1 A0000…  1044. hot   LE      M     NEV     Pare… Pare… NEV 55 
##  8 Offsp1327 Tank1 A0000…   780. hot   LL      M     NEV     Pare… Pare… NEV 17 
##  9 Offsp1603 Tank2 A0000…   730. hot   EE      F     NEV     Pare… Pare… NEV 30 
## 10 Offsp1458 Tank2 A0000…  1367  hot   EL      F     NEV     Pare… Pare… NEV 32 
## # … with 14 more rows, 4 more variables: internalPri <dbl>, ID_num <int>,
## #   ID_type <chr>, n_of_type <dbl>, and abbreviated variable names ¹geno_vgll3,
## #   ²population, ³ID_family

The combination table is rather long this time, but I’m showing it still just to show what’s going on. The table shows that the algorithm will be looking for one individual of each combination of the column “tank”, “geno_vgll3”, and “sex”. Now, there is fewer otions for each combination (-or, “type”). In the end, the function returns a table with 24 individuals, just as requested.

Scenario 3: No-shares

Here, let’s try the no-share option. We’ll do the same sampling as above, but now we specify that we want no individuals within the same tank to come from the same mother or father (so selected individuals within tanks are more genetically distinct). Essentially, what we tell the function below is that “no individuals may have the same tank and the same mother, and no individuals may have the same tank and the same father:

ransampler(
  table = table_salmon,
  ofeach = c("tank","sex","geno_vgll3"),
  no_share = list(c("ID_ma","tank"),c("ID_pa","tank")),
)

I’m not showing the output here, as it is more or less the same as above.

Scenario 4: More than one individual per type

We can use the n_ofeach to tell the algorithm that we want two individuals of each combination/type:

ransampler(
  table = table_salmon,
  ofeach = c("tank","sex","geno_vgll3"),
  no_share = list(c("ID_ma","tank"),c("ID_pa","tank")),
  n_ofeach= 2,
)

scenario 5: Not enough individuals!

Let’s try now using the table_salmon_small, which is a smaller version of the table_salmon dataset

result <- ransampler(
  table = table_salmon_small,
  ofeach = c("tank","sex","geno_vgll3"),
  no_share = list(c("ID_ma","tank"),c("ID_pa","tank")),
  n_ofeach= 2,
)

## Searching using the following table of combinations:

## # A tibble: 48 × 5
##    tank  sex   geno_vgll3 n_ofeach n_options
##    <chr> <chr> <chr>         <dbl>     <int>
##  1 T21   F     EE                2         7
##  2 T21   F     EL                2        10
##  3 T21   F     LE                2         7
##  4 T21   F     LL                2         4
##  5 T21   M     EE                2        10
##  6 T21   M     EL                2        10
##  7 T21   M     LE                2         3
##  8 T21   M     LL                2         4
##  9 T23   F     EE                2        14
## 10 T23   F     EL                2         5
## 11 T23   F     LE                2         7
## 12 T23   F     LL                2         4
## 13 T23   M     EE                2        10
## 14 T23   M     EL                2         8
## 15 T23   M     LE                2         0
## 16 T23   M     LL                2         4
## 17 T31   F     EE                2        12
## 18 T31   F     EL                2         9
## 19 T31   F     LE                2         5
## 20 T31   F     LL                2         6
## 21 T31   M     EE                2         8
## 22 T31   M     EL                2         6
## 23 T31   M     LE                2         5
## 24 T31   M     LL                2         2
## 25 Tank1 F     EE                2         7
## 26 Tank1 F     EL                2        10
## 27 Tank1 F     LE                2         1
## 28 Tank1 F     LL                2         3
## 29 Tank1 M     EE                2        14
## 30 Tank1 M     EL                2        10
## 31 Tank1 M     LE                2         6
## 32 Tank1 M     LL                2         5
## 33 Tank2 F     EE                2        14
## 34 Tank2 F     EL                2         8
## 35 Tank2 F     LE                2         5
## 36 Tank2 F     LL                2         1
## 37 Tank2 M     EE                2        11
## 38 Tank2 M     EL                2        11
## 39 Tank2 M     LE                2         3
## 40 Tank2 M     LL                2         4
## 41 Tank3 F     EE                2        10
## 42 Tank3 F     EL                2        12
## 43 Tank3 F     LE                2         1
## 44 Tank3 F     LL                2         4
## 45 Tank3 M     EE                2        10
## 46 Tank3 M     EL                2         7
## 47 Tank3 M     LE                2         6
## 48 Tank3 M     LL                2         3

Notice from the combinations table that for some combinations/types, there are very few individuals to pick from (down to 1 in some cases). Because of the no-share rule, this could mean that for some categories, no individuals will get picked because there will be no legible individuals to pick from.

Counting how many missing individuals we have:

result %>% filter(is.na(ID)) %>% nrow()

## [1] 13

There are a few strategies to cope with low numbers of individuals:

Strategy 1: the use_duplis option

In the scenario above, we’re picking two individuals of each type. However, maybe we’re not planning on using both individuals, maybe we pick two because we want one in backup. In that case, we can tell the sampler that we only plan on using one of each type. That way, it will not enforce the no_share rule on individuals of the same type, potentially opening up for more usable individuals:

result_2 <- ransampler(
  table = table_salmon_small,
  ofeach = c("tank","sex","geno_vgll3"),
  no_share = list(c("ID_ma","tank"),c("ID_pa","tank")),
  n_ofeach= 2,
  use_dupli = F
)

## Searching using the following table of combinations:

## # A tibble: 48 × 5
##    tank  sex   geno_vgll3 n_ofeach n_options
##    <chr> <chr> <chr>         <dbl>     <int>
##  1 T21   F     EE                2         7
##  2 T21   F     EL                2        10
##  3 T21   F     LE                2         7
##  4 T21   F     LL                2         4
##  5 T21   M     EE                2        10
##  6 T21   M     EL                2        10
##  7 T21   M     LE                2         3
##  8 T21   M     LL                2         4
##  9 T23   F     EE                2        14
## 10 T23   F     EL                2         5
## 11 T23   F     LE                2         7
## 12 T23   F     LL                2         4
## 13 T23   M     EE                2        10
## 14 T23   M     EL                2         8
## 15 T23   M     LE                2         0
## 16 T23   M     LL                2         4
## 17 T31   F     EE                2        12
## 18 T31   F     EL                2         9
## 19 T31   F     LE                2         5
## 20 T31   F     LL                2         6
## 21 T31   M     EE                2         8
## 22 T31   M     EL                2         6
## 23 T31   M     LE                2         5
## 24 T31   M     LL                2         2
## 25 Tank1 F     EE                2         7
## 26 Tank1 F     EL                2        10
## 27 Tank1 F     LE                2         1
## 28 Tank1 F     LL                2         3
## 29 Tank1 M     EE                2        14
## 30 Tank1 M     EL                2        10
## 31 Tank1 M     LE                2         6
## 32 Tank1 M     LL                2         5
## 33 Tank2 F     EE                2        14
## 34 Tank2 F     EL                2         8
## 35 Tank2 F     LE                2         5
## 36 Tank2 F     LL                2         1
## 37 Tank2 M     EE                2        11
## 38 Tank2 M     EL                2        11
## 39 Tank2 M     LE                2         3
## 40 Tank2 M     LL                2         4
## 41 Tank3 F     EE                2        10
## 42 Tank3 F     EL                2        12
## 43 Tank3 F     LE                2         1
## 44 Tank3 F     LL                2         4
## 45 Tank3 M     EE                2        10
## 46 Tank3 M     EL                2         7
## 47 Tank3 M     LE                2         6
## 48 Tank3 M     LL                2         3

Again, checking how many we are missing: (Note, in this specific scenario, this is not likely to improve the result much, as lack of family variation within tanks is not a big issue)

result_2 %>% filter(is.na(ID)) %>% nrow()

## [1] 11

Strategy 2: The best of multiple runs

Each time you run the function, you will get a different result (since the picking is random). If you’re having issues with picking enough individuals for some combinations, you can try to run the function several times, and pick the result that has the least missing individuals. This is done automatically with the “runs” parameter. In the example below, we run the sampler 5 times and pick the result with least missing individuals.

result_3 <- ransampler(
  table = table_salmon_small,
  ofeach = c("tank","sex","geno_vgll3"),
  no_share = list(c("ID_ma","tank"),c("ID_pa","tank")),
  n_ofeach= 2,
  use_dupli = F,
  runs = 5
)

## Searching using the following table of combinations:

## # A tibble: 48 × 5
##    tank  sex   geno_vgll3 n_ofeach n_options
##    <chr> <chr> <chr>         <dbl>     <int>
##  1 T21   F     EE                2         7
##  2 T21   F     EL                2        10
##  3 T21   F     LE                2         7
##  4 T21   F     LL                2         4
##  5 T21   M     EE                2        10
##  6 T21   M     EL                2        10
##  7 T21   M     LE                2         3
##  8 T21   M     LL                2         4
##  9 T23   F     EE                2        14
## 10 T23   F     EL                2         5
## 11 T23   F     LE                2         7
## 12 T23   F     LL                2         4
## 13 T23   M     EE                2        10
## 14 T23   M     EL                2         8
## 15 T23   M     LE                2         0
## 16 T23   M     LL                2         4
## 17 T31   F     EE                2        12
## 18 T31   F     EL                2         9
## 19 T31   F     LE                2         5
## 20 T31   F     LL                2         6
## 21 T31   M     EE                2         8
## 22 T31   M     EL                2         6
## 23 T31   M     LE                2         5
## 24 T31   M     LL                2         2
## 25 Tank1 F     EE                2         7
## 26 Tank1 F     EL                2        10
## 27 Tank1 F     LE                2         1
## 28 Tank1 F     LL                2         3
## 29 Tank1 M     EE                2        14
## 30 Tank1 M     EL                2        10
## 31 Tank1 M     LE                2         6
## 32 Tank1 M     LL                2         5
## 33 Tank2 F     EE                2        14
## 34 Tank2 F     EL                2         8
## 35 Tank2 F     LE                2         5
## 36 Tank2 F     LL                2         1
## 37 Tank2 M     EE                2        11
## 38 Tank2 M     EL                2        11
## 39 Tank2 M     LE                2         3
## 40 Tank2 M     LL                2         4
## 41 Tank3 F     EE                2        10
## 42 Tank3 F     EL                2        12
## 43 Tank3 F     LE                2         1
## 44 Tank3 F     LL                2         4
## 45 Tank3 M     EE                2        10
## 46 Tank3 M     EL                2         7
## 47 Tank3 M     LE                2         6
## 48 Tank3 M     LL                2         3

## Run 1 of 5

## Missing: 13

## Run 2 of 5

## Missing: 13

## Run 3 of 5

## Missing: 12

## Run 4 of 5

## Missing: 11

## Run 5 of 5

## Missing: 7

result_3 %>% filter(is.na(ID)) %>% nrow()

## [1] 11

A little better!

Prioritizing

Finally, let’s have a quick look at the prioritizing option. Let’s say we want to prioritize individuals from large families (so as to not exhaust small families). We can do so by making a new column called “siblings”, which tells how many siblings each individual has in a tank, and then prioritize by the inverse of this (so that those with many siblings get prioritized first).

First, a funtion for counting siblings, and for inverting the count:

# functions used for counting siblings (used below)
siblings_count <- function (df) 
{
  sibs <- df %>% apply(MARGIN = 1, FUN = function(x) {
    tfam = x[["ID_family"]]
    ttank = x[["tank"]]
    sibs = df %>% filter(ID_family == tfam & ttank == tank) %>% 
      nrow()
    sibs
  })
  df$sibs = sibs
  df
}

# also used below
invert <- function(x) 
{
  (max(x) - x) + 1
}

Then, lets try this in practice:

table_salmon_sibs <- 
  table_salmon %>%
  siblings_count %>% 
  mutate( pri = invert(sibs) )

head(table_salmon_sibs)

## # A tibble: 6 × 13
##   ID    tank  pit   weight temp  geno_…¹ sex   popul…² ID_ma ID_pa ID_fa…³  sibs
##   <chr> <chr> <chr>  <dbl> <chr> <chr>   <chr> <chr>   <chr> <chr> <chr>   <int>
## 1 Offs… Tank1 A000…   987  hot   EE      F     NEV     Pare… Pare… NEV 54      5
## 2 Offs… Tank1 A000…   240. hot   EE      M     NEV     Pare… Pare… NEV 2       5
## 3 Offs… Tank1 A000…   966. hot   EL      F     NEV     Pare… Pare… NEV 4       7
## 4 Offs… Tank1 A000…  1554. hot   LE      M     NEV     Pare… Pare… NEV 55      4
## 5 Offs… Tank1 A000…   212. hot   EE      M     NEV     Pare… Pare… NEV 22      3
## 6 Offs… Tank1 A000…   996. hot   EE      F     NEV     Pare… Pare… NEV 39      3
## # … with 1 more variable: pri <dbl>, and abbreviated variable names
## #   ¹geno_vgll3, ²population, ³ID_family

Now we have one column “sibs” which tells how many siblings each individual has per tank, and one “pri” which is the inverse of this one. We’ll now do a new sampling where we prioritize using the column “pri”

result_4 <- ransampler(
  table = table_salmon_sibs,
  ofeach = c("tank","sex","geno_vgll3"),
  no_share = list(c("ID_ma","tank"),c("ID_pa","tank")),
  n_ofeach= 2,
  pri_by = "pri"

)

This will return a table with selected individuals as before, but this time the individuals with many siblings should have been prioritized.