Sometimes we need to randomly pick out individuals from a study population, for example for sampling, measurements, or other activities. In the simplest cases, this just means haphazardly picking out individuals on-site, but sometimes we need more controll over this process. The R package ransampler can help with some cases of this. It can do random sampling of individuals from multiple groups, prioritizing some indiiduals over others, and can include “no-share” conditions, for example avoiding picking two individuals of the same type within some group (for example, two individuals of the same family from the same study-group).
In this quick demo, I’ll show some examples of how to use ransampler. First, install ransampler from github (needs devtools installed):
devtools::install_github("eiriksen/ransampler")
# see also the function's documentation:
?ransampler
the ransampler package comes with two example datasets, “table_salmon” and “table_salmon_small”. The datasets include a study population of fish, where each row is one individual, and columns denote various traits and experimental groups (ID, weight, tank, vgll3 genotype, mother ID, father ID, etc).
library(ransampler)
library(tidyverse)
library(glue)
library(magrittr)
head(table_salmon)
## # A tibble: 6 × 11
## ID tank pit weight temp geno_…¹ sex popul…² ID_ma ID_pa ID_fa…³
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Offsp1377 Tank1 A00000… 987 hot EE F NEV Pare… Pare… NEV 54
## 2 Offsp1384 Tank1 A00000… 240. hot EE M NEV Pare… Pare… NEV 2
## 3 Offsp1262 Tank1 A00000… 966. hot EL F NEV Pare… Pare… NEV 4
## 4 Offsp1333 Tank1 A00000… 1554. hot LE M NEV Pare… Pare… NEV 55
## 5 Offsp1378 Tank1 A00000… 212. hot EE M NEV Pare… Pare… NEV 22
## 6 Offsp0938 Tank1 A00000… 996. hot EE F NEV Pare… Pare… NEV 39
## # … with abbreviated variable names ¹geno_vgll3, ²population, ³ID_family
Sampling one individual randomly from each tank:
ransampler(
table = table_salmon,
ofeach = "tank"
)
## Searching using the following table of combinations:
## # A tibble: 3 × 3
## tank n_ofeach n_options
## <chr> <dbl> <int>
## 1 Tank1 1 136
## 2 Tank2 1 134
## 3 Tank3 1 142
## # A tibble: 3 × 15
## ID tank pit weight temp geno_…¹ sex popul…² ID_ma ID_pa ID_fa…³
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Offsp1313 Tank1 A00000… 1206. hot LL F NEV Pare… Pare… NEV 17
## 2 Offsp1522 Tank2 A00000… 1186 hot EL M NEV Pare… Pare… NEV 4
## 3 Offsp1972 Tank3 A00000… 902. hot EE F NEV Pare… Pare… NEV 34
## # … with 4 more variables: internalPri <dbl>, ID_num <int>, ID_type <chr>,
## # n_of_type <dbl>, and abbreviated variable names ¹geno_vgll3, ²population,
## # ³ID_family
The function first outputs a table, the combination table into the console. This table shows you what the sampling algorithm will be looking for. In this case: 1 individual from each of the 3 tanks, and in each tank there is some ~130 individuals to pick from.
After searching, the function returns a new table with the results. This has the same structure as the original table, but includes only the selected individuals. We can see that it found 3 individuals, one from each tank.
Sampling one individual from each sex and vgll3 genotype from each tank:
ransampler(
table = table_salmon,
ofeach = c("tank","sex","geno_vgll3")
)
## Searching using the following table of combinations:
## # A tibble: 24 × 5
## tank sex geno_vgll3 n_ofeach n_options
## <chr> <chr> <chr> <dbl> <int>
## 1 Tank1 F EE 1 26
## 2 Tank1 F EL 1 18
## 3 Tank1 F LE 1 17
## 4 Tank1 F LL 1 12
## 5 Tank1 M EE 1 23
## 6 Tank1 M EL 1 22
## 7 Tank1 M LE 1 7
## 8 Tank1 M LL 1 11
## 9 Tank2 F EE 1 21
## 10 Tank2 F EL 1 25
## 11 Tank2 F LE 1 5
## 12 Tank2 F LL 1 9
## 13 Tank2 M EE 1 28
## 14 Tank2 M EL 1 20
## 15 Tank2 M LE 1 15
## 16 Tank2 M LL 1 11
## 17 Tank3 F EE 1 30
## 18 Tank3 F EL 1 20
## 19 Tank3 F LE 1 13
## 20 Tank3 F LL 1 12
## 21 Tank3 M EE 1 27
## 22 Tank3 M EL 1 21
## 23 Tank3 M LE 1 11
## 24 Tank3 M LL 1 8
## # A tibble: 24 × 15
## ID tank pit weight temp geno_…¹ sex popul…² ID_ma ID_pa ID_fa…³
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Offsp1265 Tank1 A0000… 346 hot EE F NEV Pare… Pare… NEV 34
## 2 Offsp1247 Tank1 A0000… 298 hot EL F NEV Pare… Pare… NEV 20
## 3 Offsp1371 Tank1 A0000… 1146 hot LE F NEV Pare… Pare… NEV 67
## 4 Offsp1341 Tank1 A0000… 944. hot LL F NEV Pare… Pare… NEV 65
## 5 Offsp1036 Tank1 A0000… 1124. hot EE M NEV Pare… Pare… NEV 34
## 6 Offsp1269 Tank1 A0000… 1002. hot EL M NEV Pare… Pare… NEV 56
## 7 Offsp1222 Tank1 A0000… 1044. hot LE M NEV Pare… Pare… NEV 55
## 8 Offsp1327 Tank1 A0000… 780. hot LL M NEV Pare… Pare… NEV 17
## 9 Offsp1603 Tank2 A0000… 730. hot EE F NEV Pare… Pare… NEV 30
## 10 Offsp1458 Tank2 A0000… 1367 hot EL F NEV Pare… Pare… NEV 32
## # … with 14 more rows, 4 more variables: internalPri <dbl>, ID_num <int>,
## # ID_type <chr>, n_of_type <dbl>, and abbreviated variable names ¹geno_vgll3,
## # ²population, ³ID_family
The combination table is rather long this time, but I’m showing it still just to show what’s going on. The table shows that the algorithm will be looking for one individual of each combination of the column “tank”, “geno_vgll3”, and “sex”. Now, there is fewer otions for each combination (-or, “type”). In the end, the function returns a table with 24 individuals, just as requested.
We can use the n_ofeach to tell the algorithm that we want two individuals of each combination/type:
ransampler(
table = table_salmon,
ofeach = c("tank","sex","geno_vgll3"),
no_share = list(c("ID_ma","tank"),c("ID_pa","tank")),
n_ofeach= 2,
)
Let’s try now using the table_salmon_small, which is a smaller version of the table_salmon dataset
result <- ransampler(
table = table_salmon_small,
ofeach = c("tank","sex","geno_vgll3"),
no_share = list(c("ID_ma","tank"),c("ID_pa","tank")),
n_ofeach= 2,
)
## Searching using the following table of combinations:
## # A tibble: 48 × 5
## tank sex geno_vgll3 n_ofeach n_options
## <chr> <chr> <chr> <dbl> <int>
## 1 T21 F EE 2 7
## 2 T21 F EL 2 10
## 3 T21 F LE 2 7
## 4 T21 F LL 2 4
## 5 T21 M EE 2 10
## 6 T21 M EL 2 10
## 7 T21 M LE 2 3
## 8 T21 M LL 2 4
## 9 T23 F EE 2 14
## 10 T23 F EL 2 5
## 11 T23 F LE 2 7
## 12 T23 F LL 2 4
## 13 T23 M EE 2 10
## 14 T23 M EL 2 8
## 15 T23 M LE 2 0
## 16 T23 M LL 2 4
## 17 T31 F EE 2 12
## 18 T31 F EL 2 9
## 19 T31 F LE 2 5
## 20 T31 F LL 2 6
## 21 T31 M EE 2 8
## 22 T31 M EL 2 6
## 23 T31 M LE 2 5
## 24 T31 M LL 2 2
## 25 Tank1 F EE 2 7
## 26 Tank1 F EL 2 10
## 27 Tank1 F LE 2 1
## 28 Tank1 F LL 2 3
## 29 Tank1 M EE 2 14
## 30 Tank1 M EL 2 10
## 31 Tank1 M LE 2 6
## 32 Tank1 M LL 2 5
## 33 Tank2 F EE 2 14
## 34 Tank2 F EL 2 8
## 35 Tank2 F LE 2 5
## 36 Tank2 F LL 2 1
## 37 Tank2 M EE 2 11
## 38 Tank2 M EL 2 11
## 39 Tank2 M LE 2 3
## 40 Tank2 M LL 2 4
## 41 Tank3 F EE 2 10
## 42 Tank3 F EL 2 12
## 43 Tank3 F LE 2 1
## 44 Tank3 F LL 2 4
## 45 Tank3 M EE 2 10
## 46 Tank3 M EL 2 7
## 47 Tank3 M LE 2 6
## 48 Tank3 M LL 2 3
Notice from the combinations table that for some combinations/types, there are very few individuals to pick from (down to 1 in some cases). Because of the no-share rule, this could mean that for some categories, no individuals will get picked because there will be no legible individuals to pick from.
Counting how many missing individuals we have:
result %>% filter(is.na(ID)) %>% nrow()
## [1] 13
There are a few strategies to cope with low numbers of individuals:
In the scenario above, we’re picking two individuals of each type. However, maybe we’re not planning on using both individuals, maybe we pick two because we want one in backup. In that case, we can tell the sampler that we only plan on using one of each type. That way, it will not enforce the no_share rule on individuals of the same type, potentially opening up for more usable individuals:
result_2 <- ransampler(
table = table_salmon_small,
ofeach = c("tank","sex","geno_vgll3"),
no_share = list(c("ID_ma","tank"),c("ID_pa","tank")),
n_ofeach= 2,
use_dupli = F
)
## Searching using the following table of combinations:
## # A tibble: 48 × 5
## tank sex geno_vgll3 n_ofeach n_options
## <chr> <chr> <chr> <dbl> <int>
## 1 T21 F EE 2 7
## 2 T21 F EL 2 10
## 3 T21 F LE 2 7
## 4 T21 F LL 2 4
## 5 T21 M EE 2 10
## 6 T21 M EL 2 10
## 7 T21 M LE 2 3
## 8 T21 M LL 2 4
## 9 T23 F EE 2 14
## 10 T23 F EL 2 5
## 11 T23 F LE 2 7
## 12 T23 F LL 2 4
## 13 T23 M EE 2 10
## 14 T23 M EL 2 8
## 15 T23 M LE 2 0
## 16 T23 M LL 2 4
## 17 T31 F EE 2 12
## 18 T31 F EL 2 9
## 19 T31 F LE 2 5
## 20 T31 F LL 2 6
## 21 T31 M EE 2 8
## 22 T31 M EL 2 6
## 23 T31 M LE 2 5
## 24 T31 M LL 2 2
## 25 Tank1 F EE 2 7
## 26 Tank1 F EL 2 10
## 27 Tank1 F LE 2 1
## 28 Tank1 F LL 2 3
## 29 Tank1 M EE 2 14
## 30 Tank1 M EL 2 10
## 31 Tank1 M LE 2 6
## 32 Tank1 M LL 2 5
## 33 Tank2 F EE 2 14
## 34 Tank2 F EL 2 8
## 35 Tank2 F LE 2 5
## 36 Tank2 F LL 2 1
## 37 Tank2 M EE 2 11
## 38 Tank2 M EL 2 11
## 39 Tank2 M LE 2 3
## 40 Tank2 M LL 2 4
## 41 Tank3 F EE 2 10
## 42 Tank3 F EL 2 12
## 43 Tank3 F LE 2 1
## 44 Tank3 F LL 2 4
## 45 Tank3 M EE 2 10
## 46 Tank3 M EL 2 7
## 47 Tank3 M LE 2 6
## 48 Tank3 M LL 2 3
Again, checking how many we are missing: (Note, in this specific scenario, this is not likely to improve the result much, as lack of family variation within tanks is not a big issue)
result_2 %>% filter(is.na(ID)) %>% nrow()
## [1] 11
Each time you run the function, you will get a different result (since the picking is random). If you’re having issues with picking enough individuals for some combinations, you can try to run the function several times, and pick the result that has the least missing individuals. This is done automatically with the “runs” parameter. In the example below, we run the sampler 5 times and pick the result with least missing individuals.
result_3 <- ransampler(
table = table_salmon_small,
ofeach = c("tank","sex","geno_vgll3"),
no_share = list(c("ID_ma","tank"),c("ID_pa","tank")),
n_ofeach= 2,
use_dupli = F,
runs = 5
)
## Searching using the following table of combinations:
## # A tibble: 48 × 5
## tank sex geno_vgll3 n_ofeach n_options
## <chr> <chr> <chr> <dbl> <int>
## 1 T21 F EE 2 7
## 2 T21 F EL 2 10
## 3 T21 F LE 2 7
## 4 T21 F LL 2 4
## 5 T21 M EE 2 10
## 6 T21 M EL 2 10
## 7 T21 M LE 2 3
## 8 T21 M LL 2 4
## 9 T23 F EE 2 14
## 10 T23 F EL 2 5
## 11 T23 F LE 2 7
## 12 T23 F LL 2 4
## 13 T23 M EE 2 10
## 14 T23 M EL 2 8
## 15 T23 M LE 2 0
## 16 T23 M LL 2 4
## 17 T31 F EE 2 12
## 18 T31 F EL 2 9
## 19 T31 F LE 2 5
## 20 T31 F LL 2 6
## 21 T31 M EE 2 8
## 22 T31 M EL 2 6
## 23 T31 M LE 2 5
## 24 T31 M LL 2 2
## 25 Tank1 F EE 2 7
## 26 Tank1 F EL 2 10
## 27 Tank1 F LE 2 1
## 28 Tank1 F LL 2 3
## 29 Tank1 M EE 2 14
## 30 Tank1 M EL 2 10
## 31 Tank1 M LE 2 6
## 32 Tank1 M LL 2 5
## 33 Tank2 F EE 2 14
## 34 Tank2 F EL 2 8
## 35 Tank2 F LE 2 5
## 36 Tank2 F LL 2 1
## 37 Tank2 M EE 2 11
## 38 Tank2 M EL 2 11
## 39 Tank2 M LE 2 3
## 40 Tank2 M LL 2 4
## 41 Tank3 F EE 2 10
## 42 Tank3 F EL 2 12
## 43 Tank3 F LE 2 1
## 44 Tank3 F LL 2 4
## 45 Tank3 M EE 2 10
## 46 Tank3 M EL 2 7
## 47 Tank3 M LE 2 6
## 48 Tank3 M LL 2 3
## Run 1 of 5
## Missing: 13
## Run 2 of 5
## Missing: 13
## Run 3 of 5
## Missing: 12
## Run 4 of 5
## Missing: 11
## Run 5 of 5
## Missing: 7
result_3 %>% filter(is.na(ID)) %>% nrow()
## [1] 11
A little better!
Finally, let’s have a quick look at the prioritizing option. Let’s say we want to prioritize individuals from large families (so as to not exhaust small families). We can do so by making a new column called “siblings”, which tells how many siblings each individual has in a tank, and then prioritize by the inverse of this (so that those with many siblings get prioritized first).
First, a funtion for counting siblings, and for inverting the count:
# functions used for counting siblings (used below)
siblings_count <- function (df)
{
sibs <- df %>% apply(MARGIN = 1, FUN = function(x) {
tfam = x[["ID_family"]]
ttank = x[["tank"]]
sibs = df %>% filter(ID_family == tfam & ttank == tank) %>%
nrow()
sibs
})
df$sibs = sibs
df
}
# also used below
invert <- function(x)
{
(max(x) - x) + 1
}
Then, lets try this in practice:
table_salmon_sibs <-
table_salmon %>%
siblings_count %>%
mutate( pri = invert(sibs) )
head(table_salmon_sibs)
## # A tibble: 6 × 13
## ID tank pit weight temp geno_…¹ sex popul…² ID_ma ID_pa ID_fa…³ sibs
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
## 1 Offs… Tank1 A000… 987 hot EE F NEV Pare… Pare… NEV 54 5
## 2 Offs… Tank1 A000… 240. hot EE M NEV Pare… Pare… NEV 2 5
## 3 Offs… Tank1 A000… 966. hot EL F NEV Pare… Pare… NEV 4 7
## 4 Offs… Tank1 A000… 1554. hot LE M NEV Pare… Pare… NEV 55 4
## 5 Offs… Tank1 A000… 212. hot EE M NEV Pare… Pare… NEV 22 3
## 6 Offs… Tank1 A000… 996. hot EE F NEV Pare… Pare… NEV 39 3
## # … with 1 more variable: pri <dbl>, and abbreviated variable names
## # ¹geno_vgll3, ²population, ³ID_family
Now we have one column “sibs” which tells how many siblings each individual has per tank, and one “pri” which is the inverse of this one. We’ll now do a new sampling where we prioritize using the column “pri”
result_4 <- ransampler(
table = table_salmon_sibs,
ofeach = c("tank","sex","geno_vgll3"),
no_share = list(c("ID_ma","tank"),c("ID_pa","tank")),
n_ofeach= 2,
pri_by = "pri"
)
This will return a table with selected individuals as before, but this time the individuals with many siblings should have been prioritized.