Filter a dataframe using two-way criteria to increase connectedness

Traditional filtering (subsetting) of data is typically performed via some criteria based on the columns of the data.

In contrast, this function performs filtering of data based on the joint rows and columns of a matrix-view of two factors.

Conceptually, the idea is to re-shape two or three columns of a dataframe into a matrix, and then delete entire rows (or columns) of the matrix if there are too many missing cells in a row (or column).

The two most useful applications of two-way filtering are to:

Remove a factor level that has few interactions with another factor. This is especially useful in linear models to remove rare factor combinations.
Remove a factor level that has any missing interactions with another factor. This is especially useful with biplots of a matrix to remove rows or columns that have missing values.

A formula syntax is used to specify the two-way filtering criteria.

Some examples may provide the easiest understanding.

dat <- data.frame(state=c("NE","NE", "IA", "NE", "IA"), year=c(1,2,2,3,3), value=11:15)

When the 'value' column is re-shaped into a matrix it looks like:

state/year | 1 | 2 | 3 | NE | 11 | 12 | 14 | IA | | 13 | 15 |

Drop states with too much missing combinations. Keep only states with "at least 3 years per state" con_filter(dat, ~ 3 * year / state) NE 1 11 NE 2 12 NE 3 14

Keep only years with "at least 2 states per year" con_filter(dat, ~ 2 * state / year) NE 2 12 IA 2 13 NE 3 14 IA 3 15

If the constant number in the formula is less than 1.0, this is interpreted as a fraction. Keep only states with "at least 75% of years per state" con_filter(dat, ~ .75 * year / state)

It is possible to include another factor on either side of the slash "/". Suppose the data had another factor for political party called "party". Keep only states with "at least 2 combinations of party:year per state" con_filter(dat, ~ 2 * party:year / state)

If the formula contains a response variable, missing values are dropped first, then the two-way filtering is based on the factor combinations. con_filter(dat, value ~ 2 * state / year)

Usage

con_filter(data, formula, verbose = TRUE, returndropped = FALSE)

Arguments

data: A dataframe
formula: A formula with two factor names in the dataframe that specifies the criteria for filtering, like y ~ 2 * f1 / f2
verbose: If TRUE, print some diagnostic information about what data is being deleted. (Similar to the 'tidylog' package).
returndropped: If TRUE, return the dropped rows instead of the kept rows. Default is FALSE.

Value

The original dataframe is returned, minus rows that are filtered out.

References

None.

Author

Kevin Wright

Examples

dat <- data.frame(
  gen = c("G3", "G4", "G1", "G2", "G3", "G4", "G5",
          "G1", "G2", "G3", "G4", "G5",
          "G1", "G2", "G3", "G4", "G5",
          "G1", "G2", "G3", "G4", "G5"),
  env = c("E1", "E1", "E1", "E1", "E1", "E1", "E1",
          "E2", "E2", "E2", "E2", "E2",
          "E3", "E3", "E3", "E3", "E3",
          "E4", "E4", "E4", "E4", "E4"),
  yield = c(65, 50, NA, NA, 65, 50, 60,
            NA, 71, 76, 80, 82,
            90, 93, 95, 102, 97,
            98, 102, 105, 130, 135))

# How many observations are there for each combination of gen*env?
with( subset(dat, !is.na(yield)) , table(gen,env) )
#>     env
#> gen  E1 E2 E3 E4
#>   G1  0  0  1  1
#>   G2  0  1  1  1
#>   G3  2  1  1  1
#>   G4  2  1  1  1
#>   G5  1  1  1  1

# Note, if there is no response variable, the two-way filtering is based
# only on the presence of the factor combinations.
dat1 <- con_filter(dat, ~ 4*env / gen)
#> Deleted 0 of 22 rows of data.

# If there is a response variable, missing values are dropped first,
# then the two-way filtering is based on the factor combinations.

dat1 <- con_filter(dat, yield ~ 4*env/gen)
#> Dropping these 2 of 5 levels of gen:
#> [1] "G1" "G2"
#> Deleted 5 of 19 rows of data.
dat1 <- con_filter(dat, yield ~ 5*env/ gen)
#> Dropping these 5 of 5 levels of gen:
#> [1] "G1" "G2" "G3" "G4" "G5"
#> Deleted 19 of 19 rows of data.
#> Warning: No data remains.
dat1 <- con_filter(dat, yield ~ 6*gen/ env)
#> Dropping these 4 of 4 levels of env:
#> [1] "E1" "E2" "E3" "E4"
#> Deleted 19 of 19 rows of data.
#> Warning: No data remains.
dat1 <- con_filter(dat, yield ~ .8 *env / gen)
#> Dropping these 2 of 5 levels of gen:
#> [1] "G1" "G2"
#> Deleted 5 of 19 rows of data.
dat1 <- con_filter(dat, yield ~ .8* gen / env)
#> Dropping these 1 of 4 levels of env:
#> [1] "E1"
#> Deleted 5 of 19 rows of data.
dat1 <- con_filter(dat, yield ~ 7 * env / gen)
#> Dropping these 5 of 5 levels of gen:
#> [1] "G1" "G2" "G3" "G4" "G5"
#> Deleted 19 of 19 rows of data.
#> Warning: No data remains.