3 👆 Selecting and Filtering Data

Let’s practice importing data, from 2. Download the periodic table data now, and import it into R as periodic_table.

  • Remember to move this into your working directory: getwd()

  • Since this is a CSV file, use the From Text (readr) function.

  • Change the name of the data set to periodic_table before you import it. Remember that’s on the bottom left.

The first 10 rows of periodic_table are reproduced below.

TABLE 3.1: The Periodic Table
atomic_number symbol name name_origin group period block state_at_stp occurrence description atomic_weight aw_uncertainty any_stable_nuclides density density_predicted melting_point mp_predicted boiling_point bp_predicted heat_capacity electronegativity abundance
1 H Hydrogen composed of the Greek elements hydro- and -gen meaning ‘water-forming’ 1 1 s Gas Primordial Non-metal 1.008 NA Yes 0.0001 FALSE 14.010 FALSE 20.28 FALSE 14.304 2.20 1.40e+03
2 He Helium the Greek helios, ‘sun’ 18 1 s Gas Primordial Noble gas 4.003 2 Yes 0.0002 FALSE 0.956 FALSE 4.22 FALSE 5.193 NA 8.00e-03
3 Li Lithium the Greek lithos, ‘stone’ 1 2 s Solid Primordial Alkali metal 6.940 NA Yes 0.5340 FALSE 453.690 FALSE 1560.00 FALSE 3.582 0.98 2.00e+01
4 Be Beryllium beryl, a mineral 2 2 s Solid Primordial Alkaline earth metal 9.012 5 Yes 1.8500 FALSE 1560.000 FALSE 2742.00 FALSE 1.825 1.57 2.80e+00
5 B Boron borax, a mineral 13 2 p Solid Primordial Metalloid 10.810 NA Yes 2.3400 FALSE 2349.000 FALSE 4200.00 FALSE 1.026 2.04 1.00e+01
6 C Carbon the Latin carbo, ‘coal’ 14 2 p Solid Primordial Non-metal 12.011 NA Yes 2.2670 FALSE 3800.000 FALSE 4300.00 FALSE 0.709 2.55 2.00e+02
7 N Nitrogen the Greek nitron and ‘-gen’ meaning ‘niter-forming’ 15 2 p Gas Primordial Non-metal 14.007 NA Yes 0.0013 FALSE 63.150 FALSE 77.36 FALSE 1.040 3.04 1.90e+01
8 O Oxygen from the Greek oxy-, both ‘sharp’ and ‘acid’, and -gen, meaning ‘acid-forming’ 16 2 p Gas Primordial Non-metal 15.999 NA Yes 0.0014 FALSE 54.360 FALSE 90.20 FALSE 0.918 3.44 4.61e+05
9 F Fluorine the Latin fluere, ‘to flow’ 17 2 p Gas Primordial Halogen 18.998 6 Yes 0.0017 FALSE 53.530 FALSE 85.03 FALSE 0.824 3.98 5.85e+02
10 Ne Neon the Greek neos, meaning ‘new’ 18 2 p Gas Primordial Noble gas 20.180 6 Yes 0.0009 FALSE 24.560 FALSE 27.07 FALSE 1.030 NA 5.00e-03

So we have a big data frame of information about the periodic table, but what if we wanted to extract smaller pieces of the data frame?

3.1 Selecting a column using $

By default, R will organize character strings in alphabetical order. To see this, we’ll introduce two new features: the table function and the $ operator.

Let’s say I’m interested in elements in a gas, liquid, and solid state at room temperature. Looking at the periodic_table columns, I see that the column that gives me that information is state_at_stp.

periodic_table$state_at_stp
##   [1] "Gas"    "Gas"    "Solid"  "Solid"  "Solid"  "Solid" 
##   [7] "Gas"    "Gas"    "Gas"    "Gas"    "Solid"  "Solid" 
##  [13] "Solid"  "Solid"  "Solid"  "Solid"  "Gas"    "Gas"   
##  [19] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [25] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [31] "Solid"  "Solid"  "Solid"  "Solid"  "Liquid" "Gas"   
##  [37] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [43] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [49] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Gas"   
##  [55] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [61] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [67] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [73] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [79] "Solid"  "Liquid" "Solid"  "Solid"  "Solid"  "Solid" 
##  [85] "Solid"  "Gas"    "Solid"  "Solid"  "Solid"  "Solid" 
##  [91] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [97] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
## [103] "Solid"  NA       NA       NA       NA       NA      
## [109] NA       NA       NA       NA       NA       NA      
## [115] NA       NA       NA       NA

As you can see, to access specific variables inside a data frame, we can do so by entering the name of the data frame followed by $ and the name of the variable. (Note that spaces in variable names will not work. You’ll likely learn that the hard way, as I have.)

3.2 Selecting an element using [ , ]

Recall in 1.3 section that we could use the [ ] operator, to select elements from a single vector. We can extend these operations into a dataframe.

A single vector can be thought of as a dataframe with one row and \(n\) columns, which is why you just needed a single number to select an element. Now, since there are multiple rows and columns, you need a comma to differentiate the row and column index.

The [ ] operator allows you to select elements within a vector or a dataframe based on its row and column position. It starts by locating the row position, and then the column position.

3.2.1 Selecting a single element using [ , ]

Let’s say I wanted to select the name for Sodium, that is in row 11, column 3.

periodic_table[11, 3]
## # A tibble: 1 × 1
##   name  
##   <chr> 
## 1 Sodium
periodic_table$name[11]
## [1] "Sodium"

Notice that both commands above give me the desired value– Sodium– but the formats are different. + In the first command, R gives me back a table with 1 row and 1 column. This isn’t a coincidence, because you pulled directly from the table itself. It’s the difference between copying a “cell” in Excel versus double-clicking on the cell, and then copying the text inside. Here, you directly pulled from the table, so R returns back a value within a table. + The second gives me a single character vector. This makes sense– the $ operator selects a single vector name from the table, and then selects the 11th element. Generally, this is the way you want to select a single table.

Recall the use of [ ] when a vector was printed, to help us better understand where we were in printing a large vector. We can use this same tool to select the tenth to the twentieth elements of the periodic_table$name variable:

periodic_table$name[10:20]
##  [1] "Neon"       "Sodium"     "Magnesium"  "Aluminium" 
##  [5] "Silicon"    "Phosphorus" "Sulfur"     "Chlorine"  
##  [9] "Argon"      "Potassium"  "Calcium"

3.3 Selecting rows and columns using [ , ]

You have now seen how to select specific elements of a vector or a variable, but what if we want a subset of the values in the full data frame across both rows (observations) and columns (variables).?

We can use [ , ] where the spot before the comma corresponds to rows and the spot after the comma corresponds to columns. Let’s select rows 40 to 50 and columns 1, 2, and 4 from periodic_table:

periodic_table[41:50, c(1, 2, 4)]
## # A tibble: 10 × 3
##    atomic_number symbol name_origin                        
##            <dbl> <chr>  <chr>                              
##  1            41 Nb     Niobe, daughter of king Tantalus f…
##  2            42 Mo     the Greek molybdos meaning 'lead'  
##  3            43 Tc     the Greek tekhn??tos meaning 'arti…
##  4            44 Ru     Ruthenia, the New Latin name for R…
##  5            45 Rh     the Greek rhodos, meaning 'rose co…
##  6            46 Pd     the then recently discovered aster…
##  7            47 Ag     English word (argentum in Latin)   
##  8            48 Cd     the New Latin cadmia, from King Ka…
##  9            49 In     indigo                             
## 10            50 Sn     English word (stannum in Latin)

3.3.1 “Let’s eat Grandma!” vs. “Let’s eat, Grandma!”

The comma is very important. Let’s look at some examples:

#Returns all of column 2:
periodic_table[2]
## # A tibble: 118 × 1
##    symbol
##    <chr> 
##  1 H     
##  2 He    
##  3 Li    
##  4 Be    
##  5 B     
##  6 C     
##  7 N     
##  8 O     
##  9 F     
## 10 Ne    
## # … with 108 more rows
#Returns all of row 2:
periodic_table[2, ]
## # A tibble: 1 × 22
##   atomic_number symbol name  name_origin group period block
##           <dbl> <chr>  <chr> <chr>       <dbl>  <dbl> <chr>
## 1             2 He     Heli… the Greek …    18      1 s    
## # … with 15 more variables: state_at_stp <chr>,
## #   occurrence <chr>, description <chr>,
## #   atomic_weight <dbl>, aw_uncertainty <dbl>,
## #   any_stable_nuclides <chr>, density <dbl>,
## #   density_predicted <lgl>, melting_point <dbl>,
## #   mp_predicted <lgl>, boiling_point <dbl>,
## #   bp_predicted <lgl>, heat_capacity <dbl>, …
#Returns all of column 2:
periodic_table[ ,2]
## # A tibble: 118 × 1
##    symbol
##    <chr> 
##  1 H     
##  2 He    
##  3 Li    
##  4 Be    
##  5 B     
##  6 C     
##  7 N     
##  8 O     
##  9 F     
## 10 Ne    
## # … with 108 more rows
#Returns a single element in Row 2, Column 2:
periodic_table[2,2]
## # A tibble: 1 × 1
##   symbol
##   <chr> 
## 1 He

3.4 Replacing and Removing Data

Destructive operations are any operation where you override the original data. This is not necessairly a bad thing– sometimes, we don’t have a need for the original data, but want to manipulate it in some way. However, it means that if you made a mistake, the only way to revert the mistake is to completely reload the data from its original source, and repeat all your operations again. As in 1.2, you can alter your dataframe by using the storage operator <- to overwrite data.

3.4.1 Replacing one element

Let’s say that I accidentally typed ‘Abram’ as my friend’s name, and I want to replace it with ‘Abe’

friends$names[1]<- "Abe"
friends
##      names ages DC_Resident fav_number
## 1      Abe   34        TRUE       1.00
## 2   Bryant   35       FALSE       2.17
## 3  Colleen   32       FALSE      26.00
## 4    David   29        TRUE       7.00
## 5   Esther   30       FALSE      10.00
## 6 Jeremiah   30        TRUE       9.00

3.4.2 Replacing columns with a mathematical formula

I want to see what everybody’s age is in 50 years.

friends$ages<- friends$ages+50
friends
##      names ages DC_Resident fav_number
## 1      Abe   84        TRUE       1.00
## 2   Bryant   85       FALSE       2.17
## 3  Colleen   82       FALSE      26.00
## 4    David   79        TRUE       7.00
## 5   Esther   80       FALSE      10.00
## 6 Jeremiah   80        TRUE       9.00

3.4.3 Replacing entire columns with another vector

I want to add their last names.

friends$names <- c("Abram A", "Bryant B", "Colleen C", "David D", "Esther E", "Jeremiah J")
friends
##        names ages DC_Resident fav_number
## 1    Abram A   84        TRUE       1.00
## 2   Bryant B   85       FALSE       2.17
## 3  Colleen C   82       FALSE      26.00
## 4    David D   79        TRUE       7.00
## 5   Esther E   80       FALSE      10.00
## 6 Jeremiah J   80        TRUE       9.00

Note in the last example, I had to type the entire vector I planned to override. That’s because these are characters, and if you use the + operator, R will return an error saying that you can’t add “a” with “b”. We will discuss a better way to do this, below.

3.4.4 Dropping Rows

Or I added the first person by mistake to my Friends list, so I have to drop the first row that represents “Abram A”:

friends<- friends[-1, ]

friends
##        names ages DC_Resident fav_number
## 2   Bryant B   85       FALSE       2.17
## 3  Colleen C   82       FALSE      26.00
## 4    David D   79        TRUE       7.00
## 5   Esther E   80       FALSE      10.00
## 6 Jeremiah J   80        TRUE       9.00

3.5 🤔 Selecting Data Using Logical Conditions

3.5.1 Using %in%

Recall in 1.3.1 that we could use a logical vector, and the %in% operator, to select elements from a single vector. We can extend these operations into a dataframe.

At its most basic level, you can use the == operator to test whether one element exactly matches another. For example, "gas"=="gas" returns TRUE, but "gas"=="Gas" will return false.

Let’s say that I want to find any elements that are naturally gasses.

periodic_table[periodic_table$state_at_stp == "Liquid", ]
## # A tibble: 17 × 22
##    atomic_number symbol name    name_origin    group period
##            <dbl> <chr>  <chr>   <chr>          <dbl>  <dbl>
##  1            35 Br     Bromine the Greek bro…    17      4
##  2            80 Hg     Mercury the New Latin…    12      6
##  3            NA <NA>   <NA>    <NA>              NA     NA
##  4            NA <NA>   <NA>    <NA>              NA     NA
##  5            NA <NA>   <NA>    <NA>              NA     NA
##  6            NA <NA>   <NA>    <NA>              NA     NA
##  7            NA <NA>   <NA>    <NA>              NA     NA
##  8            NA <NA>   <NA>    <NA>              NA     NA
##  9            NA <NA>   <NA>    <NA>              NA     NA
## 10            NA <NA>   <NA>    <NA>              NA     NA
## 11            NA <NA>   <NA>    <NA>              NA     NA
## 12            NA <NA>   <NA>    <NA>              NA     NA
## 13            NA <NA>   <NA>    <NA>              NA     NA
## 14            NA <NA>   <NA>    <NA>              NA     NA
## 15            NA <NA>   <NA>    <NA>              NA     NA
## 16            NA <NA>   <NA>    <NA>              NA     NA
## 17            NA <NA>   <NA>    <NA>              NA     NA
## # … with 16 more variables: block <chr>,
## #   state_at_stp <chr>, occurrence <chr>,
## #   description <chr>, atomic_weight <dbl>,
## #   aw_uncertainty <dbl>, any_stable_nuclides <chr>,
## #   density <dbl>, density_predicted <lgl>,
## #   melting_point <dbl>, mp_predicted <lgl>,
## #   boiling_point <dbl>, bp_predicted <lgl>, …

You’ll notice that this returns back a massive table of NAs, rather than just the two elements that exist in liquid state! Why?

Let’s examine the command periodic_table$state_at_stp=="Liquid" more carefully.

periodic_table$state_at_stp=="Liquid"
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [10] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [28] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [46] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [55] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [64] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
##  [82] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [91] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE    NA    NA    NA    NA    NA
## [109]    NA    NA    NA    NA    NA    NA    NA    NA    NA
## [118]    NA
  • Every time the element’s state is not Liquid, R returns False, instead of ignoring the row, which means you get a logical vector of all 118 elements.
  • This gets passed onto the subset function, which, rather than ignore any row that says FALSE, returns each row as entirely NAs.

Although this technically does the job you want, it is not a very elegant solution, which is why the %in% operator is better.

periodic_table[periodic_table$state_at_stp %in% "Liquid", ]
## # A tibble: 2 × 22
##   atomic_number symbol name  name_origin group period block
##           <dbl> <chr>  <chr> <chr>       <dbl>  <dbl> <chr>
## 1            35 Br     Brom… the Greek …    17      4 p    
## 2            80 Hg     Merc… the New La…    12      6 d    
## # … with 15 more variables: state_at_stp <chr>,
## #   occurrence <chr>, description <chr>,
## #   atomic_weight <dbl>, aw_uncertainty <dbl>,
## #   any_stable_nuclides <chr>, density <dbl>,
## #   density_predicted <lgl>, melting_point <dbl>,
## #   mp_predicted <lgl>, boiling_point <dbl>,
## #   bp_predicted <lgl>, heat_capacity <dbl>, …

Much better. Notice that, with this call, we didn’t specify any columns for R to filter out, so it left those blank.

Now we can think about how to subset an entire data frame using the same sort of creation of two logical vectors (one for rows and one for columns):

periodic_table[ (periodic_table$name %in% c("Hydrogen", "Oxygen") ),
        c("atomic_weight", "state_at_stp")]
## # A tibble: 2 × 2
##   atomic_weight state_at_stp
##           <dbl> <chr>       
## 1          1.01 Gas         
## 2         16.0  Gas

The extra parentheses around periodic_table$name %in% c("Hydrogen", "Oxygen") are a good habit to get into as they ensure everything before the comma is used to select specific rows matching that condition. For the columns, we can specify a vector of the column names to focus on only those variables. The resulting table here gives the atomic_weight and state_at_stp for "Hydrogen" and then for "Oxygen".

There are many more complicated ways to subset a data frame and one can use the subset function built into R, but in my experience, whenever you want to do anything more complicated than what we have done here, it is easier to use the filter and select functions in the dplyr package.

3.5.2 Using Comparison Operators < and >

These operators are self explanatory. The < and > can be used to select rows based on a numeric condition.

!= is the not equals operator. This operator allows for both numeric and character types.

# Prints any row whose atomic number is less than 10.
periodic_table[periodic_table$atomic_number<10,]
## # A tibble: 9 × 22
##   atomic_number symbol name  name_origin group period block
##           <dbl> <chr>  <chr> <chr>       <dbl>  <dbl> <chr>
## 1             1 H      Hydr… composed o…     1      1 s    
## 2             2 He     Heli… the Greek …    18      1 s    
## 3             3 Li     Lith… the Greek …     1      2 s    
## 4             4 Be     Bery… beryl, a m…     2      2 s    
## 5             5 B      Boron borax, a m…    13      2 p    
## 6             6 C      Carb… the Latin …    14      2 p    
## 7             7 N      Nitr… the Greek …    15      2 p    
## 8             8 O      Oxyg… from the G…    16      2 p    
## 9             9 F      Fluo… the Latin …    17      2 p    
## # … with 15 more variables: state_at_stp <chr>,
## #   occurrence <chr>, description <chr>,
## #   atomic_weight <dbl>, aw_uncertainty <dbl>,
## #   any_stable_nuclides <chr>, density <dbl>,
## #   density_predicted <lgl>, melting_point <dbl>,
## #   mp_predicted <lgl>, boiling_point <dbl>,
## #   bp_predicted <lgl>, heat_capacity <dbl>, …
#Prints any row whose state at standard room temperature is not Solid.
periodic_table[periodic_table$state_at_stp != "Solid",]
## # A tibble: 28 × 22
##    atomic_number symbol name     name_origin   group period
##            <dbl> <chr>  <chr>    <chr>         <dbl>  <dbl>
##  1             1 H      Hydrogen composed of …     1      1
##  2             2 He     Helium   the Greek he…    18      1
##  3             7 N      Nitrogen the Greek ni…    15      2
##  4             8 O      Oxygen   from the Gre…    16      2
##  5             9 F      Fluorine the Latin fl…    17      2
##  6            10 Ne     Neon     the Greek ne…    18      2
##  7            17 Cl     Chlorine the Greek ch…    17      3
##  8            18 Ar     Argon    the Greek ar…    18      3
##  9            35 Br     Bromine  the Greek br…    17      4
## 10            36 Kr     Krypton  the Greek kr…    18      4
## # … with 18 more rows, and 16 more variables: block <chr>,
## #   state_at_stp <chr>, occurrence <chr>,
## #   description <chr>, atomic_weight <dbl>,
## #   aw_uncertainty <dbl>, any_stable_nuclides <chr>,
## #   density <dbl>, density_predicted <lgl>,
## #   melting_point <dbl>, mp_predicted <lgl>,
## #   boiling_point <dbl>, bp_predicted <lgl>, …

3.5.3 Combining logical operators with “or”, “and” conditions

Finally, the & operator means and, and the | operator means or. You can fund | by pressing the shirt key and kitting the “reverse slash” key above the enter key on your keyboard.

It allows you to combine multiple logical statements together, which may be useful if you have multiple criteria to work on.

#Prints any element which is Non-metal AND whose  atomic number is less than 15
periodic_table[periodic_table$description != "Non-metal" & periodic_table$atomic_number <15,]
## # A tibble: 10 × 22
##    atomic_number symbol name      name_origin  group period
##            <dbl> <chr>  <chr>     <chr>        <dbl>  <dbl>
##  1             2 He     Helium    the Greek h…    18      1
##  2             3 Li     Lithium   the Greek l…     1      2
##  3             4 Be     Beryllium beryl, a mi…     2      2
##  4             5 B      Boron     borax, a mi…    13      2
##  5             9 F      Fluorine  the Latin f…    17      2
##  6            10 Ne     Neon      the Greek n…    18      2
##  7            11 Na     Sodium    the English…     1      3
##  8            12 Mg     Magnesium Magnesia, a…     2      3
##  9            13 Al     Aluminium from alumin…    13      3
## 10            14 Si     Silicon   from the La…    14      3
## # … with 16 more variables: block <chr>,
## #   state_at_stp <chr>, occurrence <chr>,
## #   description <chr>, atomic_weight <dbl>,
## #   aw_uncertainty <dbl>, any_stable_nuclides <chr>,
## #   density <dbl>, density_predicted <lgl>,
## #   melting_point <dbl>, mp_predicted <lgl>,
## #   boiling_point <dbl>, bp_predicted <lgl>, …
#Prints any element which is Non-metal OR whose atomic number is less than 15
periodic_table[periodic_table$description != "Non-metal" | periodic_table$atomic_number <15,]
## # A tibble: 115 × 22
##    atomic_number symbol name      name_origin  group period
##            <dbl> <chr>  <chr>     <chr>        <dbl>  <dbl>
##  1             1 H      Hydrogen  composed of…     1      1
##  2             2 He     Helium    the Greek h…    18      1
##  3             3 Li     Lithium   the Greek l…     1      2
##  4             4 Be     Beryllium beryl, a mi…     2      2
##  5             5 B      Boron     borax, a mi…    13      2
##  6             6 C      Carbon    the Latin c…    14      2
##  7             7 N      Nitrogen  the Greek n…    15      2
##  8             8 O      Oxygen    from the Gr…    16      2
##  9             9 F      Fluorine  the Latin f…    17      2
## 10            10 Ne     Neon      the Greek n…    18      2
## # … with 105 more rows, and 16 more variables:
## #   block <chr>, state_at_stp <chr>, occurrence <chr>,
## #   description <chr>, atomic_weight <dbl>,
## #   aw_uncertainty <dbl>, any_stable_nuclides <chr>,
## #   density <dbl>, density_predicted <lgl>,
## #   melting_point <dbl>, mp_predicted <lgl>,
## #   boiling_point <dbl>, bp_predicted <lgl>, …