3 👆 Selecting and Filtering Data

Let’s practice importing data, from 2. Download the periodic table data now, and import it into R as periodic_table.

Remember to move this into your working directory: getwd()
Since this is a CSV file, use the From Text (readr) function.
Change the name of the data set to periodic_table before you import it. Remember that’s on the bottom left.

The first 10 rows of periodic_table are reproduced below.

TABLE 3.1: The Periodic Table
atomic_number	symbol	name	name_origin	group	period	block	state_at_stp	occurrence	description	atomic_weight	aw_uncertainty	any_stable_nuclides	density	density_predicted	melting_point	mp_predicted	boiling_point	bp_predicted	heat_capacity	electronegativity	abundance
1	H	Hydrogen	composed of the Greek elements hydro- and -gen meaning ‘water-forming’	1	1	s	Gas	Primordial	Non-metal	1.008	NA	Yes	0.0001	FALSE	14.010	FALSE	20.28	FALSE	14.304	2.20	1.40e+03
2	He	Helium	the Greek helios, ‘sun’	18	1	s	Gas	Primordial	Noble gas	4.003	2	Yes	0.0002	FALSE	0.956	FALSE	4.22	FALSE	5.193	NA	8.00e-03
3	Li	Lithium	the Greek lithos, ‘stone’	1	2	s	Solid	Primordial	Alkali metal	6.940	NA	Yes	0.5340	FALSE	453.690	FALSE	1560.00	FALSE	3.582	0.98	2.00e+01
4	Be	Beryllium	beryl, a mineral	2	2	s	Solid	Primordial	Alkaline earth metal	9.012	5	Yes	1.8500	FALSE	1560.000	FALSE	2742.00	FALSE	1.825	1.57	2.80e+00
5	B	Boron	borax, a mineral	13	2	p	Solid	Primordial	Metalloid	10.810	NA	Yes	2.3400	FALSE	2349.000	FALSE	4200.00	FALSE	1.026	2.04	1.00e+01
6	C	Carbon	the Latin carbo, ‘coal’	14	2	p	Solid	Primordial	Non-metal	12.011	NA	Yes	2.2670	FALSE	3800.000	FALSE	4300.00	FALSE	0.709	2.55	2.00e+02
7	N	Nitrogen	the Greek nitron and ‘-gen’ meaning ‘niter-forming’	15	2	p	Gas	Primordial	Non-metal	14.007	NA	Yes	0.0013	FALSE	63.150	FALSE	77.36	FALSE	1.040	3.04	1.90e+01
8	O	Oxygen	from the Greek oxy-, both ‘sharp’ and ‘acid’, and -gen, meaning ‘acid-forming’	16	2	p	Gas	Primordial	Non-metal	15.999	NA	Yes	0.0014	FALSE	54.360	FALSE	90.20	FALSE	0.918	3.44	4.61e+05
9	F	Fluorine	the Latin fluere, ‘to flow’	17	2	p	Gas	Primordial	Halogen	18.998	6	Yes	0.0017	FALSE	53.530	FALSE	85.03	FALSE	0.824	3.98	5.85e+02
10	Ne	Neon	the Greek neos, meaning ‘new’	18	2	p	Gas	Primordial	Noble gas	20.180	6	Yes	0.0009	FALSE	24.560	FALSE	27.07	FALSE	1.030	NA	5.00e-03

So we have a big data frame of information about the periodic table, but what if we wanted to extract smaller pieces of the data frame?

3.1 Selecting a column using `$`

By default, R will organize character strings in alphabetical order. To see this, we’ll introduce two new features: the table function and the $ operator.

Let’s say I’m interested in elements in a gas, liquid, and solid state at room temperature. Looking at the periodic_table columns, I see that the column that gives me that information is state_at_stp.

periodic_table$state_at_stp

##   [1] "Gas"    "Gas"    "Solid"  "Solid"  "Solid"  "Solid" 
##   [7] "Gas"    "Gas"    "Gas"    "Gas"    "Solid"  "Solid" 
##  [13] "Solid"  "Solid"  "Solid"  "Solid"  "Gas"    "Gas"   
##  [19] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [25] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [31] "Solid"  "Solid"  "Solid"  "Solid"  "Liquid" "Gas"   
##  [37] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [43] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [49] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Gas"   
##  [55] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [61] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [67] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [73] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [79] "Solid"  "Liquid" "Solid"  "Solid"  "Solid"  "Solid" 
##  [85] "Solid"  "Gas"    "Solid"  "Solid"  "Solid"  "Solid" 
##  [91] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
##  [97] "Solid"  "Solid"  "Solid"  "Solid"  "Solid"  "Solid" 
## [103] "Solid"  NA       NA       NA       NA       NA      
## [109] NA       NA       NA       NA       NA       NA      
## [115] NA       NA       NA       NA

As you can see, to access specific variables inside a data frame, we can do so by entering the name of the data frame followed by $ and the name of the variable. (Note that spaces in variable names will not work. You’ll likely learn that the hard way, as I have.)

3.2 Selecting an element using `[ , ]`

Recall in 1.3 section that we could use the [ ] operator, to select elements from a single vector. We can extend these operations into a dataframe.

A single vector can be thought of as a dataframe with one row and $n$ columns, which is why you just needed a single number to select an element. Now, since there are multiple rows and columns, you need a comma to differentiate the row and column index.

The [ ] operator allows you to select elements within a vector or a dataframe based on its row and column position. It starts by locating the row position, and then the column position.

3.2.1 Selecting a single element using `[ , ]`

Let’s say I wanted to select the name for Sodium, that is in row 11, column 3.

periodic_table[11, 3]

## # A tibble: 1 × 1
##   name  
##   <chr> 
## 1 Sodium

periodic_table$name[11]

## [1] "Sodium"

Notice that both commands above give me the desired value– Sodium– but the formats are different. + In the first command, R gives me back a table with 1 row and 1 column. This isn’t a coincidence, because you pulled directly from the table itself. It’s the difference between copying a “cell” in Excel versus double-clicking on the cell, and then copying the text inside. Here, you directly pulled from the table, so R returns back a value within a table. + The second gives me a single character vector. This makes sense– the $ operator selects a single vector name from the table, and then selects the 11th element. Generally, this is the way you want to select a single table.

Recall the use of [ ] when a vector was printed, to help us better understand where we were in printing a large vector. We can use this same tool to select the tenth to the twentieth elements of the periodic_table$name variable:

periodic_table$name[10:20]

##  [1] "Neon"       "Sodium"     "Magnesium"  "Aluminium" 
##  [5] "Silicon"    "Phosphorus" "Sulfur"     "Chlorine"  
##  [9] "Argon"      "Potassium"  "Calcium"

3.3 Selecting rows and columns using `[ , ]`

You have now seen how to select specific elements of a vector or a variable, but what if we want a subset of the values in the full data frame across both rows (observations) and columns (variables).?

We can use [ , ] where the spot before the comma corresponds to rows and the spot after the comma corresponds to columns. Let’s select rows 40 to 50 and columns 1, 2, and 4 from periodic_table:

periodic_table[41:50, c(1, 2, 4)]

## # A tibble: 10 × 3
##    atomic_number symbol name_origin                        
##            <dbl> <chr>  <chr>                              
##  1            41 Nb     Niobe, daughter of king Tantalus f…
##  2            42 Mo     the Greek molybdos meaning 'lead'  
##  3            43 Tc     the Greek tekhn??tos meaning 'arti…
##  4            44 Ru     Ruthenia, the New Latin name for R…
##  5            45 Rh     the Greek rhodos, meaning 'rose co…
##  6            46 Pd     the then recently discovered aster…
##  7            47 Ag     English word (argentum in Latin)   
##  8            48 Cd     the New Latin cadmia, from King Ka…
##  9            49 In     indigo                             
## 10            50 Sn     English word (stannum in Latin)

3.3.1 “Let’s eat Grandma!” vs. “Let’s eat, Grandma!”

The comma is very important. Let’s look at some examples:

#Returns all of column 2:
periodic_table[2]

## # A tibble: 118 × 1
##    symbol
##    <chr> 
##  1 H     
##  2 He    
##  3 Li    
##  4 Be    
##  5 B     
##  6 C     
##  7 N     
##  8 O     
##  9 F     
## 10 Ne    
## # … with 108 more rows

#Returns all of row 2:
periodic_table[2, ]

## # A tibble: 1 × 22
##   atomic_number symbol name  name_origin group period block
##           <dbl> <chr>  <chr> <chr>       <dbl>  <dbl> <chr>
## 1             2 He     Heli… the Greek …    18      1 s    
## # … with 15 more variables: state_at_stp <chr>,
## #   occurrence <chr>, description <chr>,
## #   atomic_weight <dbl>, aw_uncertainty <dbl>,
## #   any_stable_nuclides <chr>, density <dbl>,
## #   density_predicted <lgl>, melting_point <dbl>,
## #   mp_predicted <lgl>, boiling_point <dbl>,
## #   bp_predicted <lgl>, heat_capacity <dbl>, …

#Returns all of column 2:
periodic_table[ ,2]

## # A tibble: 118 × 1
##    symbol
##    <chr> 
##  1 H     
##  2 He    
##  3 Li    
##  4 Be    
##  5 B     
##  6 C     
##  7 N     
##  8 O     
##  9 F     
## 10 Ne    
## # … with 108 more rows

#Returns a single element in Row 2, Column 2:
periodic_table[2,2]

## # A tibble: 1 × 1
##   symbol
##   <chr> 
## 1 He

3.4 Replacing and Removing Data

Destructive operations are any operation where you override the original data. This is not necessairly a bad thing– sometimes, we don’t have a need for the original data, but want to manipulate it in some way. However, it means that if you made a mistake, the only way to revert the mistake is to completely reload the data from its original source, and repeat all your operations again. As in 1.2, you can alter your dataframe by using the storage operator <- to overwrite data.

3.4.1 Replacing one element

Let’s say that I accidentally typed ‘Abram’ as my friend’s name, and I want to replace it with ‘Abe’

friends$names[1]<- "Abe"
friends

##      names ages DC_Resident fav_number
## 1      Abe   34        TRUE       1.00
## 2   Bryant   35       FALSE       2.17
## 3  Colleen   32       FALSE      26.00
## 4    David   29        TRUE       7.00
## 5   Esther   30       FALSE      10.00
## 6 Jeremiah   30        TRUE       9.00

3.4.2 Replacing columns with a mathematical formula

I want to see what everybody’s age is in 50 years.

friends$ages<- friends$ages+50
friends

##      names ages DC_Resident fav_number
## 1      Abe   84        TRUE       1.00
## 2   Bryant   85       FALSE       2.17
## 3  Colleen   82       FALSE      26.00
## 4    David   79        TRUE       7.00
## 5   Esther   80       FALSE      10.00
## 6 Jeremiah   80        TRUE       9.00

3.4.3 Replacing entire columns with another vector

I want to add their last names.

friends$names <- c("Abram A", "Bryant B", "Colleen C", "David D", "Esther E", "Jeremiah J")
friends

##        names ages DC_Resident fav_number
## 1    Abram A   84        TRUE       1.00
## 2   Bryant B   85       FALSE       2.17
## 3  Colleen C   82       FALSE      26.00
## 4    David D   79        TRUE       7.00
## 5   Esther E   80       FALSE      10.00
## 6 Jeremiah J   80        TRUE       9.00

Note in the last example, I had to type the entire vector I planned to override. That’s because these are characters, and if you use the + operator, R will return an error saying that you can’t add “a” with “b”. We will discuss a better way to do this, below.

3.4.4 Dropping Rows

Or I added the first person by mistake to my Friends list, so I have to drop the first row that represents “Abram A”:

friends<- friends[-1, ]

friends

##        names ages DC_Resident fav_number
## 2   Bryant B   85       FALSE       2.17
## 3  Colleen C   82       FALSE      26.00
## 4    David D   79        TRUE       7.00
## 5   Esther E   80       FALSE      10.00
## 6 Jeremiah J   80        TRUE       9.00

3.5 🤔 Selecting Data Using Logical Conditions

3.5.1 Using `%in%`

Recall in 1.3.1 that we could use a logical vector, and the %in% operator, to select elements from a single vector. We can extend these operations into a dataframe.

At its most basic level, you can use the == operator to test whether one element exactly matches another. For example, "gas"=="gas" returns TRUE, but "gas"=="Gas" will return false.

Let’s say that I want to find any elements that are naturally gasses.

periodic_table[periodic_table$state_at_stp == "Liquid", ]

## # A tibble: 17 × 22
##    atomic_number symbol name    name_origin    group period
##            <dbl> <chr>  <chr>   <chr>          <dbl>  <dbl>
##  1            35 Br     Bromine the Greek bro…    17      4
##  2            80 Hg     Mercury the New Latin…    12      6
##  3            NA <NA>   <NA>    <NA>              NA     NA
##  4            NA <NA>   <NA>    <NA>              NA     NA
##  5            NA <NA>   <NA>    <NA>              NA     NA
##  6            NA <NA>   <NA>    <NA>              NA     NA
##  7            NA <NA>   <NA>    <NA>              NA     NA
##  8            NA <NA>   <NA>    <NA>              NA     NA
##  9            NA <NA>   <NA>    <NA>              NA     NA
## 10            NA <NA>   <NA>    <NA>              NA     NA
## 11            NA <NA>   <NA>    <NA>              NA     NA
## 12            NA <NA>   <NA>    <NA>              NA     NA
## 13            NA <NA>   <NA>    <NA>              NA     NA
## 14            NA <NA>   <NA>    <NA>              NA     NA
## 15            NA <NA>   <NA>    <NA>              NA     NA
## 16            NA <NA>   <NA>    <NA>              NA     NA
## 17            NA <NA>   <NA>    <NA>              NA     NA
## # … with 16 more variables: block <chr>,
## #   state_at_stp <chr>, occurrence <chr>,
## #   description <chr>, atomic_weight <dbl>,
## #   aw_uncertainty <dbl>, any_stable_nuclides <chr>,
## #   density <dbl>, density_predicted <lgl>,
## #   melting_point <dbl>, mp_predicted <lgl>,
## #   boiling_point <dbl>, bp_predicted <lgl>, …

You’ll notice that this returns back a massive table of NAs, rather than just the two elements that exist in liquid state! Why?

Let’s examine the command periodic_table$state_at_stp=="Liquid" more carefully.

periodic_table$state_at_stp=="Liquid"

##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [10] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [28] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [46] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [55] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [64] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
##  [82] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [91] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE    NA    NA    NA    NA    NA
## [109]    NA    NA    NA    NA    NA    NA    NA    NA    NA
## [118]    NA

Every time the element’s state is not Liquid, R returns False, instead of ignoring the row, which means you get a logical vector of all 118 elements.
This gets passed onto the subset function, which, rather than ignore any row that says FALSE, returns each row as entirely NAs.

Although this technically does the job you want, it is not a very elegant solution, which is why the %in% operator is better.

periodic_table[periodic_table$state_at_stp %in% "Liquid", ]

## # A tibble: 2 × 22
##   atomic_number symbol name  name_origin group period block
##           <dbl> <chr>  <chr> <chr>       <dbl>  <dbl> <chr>
## 1            35 Br     Brom… the Greek …    17      4 p    
## 2            80 Hg     Merc… the New La…    12      6 d    
## # … with 15 more variables: state_at_stp <chr>,
## #   occurrence <chr>, description <chr>,
## #   atomic_weight <dbl>, aw_uncertainty <dbl>,
## #   any_stable_nuclides <chr>, density <dbl>,
## #   density_predicted <lgl>, melting_point <dbl>,
## #   mp_predicted <lgl>, boiling_point <dbl>,
## #   bp_predicted <lgl>, heat_capacity <dbl>, …

Much better. Notice that, with this call, we didn’t specify any columns for R to filter out, so it left those blank.

Now we can think about how to subset an entire data frame using the same sort of creation of two logical vectors (one for rows and one for columns):

periodic_table[ (periodic_table$name %in% c("Hydrogen", "Oxygen") ),
        c("atomic_weight", "state_at_stp")]

## # A tibble: 2 × 2
##   atomic_weight state_at_stp
##           <dbl> <chr>       
## 1          1.01 Gas         
## 2         16.0  Gas

The extra parentheses around periodic_table$name %in% c("Hydrogen", "Oxygen") are a good habit to get into as they ensure everything before the comma is used to select specific rows matching that condition. For the columns, we can specify a vector of the column names to focus on only those variables. The resulting table here gives the atomic_weight and state_at_stp for "Hydrogen" and then for "Oxygen".

There are many more complicated ways to subset a data frame and one can use the subset function built into R, but in my experience, whenever you want to do anything more complicated than what we have done here, it is easier to use the filter and select functions in the dplyr package.

3.5.2 Using Comparison Operators `<` and `>`

These operators are self explanatory. The < and > can be used to select rows based on a numeric condition.

!= is the not equals operator. This operator allows for both numeric and character types.

# Prints any row whose atomic number is less than 10.
periodic_table[periodic_table$atomic_number<10,]

## # A tibble: 9 × 22
##   atomic_number symbol name  name_origin group period block
##           <dbl> <chr>  <chr> <chr>       <dbl>  <dbl> <chr>
## 1             1 H      Hydr… composed o…     1      1 s    
## 2             2 He     Heli… the Greek …    18      1 s    
## 3             3 Li     Lith… the Greek …     1      2 s    
## 4             4 Be     Bery… beryl, a m…     2      2 s    
## 5             5 B      Boron borax, a m…    13      2 p    
## 6             6 C      Carb… the Latin …    14      2 p    
## 7             7 N      Nitr… the Greek …    15      2 p    
## 8             8 O      Oxyg… from the G…    16      2 p    
## 9             9 F      Fluo… the Latin …    17      2 p    
## # … with 15 more variables: state_at_stp <chr>,
## #   occurrence <chr>, description <chr>,
## #   atomic_weight <dbl>, aw_uncertainty <dbl>,
## #   any_stable_nuclides <chr>, density <dbl>,
## #   density_predicted <lgl>, melting_point <dbl>,
## #   mp_predicted <lgl>, boiling_point <dbl>,
## #   bp_predicted <lgl>, heat_capacity <dbl>, …

#Prints any row whose state at standard room temperature is not Solid.
periodic_table[periodic_table$state_at_stp != "Solid",]

## # A tibble: 28 × 22
##    atomic_number symbol name     name_origin   group period
##            <dbl> <chr>  <chr>    <chr>         <dbl>  <dbl>
##  1             1 H      Hydrogen composed of …     1      1
##  2             2 He     Helium   the Greek he…    18      1
##  3             7 N      Nitrogen the Greek ni…    15      2
##  4             8 O      Oxygen   from the Gre…    16      2
##  5             9 F      Fluorine the Latin fl…    17      2
##  6            10 Ne     Neon     the Greek ne…    18      2
##  7            17 Cl     Chlorine the Greek ch…    17      3
##  8            18 Ar     Argon    the Greek ar…    18      3
##  9            35 Br     Bromine  the Greek br…    17      4
## 10            36 Kr     Krypton  the Greek kr…    18      4
## # … with 18 more rows, and 16 more variables: block <chr>,
## #   state_at_stp <chr>, occurrence <chr>,
## #   description <chr>, atomic_weight <dbl>,
## #   aw_uncertainty <dbl>, any_stable_nuclides <chr>,
## #   density <dbl>, density_predicted <lgl>,
## #   melting_point <dbl>, mp_predicted <lgl>,
## #   boiling_point <dbl>, bp_predicted <lgl>, …

3.5.3 Combining logical operators with “or”, “and” conditions

Finally, the & operator means and, and the | operator means or. You can fund | by pressing the shirt key and kitting the “reverse slash” key above the enter key on your keyboard.

It allows you to combine multiple logical statements together, which may be useful if you have multiple criteria to work on.

#Prints any element which is Non-metal AND whose  atomic number is less than 15
periodic_table[periodic_table$description != "Non-metal" & periodic_table$atomic_number <15,]

## # A tibble: 10 × 22
##    atomic_number symbol name      name_origin  group period
##            <dbl> <chr>  <chr>     <chr>        <dbl>  <dbl>
##  1             2 He     Helium    the Greek h…    18      1
##  2             3 Li     Lithium   the Greek l…     1      2
##  3             4 Be     Beryllium beryl, a mi…     2      2
##  4             5 B      Boron     borax, a mi…    13      2
##  5             9 F      Fluorine  the Latin f…    17      2
##  6            10 Ne     Neon      the Greek n…    18      2
##  7            11 Na     Sodium    the English…     1      3
##  8            12 Mg     Magnesium Magnesia, a…     2      3
##  9            13 Al     Aluminium from alumin…    13      3
## 10            14 Si     Silicon   from the La…    14      3
## # … with 16 more variables: block <chr>,
## #   state_at_stp <chr>, occurrence <chr>,
## #   description <chr>, atomic_weight <dbl>,
## #   aw_uncertainty <dbl>, any_stable_nuclides <chr>,
## #   density <dbl>, density_predicted <lgl>,
## #   melting_point <dbl>, mp_predicted <lgl>,
## #   boiling_point <dbl>, bp_predicted <lgl>, …

#Prints any element which is Non-metal OR whose atomic number is less than 15
periodic_table[periodic_table$description != "Non-metal" | periodic_table$atomic_number <15,]

## # A tibble: 115 × 22
##    atomic_number symbol name      name_origin  group period
##            <dbl> <chr>  <chr>     <chr>        <dbl>  <dbl>
##  1             1 H      Hydrogen  composed of…     1      1
##  2             2 He     Helium    the Greek h…    18      1
##  3             3 Li     Lithium   the Greek l…     1      2
##  4             4 Be     Beryllium beryl, a mi…     2      2
##  5             5 B      Boron     borax, a mi…    13      2
##  6             6 C      Carbon    the Latin c…    14      2
##  7             7 N      Nitrogen  the Greek n…    15      2
##  8             8 O      Oxygen    from the Gr…    16      2
##  9             9 F      Fluorine  the Latin f…    17      2
## 10            10 Ne     Neon      the Greek n…    18      2
## # … with 105 more rows, and 16 more variables:
## #   block <chr>, state_at_stp <chr>, occurrence <chr>,
## #   description <chr>, atomic_weight <dbl>,
## #   aw_uncertainty <dbl>, any_stable_nuclides <chr>,
## #   density <dbl>, density_predicted <lgl>,
## #   melting_point <dbl>, mp_predicted <lgl>,
## #   boiling_point <dbl>, bp_predicted <lgl>, …

2 🔢 Creating Dataframes

4 ✨ Creating New Data