Dataframes
In what follows we assume Python has access to a dataframe called mtcars which is constructed on the same basis as the R one.
import pandas as pd
mtcars = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv", index_col=0)
Selecting rows
Selecting row by numeric index
R
mtcars[1,] #Select row number 1
mtcars[c(1,2),] #Select row numbers 1 and 2
This returns a 'data.frame': 1 obs. of 11 variables
or a 'data.frame': 2 obs. of 11 variables
Python
mtcars.iloc[0,:] #Select row number 1 (zero based indexing)
mtcars.iloc[[0,1],:] #Select rows number 1 and 2.
This returns a pandas.core.series.Series
if you select a single row, or a pandas.core.frame.DataFrame
if you select multiple rows.
Selecting row by named index
R
mtcars['Mazda RX4', ]
mtcars[c('Mazda RX4','Mazda RX4 Wag'), ]
This returns a 'data.frame': 1 obs. of 11 variables
or a 'data.frame': 2 obs. of 11 variables
Python
mtcars.loc['Mazda RX4',:]
mtcars.loc[['Mazda RX4', 'Mazda RX4 Wag'],:]
This returns a pandas.core.series.Series
if you select a single row, or a pandas.core.frame.DataFrame
if you select multiple rows.
Selecting columns
Selecting column by numeric index
R
mtcars[,1 ]
mtcars[,c(1,2)]
If we select a single column, this return an atomic vector. If we select multiple columns, it returns a 'data.frame': 32 obs. of 2 variables:
Python
mtcars.iloc[:,0]
If we select a single column, this returns a pandas.core.series.Series
. If we select multiple columns, it returns a pandas.core.frame.DataFrame
.
Selecting individual ‘cells’ (items)
Selecting item by numeric index
R
mtcars[1,1 ]
This returns a numeric vector with a single item
Python
mtcars.iloc[0,0]
This returns a float
Selecting item by named index
R
mtcars['Mazda RX4','mpg']
This returns a numeric atomic vector with a single item.
Python
mtcars.loc['Mazda RX4','mpg]
This returns a float.
Selecting items with a logical vector
It is possible to use a logical vector (a series of ‘truthy’ and ‘falsey’ values) to filter a dataframe.
For instance, if your rows are 1,2,3
a logical vector true, false, true
will select rows 1 and 3.
R
# The following df has 3 rows and 3 columns
df <- read.csv("https://gist.githubusercontent.com/RobinL/dae60170438c5e0adcd0c68fbaa6abdf/raw/62747b776c63c9bd367bcdc11fe7d6a3459eaa12/tiny.csv", row.names = 1)
logical_vector <- c(TRUE,FALSE, TRUE)
df[logical_vector,]
df[,logical_vector]
logical_vector <- c(TRUE, FALSE)
df[logical_vector,] #This works due to R's vector recyclig
df[,logical_vector]
For more on vector recycling in R, see here.
Python
df = pd.read_csv("https://gist.githubusercontent.com/RobinL/dae60170438c5e0adcd0c68fbaa6abdf/raw/62747b776c63c9bd367bcdc11fe7d6a3459eaa12/tiny.csv", index_col=0)
logical_vector = [True, False, True]
df.loc[logical_vector,:]
df.loc[:,logical_vector]
logical_vector = [True, False]
df.loc[logical_vector,:] #Different behaviour to R - no recycling
df.loc[:,logical_vector]
Obtaining the index itself
R
names(mtcars)
rownames(mtcars)
These are both atomic vectors of strings.
Python
mtcars.index
mtcars.columns
These are both of type pandas.indexes.base.Index
Edge cases
Things can get ambiguous in two situations:
- You have integer row or column names which aren’t just ascending integers starting at 0 (Python) or 1 (R). This creates an ambiguity between indexing my name or by number
- You have an index with duplicate values.
Integer row/columns names which aren’t just 1,2,3…,n
In R, row names are always strings, which resolves this ambiguity
Data for these examples are here
R
df <- read.csv("https://gist.githubusercontent.com/RobinL/7378abcb98d70f642a4047f4e01fd428/raw/5e26ef7581804308cd244a499737ead03eba66e8/f.csv", row.names = 1)
df[c(1,2,3),] # Selects rows in positions 1,2,3
df[c(1,2,3,7),] # Selects rows in positions 1,2,3 and a blank row
df[c("2","4","6"),] # #Selects rows named "2","4", and "6"" corresponding to position 1, 2 and 3
df[c("2","4","5","6"),] # Row "5" does not exist, so it creates a blank row in the results
Python
df = pd.read_csv("https://gist.githubusercontent.com/RobinL/7378abcb98d70f642a4047f4e01fd428/raw/5e26ef7581804308cd244a499737ead03eba66e8/f.csv", index_col=0)
df.iloc[[0,1,2],:] #Selects rows in position 1, 2 and 3
df.iloc[[0,1,2,6],:] #IndexError: positional indexers are out-of-bounds
df.loc[[2,4,6],:] #Selects rows named 2,4, and 6 corresponding to position 0, 1 and 2
df.loc[[2,4,6,99],:] #Selects rows named 2,4, and 6 corresponding to position 0, 1 and 2, plus creates a row of NA with index 99
You can also use the pandas function ix
for this, but I advise against it. See advice on the use of ix
here.
Duplicated row or column names
Consider this table.
R
df <- read.csv("https://gist.githubusercontent.com/RobinL/05eea0870961a80814da5ae22f25f407/raw/bd6f6cd2cf8edd22763e2afa460bf9a7c9bdd79b/tiny_dup.csv", row.names=1)
#Error in read.table(file = file, header = header, sep = sep, quote = quote, :
# duplicate 'row.names' are not allowed
Python
df = pd.read_csv("https://gist.githubusercontent.com/RobinL/05eea0870961a80814da5ae22f25f407/raw/bd6f6cd2cf8edd22763e2afa460bf9a7c9bdd79b/tiny_dup.csv", index_col=0, mangle_dupe_cols=False)
df.loc[2, "b"] # Returns 2 rows and 2 columns
Writing to a dataframe
Writing to individual cells
In what follows I use name based indexing, but the following would also work with numeric (positional) indexing.
R
#Edit existing value
mtcars["Mazda RX4", "mpg"] = 21.1
#Create new value
mtcars["Future car", "mpg"] = 999 #A new row is created. All values are set to NA except for 'mpg' column
mtcars["Future car", "mpg"] = "Not known" #Note that this has the effect of coercing all the other values in the 'mpg' column into a string vector.
Python
#Edit existing value
mtcars.loc["Mazda RX4", "mpg"] = 21.1
#Create new value
mtcars.loc["Future Car", "mpg"] = 999 #Note that this has the effect of coercing all the other values in the 'mpg' column into a string vector.
mtcars.loc["Future Car", "mpg"] = "Not known" #This does not affect the other values in the 'mpg' column
0.0.0.1 Writing a new row with multiple values
R
# If you have all the values for the new row you can do this
mtcars["New car",] <- <- list(mpg = 100, cyl = 2, disp = NA, hp = NA, drat = NA, wt = NA, qsec = NA, vs = NA, am = NA, gear = NA, carb = NA)
# This is probably a bad idea because it recycles the list for you
mtcars["New car",] <- data.frame(list(1,2,3))
#You can also edit an existing row like this:
mtcars["Datsun 710",] <- list(mpg = 100, cyl = 2, disp = NA, hp = NA, drat = NA, wt = NA, qsec = NA, vs = NA, am = NA, gear = NA, carb = NA)
Python
#Write mutilple values using a pandas series
mtcars.loc["New car",:] = pd.Series({"mpg": 100, "cyl": 2})
#You don't need a full row. Where values don't exist in the series, NAs will be inserted into the new row
mtcars.loc["Datsun 710",:] = pd.Series({"mpg": 100, "cyl": 2})
#Note that this overwrites all the unspecified coluns to NA
Writing a new column with multiple values
R
df <- read.csv("https://gist.githubusercontent.com/RobinL/dae60170438c5e0adcd0c68fbaa6abdf/raw/62747b776c63c9bd367bcdc11fe7d6a3459eaa12/tiny.csv", row.names = 1)
#One option is
df[,"d"] = c(1,2,3)
#Another options is
df["d"] = c(1,2,3)
Python
df = pd.read_csv("https://gist.githubusercontent.com/RobinL/dae60170438c5e0adcd0c68fbaa6abdf/raw/c111232bb1fd564bae084be45c988d63053ab843/tiny.csv", index_col=0)
#One option is:
df.loc[:, "d"] = [1,2,3]
#Another option is
df["d"] = [1,2,3]
Subsetting and filtering
Simple rows filter
R
filter_vector <- mtcars["mpg"] > 20
mtcars[filter_vector, ]
Python
filter_vector = mtcars["mpg"] > 20
mtcars[filter_vector]
#An alternative is
mtcars.loc[filter_vector, :]
Operations on columns
Filter column names
R
mtcars[,!names(mtcars) %in% c("mpg", "cyl")]
Python
mtcars[c for c in mtcars.columns if c not in ["mpg", "cyl"]]
#Or if you wanted to stick closer to the R syntax you could do:
mtcars.loc[:,~np.in1d(mtcars.columns, ["mpg", "cyl" ])]
Sort order of columns
R
mtcars[,sort(names(mtcars))]
Python
mtcars[sorted(mtcars.columns)]
Computing new columns from existing columns
Simple addition
R
mtcars["kmpl"] <- mtcars["mpg"]/2.35214583
Python
mtcars["kmpl"] = mtcars["mpg"]/2.35214583
More complex operations
Where possible you should vectorise your computations.
R
mtcars["ifoutput"] <- ifelse(mtcars["mpg"] > 15, "yes", "no")
Python
mtcars["ifoutput"] = np.where(mtcars["mpg"] > 15, "yes", "no")
Application of custom function across data frame
Generally you should use vectorised operations to manupulate dataframe - this is much faster/more efficient.
But if you need to here’s how to apply a function row by row:
R
fn_to_apply <- function(row) {
#row is a named character vector - there is implicit type conversion happening here
if (row["mpg"] > 15) {
if (row["carb"] == 4) {
return("one thing")
}
}
return("another thing")
}
mtcars["new_variable"] <- apply(mtcars,MARGIN=1, fn_to_apply)
Python
def fn_to_apply(row):
"""
Note that row is a pandas.core.series.Series
representing a single row of the df
"""
if row["mpg"] > 15:
if row["carb"] == 4:
return "one thing"
return "another thing"
mtcars["new_variable"] = mtcars.apply(fn_to_apply,axis=1)
Here’s another way of doing this in r using the purr package:
R
fn_to_apply <- function(row) {
#row is a named character vector - there is implicit type conversion happening here
if (row["mpg"] > 15) {
if (row["carb"] == 4) {
return("one thing")
}
}
return("another thing")
}
Other stuff
Piping
Checkout this for 🐍.
Suppose we want to filter then apply a function, then do a group by and a summary. We’ll use dplyr in R
R
fn_to_apply <- function(row) {
#row is a named character vector - there is implicit type conversion happening here
if (row["mpg"] > 15) {
if (row["carb"] == 4) {
return("one thing")
}
}
return("another thing")
}
mtcars %>%
filter(mpg > 10) %>%
by_row(fn_to_apply, .to = "new_col", .collate = c("row")) %>%
group_by(new_col) %>%
summarise_each(funs(mean))
Python
def pipe_fn(df):
"""
Pipe expects a function that takes a dataframe as argument and returns a dataframe.
"""
def fn_to_apply(row):
if row["mpg"] > 15:
if row["carb"] == 4:
return "one thing"
return "another thing"
df["new_col"] = df.apply(fn_to_apply,axis=1)
return df
mtcars\
.query("mpg > 10")\
.pipe(pipe_fn)\
.groupby("new_col")\
.mean()
Transposition
When you transpose an R dataframe, implicit type conversion can get you Pandas has an advantage here because columns can be of mixed type.
R
all(t(t(iris)) == iris) # returns false
Python
all(iris.T.T == iris) # returns true
Pivot tables/cross tabulations
This is quite tricky in R
#Simple pivot table in the tidyverse
library(dplyr)
library(tidyr)
mtcars %>%
group_by(cyl, vs) %>%
summarise(count_hp = n()) %>%
spread(vs, count_hp)
#Pivot table with multiple values aggregatedwith different agg functions (see python version for clearer description)
library(data.frame)
df <- merge(dcast(data.table(mtcars), cyl ~ vs, mean, value.var = "mpg"),
dcast(data.table(mtcars), cyl ~ vs, length, value.var = "hp"), by = "cyl")
names(df) <- c("cyl", "mean_vs_0", "mean_vs_1", "count_vs_0", "count_vs_1")
Python
#Simple Pivot table
mtcars.pivot_table(index="cyl", columns="vs", values = "hp", aggfunc="count")
#Note the benfits of multi indexing in the following example
df = mtcars.pivot_table(index="cyl", columns="vs", values = ["mpg", "hp"], aggfunc={"mpg":np.mean, "hp":"count"})
df.columns = df.columns.set_levels(['mpg_mean', 'hp_count'], level=0)
Setup
Here are some notes on running the code in this .Rmd.
To run the code, you’ll want to remove eval=FALSE
from the code blocks.
In what follows we are going to use the mtcars
dataset. If you want to run this code in Python, you’ll need access to the dataset. You can get this by doing: pip install ggplot
and then in python from ggplot import mtcars
Note that in order for the Python code to execute, you need to make sure that Sys.which('python')
returns the path for Anaconda rather than Mac OS default python. See here, here and here. The solution I found was
cd ~
echo .Renviron >> PATH=/Users/robinlinacre/anaconda/bin