Dataframes

In what follows we assume Python has access to a dataframe called mtcars which is constructed on the same basis as the R one.

import pandas as pd 
mtcars = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv", index_col=0)

Selecting rows

Selecting row by numeric index

R

mtcars[1,]  #Select row number 1
mtcars[c(1,2),] #Select row numbers 1 and 2

This returns a 'data.frame': 1 obs. of 11 variables or a 'data.frame': 2 obs. of 11 variables

Python

mtcars.iloc[0,:]  #Select row number 1 (zero based indexing)
mtcars.iloc[[0,1],:]  #Select rows number 1 and 2.

This returns a pandas.core.series.Series if you select a single row, or a pandas.core.frame.DataFrame if you select multiple rows.

Selecting row by named index

R

mtcars['Mazda RX4', ]
mtcars[c('Mazda RX4','Mazda RX4 Wag'), ]

This returns a 'data.frame': 1 obs. of 11 variables or a 'data.frame': 2 obs. of 11 variables

Python

mtcars.loc['Mazda RX4',:]
mtcars.loc[['Mazda RX4', 'Mazda RX4 Wag'],:]

This returns a pandas.core.series.Series if you select a single row, or a pandas.core.frame.DataFrame if you select multiple rows.

Selecting columns

Selecting column by numeric index

R

mtcars[,1 ]
mtcars[,c(1,2)]

If we select a single column, this return an atomic vector. If we select multiple columns, it returns a 'data.frame': 32 obs. of 2 variables:

Python

mtcars.iloc[:,0]

If we select a single column, this returns a pandas.core.series.Series. If we select multiple columns, it returns a pandas.core.frame.DataFrame.

Selecting individual ‘cells’ (items)

Selecting item by numeric index

R

mtcars[1,1 ]

This returns a numeric vector with a single item

Python

mtcars.iloc[0,0]

This returns a float

Selecting item by named index

R

mtcars['Mazda RX4','mpg']

This returns a numeric atomic vector with a single item.

Python

mtcars.loc['Mazda RX4','mpg]

This returns a float.

Selecting items with a logical vector

It is possible to use a logical vector (a series of ‘truthy’ and ‘falsey’ values) to filter a dataframe.

For instance, if your rows are 1,2,3 a logical vector true, false, true will select rows 1 and 3.

R

# The following df has 3 rows and 3 columns
df <- read.csv("https://gist.githubusercontent.com/RobinL/dae60170438c5e0adcd0c68fbaa6abdf/raw/62747b776c63c9bd367bcdc11fe7d6a3459eaa12/tiny.csv", row.names = 1)
logical_vector <- c(TRUE,FALSE, TRUE)
df[logical_vector,]
df[,logical_vector]

logical_vector <- c(TRUE, FALSE)
df[logical_vector,] #This works due to R's vector recyclig 
df[,logical_vector]

For more on vector recycling in R, see here.

Python

df = pd.read_csv("https://gist.githubusercontent.com/RobinL/dae60170438c5e0adcd0c68fbaa6abdf/raw/62747b776c63c9bd367bcdc11fe7d6a3459eaa12/tiny.csv", index_col=0)
logical_vector = [True, False, True]
df.loc[logical_vector,:]
df.loc[:,logical_vector]

logical_vector = [True, False]
df.loc[logical_vector,:] #Different behaviour to R - no recycling
df.loc[:,logical_vector]

Obtaining the index itself

R

names(mtcars)
rownames(mtcars)

These are both atomic vectors of strings.

Python

mtcars.index
mtcars.columns

These are both of type pandas.indexes.base.Index

Edge cases

Things can get ambiguous in two situations:

  1. You have integer row or column names which aren’t just ascending integers starting at 0 (Python) or 1 (R). This creates an ambiguity between indexing my name or by number
  2. You have an index with duplicate values.

Integer row/columns names which aren’t just 1,2,3…,n

In R, row names are always strings, which resolves this ambiguity
Data for these examples are here

R

df <- read.csv("https://gist.githubusercontent.com/RobinL/7378abcb98d70f642a4047f4e01fd428/raw/5e26ef7581804308cd244a499737ead03eba66e8/f.csv", row.names = 1)
df[c(1,2,3),] # Selects rows in positions 1,2,3 
df[c(1,2,3,7),] # Selects rows in positions 1,2,3 and a blank row
df[c("2","4","6"),] #  #Selects rows named "2","4", and "6"" corresponding to position 1, 2 and 3
df[c("2","4","5","6"),] # Row "5" does not exist, so it creates a blank row in the results

Python

df = pd.read_csv("https://gist.githubusercontent.com/RobinL/7378abcb98d70f642a4047f4e01fd428/raw/5e26ef7581804308cd244a499737ead03eba66e8/f.csv", index_col=0)
df.iloc[[0,1,2],:] #Selects rows in position 1, 2 and 3
df.iloc[[0,1,2,6],:] #IndexError: positional indexers are out-of-bounds

df.loc[[2,4,6],:] #Selects rows named 2,4, and 6 corresponding to position 0, 1 and 2
df.loc[[2,4,6,99],:] #Selects rows named 2,4, and 6 corresponding to position 0, 1 and 2, plus creates a row of NA with index 99

You can also use the pandas function ix for this, but I advise against it. See advice on the use of ix here.

Duplicated row or column names

Consider this table.

R

df <- read.csv("https://gist.githubusercontent.com/RobinL/05eea0870961a80814da5ae22f25f407/raw/bd6f6cd2cf8edd22763e2afa460bf9a7c9bdd79b/tiny_dup.csv", row.names=1)
#Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
#  duplicate 'row.names' are not allowed

Python

df = pd.read_csv("https://gist.githubusercontent.com/RobinL/05eea0870961a80814da5ae22f25f407/raw/bd6f6cd2cf8edd22763e2afa460bf9a7c9bdd79b/tiny_dup.csv", index_col=0, mangle_dupe_cols=False)

df.loc[2, "b"] # Returns 2 rows and 2 columns

Writing to a dataframe

Writing to individual cells

In what follows I use name based indexing, but the following would also work with numeric (positional) indexing.

R

#Edit existing value
mtcars["Mazda RX4", "mpg"] = 21.1

#Create new value
mtcars["Future car", "mpg"] = 999 #A new row is created.  All values are set to NA except for 'mpg' column

mtcars["Future car", "mpg"] = "Not known"  #Note that this has the effect of coercing all the other values in the 'mpg' column into a string vector.

Python

#Edit existing value
mtcars.loc["Mazda RX4", "mpg"] = 21.1

#Create new value
mtcars.loc["Future Car", "mpg"] = 999 #Note that this has the effect of coercing all the other values in the 'mpg' column into a string vector.

mtcars.loc["Future Car", "mpg"] = "Not known" #This does not affect the other values in the 'mpg' column

0.0.0.1 Writing a new row with multiple values

R

# If you have all the values for the new row you can do this
mtcars["New car",] <- <- list(mpg = 100, cyl = 2, disp = NA, hp = NA, drat = NA, wt = NA, qsec = NA, vs = NA, am = NA, gear = NA, carb = NA)

# This is probably a bad idea because it recycles the list for you
mtcars["New car",] <- data.frame(list(1,2,3))

#You can also edit an existing row like this:
mtcars["Datsun 710",] <- list(mpg = 100, cyl = 2, disp = NA, hp = NA, drat = NA, wt = NA, qsec = NA, vs = NA, am = NA, gear = NA, carb = NA)

Python

#Write mutilple values using a pandas series
mtcars.loc["New car",:] = pd.Series({"mpg": 100, "cyl": 2})

#You don't need a full row.  Where values don't exist in the series, NAs will be inserted into the new row
mtcars.loc["Datsun 710",:] = pd.Series({"mpg": 100, "cyl": 2})
#Note that this overwrites all the unspecified coluns to NA

Writing a new column with multiple values

R

df <- read.csv("https://gist.githubusercontent.com/RobinL/dae60170438c5e0adcd0c68fbaa6abdf/raw/62747b776c63c9bd367bcdc11fe7d6a3459eaa12/tiny.csv", row.names = 1)

#One option is 
df[,"d"] = c(1,2,3)

#Another options is 
df["d"] = c(1,2,3)

Python

df = pd.read_csv("https://gist.githubusercontent.com/RobinL/dae60170438c5e0adcd0c68fbaa6abdf/raw/c111232bb1fd564bae084be45c988d63053ab843/tiny.csv", index_col=0)

#One option is:
df.loc[:, "d"] = [1,2,3]

#Another option is 
df["d"] = [1,2,3]

Subsetting and filtering

Simple rows filter

R

filter_vector <- mtcars["mpg"] > 20
mtcars[filter_vector, ]

Python

filter_vector =  mtcars["mpg"] > 20
mtcars[filter_vector]

#An alternative is
mtcars.loc[filter_vector, :]

Operations on columns

Filter column names

R

mtcars[,!names(mtcars) %in% c("mpg", "cyl")]

Python

mtcars[c for c in mtcars.columns if c not in ["mpg", "cyl"]]

#Or if you wanted to stick closer to the R syntax you could do:
mtcars.loc[:,~np.in1d(mtcars.columns, ["mpg", "cyl" ])]

Sort order of columns

R

mtcars[,sort(names(mtcars))]

Python

mtcars[sorted(mtcars.columns)]

Computing new columns from existing columns

Simple addition

R

mtcars["kmpl"] <- mtcars["mpg"]/2.35214583

Python

mtcars["kmpl"] =  mtcars["mpg"]/2.35214583

More complex operations

Where possible you should vectorise your computations.

R

mtcars["ifoutput"]  <- ifelse(mtcars["mpg"] > 15, "yes", "no")

Python

mtcars["ifoutput"] = np.where(mtcars["mpg"] > 15, "yes", "no")

Application of custom function across data frame

Generally you should use vectorised operations to manupulate dataframe - this is much faster/more efficient.

But if you need to here’s how to apply a function row by row:

R

fn_to_apply <- function(row) {
    #row is a named character vector - there is implicit type conversion happening here
    if (row["mpg"] > 15) {
        if (row["carb"] == 4) {
            return("one thing")
        }
    }
    return("another thing")
}

mtcars["new_variable"] <- apply(mtcars,MARGIN=1, fn_to_apply)

Python

def fn_to_apply(row):
    """
    Note that row is a pandas.core.series.Series
    representing a single row of the df
    """
    if row["mpg"] > 15:
        if row["carb"] == 4:
            return "one thing"
    return "another thing"

mtcars["new_variable"] = mtcars.apply(fn_to_apply,axis=1)

Here’s another way of doing this in r using the purr package:

R

fn_to_apply <- function(row) {
    #row is a named character vector - there is implicit type conversion happening here
    if (row["mpg"] > 15) {
        if (row["carb"] == 4) {
            return("one thing")
        }
    }
    return("another thing")
}

Other stuff

Piping

Checkout this for 🐍.

Suppose we want to filter then apply a function, then do a group by and a summary. We’ll use dplyr in R

R

fn_to_apply <- function(row) {
    #row is a named character vector - there is implicit type conversion happening here
    if (row["mpg"] > 15) {
        if (row["carb"] == 4) {
            return("one thing")
        }
    }
    return("another thing")
}

mtcars %>%
    filter(mpg > 10) %>%
    by_row(fn_to_apply, .to = "new_col", .collate = c("row")) %>% 
    group_by(new_col) %>%
    summarise_each(funs(mean))

Python

def pipe_fn(df):
    """
    Pipe expects a function that takes a dataframe as argument and returns a dataframe.
    """
    def fn_to_apply(row):
        if row["mpg"] > 15:
            if row["carb"] == 4:
                return "one thing"
        return "another thing"
    df["new_col"] = df.apply(fn_to_apply,axis=1)
    return df 

mtcars\
    .query("mpg > 10")\
    .pipe(pipe_fn)\
    .groupby("new_col")\
    .mean()

Transposition

When you transpose an R dataframe, implicit type conversion can get you Pandas has an advantage here because columns can be of mixed type.

R

all(t(t(iris)) == iris) # returns false

Python

all(iris.T.T == iris) # returns true

Pivot tables/cross tabulations

This is quite tricky in R

#Simple pivot table in the tidyverse
library(dplyr)
library(tidyr)
mtcars %>%
  group_by(cyl, vs) %>%
  summarise(count_hp = n()) %>%
  spread(vs, count_hp)

#Pivot table with multiple values aggregatedwith different agg functions (see python version for clearer description) 
library(data.frame)
df <- merge(dcast(data.table(mtcars), cyl ~ vs, mean, value.var = "mpg"),
      dcast(data.table(mtcars), cyl ~ vs, length, value.var = "hp"), by = "cyl")
names(df) <- c("cyl", "mean_vs_0", "mean_vs_1", "count_vs_0", "count_vs_1")

Python


#Simple Pivot table
mtcars.pivot_table(index="cyl", columns="vs", values = "hp", aggfunc="count")

#Note the benfits of multi indexing in the following example
df = mtcars.pivot_table(index="cyl", columns="vs", values = ["mpg", "hp"], aggfunc={"mpg":np.mean, "hp":"count"})
df.columns = df.columns.set_levels(['mpg_mean', 'hp_count'], level=0)

Setup

Here are some notes on running the code in this .Rmd.

To run the code, you’ll want to remove eval=FALSE from the code blocks.

In what follows we are going to use the mtcars dataset. If you want to run this code in Python, you’ll need access to the dataset. You can get this by doing: pip install ggplot and then in python from ggplot import mtcars

Note that in order for the Python code to execute, you need to make sure that Sys.which('python') returns the path for Anaconda rather than Mac OS default python. See here, here and here. The solution I found was

cd ~
echo .Renviron >> PATH=/Users/robinlinacre/anaconda/bin