EDA and basic Data Visualizations in R (and Python)

Ojas Pandey
Analytics Vidhya
Published in
6 min readJul 13, 2021

--

Anyone interested in and learning Data Science is probably well aware about the ongoing debate between Python and R for data science. Although both languages are amazing in their own way, there are some pros and cons to both. Python has lots of advantages like it’s a universal multipurpose language, easy to learn, has high interactivity. Some of it’s disadvantages would be it’s Visualization features are not as extensive and elaborate as R’s. R on the other hand, was specifically made for data analysis so it has amazing tools for that purpose, huge number of packages, easy installation and guides to help you through every issue you might face with Visualization (RStudio) , another feature would be it’s syncing with Github. Some of it’s disadvantages would be that it’s specific and not multipurpose like Python. Also they share some similarities like both are open source, both are command-line interpreters, extensively used in Data science projects etc.

Personally I believe that Python is better when it comes to working on Data Science as a whole, as it is easy to use and understand but the statistical awesomeness offered by R is much better than Python’s. Then again, R was primarily made for statistical analysis and graphics work. Sum and substance is, Python is an amazing multipurpose language, but R offers better statistical analysis features and tools. The question, hence, shouldn’t be which one is better, rather it should be how to make best use of both languages in your specific use case.

Enough of the theory, let’s get to business. Now, I’ll be showing some Exploratory Data Analysis (EDA) and Data Visualization processes using both of these languages. Although my R code will be more elaborate as I also believe that there are various EDA and Data Viz sample codes online but not as many in R. (Again, I may be wrong here, but it doesn’t really matter)

You can access the code, data and data dictionary using the link here.

Just a head’s up, here I’ll be explaining the R code but I’ve provided the Python code too in the link mentioned above.

Libraries used

tidyverse

Tidyverse is one of the most widely used R packages for data cleaning and exploration. It offers an extensive variety of tools for that purpose. Tidyverse is python’s pandas equivalent but better.

ggplot2

If you have an idea of visualization in python then, ggplot is Python’s matplotlib and seaborn combined and more. It’s easy to use and allows you to make complex visualization with just a few lines of code, unlike Python.

skimr

This is another useful package of R used to offer summary statistics and is capable of handling various data types.

e1071

This R package useful tools for statistical and probabilistic algorithms.

lattice

This is a powerful R package for high level data visualizations that focuses on multivariate analysis.

summarytools

This package helps in data exploration and provides summary reports

I’ve used two datasets for illustration purpose, one is mammals sleep dataset (msleep) and other is cars dataset (mpg)

EDA techniques

After loading the datasets, I perform some EDA techniques on the data. Some examples of them are as follows :

str() : This function is part of the base R package, that stands for “structure”. It produces a basic structure of the entire dataframe.

glimpse() : This function provides a sneak-peek into the dataset along with datatypes of respective columns

skim(): Slightly more advanced and elaborate than the above mentioned functions, is the skim() function from the skimr library. It’s results include min, max, number of missing values, number of whitespaces, standard deviation, percentile, unique values, etc. It also categorises features into numerical and categorical.

summary(): summary is another function from the base R package that provides a brief summary of our dataframe.

dfSummary() : Another function, equivalent to, if not more advanced, than summary() and skim() function is the dfSummary() function from the summarytools library. It provides more in-depth analysis of each variable.

After a little more analysis, that can be seen in the codes provided, we move on to data visualization.

Plotting Categorical Variables

Barplot using qplot()

qplot (Qucik Plot) in R is a shortcut to the simple plot() function. It is not as extensive as ggplot() but it is useful in producing simple plots for quick analysis.

Pie charts : We can also plot a simple pie chart using the pie() function in R for quick analysis just like the qplot.

Plotting numerical/quantitative variables

Plots for univariate analysis

Histograms

hist() : This function from the graphics package gives a histogram with frequency on the y-axis.

histogram() : This function from the lattice library is similar to the hist() function, the major difference being that it gives percentage on the y axis instead of frequency.

qplot() : Here also we can use the qplot to plot a histogram quickly.

Density related plots :

Examples of few density related plots is as follows,

Density plot using the densityplot() function.

Density plot using qplot.

Density plot using qqnorm function.

Boxplot:

This boxplot made using qplot gives the highway miles per gallon for each class of cars using the mpg.csv data.

Stacked bar chart using qplot shows frequency of each drive for each class of cars.

Lastly, we can look at colored dot plot (also called scatter plot).

The best thing about all these plots is that all of them were made with just a single line of code. A much more advanced version of these can also be made using the ggplot() function, which as I said earlier is much more advanced. But for the sake of this article, that focuses on some easy to understand and implement, quick graphs and analysis, the qplot() function does the job.

I’ve also provided Python codes for EDA and data visualization for the same datasets. As I previously said, both R and python are amazing both individually and more amazing together when it comes to data science. I believe there is great benefit in learning both so that we can synergize their individual powers into one.

Also at the end of the code I’ve provided few lines of code for another amazing R package, dlookr, that is used for data diagnosis, transformation and exploration. Make sure to check that out too.

Thanks. Hope this helps.

--

--

Ojas Pandey
Analytics Vidhya

Student, Big Data and Machine Learning enthusiast