EDA for dummies

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

JUVINI

A Comprehensive tool for EDA

Introduction
Requirement
Usage
Best Practices

Introduction

Plotting graphs are one of the most important aspects of EDA. Graphs give intuitive insights based because it is processed by our natural neural networks trained and evolved non stop for years. This tool is designed to allow data science users work on plotting the graphs rather than spending time on codes and analysing different methods of each plotting library. This tool has several levels. Highest level is where the user just have to input the entire data frame and the code will take care of giving all plots based on the data type for all combinations. Just like the way pairplot works for numerical datatypes.

Requirement

User should have some idea on python. This can be run from jupyter as well as python console
Should have good understanding of different graph types especially boxplot , scatterplot , barplot , countplot and distplot
This is not a must , but if the user has a clear understanding of the datatype associated with each column , then converting to the datatype will make the graph look better. For eg , if a column contains categorical value 1,2,3,4. Then it is better to convert it as object or category so that the tool will be able to guess it. Else it will assume the datatype as numeric and will plot for numeric related graphs
The tool will always treat first column as X axis , second input column as Y axis and if parameter hue_col is specified then it will search for this column on rest of the dataframe.

Usage

consider the standard IRIS dataset.Here we modified it a bit to add a numeric column rating where values are 0.1.2.3. Even though it is categorical , we have purposely kept it as numerical column to show some use cases that will come in later sections. It consists of 5 columns

sepal_length - numeric
sepal_width - numeric
petal_length - numeric
petal_width - numeric
species - categorical
rating - numeric ( in fact it is categorical in real scenario )

Sample output

sepal_length,sepal_width,petal_length,petal_width,species,rating
5.1,3.5,1.4,0.2,setosa,1
4.9,3.0,1.4,0.2,setosa,1
4.7,3.2,1.3,0.2,setosa,0
4.6,3.1,1.5,0.2,setosa,3
5.0,3.6,1.4,0.2,setosa,0
5.4,3.9,1.7,0.4,setosa,1
4.6,3.4,1.4,0.3,setosa,3
5.0,3.4,1.5,0.2,setosa,0
4.4,2.9,1.4,0.2,setosa,1
4.9,3.1,1.5,0.1,setosa,1
5.4,3.7,1.5,0.2,setosa,0

NUMERIC vs NUMERIC - to plot graph where two columns are numeric.

Method : `num_num(df[[num_col1,num_col2]])`

Examples

simple numeric to numeric plotting

import pandas as pd
from juvini import num_num
df=pd.read_csv('iris_with_rating.csv')
num_num(df[['sepal_length','sepal_width']])

numeric_numeric

wait what if i do want to add a hue parameter to it? Just make sure to add the additional column species to the input dataframe and also add the parameter hue_col='species'

num_num(df[['sepal_length','sepal_width','species']],hue_col='species')

numeric_numeric

additional parameters

x_name='xvalue' , the name that you want in x axis for the first column , sometimes the column name are different from the name you want to see in the graph.By default the first column name is taken
y_name='yvalue' , same as x_name , but for Y axis
size_figure=(13,4) , for playing around with the size. depending on size of the screen you may want to change it. default is 13,4 with tight layout
hue_cols , to plot the hue. See the above example

CATEGORICAL vs CATEGORICAL - to plot graph where two columns that are categorical.

Method : `cat_cat(df[[cat_col1,cat_col2]])`

Examples

This will take the top 5 categories for each column and plot it. You can change this value 5 using parameters xcap and ycap as mentioned below. For each value of X , it will give the countplot for values in Y. Also the tool will take care of all subplots and figure size etc. User do not have to figure out the sizing and subplot grid size.

import pandas as pd
from juvini import cat_cat
df=pd.read_csv('iris_with_rating.csv')
cat_cat(df[['species','rating']])

categorical_categorical

similarly interchanging first and second column will change the axis cat_cat(df[['rating','species']]) categorical_categorical_xy_changed

But wait , did we just use a numerical column to plot a categorical column? Actually yes , if we know that it is categorical , we do not have to change the datatype and all unnecessary things. the code will take care of converting it to category.

Fine , but what if there are too many categories and i simply need to have a gist of top few categories? Yes that is also supported , simply provide the parameter xcap=<value> , the code will sort the categories based on its count and choose the top n values based on the input.

cat_cat(df[['species','rating']],xcap=2)

categorical_categorical_with_xcap

Fine , what if i want to change not the xcap but the ycap? Yes we can do that as well. Simply change the parameter ycap=<value> just like the xcap.

How about the hue? Yes , that also will work here. provide it like cat_cat(df[['species','rating','hue_column']],hue_col='hue_column)

additional parameters

x_name='xvalue' , the name that you want in x axis for the first column , sometimes the column name are different from the name you want to see in the graph.By default the first column name is taken
y_name='yvalue' , same as x_name , but for Y axis
size_figure=(13,4) , for playing around with the size. depending on size of the screen you may want to change it. default is 13,4 with tight layout
xcap=5 , will cap the maximum categories with top 5 based on its count for x axis 1st column , default 5
ycap=5 , same as xcap , but will be applicable to y column.
hue_cols , to plot the hue. See the above example
scols=3 , this is an experimental feature , use with caution. This parameter will control how many plots in one row. By default it is 3
others=True , this is an experimental feature , use with caution. This parameter will put all the other values that are not coming in the top values provided into a category called 'restall'

CATEGORICAL vs NUMERICAL - to plot graph where two columns where x is category and y is numeric.

Method : `cat_num(df[[cat_col1,num_col2]])`

Examples

This will take the top 5 categories of categorical column and plot numerical. You can change this value 5 using parameters xcap and ycap as mentioned below. For each value of X , it will give the boxplot corresponding to the numerical column in that. Additionally it will also give aggregate sum of the numerical values for each category.

It is upto the user to decide which is useful. Boxplot is always useful , whereas the sum aggregate might help if you are looking at something like total votes etc. but if it is like sepal_width kind , then it may not be useful.Anyways no harm in giving both.

import pandas as pd
from juvini import cat_num
df=pd.read_csv('iris_with_rating.csv')
cat_num(df[['species','petal_length']])

categorical_numerical

Can we use a numerical column to plot a categorical column? Actually yes , if we know that it is categorical , we do not have to change the datatype and all unnecessary things. the code will take care of converting it to category as long as you provide the column as first column in the input

How about the hue? Yes , that also will work here. provide it like cat_num(df[['species','petal_length','rating']],hue_col='rating')

categorical_numerical_with_hue

additional parameters

x_name='xvalue' , the name that you want in x axis for the first column , sometimes the column name are different from the name you want to see in the graph.By default the first column name is taken
y_name='yvalue' , same as x_name , but for Y axis
size_figure=(13,4) , for playing around with the size. depending on size of the screen you may want to change it. default is 13,4 with tight layout
xcap=5 , will cap the maximum categories with top 5 based on its count for x axis 1st column , default 5
hue_cols , to plot the hue. See the above example
others=True , this is an experimental feature , use with caution. This parameter will put all the other values that are not coming in the top values provided into a category called 'restall'. There are ratings 0-3. If we cap it to only top 2. Then the rest of the ratings will go into "restall" value.

cat_num(df[['rating','petal_length']],xcap=2,others=True)

categorical_numerical_with_hue

Single NUMERICAL - to plot graph with just a numerical column

Method : `single_num(df[[num_col1]])`

Examples

It is not always the case that plot will need two columns. What if i just need to see a boxplot of a numeric column or the distribution of a numeric column? For that we have the method which will give boxplot and distplot. It is usually used with the hue to give more insights

import pandas as pd
from juvini import single_num
df=pd.read_csv('iris_with_rating.csv')
single_num(df[['sepal_length']])

single_numerical

How about the hue? Yes , that also will work here. provide it like single_num(df[['sepal_length','species']],hue_col='species')

single_numerical_with_hue

additional parameters

x_name='xvalue' , the name that you want in x axis for the first column , sometimes the column name are different from the name you want to see in the graph.By default the first column name is taken
size_figure=(13,4) , for playing around with the size. depending on size of the screen you may want to change it. default is 13,4 with tight layout
hue_cols , to plot the hue. See the above example

Single CATEGORICAL - to plot graph with just a categorical column

Method : `single_cat(df[[cat_col1]])`

Examples

It is not always the case that plot will need two columns. What if i just need to see a boxplot of a categorical column or the distribution of a numeric column? For that we have the method which will give boxplot and distplot. It is usually used with the hue to give more insights

import pandas as pd
from juvini import single_cat
df=pd.read_csv('iris_with_rating.csv')
single_cat(df[['species']])

single_categorical

How about the hue? Yes , that also will work here. provide it like single_cat(df[['species','rating']],hue_col='rating')

single_categorical_with_hue

additional parameters

x_name='xvalue' , the name that you want in x axis for the first column , sometimes the column name are different from the name you want to see in the graph.By default the first column name is taken
size_figure=(13,4) , for playing around with the size. depending on size of the screen you may want to change it. default is 13,4 with tight layout
hue_cols , to plot the hue. See the above example
xcap=5 , will cap the maximum categories with top 5 based on its count for x axis 1st column , default 5

To make it more easier

Method : `xy_autoplot(df[[col1,col2]])`

Examples

What if i do not even care what the data type is. I just want the code to decide it based on the data type already present.Can i do that?

Yes. There is a method which does exactly this. You will have to simply give two columns. The first column will be taken as X variable and second as Y variable. And based on the data type it will provide you the necessary graph.

import pandas as pd
from juvini import xy_auto_plot
df=pd.read_csv('iris_with_rating.csv')
xy_auto_plot(df[['sepal_length','rating']])

xy_auto_plot

Does it support hue? Yes , you can use the same parameter hue_col=<colname> and if the graph can handle hue , then it will use it. xy_auto_plot_hue

So, what is the problem , why then go through all the above graphs if this will take care of all.

Not exactly!! . The rating column is numeric. But it contains only categorical values. In such cases the code will not be able to identify and the plot may not look good. So it is always useful to have the breakdown of charts to more specific details. Apart from that , i do not see any issues in using autoplot as long as the very purpose of all this is to make life easier for data scientists.

xy_auto_plot_issue

Still better and most comfortable

Method : `juvini_profile(df[[list_of_cols]])`

Examples

This is the highest of all that combines all below features and give the entire story in a matter of one command.

import pandas as pd
from juvini import juvini_profile
df=pd.read_csv('iris_with_rating.csv')
xy_auto_plot(df,hue_col='species')

The output will contain 15 graphs

Analysis of numeric sepal_length and numeric sepal_length
Analysis of numeric sepal_length and numeric sepal_width
Analysis of numeric sepal_length and numeric petal_length
Analysis of numeric sepal_length and numeric petal_width
Analysis of numeric sepal_length and numeric rating
Analysis of numeric sepal_width and numeric sepal_width
Analysis of numeric sepal_width and numeric petal_length
Analysis of numeric sepal_width and numeric petal_width
Analysis of numeric sepal_width and numeric rating
Analysis of numeric petal_length and numeric petal_length
Analysis of numeric petal_length and numeric petal_width
Analysis of numeric petal_length and numeric rating
Analysis of numeric petal_width and numeric petal_width
Analysis of numeric petal_width and numeric rating
Analysis of numeric rating and numeric rating

Best Practices

When giving the input into the tool , it is always advisable to skip the column like the ID kind of columns where there is going to be a large number of categorical data
If you can ensure all data types are properly mapped , then you can make the best use of this tool

Final Notes

Feel free to comment or ask for any improvements. The motive behind this is to make the life of data scientist easier, we should concentrate of task more than coding.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.0.8

May 12, 2021

1.0.7

Dec 12, 2020

1.0.6

Dec 8, 2020

1.0.5

Nov 29, 2020

1.0.4

Nov 29, 2020

This version

1.0.3

Nov 29, 2020

1.0.2

Nov 29, 2020

1.0.1

Nov 29, 2020

1.0.0

Nov 29, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

juvini-1.0.3.tar.gz (15.5 kB view hashes)

Uploaded Nov 29, 2020 Source

Built Distribution

juvini-1.0.3-py3-none-any.whl (11.8 kB view hashes)

Uploaded Nov 29, 2020 Python 3

Hashes for juvini-1.0.3.tar.gz

Hashes for juvini-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`747fc1408baf309fd20f251f454b6bfa5059bc05c64fbdd30f27bd11eef8c6c7`
MD5	`9dfd5bb25d9d08ce1c2da84345574521`
BLAKE2b-256	`8939684345b9ca3b6156192cd8fbeb023f19433d2dca646a7ec159f4ff5df6d6`

Hashes for juvini-1.0.3-py3-none-any.whl

Hashes for juvini-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`15a50c42e3999933937385f60191597399d72abab55f79509e5acde9c5fd4bc7`
MD5	`5032825403cc7b8cbd6f28a3dae27b8e`
BLAKE2b-256	`633fff824705b8129da00ba7f1d849f625eac283a1533f0d5547cc8c85068222`

juvini 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

JUVINI

A Comprehensive tool for EDA

Introduction

Requirement

Usage

NUMERIC vs NUMERIC - to plot graph where two columns are numeric.

Method : num_num(df[[num_col1,num_col2]])

Examples

additional parameters

CATEGORICAL vs CATEGORICAL - to plot graph where two columns that are categorical.

Method : cat_cat(df[[cat_col1,cat_col2]])

Examples

additional parameters

CATEGORICAL vs NUMERICAL - to plot graph where two columns where x is category and y is numeric.

Method : cat_num(df[[cat_col1,num_col2]])

Examples

additional parameters

Single NUMERICAL - to plot graph with just a numerical column

Method : single_num(df[[num_col1]])

Examples

additional parameters

Single CATEGORICAL - to plot graph with just a categorical column

Method : single_cat(df[[cat_col1]])

Examples

additional parameters

To make it more easier

Method : xy_autoplot(df[[col1,col2]])

Examples

Still better and most comfortable

Method : juvini_profile(df[[list_of_cols]])

Examples

Best Practices

Final Notes

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Method : `num_num(df[[num_col1,num_col2]])`

Method : `cat_cat(df[[cat_col1,cat_col2]])`

Method : `cat_num(df[[cat_col1,num_col2]])`

Method : `single_num(df[[num_col1]])`

Method : `single_cat(df[[cat_col1]])`

Method : `xy_autoplot(df[[col1,col2]])`

Method : `juvini_profile(df[[list_of_cols]])`