EDA for dummies

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

JUVINI

A Comprehensive graphing tool for EDA. Like a profiler using Graphs.

To install the package : pip install juvini

Introduction
Requirement
Usage
Best Practices

Introduction

Plotting graphs are one of the most important aspects of EDA. Graphs give intuitive insights because it is processed by our natural neural networks trained and evolved non stop for years. This tool is designed to allow data science users to work on plotting the graphs rather than spending time on codes to do it. This tool has several levels. Highest level is where the user just have to input the entire data frame and the code will take care of giving all plots based on the data type for all combinations. Just like the way pairplot works for numerical datatypes.

Requirement

User should have some idea on python. This can be run from jupyter as well as python console
Should have good understanding of different graph types especially boxplot , scatterplot , barplot , countplot and distplot
This is not a must , but if the user has a clear understanding of the datatype associated with each column , then converting to the datatype will make the graph look better. For eg , if a column contains categorical value 1,2,3,4. Then it is better to convert it as object or category so that the tool will be able to guess it. Else it will assume the datatype as numeric and will plot for numeric related graphs.Note that there is feature within the juvini that will automatically consider a numeric column as category if the unique values in it are less than 5
The tool will always treat first column as X axis , second input column as Y axis and if parameter hue_col is specified then it will search for this column on rest of the dataframe.

Usage

consider the standard IRIS dataset.Here we modified it a bit to add a numeric column rating where values are 0.1.2.3. Even though it is categorical , we have purposely kept it as numerical column to show some use cases that will come in later sections. It consists of 5 columns

sepal_length - numeric
sepal_width - numeric
petal_length - numeric
petal_width - numeric
species - categorical
rating - numeric ( in fact it is categorical in real scenario )

Sample output

sepal_length,sepal_width,petal_length,petal_width,species,rating
5.1,3.5,1.4,0.2,setosa,1
4.9,3.0,1.4,0.2,setosa,1
4.7,3.2,1.3,0.2,setosa,0
4.6,3.1,1.5,0.2,setosa,3
5.0,3.6,1.4,0.2,setosa,0
5.4,3.9,1.7,0.4,setosa,1
4.6,3.4,1.4,0.3,setosa,3
5.0,3.4,1.5,0.2,setosa,0
4.4,2.9,1.4,0.2,setosa,1
4.9,3.1,1.5,0.1,setosa,1
5.4,3.7,1.5,0.2,setosa,0

NUMERIC vs NUMERIC - to plot graph where two columns are numeric.

Method : `num_num(df[[num_col1,num_col2]])`

Examples

simple numeric to numeric plotting

import pandas as pd
from juvini import num_num
df=pd.read_csv('iris_with_rating.csv')
num_num(df[['sepal_length','sepal_width']])

png

True

wait what if i want to add a hue parameter to it? Just make sure to add the additional column species to the input dataframe and also add the parameter hue_col='species'

num_num(df[['sepal_length','sepal_width','species']],hue_col='species')

png

True

additional parameters

x_name='xvalue' , the name that you want in x axis for the first column , sometimes the column name are different from the name you want to see in the graph.By default the first column name is taken
y_name='yvalue' , same as x_name , but for Y axis
size_figure=(13,4) , for playing around with the size. depending on size of the screen you may want to change it. default is 13,4 with tight layout
hue_cols , to plot the hue. See the above example

CATEGORICAL vs CATEGORICAL - to plot graph where two columns that are categorical.

Method : `cat_cat(df[[cat_col1,cat_col2]])`

Examples

This will take the top 5 categories for each column and plot it. You can change this value 5 using parameters xcap and ycap as mentioned below. For each value of X , it will give the countplot for values in Y. Also the tool will take care of all subplots and figure size etc. User do not have to figure out the sizing and subplot grid size.

import pandas as pd
from juvini import cat_cat
df=pd.read_csv('iris_with_rating.csv')
cat_cat(df[['species','rating']])

png

True

similarly interchanging first and second column will change the axis cat_cat(df[['rating','species']])

cat_cat(df[['rating','species']])

png

True

But wait , did we just use a numerical column to plot a categorical column? Actually yes , if we know that it is categorical , we do not have to change the datatype and all unnecessary things. the code will take care of converting it to category.

Fine , but what if there are too many categories and i simply need to have a gist of top few categories? Yes that is also supported , simply provide the parameter xcap=<value> , the code will sort the categories based on its count and choose the top n values based on the input.

additional parameters

x_name='xvalue' , the name that you want in x axis for the first column , sometimes the column name are different from the name you want to see in the graph.By default the first column name is taken
y_name='yvalue' , same as x_name , but for Y axis
size_figure=(13,4) , for playing around with the size. depending on size of the screen you may want to change it. default is 13,4 with tight layout
xcap=5 , will cap the maximum categories with top 5 based on its count for x axis 1st column , default 5
ycap=5 , same as xcap , but will be applicable to y column.
hue_cols , to plot the hue. See the above example
scols=3 , this is an experimental feature , use with caution. This parameter will control how many plots in one row. By default it is 3
others=True , this is an experimental feature , use with caution. This parameter will put all the other values that are not coming in the top values provided into a category called 'restall'

CATEGORICAL vs NUMERICAL - to plot graph where two columns where x is category and y is numeric.

Method : `cat_num(df[[cat_col1,num_col2]])`

Examples

This will take the top 5 categories of categorical column and plot numerical. You can change this value 5 using parameters xcap and ycap as mentioned below. For each value of X , it will give the boxplot corresponding to the numerical column in that. Additionally it will also give aggregate sum of the numerical values for each category.

It is upto the user to decide which is useful. Boxplot is always useful , whereas the sum aggregate might help if you are looking at something like total votes etc. but if it is like sepal_width kind , then it may not be useful.Anyways no harm in giving both.

import pandas as pd
from juvini import cat_num
df=pd.read_csv('iris_with_rating.csv')
cat_num(df[['species','petal_length']])

png

True

Can we use a numerical column to plot a categorical column? Actually yes , if we know that it is categorical , we do not have to change the datatype and all unnecessary things. the code will take care of converting it to category as long as you provide the column as first column in the input

How about the hue? Yes , that also will work here. provide it like

cat_num(df[['species','petal_length','rating']],hue_col='rating')

png

True

additional parameters

x_name='xvalue' , the name that you want in x axis for the first column , sometimes the column name are different from the name you want to see in the graph.By default the first column name is taken
y_name='yvalue' , same as x_name , but for Y axis
size_figure=(13,4) , for playing around with the size. depending on size of the screen you may want to change it. default is 13,4 with tight layout
xcap=5 , will cap the maximum categories with top 5 based on its count for x axis 1st column , default 5
hue_cols , to plot the hue. See the above example
others=True , this is an experimental feature , use with caution. This parameter will put all the other values that are not coming in the top values provided into a category called 'restall'. There are ratings 0-3. If we cap it to only top 2. Then the rest of the ratings will go into "restall" value.

cat_num(df[['rating','petal_length']],xcap=2,others=True)

png

True

Single NUMERICAL - to plot graph with just a numerical column

Method : `single_num(df[[num_col1]])`

Examples

It is not always the case that plot will need two columns. What if i just need to see a boxplot of a numeric column or the distribution of a numeric column? For that we have the method which will give boxplot and distplot. It is usually used with the hue to give more insights

import pandas as pd
from juvini import single_num
df=pd.read_csv('iris_with_rating.csv')
single_num(df[['sepal_length']])

png

True

How about the hue? Yes , that also will work here. provide it like

single_num(df[['sepal_length','species']],hue_col='species')

png

True

additional parameters

x_name='xvalue' , the name that you want in x axis for the first column , sometimes the column name are different from the name you want to see in the graph.By default the first column name is taken
size_figure=(13,4) , for playing around with the size. depending on size of the screen you may want to change it. default is 13,4 with tight layout
hue_cols , to plot the hue. See the above example

Single CATEGORICAL - to plot graph with just a categorical column

Method : `single_cat(df[[cat_col1]])`

Examples

It is not always the case that plot will need two columns. What if i just need to see a boxplot of a categorical column or the distribution of a numeric column? For that we have the method which will give boxplot and distplot. It is usually used with the hue to give more insights

import pandas as pd
from juvini import single_cat
df=pd.read_csv('iris_with_rating.csv')
single_cat(df[['species']])

png

True

single_cat(df[['species']],xcap=2)

png

True

Fine , what if i want to change not the xcap but the ycap? Yes we can do that as well. Simply change the parameter ycap=<value> just like the xcap.

How about the hue? Yes , that also will work here. provide it like single_cat(df[['species','hue_column']],hue_col='hue_column)

single_cat(df[['species','rating']],hue_col='rating')

png

True

additional parameters

x_name='xvalue' , the name that you want in x axis for the first column , sometimes the column name are different from the name you want to see in the graph.By default the first column name is taken
size_figure=(13,4) , for playing around with the size. depending on size of the screen you may want to change it. default is 13,4 with tight layout
hue_cols , to plot the hue. See the above example
xcap=5 , will cap the maximum categories with top 5 based on its count for x axis 1st column , default 5

To make it more easier

Method : `xy_autoplot(df[[col1,col2]])`

Examples

What if i do not even care what the data type is. I just want the code to decide it based on the data type already present.Can i do that?

Yes. There is a method which does exactly this. You will have to simply give two columns. The first column will be taken as X variable and second as Y variable. And based on the data type it will provide you the necessary graph.

import pandas as pd
from juvini import xy_auto_plot
df=pd.read_csv('iris_with_rating.csv')
xy_auto_plot(df[['sepal_length','species']])

png

True

Does it support hue? Yes , you can use the same parameter hue_col=<colname> and if the graph can handle hue , then it will use it.

xy_auto_plot(df[['sepal_length','species']],hue='rating')

png

True

cat_num(df[['rating','sepal_length']])

png

True

Still better and most comfortable

Method : `juvini_profile(df[[list_of_cols]])`

Examples

This is the highest of all that combines all below features and give the entire story in a matter of one command.

import pandas as pd
from juvini import juvini_profile
df=pd.read_csv('iris_with_rating.csv')
juvini_profile(df,hue_col='species')

Numerical columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'rating']
Categorical columns: []
Analysis of numeric sepal_length and numeric sepal_length

png

Analysis of numeric sepal_length and numeric sepal_width

png

Analysis of numeric sepal_length and numeric petal_length

png

Analysis of numeric sepal_length and numeric petal_width

png

Analysis of numeric sepal_length and numeric rating

png

Analysis of numeric sepal_width and numeric sepal_width

png

Analysis of numeric sepal_width and numeric petal_length

png

Analysis of numeric sepal_width and numeric petal_width

png

Analysis of numeric sepal_width and numeric rating

png

Analysis of numeric petal_length and numeric petal_length

png

Analysis of numeric petal_length and numeric petal_width

png

Analysis of numeric petal_length and numeric rating

png

Analysis of numeric petal_width and numeric petal_width

png

Analysis of numeric petal_width and numeric rating

png

Analysis of numeric rating and numeric rating

png

True

An easier way to get the only related graphs to the dependent variable

In many cases we may not need all sorts of graph but rather interested in seeing the graph related to the target variable, to do use the feature juvini_against_target(df[col_list],target_col=<target_variable>)

import pandas as pd
from juvini import juvini_profile
df=pd.read_csv('iris_with_rating.csv')
juvini_against_target(df,target_col='species')

png

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.0.8

May 12, 2021

1.0.7

Dec 12, 2020

1.0.6

Dec 8, 2020

1.0.5

Nov 29, 2020

1.0.4

Nov 29, 2020

1.0.3

Nov 29, 2020

1.0.2

Nov 29, 2020

1.0.1

Nov 29, 2020

1.0.0

Nov 29, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

juvini-1.0.8.tar.gz (15.4 kB view hashes)

Uploaded May 12, 2021 Source

Built Distribution

juvini-1.0.8-py3-none-any.whl (11.8 kB view hashes)

Uploaded May 12, 2021 Python 3

Hashes for juvini-1.0.8.tar.gz

Hashes for juvini-1.0.8.tar.gz
Algorithm	Hash digest
SHA256	`999a4a32ab4ca231265552fac154ab199149f8c71d8b41d4b0502e7b252d1b39`
MD5	`108743db6675659d9e340434f07a768f`
BLAKE2b-256	`cce9990f6ee86a82718fbd56cee49fa6a50aa8087adaca4baa5477158d0426cb`

Hashes for juvini-1.0.8-py3-none-any.whl

Hashes for juvini-1.0.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`919e4d1780f7a0abc4d23fe4e3e109d1e7efadfefb507ab12fcb1c36c50a0a6b`
MD5	`f52870179c6e441656a533bb39233ad5`
BLAKE2b-256	`f3f29dff5c0f3ec88ed47915a99f9e15b96c78aa450f47e42ec42f4dcbb5c938`

juvini 1.0.8

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

JUVINI

A Comprehensive graphing tool for EDA. Like a profiler using Graphs.

Introduction

Requirement

Usage

NUMERIC vs NUMERIC - to plot graph where two columns are numeric.

Method : num_num(df[[num_col1,num_col2]])

Examples

additional parameters

CATEGORICAL vs CATEGORICAL - to plot graph where two columns that are categorical.

Method : cat_cat(df[[cat_col1,cat_col2]])

Examples

additional parameters

CATEGORICAL vs NUMERICAL - to plot graph where two columns where x is category and y is numeric.

Method : cat_num(df[[cat_col1,num_col2]])

Examples

additional parameters

Single NUMERICAL - to plot graph with just a numerical column

Method : single_num(df[[num_col1]])

Examples

additional parameters

Single CATEGORICAL - to plot graph with just a categorical column

Method : single_cat(df[[cat_col1]])

Examples

additional parameters

To make it more easier

Method : xy_autoplot(df[[col1,col2]])

Examples

Still better and most comfortable

Method : juvini_profile(df[[list_of_cols]])

Examples

An easier way to get the only related graphs to the dependent variable

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Method : `num_num(df[[num_col1,num_col2]])`

Method : `cat_cat(df[[cat_col1,cat_col2]])`

Method : `cat_num(df[[cat_col1,num_col2]])`

Method : `single_num(df[[num_col1]])`

Method : `single_cat(df[[cat_col1]])`

Method : `xy_autoplot(df[[col1,col2]])`

Method : `juvini_profile(df[[list_of_cols]])`