Data Analysis : Initial Steps For Any Dataset (Version 1)

Kishan Tongrao
5 min readJan 13, 2021
Photo by Luke Chesser on Unsplash

We often get confused while reading a book which does not have index in it.

This article or document will be so beneficiary for all beginners and those are in the field of data. The problem with me while I was learning data science was I started from somewhere and keep going but later found that, I need an index that will keep me in the direction of effective and productive learning. Thats why if you want to enter into field of data my advise is to start with very basics and get complete knowledge about data analysis first.

As title says these are the initial steps, data analysis is all about finding meaning and solution in huge amount of data. But before that we should know about given dataset. The information we will get from dataset in initial phase is presented here.

“The ability to take data to able to understand it, to process it, to extract value from it, to visualize it, to communicate it that’s going to be important skills in next decades.” — Hal Varian, Chief economist at Google

Below are 12 initial steps in any data analysis of dataset

Step 1 : Import pre-libraries

Step 2 : Load dataset

Step 3 : Get general information of dataset

Step 4 : Set statistical information of dataset

Step 5 : Missing data finding and management

Step 6 : Check data type of each column and change it if required

Step 7 : Display heat map to visualize the correlation between features

Step 8 : Calculate and interpret measure of central dependency

Step 9 : Calculate and interpret measure of dispersion

Step 10 : Calculate and interpret moments

Steo 11 : State problem statements and solve it

Step 12. Visualize the solution

Step 1 — Import pre-libraries

Below are the most welcomed libraries of all time. Just import them because eventually you are going to use them just trust me 👻.

Step 2 — Load dataset

We are using below dataset for reference.

dataset.csv
output

Step 3 — Get general information of dataset

Now its time to see the general information of dataset like column names, non-null count, data type of each column, memory usage, index range etc.

output 1
output 2

Step 4 — Set statistical information of dataset

Using describe() method we can find the statistical information of the numerical and non-numerical columns.

output 1
output 2

Step 5 — Missing data finding and management

Before heading forward we should take care 👿 of missing data present in our dataset. Below demonstration shows that how to find and handle missing data.

output 1
output 2
output 3

Other ways to find missing values.

Lets handle missing values using simple imputer.

Below are some other ways we can handle missing data.

Step 6 — Check data type of each column and change it if required

So it is needed to check data type of each column and then change data type if required. It will bring down memory usages.

output 1
output 2

Step 7 — Display heat map to visualize the correlation between features

output

Step 8 — Calculate and interpret measure of central dependency

output

Step 9 — Calculate and interpret measure of dispersion

output

Step 10 — Calculate and interpret moments

output

Step 11 — State problem statements and solve it

This is where you can state problem statements and crate logics and codes to find solution of those problem statements.

Photo by sebastiaan stam on Unsplash

Step 12 — Visualize solution

When we done with solving the problem statement, try to demonstrate that solution with visuals. That will quickly demonstrate the solution.

Photo by Clay Banks on Unsplash

Full code

YouTube

Thanks you for your time 🙂

To connect : kishantongrao123@gmail.com/kishan.tongs@gmail.com

--

--