Data Analysis : Initial Steps For Any Dataset (Version 1)

5 min readJan 13, 2021

--

Photo by Luke Chesser on Unsplash

We often get confused while reading a book which does not have index in it.

This article or document will be so beneficiary for all beginners and those are in the field of data. The problem with me while I was learning data science was I started from somewhere and keep going but later found that, I need an index that will keep me in the direction of effective and productive learning. Thats why if you want to enter into field of data my advise is to start with very basics and get complete knowledge about data analysis first.

As title says these are the initial steps, data analysis is all about finding meaning and solution in huge amount of data. But before that we should know about given dataset. The information we will get from dataset in initial phase is presented here.

“The ability to take data to able to understand it, to process it, to extract value from it, to visualize it, to communicate it that’s going to be important skills in next decades.” — Hal Varian, Chief economist at Google

Below are 12 initial steps in any data analysis of dataset

Step 1 : Import pre-libraries
Step 2 : Load dataset
Step 3 : Get general information of dataset
Step 4 : Set statistical information of dataset
Step 5 : Missing data finding and management
Step 6 : Check data type of each column and change it if required
Step 7 : Display heat map to visualize the correlation between features
Step 8 : Calculate and interpret measure of central dependency
Step 9 : Calculate and interpret measure of dispersion
Step 10 : Calculate and interpret moments
Steo 11 : State problem statements and solve it
Step 12. Visualize the solution

Step 1 — Import pre-libraries

Below are the most welcomed libraries of all time. Just import them because eventually you are going to use them just trust me 👻.

Step 2 — Load dataset

We are using below dataset for reference.

dataset.csv

output

Step 3 — Get general information of dataset

Now its time to see the general information of dataset like column names, non-null count, data type of each column, memory usage, index range etc.

output 1

output 2

Step 4 — Set statistical information of dataset

Using describe() method we can find the statistical information of the numerical and non-numerical columns.

output 1

output 2

Step 5 — Missing data finding and management

Before heading forward we should take care 👿 of missing data present in our dataset. Below demonstration shows that how to find and handle missing data.

output 1

output 2

output 3

Other ways to find missing values.

Lets handle missing values using simple imputer.

Below are some other ways we can handle missing data.

Step 6 — Check data type of each column and change it if required

So it is needed to check data type of each column and then change data type if required. It will bring down memory usages.

output 1

output 2

Step 7 — Display heat map to visualize the correlation between features

output

Step 8 — Calculate and interpret measure of central dependency

output

Step 9 — Calculate and interpret measure of dispersion

output

Step 10 — Calculate and interpret moments

output

Step 11 — State problem statements and solve it

This is where you can state problem statements and crate logics and codes to find solution of those problem statements.

Photo by sebastiaan stam on Unsplash

Step 12 — Visualize solution

When we done with solving the problem statement, try to demonstrate that solution with visuals. That will quickly demonstrate the solution.

Photo by Clay Banks on Unsplash

Full code

thetongs/Data-analysis-initial-steps-version-1

Contribute to thetongs/Data-analysis-initial-steps-version-1 development by creating an account on GitHub.

github.com

YouTube

Thanks you for your time 🙂

You can also see below medium pages if you like.

Data Preprocessing In Python.

Hello, guys did you hear, see the above terminology data preprocessing? Those we want to be Data Scientist or Data…

medium.com

Data slicing or indexing in python on datasets.

Hey guys this part is a very basic and important part. Before performing any action on the dataset we should know some…

medium.com

Digital and Non-Digital PDF segregator — Python

There is one problem with the pdf extraction method. When we are using normal libraries like PyPDF2, Pdfplumber…

medium.com

Document’s language identifier — Adv NLP and Python

There are different input files we have to work on in NLP based applications. So what about the identification of…

medium.com

To connect : kishantongrao123@gmail.com/kishan.tongs@gmail.com

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Python Programming

Written by Kishan Tongrao

Data Science Engineer

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams