Data preprocessing

Lesson 4/16 | Study Time: 0 Min

Course: INTRODUCTION TO DATA SCIENCE AND ANALYTICS

ContentContent

Class-labeling the observations

•This consists of arranging data by category, or labelling data points to the correct data type (e.g., for traditional data this can be numerical / categorical; for big data – txt, digital image, digital audio).

Data cleansing / data scrubbing

•Dealing with inconsistent data, such as misspelled categories & missing values.

Data balancing

•If the categories in the data contain an unequal number of observations, they may not be representative of the population.

•Balancing methods, like extracting an equal number of observations for each category, and preparing that for processing, fix the issue

Data shuffling

•Re-arranging data points to eliminate unwanted patterns and improve predictive performance
further on.

•This is applied, for example, if the first 100 observations
in the data are from the first 100 people who have used a website; the data isn’t randomized, and patterns due to sampling emerge.

Data masking (big data)

•Aims to ensure that any confidential information in the data remains private, without hindering the analysis and extraction
of insight.

•Masking involves concealing the original data with random and false data, allowing the scientist to conduct their analyses without compromising private details.

Previous Lesson Next Lesson

Xaviour Aluku

Product Designer

Profile Book a Meeting

Class Sessions

1- Introduction to Data Analytics 2- What is Data Science 3- Overview 4- Data preprocessing 5- Data Quality Assessment and Use Case 6- Where does Data come from? 7- Who handles th Data 8- BUSINESS INTELLIGENCE 9- Traditional Methods 10- Traditional Forecasting uses 11- Content 12- Machine Learning Uses 13- Introduction to Large Language Model and Generative AI 14- Data protection 15- Data Management practises 16- Tools and Frameworks