Data preprocessing

Lesson 4/16 | Study Time: 0 Min

Content



Class-labeling the observations

This  consists  of  arranging  data  by category, or labelling data points to the correct  data  type  (e.g.,  for  traditional data this can be numerical / categorical; for big data txt, digital image, digital audio).

Data cleansing / data scrubbing











Dealing with inconsistent data, such as misspelled categories & missing values.


Data balancing





If the categories in the data contain an unequal number of observations, they may not be representative of the population.





Balancing  methods,  like  extracting  an equal number of observations for each category, and preparing that for processing, fix the issue


Data shuffling


Re-arranging  data  points  to  eliminate unwanted patterns and improve predictive performance
further
on.


This is applied, for example, if the first 100 observations
in the data are from the first 100 people who have used a website; the data isn’t randomized, and patterns due to sampling emerge.



Data masking (big data)

Aims  to  ensure  that  any  confidential information in the data remains private, without  hindering  the  analysis  and extraction
of insight.









Masking involves concealing   the   original   data   with random  and  false  data,  allowing  the scientist  to  conduct  their  analyses without  compromising  private  details.