Content
Class-labeling the observations
•This consists of arranging data by category, or labelling data points to the correct data type (e.g., for traditional data this can be numerical / categorical; for big data – txt, digital image, digital audio).
Data cleansing / data scrubbing
•Dealing with inconsistent data, such as misspelled categories & missing values.
Data balancing
•If the categories in the data contain an unequal number of observations, they may not be representative of the population.
•Balancing methods, like extracting an equal number of observations for each category, and preparing that for processing, fix the issue
Data shuffling
•Re-arranging data points to eliminate unwanted patterns and improve predictive performance
further on.
•This is applied, for example, if the first 100 observations
in the data are from the first 100 people who have used a website; the data isn’t randomized, and patterns due to sampling emerge.
Data masking (big data)
•Aims to ensure that any confidential information in the data remains private, without hindering the analysis and extraction
of insight.
•Masking involves concealing the original data with random and false data, allowing the scientist to conduct their analyses without compromising private details.