ML Data Cleaning.
-
Data cleaning in Machine Learning, is the process of preparing data for analysis, by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted. Unfortunately, it's not as simple as organizing some rows or erasing information to make space for new data.
-
There’s a reason data cleaning is the most important step if you want to create a data-culture.
Lets take a predictions. It involves:
1) Fixing spelling and syntax errors
2) Standardizing data sets
3) Correcting mistakes such as empty fields
4) Identifying duplicate data points
Majority of a data scientist's time is spent on cleaning, rather than machine learning. At least 45% of data scientist's time is spent on preparing data.
Common Data Cleaning Techniques
Handling Missing Values:
Missing data can occur for various reasons, such as errors in data collection or transfer. There are several ways to handle missing data, depending on the nature and extent of the missing values.
Imputation:
Here, you replace missing values with substituted values. The substituted value could be a central tendency measure like mean, median, or mode for numerical data or the most frequent category for categorical data. More sophisticated imputation methods include regression imputation and multiple imputation.
Deletion:
You remove the instances with missing values from the dataset. While this method is straightforward, it can lead to loss of information, especially if the missing data is not random.
Removing Duplicates:
Duplicate entries can occur for various reasons, such as data entry errors or data merging. These duplicates can skew the data and lead to biased results. Techniques for removing duplicates involve identifying these redundant entries based on key attributes and eliminating them from the dataset.
Data Type Conversion:
Sometimes, the data may be in an inappropriate format for a particular analysis or model. For instance, a numerical attribute may be recorded as a string. In such cases, data type conversion, also known as datacasting, is used to change the data type of a particular attribute or set of attributes. This process involves converting the data into a suitable format that machine learning algorithms can easily process.
Outlier Detection:
Outliers are data points that significantly deviate from other observations. They can be caused by variability in the data or errors. Outlier detection techniques are used to identify these anomalies. These techniques include statistical methods, such as the Z-score or IQR method, and machine learning methods, such as clustering or anomaly detection algorithms.
Data cleaning is a vital step in the data science pipeline. It ensures that the data used for analysis and modeling is accurate, consistent, and reliable, leading to more robust and reliable machine learning models.