4. Data Preprocessing in Machine learning (Handling Missing values )
4. Data Preprocessing in Machine learning (Handling Missing values)
2.Importing the Datasets
Now we need to import the datasets which we have collected for our machine learning project. But before importing a dataset, we need to set the current directory as a working directory.
read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is used to read a csv file and performs various operations on it. Using this function, we can read a csv file locally as well as through an URL.
Handling Missing data:
The next step of data preprocessing is to handle missing data in the datasets. If our dataset contains some missing data, then it may create a huge problem for our machine learning model. Hence it is necessary to handle missing values present in the dataset.
Operating on Null Values
Pandas treats None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures. They are:
1. isnull(): Generate a boolean mask indicating missing values
2. notnull(): Opposite of isnull()
3. dropna(): Return a filtered version of the data
4. fillna(): Return a copy of the data with missing values filled or imputed
Detecting null values
Pandas data structures have two useful methods for detecting null data:
Detecting null values
Pandas data structures have two useful methods for detecting null data:
1. isnull()
2. notnull()
Either one will return a Boolean mask over the data.
Find the number of missing values per column:
Dropping null values:In addition to the masking used before, there are the convenience methods,1. dropna() (which removes NA values)2. fillna() (which fills in NA values).We cannot drop single values from a DataFrame; we can only drop full rows or full columns. Depending on the application, you might want one or the other, so dropna() gives a number of options for a DataFrame. By default, dropna() will drop all rows in which any null value is present:
But this drops some good data as well; you might rather be interested in dropping rows or columns with all NA values, or a majority of NA values. This can be specified through the how or thresh parameters, which allow fine control of the number of nulls to allow through.
The default is how='any', such that any row or column (depending on the axis keyword) containing a null value will be dropped. You can also specify how='all', which will only drop rows/columns that are all null values:
For finer-grained control, the thresh parameter lets you specify a minimum number of non-null values for the row/column to be kept:Filling null valuesSometimes rather than dropping NA values, you'd rather replace them with a valid value. This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values. You could do this replace using the isnull() method as a mask, but because it is such a common operation Pandas provides the fillna() method, which returns a copy of the array with the null values replaced.
We can fill NaN entries with a single value, such as zero:
We can fill NaN entries with a single value, such as one:
We can specify a forward-fill to propagate the previous value forward:
we can specify a back-fill to propagate the next values backward:
Notice that if a previous value is not available during a forward fill or backward fill, the NA value remains.
Replacing missing categorical data:
missed numerical values are filled with mean value:
missed numerical values are filled with mode value:
missed numerical values are filled with median value:
missed numerical values are filled with most frequent value:
categorical values - lable encode (Encode Categorical data):
Either one will return a Boolean mask over the data.
Find the number of missing values per column:
Sometimes rather than dropping NA values, you'd rather replace them with a valid value. This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values. You could do this replace using the isnull() method as a mask, but because it is such a common operation Pandas provides the fillna() method, which returns a copy of the array with the null values replaced.
We can fill NaN entries with a single value, such as zero:
We can fill NaN entries with a single value, such as one:
We can specify a forward-fill to propagate the previous value forward:
Notice that if a previous value is not available during a forward fill or backward fill, the NA value remains.
Comments
Post a Comment