Before Data Analysis, You Need Data Preparation
One of the prerequisites for any type of analytics in data science is data preparation. Raw data usually has several shortcomings in structure, format, and consistency, so first it has to be converted to a usable form.
These are some types of data preparation you can conduct to make your data useful for analysis.
Filtering Data
First you must ascertain if all raw data should be prepared or some should be filtered out. You may only be interested in a subset of the data. Filtering data is usually automated with some tool that provides options or settings for how to filter.
Fixing Data Types
All data needs an appropriate data type. Is date/time a more appropriate data type for a column that is defined as varchar data? Is integer a more appropriate data type for a column that is set to a string? Some of these transformations may be detected and performed automatically, as most data preparation tools are able to find and transform wrongly set data types.
Removing Duplicates
Duplicate values are not always erroneous data. You should only remove duplicates if it is ascertained that duplicated values do not conform to the data definition. Another factor to be considered is which data characteristics are significant. Is character or text data that is different only in terms of case sensitivity considered duplicate? You also should determine which occurrence of the duplicated data is to be removed. If id 100 occurs in row 1 and row 3, which of these do you want to keep?
Replacing Null Values
Having a null value in a nullable column is not an issue that needs to be fixed. Consequently, replacing nulls should not be performed in bulk, as some null values may be valid. Preferably, replace each null value separately with a user-supplied value.
Trimming Data
Trimming data involves removing the extra space at the start and end of a value. This may be performed in bulk with most tools.
Finding and Replacing Missing Data
Finding and replacing missing data is not as straightforward; it may not be obvious that data is missing. To indicate a missing value, a null value could have been added, or spaces, or a blank value with no extra space. To add to the complexity, single or multiple contiguous extra spaces may have gotten removed during trimming. Each missing value should be ascertained individually and replaced.
Fixing Erroneous Data
Fixing erroneous data is probably the most complex task, as you need detailed information about what a data value should be. But some erroneous data is easy to find, such as misspelled text, extra spaces within a data value (not at the beginning or end), a lowercase or uppercase interchange for case-sensitive data, and missing space within text.
Formatting Data
Certain data type values have an appropriate format. For example, an integer value 75005 may need to be formatted as 75,005, or a date value may need to have the year, month, and date in a certain order.
Some data preparation tasks can be performed in bulk while others have to be applied per data value, and these activities can take some time. But they’re necessary in order to make your data usable in analysis.