Tag: preprocessing

  • Scaling your data before clustering

    Scaling your data before clustering

    You will often need to perform some form of preprocessing on your dataset before running a data clustering algorithm. In this post I will introduce one important preprocessing method that is often overlooked when clustering data. That method, my friends, is called data scaling.

    What is data scaling?

    As you may already know, clustering algorithms work by computing distances (i.e. dissimilarities) between data points in the dataset and grouping together points that are close in proximity. The method used for calculating the distance will be different depending on the algorithm used. One trait most of these methods share is high sensitivity to attributes (also called features) with numerically larger values compared to other attributes in a dataset. These features will bias the distance calculations and can cause the clustering algorithms to produce subpar clusterings.

    To illustrate this, let’s say you have a dataset containing SAT and GPA scores for 5000 college students that you’d like to segment into 4 groups.

    StudentSATGPA
    114191.35
    213841.16
    37843.58
    47483.31
    511240.98
    49966832.02
    49977393.58
    49987413.21
    499910881.01
    50005951.60

    After running K-means on the dataset, you create a scatter plot showing the student scores and the clusters each student is assigned to.

    It’s immediately apparent looking at the plot that a large number of students were mislabeled by the algorithm. SAT scores have a significantly larger numerical scale (400 to 1600) than that of grade point averages (0 to 4). SAT scores therefore have a much larger influence on the  distance measurement K-means uses to group the students.

    As can be seen from the plot, K-means leaned heavily on the SAT scores to cluster the data. To ensure that both GPA and SAT has an equal weight in the clustering both features need to be transformed so they are on the same scale. This is exactly what data scaling is for.

    Here’s the scatter plot we get after scaling the data set and performing k-means:

    Student clusters after applying data scaling

    Data Scaling Techniques

    There are two main methods for scaling data. The first method is called normalization and the second method is called standardization.

    Normalization

    Normalization uses the minimum and maximum to transform features onto the same scale. Below is the formula used to normalize each feature:

    Here’s what the student scores dataset would look like when each feature is normalized:

    StudentSATGPA
    10.8491670.3375
    20.8200000.2900
    30.3200000.8950
    40.2900000.8275
    50.6033330.2450
    49960.2358330.5050
    49970.2825000.8950
    49980.2841670.8025
    49990.5733330.2525
    50000.1625000.4000

    Normalization rescales the dataset features between 0 and 1 or -1 and 1. Normalization is typically used when the numerical distribution of the features is not known. Normalization fails however on datasets with outliers. Allow me to explain why.

    Suppose you have a small dataset containing the following points.

    XY
    01
    23
    53
    1013
    1517
    2030
    2223
    2419
    99025
    100027

    As you can see from the table, the last two data points are outliers. Using normalization, the data points will transform into the following

    XY
    0.0000.000
    0.0020.083
    0.0050.083
    0.0100.500
    0.0150.666
    0.0200.458
    0.0220.583
    0.0240.708
    0.9900.833
    1.0001.000

    Even though normalization shifts the values between 0 and 1, the outliers still remain as outliers. If we were to compare these points using a distance function, the [] feature would have more influence on the measurement.

    Standardization

    Standardization uses the mean and standard deviation to transform a dataset. The formula for transforming each feature is shown below:

    Here, µ is the mean of the feature and σ is the standard deviation of the feature. This is what the same student score dataset from earlier would look like when standardized.

    StudentSATGPA
    11.393781-0.447397
    21.283359-0.634040
    3-0.6095871.743211
    4-0.7231631.477980
    50.463083-0.810861
    4996-0.9282320.210768
    4997-0.7515571.743211
    4998-0.7452481.379747
    49990.349506-0.781391
    5000-1.205864-0.201813

    Standardization is very helpful in cases where the features in the dataset are normally distributed. But your data doesn’t necessarily have to be normally distributed in order to use standardization. Unlike normalization, standardization does not squeeze your data into a bounding range, meaning that it will still work if there’s outliers present.

    That’s all, folks!

    There are many other data scaling techniques that can be used, but normalization and standardization are the two methods that are mainly used in practice. In my next post I will discuss fuzzy clustering.