Author: Daryle Serrant

Using K-Medians for Outlier Clustering.
Hey, readers! I’m back again with another blog post. If you’ve been following the series of posts I’ve been releasing thus far you will already be well familiar with the k-means clustering algorithm. In my last blog post I covered k-modes, a variation of k-means that can be used for clustering categorial data. In this blog post I will introduce you to another variation of k-means called k-medians.

The case for K-Medians

In the K-Means post, I mentioned that one of the limitations of the algorithm is its sensitivity to outliers. The mere existence of just one extreme value in the dataset can have a huge influence on the placement of the cluster centroids and could cause outliers to get their own clusters instead of being ignored.

One way to avoid this issue is to remove the outliers from the dataset prior to clustering. But what if that wasn’t an option for you? What if there’s just too many outliers in your dataset for you to simply remove? That’s where K-Medians comes in.

What makes K-Medians different from K-Means is two things:

First off, K-Medians computes the dissimilarity between data points using Manhattan distance instead of Euclidean distance. More on Manhattan distance later in this post.

Secondly, as the name suggests, K-Medians computes new cluster centroids using the median. The median is a robust measure of a dataset’s center and this therefore less sensitive to the existence of outliers in the dataset.

To demonstrate this fact, let’s suppose we have a small dataset of values:

1, 6, 9, 7, 12

The mean of the dataset is 7.If another data point with the value 7000 was added to the data the resulting mean would be 1172.5, which is drastically different from the original dataset.

If we computed the median of the dataset with and without the new data the result would be 7 and 8 respectively. The median of the dataset with the additional datapoint is only slightly higher than the median of the original dataset.

By using the median instead of mean for calculating new centers for each cluster, K-medians does not suffer from the outlier sensitivity that K-means has.

About the Manhattan Distance

Another name for the Manhattan distance is the city block distance.

Imagine each cell is a building and the grid lines are side walks. If one wanted to travel from point A to point B in the picture, he/she could not take a straight path. Both paths available in the image would require a turn to be taken. The Manhattan distance can be used to calculate the distance walked.

The equation for the Manhattan distance is the following:

Allow me to break this formula down and explain how the distance is calculated.

Given two data points, x and y the formula adds up the absolute difference between each component from x and y. As an example, computing the Manhattan distance between the data points (3,5,9,1) and (7,2,0,8) yields the following:

|3 – 7| + |5 – 2| + |9 – 0| + |1 – 8| = 23

The Algorithm

Now that you’re aware of the Manhattan distance and the median, here’s the breakdown of the K-Medians clustering algorithm
1. Select k data points at random to be initial center points
2. Use the Manhattan distance to calculate dissimilarities between each datapoint and the center points.
3. Assign each datapoint to the cluster with the smallest Manhattan distance.
4. Use the median to compute new center points for each cluster
5. Repeat steps 2, 3, and 4 until there’s no change in the cluster assignments.
Choosing K

As with K-means, the Sihouette method can be used to find the optimal value for k. A modified version of the elbow method can also be used. Instead of using the sum of squared error to calculate the total distance between each data point and it’s centroid, the absolute difference is used:

For each cluster the absolute error between the data point p_iand the centroid µ_jis computed. These differences are then summed up to get the total error for the clustering.

From there the process of producing the elbow plot is the same as previously described in my earlier posts.

What’s Next

That’s it folks. In my next post, I will cover yet another variation of k-means that is also robust to outliers. Until next time.
November 27, 2021

Using K-Modes to Cluster Categorical Data

If you’ve been following the series of posts I’ve written thus far, you will have already become familiar with the k-means clustering algorithm and how to determine the optimal number of clusters to produce using the algorithm. This clustering method, unfortunately, is only useful for numerical datasets. What happens if we want to cluster non-numerical data, such as names, locations, colors, and companies? The clustering algorithm I will cover is a variation of k-means that can be used on categorial data. This method is called K-Modes.

So, what is the K-Modes Algorithm?

The K-Modes clustering procedure is very similar to k-means, actually.

Lets say we have a small dataset of individual categorical data that we’d like to cluster into 3 groups:

	Education Level	Ethnicity	Occupation	Marital Status	State
P1	Associates	Hispanic	Analyst	Single	Florida
P2	Bachelors	Caucasian	Engineer	Divorced	California
P3	Masters	Native American	Engineer	Single	North Carolina
P4	Bachelors	Hispanic	Engineer	Divorced	California
P5	Associates	African American	Accountant	Married	Texas
P6	Associates	Caucasian	Engineer	Married	Texas
P7	Bachelors	Caucasian	Accountant	Widowed	Florida
P8	Associates	Asian	Teacher	Married	Texas
P9	Bachelors	African American	Accountant	Single	Florida
P10	Masters	Caucasian	Lawyer	Single	Texas

The process to cluster the dataset can be broken down to 4 steps.

Step 1: Select k data points at random to be initial center points

Just like with K-means, the method starts out with deciding the number of clusters desired and then choosing at random several points to be the initial center points for the algorithm.

From our little example dataset, we will choose P2, P6, and P9 as center points (centroids):

	Education Level	Ethnicity	Occupation	Marital Status	State
P2	Bachelors	Caucasian	Engineer	Divorced	California
P6	Associates	Caucasian	Engineer	Married	Texas
P9	Bachelors	African American	Accountant	Single	Florida

Step 2: Calculate dissimilarities between each datapoint and the center points

K-means used Euclidean distance to determine how similar (or dissimilar) a datapoint was to a centroid. K-modes uses a different method for determining similarity.

K-modes compares each datapoint to the centroids and count up the number of dissimilar attributes.

When we take P1 for example, and compare it to each centroid, here’s what we get:

P1 => P2

	Education Level	Ethnicity	Occupation	Marital Status	State
P1	Associates	Hispanic	Analyst	Single	Florida
P2	Bachelors	Caucasian	Engineer	Divorced	California

Number of Dissimilarities: 5

P1 => P6

	Education Level	Ethnicity	Occupation	Marital Status	State
P1	Associates	Hispanic	Analyst	Single	Florida

P6	Associates	Caucasian	Engineer	Married	Texas

Number of Dissimilarities: 4

P1 => P9

	Education Level	Ethnicity	Occupation	Marital Status	State
P1	Associates	Hispanic	Analyst	Single	Florida

P9	Bachelors	African American	Accountant	Single	Florida

Number of Dissimilarities: 3

Step 3: Assign data points to a cluster

As with K-means, each datapoint is assigned to the centroid with the smallest dissimilarity score.

Below is a table containing the dissimilarities between each data point and the centroids and its assigned cluster.

	Cluster 1 (P2)	Cluster 2 (P6)	Cluster 3 (P9)	Assigned Cluster
P1	5	4	3	Cluster 3
P2	0	3	5	Cluster 1
P3	4	5	4	Cluster 1
P4	1	4	4	Cluster 1
P5	5	2	3	Cluster 2
P6	4	0	5	Cluster 2
P7	3	4	1	Cluster 3
P8	5	2	5	Cluster 2
P9	3	5	0	Cluster 3
P10	4	3	4	Cluster 2

Step 4: Calculate new cluster centroids

This step is where k-modes gets its name from. We compute the new centroids for each cluster by using the mode. If there are values that have the same number of occurrence, one of the values can be chosen at random.

The new centroids for each cluster are:

Cluster 1

	Education Level	Ethnicity	Occupation	Marital Status	State
P2	Bachelors	Caucasian	Engineer	Divorced	California
P3	Masters	Native American	Engineer	Single	North Carolina
P4	Bachelors	Hispanic	Engineer	Divorced	California
New Centroid	Bachelors	Caucasian	Engineer	Divorced	California

Cluster 2

	Education Level	Ethnicity	Occupation	Marital Status	State
P5	Associates	African American	Accountant	Married	Texas
P6	Associates	Caucasian	Engineer	Married	Texas
P8	Associates	Asian	Teacher	Married	Texas
P10	Masters	Caucasian	Lawyer	Single	Texas
New Centroid	Associates	Caucasian	Accountant	Married	Texas

Cluster 3

	Education Level	Ethnicity	Occupation	Marital Status	State
P1	Associates	Hispanic	Analyst	Single	Florida
P7	Bachelors	Caucasian	Accountant	Widowed	Florida
P9	Bachelors	African American	Accountant	Single	Florida
New Centroid	Bachelors	Caucasian	Accountant	Single	Florida

Step 5: Repeat steps 2, 3 and 4

As with k-means, the algorithm reassigns data points to clusters based on their similarity to the centroids. New centroids are then computed, and the process repeats until no changes in the cluster assignments are made.

Determining the optimal ‘k’

Choosing the optimal number of clusters for k-modes can be done using a slightly modified version of the elbow method employed for k-means. Here’s the steps to follow:

Perform K modes clustering for a range of values of k.
After each clustering, compute the dissimilarity score between each datapoint and it’s assigned cluster centroid, and add them up to get the total dissimilarity score.
Create a line plot of the total dissimilarity score for each value of k.
Finally look at the plot for the value of k where the dissimilarity score begins to decrease linearly. This value is your optimal number of clusters for the dataset.

Now that you know the k-modes clustering algorithm, what datasets do you plan of using it to cluster? Share your plans in the comments below.

Until next time! 🙂

November 21, 2021

K-Means Parameter Selection
In my last post, I provided a 101 on the K-means clustering algorithm. In case you haven’t had a chance to read the post this link will take you to it. One question you might be wondering after reading the post is how to determine out the optimal number of clusters for k-means. That’s exactly what this post is all about. At the end of this post you will have two techniques in your back-pocket that you can use to determine the optimal value for k in K-means.

Elbow Method

The first technique is the simplest of the two and is called the Elbow method. The Elbow method involves the following steps:
1. Perform K means clustering for a range of values for k.
2. After each clustering, compute the sum of squared error ( or SSE for short) of each cluster. — More on that in a bit
3. Create a line plot of the SSE for each value of k. This plot is called the elbow plot
4. Look at the elbow plot and find the value of k where the SSE begins to decrease linearly. That value is the optimal number of clusters in your dataset.
So, what is the Sum of Squared Error?

The cluster sum of squared error is the average distance each data point is from the centroid. Calculating this involves the following steps:
1. For each cluster and centroid produced by k-means
  1. Calculate the difference between each data point and its cluster centroid. Square the difference. We’ll call the result e
  2. Compute the sum of all the differences. Let’s call the result t
2. Next we will add up the total distances computed for each cluster. The result of the calculation is the sum of squared error.
It is best summarized with this formula:

The Intuition of the Elbow Plot

When you plot the SSE for each value of k, you will immediately see why this this plot is called the Elbow plot. It’s because the plot looks like an elbow!

As the number of clusters increases, the sum of squared error decreases. Our optimal number of clusters for k means will be a number where the SSE is small. If this number is too small though, we will end up with too many clusters. If this number is too big, then we will end up with too few clusters to make sense of the data we have. The purpose of the elbow plot is to help us find the value of k where we start to see diminishing returns. This point is our optimal value for k. It just so happens that this point on the plot is the elbow part of the arm.

As you can see from the plot above, this elbow appears to occur at k = 3, making this the optimal number of clusters.

But what happens when you have a plot with a fairly smooth or linear curve and no obvious elbow point, such as the one below.

This is where the next method I will show you comes in handy.

Introducing the Sihouette Method

The Sihouette method involves computing coefficients for each point in the dataset. These coefficients measure how similar a point is to its assigned cluster compared to other clusters. The value of each coefficient range between -1 and 1. Coefficients on the high end of the range indicate that the data point is well matched to its assigned cluster and poorly matched to other clusters. Coefficients on the low end of the range indicate that the data point is poorly matched to its cluster compared to other clusters.

Let’s go through how to compute a sihouette coefficient for a simple data point:
1. First we calculate the average distance the data point is from all other points in the cluster. We’ll call this number a.
2. Next we calculate the average distance of that data point from points in the cluster closest to its assigned cluster. Let’s call this number b.
3. Finally we take the two measurements we computed and calculate the coefficient by subtracting a from b and dividing by the larger of the two numbers.
This process is best summarized with this formula:

If most of the data points have high coefficient values, then the number of clusters used is optimal. If, on the other hand there are many data point that have low coefficient values, this indicates that either two many or too few clusters.

A sihouette plot can be used to visually inspect how close each point in one cluster is to points in neighboring clusters.

The x axis in the plot displays the sihouette coefficient values. The y axis displays the labels for each cluster. The red dashed line is the average sihouette coefficient, computed by averaging all the coefficients in the dataset.

Here’s some rules of thumb to follow when analyzing a sihouette plot:
- If all the cluster plots is beyond the average sihoutte score, have mostly uniform thickness, and do not have wide fluctuation in size, that’s a good indication that the number of clusters used is optimal.
- The number of clusters is sub optimal if you observe either of the following:
1. The plot for a cluster falls below the average coefficient. Such as the plot from this link: https://vitalflux.com/wp-content/uploads/2020/09/Screenshot-2020-09-16-at-11.59.52-AM.png
2. There are wide fluctuations in the size and thickness of the cluster plots.
You have just learned two methods for choosing the optimal number of clusters to form for k-means. If you enjoyed this post, subscribe to my blog to keep abreast of more posts like this.
November 19, 2021
K-Means.. The Simple Clustering Algorithm
If you do a quick Google search for clustering algorithms, k-means will undoubtedly be the first algorithm that will be mentioned in the search results, and for good reason too. It’s the first algorithm that most professionals turn to for segmenting datasets. Compared to other clustering methods out there, k-means is both very simple to implement and understand. In this post I will explain how this algorithm works and explain some pros and cons that you should be aware of when using this algorithm. In future posts, I will delve into more technical aspects of k-means and also walk through a coded implementation of this algorithm.

Here goes…

So what exactly is k-means?

The k-means algorithm divides a dataset into groups. The datapoints in each group are in close proximity of each other (at least as close to each other as possible). The k in k-means comes from the choice in the number of clusters to create. When using this algorithm, the practitioner provides two things:
1. A dataset, and
2. The number of clusters to create. We call this number k.
The algorithm will then take the dataset and create the desired number of clusters. The means in k-means comes from the method in which the clusters are formed. The algorithm repeatedly estimates the center of each cluster by calculating the mean (average) of the data points in the cluster. Another name for this center point is centroid. Each data point in the dataset is assigned a cluster based on how close it is to a centroid. This will make more sense once we go through the next section.

Before K-Means

After K-Means

So how exactly does it work?

Here’s a breakdown of the clustering algorithm in all of it’s parts.

Choose initial centroids

Assign points to cluster

Calculate new centroids

Repeat until done
1. Choose k. The first step is to determine the number of clusters you’d like to create. We’ll call this number k
2. Select initial cluster centroids. In this step the algorithm chooses k data points from the dataset at random. These data points serve as center points (centroids) for the clusters.
3. Assign data points to a cluster. In this step the algorithm calculates the distance between each data point in the dataset and each centroid and assigns the data point to the centroid it is closest to.
4. Calculate new cluster centroids. The algorithm calculates the new center points for each cluster by computing the calculating the average of each data point in the cluster.
5. Repeat steps 3 and 4. The algorithm reassigns data points to clusters based on how close they are to the new centroids. After the reassignment new cluster centroids are computed. This process is repeated until no changes in the assignments are made.
But wait, how is distance between a data point and a centroid calculated?

There are many methods that can be used to calculate distance (i.e. similarity) between points, many of which I plan on covering at a later time. But the method most commonly used is Euclidean distance. The formula is as follows:

So, what are the advantages of k-means?

Very Simple to Implement. This is probably one of the biggest advantages of this algorithm. Not only is it very easy to implement, it is also very easy to interpret and visualize compared to other algorithms out there.

Very efficient. The k-means clustering algorithm is highly scalable and can efficiently cluster large datasets.

Highly adaptable to new data points. K-means clustering algorithm is extremely flexible and can adjust the clustering of the data as new data points are added to the dataset. This is really great for real-time applications where new data points can be inserted to the dataset on the fly.

And, what are the disadvantages of k-means?

K must be chosen manually. As you may have already surmised from the explanation of the algorithm, K-means is not capable of determining the optimal number of clusters in the dataset automatically. You must specify the number of clusters in advance. In a future post I will cover some strategies you can use to determine the optimal number of clusters for K-means

Very sensitive to noisy data and outliers. The presence of extreme data points in the dataset can have a huge impact on the quality of the clusters that are formed. The placement of cluster centroids, for instance can be heavily influenced by outliers. Or better yet, outliers might even get their own cluster instead of being ignored. To prevent any of these from occurring, make sure to remove these outliers from the dataset before clustering.

Dataset with outliers before clustering

Dataset with outliers after clustering

Results are inconsistent. Due to the fact that the initial center points are chosen at random, you can run k-means multiple times on the same dataset and get different answers. What people therefore do in practice is run k-means several times with different initial values and pick the one with the best result.

Works best with spherical datasets. One big assumption K-means has on the dataset is that it contains spherically shaped clusters. This is not always the case however. When given datasets such as the one below, K-means may produce a result that might not be what you’d expect.

Non spherical dataset before clustering

Non spherical dataset after clustering

Assumes clusters have similar density. Another big assumption that K-means makes on the dataset is that the clusters are even in size and density. When given dataset with uneven distributions of points per cluster, the clustering starts to break down.

What’s next?

You now have a high level understanding of k-means clustering and how it works. Other the next few posts, I will start diving into more technical details on the algorithm, including strategies for choosing the optimal number of clusters, and other distance methods you can use with k-means.
November 2, 2021
4 skills to focus on when learning Data Science
Python, R, Spark, TensorFlow, pandas, data visualization, etc., etc., etc. With so many technologies and concepts, the task of learning data science, especially for people who are learning on their own, can appear daunting. My goal in this article is to your data science journey less overwhelming. I’m going to show you four skills I believe you should focus on when starting out in order of importance.

Skill #1: Python

You will be writing code on a daily basis as a data scientist. Code for extracting and parsing data, code for building training machine learning models, and code for creating data visualizations to communicate your findings to stakeholders. You may also be asked to write modules that can integrate into software products that your company is building. Therefore knowing how to write code – good, clean code in particular – is important. Python is a great coding language to start out with because the syntax is simple and fairly intuitive compared to most languages and is a very popular language to use for Data Science as well as general software development tasks.

Why Python and not another language like R?

Even though R designed specifically for statistical analysis and data visualization, I would encourage people starting out to learn Python because it’s a much more versatile language. R is a favorite among researchers and scholars in academia and works really well for doing quick exploratory data analysis and early modeling. Where R falls short is in the ability to scale in a production environment.

Most mainstream developers you will be collaborating with as a data scientist also don’t know R. Considering that in your work may be part of a live software product, you’re better off starting with a more mainstream language like Python.

Skill #2: SQL

An activity common in many data science roles is retrieving data from a database. I’m order to do this, knowledge of SQL is essential.

As far as which database management system you should start with, choose any one you want. The structure and syntax of SQL is the same amongst all database management systems, save for a few minor differences in function names.

What about NoSQL?

A large majority of the data you will be working with as a data scientist will be structured and relational in nature. Relational databases and SQL are the go-to tools when working with this type of data.

NoSQL technology is used when dealing with unstructured data such as documents, audio, and geospatial data. NoSQL may also be used when an application requires a level of efficiency and scalability that can’t be provided by a relational database management system.

Many of the relational database concepts you will learn with SQL will help you better understand NoSQL. Therefore I would recommend you learn SQL first and then pick up NoSQL later when you really have a need to learn it.

Skill #3: Data Visualization

As a data scientist, you will be expected to communicate the results of your findings to business stakeholders. Data visualization is one of your primary tool for doing this. There’s a multitude of packages for data visualization available under Python. Two packages I’d recommend you start out with before exploring others are matplotlib and seaborn.

Skill #4: Machine Learning

Once you have a good grasp of probability, statistics, and linear algebra, it’s There are so many machine learning algorithms out there that it could be hard for ones starting out to figure out which ones to learn first. Pick five or six algorithms to study and understand and move on to more as you get comfortable with them. My recommendation would be:
- Linear Regression
- Logistic Regression
- Kmeans
- Decision Tree
- K nearest neighbors
- Support Vector Machines
Along with learning the algorithms understanding how to evaluate the predictions of the models is important as well.

Do you agree with my suggestions? What are your thoughts? Share in the comments section below.
September 4, 2021
Takeaways from 1 year of teaching coding to kids
Since the summer of last year I’ve been an online coding instructor for children and young adults for theCoderSchool. I’ve taught students 1-on-1, 2-on-1, and sometimes groups of five or more. In this blog post, I will be share a few takeaways I’ve gained from my teaching experiences over the past year.

Be aware of the curse of knowledge

One of the first takeaways I gained when I started teaching was how experience was how my software engineering experience impacts the quality of my instruction. While it has provided me a depth of topics and interesting insights to infuse into my lessons, at times it has also caused me to assume that people know things the same way that I do. I’ve been coding professionally for well over a decade. Coding has practically become a second language to me. I can think and reason in code so quickly that it’s unconscious. Because of this, I at times forget that my students are learning code for the first time and don’t grasp the concepts as deeply as I do.

There were many moments when I first started out that I would quickly zip through an explanation that in my mind was really simple, only to look up and meet blank look on my student’s face.

Here’s some things I do to avoid this:
1. Put myself in my student’s shoes – I try to remember what it was like when I was learning how to code. I think about what questions I might have about certain concepts and how I might understand them. I try to get inside their heads and think about how I can adjust my instruction.
2. I slow down – I take my time to explain coding concepts and make sure my students understand them before moving on. I avoid moving on to very abstract, complex concepts such as classes and objects too soon; this is especially the case for students under the age of 10, most of whom tend to lack the cognition to fully grasp the concept.
Patience is a Must

Becoming aware of the curse of knowledge has also made me realize the importance of patience in teaching. The greatest gift a teacher can give a student is lots of support and approval of where they are in their journey. It takes a lot of time and persistence to learn how to code. When a student feels like they can learn at their own pace without any expectation of being at a particular skill level they are more likely to continue engaging with the material. I’ve seen students who lost interest in learning to code because either the instructor moved too quickly through the material or the instructor got frustrated with the student’s lack of progress.

As an instructor it can be frustrating to see that after spending an exorbitant amount of time teaching and answering questions about a coding concept, a student still fail to grasp it. But when I pause to reflect on my own learning journey, I remember moments when I struggled to understand many computer science concepts, including the very ones my students struggle with. What I immediately realize is that learning is a process.

We often view the act of learning to be something akin to downloading a data file from the internet. I think a better analogy for learning is that of building an extension to a house. I’m not a builder so please excuse the bad analogies in advance. Learning is about finding the right building blocks that connect with the existing structure (i.e. a student’s current knowledge and understanding). Sometimes it might be necessary to knock down existing walls (i.e. false ideas and misconceptions) to make room for the new structures you want to build. And sometimes you might need to add new structural support to the house (i.e. giving students lots of practice with the concepts) to keep everything in place. Depending on the house and on the expansion, this can be easy or it can be a challenge. But that challenge is not a bad thing. It’s an essential part of the process. Because without facing that challenge, nothing will ever get built.

So now I have a lot more patience and understanding for anyone learning how to code. I don’t see their struggle as a problem. I see it as a sign that they are on the path to building something great!

Growing creative problem solvers

Probably the biggest realization I’ve had since I started teaching is that the value that I provide my students is more than merely teaching them how to code. I’m teaching them how to solve problems.

Software developers are probably some of the most creative problem solvers out there. Everyday they are tasked to solve technical problems that are complex enough to make the head of the average spin a thousand times over. The key is to their success is a structured approach that they use when tacking each problem. This same approach is highly applicable to areas outside of coding and can provide kids benefits for future career opportunities as well as help them find solutions to problems they encounter on a regular basis.

Something I’ve been taking on since making this realization is not just giving my students the answers to problems when they get stuck, but also showing them strategies that can be used to solve any problem and coaching them to use it as often as possible.

Are you a coding instructor? What are some of your takeaways from your experience? Feel free to share them in the comments below.
August 12, 2021
Binary Data Series Part 3: Floating Point Numbers
This article is part of the multi part series on binary data. If you’re knowledge of positive and negative integers and binary numbers is a bit flakey, I’d recommend that you read the part 1 and part 2 before you continue.

In this article we will discuss how fractions are represented in binary and some limitations you must keep in mind when developing software that require very precise calculations.

To start things off consider how we represent floating points in the decimal system.

Thousands Hundreds Tens Ones . Tenth Hundredth Thousandth Ten-Thousanth Hundred-Thousanth
10^3 10^2 10^1 10^0 . 10^-1 10^-2 10^-3 10^-4 10^-5
1 9 8 1 . 0 8 4 2 7

Each position to the right of the decimal point represents some fraction of 10. Each position to the right is 10 times smaller than the previous position.

This is very much how floats in binary is represented.

Eight Four Two One . One-Half One-Fourth One-Eighth One- Sixteenth One-Thirty Second
2^3 2^2 2^1 2^0 . 2^-1 2^-2 2^-3 2^-4 2^-5
1 9 8 1 . 0 8 4 2 7

Converting from binary floats to decimal fractions is the same as converting integers. Multiply each bit by it’s positional value and add them up. Observe:

101.011
- The first digit on the left is 1. It’s is in the 2^2 position, making 1 * 2^2 = 4
- The next digit is 0. It’s in the 2^1 position, meaning 0 * 2^1 = 0
- The next digit is 1. It’s in the 2^0 position, meaning 1 * 2^0 = 1
- The next digit is 0. It’s in the 2^-1 position, meaning 0 * 2^-1 = 0
- The next digit is 1. It’s in the 2^-2 position, meaning 1 * 2^-2 = 0.25
- The next digit it 1. It’s in the 2^-3 position, meaning 1 * 2^-3 = 0.125
Adding these numbers together we get 4 + 0 + 1 + 0 + 0.25 + 0.125 = 5.375

Now suppose we have a 12.25 and we want to convert this to binary. Here’s how it’s done:

First convert the integer part to binary:

12 therefore becomes 1100

Next we deal with the fractional part of the number. For this we repeatedly multiply the fractional part of number by 2 until we get 0, each time adding a binary 1 digit when the product is greater than or equal to 1 and a binary 0 when it is less than 1:

Multiplication by 2 Product Digit
0.25 * 2 0.5 0
0.5 * 2 1.0 1

The fractional part of the number is .01. Combine this with the integer part and we get 1100.01

Great, now that you we’ve gone over how to convert floating point from decimal to binary, let’s talk about how these are stored in a computer.

The IEEE Floating Point Standard

Most computer systems today follow the IEEE 754 standard for representing real numbers. In the technical standard, floating point numbers are represented using 3 components. These are:
1. The sign bit
2. The exponent
3. The mantissa
I’ll break each of these down and explain what they mean.

Do you remember learning about scientific notation in primary school? In case you haven’t heard of it before or need a refresher, scientific notation is a method for writing very large or small numbers. The numbers 890,000,000 and 0.00000051 or example is written in scientific notation as 8.9 x 10⁸ and 5.1 x 10^-7. A number in scientific notation is expressed as a number between 1 and 10 multiplied by a power of 10.

Binary real numbers can also be expressed in scientific notation. A binary number in scientific notation is expressed as a number 0 or 1 multiplied by a power of two. Let’s look at the decimal number 85.125 and express that in binary scientific notation:

First we convert 85.125 to binary. Following the steps outlined above, we will get 1010101.001. Now to express this in scientific notation we will move the decimal point six bits to the left and then multiply by 2⁶. The result will be 1.010101001 x 2^6.

If this number was stored as a single precision 32 bit float, here’s what it will look like:

Sign bit Exponent Mantissa
0 10000101 01010100100000000000000

Let’s unpack things a bit further…

The sign bit represents the sign of the number. If the float is positive than this bit will be 0 otherwise it is 1.

The exponent is 8 bits in length and represents the exponent part of the expression. You may have noticed that notice that exponent is not 00000110, which is 6 in binary. This is because 127 was added to the number. The 127 serves as a bias term and allows the computer to represent both positive and negative exponents.

The mantissa is 23 bits in length and represents the part of the number in scientific notation that is to the right of the decimal point.

This number can also be represented in a double precision 64 bit floating point. Here’s what it would look like:

Sign bit Exponent Mantissa
0 10000000101 0101010010000000000000000000000000000000000000000000

As you can see the double precision floating point type uses 1 bit to represent the sign of the number, 11 bits for the exponent with a bias of 1023, and 52 bits for the mantissa. Just as with the different types of integer types covered in the last article, single precision and double precision floating point can store different ranges of real numbers.

The range a single precision float can represent approximately +/- 10^38.25 while the range of a double precision float is +/- 10^308.25

With that large of a range you would think that a computer can represent any type of number you provide it accurately. But you would be mistaken. Allow me to explain.

The Limits of Floating Point Numbers

I’ll begin my explanation by first demonstrating the conversion of the real number 0.4 into binary.

Multiplication by 2 Product Digit
0.4 * 2 0.8 0
0.8 * 2 1.6 1
0.6 * 2 1.2 1
0.2 * 2 0.4 0
0.4 * 2 0.8 0
0.8 * 2 1.6 1
0.6 * 2 1.2 1
… … …

As you can see the number 0.4 in binary repeats infinitely. 0.011001100110011001100110011…

The number 0.3 and 0.1 are other examples of numbers where you will see this behavior. I’ll leave it up to you to convert them to binary and see for yourself.

There are many decimal floating points like these that cannot be represented exactly in binary. As a consequence, what you often get when you type floating point numbers into a computer is not the exact number, but an approximation of that number.

That is why when you run certain calculations on a computer like 0.1 + 0.2, you get 0.30000000000000004 instead of exactly 0.3.

In many cases, the approximation isn’t an issue. We lob off the trailing digits we don’t need and make it 0.3. But where it really becomes an issue is when we are dealing with applications that must make very precise calculations. Financial applications one case where precise calculations are crucial.

So what can be done for cases like these? In these situations, developers turn to Decimal data types, data types that are designed to represent decimal numbers precisely in a computer system. In the case of finance applications, developers may also choose to represent money in cents rather than in dollars, allowing them to use an integer data type instead of a floating point type to store the money.

That’s all, folks!

I’ve hoped you enjoyed the binary data series. Feel free to leave a comment if you have any questions.
July 6, 2021
Binary Data Series Part 2: Positive and Negative Integers

Let’s continue where we left off. In the last article I covered what are binary numbers and how to convert from binary to decimal. If you had a chance to read it, you’re now probably wondering how positive and negative numbers are represented in a computer. This article will cover this.

Adding binary numbers

Before I cover positive and negative numbers, I must first touch on a few things that will make some of the concepts I will bring up later easier to follow. The first thing I will touch on is how addition works in binary.

Binary addition follows the same rules as addition in the decimal system.

In the decimal system, when the result of an addition equals 10, we carry 1 over to the next digit on the left.

For example:

12 + 9 = 21

In binary addition, we carry 1 over when the sum of the values equal 2:

0 + 0 = 0

0 + 1 = 1

1 + 0 = 1

1 + 1 = 10, carry over 1, i.e. 10

Example:

      1 0 ¹0 1 ¹0 1

+      1    0 1   0 1

= 1 1    1 0   1 0

The value 2 in the binary system is equivalent to 10 in the decimal system.

Bits, bits, and more bits

The next topic I will touch on is how numbers are stored and organized in a computer system. A single binary digit is called a bit. Bits are the most basic units of information in a computer system. A bit by itself can only provide two values: 0 and 1. But as you seen in the last article, a computer can represent larger numeric values with more bits.

Programming languages typically provide a variety of numeric data types, each of which allow for a different range of numbers to be stored. Here are the types you will come across:

Data Type # Bits Number Range
Byte 8 0 – 256 (2^8)
Short/Word 16 (2 Bytes) 0 – 65536 (2^16)
Double Word 32 (4 Bytes) 0 – 4,294,967,296 (2^32)
Long 64 (8 Bytes) 0 – 18,446,744,073,709,551,616

Positive and Negative Integers

To represent negative numbers, one bit in the number is used to indicate whether or not the number is positive or negative. If this bit is 0 than the number is positive. If the bit is 1 than the number is negative. This bit is called the sign bit and is the leftmost bit in the binary number

Let’s say that we using 8 bits to represent positive and negative numbers. This is what the number 105 will look like:

Sign (High order bit) Low Order Bit
0 1 1 0 1 0 0 1

And this what -105 looks like:

Sign (High order bit) Low Order Bit
1 0 0 1 0 1 1 1

If you compare the two binary numbers you will notice that for every bit in the number 105 except for the right most bit, the 1s was exchanged for 0s and the 0s were exchanged for 1s. We have a special name for these type of numbers. It’s called two’s complement.

The two’s complement for any binary number can be computed by simply inverting the each bit (turning 0s into 1s and 1s into 0s) and adding 1 to the result.

As another example here’s the number 78 and its two’s complement:

78

Sign (High order bit) Low Order Bit
0 1 0 0 1 1 1 0

And -78

Sign (High order bit) Low Order Bit
1 0 1 1 0 0 1 0

As mentioned before most programming languages provide a number of numeric data types that allow the user to store numbers of varying ranges. Since 1 bit is used to represent negative numbers, the range of numbers that can be represented will be different

Data Type # Bits Number Range
Byte 8 -127 to 127
Short/Word 16 (2 Bytes) -32767 to 32767
Double Word 32 (4 Bytes) -2,147,483,648 – 2,147,483,647
Long 64 (8 Bytes) (-2^63)+1 – (2^63)-1

Some languages like C/C++ provide signed and unsigned integers. Signed integers which use one bit to represent negative numbers and unsigned integers which do not.

So now that you know about signed and unsigned integers, what do you think will happen if you perform a calculation like 2,147,483,647 + 1 and store the result in a signed integer? What do you think will happen if you store the result of 4,294,967,296 + 1 in an unsigned integer?

In the case of unsigned integers, what happens depends on the programming language and the compiler. Some languages will raise an error when this happens. Most however will either perform a modulo/wrap around operation or have some undefined behavior.

In the example above the let’s let UINT = 4,294,967,296 + 1. Then

UINT + 1 = 0

UINT + 2 = 1

UINT + 3 = 2

And so on..

In the case of signed integers the result might be even more surprising. Let’s let SINT = 2,147,483,647. Here’s what you can mostly expect:

SINT + 1 = -2,147,483,648

SINT + 2 = -2,147,483,647

SINT + 3 = -2,147,483,646

And so on..

The result becomes a negative number.

These cases, where the result of an integer operation does not fit within the allocated memory space, are called integer overflow errors. As you can imagine these errors are very dangerous because they often lead to unexpected behaviors.

Avoiding integer overflow errors usually comes down to checking the result of the addition. If the sum of two numbers is less than either of the operands, then you know for sure an integer overflow occurred.

What’s next…

In this article, you learned how positive and negative integers are represented in binary and special cases you should be aware of when adding integers. Next article we will discuss floating point numbers.

June 23, 2021
Binary Series Part 1: Binary Numbers
If you’ve ever…
- Wondered by 0.1 + 0.2 equals 0.30000000000000004
- Were curious why you may get an integer errors when adding up large numbers, or
- Why you might get a small number when adding two big numbers
Then read on. In this multi part article series, you will learn the answer to all of these questions. I guarantee you that if you take the time to read these articles, you will be a much better coder. Knowledge of how a computer represents numbers will help you design more precise and stable applications.

Let’s Begin!

Base 2 Number System

The base 10 number system, aka the decimal system, is the number system we use in our everyday life. As you all know in the decimal system each digit of a number can have a value ranging from 0 to 9. The places and positions of the numbers are based on powers of 10.

etc. Hundred Thousands Ten Thousands Thousands Hundreds Tens Ones
… 10⁵ 10⁴ 10³ 10² 10¹ 10⁰
… 4 7 6 9 4 5

Each number position is 10 times the value to the right of it.

A computer on the other hand uses a different number system for representing numbers. Every piece of data stored in a computer system is represented in base 2. This is also known as a binary number. Each position in a binary number are based on powers of 2.

etc. 256 128 64 32 16 8 4 2 1
… 2⁸ 2⁷ 2⁶ 2⁵ 2⁴ 2³ 2² 2¹ 2⁰
… 1 0 0 1 1 1 0 0 1

Each position in a binary number can have a value of either 0 or 1.

Here’s an example of a binary number:

100011

The binary number above represents the decimal number 35. We’ll get to how to convert from binary to decimal in a bit, but let’s quickly cover counting in binary.

Counting in Binary

Below is a table of numbers from 0 to 20 and their representation in binary. Can you see the pattern?

Number 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Binary 10100 10011 10010 10001 10000 01111 01110 01101 01100 01011 01010 01001 01000 00111 00110 00101 00100 00011 00010 00001 00000

Counting in binary is very similar to how we count in decimal. In decimal we start counting at 0 then count from 1 all the up to 9, then start at 0 again but add 1 to the digit on the left, resulting in the number 10. Counting in binary is the exact same process, except we reach 10 much sooner.

Converting from Binary to Decimal

To convert a binary number to a decimal number, multiply each digit by its binary positional value and then all together. Here’s an example to make things clearer:

100011
- The first digit on the left is 1. It’s is in the 2⁵ position, making 1 * 2⁵ = 32
- The next digit is 0. It’s in the 2⁴ position, meaning 0 * 2⁴ = 0
- The next digit is 0. It’s in the 2³ position, meaning 0 * 2³ = 0
- The next digit is 0. It’s in the 2² position, meaning 0 * 2² = 0
- The next digit is 1. It’s in the 2¹ position, meaning 1 * 2¹ = 2
- The next digit it 1. It’s in the 2⁰ position, meaning 1 * 2⁰ = 1
Adding these products together, we get 35.

Converting from Decimal to Binary

Converting from decimal to binary is accomplished by repeatedly dividing the decimal number by 2 until the quotient becomes 0. The remainder of the division at each iteration becomes a digit in the binary number.

Here’s an example:

Converting 13 to binary

Division by 2 Quotient Remainder
13/2 6 1
6/2 3 0
3/2 1 1
1/2 0 1

13 is 1101 in binary

That’s it for now. Next article in the series, we will talk about positive and negative numbers in binary.
June 14, 2021
Need help deciding which validation technique to use?

There are a number of techniques for testing the performance of newly developed machine learning models. Have you ever wondered which technique you should use for your data science project and when to use it? The goal of this post to provide some helpful guidelines.

First things first, how big is your dataset?

If you’re working with a very small dataset, consider using leave one out cross validation. In this method each observation is held out from the dataset for testing. The rest of the observations are used to train the model. This is repeated N times for each observation in the dataset and the test error is computed as the average of all N errors.

If you have a decently sized to large dataset then turn your attention to the next set of methods.

Say you’re just kicking off a data science project, are in a bit of a time crunch, and just need a quick and dirty model to serve as an starting point for your project. The holdout method is a good go to technique to use. This technique involves splitting the dataset into a “train” and “test” set. The training set is used to train the model and the test set is used to see how well the model performs on unseen data. A common approach is to use 80% of the data for training and 20% of the data for testing.

If you have more time on your hands K fold cross validation is the preferred method over hold out as it allows a better indication of how the model will perform on unseen data. In this approach the data is randomly split into k groups. One of the groups is used as the test set and the rest of the groups is used for training. This process is repeated until each group is used as the test set. The test error is computed as the average test error from all groups.

But regular K fold cross validation and hold is not ideal for datasets containing class imbalance. For cases like those, stratified cross validation is the way to go. In this approach the train-test splits are chosen to ensure that each set has the same percentage of examples of each target as the complete dataset.

If you choose K-fold cross validation and you intend to use it tune the hyperparameters for your model and also get an error estimate, there is much better approach my friend. Using nested cross validation for this task is suitable as it avoids overfitting. This article provides a detailed breakdown of the technique https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/.

If you’re working with time series data, where the goal is to produce of forecast of future events, none of the approaches discussed earlier will work. Time series cross validation is what you’ll need. In this approach a single observation is used as the test set and the training set is composed of observations that occurred prior to the test observation. This is repeated multiple observations used as test points. The forecast error is computed by averaging the errors for each test point. This approach can be extended to evaluate multi-step forecasts.

Well, that’s it. Feel free to share your thoughts in the comment section.

Take care

May 3, 2021

Thousands	Hundreds	Tens	Ones	.	Tenth	Hundredth	Thousandth	Ten-Thousanth	Hundred-Thousanth
10^3	10^2	10^1	10^0	.	10^-1	10^-2	10^-3	10^-4	10^-5
1	9	8	1	.	0	8	4	2	7

Eight	Four	Two	One	.	One-Half	One-Fourth	One-Eighth	One- Sixteenth	One-Thirty Second
2^3	2^2	2^1	2^0	.	2^-1	2^-2	2^-3	2^-4	2^-5
1	9	8	1	.	0	8	4	2	7

Sign bit	Exponent	Mantissa
0	10000101	01010100100000000000000

Multiplication by 2	Product	Digit
0.4 * 2	0.8	0
0.8 * 2	1.6	1
0.6 * 2	1.2	1
0.2 * 2	0.4	0
0.4 * 2	0.8	0
0.8 * 2	1.6	1
0.6 * 2	1.2	1
…	…	…

Data Type	# Bits	Number Range
Byte	8	0 – 256 (2^8)
Short/Word	16 (2 Bytes)	0 – 65536 (2^16)
Double Word	32 (4 Bytes)	0 – 4,294,967,296 (2^32)
Long	64 (8 Bytes)	0 – 18,446,744,073,709,551,616

etc.	Hundred Thousands	Ten Thousands	Thousands	Hundreds	Tens	Ones
…	10⁵	10⁴	10³	10²	10¹	10⁰
…	4	7	6	9	4	5

etc.	256	128	64	32	16	8	4	2	1
…	2⁸	2⁷	2⁶	2⁵	2⁴	2³	2²	2¹	2⁰
…	1	0	0	1	1	1	0	0	1

Number	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
Binary	10100	10011	10010	10001	10000	01111	01110	01101	01100	01011	01010	01001	01000	00111	00110	00101	00100	00011	00010	00001	00000

Division by 2	Quotient	Remainder
13/2	6	1
6/2	3	0
3/2	1	1
1/2	0	1

Multiplication by 2	Product	Digit
0.25 * 2	0.5	0
0.5 * 2	1.0	1

Sign (High order bit)							Low Order Bit
0	1	1	0	1	0	0	1

Sign (High order bit)							Low Order Bit
1	0	0	1	0	1	1	1

Sign (High order bit)							Low Order Bit
0	1	0	0	1	1	1	0

Author: Daryle Serrant

The case for K-Medians

About the Manhattan Distance

The Algorithm

Choosing K

What’s Next

So, what is the K-Modes Algorithm?

Determining the optimal ‘k’

Elbow Method

So, what is the Sum of Squared Error?

The Intuition of the Elbow Plot

Introducing the Sihouette Method

So what exactly is k-means?

So how exactly does it work?

But wait, how is distance between a data point and a centroid calculated?

So, what are the advantages of k-means?

And, what are the disadvantages of k-means?

What’s next?