Author: Daryle Serrant

  • A Few Faces of Bias

    One of the biggest killers of any Data Analysis project is Bias. Bias in data can appear in many forms. Today I will briefly describe three types of bias and how to avoid them.

    Algorithm Bias

    This is the kind of bias that results from selecting a machine learning model that that is too simple for the problem at hand. A classic example of this is using a linear regression model on data where no linear relationship exists between the predictors and the attribute being predicted. The model is unable to capture all of the signal in the dataset. You’ll know you’re dealing with this kind of bias when you see a very high inaccuracies in both your training and test set. 

    To avoid this error:

    Experiment with models that have more complexity. Use cross validation techniques to measure the performance of the your models and select a model that has an optimal balance between bias and variance.

    Measurement Bias

    This kind of bias happens as a result of errors in the data collection process. A few examples of this kind of bias is:

    • Image classification system where data is collected from a camera that has a much higher image quality than what will be used in production
    • Incorrectly labeled audio files for a project where the goal is to build a model that identifies male voices from female voices
    • A survey containing leading questions which influences answers in a particular direction.

    To avoid this error:

    • Compare the outputs of the measurement tools used for data collection to make sure they are consistent with the tools used in production.
    • Properly train labelers and annotation workers before putting them to work on data.
    • Comply with survey best practices

    Sample Bias

    This kind of bias is also a result of faults in the data collection process. Whereas measurement bias is a result of measurement errors in the data collection, this kind of bias is caused when the data used comes from a sample of individuals that is not representative of the population of interest.

    Amazon’s failed AI Recruiting Tool, which I wrote about in my last post, is one great example of this type of bias. Because the dataset used for the product came from resumes submitted from mostly male applicants, the AI tool had a strong preference for male candidates over female candidates.

    To avoid this error:

    • Clearly identify the goals of the project the data will be collected for and the intended audience the project is meant for.
    • Use random sampling techniques  to make sure that every member of the population has an equal chance of being selected.
    • Double check work to make sure no mistakes were made.

  • Sexist AI Recruiting

    Employee staffing and recruiting is serious business. According to this source, staffing and recruitment in the United States was valued over $150 billion in 2019. Many companies (especially organizations like Facebook and Google) spend millions scouring the country for talented professionals and interviewing them opportunities. Given the amount of time, energy, and stress involved, it makes sense that some firms would want to automate some (or all) of the recruitment process.

    Many believe that AI can be a super effective means to automatically identifying qualified candidates from the sea of resumes that often hit a company’s applicant tracking system. Some may also believe that unlike a human who is often fallible and have biases that can cause them to reject an otherwise qualified candidate, an AI has no such biases and can evaluate a candidate with lightning quick, analytical precision.

    The actual truth is that AI, if not properly designed and managed, can magnify the biases of its creators, often in ways that were unintended.

    In late 2018, Amazon terminated an AI product they’ve been using internally to automatically vet candidates for a number of roles. The reason — their AI models had a strong preference towards female candidates.

    What went wrong?

    AI is neither sexist nor does it have any particular opinion about people or society. It merely finds patterns in data that best correlate to the target specified by the creator.

    There’s been a number of articles already written on the subject, a lot of which rightfully point out that the data used to develop the tool was resumes submitted over a 10 year period, a majority of which were submitted by males.

    Biggest takeaway for in this story is to avoid solely relying on AI to make business decisions.

  • The Obligatory Hello World Post

    The Obligatory Hello World Post

    Welcome to SegmentationPro!

    This a blog that explores methods for clustering and analyzing data of all shapes, sizes, and types with a focus towards customer segmentation. Customer segmentation is a powerful tool that businesses employ in order to effectively market their products and services to new customers. It can also help businesses improve the quality of service provided to existing customers and promote customer loyalty and retention.

    Most of the data science and machine learning blog posts I found on the topic usually refer to K-Means clustering or a similar unsupervised learning. There are many more algorithms that can be employed for this purpose and I wanted to create a resource for anyone looking for other methods to employ in their projects.

    Whether you are a budding data analyst, an experienced professional, or a marketing professional looking to expand their skillset, my intention is to make SegmentationPro assessable to all interested parties. Subscribe to my blog to keep apprised of new posts and updates.