Cybersecurity and AI / ML bias – Security Boulevard



Cyber ​​attackers and cyber defenders seem to be using AI (artificial intelligence) and ML (machine learning) more and more, according to the press, vendor claims and blogs. So it makes sense that cybersecurity professionals and researchers have a better understanding of the biases that affect the AI ​​/ ML pipeline. A recent article, “Biases in AI Systems”, by Ramya Srinivasan and Ajay Chander in the August 2021 issue of Communication from the ACM, does a great job of presenting various biases and makes some suggestions on how to mitigate their negative impact.

The article is too detailed to be described in a short column, but I will list the stated biases. The CACM This article presents an AI / ML pipeline, similar to the software development lifecycle (SDLC) used in software development. The AI ​​/ ML pipeline (AMP?) Has the following sequential phases:

  • Data creation
  • Formulation of the problem
  • Data analysis
  • Validation and testing.

I’m wondering if the problem shouldn’t be formulated first, like the requirements phase of SDLC, with data creation and analysis after problem definition and requirements specification. Otherwise, it appears to be a “fishing expedition” where the nature of the problem depends on the data available rather than looking for the data after the problem has been defined. This is somewhat analogous to my request, in the November 2008 issue ISACA Review “Accounting for Value and Uncertainty in Security Metrics” article, where I note that the most useful metrics are often based on data that is harder and more expensive to obtain.

The biases within the AMP phases are identified in the article as follows:

Data creation bias

  • Sampling bias– due to the selection of particular types of instances more than others, making the dataset sub-representative of the real world
  • Measurement bias– introduced by human measurement errors or due to the intrinsic habits of those who capture the data
  • Label bias– associated with inconsistencies in the data labeling process due to the different styles and preferences of the labelers
  • Overall negative bias– introduced due to the lack of representative samples from the “rest of the world”

Problem formulation bias

  • Framing effect bias—Based on how the problem is formulated and how the information is presented

Algorithm / data analysis bias

  • Sample selection bias– introduced by the selection of individuals, groups or data to be analyzed so that the samples are not representative of the population to be analyzed
  • Confusion bias– occurs if the algorithm learns the bad relationships by not taking into account all the information contained in the data
  • Design bias—Only introduced or added by the algorithm

Evaluation / validation bias

  • Human evaluation bias—Due to phenomena such as confirmation bias, spike effect, previous beliefs (eg culture) and the amount of information that can be recalled (“recall bias”)
  • Sample processing bias– introduced in the process of selective submission of certain groups of people to a type of treatment
  • Validation and bias testing of datasets—Introduced from sample or label selection bias in testing and validation data sets or may result from selection of inappropriate references / data sets for testing

The article states that “it may not be possible to eliminate all sources of bias,” but offers some guidelines, as follows:

  • Incorporate domain specific knowledge (This is similar to what I advocate in my book “Engineering Safe and Secure Software Systems”, where I suggest including both InfoSecurity and Security experts throughout the SDLC)
  • Understand which data features are considered sensitive depending on the application
  • Ensure data sets are representative of the actual population, to the extent possible
  • Establish appropriate standards for annotating (labeling) data
  • Include all variables that have dependencies with the target functionality
  • Eliminate sources of confounding bias through appropriate data conditioning and randomization strategies in the selection of entries
  • Take care not to introduce sample selection bias into the choice of data subsets to be analyzed
  • Guard against the introduction of a sample processing bias

These guidelines make sense for AI / ML and should be considered when applying these technologies to cybersecurity systems and services that integrate AI / ML to some extent. However, it must be recognized that many of these guidelines are “easier said than done”. In addition, taking into account these biases and guidelines is also appropriate for security measures and investing in cybersecurity measures in general.

*** This is a Syndicated Security Bloggers Network blog from written by C. Warren Axelrod. Read the original post at:



About Author

Leave A Reply