Data Loss Prevention (DLP) Architecture – Deep Dive

One of the hot features of Office 365 and "by inheritance" in SharePoint 2016 is Data Loss Prevention (DLP).  DLP allows you to execute search queries on your content to find possible "Sensitive Data Types".  You can find many blog post on how to setup DLP queries and how to find data, but that is at the high level.  In this blog post, I get into the nitty gritty of how it actually works under the covers.

High Level Steps:

  • UserApp uploads a document
  • Search gather runs (likely every 15 minutes)
  • Search content indexes the content
    • Ceres engine executes flows, one such Flow is called "Data Loss Prevention"
      • DLP Flow runs
      • DLP Flow looks for keywords
      • DLP Flow looks for regular expression matches
      • DLP excludes invalid "test" values
      • DLP Flow adds "SensitiveType" crawl property data
  • Ceres adds data to the index
  • User queries for DLP data using the "SensitiveType" managed proprerty

Deep Level Steps:

Most of the major work happens in the Ceres Flow.  It uses a "configuration" file of sorts from the "Microsoft.Ceres.DataLossPrevention" .NET Assembly in the GAC.  You will find a "defaultCLassificationRules" resource file in this assembly.  This file is called a "RulePackage" and contains Rule Packs. Currently for SharePoint 2016, there is only one rule pack, Office 365 may have more. Opening this file you will find the following:

  • Entities
    • High level container of a particular "end user" data type
      • US Driver's License Number – there are different in almost every state
      • Canada Driver's License Number – different in each of the provinces
    • Made up of patterns
  • Patterns
    • Made up of IDMatch'es, and Keyword Matches
    • Patterns have confidence levels.  Most expect a confidence level of 75% or higher.
    • Pattern can enforce minMatches (must have at least one keyword, etc)
  • RegEx Expressions
  • Keywords
    • Keywords may be case sensitive – most are not
  • Localization strings – these are used for the rendering and matching of keywords

Patterns:

  • Credit Card Number
  • EU Debit Card Number
  • US Social Security Number
  • US Individual Taxpayer Id Number (ITIN)
  • Canada Social Insurance Number
  • UK NINO
  • UK Driver's License
  • German Driver's License Number
  • German Passport Number
  • UK NHS Number
  • France INSEE
  • France Driver's License
  • Canada Driver's License
  • US Driver's License
  • Japan Driver's License
  • Japan Resident Registration
  • Japan Social Insurance Number
  • Japan Passport Number
  • Japan Bank Account Number
  • France Passport Number
  • US/ UK Passport Number
  • SWIFT Code
  • US Bank Account Number
  • ABA Routing Number
  • DEA Number
  • Australia Medical Account Number
  • Australia Tax File Number
  • Israel National ID Number
  • New Zealand Health Number
  • Spain SSN
  • Sweden National ID
  • Australia Bank Account Number
  • Australia Passport Number
  • Canada Bank Account Number
  • Canada Passport Number
  • Canada PHIN
  • Canada Health Service Number
  • France CNI
  • IP Address
  • IBAN
  • Israel Bank Account Number
  • Italy Driver's license Number
  • Saudi Arabia National ID
  • Sweden Passport Number
  • U.K. Electoral Number
  • Finnish National ID
  • Taiwanese National ID
  • Poland National ID (PESEL)
  • Poland Identity Card
  • Poland Passport Number

Regular Expressions:

Here are some of the regular expressions that DLP flow is looking for:

  • <Regex id="Regex_france_cni">(^|s)(d{12})($|s|.s)</Regex>
  • <Regex id="Regex_uk_electoral">(^|s)([a-zA-Z]{2}d{1,4})($|s|.s)</Regex>
  • <Regex id="Regex_canada_health_service_number">(^|s)(d{10})($|s|.s)</Regex>
  • <Regex id="Regex_canada_phin">(^|s)(d{9})($|s|.s)</Regex>
  • <Regex id="Regex_canada_passport_number">(^|s)(D{2})(d{6})($|s|.s)</Regex>
  • <Regex id="Regex_canada_bank_account_number">(^|s)(d{7})($|s|.s)</Regex>
  • <Regex id="Regex_australia_passport_number">(^|s)([A-Za-z]d{7})($|s|.s)</Regex>
  • <Regex id="Regex_australia_drivers_license_number">(?ix)(?:^|s)(?=(?:[A-Zd]{2}d{2}[A-Zd]{5})(?:$|s|.s))(?=(?:[A-Z]{0,2}d){4,9})(?!(?:d{0,9}[A-Z]){3,9})[A-Zd]{9}</Regex>
  • <Regex id="Regex_australia_bank_account_number">(^|s)([0-9]{6,10})($|s|.s)</Regex>
  • <Regex id="Regex_sweden_passport_number">(^|s)(d{8})($|s|.s)</Regex>
  • <Regex id="Regex_italy_drivers_license_number">(^|s)(D{1}[^b-uw-zB-UW-Z])((w{7})(D))($|s|.s)</Regex>
  • <Regex id="Regex_ipv4_address">(^|s)((?:[0-9].)|(?:[0-9]))(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(?!(?:.[0-9])|(?:[0-9]))($|s|.s)</Regex>
  • <Regex id="Regex_ipv6_address">(^|s)((?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4})($|s|.s)</Regex>
  • <Regex id="Regex_israel_bank_account_number">(^|s)(d{2}-d{3}-d{8}|d{13})($|s|.s)</Regex>
  • <Regex id="Regex_saudi_arabia_national_id">(^|s)(d{10})($|s|.s)</Regex>
  • <Regex id="Regex_usa_bank_account_number">(^|s)(d{4,17})($|s|.s)</Regex>

Keywords:

Keywords are important for deciding the confidence level of a particular pattern.  If a keyword is found, then the confidence goes up.  Here are some examples:

  •  UK Nino contains:
    • national insurance number,national insurance contributions,protection act,insurance,social security number,insurance application,medical application,social insurance,medical attention,social security,great britain,insurance
  • UK Driver's License:
    • DVLA,light vans,quadbikes,motor cars,125cc,sidecar,tricycles,motorcycles,photocard licence,learner drivers,licence holder,licence holders,driving licences,driving licence,dual control car

Keep in mind that your SharePoint on-premises environment will always have a static rule pack unless you download and apply updates.  Office 365 will continually get updated rule packs to enhance the DLP engine features so you will be more protected against sensitive data "leaks".

Enjoy!
Chris