Summary
As teams expand and deepen usage of Azure DevOps, there is the propensity for Personally Identifiable Information (PII) being introduced into work items (user stories, bugs, test cases, etc.). This can introduce liability and privacy issues for organizations. PII can creep in into Azure DevOps environments in the form of (may not be all inclusive):
- Field values in user stories, bugs, tasks, etc. (description, acceptance criteria, title, and other HTML or text fields)
- Test cases (title, description, test step descriptions, test step attachments)
- Attachments to any work item type
While there is not an off-the-shelf solution to help with this There are ways to leverage Azure to develop such a utility to find, manage, and notify appropriately when PII exists in the Azure DevOps organization.
Approaches to Scanning
There are two main approaches to this problem. First, the team needs to find any PII that already exists in Azure DevOps and take necessary action. Second, the team needs a method to detect PII as close to real-time when it is first introduced.
Full Scan of Azure DevOps
To “catch up” and detect with PII that already exists in Azure DevOps, a comprehensive scan of Azure DevOps content is needed.
Below is a very high-level pseudo-code outline of performing such a scan. This scan takes into consideration all the aforementioned areas that PII could be present in Azure Boards or Azure Test Plans (the components Azure DevOps that leverage work items).
I also built a sample (just a sample, only a sample) here in GitHub.
Connect to Azure DevOps organization Foreach (project in organization) { Foreach (workItemType in project) { Get all work items for the current workItemType Foreach (workItem in workItemsByType) { Get all HTML and text field values that could contain PII Send HTML/text field values to Azure Text Analytics (in batch document) Foreach (valueWithPII in TextAnalyticsResults) { Take some action (notification, redaction, removal) } Get attachments for the workItem Foreach (attachment in workItemAttachments) { Send attachment content (supported format) to Azure Computer Vision Send computerVisionResults to Azure Text Analytics Foreach (attachmentWithPII in AttachmentAnalyticsResults) { Take some action (notification, removal) } } If (workItemType is a Test Case) { Get all values of each test step Send test step values to Azure Text Analytics (in batch document) Foreach (testStepWithPII in TestStepAnalyticsResults) { Take some action (notification, redaction, removal) } Foreach (attachment in TestSteps) { Send attachment content (supported format) to Azure Computer Vision Send computerVisionResults to Azure Text Analytics Foreach (attachmentWithPII in AttachmentAnalyticsResults) { Take some action (notification, removal) } } Get any test case parameters Send test case parameters to Azure Text Analytics (in batch document) Foreach (paramWithPII in TestParametersAnalyticsResults) { Take some action (notification, removal) } } } } }
This solution could also be used for a periodic scan if real-time/triggered scans are prohibitive.
Reference Documentation
- Get started with the REST APIs for Azure DevOps Services and Team Foundation Server – Azure DevOps Services REST API | Microsoft Docs
- Azure Cognitive Services Text Analytics client library for .NET – Azure for .NET Developers | Microsoft Docs
- What is Optical character recognition? – Azure Cognitive Services | Microsoft Docs
Incremental or Triggered Scan
Moving forward, teams will need to detect the introduction of PII into Azure DevOps as soon as possible. There are a couple of approaches to this more incremental or trigger-based scan.
First, the solution developed in “Full Scan of Azure DevOps” could be utilized here as well, parameterized to check only the most recent items for a given interval. For example, if the scan is to run every hour, filter work item querying to return only items with a ChangedDate in the last 60 minutes.
Second, Azure Logic Apps could be used to trigger when work items are updated in Azure DevOps, providing detection within 1 minute of PII introduction. The Logic App would orchestrate the extraction of content to check, as well as any mitigation actions.
Below are a couple screenshots of basic examples of using a Logic App (steps are simplified for brevity).
While there are Logic App connectors for Azure DevOps, Text Analytics, and Computer Vision, Azure Functions would provide more granular control (and also become more of a microservices architecture). Create Azure Functions to:
- “Sanitize” HTML field values to plain text
- Manage collation and interaction with Azure Text Analytics for text values
- Manage OCR actions using Azure Computer Vision to extract text values from images and other attachments
- Conduct PII replacement, redaction, or removal
- Facilitate logging (to Azure storage, databases, or Azure Event Hubs)
Lastly, the “Full Scan” solution could be combined with the Azure Functions/microservices-style architecture to create more reusable components, allowing for easier updates, fixes, and scale. For example, create Functions for each of the above-bulleted capabilities, and leverage those Functions from the “Full Scan” solution as well as the “Incremental Scan” solution.
Azure Services Used
Below are Azure services that could potentially be used for this solution and are referenced in this document.
- Azure Cognitive Services: Azure Cognitive Services are cloud-based services with REST APIs and client library SDKs available to help you build cognitive intelligence into your applications. You can add cognitive features to your applications without having artificial intelligence (AI) or data science skills. Azure Cognitive Services comprise various AI services that enable you to build cognitive solutions that can see, hear, speak, understand, and even make decisions.
- Text Analytics: The Text Analytics API is a cloud-based service that provides Natural Language Processing (NLP) features for text mining and text analysis, including: sentiment analysis, opinion mining, key phrase extraction, language detection, and named entity recognition.
- Named Entity Recognition (NER): Finds entities in text & categorizes them (ex. Person, event, etc.). Identifies & categorizes them (ex. Phone number, email address, passport number). Can also find PHI with Text Analytics for Health (medication, diagnosis, dosage, delivery)
- Computer Vision: The cloud-based Computer Vision API provides developers with access to advanced algorithms for processing images and returning information. By uploading an image or specifying an image URL, Microsoft Computer Vision algorithms can analyze visual content in different ways based on inputs and user choices. Learn how to analyze visual content in different ways with quickstarts, tutorials, and samples.
- Text Analytics: The Text Analytics API is a cloud-based service that provides Natural Language Processing (NLP) features for text mining and text analysis, including: sentiment analysis, opinion mining, key phrase extraction, language detection, and named entity recognition.
- Azure Logic Apps: Azure Logic Apps is a cloud-based platform for creating and running automated workflows that integrate your apps, data, services, and systems.
- Azure Functions: Azure Functions is a serverless solution that allows you to write less code, maintain less infrastructure, and save on costs. Instead of worrying about deploying and maintaining servers, the cloud infrastructure provides all the up-to-date resources needed to keep your applications running. You focus on the pieces of code that matter most to you, and Azure Functions handles the rest.
- Azure Blob Storage: Azure Blob storage is Microsoft’s object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data.
- Azure Event Hubs: Azure Event Hubs is a big data streaming platform and event ingestion service. It can receive and process millions of events per second. Data sent to an event hub can be transformed and stored by using any real-time analytics provider or batching/storage adapters.
Development Considerations
While the conceptual approach to scanning Azure DevOps is straightforward, there are programming considerations to discuss if wanting to complete a comprehensive scan of Azure DevOps work items. These are a few that I discovered during my research.
- Dealing with attachments: Checking for PII in an attachment requires additional steps, depending on the file format.
- Text-based (.txt, .json, .xml, .html, etc.): These can have their content streamed to memory as text. The Text Analytics API can then be streamed the content.
- Binary (.jpg, .doc, .pdf, .png, etc.): If the format is a supported one for the Read API, the URL of the attachment can be provided to the Read API directly (if the identity used to run the Cognitive Services resource has access to Azure DevOps) using the attachment URL. Otherwise, these attachments will need to be downloaded as well. Depending on the file type, additional methods will need to be used to get the content into an accepted file format for the OCR features in Azure Computer Vision (using the Read API).
- The Read API has the following input requirements:
- Supported file formats: JPEG, PNG, BMP, PDF, and TIFF
- For PDF and TIFF files, up to 2000 pages (only first two pages for the free tier) are processed.
- The file size must be less than 50 MB (6 MB for the free tier) and dimensions at least 50 x 50 pixels and at most 10000 x 10000 pixels.
- The Read API has the following input requirements:
- As documented in the pseudo-code for the full scan approach, an additional check and loop is needed to iterate test steps in a Test Case. For any attachments on a test step, the above “dealing with attachments” considerations also apply.
- An individual Logic Apps can only be triggered by changes to a single project (the trigger action can be bound to only one Azure DevOps project).
- Work Items: API limits work item query results to 200.
- May need to build narrower queries, such as iteratively “walk back” by querying for items in last 1 day, then 2 days, etc. (for example)
- OData feeds support more than 200 results, but don’t include text-based fields. Additional calls would have to be incorporated.
Actions Upon Detection
Regardless of the approach used to detect PII, the actions taken upon detection are most important. What to do depends on urgency, compliance, and trust.
Logging
Simply logging the detection may be good enough if proper reporting is all that is needed. Sending the detection event to Azure Event Hubs or Azure Event Grid provides an avenue for the event to be recorded in an analytics workspace, or analysis engine.
Notification
Notification can involve several methods:
- Email to the user introducing the PII, that person’s manager, or a compliance team.
- Post a message to Microsoft Teams.
- Place a tag on the work item to draw attention to it.
Mitigation
Mitigation involves taking direct action on the PII content. During this exercise, several options presented themselves. For example, if the following text was detected in the description field of a work item:
Parker Doe has repaid all of their loans as of 2020-04-25. Their SSN is 859-98-0987. To contact them, use their phone number 800-102-1100.
- PII deletion: Delete the PII content and save the work item.
- PII redaction: The content can be replaced with its redacted equivalent (Azure Text Analytics provides redaction automatically): ********** has repaid all of their loans as of **********. Their SSN is ***********. To contact them, use their phone number ************.
- Secure the PII: Move the PII content to a location that has proper RBAC, such as Azure Blob Storage. Replace the description field with a notice of the PII detection and the blob URL. Only those with RBAC access will be able to view it.
Associated Costs
Specific costs of this solution depend heavily on overall volume: Frequency of execution, API calls, # of records checked, etc. Below is reference pricing for each service that could be utilized.
Logic Apps
- Price Per Execution: $0.000025 (First 4,000 actions free)
- Standard Connector: $0.000125 (Azure DevOps, Text Analytics)
Text Analytics
- 0-500,000 text records — $1 per 1,000 text records
- 0.5M-2.5M text records — $0.75 per 1,000 text records
- 2.5M-10.0M text records — $0.30 per 1,000 text records
- 10M+ text records — $0.25 per 1,000 text records
Text record: request up to 1,000 characters
Computer Vision
- OCR pricing table
- 0-1M transactions — $1 per 1,000 transactions
- 1M-10M transactions — $0.65 per 1,000 transactions
- 10M-100M transactions — $0.60 per 1,000 transactions
- 100M+ transactions — $0.40 per 1,000 transactions
Disclaimer
The information in this document is designed to be a sample reference for the solution described. It does not imply an enterprise-ready solution architecture, nor represent a commitment to fully design and build the described solution.