Secured Information Ingestion following Azure Entra OAuth2 from Office365

Information ingestion from MS Office365 via MS Graph APIs using Python Selenium and Apache Hop

Introduction

Requirement for this task was to download attachments from a shared email address selectively based on attachment’s file type in an automated and scheduled manner. And then parse and extract information from those files to stage on a database platform. In this document we will discuss only the attachments downloading part by designing and implementing via Python Selenium library, Apache Hop Pipelines, Workflows.

To accomplish the task, we had taken the approach of low-coding, preferably no-coding. As this approach offers much more agility and flexibility for this implementation. I will describe all the details of implementation in the following sections.

To accomplish the requirement of automation and scheduling of the tasks, we decided to use Apache Hop Docker container hosting Hop as a server, which would wake-up to execute the flow and shut-down when processing is complete.

However, to maintain the readability of this document, we will publish another following article describing the implementation for automated and scheduled execution via Docker with Apache Hop and other necessary libraries.

Azure OAuth2 Implementation

The first part of this task is to implement Azure Oauth2 security protocol for MS Office365

authentication.

Azure Entra authentication flow is described in following URL -

https://learn.microsoft.com/en-us/entra/identity-platform/v2-protocols

The steps for OAuth2 Authentication flow are-

Send http request with tenant_id, client_id, scope and callback url of registered application on Azure portal
Next, Azure want to receive user id and password of the Azure account associated with client_id by popping up a browser window
Once user_id and password are provided, Azure authenticates that. Then prepares and sends response to the registered callback URL
Azure response comes with Authorization Code in a browser’s address bar along with Callback url and other parameters
Client application needs to listen on the http callback address for Azure response and then extracts Authorization Code from the address bar of the response message
Then application sends another http post request with client_id, client_secret, callback url and Authorization Code received in the previous response
Then Azure sends another response with a JSON object including Access Token of Bearer type with the information for the validity period
Finally application parse, extracts and store Access Token for all subsequent MS Graph API Calls to access emails and downloading its attachments

Authentication Process Flow Diagram

Implementation of this security handshake protocol via Apache Hop ETL is going to be

described in the following sections of this document.

Following diagram is the pictorial presentation for acquiring Azure Access Token -

Once an access token of bearer type is obtained, it is stored and used for all subsequent Graph

API calls within the time of its validity window. After that it needs to be refreshed again.

Actual Implementation of Azure OAuth2

I break down the end-to-end workflow in 3 stages as described below -

Retrieve Auth Code from browser URL and request for Access token

Store and use access token for using with MS Graph API calls

Stage 1 and 2 are implemented inside Apache Hop workflow by executing shell scripts, which in

turn invoke a python script to accomplish following automation -

Open a web listener using python http.server library

Make a HTTP request to microsoftonline.com

Using selenium ⇒ chrome web driver, find the elements in the response web page for providing email user id and password; application automatically fills in those information and sends back to Azure

Then switch to browser address bar (URL location), extract authorization code from JSON formatted Azure response

Finally make another API call with auth code and other necessary info; then parse JSON response to extract access token of bearer type to be used in subsequent Graph API calls

Hop pipelines and workflows for downloading email attachments

Now we reach at the Hop implementation for actual email attachment downloads which is

elaborated in the following sections -

Access Token received is set as a variable for reuse in subsequent calls -

Next, retrieve and store email information as shown below from Office365 Cloud for configurable

number of latest emails received in the shared email address -

Attributes of each email (id, recvdDateTime, attachment flags etc.) are extracted and stored in a

database table as shown above.

Now we execute attachment-download-main-job.hwf workflow -

Inside getting_email_ids.hpl, it reads email_ids from database table with attachment flag set and

caches in memory for using inside the following loop for all the emails to be scanned 1 by 1 -

Next, it enters loop for processing each email by ids from cache one after another -

Inside do-for-each-email-id.hwf workflow, first it sets email_id as a variable to be retrieved by

following workflow makeAPICall-for-each-email-id.hwf for executing in a loop for all the

email_ids stored in cache.

Finally, the main pipeline makeEmailDetailsAPICall.hpl is executed to make REST Call for each

email_ids having downloadable attachments.

Implementation requires an outer loop for email_ids, and inner loop is for multiple

attachment_ids for each email_ids having multiple attachments. Filter-Rows transform is used to

select certain attachment file types, while discarding jpg, png etc.

I have chosen to download all the attachments for emails having multiple attachments in a

single API call avoiding each call for each attachment. And used Enhanced JSON Output

transform instead.

Using Group Key as Email Id -

And Fields as Attachment name and its content.

I store a single JSON file for each email, and if an email has multiple attachments, then contents

of those are separated by unique attachment names.

This implementation is working well so far as most of the emails have single attachment, thus avoiding another additional layer for individual API calls for each attachment. That would have increased the number of REST calls and execution times. This implementation makes it better to save time and reduce REST calls. But if the JSON file, having multiple large files as attachments, becomes too large, it might lead to a memory issue. Users having memory issues in this approach might need to change it by calling one API call for each attachment file.

Search This Blog

Saikat's blogs on recent works