Secured Information Ingestion following Azure Entra OAuth2 from Office365
Information ingestion from MS Office365 via MS Graph APIs using Python Selenium and Apache Hop
Introduction
Requirement for this task was to download attachments from a shared email address selectively based on attachment’s file type in an automated and scheduled manner. And then parse and extract information from those files to stage on a database platform. In this document we will discuss only the attachments downloading part by designing and implementing via Python Selenium library, Apache Hop Pipelines, Workflows.
To accomplish the task, we had taken the approach of low-coding, preferably no-coding. As this approach offers much more agility and flexibility for this implementation. I will describe all the details of implementation in the following sections.
To accomplish the requirement of automation and scheduling of the tasks, we decided to use Apache Hop Docker container hosting Hop as a server, which would wake-up to execute the flow and shut-down when processing is complete.
However, to maintain the readability of this document, we will publish another following article describing the implementation for automated and scheduled execution via Docker with Apache Hop and other necessary libraries.
Azure OAuth2 Implementation
The first part of this task is to implement Azure Oauth2 security protocol for MS Office365
authentication.
Azure Entra authentication flow is described in following URL -
The steps for OAuth2 Authentication flow are-
- Send http request with tenant_id, client_id, scope and callback url of registered application on Azure portal
- Next, Azure want to receive user id and password of the Azure account associated with client_id by popping up a browser window
- Once user_id and password are provided, Azure authenticates that. Then prepares and sends response to the registered callback URL
- Azure response comes with Authorization Code in a browser’s address bar along with Callback url and other parameters
- Client application needs to listen on the http callback address for Azure response and then extracts Authorization Code from the address bar of the response message
- Then application sends another http post request with client_id, client_secret, callback url and Authorization Code received in the previous response
- Then Azure sends another response with a JSON object including Access Token of Bearer type with the information for the validity period
- Finally application parse, extracts and store Access Token for all subsequent MS Graph API Calls to access emails and downloading its attachments
Authentication Process Flow Diagram
Implementation of this security handshake protocol via Apache Hop ETL is going to be
described in the following sections of this document.
Following diagram is the pictorial presentation for acquiring Azure Access Token -
Once an access token of bearer type is obtained, it is stored and used for all subsequent Graph
API calls within the time of its validity window. After that it needs to be refreshed again.
Actual Implementation of Azure OAuth2
I break down the end-to-end workflow in 3 stages as described below -
- Register app on Azure portal, get credentials and request for Auth Code
- Retrieve Auth Code from browser URL and request for Access token
- Store and use access token for using with MS Graph API calls
Stage 1 and 2 are implemented inside Apache Hop workflow by executing shell scripts, which in
turn invoke a python script to accomplish following automation -
- Open a web listener using python http.server library
- Make a HTTP request to microsoftonline.com
- Using selenium ⇒ chrome web driver, find the elements in the response web page for providing email user id and password; application automatically fills in those information and sends back to Azure
- Then switch to browser address bar (URL location), extract authorization code from JSON formatted Azure response
- Finally make another API call with auth code and other necessary info; then parse JSON response to extract access token of bearer type to be used in subsequent Graph API calls
Hop pipelines and workflows for downloading email attachments
Now we reach at the Hop implementation for actual email attachment downloads which is
elaborated in the following sections -
Access Token received is set as a variable for reuse in subsequent calls -
Next, retrieve and store email information as shown below from Office365 Cloud for configurable
number of latest emails received in the shared email address -
Attributes of each email (id, recvdDateTime, attachment flags etc.) are extracted and stored in a
database table as shown above.
Now we execute attachment-download-main-job.hwf workflow -
Inside getting_email_ids.hpl, it reads email_ids from database table with attachment flag set and
caches in memory for using inside the following loop for all the emails to be scanned 1 by 1 -
Next, it enters loop for processing each email by ids from cache one after another -
Inside do-for-each-email-id.hwf workflow, first it sets email_id as a variable to be retrieved by
following workflow makeAPICall-for-each-email-id.hwf for executing in a loop for all the
email_ids stored in cache.
Finally, the main pipeline makeEmailDetailsAPICall.hpl is executed to make REST Call for each
email_ids having downloadable attachments.
Implementation requires an outer loop for email_ids, and inner loop is for multiple
attachment_ids for each email_ids having multiple attachments. Filter-Rows transform is used to
select certain attachment file types, while discarding jpg, png etc.
I have chosen to download all the attachments for emails having multiple attachments in a
single API call avoiding each call for each attachment. And used Enhanced JSON Output
transform instead.
Using Group Key as Email Id -
And Fields as Attachment name and its content.
I store a single JSON file for each email, and if an email has multiple attachments, then contents
of those are separated by unique attachment names.
This implementation is working well so far as most of the emails have single attachment, thus avoiding another additional layer for individual API calls for each attachment. That would have increased the number of REST calls and execution times. This implementation makes it better to save time and reduce REST calls. But if the JSON file, having multiple large files as attachments, becomes too large, it might lead to a memory issue. Users having memory issues in this approach might need to change it by calling one API call for each attachment file.
Comments
Post a Comment