Customizing an Apache Hop Docker Container

 

Customizing an Apache Hop Docker Container for Automated Office 365 Authentication

Introduction

In my previous blog, Apache Hop implementation for selectively downloading email-attachments from MS Office365 Cloud, I focused on explaining Azure Entra Oath2 implementation using Apache hop and python. We chose to use Apache hop docker container for execution of the automated task via scheduler, following a ‘wake-up, do task, shutdown’ approach, as explained in hop documentation as a short-lived execution pattern.


In this article, I describe the steps for customizing docker container, which was necessary to support the execution of Bash and Python scripts integrated within Hop pipelines and workflows

Hop Container with External Dependencies 

Apache hop container required following external packages -


  • Python3

  • Py3-pip

  • Wget

  • Dpkg

  • Chromium-chromedriver

  • Chromium

  • Git

  • Selenium (and bs4) for python3

Making Hop Container Ready

The Apache Hop Docker image is built on a lightweight Alpine Linux OS.

Below are the steps to prepare the container for running the task described in the previous blog

Download and Run the Hop Docker Image

Download and run hop docker

# docker pull apache/hop:development


List the downloaded images

# docker images (notedown image name and tag)


Start a container from above image

# docker run -d –name customContainer <image name>:tag


Check the running container 


# docker ps (note container id, name)


Login into the container as root user

# docker exec -it -u root <container id> /bin/bash


Adding External Dependencies

Once the container is up and running, install the required dependencies.

Ensure that the Docker environment has at least 8 GB of memory and 2+ CPU cores, depending on workload.

Install python and pip


# apk add -no-cache python3


# apk add -no-cache py3-pip

Install wget


# apk update && apk add -no-cache wget 

Browser Setup for Headless Execution

Initially, I attempted to install Google Chrome using the following commands:


#apk add --no-cache --repository=http://dl-cdn.alpinelinux.org/alpine/edge/community google-chrome-stable

#apk add dpkg

#dpkg --add-architecture amd64

However, I found that Google Chrome is not fully compatible with Alpine Linux.

Instead, I installed Chromium, the open-source alternative, along with its WebDriver:

    #apk add chromium-chromedriver

#apk add chromium

Install Git

I also installed Git for potential future use:

# apk info git

# apk add git

# apk list git

git-2.47.2-r0 x86_64 {git} (GPL-2.0-only) [installed]

Install Python Dependencies (Selenium and bs4)


While installing Selenium using pip, I encountered the following error:


error: externally-managed-environment

× This environment is externally managed

This occurs because Alpine Linux manages Python packages externally.

There are two ways to resolve this:

Option 1: Use venv

 Create a virtual env for python and install after activating the same -


# python3 -m venv ./py_envs

# cd py_envs

# . ./bin/activate

# python3 -m pip install selenium


Option 2: Override System Package Restrictions


    # pip install selenium —break-system-packages

# pip install bs4  —break-system-packages


This approach generates warnings but works if you choose to bypass the restrictions.

Running the Authentication Script

After installing all dependencies, I was able to run the authentication script inside the Hop Docker container.
The script uses Chromium in headless mode along with Selenium to log in to Microsoft 365.

# ./retrieveAuthCodeScriptServer.sh /root/email_scripts shared_mailbox@mydomain.com <passwd for shared_mailbox@mydomain.com>

Sample Output


 /root/email_scripts

Downloading from shared mailbox:  shared_mailbox@mydomain.com

127.0.0.1 - - [28/Feb/2026 17:15:00] "GET /?code=1.ARxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&state=12345&session_s3a HTTP/1.1" 200 -127.0.0.1 - - [28/Feb/2026 17:15:00] code 404, message File not found

127.0.0.1 - - [28/Feb/2026 17:15:00] "GET /favicon.ico HTTP/1.1" 404 -

Cleaning up background processes

240

Killing python http server process id 240

I am adding codes for the bash script and python file used for automated authentication handling inside docker container in headless mode. 

Next Steps

In the next article, I will share the complete Apache Hop workflows and pipelines, along with the associated code used in this implementation. I will also include the Bash and Python scripts used for automated authentication handling within the Docker container in headless mode.

Appendix

Following is the content of “retrieveAuthCodeScriptServer.sh”. First argument is working directory inside docker, second one is shared email address and the third one is password for the same.


#!/bin/bash


# Launch http server process 

python -m http.server 5000 2>&1 > /dev/null &


wd=$1

echo $wd

user=$2

echo $user

passwd=$3

#echo $passwd


python -u $wd/getCallbackUrlServer.py $wd $user $passwd > $wd/log/auth_code.log


# Kill the http server process once auth code is retrieved

echo "Cleaning up background processes"


PIDs=`pgrep -l python|awk '{print $1}'`

echo $PIDs


for pid in $PIDs

do

        if [[ "" != "$pid" ]]; then

                echo "Killing python http server process id $pid"

                kill -9 "$pid"

        fi

done

exit 0


Content of getCallbackUrlServer.py invoked inside above bash script for automated handling of Microsoft 365 shared mailbox user login with passwd -


#!/usr/bin/python                                                                                                                                                       

                                                                                                                                                                        

from selenium import webdriver                                                                                                                                          

from selenium.webdriver.chrome.options import Options                                                                                                                   

from selenium.webdriver.common.keys import Keys                                                                                                                         

from selenium.webdriver.common.by import By                                                                                                                             

from time import sleep                                                                                                                                                  

from urllib import parse                                                                                                                                                

import webbrowser                                                                                                                                                       

import json                                                                                                                                                             

import os                                                                                                                                                               

import sys                                                                                                                                                              

                                                                                                                                                                        

folder_dir = str(sys.argv[1])                                                                                                                                           

print (folder_dir + "\n")                                                                                                                                               

                                                                                                                                                                      email = str(sys.argv[2])                                                                                                                                                

print (email + "\n")                                                                                                                                                    

                                                                                                                                                                        

password = str(sys.argv[3])                                                                                                                                             

print (password + "\n")     


url = "https://login.microsoftonline.com/<Application tenant id for the app registered on Azure portal>/oauth2/v2.0/authorize?client_id=<Application client id for the app registered on Azure portal>&response_type=code&response_mode=query&scope=https%3A%2F%2Fgraph.microsoft.com%2F.default&state=12345


                                                                                                                                                                        

chrome_options = Options()                                                                                                                                              

chrome_options.add_argument("--disable-extensions")                                                                                                                     

chrome_options.add_argument("--disable-gpu")                                                                                                                            

chrome_options.add_argument("--no-sandbox")                                                                                                                             

chrome_options.add_argument("--headless=new")                                                                                                                           

                                                                                                                                                                        

driver = webdriver.Chrome(options=chrome_options)                                                                                                                       

#driver = webdriver.Chrome()                                                                                                                                            

                                                                                                                                                                        

driver.get(url)                                                                                                                                                         

sleep(10)                                                                                                                                                               

                                                                                                                                                                        

elem = driver.find_element(By.NAME, "loginfmt")                                                                                                                         

                                                                                                                                                                        

elem.send_keys(email)                                                                                                                                                   

sleep (1)                                                                                                                                                               

elem.send_keys(Keys.RETURN)       


sleep (5)                                                                                                                                                               

elem = driver.find_element(By.NAME, "passwd")                                                                                                                           

                                                                                                                                                                        

elem.send_keys(password)                                                                                                                                                

sleep (1)                                                                                                                                                               

elem.send_keys(Keys.RETURN)                                                                                                                                             

                                                                                                                                                                        

sleep (5)                               



                                                                                               

Comments

Popular posts from this blog

Secured Information Ingestion following Azure Entra OAuth2 from Office365

Automating Kyligence Index Recommendation Feature After Analysing Pushdown Queries