Customizing an Apache Hop Docker Container
Customizing an Apache Hop Docker Container for Automated Office 365 Authentication
Introduction
In my previous blog, Apache Hop implementation for selectively downloading email-attachments from MS Office365 Cloud, I focused on explaining Azure Entra Oath2 implementation using Apache hop and python. We chose to use Apache hop docker container for execution of the automated task via scheduler, following a ‘wake-up, do task, shutdown’ approach, as explained in hop documentation as a short-lived execution pattern.
In this article, I describe the steps for customizing docker container, which was necessary to support the execution of Bash and Python scripts integrated within Hop pipelines and workflows
Hop Container with External Dependencies
Apache hop container required following external packages -
Python3
Py3-pip
Wget
Dpkg
Chromium-chromedriver
Chromium
Git
Selenium (and bs4) for python3
Making Hop Container Ready
The Apache Hop Docker image is built on a lightweight Alpine Linux OS.
Below are the steps to prepare the container for running the task described in the previous blog
Download and Run the Hop Docker Image
Download and run hop docker
# docker pull apache/hop:development
List the downloaded images
# docker images (notedown image name and tag)
Start a container from above image
# docker run -d –name customContainer <image name>:tag
Check the running container
# docker ps (note container id, name)
Login into the container as root user
# docker exec -it -u root <container id> /bin/bash
Adding External Dependencies
Once the container is up and running, install the required dependencies.
Ensure that the Docker environment has at least 8 GB of memory and 2+ CPU cores, depending on workload.
Install python and pip
# apk add -no-cache python3
# apk add -no-cache py3-pip
Install wget
Browser Setup for Headless Execution
Initially, I attempted to install Google Chrome using the following commands:
#apk add --no-cache --repository=http://dl-cdn.alpinelinux.org/alpine/edge/community google-chrome-stable
#apk add dpkg
#dpkg --add-architecture amd64
However, I found that Google Chrome is not fully compatible with Alpine Linux.
Instead, I installed Chromium, the open-source alternative, along with its WebDriver:
#apk add chromium-chromedriver
#apk add chromium
Install Git
I also installed Git for potential future use:
# apk info git
# apk add git
# apk list git
git-2.47.2-r0 x86_64 {git} (GPL-2.0-only) [installed]
Install Python Dependencies (Selenium and bs4)
While installing Selenium using pip, I encountered the following error:
error: externally-managed-environment
× This environment is externally managed
This occurs because Alpine Linux manages Python packages externally.
There are two ways to resolve this:
Option 1: Use venv
Create a virtual env for python and install after activating the same -
# python3 -m venv ./py_envs
# cd py_envs
# . ./bin/activate
# python3 -m pip install selenium
Option 2: Override System Package Restrictions
# pip install selenium —break-system-packages
# pip install bs4 —break-system-packages
This approach generates warnings but works if you choose to bypass the restrictions.
Running the Authentication Script
The script uses Chromium in headless mode along with Selenium to log in to Microsoft 365.
# ./retrieveAuthCodeScriptServer.sh /root/email_scripts shared_mailbox@mydomain.com <passwd for shared_mailbox@mydomain.com>
Sample Output
/root/email_scripts
Downloading from shared mailbox: shared_mailbox@mydomain.com
127.0.0.1 - - [28/Feb/2026 17:15:00] "GET /?code=1.ARxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&state=12345&session_s3a HTTP/1.1" 200 -127.0.0.1 - - [28/Feb/2026 17:15:00] code 404, message File not found
127.0.0.1 - - [28/Feb/2026 17:15:00] "GET /favicon.ico HTTP/1.1" 404 -
Cleaning up background processes
240
Killing python http server process id 240
I am adding codes for the bash script and python file used for automated authentication handling inside docker container in headless mode.
Next Steps
In the next article, I will share the complete Apache Hop workflows and pipelines, along with the associated code used in this implementation. I will also include the Bash and Python scripts used for automated authentication handling within the Docker container in headless mode.
Appendix
Following is the content of “retrieveAuthCodeScriptServer.sh”. First argument is working directory inside docker, second one is shared email address and the third one is password for the same.
#!/bin/bash
# Launch http server process
python -m http.server 5000 2>&1 > /dev/null &
wd=$1
echo $wd
user=$2
echo $user
passwd=$3
#echo $passwd
python -u $wd/getCallbackUrlServer.py $wd $user $passwd > $wd/log/auth_code.log
# Kill the http server process once auth code is retrieved
echo "Cleaning up background processes"
PIDs=`pgrep -l python|awk '{print $1}'`
echo $PIDs
for pid in $PIDs
do
if [[ "" != "$pid" ]]; then
echo "Killing python http server process id $pid"
kill -9 "$pid"
fi
done
exit 0
Content of getCallbackUrlServer.py invoked inside above bash script for automated handling of Microsoft 365 shared mailbox user login with passwd -
#!/usr/bin/python
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from time import sleep
from urllib import parse
import webbrowser
import json
import os
import sys
folder_dir = str(sys.argv[1])
print (folder_dir + "\n")
email = str(sys.argv[2])
print (email + "\n")
password = str(sys.argv[3])
print (password + "\n")
url = "https://login.microsoftonline.com/<Application tenant id for the app registered on Azure portal>/oauth2/v2.0/authorize?client_id=<Application client id for the app registered on Azure portal>&response_type=code&response_mode=query&scope=https%3A%2F%2Fgraph.microsoft.com%2F.default&state=12345
chrome_options = Options()
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless=new")
driver = webdriver.Chrome(options=chrome_options)
#driver = webdriver.Chrome()
driver.get(url)
sleep(10)
elem = driver.find_element(By.NAME, "loginfmt")
elem.send_keys(email)
sleep (1)
elem.send_keys(Keys.RETURN)
sleep (5)
elem = driver.find_element(By.NAME, "passwd")
elem.send_keys(password)
sleep (1)
elem.send_keys(Keys.RETURN)
sleep (5)
Comments
Post a Comment