This blog post discusses HDInsight premium which is currently in preview. HDInsight Premium adds the ability to domain join HDInsight clusters and Apache Ranger which can then be used to control access to databases/tables on HDInsight.
At the time of writing the documentation for HDInsight very poor and there are number of different limitations and issues with HDInsight Premium, most of which are not documented so I hope this post will help others.
HDInsight Premium allows you to join clusters to Azure AD Domain Services (AAD DS) domains. This then allows you to use accounts in your on-premise domain (provided you are synchronising users/groups via AAD Connect and have enabled password hash synchronisation) in HDInsight. Furthermore, you can then configure role based access control for Hive using Apache Ranger.
At the time of writing HDInsight is currently in Preview and has not GA’d – this means it is not backed by a full SLA. The Premium SKU is only available for “Hadoop” clusters – which do not come with Spark. However, HDInsight Premium with Spark clusters is available in private preview to a limited number of customers.
The domain-joining feature relies on Azure AD Domain Services (AADDS) – which provisions a Microsoft managed read-only domain controller. Until recently it was only possible to deploy AAD DS to a classic VNET which then required a VNET peering connection to the ARM VNET containing your HDInsight cluster (this obviously requires your VNETs are in the same region).
AD Connect and Password Synchronisation
- Firstly you must use Azure AD Connect to synchronise users and groups to Azure AD
- Secondly you need to enable password synchronisation.
- Password synchronisation will apply to all users that are being synchronised to Azure AD.
- Synchronisation traffic uses HTTPS
- When synchronizing passwords, the plain-text version of your password is not exposed to the password synchronization feature, to Azure AD, or any of the associated services.
- The original hash is not transmitted to Azure AD. Instead, the SHA256 hash of the original MD5 hash is transmitted. As a result, if the hash stored in Azure AD is obtained, it cannot be used in an on-premises pass-the-hash attack.
- On-Premise to Azure AD Syncrhonisation: this is usually on an hourly basis unless you have a newer version of Azure AD Connect and have customised the sychronisation interval.
- Azure AD to AAD DS: the documentation states this takes 20 minutes, but in my experience this usually takes closer to 1 hour.
Azure AD Domain Services
Enabling SSL/TLS for AAD DS
Cluster Domain Join Account
- Permissions to join machines to the domain
- Permissions to place the machines into the OU created for HDInsight clusters
- Permissions to create service principals within the OU Create reverse DNS entries
- Right-click the OU, select Delegate Control
- Click Next
- Click Add
- Select the account to be used for domain joining and click OK
- Click Next Select , and select . Delegate the following common tasks Create, delete, and manage user accounts
- Click Next then click Finish
- From ADUC click > View Advanced Features
- Right-click the OU and click Properties
- Click the tab Security
- Grant the domain join account the following permissions
- Create all child objects
- Delete all child objects
Issues and Limitations
- HDInsight is in public preview – which means that it is not subject to any SLAs
- The synchronisation lag can be quite large – in theory this should be 1 hour 20 minutes from on-premise AD to AAD DS. However, in practice this is more like 2 hours. You need to keep this in mind when troubleshooting permission / access issues.
- The documentation for HDInsight is pretty bare bones and contains mistakes/errors.
- For example, this article https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-domain-joined-configure-use-powershell#run-the-powershell-script links to a repo in GitHub that is supposed to do the AAD DS configuration for you. However, apart from a README.md file it is an empty repo;
- It does not explain the permissions required to domain join a cluster in enough detail e.g. on the OU, the exact DNS permissions, how to create reverse DNS zones (unless you are a DNS admin you won’t know this);
- There are special requirements for the username of the domain join account but these are not documented anywhere.
- If you delete a cluster it leaves behind the DNS entries (forward and reverse), computer accounts, as well as the user and service principal objects. This obviously clutters AAD DS but can also cause problems if you want to do CI/CD and the objects already exist.
- The components that are available with HDInsight are also not well documented e.g.
- Jupyter is currently not available – presumably because the it’s not that trivial to integrate with kerberos. You can use Zeppelin though.
- The Microsoft provided Hue script action will not work because it does not support kerberos – a significant amount of effort is required to do this. In light of this you would have to use Ambari Hive views.
- Oozie is not available on the cluster either.
- Applications are not supported – which means you cannot add edge nodes via an ARM template
- Other things that are not documented include
- If you are using Azure Data Factory (ADF) then Hive activities do not work.
- Spark activities with ADF does work but you have to disable CSRF protection in the livy.conf configuration file (you can do this via Ambari) but this isn’t a good idea from a Security standpoint.
- Ranger policies are only provided for Hive/Spark – they do not cover HDFS. I believe this is because of the limitations with Azure Storage authorisation and authentication listed here https://hadoop.apache.org/docs/current3/hadoop-azure/index.html#Limitations