Amazon Kendra is an intelligent search service powered by machine learning (ML), enabling organizations to provide relevant information to customers and employees, when they need it.
Amazon Kendra uses ML algorithms to enable users to use natural language queries to search for information scattered across multiple data souces in an enterprise, including commonly used document storage systems like Microsoft OneDrive.
OneDrive is an online cloud storage service that allows you to host your content and have it automatically sync across multiple devices. Amazon Kendra can index document formats like Microsoft OneNote, HTML, PDF, Microsoft Word, Microsoft PowerPoint, Microsoft Excel, Rich Text, JSON, XML, CSV, XSLT, and plain text.
We’re excited to announce that we have updated the OneDrive connector for Amazon Kendra to add even more capabilities. For example, we have added support to search OneNote documents. Additionally, you can now choose to use identity or ACL information to make your searches more granular.
The connector helps to index documents and their access control information to limit the search results to only those documents the user is allowed to access. To show the search results based on user access rights and using only the user information, the connector provides an identity crawler to load principal information, such as user and group mappings into a principal store.
In this post, we demonstrate how to configure multiple data sources in Amazon Kendra to provide a central place to search across your document repository.
For our solution, we demonstrate how to index a OneDrive repository or folder using the Amazon Kendra connector for OneDrive. The solution consists of the following steps:
Create and configure an app on Microsoft Azure Portal and get the authentication credentials.
Create a OneDrive data source via the Amazon Kendra console.
Index the data in the OneDrive repository.
Run a sample query to get the information.
Filter the query by users or groups.
To try out the Amazon Kendra connector for OneDrive, you need the following:
A Microsoft Azure account with enough access permissions to set up an OAuth 2.0 data source.
An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
Ensure each document in OneDrive is unique, and across other data sources you plan to use for the same Amazon Kendra index. Document IDs are global to an index and must be unique per index.
Configure an Azure application and assign connection permissions
Before we set up the OneDrive data source, we need a few details about the OneDrive repository. Complete the following steps:
Log in to Azure.
After logging in with your account credentials, choose App registrations, then choose New registration.
Give an appropriate name to your application and register the application.
Collect the information about the client ID, tenant ID, and other details of the application.
To get a client secret, choose Add a certificate or secret under Client credentials.
Choose New client secret and provide the proper description and expiry.
Note the client-id, tenant-id, and secret-id values. We use these for authenticating the OAuth2 application.
Navigate to App, choose API permissions in the navigation pane, and choose Add a permission.
Choose Microsoft Graph.
Under Application permissions, enter File in the search bar and under Files, select Files.Read.All.
Choose Add permissions
Similarly, add the following permissions on the Microsoft Graph option for the application you created:
On completion, the API permissions will look like the following screenshot.
Configure the Amazon Kendra connector for OneDrive
To configure the Amazon Kendra connector, complete the following steps:
On the Amazon Kendra console, choose Create an Index.
For Index name, enter a name for the index (for example, my-onedrive-index).
Enter an optional description.
Choose Create a new role.
For Role name, enter an IAM role name.
Configure optional encryption settings and tags
In the Configure user access control section, select Yes under Access control settings.
For Token type, choose JSON on the drop-down menu.
Leave the remaining values as their default values.
Before we move to the next configuration step, we need to provide Amazon Kendra with a role that has the permissions necessary for connecting to the site. These include permission to get and decrypt the AWS Secrets Manager secret that contains the application ID and secret key necessary to connect to the OneDrive site.
Open another tab for the AWS account, and on the IAM console, navigate to the role that you created earlier (for example, AmazonKendra-us-west-2-onedrive).
Choose Add permissions and Create inline policy.
For Service, choose Kendra.
For Actions¸choose Write and specify BatchPutDocument.
For Resources, choose All resources.
Choose Review policy.
For Name, enter a name (for example, BatchPutPolicy).
Choose Create policy.
Add this policy to the role you created.
Additionally, attach the SecretsManagerReadWrite AWS managed policy to the role
Return to the Amazon Kendra tab.
Select Developer edition and choose Create.
This creates and propagates the IAM role and then creates the Amazon Kendra index, which can take up to 30 minutes.
Return to the Amazon Kendra console, choose Data sources in the navigation pane, and choose Add data source.
Under OneDrive connector V2.0, choose Add connector.
For Data source name, enter a name (for example, my-onedrive).
Enter an optional description.
For OneDrive Tenant ID, enter the tenant ID you gathered earlier.
For Configure VPC and security group, leave the default (No VPC).
Keep Identity crawler is on selected. This imports identity information into the index.
For IAM role, choose Create a new role.
Enter a role name, such as AmazonKendra-us-west-2-onedrive, then choose Next.
In the Authentication section, choose Create and add a secret.
Create a secret with clientId and clientSecret as keys.
Add their respective values with the information you collected earlier.
In the Configure sync settings section, add the OneDrive users whose documents you want to index.
Select the sync mode for the index. For this post, we select New, modified or deleted content sync.
Choose the frequency of indexing as Run on demand, then choose Next.
Field mappings enable allow you to set the searchability and relevance of fields. For example, the lastUpdatedAt field can sort or boost the ranking of the documents based on how recently it was updated.
Keep all the defaults in the Set field mappings section and choose Next.
On the review page, choose Add data source
Choose Sync now
The sync can take up to 30 minutes to complete.
Test the solution
Now that you have indexed the content from OneDrive, you can test it by querying the index.
Go to your index on the Amazon Kendra console and choose Search indexed content in the navigation pane.
Enter a search term and press Enter.
Notice that without a token, the ACLs prevent a search result from being returned.
Expand Test query with an access token and choose Apply token.
Enter the appropriate token with a user who has permissions to read the file and choose Apply.
Search for information present in OneDrive again.
You can verify that Amazon Kendra presents the ranked results as expected.
Congratulations, you have configured Amazon Kendra to index and search documents in OneDrive and control access to them using ACL.
With the Microsoft OneDrive V2 connector for Amazon Kendra, organizations can tap into commonly used enterprise document stores, securely using intelligent search powered by Amazon Kendra. You can enhance the search experience by integrating the data source with the Custom Document Enrichment (CDE) capability in Amazon Kendra to perform additional attribute mapping logic and even custom content transformation during ingestion.
About the authors
Pravinchandra Varma is a Senior Customer Delivery Architect with the AWS Professional Services team and is passionate about applications of machine learning and artificial intelligence services.
Supratim Barat is a Software Developer Engineer with AWS Kendra Yellowbadge Team and is a blockchain and cyber security enthusiast
Read MoreAWS Machine Learning Blog