Experimenting with ML-powered search in Amazon Kendra
Apr 22, 2021
Alexander Junge
2 minute read

Amazon Kendra is a managed search service offered by AWS. Using machine learning/natural language processing, Kendra is able to “understand” both search queries and the documents searched to answer questions directly or to perform a keyword-based search.

Kendra is fully-managed by AWS which means that, as a developer, I do not need to worry about managing infrastructure, as, for example, required by open source alternatives like Jina for neural search or Haystack for Question-Answering.

In this short post, I want to give Kendra a spin and search a small dataset of podcast episode show notes and episode transcripts. Note that Kendra is really meant to power search in larger enterprises, reflected in its pricing model. But there is a 30-day free trial allowing developers to explore the tool and see what it is capable of. This is what I am doing in this post.

Data

I downloaded the show notes and transcripts for 20 episodes of Python-focused podcasts (Talk Python to Me and PythonBytes) and stored them as flat text files in an S3 bucket.

Deployment

CloudFormation templates deploying Kendra, setting up an S3 bucket as a Kendra data source and defining the necessary AWS roles are available on GitHub.

Results

Using Kendra’s default search interface, I tested both question- and keyword-based searches:

Overall, the results look useful and Kendra is able to surface relevant information. A logical next step would be to use this demo to show it to potential users/stakeholders, together define a search scope of interest and implement a larger proof-of-concept.

Conclusion

Here is what I think about Amazon Kendra based on this small trial:

  • as a managed service, Kendra is able to deliver meaningful search results quickly - this is ideal to build a rapid prorotype
  • loading data from S3 is easy and plenty of other connectors are available to make documents searchable
  • supports file formats such as flat files, PDFs, Word documents or PowerPoint presentations
  • adding metadata and API access allow customization via faceted search, access restrictions for specific user/groups etc.
  • expensive for small user groups with few queries - Kendra is clearly made to serve larger enterprises


comments powered by Disqus