Visualize an Amazon Comprehend analysis with a word cloud in Amazon QuickSight


Searching for insights in a repository of free-form text documents can be like finding a needle in a haystack. A traditional approach might be to use word counting or other basic analysis to parse documents, but with the power of Amazon AI and machine learning (ML) tools, we can gather deeper understanding of the content.

Amazon Comprehend is a fully, managed service that uses natural language processing (NLP) to extract insights about the content of documents. Amazon Comprehend develops insights by recognizing the entities, key phrases, sentiment, themes, and custom elements in a document. Amazon Comprehend can create new insights based on understanding the document structure and entity relationships. For example, with Amazon Comprehend, you can scan an entire document repository for key phrases.

Amazon Comprehend lets non-ML experts easily do tasks that normally take hours of time. Amazon Comprehend eliminates much of the time needed to clean, build, and train your own model. For building deeper custom models in NLP or any other domain, Amazon SageMaker enables you to build, train, and deploy models in a much more conventional ML workflow if desired.

In this post, we use Amazon Comprehend and other AWS services to analyze and extract new insights from a repository of documents. Then, we use Amazon QuickSight to generate a simple yet powerful word cloud visual to easily spot themes or trends.

Overview of solution

The following diagram illustrates the solution architecture.

ML 13704 imagetest

To begin, we gather the data to be analyzed and load it into an Amazon Simple Storage Service (Amazon S3) bucket in an AWS account. In this example, we use text formatted files. The data is then analyzed by Amazon Comprehend. Amazon Comprehend creates a JSON formatted output that needs to be transformed and processed into a database format using AWS Glue. We verify the data and extract specific formatted data tables using Amazon Athena for a QuickSight analysis using a word cloud. For more information about visualizations, refer to Visualizing data in Amazon QuickSight.

Předpoklady

For this walkthrough, you should have the following prerequisites:

Upload data to an S3 bucket

Upload your data to an S3 bucket. For this post, we use UTF-8 formatted text of the US Constitution as the input file. Then you’re ready to analyze the data and create visualizations.

Analyze data using Amazon Comprehend

There are many types of text-based and image information that can be processed using Amazon Comprehend. In addition to text files, you can use Amazon Comprehend for one-step classification and entity recognition to to accept image files, PDF files, and Microsoft Word files as input, which are not discussed in this post.

To analyze your data, complete the following steps:

  1. On the Amazon Comprehend console, choose Analysis jobs v navigačním panelu.
  2. Vybrat Create analysis job.
  3. Enter a name for your job.
  4. Pro Analysis type, Vybrat Key phrases.
  5. Pro Language¸ choose English.
  6. Pro Input data location, specify the folder you created as a prerequisite.
  7. Pro Output data location, specify the folder you created as a prerequisite.
  8. Vybrat Create an IAM role.
  9. Enter a suffix for the role name.
  10. Vybrat Create job.

The job will run and the status will be displayed on the Analysis jobs strana.

ML 13704 image002

Wait for the analysis job to complete. Amazon Comprehend will create a file and place it in the output data folder you provided. The file is in .gz or GZIP format.

This file needs to be download and converted to a non-compressed format. You can download an object from the data folder or S3 bucket using the Amazon S3 console.

  1. On the Amazon S3 console, select the object and choose Stažení. If you want to download the object to a specific folder, choose Stažení on the Actions menu.
  2. After you download the file to your local computer, open the zipped file and save it as an uncompressed file.

The uncompressed file must be uploaded to the output folder before the AWS Glue crawler can process it. For this example, we upload the uncompressed file into the same output folder that we use in later steps.

  1. On the Amazon S3 console, navigate to your S3 bucket and choose Upload.
  2. Vybrat Add files.
  3. Choose the uncompressed files from your local computer.
  4. Vybrat Upload.

After you upload the file, delete the original zipped file.

  1. On the Amazon S3 console, select the bucket and choose Vymazat.
  2. Confirm the file name to permanently delete the file by entering the file name in the text box.
  3. Vybrat Delete objects.

This will leave one file remaining in the output folder: the uncompressed file.

Convert JSON data to table format using AWS Glue

In this step, you prepare the Amazon Comprehend output to be used as input into Athena. The Amazon Comprehend output is in JSON format. You can use AWS Glue to convert JSON into a database structure to ultimately be read by QuickSight.

  1. On the AWS Glue console, choose Crawlers v navigačním panelu.
  2. Vybrat Create crawler.
  3. Enter a name for your crawler.
  4. Vybrat další.
  5. Pro Is your data already mapped to Glue tables, select Not yet.
  6. Add a data source.
  7. Pro S3 path, enter the location of the Amazon Comprehend output data folder.

Be sure to add the trailing / to the path name. AWS Glue will search the folder path for all files.

  1. Vybrat Crawl all sub-folders.
  2. Vybrat Add an S3 data source.

ML 13704 image003

  1. Create a new AWS Identity and Access Management (IAM) role for the crawler.
  2. Enter a name for the IAM role.
  3. Vybrat Update chosen IAM role to be sure the new role is assigned to the crawler.
  4. Vybrat další to enter the output (database) information.
  5. Vybrat Add database.
  6. Enter a database name.
  7. Vybrat další.
  8. Vybrat Create crawler.
  9. Vybrat Run crawler to run the crawler.

You can monitor the crawler status on the AWS Glue console.

Use Athena to prepare tables for QuickSight

Athena will extract data from the database tables the AWS Glue crawler created to provide a format that QuickSight will use to create the word cloud.

  1. On the Athena console, choose Query editor v navigačním panelu.
  2. Pro Data source, Vybrat AwsDataCatalog.
  3. Pro Database, choose the database the crawler created.

ML 13704 image004

To create a table compatible for QuickSight, the data must be unnested from the arrays.

  1. The first step is to create a temporary database with the relevant Amazon Comprehend data:
CREATE TABLE temp AS
SELECT keyphrases, nested
FROM output
CROSS JOIN UNNEST(output.keyphrases) AS t (nested)

  1. The following statement limits to phrases of at least three words and groups by frequency of the phrases:
CREATE TABLE tableforquicksight AS
SELECT COUNT(*) AS count, nested.text
FROM temp
WHERE nested.Score > .9 AND 
 length(nested.text) - length(replace(nested.text, ' ', '')) + 1 > 2
GROUP BY nested.text
ORDER BY count desc

Use QuickSight to visualize output

Finally, you can create the visual output from the analysis.

  1. On the QuickSight console, choose New analysis.
  2. Vybrat New dataset.
  3. Pro Create a dataset, Vybrat From new data sources.
  4. Vybrat Athena as the data source.
  5. Enter a name for the data source and choose Create data source.

ML 13704 image005 resize

  1. Vybrat Visualize.

ML 13704 image006 resize

Make sure QuickSight has access to the S3 buckets where the Athena tables are stored.

  1. On the QuickSight console, choose the user profile icon and choose Manage QuickSight.

ML 13704 image007

  1. Vybrat Security & permissions.
    ML 13704 image008
  1. Look for the section QuickSight access to AWS services.

By configuring access to AWS services, QuickSight can access the data in those services. Access by users and groups can be controlled through the options.

  1. Verify Amazon S3 is granted access.

Now you can create the word cloud.

  1. Choose the word cloud under Visual types.
  2. Drag text to Group by and count to Size.

ML 13704 image009
Choose the options menu (three dots) in the visualization to access the edit options. For example, you might want to hide the term “other” from the display. You can also edit items such as the title and subtitle for your visual. To download the word cloud as a PDF, choose Stažení on the QuickSight toolbar.

Uklidit

To avoid incurring ongoing charges, delete any unused data and processes or resources provisioned on their respective service console.

Závěr

Amazon Comprehend uses NLP to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. You can use Amazon Comprehend to create new products based on understanding the structure of documents. For example, with Amazon Comprehend, you can scan an entire document repository for key phrases.

This post described the steps to build a word cloud to visualize a text content analysis from Amazon Comprehend using AWS tools and QuickSight to visualize the data.

Let’s stay in touch via the comments section!


O autorech

Kris 100Kris Gedman is the US East sales leader for Retail & CPG at Amazon Web Services. When not working, he enjoys spending time with his friends and family, especially summers on Cape Cod. Kris is a temporarily retired Ninja Warrior but he loves watching and coaching his two sons for now.

clark 100Clark Lefavour is a Solutions Architect leader at Amazon Web Services, supporting enterprise customers in the East region. Clark is based in New England and enjoys spending time architecting recipes in the kitchen.



Odkaz na zdroj

zanechte odpověď

Vaše e-mailová adresa nebude zveřejněna. Povinná pole jsou označena *

Můžete použít tyto HTML značky a atributy: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

cs_CZCzech