Orange Workflows

Text Preprocessing

Text mining requires careful preprocessing. Here’s a workflow that uses simple preprocessing for creating tokens from documents. First, it applies lowercase, then splits text into words, and finally, it removes frequent stopwords. Preprocessing is language specific, so change the language to the language of texts where required. Results of preprocessing can be observe in a Word Cloud.

Text Clustering

The workflow clusters Grimm’s tales corpus. We start by preprocessing the data and constructing the bag of words matrix. Then we compute cosine distances between documents and use Hierarchical Clustering, which displays the dendrogram. We observe how well the type of the tale corresponds to the cluster in the MDS.

Text Classification

We can use predictive models to classify documents by authorship, their type, sentiment and so on. In this workflow we classify documents by their Aarne-Thompshon-Uther index, that is the defining topic of the tale. We use two simple learners, Logistic Regression and Naive Bayes, both of which can be inspected in the Nomogram.

Twitter Data Analysis

Tweets are a valuable source of information, for social scientists, marketing managers, linguists, economists, and so on. In this workflow we retrieve data from Twitter, preprocess it, and uncover latent topics with topic modeling. We observe the topics in a Heat Map.

Story Arcs

In this workflow we explore story arcs in the Little Match Seller story. First we select the story from the corpus of Andersen tales. Then we create a table, where each sentence of the tale is a separate row. We use sentiment analysis to compute the sentiment of each sentence, then observe the emotional arcs through the story. We also inspect sentences with similar scores in the Heat Map and Corpus Viewer.

Load Text Corpus from the Server Repository

The workflow loads the corpus from the text repository on the server. The repository contains documents with raw text and associated YAML files with meta-features. We here use some pre-processing and then display the most frequent words in a word cloud. This workflow could work on your repository: just change the URL in the Import Documents widget.

Semantic Document Map

Document maps may reveal clusters of documents with semantically similar content. Here we show a workflow that loads the corpus, performs some text preprocessing and embeds the documents in the vector space using the fastText deep model. The t-SNE widget reveals the document map, where we can select a set of documents and then explore them in Corpus Viewer or characterize them in the display of the most frequent words in the Word Cloud.

Semantic Word Map

We can find clusters of semantically related words either by hierarchical clustering or t-SNE visualizations. Here, we show a workflow that loads the documents, extracts frequent words, embeds them in a vector space, and explores word clusters.

Semantic search

We can find relevant parts of a document by searching for exact words or parts of documents with similar meanings. The Semantic Viewer widget in this workflow searches for pertinent sentences of the documents by comparing the meaning of words from the Word List widget to the meaning of sentences in the text via SBERT text embeddings. The widget ranks documents by scores measuring their relevance, shows selected documents, and highlights the relevant parts.

Keyword Extraction from a Set of Text Documents

The Extract Keywords widget can characterize a set of textual documents. In this workflow, we load the documents from the server, preprocess them and embed them in the vector space, and display a semantic document map in the t-SNE widget. In this widget, we can select a set of similar documents and then characterize them through keyword extraction. Extract keywords support different inference techniques, including TF-IDF and deep network-based characterization.