This tutorial is based on the very thorough and prescient workshop conducted by Stefan Sinclair at DH2015 where he introduced the newly released Voyant 2.0.
- Installation and Setting Up
- A Quick Sample
- Exploring the Interface
- Importing: Bring Your Own Texts
- Exploring: Getting to Know the Tools
- Even More Tools
- Advanced: Search
- Exporting: URLs, Tools & Data
Installation, Variants and Setting Up
At the outset, it important to note that there are two ways you can use Voyant. The first relies on the hosted version provided at https://voyant-tools.org. This is always accessible through your own web browser and is generally the most up-to-date version.
The second takes advantage of the fact that the authors make an installable version for your own machine which ensures that your data remains local and you have full control over the environment. This downloadable version of Voyant Tools is the one that we will be using today (and it has been pre-installed on your workstation). As the authors note this means of using Voyant offers several potential advantages:
- You can keep your texts confidential as they will not be cached on our server.
- You can restart the server if it slows down or crashes.
- You can handle larger texts without the connection timing out.
- You can work offline (without an Internet connection).
- You can have participants in a group (like in this workshop) run their own instance without encountering load issues on our server.
What we have done for you to facilitate this workshop is:
If you would like to install Voyant on your own computer you can do so from the above link.
To launch this local Voyant server, all you need to do is double-click on VoyantServer.jar. Helpfully, there is an icon provided on your desktop.
Note: If you decide to run Voyant on your Mac, because of security restrictions on applications that aren’t signed and approved by Apple, you may need to Ctrl-click on the VoyantServer.jar file, select open from the menu, and then click open (not the default button) in the next
You can find more information about Running VoyantServer, including tips in case of problems. If you’re unable to run VoyantServer (because of a problem with your machine or because you’re using a tablet, or for any other reason), you should be able to follow along using the following URLs:
Note: We have installed Voyant Server on the lab workstations for this session. However, there are security restrictions interfering with at least one of the frameworks used. If you run into an issue when clicking on the ‘local’ link at any stage in this tutorial please try the beta link which accesses Voyant on the server referenced above.
A Quick Sample
To jump in with both feet and get a quick look at what Voyant is all about, click the local link below and hopefully you will see something similar to the word cloud below.
- What text do you think it is a cloud for?
- What features are metrical (based on measuring the text in some way)? How are the other features generated?
- What words are missing?
Exploring the Interface
Voyant Tools is much more than merely a word cloud generator and is
- Cirrus: a simple wordcloud that displays the highest frequency terms in the corpus (that aren’t in the stopword list)
- Reader: a infinite scrolling reader for the actual text in the corpus (this fetches the next part of the text as needed)
- Trends: a visualization of word frequency across the corpus or within each document (depending on the mode)
- Summary: a high-level summary of data from the corpus
- Contexts: a list of occurrences of a specified word (this is sometimes called a concordance or a keyword in context)
Explore the visible tools for the next five minutes and we will discuss them as a group (we’ll come back to the other tools later):
- what happens when you hover over the help icon? what if you click it?
- which tools trigger responses from which other tools?
- what scale is each tool (entire corpus, entire document, part of a document, etc.)?
- what is the visualization in the bottom of the Reader (middle-top) panel?
- try a simple search in the Reader panel
- what is relative frequency in the Trends tool?
- what are vocabulary density and distinctive words in the Summary tool?
- what does the plus icon do in the Contexts tool?
- what is the difference between context and expand in the Contexts tool?
Bring Your Own Texts
One primary strength of Voyant Tools is that you can use an existing corpus (such as the Austen corpus we used above), or you can create your own corpus from the home page [local, beta]. There are three primary ways of creating a corpus:
- type or paste text into the large box (you can copy-and-paste text from a webpage or word processor, for instance) – in this case you’ll be creating a corpus with one document
- type or paste URLs into the large box, one URL per line – this will create a corpus with as many documents as you have URLs, Voyant will try to fetch the content from the specified locations (so they can’t be behind a password or restrictive firewall); the URLs can point to documents in various supported formats (see below)
- click the upload button and select one or more files to upload – the files can be in a variety of formats, including plain text, HTML, XML, RTF, MSWord, and PDF, or a Zip (archive) file containing documents in one of the supported formats
- Try downloading the following text file VoyantSampleText and saving it on the desktop on your workstation. You can then choose to upload this to Voyant by selecting Upload in the Voyant start screen. Note you can also simply point Voyant at this URL and it will extract the zip file and load these texts for our workshop.
Note: It is possible to use XML very powerfully by clicking on the options icon (when hovering in the Add Texts header) and defining XPath expressions to documents, body content and metadata such as title and author.
Note: When uploading files, you can now select multiple files at once by using the Ctrl and/or Shift keys.
Getting to Know the Tools
Each of the several tools in Voyant has its own particularities and peculiarities, but here are some general principles that apply to several tools.
Options. Many of the tools provide parameters directly visible (usually in the bottom part of the tool). The Contexts tool for instance (bottom right-hand corner of the default skin) has options for searching, for the context size (how many words to show on each side of the keyword in the table), and for expand size (how many words to show on each side of the keyword when you expand the occurrence by clicking on the plus icon in the first column of the row). In addition to these visible options, some tools also have additional options that can be accessed through the options icon in the top header. The Cirrus tool, for instance, has an option for modifying the stopword list.
Stopwords. The stopword list contains common words that usually have less meaning and are very common in most texts, such as determiners (“the”, “a”) and prepositions (“to”, “in”, “from”), etc. One person’s stopword is another person’s treasure, and it may be worth looking at the list of words to see if there are ones you’d prefer to show or if there are words that you don’t want to show and that should be added to the stopword list. You can edit the list by click on the options icon (in Cirrus, for instance) and clicking the edit button. Note that you can apply the newly selected or edited list to the current tool only or globally to all tools that support stopwords (globally is the default).
Table/Grid Headers. The column headers in table/grid views includes functionality that may not be obvious. First, a help tip will appear when you hover over most column headers to briefly explain what that column is showing. Next, a down arrow will appear in the right part of the column header that and clicking on the down arrow will allow you to sort by that column (when possible) and to toggle the visibility of columns. Finally, if a column is sortable, you can also click on the header to toggle between ascending and descending order for sorting the table by that column.
Infinite Scrolling Tables/Grids. Tables can sometimes contain a huge number of logical items (for instances tens of thousands of terms in a document) which would be impractical to load at once. In Voyant items are loaded on-demand as the user scrolls through the table – in most cases that should happen fairly seamlessly.
Corpus/Document Modes. Some of the tools can operate at variable scale, either showing data at the corpus level or at the individual document level – this can be a bit confusing if you’re not sure what you’re seeing. For instance, by default Cirrus shows top frequency terms for the entire corpus, but you can also generate a Cirrus from the terms of an individual document – one way to do this is to click on the Documents tab in the lower left-hand panel and click on one of the document rows. The Cirrus that appears will be for just one document, and if you want to revert to Corpus mode you can click on the “reset” button that appears in the lower right-hand corner of the Cirrus tool.
Resizing. The individual tool panels are resizable, the mouse pointer should change to a resize icon when you are hovering over the inner borders between tools and you can drag the border to resize. Similarly, the columns in table/grid tools are resizable.
Exploring More Tools
In addition to the five tools that are displayed by default (Cirrus, Reader, Trends, Summary and Contexts), each of the five panels makes it easy to access additional tools, some of which we’ve mentioned already. Here are the other tools available from the tabs:
- Corpus Terms: displays frequency and distribution information for terms (types or unique words) in the corpus
- Links: displays a network graph of the collocates of keywords (the highest frequency terms that occur close to the specified search terms) – you can click on individual terms to fetch more terms and you can drag terms off the tool to remove them
- Collocates: similar to Links, but this presents collocates of search terms in a table form
- Documents: lists the documents in the corpus, including some metadata (where available, such as title and author), as well as counts of words/tokens, types and a ratio of types to tokens
- Phrases: lists the recurring phrases in the corpus (though any phrase must be repeated in a document before it is counted at the corpus level); this is a new tool in Voyant 2.0 and one of the most useful functions can be to see the longest repeating phrases (without having to specify a search query); note that there are different options for handling overlapping phrases
- Bubblelines: this is another representation of the distribution within each document in the corpus, it can be helpful for perceiving where different terms appear together (overlap)
All of these tools can be accessed through the tabs, but they can also be invoked from the tool switching menu (a windows-like icon) that appears when you hover over the header of any tool.
If you click on the tool switching icon a nested menu will appear. The first items will be a list of one or more tools that fit most naturally in that tool panel, but you can also navigate tools by scale (corpus or document) or by tool type (visualizations, tables/grids, other).
The skin header (the blue bar at the top) also has a tool switching menu which allows you to replace the entire page with one tool. This is also a convenient way to access the ScatterPlot tool which provides a visualization of Correspondence Analysis or Principle Component Analysis (more complex analysis of how terms are shared between documents).
Advanced Search Functionality
Help with the search syntax is displayed when you hover over the question mark icon in a search box. The hovering tip box will disappear after a few seconds, and you can click on the question mark to have a dialog box appear until you dismiss it.
Search functionality is fairly consistent in all tools that support search. For experimentation, let’s work in the Corpus Terms tool (which is the second tab in the upper left-hand panel where the Cirrus wordcloud is displayed by default). These examples use the Austen corpus [local, workshop, beta].
- exact match: think this searches the exact word (though it’s case insensitive, there’s currently no way to perform a case-sensitive search)
- wildcard match: think* this matches the root of a word and includes variants as a single term (think, thinks, thinking, etc.), note that for now wildcards can’t be used at the beginning of words and produces inconsistent results when used in the middle of words
- expanding wildcard match: ^think* this is similar to the previous wildcard match but this time each variant is counted and displayed as a separate term (this can be useful for seeing what terms are actually included in a wildcard match)
- multiple matches: think*, ^think* you can search multiple terms (two or more) by separating them by commas – a simple search might be for exact matches think, thinking, but you can also use more complex searches like think*, ^think* to get the best of both worlds form wildcard matches (counting the total wildcard matches as one term and also seeing the individual matches).
- combined matches: think|thinking use a combined match to merge two or more search terms into one result – this might be useful for counting singular and plural forms of a word, but not all wildcard forms (time|times but not timely, etc.)
- phrase match: “time enough” this matches an exact phrase or sequence of words – note the use of quotes (if you exclude the quotes you’re essentially performing a combined match for time|enough, though that may change in the future)
- proximity match: “time enough”~10 this is essentially a NEAR match, where the terms in quotes (there can be more than two) must occur within a specified number of words (in this case within 10 words, but you can specify a different number for the proximity); note that words can appear in any order, so enough might occur before time; it’s not possible to expand the match with the ^ operator like with wildcard searches, but you can use the Contexts tool to see the actual occurrences that are being matched
- multiple matches: time*, time|times, “time enough”~10 it’s possible to mix and match the different syntaxes, as with this example that has a wildcard match, multiple matches, combined matches, and a proximity match
Exporting URLs, Tools & Data
A distinguishing feature of Voyant Tools is its ability to generate URLs that can be bookmarked or shared and that point to a specific corpus with specific parameters.
The URL in the browser location bar will now update automatically after you create a corpus – you can bookmark or share this URL directly.
To export the URL from the current skin (combination of tools, not just one tool), click on the export icon from the top blue header bar.
This will cause a dialog box to appear with various export options, the first of which is a simple link that can be copied into the clipboard or clicked to open the URL in a new window.
The same basic process works for individual tool panels as well (if you just want to export or share, say, the Cirrus visualization), except that additional parameters are usually included with the tool panels (specific search terms that have been selected, for instance).
In addition to exporting a URL, you can also generate a bibliographic entry for Voyant Tools (if you wish to cite it), or if you want to export a live dynamic tool panel. The exported tool works much like a YouTube clip that can be embedded into any website – it pulls interactive content from a remote site. For both of these options, expand the “Export View” menu (see the image above).
The HTML snippet for a live tool might look something like this:
<!– Exported from Voyant Tools: http://voyant-tools.org/.
Feel free to change the height and width values below: –>
<iframe style=’width: 100%; height: 400px’ src=’http://voyant-tools.org:80/?corpus=austen&view=Cirrus’></iframe>
Which should produce a live tool like this: