This tutorial will lead you through an exploration of OpenRefine and demonstrate how it can be used to interrogate and clean your dataset.
It freely acknowledges the wonderful introductory tutorial: Seth van Hooland, Ruben Verborgh, and Max De Wilde, “Cleaning Data with OpenRefine,” Programming Historian (05 August 2013), as well as the basic tutorials provide at OpenRefine.org and their curated tutorial list.
There is a more in-depth tutorial available at DataCarpentry which I would also recommend should you wish to sharpen your OpenRefine skills.
This tutorial uses a sample dataset available at http://digitalnomad.ie/ACIrishBrews2018.tsv
Note that this tutorial assumes that you have OpenRefine installed on your workstation. If you do not it can be found (along with installation instructions) at the OpenRefine Site.
So, let’s get started…
Tutorial 1 (approximately 45 minutes)
It is important to note before beginning this tutorial that OpenRefine is not necessarily a ‘normal’ application that you run on your workstation. It is a local web service that runs locally using your web browser to interact through.
Although the installer creates a clickable icon, when you do click on it, you may find that it doesn’t seem to present a window or any other user components to interact with it. Hmmmm.
As mentioned, it creates a web service that you access through your browser, via a specific address – in this case: http://127.0.0.1:3333.
If you open your web browser and direct it to this, we will see something a little more promising, and we are ready to go.
The first step is to provide data for OpenRefine to work with.
Create Project and select
Get data from
Web Addresses (URL).
2. Input the URL
3. Click the
Next>> button at the foot of the screen to load the data into OpenRefine.
4. OpenRefine gives you a preview – a chance to show you it understood the file. If, for example, your file was tab-delimited, the preview might look strange, you would choose the correct separator in the box shown and click
Update Preview (bottom left). If this is the wrong file, click
<<Start Over (upper left). There are also options to indicate whether the dataset has column headers included and whether OpenRefine should skip some rows before reading the data.
5. OpenRefine inspects the datafile and before finishing the import process (in this case there are 1372 records, and it is about less than 1Mb in size. From this screen, we can set some parameters. 6. In this case, change the ‘Quotation marks are used to enclose …” and also give the project a more meaningful name. Note: the ‘encoding’ tab lets you see in realtime what happens if you change it – see ‘funny’ characters – try this one – Make sure it is set to UTF-8.
7. Click the Create Project… button.
OpenRefine then imports the file – noting that it is stored locally on your own machine.
8. What we should now see is our dataset presented in all its glory.
A Quick Note on Experimenting
Before we go any further, it’s nice to know that OpenRefine is all about trying things out and playing with data. It’s a very forgiving environment.
Once you load your data into OpenRefine as we have above, we are no longer working with the original datafile. Anything we do is entirely within OpenRefine and non-destructive. Moreover, anything you do can be easily undone.
- In our example file, one of the columns contains long data and results in us seeing fewer records than we may want to or in a way similar Excel. Why not? Let’s just make that column disappear. Choose Edit column>Remove this column from the Naming Origin column.
- When the column is removed, suddenly all the rows become far more compact, and you can see more data.
- However, that column contained data valuable to our analysis, so we want to get it back. As mentioned, OpenRefine monitors everything we do and allows us to turn back time. Choose the Undo / Redo tab on the left panel.
- This panel shows a list of all the actions we have done. We will discuss more about this later, but for now, let’s step back to where we created the project. Click on ‘0. Create project’ and more that we are now back to the dataset as created. At any point during this exercise you are free to try other things and know that you can always return to the state of the data before any previous action – so feel free to experiment. So let’s take a look at this dataset.
Explore Your Data
Here we will use faceting to get to know our dataset. It contains a collection of information about craft beers brewed in Ireland over the past decade. So what sort of beer is being brewed? Who is brewing it and where is it being brewed?
- Note that we can know the
totalnumber of records being browsed.
- A ‘Facet’ groups all the like values that appear in a column, and then allow you to filter the data by these values and edit values across many records at the same time.
Let’s take a look at the data in the
3. Click the down arrow and choose
4. In the left panel, you’ll now see a box containing every unique value in the
type column along with a number representing how many times that value occurs in the column.
Try sorting this facet by name and by count.
Facetting data allows us to see all the various unique values. If we know what we are specifically interested in we can filter the data for specific values.
5. On the brewName column, select the dropdown menu and choose Text Filter.
6. Like facetting, a new panel opens on the right allowing us to type in a search value. Try typing in ‘Oatmeal’ to see what beers have this in their name.
7. OpenRefine now displays only the records that match the filter parameter, and the facets apply to beers with oatmeal in their name.
Edit Your Data
To see all the records in your dataset, close the Facet and the Filter panels by clicking the small ‘x’ in the top left corners. The count should now indicate the full 1,372 records.
- Choose Facet/Text Facet on the Type column.
- Do you see anything wrong with this the beer types listed in the type panel? From a cursory examination, you might note that one of the Altbiers seems to have been misspelled. Small spelling mistakes in large datasets can become readily apparent even from this simple facetting.
3. Notice that when you hover over the misspelled Altbeir, an ‘edit’ button is revealed. If you click on it, you will get a window to correct the error. Choose edit and the fix the spelling. It should read Altbier.
4. Doing so will reveal that we have two beers of the Altbier type and both are now correctly spelled.
You could edit each cell individually in this fashion, but this could become very time consuming with a large number of records. Thankfully there is a better way.
Cluster editing in OpenRefine takes advantage of powerful data matching algorithms that suggest values that may need correction.
5. Suspecting that we may have more type errors in our dataset, select Edit cells/Cluster and edit.
6. The variety of algorithms yield different results and changing the Methods and Functions show the various suggestions. In our case, the first screen suggests that Imperial Red Ale and Red Imperial Ale may be the same thing. They are, and this is a useful suggestion so choose Merge Selected and Recluster.
7. Try changing the Keying Function to metaphone3. Although OpenRefine identifies possible clusters to merge, take a closer look and see if these clusters do represent misspelt or misconstructed values referring to the same thing.
8. In fact, these are different values that should remain different. Try changing Method to ‘nearest neighbour’ and Distance Function to ‘PPM.’ The suggestions may look relatively similar, but in fact, the third suggestion has detected that we have 1 American Amber Lager and 1 Amercian Amber Lager with a spelling error. All of the other differences in values should be different. Selecting the checkbox on the third value.
Process Your Data
OpenRefine gives you the ability to transform complete columns of data based on a variety of precoded routines to manipulate your data and allow it to be processed or viewed in different manners.
- One of the more common transformations is to change the ‘type’ of a column so the OpenRefine (or other programmes) can better know what the data represents. In the dataset, we are working with the data in the column introduced represents the date the particular beer was launched. When we loaded the dataset OpenRefine coded it as text data, but if we change the type to date as it truly is, it will be far more useful. From the dropdown menu for introduced, choose the Edit cells/Common transforms/To date.
2. When you successfully carry out this transformation, all the values in the transformed column will be shown in green to indicate the success of the operation. Any red values could not be successfully transformed.
Find Bad Dates
3. Now that we have a date column, we can facet it as dates and not as textual values. As we know the context, we can better appreciate the values. Choose Facet/Timeline Facet – note: if this were text we could not do this.
4. The timeline shows us the date on the x-axis and shows the number of records sharing that date on the y-axis. You may note that most of the dates cluster toward the right suggesting that we may want to explore the outliers to the left.
5. The timeline facet allows us to select records by dragging the reticule to encompass the records of interest. Select the outlier by focusing on it. You will note there is only one record to the left. It has no date in the introduction column. Beneath the timeline, you will see that you can indicate those that have a time value and those that are blank.
For this exercise and to see the value of the timeline for getting to know our data, we will remove the records that do not have a date of introduction. We have 1286 records with valid dates and 97 without dates.
6. Deselect the ‘Time’ checkbox. This will leave our found set containing only records without dates.
7. We can now choose to remove these records by selecting the dropdown on All and Edit rows/Remove all matching rows.
Script Data Process
When working with datasets, it is often the case that similar sets of routines need to be carried out on multiple datasets. OpenRefine allows us to save the actions we carry out on one dataset so they can be applied to additional sets.
- Choose Undo/Redo the top of the left column. When you do this, you will see a list of all the actions we have carried out on the dataset since we began.
You can undo any activity that you have undertaken simply by clicking the state of the data you want to return to. More specifically you can export your activities as a script to apply to other datasets.
2. Choose Extract … and you will be presented with a JSON file that details the operations you have carried out on the current dataset. Note that you can exclude specific actions an retain only the ones you want. As you become more proficient and your data demands it you can edit the raw JSON to change the way your data is modified.
You copy and paste this data into a text editor to save or edit it.
Probably most typically, you will want to export the data that you have manipulated in OpenRefine for use in other applications.
- On the top right of the screen select the Export drop-down menu and note you can choose from a variety of different formats. Originally we imported our data as a tab separated values format.
- Choose Excel (.xls) from the dropdown and save the file to your desktop.
3. If you now go to your desktop, you can double-click on the file, and it will be opened within Excel so you can see the results of all your edits.
You may note that there is another option at the top of the Export menu in OpenRefine. It allows you to save the full contents of your session, the data along with all the operations that you have undertaken so that you can return to it at any time or share it with someone else.
4. Choose Export project from the menu and save it to the desktop.
5. Now return to the dashboard screen of OpenRefine. Note that you can Import Project. Click the Choose file and aim it at the file you saved to your desktop.
6. Doing so will reload your entire project, and you will be right back to where you were working with the same data and still having all the various past actions available to you.
This concludes this first exercise. For further tutorials, you may want to consult: