In this new series, we interview the Fellows of OpenForum Academy to talk about their current work, projects and how their research relates to the wider ‘openness agenda’. The first interviewee is Peter-Murray-Rust, a chemist working at the University of Cambridge and interested in Open Access and Open Data.
Hello Peter and welcome to this first installment of ‘Meet the Fellows’. Can you start by briefly presenting yourself for people who don’t know you and talk a little bit about what your current research interests are?
Hi there, I’m Peter Murray-Rust, I’m a scientist who for the last 20 years has worked with information in chemistry, and now I’ve moved on to other sciences as well. I believe that information, publicly available to anybody is at the core of scientific research. It’s not just what you do in your laboratory, it’s also what other people have done and how that can be integrated into a better view of science, technology and medicine. What I found over 20 years is that it’s often incredibly difficult to find scientific information that ought to be available. Sometimes this is due to laziness, sometimes it’s due to lack of funding, but often it’s because people want to keep this information to themselves, and aren’t prepared to release it – and that’s true of both academics and companies. So my current goal is to liberate scientific information, on a massive scale, by doing this with technology. I’ve been very fortunate for the last year that I’ve been funded by the Shuttleworth Foundation, and as a Fellow of that organisation, I and my team are building software which is going to automatically read published science from journals and other source and turn it into semantic form so machines can understand it and then make it available to everybody.
How would you say that this area – open access if I may call it this way – relates to other areas of openness ? This is a topic that we’re interested in at OpenForum Academy, to see how different Fellows, each in their respective fields, have a common perspective on their work.
“‘Open’ isn’t just a passive thing, it’s not just about consuming information, it’s about how you carry out the operation.”
That’s one of the really exciting things. Very shortly after I came to Cambridge, 13 years ago, I met up with Rufus Pollock, who founded the Open Knowledge Foundation, now known as just Open Knowledge, which explores the question of what is open, why is open valuable, and how can we use open to change the world. Open Knowledge has covered a wide range of things like copyright, the value of the public domain, through to Open Government, how can we make government processes transparent, open medicine, open clinical trials, and my own area of open science, which is how can we make science available to everybody. Open isn’t just the permission to read something, it is the permission for everybody to use and re-use, and increasingly that means machines, so I’m looking at how machines can read scientific information, digest it, and then help humans make best use of it. But further upstream is the question of how the process is governed, how the materials are created. So ‘open’ isn’t just a passive thing, it’s not just about consuming information, it’s about how you carry out the operation, whether it be scientific research, government, medicine, so that we know what’s going on, we have transparency, we can measure that transparency, we have an input when things go right or wrong, and we can get the information as soon as it’s made available (which is often at a much earlier stage in the process than simply publishing the whole lot at the end in some dense set of documents).
You talked about liberating scientific information using software. This is often referred to as text and data mining, but I think you prefer to call it ‘content mining’. Can you expand on that distinction and how for example mining a text database might differ from mining other types of information ?
“People read PDFs, which is a format reasonably developed for sighted human beings, but is no use for blind people or for machines.”
Historically, science and much other communication has been through print and science is still communicated through objects that look like printed pages. Let’s call it ‘e-paper’, rather than fully semantic information. People read PDFs, which is a format reasonably developed for sighted human beings, but is no use for blind people or for machines. So we’ve developed technology which can read everything that is published, not just the PDFs as we’re also very interested in images and other types of data. There’s been too much concentration on the paper, textual representation, which is where the phrase ‘text and data mining’ comes from, and not on images and audio and things like that. I should say that I work on images, I don’t work on audio which is much harder, but I campaign for the rights for other people to extract information from it.
Now the problem with this is that technically this is covered by copyright. Copyright is one of the most complex and I think broken systems that we have at the moment. There’s often a special role given to images as being creative works, and in some cases they have more protection than text. I believe that scientific images are often the only way of communicating something, so if you photograph a cell, that is a primary scientific information, and I would call that data rather than a creative work. Similarly there are expressions within text that can only be understood in conjunction with the diagrams in the document, and the diagrams are also fundamental science. So we’ve developed the term ‘content mining’ rather than simply text and data mining because we wanted to make clear that there’s no line to be drawn between scientific text and scientific diagrams and images, that all of this is legitimately minable.
“I believe that scientific images are often the only way of communicating something, so if you photograph a cell, that is a primary scientific information, and I would call that data rather than a creative work”
You mentioned copyright – are there any legal impediments preventing the take up of content mining, and if so, what would be your policies recommendations?
The answer is yes. Copyright has been described by MEP Julia Reda as “incredibly complex” and she is right, it is something which has evolved over the years and in many peoples’ view is no longer fit for purpose and it actually seriously stands in the way of using modern technology for creating and disseminating knowledge. The problem is that most operations dealing with electronic information involve copying at some stage. Either downloading from a site on the Internet or transmitting it to other people in your community and technically if that information is protectable by copyright you have to ask the question: “Am I violating copyright in carrying out this act of copying?”.
Let’s say I have a journal article, an article I have the right to read because it’s either in the public domain, it’s covered by an open licence, or, most problematically, if it is something that I subscribed to. Most scientific information is only available in published form as journal articles through subscriptions. I can read it because I am in Cambridge University which has a large number of subscriptions. But even in downloading it from the publishers website I am making a copy – that is allowed by the contract with the publishers because otherwise I couldn’t download it. But if I am now going to mine it that involves using that copy, and the publishers have, I think, very narrow-mindedly and counter-productively decided that mining this information requires their permission. There is nothing in the law that backs this up, but the law does support copyright in copying materials so there is a grey area here in that I can copy it and read it but that the publisher says that I can’t copy it and run my software over it. I have produced the mantra: “The right to read is the right to mine”, in other words, if you have the right to read something you have the right to mine it. A lot of people support this idea, most recently it has been supported by LIBER, the association of European research libraries and we have come out with a declaration from a meeting in the Hague just before Christmas where among other things we have asserted that the right to read is the right to mine (Editor’s note : LIBER has produced a useful factsheet on the importance of mining, available here). Unfortunately, a large number of conventional publishers have opposed this, saying that you require addition permission and in some cases additional funding to be able to mine this content.
“A number of actions in copying are now legal in the UK without the permission of the copyright owner, including copying for data analytics interpreted as text and data mining.”
In the UK, the government has taken a very pro-active action and has created an exemption for copyright which was proposed by professor Hargreaves – let’s call it the Hargreaves exemption – and in June 2014 it came into law in the UK as an additional statutory instrument. What it says is that a number of actions in copying are now legal in the UK without the permission of the copyright owner and they include things like copying for archives or by libraries, copying for format shifting, copying for parody, and in the particular case of my community, copying for data analytics interpreted as text and data mining. We are allowed to do this for research purposes and for non-commercial use. This has been passed in UK law, it hasn’t been tested in court but I believe that this legitimises what I do and we are going to go ahead and do it. A number of European governments and organisations are pushing for the European Parliament to legislate in a similar manner and this is what Julia Reda proposed earlier this year in a very balanced and valuable paper on copyright and the European Parliament where she proposed a number of actions which are similar to the UK government. Some of them go beyond and one or two don’t go far enough but it’s basically very similar policy proposals. There is now fairly intense lobbying in Brussels from both sides on this issue and she said she has had a lot of pressure from rights holders who are lobbying against this reform.
You mentioned the Hargreaves exemptions, and of course as you said this is also being discussed at the EU level. Some of the policy proposals would be to go for a wide exemption for text and data mining – or content mining – and some others are probably more aligned with the UK current framework to restrict exemptions exclusively to research and non-commercial purposes. In other areas, the non-commercial clause especially has generated much discussion and some concerns. I was wondering if you thought that this also applicable to your domain and if there are any issues with distinguishing commercial from non-commercial uses and pure research from other purposes ?
“Clearly, if the courts themselves have to debate about if something is of commercial use or not, it is going to be very difficult for ordinary people to do it.”
That is a very valuable question. My understanding is that the reason that the UK government chose the non-commercial/research exemption was that they could then enforce this change through a statutory instrument which didn’t need a full act of Parliament, whereas if they were to get a complete exemption for commercial use as well it would have to be debated in the House of Commons and then in the House of Lords. The current legislation went through with ratification from just the House of Lords. I think it was a pragmatic political action, I clearly don’t think it goes far enough. The first point is that non-commercial can be very restricting – for instance there was a recent case in Germany about whether teaching is a commercial use or not. Early last year, the German Court found that the use of material for teaching was commercial, but later that same year a higher Court overturned that decision. Clearly, if the courts themselves have to debate about if something is of commercial use or not, it is going to be very difficult for ordinary people to do it. The uncertainty is extremely restrictive, different people will take different views and the more uncertainty there is here, the less progress there will be, because people are afraid. The second thing is that the UK government has given us a paradox in that the reason for creating this copyright exemption was to generate greater digital wealth in the country and to promote industries which were going to be wealth producing. Now if you only allow this for non-commercial purposes, it’s very difficult to create an industry which is going to be able to rely on that and not end up in court at the first sign of any conflict with other rights holders. I am taking the view – and I say this very carefully – that I am doing research, personal research, and as a good scientist I have to read the whole literature to find out what might be pertinent and I also have to publish all the facts that I extract because only then can I give a complete scientific record of what I’ve done. I plan to do research and publish the facts as the correct, ethical procedure. Lots of people will challenge that and I don’t know when or where these challenges are going to come from, but large commercial publishers have been lobbying against it with a variety of FUD, saying that nobody wants to do it, that there’s no call for it and that as soon as people start doing it, it will overload their servers. This is all rubbish, it’s not going to overload their servers, studies by PLOS (the Open Access publisher) have shown that something like a ten to the minus seven of the daily load will come from text and data mining. So it’s a complete red herring but it’s a sort of deliberate misinformation that it’s being put out there.
Are there any additional points that we haven’t covered yet and you would like to discuss? Now is also your chance to recommend any papers or online resources that you think our readers might be interested to take a look at.
“Publishers have started to make APIs available where you get permission to access this in a way that they tell you is the best to do it.”
I don’t think so, other than to say that it is very important that people engage on this because if we do not assert our rights, they are likely to disappear by default and they will be increasingly hard to recover. There are technical control measures here so that the publishers often cut off people for downloading too much, even if it is otherwise allowed in the contract. So the publishers can control what is downloaded, and we have to fight against this. I’d also say that publishers have started to make APIs available where you get permission to access this in a way that they tell you is the best to do it. First of all, I dispute that, and I’ve shown why. But also that means that they can monitor every use of this, so that they end up with a very large amount of metadata as to who mines what and for what purpose. They are potentially violating privacy and they can also use the API to control what you have access to, what types of information and how comprehensive it is and so forth. Most publishers will only allow you to look at the text and not at the figures, for example. In finishing, I would like to say how valuable it is for me to be a Fellow of Open Forum Academy and how very critical and valuable the organisation is. One of the important papers is the one we published with you about two years ago on content mining, and that gives a reasonable overview of content mining. More generally I have a number of slides on this topic available on slideshare.net, many of which are well-suited to discussing this. If you go to contentmine.org we are organising the slides there as well and there’s an excellent presentation by Charles Oppenheim which he did with us in September 2014 about what the Hargreaves allows, it’s very clear and balanced.
Thanks a lot Peter!