Posted by: bluesyemre | August 19, 2022

AI in Academic Libraries, Part 1: Areas of Activity, Big Players and the Automation of Indexing

Interview with Frank Seeliger (TH Wildau) and Anna Kasprzik (ZBW)

What are the most promising areas of activity for the use of artificial intelligence in academic libraries? (How) can libraries hold their own against the commercial big players? Do they even have to? Will the intellectual preparation of metadata soon be superfluous due to intelligent search engines? The first part of the three-part interview series with Anna Kasprzik and Frank Seeliger focuses on these questions.

We recently had an intense discussion with Anna Kasprzik (ZBW) and Frank Seeliger (Technical University of Applied Sciences Wildau) on the use of artificial intelligence (AI) in academic libraries. Both of them were also recently involved in two wide-ranging articles: “On the Promising Use of AI in Libraries: Discussion Stage of a White Paper in Progress – Part 1“ (German) and “Part 2 (German). This slightly shortened, three-part series has been drawn up from our spoken interview. These two articles are also part of the following text:

  • Part 2: Interesting Projects, the Future of Chatbots and Discrimination Through AI
  • Part 3: Prerequisites and Conditions for Successful Use

We will link them here as soon as the texts are online.

An interview with Dr Anna Kasprzik (ZBW – Leibniz Information Centre for Economics) and Dr Frank Seeliger (University Library of the Technical University of Applied Sciences Wildau).

What are the most promising areas of activity for the use of AI in academic libraries?

Frank Seeliger: Time and again, reports crop up about how great the automation potential of different job profiles is. This also applies to libraries: In the case of the management of an institution, automation using AI is minimal, but for the specialists for media and information services (FaMI in German), it could be up to 50%.

In the course of automation and digitalisation, it’s largely about changing process chains and automating so that users can borrow or return media autonomously in the libraries – outside opening hours or during rush hour – essentially as an interaction between human and machine.

Even the display of availabilities in the catalogue is a consequence of the use of automation and digitalisation of services in libraries. Users can check at home whether a medium is available. Services in this area – those dealing with how to access a service outside the immediate vicinity and opening hours – are certainly increasing, for example in the context of asking a question or using something during the evening, including via remote access. This process continues and also includes internal procedures such as leave requests or budget planning. These processes run completely differently in comparison to 15 years ago.

One of the first areas of activity for libraries is the automatic letter and number recognition, including for older works, cimelia, early printed books or also generally in the context of digitalisation for all the projects there. This is the one area of expertise of libraries in layout, identification and recognition. The other is the question of indexing. Many years ago, libraries worked almost exclusively with printed works, keywording them and indexing their content. Nowadays detection systems have tables of contents and work with what are known as “component parts of a bibliographically independent work”, i.e. articles that are co-documented in discovery tools or search engines. The question is always: “How should we prepare this knowledge so that it can be found using completely different approaches?” Competitors such as Wikipedia and Google predetermine the speed to some extent. We try to keep up or go into niche fields where we have different expertise, another perspective. These are definitely the first areas of activity in the field of operations, search activities or indexing and digitalisation, where AI is helping us to go further than before.

It has thereby been possible for many libraries to offer services at lower personnel cost even beyond the opening hours of public libraries (Open Level concept). Not round the clock, but for several more hours – even if no-one is in the building.

We need to make sure that we provide students with relatively high-quality information at different places and different times in their various locations. This is why chatbots for example (there’s more to come about this in part 2 of this article series) are such an exciting development, because students do not necessarily work when libraries are open or when our service times are available, but rather during the evenings, at the weekend or on public holidays. Libraries have the urgent task of providing them with sufficient and quality-checked information. We need to position ourselves where the modern technologies are.

Anna Kasprzik: Perhaps I’m biased because I’m working in the field but for me it’s very important to differentiate: I am specialised in the field of automation of subject indexing in academic libraries; the core task is to process and provide information intelligently. For me, this is the most interesting field. However, I sometimes get the impression that some libraries are falling into a trap: they want to do “something with AI” because it’s cool at the moment and then just end up dabbling in it.

But it’s really important to tackle the core tasks and thus prove that libraries can stay relevant. These days, core tasks such as subject indexing are impossible to imagine without automation. Previously this work was done intellectually by people, often even by people with doctorates. But because the tasks are changing and the quantity of digital publications is growing so rapidly, humans can only achieve a fraction of what is required. This is why we need to automate and successively find ways to combine humans and machines more intelligently. In machine learning, we speak of the „Human in the Loop“. By this, we mean the various ways in which humans and machines can work together to solve problems. We really need to focus on the core tasks. And we need to apply methods of artificial intelligence and not just do explorative projects that might be interesting in the short-term but are not thought through at a sustainable level.

Frank Seeliger: The challenge is that, even when you have a very narrow field that you are trying to research and describe, it’s difficult to stay up to date with all relevant articles. You need tools such as the Open Research Knowledge Graph (ORKG). With its help, content can be compared with the same methods and similar facts, without reading the entire article. Because this naturally requires time and energy. It’s impossible to read 20 scientific articles a day. But that’s how many are produced in some fields. That’s why you need to develop intelligent tools that help scientists to get a fast overview of which articles to prioritise for reading, absorbing and reflecting on.

But it goes even further. In the authors’ group of the „White Papers in progress“ (German), which we held for one year, we asked ourselves what search of the future would be like: Will we still search for keywords? We’re familiar with this from plagiarism detection software into which entire documents are entered. The software checks whether there is a match with other publications and whether non-cited text is used without permission. But you can also turn the whole thing around by saying: I have written something; have I forgotten a significant, current contribution in science? As a result, you get a semantic ontological hint that there is already an article on the topic you have explored which you should reflect on and incorporate. This is a perspective for us, because we assume that today one can hardly become master of the situation, even when they have an interdisciplinary focus or are exploring a new field. It would also be exciting to find a way in via a graphic analysis that ensures that you have not forgotten anything important.

(How) can libraries keep up with big players such as Google, Amazon or facebook? Do they even have to?

Frank Seeliger: We’ve had some very intensive disagreements about this and come to the conclusion that libraries will never have the men-and-women power that other corporations have, even if we were able to only have one single world library. Even then it would be questionable whether we would be able to establish a parallel world (and if we would even want this). After all, others cater for other target groups. But even in the case of Google Scholar, the target group is quite clearly defined.

Our expertise lies in the respective field that we have licenced, for which we have access. Every higher education institution has different points of focus for its own teaching and research. For this, it ensures very privileged, exclusive access which is used to reflect precisely on what is in the full text or is licenced and what can be accessed by going to the shelves. This is and remains the task.

Although it is also changing. How will things develop, for example, if a very high percentage of publications are published in Open Access and the data becomes freely accessible? There are semantic search engines that are experimenting with this. Examples are YEWNO at Bayerische Staatsbibliothek (Bavarian State Library) or iris.ai, a company that has a headquarters in Prague, among other places. They work a lot with Open Access literature and try to process it differently on a scientific level than before. So in this respect, tasks also change.

Libraries need to reposition themselves if they want to stay in the race. But it’s clear that our core task is first of all to process the material that we have licenced and for which we pay a lot of money in the best possible way. The aim must be that our users, i.e. students or researchers, find the information they need relatively quickly and not after the 30th hit.

One of the ways in which libraries are intrinsically different to the big players lies in how they deal with personal data. The relationship to personal data when using services is diametrically opposed to the offers of the big players, because values such as trustworthiness, transparency etc. play an enormously important role for the services of libraries.

Do students even start their search in library catalogues? Don’t they go directly to the general internet search engines?

Anna Kasprzik: They use Google relatively often. At the ZBW, we are actually currently analysing the routes via which users enter our research portal. It’s often Google hits. But I don’t see that as a problem because the research portal of a library is only one reuse scenario of metadata that libraries create. You can also make it available for reuse as Linked Open Data. And what’s more: Google uses a lot of this data, so it is already integrated into Google.

And to respond to the other question, we have also discussed this in the paper, at least in the early draft. The fact that libraries are publicly funded means that they have a very different set of ethics when dealing with the personal data of users. And this has many advantages because they don’t constantly try to milk the users according to their needs or requirements. Libraries simply want to provide the best-prepared information possible. This is a strong moral advantage, which we can utilise to our benefit. But libraries do not sell this advantage, at least not very much.

There is also an age-old disagreement about this (which has nothing to do with AI, however) – many students or also PhD candidates do not realise that in their everyday lives, they are using data that a library has prepared and made available for them. They call up a paper in the university and do not notice that its link has been made available via their library and that the library has paid for this. And then, there are two factions: some people say that the users shouldn’t notice that it must occur as smoothly as possible. The others believe that, actually, there should be a big fat notice stating “provided by your library” so that people can’t miss it.

Frank Seeliger: The visualisation of the library work that is reused by third parties is a great challenge and must be properly championed because otherwise, if it is no longer visible, people will start asking why they are giving money to libraries at all? The results are visible but not who has financed them and/or people don’t notice that they are actually commercial products.

Another aspect that we discussed was the issue of transparency and freedom from advertising. We organised a virtual Open Access Week (German) from November 2021 to March 2022. We made video recordings of each ninety-minute session. Then we asked ourselves: Should we use YouTube for publication or the non-commercial video portal of the TIB Leibniz Information Centre for Science and Technology and University Library (TIB AV Portal)? We made a clear-cut decision to use the TIB AV portal and they have accepted us there. We decided in favour of the portal precisely because there are no advertisements, no overlays and no pop-up windows. If we work with discovery tools, we try to advertise the fact that you really don’t get any advertising and reach your goal with your very first hit. Therefore, several aspects differentiate us significantly from commercial providers. We are having that discussion right now; it’s an important difference.

Will the intellectual creation of metadata soon become superfluous because intelligent search engines will take over this task?

Anna Kasprzik: This is a fundamental issue for me. I say: “no”, or perhaps “yes and no”. What we are doing at the moment via our automation of subject indexing with machine learning methods is an attempt to imitate the intellectual subject indexing one-to-one, just the same way it has always been done. But for me this is only a way for us to get our foot in the door technologically. In the next few years, we will address this and start designing the interplay between human knowledge organisation expertise and machines in a more intelligent way – reorganise it completely. I can imagine that we will not necessarily need to do the intellectual subject indexing in advance in the same way that we are currently doing it. Instead, intelligent search engines can try to index content resources taking the context into account.

But even if they are able to do this from the context ad hoc, those engines require a certain amount of underlying semantic structuring. And this structuring needs to exist in advance. It will therefore always be necessary to prepare information so that the pattern recognition algorithms can access them in the first place. If you merely dive into the raw data, the result is chaos, because the available metadata is fuzzy. You need structuring that pulls the whole thing more sharply into focus, even if it only accommodates the machine to a partial extent and not completely. There exist completely different ways of interconnecting search queries and retrieval results. But intelligent search engines still have to have something up their sleeve, and that something is organised knowledge. This knowledge organisation requires human expertise as input at certain points. The question is: at which points?

Frank Seeliger: There is also the opposing view of TIB director Prof. Dr Sören Auer, who says that data collection is overvalued. Certainly also meant as a provocation or simply to test how far one can go. In the future, it may not be necessary to have as many colleagues working in the field of intellectual indexing.

For example, we have 16,000 graduate thesis held in the library of the TH Wildau library; the entire lists of contents are being scanned and made OCR-compatible. The question is, can you systematise them according to the Regensburger Verbundklassifikation (RVK, Regensburger Association Classification; a classification scheme for academic libraries), perhaps with the Annif tool? This means that I don’t have to look at each dissertation and say, this one belongs in the field of engineering, etc., independently of the study courses in which they were written. But instead, here is the RVK graph, there are the tables of contents, then they are matched according to certain algorithms. This is a different approach to when I, as a specialist, take a look at every work and index it correspondingly for keywords, the Integrated Authority File (GND; a service facilitating the collaborative use and administration of authority data) and so on, run through all the procedures. I see this as a new way of master or mistress of the masses, because a great deal is published; because we have taken over responsibilities that did not used to be covered by libraries, such as the indexing of articles, i.e. component parts of a bibliographically independent work, besides bibliographically independent works. It’s definitely a great help.

However I cannot imagine that humans no longer intervene at all in such algorithms and offer a pre-structuring according to which they must act. Up to now, it’s been the case that we require a lot of human intervention to trim and optimise these systems better, so that the results are indexed 99% correctly. That’s one objective. This requires control and pre-structuring, looking at, training data. For example in calligraphy, when you check if a letter has been recognised correctly. Checking and handling by human beings is still necessary.

Anna Kasprzik: Exactly – I mentioned the concept earlier: the “human in the loop”, i.e. that people can be involved at various levels. These can start out very trivially: with the fact that training data or our knowledge organisation systems are generated by humans. Or the fact that you can use automatically generated keywords as suggestions – machine-assisted subject indexing.

There are also concepts such as online learning and active learning. Online learning means that the machine receives feedback relatively consistently from the indexer, as to how good its output was and based on that retraining takes place. Active learning is where the machine can interactively decide at certain points: I now need a person as an oracle for a partial decision. The machine initiates this, saying: “Human, I am pushing a few part-decisions that I need into the queue here – please work through them.” People and machines tend to toss the ball back and forth here, rather than doing it separately in two blocks.

Thank you for the interview, Anna and Frank.

In part 2 of the interview on “AI in Academic Libraries” we explore exciting projects regarding the future of chatbots and discrimination through AI.
Part 3 of the interview on “AI in Academic Libraries” focuses on prerequisites and conditions for successful use.
We’ll share the link here as soon as the post is published.

This text has been translated from German.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Categories

%d bloggers like this: