Session 1_11 June_ cleaned transcript #12
flower1430
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Session 1_transcript
This article, is to celebrate the centenary of SR Ranganathon, the father of library science. And, It is complex and ambitious.And there are many seems to it. So we will indicate these as we go through. The 1st and probably the most important thing for library science is that, it describes semantic publication. An idea which has been around for, probably 50 years. Promoted in 1994 by Tim Berners Lee, particularly, but almost completely ignored by the Scholarly, publishing Community. We believe passionately that scholarly publishing must become semantic if it is not to be totally overcome by new methods which have been developed, often labeled AI and so on.
So, this will outline the structure and purpose of semantic publication.
To do it and to exemplify it, we'll be talking about climate. Climate is the most important thing in the world at the moment even though in many elections it's not mentioned. If one follows the climate discourse, then, very serious predictions about How the world is going to change almost all for the worse. And how we have to adapt and how we have to mitigate. And those are going to happen whether or not politicians do anything mmediate about it. We are lucky that we have very good material to work with in climate and it's an ideal starting point to show what semantics are.
So, the structure of this article is going to be partially semantic. The ideal would be to make the whole article semantic. It would be probably the equivalent of 500 published pages including a multimedia of various sorts because that's what semantic is doing at the moment.
However, we have to condense it into something readable. So there will be a narrative in HTML and very possibly from this monologue at the moment, which will cover all the main points. And the monologue will then be hyperlinked into richer semantic components as we go. So, the monologue will probably be about 2030 pages when printed and then, that monologue will be converted into a PDF or publication in the NS article. And our intention is that it will be a high level guide to what we're doing but heavily hyperlinked so that people can get an immediate idea of what the particular component, what the political particular philosophy is at any stage. So we will start with semantic publishing. I am sure we are sure that if Ranganathon were working today that he would be hyperactive on making publishing a semantic.
He worked in a time when the flow of information was very slow and was primarily from author to reader by a printed material.
And picking out the important concepts for today. We have inclusiveness. That he was passionate about including people outside the traditional, library readership and in our case, this is outside traditional scholarly publishing. So he mentioned the, infirm, people with language disabilities and people who did not have access to to books and so on.
And I am sure that he would be promoting all of these ideas very actively.
Among his laws we will pick out "Save the time of the reader" and that is very much the theme of this article here.
We are making articles easier to read and we will add to that, save the time of the author, because authors are many readers are now also authors.
In, the multi-dimensional web, you can offer at many levels, primary also you can comment on others and so on.
And I am sure that he would have been and active in this. he would have been I think very disappointed with PDF as the main medium of scholarly communication. He promoted semantic publishing. And in library classifications with the Klong classification, which is a forerunner of today's No SQL databases.Many of which work on a faceted approach to, classification and discovery. And I am sure that he would have seen data and text as simply parts of a spectrum.
Rather than two separate components which we have at the moment where you publish the text, you publish the data and then you make some weak hyperlink between them.
Data and text are simply parts of a continuum which includes discourse, multimedia and much else. We are inspired by the work of Tim Berners Lee, Who came up with the concept in about 1990 of the semantic web and Tim really launched this at the 1994, World Wide Web Conference where he gave a vision of the semantic web, which was an integrated system where all information was semantic and where you could discover it and understand it with a range of tools.
Many people adopted this concept of semantic publishing and built systems. In the early 2,000's there was a considerable level of discourse about it, tools were created, exemplars were created. And so, David Shotton created a marked up, Archicho, Henry Shepherd and I myself did this for chemistry. There were various prototypes of semantic material and one of the useful concepts at that stage was Nano-publication, the idea that we could break publications down into small components and each of those was semantic and could be understood by machines.
The other thing that happened in about the same time was the rise of open. Open is a philosophy of universal access to knowledge, both its consumption and its creation. There were many efforts to systematize this. For example, open source and free software were championed by Linus Torvalds and Richard Stallman and those concepts have been seminal in changing the way that software is developed in a community. And with, universally available tools for the whole process. Open source. Of free and open source software. FOSS has been enormously successful and most software which is generated and outside corporations is FOSS and organizations such as the, European Commission. Other national bodies have taken on board the requirement to make software an open.
What open means is, that a piece of information is free to use, reuse and redistribute without any permission. And that's encapsulated by the, open knowledge foundations. Open definition, which I have just quoted from. It was expected in, the 2,000's that the idea of open would spread to other aspects of human knowledge and in 25 years ago, there was a major commitment by the Budapest definition of open access which defines what open access to published knowledge should be. It takes the view of the open definition that this is free to use reuse and redistribute without permission. The purpose of published knowledge is for use by the world as well as for recording what particular people and organisations have done. Unfortunately, that vision has failed. Open access is now largely in the global north, is largely bureaucratic and a business model complex and it doesn't serve the ideals of reuse and global ownership.
The model of open however has been adopted by the Global South. Particularly, in Latin America, which is also spreading this vision. And, in the work of Ariana Beck and colleagues, a co author on these, the model which is practiced is that publishing scholarly publishing is a public good and should be available to the whole world and that means that anyone can access it, reuse it, and the reuse can actually be repurposing so that we can make this semantic for easier reuse in terms of science. Open science again is a concept which was advocated. It's, not really, It's not universal, it's patchy. One of the adherence to this full open model was the idea of open notebook science, which was championed by Jean Claude Bradley and we adopt the idea of open notebook science where we make everything that we do available immediately to the whole world.
Now, this is not at the moment automatic because the tools do not exist but we are in the process of generating tools which will take the work that we do, including developing this paper, this article, and put it out to the world as we do it.
It builds on the concept of the Memex from Van. In 1948 where he conceived a machine which would record everything you do and make this available for posterity. Unfortunately, we are seeing this model developed by closed corporations over which we have no control, very little insight, and which is not done for the benefit of humanity but for the benefit of the corporations. We have to try and create however we can do it an equivalent which is open so that as the world creates knowledge it publishes it immediately in semantic form. That is the theme of this article.
Now, we are going to use a particular set of examples to show this and these examples are not hypothetical. We have done them. So, we have built semantic knowledge products.
In the area of climate. Climate is the most important thing in the world at the moment. Many people think we have only a few. Until we, enter irretrievable, tipping points, feedback loops, and, human disaster. And, So it's immediate. It's technically very complex. It covers every domain. It covers mathematics, physics, astronomy, Chemistry, biology, psychology, politics, Law, every discipline is involved in climate. Much of the work so far has been global north centric, so, it's written from the point of view of English speaking nations and often with a neo-colonial agenda.
But at the center of the efforts we espouse what the united nations is doing in 3 areas, the creation of records of United Nations meetings that is the UNFCCC. The handbook on climate, the IPCC reports and the IPCC glossary, which we are turning into semantic.
It's disciplinary as I say, which means that almost always, we will be finding, we interact with disciplines which we have not been trained in and have no knowledge and so we have to be able to read and understand those very quickly. Semantic climate can help with immediately are the concepts to extract those from the IPCC and the terminology. We deal primarily with textual and graphical discourse, we are not dealing with data at the moment.That's deliberate. There is a lot of people who work with data. Again, it's complex to discover. So, we are working with, what we are called documents.
We now come on to the model of our publication. So this is semantic publication and what this involved is a wide range of things, that have to be done before something is semantic. We'll have a mental model here of a document, a scholarly publication or a report from a company or government or NGO. And first thing we have to do is to structure it. What are the components of the document. A lot has been done. For the bibliography of the document so that's very well researched and supported by, the academic and other communities. Less has been done with the actual content of the document. So conceptually, a document often has a head, body and footers. The header is describing what is in the document, often called metadata and that is described in great detail by the, JETS suite which has about 250 terms which describe the components of a scholarly publication. It's applicable to other domains as well. So it covers things like, publisher, authors, subjects, abstracts, references, things of that sort. The body of a publication is much less well defined and varies a great deal. It is often domain specific.
And then we come to the microstructure of the document where we go down to the individual Nano components, public thing was developed by Baron Mons and Jan Valterop and ideally, comes down to statements which are about the size of a sentence which can be taken out of context if they're properly annotated.
The IPCC, since this report and other reports are actually in many cases a collection of micro, nano publications and very well. In this, they are often sufficiently well stated that they can be taken out of context or that if you give them a context they can be understood without the rest of the document.
They are not a new idea. They are well seen in analysis of our presentation of religious texts and and works, and creative works, such as collections of plays or, a poetry or whatever and I'd welcome examples from Indian culture. And so I may mention things which are in the English and Christian canon but only because I am familiar with them and they are very well described in research. So, religious texts, often are represented as, a collection of books.
The books have chapters and the chapters have verses and the verses are micro addressable. So very often, all that is quoted is the verse. And, it's address which is normally something like chapter and verse. Now to support this, we have all the elements in HTML and we see HTML and we see HTML as the elements in HTML and we see HTML as the tool which will support all of our, work except for specialist subjects. The HTML has a rich range of structuring tools we will use Head body. I am within body we will use, which is a division, which can be, either, More divisions or paragraphs or lists or tables. So those are the main, next micro addressable levels.
And then we have within a paragraph. We have in line mark up, which is spans. So, a paragraph may consist of a list of spans which is very useful. It may also have A's which are anchors for hyperlinks, bi-directional hyperlinks those are what we will use to represent our semantic material. Now on top of those elements as they are called in XML, we have attributes which are in the start tag of LMS and and particularly these include classes. So the class attribute gives the role and may also be used to give information on its display. We strictly separate content from presentation, which is a major semantic philosophy.There is no presentation information in what we create. But you can present it in different ways by using tools such as cascading style sheets which allow you to discover the class, the HTML classes and to render them in different ways, by using, you to discover the class, the HTML classes, and to render them in different ways, which allow you to discover the class, the HTML classes, and to render them in different ways, including hiding them, and to render them in different ways, including hiding them, or aggregating them. HTML contains all of the management tools that we need and we would strongly recommend that all publication was in semantic HTML.
Semantics at this level as 2 levels.
One is the role of the components. So this is Well, seen in RDF, Resource description framework which is also, developed to be compatible with HTML. We promote the idea of RDF as the vehicle in which roles will be embedded.
A typical role is, Renu Kumari is the author of this article. Peter Murray Rust is an author of this article. Renu Kumari is employed by NIPGR. Peter Murray Rust is A member of staff in Cambridge University. Cambridge University is a university. Cambridge University is located in Cambridge. Cambridge is located in UK. By the way, there's more than one Cambridge in the world. An NIPGR is located in Delhi. NIPGR is a scientific organization. Now those are all stated as triples. Where we have subject, relation object, subject property objects. Tt's extremely powerful and it's a major tool for how we represent the semantics.
However, there's a second dimension to semantics, which is ontologies. Ontologies can become very complex and we use a very lightweight approach to ontologies.
So on challenges will allow you to discover the meaning of a relationship, the meaning of an object. And, ontologies are often structured with a hierarchy. We will find that a plant has a binomial name.It has a species and a genus. Genus is a subset of family. You can see the hierarchy developing and why scientists use hierarchical classifications a great deal.
We will see in our dictionary that we have multiple relations within the dictionary. A term can be related to many other terms. Annotation is extremely important. So, annotation is very well presented in Wiki Media products, particularly Wikipedia and terms within the discourse are highlighted as Hiperlinks, normally displayed as blue underlined. If you have a tool, if you have a term like NIPGR, it will give you a dynamic hyperlink which links you to the page on the Wikipedia page on NIPGR that will in its turn contain a large number of links to other concepts and so on. Wikimedia contains a literally over a hundred 1 million different concepts, all of which are linkable where the presentation is usually with inline annotation. A hyperlinks which are described by an HTML bar A with H for F which points outward and A with an ID, which receives incoming annotations. The more complex annotation is stand off annotation but this requires software and we don't use standoff annotation at the moment.
Knowledge is multimedia and it should be highly integrated scholarly knowledge consists of text, images and diagrams. Audio and video streams which we shall not deal with in this publication and specialist domains such as maths and chemistry. All of those are part of the complete publication and I am sure Ranganathan would have been a delighted with the tools available to do that. However, to make that a reality, we need to do it and the will to do it. Although it was potentially there 15 years ago, there's been virtually no progress in the public domain since then.
We now come on to tooling.
We cannot do, we cannot be completely semantic without tools which manage our semantics and that's been a major drawback. Because without the will, there are no tools and without the tools it's very difficult to create the will. And so what semantic climate is doing is blazing a trail here to try and show that the tools already there. What we need primarily is the will to develop them. If we have the will, then we can develop the tools.
Now, true semantic publication will be dynamic where information is always changing because the world changes day to day. We see reports on climate which we want to annotate. We see comments on previous reports which suggest that they need revision. We see new ideas so, we would want to reannotate. Those previous works and so forth. That is hard. It's possible but it's hard. We take a static approach at the moment. So all of our, semantic material is essentially static. It's created once and it is an archived and available but it is not dynamically updated. That's not to say, that we don't, ourselves work with dynamic information and we do this using versions. So, all of our software is heavily versioned. Every time we make a change to the software, we have a new version. We believe, that should be the norm for scholarly information. That every time, first of all that things should be revised. They should be annotated and each of these should be a new version.
Tools exist to do it. The will does not. So before we can do this, we need identifiers. Identifies are really important. They are things which allow us to refer to a precise object. And this really came in, probably with Ranganathan, with his idea of classification. He wasn't the first to do this but, the idea of classification and versioning has been with us for hundreds of years. But, in terms of versions of applications, this has been very little.
In terms of dynamic versions and in fact the academic communities idea of the version of record which is what we are publishing here is highly counterproductive because it suggests that there is a golden period in which one can say this is the state of knowledge at this time and for often for the future that might work for say a mathematical proof. But it does not work for scientific information. It does not work for global information which literally changes by the minute.
So, we use identifiers for all our material and these identifiers are both semantic and non-smantic. A semantic identifier is one whose structure of the string of the identifier explains what it is and often how to find it. The typical semantic identifier which we use in the IPCC reports is report, chapter, section, subsection, subsubsection, paragraph, sentence. They can all be combined into a single identifier and that identifier should be permanent for static information.
Where we get our material.
Now, because there is no semantic publication, we have to convert it from what's called legacy information and legacy is horrible. So, legacy means it is not semantic. It's often not structured. It is very difficult for machines to understand automatically. And it's often impendable to people who do not speak the language in which it's written and who are not experts in the discipline in which it's written. There is a hierarchy of legacy, the lowest is Publications as bit maps, the worst is handwriting. We don't deal with that. Then, there is, Type set material, which is sometimes recoverable by optical character recognition, if it is clear. If the text has come from a machine, is often recoverable and links to what we do for PDF. Now most scholarly publications including this article are in PDF. PDF is extremely difficult to make semantic. It is underneath, it is Only four things. Characters, images. Image is, graphic curves and hyperlinks.That's all you get. There were no words. There are no circles. There are simply these primitives in the PDF. The software has to try and do its best. To make some sense of it. I mean, in its most primitive form, PDF is simply a set of characters on the page and that's what people might use for let's just say flyers or political slogans or whatever. But in Scotland the articles, it's usually, easier to do. Something better than that. The raw material of a publication is usually a document authored in Word or Latex and submitted to a journal in that form. And in its most tractable form it is single column so spans the whole of the page. It does not have images embedded within the text and it has separate sections or different concepts. So, this is what is seen in academic CCs and theses are far better for semantic purposes than scholarly publications including this one.
We therefore deal in PDF with characters which have coordinates, a geometry and they have styling which is things like fonts colors and other types of style. Images are bitmaps ideally in pdf and we can also get curves out where the material has been published in semantic or implicit semantic graphic form such as in SVG. Those curves can sometimes be turned into higher level objects such as chemical formulae or phylogenetic trees. We have to turn the characters into text and to do this we have to use heuristics to tell which are words. The size of space matters, we have to know what the line end means. Is it continuing to next line or is it a list? And that requires a lot of heuristics. We, it is often per publisher, dependent, so different publishers use different approaches. Another problem we have to contend with is, there are probably 200 major scholarly publishers and publications and each of them uses a different type of publication and each of them mixes up content with non content such as advertising or author scores and things of that sort making it incredibly difficult to do that and we may have to come down to having a per publisher semantifier.
So that's discovery, that's talked about, what we need and what the tooling is and we will now go onto our examples. I see that it's come up to.
Beta Was this translation helpful? Give feedback.
All reactions