Analysis of Relations among Texts
Martin Winkler
 
Abstract

    The classification of texts with respect to registers is both a result and a requirement by human communication. We dispose of knowledge about registers and thus are able to judge the similarity of texts. In this investigation, the relation between two texts is defined on the basis of intuitive text characteristics. The degree of the relation is measured using only language-independent text indices. The results obtained with these metrics are promising and suggest the application of the analysis of text-relations in the field of information retrieval.
 
Kurzfassung

    Die Klassifikation von Texten nach Textsorten ist Ergebnis wie Voraussetzung der menschlichen Kommunikation. Wir verfügen über ein Textsortenwissen, mit dessen Hilfe wir Texte als einander mehr oder weniger ähnlich einstufen können. In dieser Untersuchung wird die Verwandtschaft zwischen zwei Texten auf der Basis intuitiver Textmerkmale definiert. Aufbauend auf sprachunabhängigen Textindizes wird der Grad der Textverwandtschaft gemessen. Die Ergebnisse dieser Metrik sind vielversprechend und legen die Anwendung der Textverwandtschaftsanalyse im Information Retrieval nahe.
 

    If You have any questions concerning the details of my diploma thesis, please send me a mail, or simply download a copy of it (zipped *.pdf, about 1140 KByte).
 


Quotes I used for writing my diploma thesis
 
"It is a common observation that human progress is most difficult in those fields which do not belong exclusively under one of the accepted major branches of knowledge. With respect to such fields we are frequently in the position of the six blind men and the elephant. According to this story, one of the blind men, who got hold of the elephant's leg, asserted that he was like a pillar; a second, who had the animal by the tail, said that he was like a rope; another, who was up against the elephant's side, claimed that he was like a wall; while the remaining three, who touched the elephant's ear, trunk, and tusk, maintained with equal stoutness that he was like a sail, a hose, and a spear respectively."
    Haskell B. Curry, Some Logical Aspects of Grammatical Structure, in: Roman Jakobson (ed.), Proceedings of Symposia in Applied Mathematics, Volume XII, Structure of Language and its Mathematical Aspects, American Mathematical Society, Providence, 1961
 
"One dark night a policeman comes upon a drunk. The man is on his knees, obviously searching for something under a lamppost. He tells the officer that he is looking for his keys, which he says he lost 'over there', pointing out into the darkness. The policeman asks him, 'Why, if you lost the keys over there, are you looking for them under the streetlight?' The drunk answers, 'Because the light is so much better here.' That is the way that science proceeds, too."
    Joseph Weizenbaum, Computer Power and Human Reason, Freeman 1976
 
"The deepest qualitative knowledge is a byproduct of some quantitative knowledge."
    Mario Bunge, Exploring the World 1983
 
"The Reader may here observe the Force of Numbers, which can be successfully applied, even to those things, which one would imagine are subject to no Rules. There are very few things which we know, which are not capable of being reduc'd to a Mathematical Reasoning; and when they cannot it's a sign our knowledge of them is very small and confus'd; and when a Mathematical Reasoning can be had it's as great a folly to make use of any other, as to grope for a thing in the dark, when you have a Candle standing by you."
    John Arbuthnot, On the Laws of Chance 1692
 
"Through and through the world is infested with quantity: To talk sense is to talk quantities. It is no use saying the nation is large... How large? It is no use saying the radium is scarce... How scarce? You cannot evade quantity. You may fly to poetry and music, and quantity and number will face you in your rhythms and your octaves."
    Alfred N. Whitehead, in J. R. Newman: The World of Mathematics, New York, Simon and Schuster, 1956
 
"Whatever you can, count."
    Francis Galton, in J. R. Newman: The World of Mathematics, New York, Simon and Schuster, 1956
 
"I know, indeed, and can conceive of no pursuit so antagonistic to the cultivation of the oratorical faculty ... as the study of Mathematics. An eloquent mathematician must, from the nature of things, ever remain as rare a phenomenon as a talking fish, and it is certain that the more anyone gives himself up to the study of oratorical effect the less he will find himself in a fit state to mathematicize."
    James J. Sylvester
 
"A start in mathematization or mathematical modelling, however unrealistic, is better than either a prolix but unenlightening description or grandiose verbal sketch."
    Mario Bunge, Exploring the World 1983
 
"Errors using inadequate data are much less than using no data at all."
    Charles Babbage
 
"Ein anderer [vom Volksgesundheitsverein "Balkenbuchstabe"] kam ihm nach und erzählte folgendes: Wenn er durch die Straßen gehe - noch viel aufregender sei es aber, wenn man auf der Elektrischen fährt -, zähle er schon seit Jahren an den großen lateinischen Buchstaben der Geschäftsschilder die Balken (A bestehe zum Beispiel aus dreien, M aus vieren) und dividiere ihre Zahl durch die Anzahl der Buchstaben. Bisher sei das durchschnittliche Ergebnis gleichbleibend zweieinhalb gewesen, ersichtlich sei dies aber keineswegs unverbrüchlich und könne sich mit jeder neuen Straße ändern: so wird man von großer Sorge bei Abweichungen, von großer Freude beim Zutreffen erfüllt, was den läuternden Wirkungen ähnle, die man der Tragödie zuschreibt. Wenn man dagegen die Buchstaben selbst zähle, so sei, wovon sich der Herr nur überzeugen möge, die Teilbarkeit durch drei ein großer Glücksfall, weshalb die meisten Aufschriften geradezu ein Gefühl der Nichtbefriedigung hinterlassen, das man deutlich bemerkt, bis auf jene, die aus Massenbuchstaben, das heißt, aus solchen mit vier Balken, bestehn, zum Beispiel WEM, die unter allen Umständen ganz besonders glücklich machen. Was daraus folge, fragte der Besucher. Nichts anderes, als daß das Ministerium für Volksgesundheit eine Verordnung herausgeben müsse, die bei Firmenbezeichnungen die Wahl von vierbalkigen Buchstabenfolgen begünstige und die Verwendung einbalkiger wie O, S, I, C möglichst unterdrücke, denn sie machten durch ihre Unergiebigkeit traurig!"
    Robert Musil, Der Mann ohne Eigenschaften
    Robert Musil's ideas about text indices
 

"'Zum Verständnis der Lyrik,
von Dr. J. Evans Prichett, Doktor der Philosophie

Um Lyrik vollständig zu verstehen, müssen wir zuerst Versform, Reim und Ausdrucksweise vollkommen beherrschen. Dazu stellen sich zwei Fragen:
Wie kunstvoll wurde die Zielsetzung des Gedichtes erfüllt? und zweitens
Wie wichtig ist diese Zielsetzung?
Frage 1 bewertet die Perfektion des Gedichtes und Frage 2 seine Bedeutsamkeit. Wenn wir diese Fragen beantwortet haben, läßt sich die dichterische Größe eines Gedichtes relativ einfach ersehen.
Die Maßzahl eines Gedichtes läßt sich anhand eines Diagramms festlegen. Auf der y-Achse tragen wir die Perfektion ein und seine Bedeutsamkeit auf der x-Achse. Die Flächenberechnung zwischen Perfektion und Bedeutsamkeit ergibt die Maßzahl der dichterischen Größe.
Ein Sonnett von Byron würde auf der y-Achse eine hohe Punktzahl erhalten, wäre auf der x-Achse allerdings nur Durchschnitt. Andererseits würde ein Sonnett von Shakespeare sowohl auf der x-Achse als auch auf der y-Achse sehr weit außen sein. Damit würde veranschaulicht, wieviel dichterische Größe dieses Gedicht aufweist.
Wenn Sie die Lyrik in diesem Buch durcharbeiten, verwenden Sie bitte diese Bewertungsmethode. In dem Maße, wie Ihre Fähigkeit zur Bewertung von Gedichten wächst, werden auch Freude und Verständnis für Lyrik wachsen.'
‚Exkrement. Das denke ich über Mr. J. Evans Prichett. Wir sind keine Klempner, wir haben es hier mit Lyrik zu tun. Man kann doch nicht Gedichte bemessen wie amerikanische Charts!'"
    Touchstone Pictures: "Der Club der toten Dichter" von Peter Weir, Robin Williams als Englischlehrer John Keating.
    John Keating's ("Dead Poets Society") opinion about text indices
 
"Je mehr ich über die Sprache nachdenke, desto mehr wundert es mich, daß sich die Menschen überhaupt je verstehen."
"The more I think about language, the more I wonder that humans are able to understand themselves at all."
    Kurt Gödel zu Karl Menger, am Heimweg nach einer Sitzung des Wiener Kreises
 
"We crossed a walk to the other part of the academy, where, as I have already said, the projectors in speculative learning resided.
The first professor I saw was in a very large room, with forty pupils about him. After salutation, observing me to look earnestly upon a frame, which took up the greatest part of both the length and breadth of the room; he said, perhaps I might wonder to see him employed in a project for improving speculative knowledge by practical and mechanical operations. But the world would soon be sensible of its usefulness; and he flattered himself, that a more noble exalted thought never sprang in any other man's head. Every one knew how laborious the usual method is of attaining to arts and sciences; whereas by his contrivance, the most ignorant person at a reasonable charge, and with a little bodily labour, may write books in philosophy, poetry, politicks, law, mathematicks and theology, without the least assistance from genius or study. He then led me to the frame, about the sides whereof all his pupils stood in ranks. It was twenty foot square, placed in the middle of the room. The superficies was composed of several bits of wood, about the bigness of a dye, but some larger than others. They were all linked together by slender wires. These bits of wood were covered on every square with paper pasted on them; and, on these papers were written all the words of their language in their several moods, tenses, and dedensions, but without any order. The professor then desired me to observe, for he was going to set his engine at work. The pupils at his command took each of them hold of an iron handle, whereof there were forty fixed round the edges of the frame; and giving them a sudden turn, the whole disposition of the words was entirely changed. He then commanded six and thirty of the lads to read the several lines softly as they appeared upon the frame; and where they found three or four words together that might make part of a sentence, they dictated to the four remaining boys who were scribes. This work was repeated three or four times, and at every turn the engine was so contrived, that the words shifted into new places, as the square bits of wood moved upside down.
Six hours a-day the young students were employed in this labour; and the professor shewed me several volumes in large folio already collected, of broken sentences, which he intended to piece together; and out of those rich materials to give the world a compleat body of all arts and sciences; which, however might be still improved, and much expedited, if the publick would raise a fund for making and employing five hundred such frames in Lagado, and oblige the managers to contribute in common their several collections.
He assured me, that this invention had employed all his thoughts from his youth; that he had emptied the whole vocabulary into his frame, and made the strictest computation of the general proportion there is in books between the numbers of particles, nouns, and verbs, and other parts of speech.
I made my humblest acknowledgments to this illustrious person for his great communicativeness; and promised, if ever I had the good fortune to return to my native country, that I would do him justice, as the sole inventor of this wonderful machine; the form and contrivance of which I desired leave to delineate upon paper as in the figure here annexed."

    Jonathan Swift, Gulliver's Travels
 
"Jeder Intellektuelle hat eine ganz spezielle Verantwortung. [Er schuldet] es seinen Mitmenschen [...], die Ergebnisse seines Studiums in der einfachsten und klarsten und bescheidensten Form darzustellen."
"Every intellectual has a special responsibility. [He owes it to] his fellow beings [...], to present the results of his study in the simplest, clearest and humblest way."
    Karl Popper, Plädoyer für intellektuelle Redlichkeit
 
"Definieren wir nun die Vollkommenheit des sprachlichen Ausdrucks in der Weise, daß er deutlich sei - es gibt nämlich ein gewisses Indiz dafür, daß, wenn die Rede einen Sachverhalt nicht klar darlegt, sie die von ihr geforderte Aufgabe nicht erfüllt."
    Aristotle
 
"Man brauche gewöhnliche Worte und sage ungewöhnliche Dinge."
"One should use ordinary words and say extraordinary things."
    Arthur Schopenhauer, Über Schriftstellerei und Stil, in: Parerga und Paralipomena, 1851
 
"Allgemein aber gilt: Das Geschriebene muß sich leicht vortragen lassen."
"Generally, the statement holds: What is written must be easily recitable."
    Aristotle, Rhetorik, Buch III, München 1980
 
"Verständnis ist die Grundlage und Ursache gut geschriebener Texte."
"Understanding is basis and reason of well written texts."
    Horatius, Ars Poetica
 
"I will be sufficiently rewarded if when telling it to others you will not claim the discovery as your own, but will say it was mine."
    Thales, in H. Eves: In Mathematical Circles, Boston, Prindle, Weber and Schmidt
 

 

    Since I have a counter, I GET A KICK OUT OF YOU
 

    For quotes on mathematics and computer sciences, take a look at my Mathematics: Quotes - page.
 

    Back to my Willkommen! Bienvenue! Welcome! Homepage.