1 kap. 60 – case: proofreading how information technology is conquering the world: workplace,...
TRANSCRIPT
![Page 1: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/1.jpg)
1
Kap. 60 –Case: Proofreading
How Information Technology Is Conquering the World:
Workplace, Private Life, and Society
Professor Kai A. Olsen, Universitetet i Bergen og Høgskolen i Molde
![Page 2: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/2.jpg)
2
A semantic proofreading tool for all languages based on a text repository
Kai A. Olsen
Molde University College and Department of Informatics, University of Bergen
Norway
Bård Indredavik
Technical Manager, Oshaug Metall AS, Molde, Norway
![Page 3: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/3.jpg)
Kai A. Olsen, 19.04.23 3
Proofreading is important
When we write in a foreign language If we are not proficient in our own language To find typos and other mistakes Errors can make the text unreadable and give a very
bad impression: I am a student of MSc Logistics and Supply Chain
Management from Westminitser University, London. Last weel I had the presentation regarding Molde College University and I heart that you are the module leader of Management of value. I am wondering if you may write me back more about that module, because it not really clear for me? In particular, when I am considering to go foe the second semestr to Molde. I will be really approciate for it.
![Page 4: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/4.jpg)
Kai A. Olsen, 19.04.23 4
Manual proofreading
When we are in doubt about an expression we could ask a language proficient colleague
However,we may not have anybody to ask it may be too much to ask somebody to
proofread everything that we writeCan we do it automatically?
![Page 5: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/5.jpg)
Kai A. Olsen, 19.04.23 5
Automatic language processing
An important research area since the nineteen sixties The results have been far from what many envisioned Natural languages seems to be too complex to be
formalized (some argue that you have to be a human being to understand natural language)
But, due to faster computers we have workable spelling checkers and studies of syntax have offered grammar checkers that handle at least some types of mistakes
Still, clear limitations, e.g., the language tools in Office 2003 will not find these errors: “I have a red far ” ”A forest has many threes” “I live at London” ”We had ice cream for desert”
![Page 6: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/6.jpg)
Kai A. Olsen, 19.04.23 6
For our student
If she had used a spelling and grammar checker in Office only a few mistakes would have been found:
![Page 7: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/7.jpg)
Kai A. Olsen, 19.04.23 7
Another approach
Instead of asking another person to proofread, we could ask the whole world
That is, use the Web as a text repository and compare our sentences to those of everyone else
For example, by using Google: ”we live at the west coast” – 0 ”we live on the west coast” – 3,500,000 ”we live in the west coast” – 5,960,000
![Page 8: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/8.jpg)
Kai A. Olsen, 19.04.23 8
Background paper (2004)
Journal of the American Society for Information Science and Technology, Volume 55, Issue 11, September 2004
![Page 9: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/9.jpg)
Kai A. Olsen, 19.04.23 9
What if the alternatives are unknown?
We can use a wild card (*)Example: ”we live * the west coast”Study the alternatives, and check the
complete sentence with each candidate to get a frequency number
![Page 10: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/10.jpg)
Kai A. Olsen, 19.04.23 10
A tedious process
![Page 11: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/11.jpg)
Kai A. Olsen, 19.04.23 11
Disadvantages
A lot of workWe have to know where we are in doubtIt can be difficult to find all the
alternativesBut we can make a tool that can do this
job automatically
![Page 12: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/12.jpg)
Kai A. Olsen, 19.04.23 12
Prototype
Consist of:1. A spider that collects text from the Web2. An index builder that creates an index
structure3. An analyzing program that finds
alternatives for each word in the user’s sentence
![Page 13: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/13.jpg)
Kai A. Olsen, 19.04.23 13
1. Spider
Starts with a list of seeds, e.g., links to Web sites of universities, newspapers, state organizations, etc.
Retrieves text from these sites “Cleans” the text of formatting data Stores all links that are found, .html, .pdf
and .doc if these have not been encountered previously
Follows html-links recursively (we have separate spiders to parse .pdf and .doc files).
Stores the text in files, numbered consecutively.
![Page 14: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/14.jpg)
Kai A. Olsen, 19.04.23 14
2. Index builder
For each word we get the files that contain at least O occurrences of the word. If O is 1 all words are included, but we may use a higher value to avoid (at least some) misspelled words.
WordWord FileFile
FileFile WordWord LinesLines
For each file we have a list of all words in the file, each word giving the lines in the file where the word occurs
All structures are represented as Boolean arrays stored as .txt files.
![Page 15: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/15.jpg)
Kai A. Olsen, 19.04.23 15
In English
2.5 Gb text2,500 files (1 Mb each) for raw text200,000 words (O=10, includes only
words with a frequency of 10 or higher) and the same number of text files to show in which files the word occurs
43 million text files with line references (for each word in each file)
No problem for Windows 7
![Page 16: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/16.jpg)
Kai A. Olsen, 19.04.23 16
In Norwegian
1 Gb text10,000 files (0.1 Mb each) for raw text550,000 words (O=1, all words) and the
same number of text files to show in which files the word occurs
42 million text files with line references (for each word in each file)
![Page 17: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/17.jpg)
Kai A. Olsen, 19.04.23 17
3. The analyzer
Finds the frequency of the complete sentence (N words) offered by the user
Parses the files where at least N-1 words of the sentence occur
Replaces one and one word with a wild card Collects alternatives Checks the frequency of each alternative Calculates a confidence value based on the ratio of
frequencies and the similarity between the original word and the alternative (Hirschberg’s algorithm)
Suggests improvements where the alternative sentence get a higher score than the original
![Page 18: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/18.jpg)
Kai A. Olsen, 19.04.23 18
Analyzer (example)
I live at London
changed to:I live in
London
![Page 19: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/19.jpg)
Kai A. Olsen, 19.04.23 19
Analyzer (example 2)
We had ice cream for desert
changed to:We had ice
cream for dessert
![Page 20: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/20.jpg)
Kai A. Olsen, 19.04.23 20
What kind of errors can be found
Typos, as in: I have a red far
Spelling, using the wrong word: e.g., mixing desert and dessert
Grammar, using the wrong preposition, verb, etc. e.g., mixing in/at/on/
Facts Beethoven was born in 1970 – corrected to 1770.
Punctuation That is, most types of mistakes that we make
when writing.
![Page 21: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/21.jpg)
Kai A. Olsen, 19.04.23 21
When the system fails
Examples: We eat avocado, may be corrected to we eat apples Neptune is the outer planet in the solar system, may be
corrected to Pluto is the outer planet… When we have date specific data, as in the sentence “the
prime minster of Great Britain is” In practice these failures will seldom be problematic as
they often will address an area where the user is competent, also
a learning system can reduce some of these cases In addition, a system that takes dates into
consideration should help
![Page 22: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/22.jpg)
Kai A. Olsen, 19.04.23 22
The prototype
Is only a prototype:1 or 2.5 Gb is not enough to get a wide
range of sentencesCatching data from the Web gives a
repository with many spelling and grammar errors (also with a lot of repeated text)
The system works too slow to handle many users
Still, it can correct many types of mistakes, e.g., all the examples that we used in our 2004 paper.
![Page 23: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/23.jpg)
Kai A. Olsen, 19.04.23 23
What we need in order to
improve the text repository:A text quality checker, that ignores text with
too many errorsOr, perhaps better, text repositories based
on books, company reports, government reports, scientific papers, …
improve speed:A site with many thousands (millions) of
simple computers (i.e., a “Google” setup)The task is ideal for parallel computing
![Page 24: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/24.jpg)
Kai A. Olsen, 19.04.23 24
Parallel computing: MapReduce
An algorithm offered by Dean and Ghemawat from Google
Idea – algorithms that work in parallel on large data sets
In our case: The map operation could be applied to each file,
offering the frequency of each alternative sentence (one computer can work on one file at a time)
The reduce could take these intermediate results in order to compute the final frequencies.
![Page 25: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/25.jpg)
Kai A. Olsen, 19.04.23 25
Discussion
Do we want to write as the majority? Yes, when we write in a foreign language When we are not too proficient writers
Can we leave everything to the proofing tool? No, as with other type of proofing tools what we get is a suggestion
only What the tool really does is helping the user to use reading
competency when writing Will the system find examples of all sentences?
No Why do not Google and others offer this tool?
Perhaps because it will be very resource demanding (or because they are not smart enough)
What about false negatives? This (the system indicating expressions that are correct) may be a
problem.
![Page 26: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/26.jpg)
Kai A. Olsen, 19.04.23 26
Conclusion
With a multicomputer setup and a large repository many mistakes can be indicated
Works in any language that can be digitized
Can be an offline or online tool (perhaps online is achievable one time in the future?)
We could have repositories that reflects style (academic, business, social…)?
![Page 27: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/27.jpg)
Kai A. Olsen, 19.04.23 27
Big data is becoming important
To analyze buying patterns of customersRecommendation systemsTraffic patterns for planning new flights
or new roads (Norwegian to Molde)In science (meteorology, medicine,
physics, astronomy…)In many areas
![Page 28: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/28.jpg)
Kai A. Olsen, 19.04.23 28
Data is available
From the WebFrom user actions on the Web
(keywords entered for searching, pages visited…)
From automatic sensors, modern equipment (such as better telescopes), online activities, cameras…
The computers and software are here to analyze the data
![Page 29: 1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet](https://reader036.vdocuments.net/reader036/viewer/2022062715/56649db95503460f94aa93be/html5/thumbnails/29.jpg)
Kai A. Olsen, 19.04.23 29
That is
BIG DATA can be used to understand many complex processes
Will becoming an important issue in the next ten years of computing