chapter 23 text processing bjarne stroustrup

26
Chapter 23 Chapter 23 Text Text Processing Processing Bjarne Stroustrup Bjarne Stroustrup www.stroustrup.com/ www.stroustrup.com/ Programming Programming

Upload: julia-alexander

Post on 26-Mar-2015

245 views

Category:

Documents


8 download

TRANSCRIPT

Page 1: Chapter 23 Text Processing Bjarne Stroustrup

Chapter 23Chapter 23Text ProcessingText Processing

Bjarne StroustrupBjarne Stroustrupwww.stroustrup.com/Programmingwww.stroustrup.com/Programming

Page 2: Chapter 23 Text Processing Bjarne Stroustrup

OverviewOverview

Application domainsApplication domains StringsStrings I/OI/O MapsMaps Regular expressionsRegular expressions

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 22

Page 3: Chapter 23 Text Processing Bjarne Stroustrup

Now you know the basicsNow you know the basics

Really! Congratulations!Really! Congratulations!

Don’t get stuck with a sterile focus on programming language Don’t get stuck with a sterile focus on programming language featuresfeatures

What matters are programs, applications, what good can you What matters are programs, applications, what good can you do with programmingdo with programming Text processingText processing Numeric processingNumeric processing Embedded systems programmingEmbedded systems programming BankingBanking Medical applicationsMedical applications Scientific visualizationScientific visualization Animation Animation Route planningRoute planning Physical designPhysical design

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 33

Page 4: Chapter 23 Text Processing Bjarne Stroustrup

Text processingText processing

““all we know can be represented as text”all we know can be represented as text” And often isAnd often is

Books, articlesBooks, articles Transaction logs (email, phone, bank, sales, …)Transaction logs (email, phone, bank, sales, …) Web pages (even the layout instructions)Web pages (even the layout instructions) Tables of figures (numbers)Tables of figures (numbers) MailMail ProgramsPrograms MeasurementsMeasurements Historical dataHistorical data Medical recordsMedical records ……

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 44

Amendment ICongress shall make no law respectingan establishment of religion, or prohibitingthe free exercise thereof; or abridging thefreedom of speech, or of the press; or theright of the people peaceably to assemble,and to petition the government for a redressof grievances.

Page 5: Chapter 23 Text Processing Bjarne Stroustrup

String overviewString overview

StringsStrings std::stringstd::string

<string><string> s.size()s.size() s1==s2s1==s2

C-style string (zero-terminated array of char)C-style string (zero-terminated array of char) <cstring> <cstring> oror <string.h> <string.h> strlen(s)strlen(s) strcmp(s1,s2)==0strcmp(s1,s2)==0

std::basic_string<Ch>std::basic_string<Ch>, e.g. unicode strings, e.g. unicode strings typedef std::basic_string<char> string;typedef std::basic_string<char> string;

Proprietary string classesProprietary string classes

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 55

Page 6: Chapter 23 Text Processing Bjarne Stroustrup

String conversionString conversion Simple to_stringSimple to_string

template<class T> string to_string(const T& t)template<class T> string to_string(const T& t){{

ostringstream os;ostringstream os;os << t;os << t;return os.str();return os.str();

}}

For example:For example:  

string s1 = to_string(12.333);string s1 = to_string(12.333);string s2 = to_string(1+5*6-99/7);string s2 = to_string(1+5*6-99/7);

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 66

Page 7: Chapter 23 Text Processing Bjarne Stroustrup

String conversionString conversion

Simple extract from stringSimple extract from string

template<class T> T from_string(const string& s)template<class T> T from_string(const string& s)

{{

istringstream is(s);istringstream is(s);

T t;T t;

if (!(is >> t)) throw bad_from_string();if (!(is >> t)) throw bad_from_string();

return t;return t;

}}  

For example:For example:  

double d = from_string<double>("12.333");double d = from_string<double>("12.333");

Matrix<int,2> m = from_string< Matrix<int,2> >("{ {1,2}, {3,4} }");Matrix<int,2> m = from_string< Matrix<int,2> >("{ {1,2}, {3,4} }");

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 77

Page 8: Chapter 23 Text Processing Bjarne Stroustrup

General stream conversionGeneral stream conversion

  template<typename Target, typename Source>template<typename Target, typename Source>

Target lexical_cast(Source arg)Target lexical_cast(Source arg)

{{

std::stringstream ss;std::stringstream ss;

Target result;Target result;  

if (!(ss << arg)if (!(ss << arg) // // read arg into streamread arg into stream

|| !(ss >> result)|| !(ss >> result) // // read result from streamread result from stream

|| !(ss >> std::ws).eof())|| !(ss >> std::ws).eof()) // // stuff left in stream?stuff left in stream?

throw bad_lexical_cast();throw bad_lexical_cast();  

return result;return result;

}}

string s = lexical cast<string>(lexical_cast<double>(" 12.7 "));string s = lexical cast<string>(lexical_cast<double>(" 12.7 ")); // // okok

// // works for any type that can be streamed into and/or out of a string:works for any type that can be streamed into and/or out of a string:

XX xx = lexical_cast<XX>(lexical_cast<YY>(XX(whatever)));XX xx = lexical_cast<XX>(lexical_cast<YY>(XX(whatever))); // // !!!!!!Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 88

Page 9: Chapter 23 Text Processing Bjarne Stroustrup

I/O overviewI/O overview

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 99

istream ostream

ifstream iostream ofstream ostringstreamistringstream

fstreamstringstream

Stream I/O

in >> x Read from in into x according to x’s format

out << x Write x to out according to x’s format

in.get(c) Read a character from in into c

getline(in,s) Read a line from in into the string s

Page 10: Chapter 23 Text Processing Bjarne Stroustrup

Map overviewMap overview

Associative containersAssociative containers <map><map>,, <set> <set>,, <unordered_map> <unordered_map>,, <unordered_set> <unordered_set> mapmap multimapmultimap setset multisetmultiset unordered_mapunordered_map unordered_multimapunordered_multimap unordered_setunordered_set unordered_multisetunordered_multiset

The backbone of text manipulationThe backbone of text manipulation Find a wordFind a word See if you have already seen a wordSee if you have already seen a word Find information that correspond to a wordFind information that correspond to a word

See example in Chapter 23See example in Chapter 23

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1010

Page 11: Chapter 23 Text Processing Bjarne Stroustrup

Map overviewMap overview

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1111

vector<Message>

multimap<string,Message*>

“John Doe”

“John Doe”

“John Q. Public”

Mail_file:

Page 12: Chapter 23 Text Processing Bjarne Stroustrup

A problem: Read a ZIP codeA problem: Read a ZIP code U.S. state abbreviation and ZIP codeU.S. state abbreviation and ZIP code

two letters followed by five digitstwo letters followed by five digits

  string s;string s;while (cin>>s) {while (cin>>s) {

if (s.size()==7if (s.size()==7&& isletter(s[0]) && isletter(s[1])&& isletter(s[0]) && isletter(s[1])&& isdigit(s[2]) && isdigit(s[3]) && isdigit(s[4])&& isdigit(s[2]) && isdigit(s[3]) && isdigit(s[4])&& isdigit(s[5]) && isdigit(s[6]))&& isdigit(s[5]) && isdigit(s[6]))

cout << "found " << s << '\n';cout << "found " << s << '\n';}}

Brittle, messy, unique codeBrittle, messy, unique code

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1212

Page 13: Chapter 23 Text Processing Bjarne Stroustrup

A problem: Read a ZIP codeA problem: Read a ZIP code

Problems with simple solution Problems with simple solution  It’s verbose (4 lines, 8 function calls)It’s verbose (4 lines, 8 function calls) We miss (intentionally?) every ZIP code number not We miss (intentionally?) every ZIP code number not

separated from its context by whitespaceseparated from its context by whitespace "TX77845""TX77845", , TX77845-1234TX77845-1234, and, and ATM77845 ATM77845

We miss (intentionally?) every ZIP code number with a We miss (intentionally?) every ZIP code number with a space between the letters and the digits space between the letters and the digits

TX 77845TX 77845 We accept (intentionally?) every ZIP code number with the We accept (intentionally?) every ZIP code number with the

letters in lower caseletters in lower case tx77845tx77845

If we decided to look for a postal code in a different format If we decided to look for a postal code in a different format we have to completely rewrite the codewe have to completely rewrite the code

CB3 0DSCB3 0DS, , DK-8000 ArhusDK-8000 ArhusStroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1313

Page 14: Chapter 23 Text Processing Bjarne Stroustrup

TX77845-1234TX77845-1234 11stst try: try: wwdddddwwddddd 22ndnd (remember -12324): (remember -12324): wwddddd-ddddwwddddd-dddd What’s “special”?What’s “special”? 33rdrd:: \w\w\d\d\d\d\d-\d\d\d\d\w\w\d\d\d\d\d-\d\d\d\d 44thth (make counts explicit): (make counts explicit): \w2\d5-\d4\w2\d5-\d4 55thth (and “special”): (and “special”): \w{2}\d{5}-\d{4}\w{2}\d{5}-\d{4} But -1234 was optional?But -1234 was optional? 66thth: : \w{2}\d{5}\w{2}\d{5}((-\d{4})?-\d{4})? We wanted an optional space after TXWe wanted an optional space after TX 77thth (invisible space): (invisible space): \w{2} ?\d{5}\w{2} ?\d{5}((-\d{4})?-\d{4})? 88thth (make space visible): (make space visible): \w{2}\s?\d{5}\w{2}\s?\d{5}((-\d{4})?-\d{4})? 99thth (lots of space – or none): (lots of space – or none): \w{2}\s*\d{5}\w{2}\s*\d{5}((-\d{4})?-\d{4})?

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1414

Page 15: Chapter 23 Text Processing Bjarne Stroustrup

Regex library – availabilityRegex library – availability

Not part of C++98 standardNot part of C++98 standard Part of “Technical Report 1” 2004Part of “Technical Report 1” 2004 Part of C++0xPart of C++0x Ships withShips with

VS 9.0 C++, use VS 9.0 C++, use <regex><regex>,, std::tr1::regex std::tr1::regex

GCC 4.3.0, use GCC 4.3.0, use <tr1/regex><tr1/regex>,, std::tr1::regex std::tr1::regex

www.boost.org, use www.boost.org, use <boost/regex><boost/regex>,, std::boost::regex std::boost::regex

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1515

Page 16: Chapter 23 Text Processing Bjarne Stroustrup

#include <boost/regex.hpp>#include <boost/regex.hpp>#include <iostream>#include <iostream>#include <string>#include <string>#include <fstream>#include <fstream>using namespace std;using namespace std;  int main()int main(){{

ifstream in("file.txt");ifstream in("file.txt"); // // input fileinput fileif (!in) cerr << "no file\n";if (!in) cerr << "no file\n";

regex pat ("\\w{2}\\s*\\d{5}(-\\d{4})?"); // regex pat ("\\w{2}\\s*\\d{5}(-\\d{4})?"); // ZIP code patternZIP code patterncout << "pattern: " << pat << '\n';cout << "pattern: " << pat << '\n';

  //// … …

}}

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1616

Page 17: Chapter 23 Text Processing Bjarne Stroustrup

  

int lineno = 0;int lineno = 0;string line;string line; // // input bufferinput bufferwhile (getline(in,line)) {while (getline(in,line)) {

++lineno;++lineno;smatch matches;smatch matches;// // matched strings go herematched strings go hereif (regex_search(line, matches, pat)) {if (regex_search(line, matches, pat)) {

cout << lineno << ": " << matches[0] << '\n';cout << lineno << ": " << matches[0] << '\n'; //// whole whole matchmatch

if (1<matches.size() && matches[1].matched)if (1<matches.size() && matches[1].matched)cout << "\t: " << matches[1] << '\n‘;cout << "\t: " << matches[1] << '\n‘; // // sub-matchsub-match

}}}}

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1717

Page 18: Chapter 23 Text Processing Bjarne Stroustrup

ResultsResultsInput:Input: address TX77845address TX77845

ffff tx 77843 asasasaaffff tx 77843 asasasaaggg TX3456-23456ggg TX3456-23456howdyhowdyzzz TX23456-3456sss ggg TX33456-1234zzz TX23456-3456sss ggg TX33456-1234cvzcv TX77845-1234 sdsascvzcv TX77845-1234 sdsasxxxTx77845xxxxxxTx77845xxxTX12345-123456TX12345-123456

  Output:Output: pattern: "\w{2}\s*\d{5}(-\d{4})?"pattern: "\w{2}\s*\d{5}(-\d{4})?"

1: TX778451: TX778452: tx 778432: tx 778435: TX23456-34565: TX23456-3456

: -3456: -34566: TX77845-12346: TX77845-1234

: -1234: -12347: Tx778457: Tx778458: TX12345-12348: TX12345-1234

: -1234: -1234Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1818

Page 19: Chapter 23 Text Processing Bjarne Stroustrup

Regular expression syntaxRegular expression syntax

Regular expressions have a thorough theoretical Regular expressions have a thorough theoretical foundation based on state machinesfoundation based on state machines You can mess with the syntax, but not much with the semanticsYou can mess with the syntax, but not much with the semantics

The syntax is terse, cryptic, boring, usefulThe syntax is terse, cryptic, boring, useful Go learn itGo learn it

ExamplesExamples Xa{2,3}Xa{2,3} // // Xaa XaaaXaa Xaaa Xb{2}Xb{2} // // XbbXbb Xc{2,}Xc{2,} // // Xcc Xccc Xcccc Xccccc …Xcc Xccc Xcccc Xccccc … \w{2}-\d{4,5}\w{2}-\d{4,5} // // \w is letter \d is digit\w is letter \d is digit (\d*:)?(\d+) (\d*:)?(\d+) // 124:1232321 :123 123// 124:1232321 :123 123 Subject: (FW:|Re:)?(.*)Subject: (FW:|Re:)?(.*) // . (dot) matches any character// . (dot) matches any character [a-zA-Z] [a-zA-Z_0-9]*[a-zA-Z] [a-zA-Z_0-9]* // // identifieridentifier [^aeiouy][^aeiouy] // not an English vowel// not an English vowel

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1919

Page 20: Chapter 23 Text Processing Bjarne Stroustrup

Searching vs. matchingSearching vs. matching

SearchingSearching for a string that matches a regular expression for a string that matches a regular expression in an (arbitrarily long) stream of datain an (arbitrarily long) stream of data regex_search() regex_search() looks for its pattern as a substring in the looks for its pattern as a substring in the

streamstream MatchingMatching a regular expression against a string (of a regular expression against a string (of

known size)known size) regex_match() regex_match() looks for a complete match of its pattern looks for a complete match of its pattern

and the stringand the string

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 2020

Page 21: Chapter 23 Text Processing Bjarne Stroustrup

Table grabbed from the webTable grabbed from the webKLASSE KLASSE ANTAL DRENGE ANTAL DRENGE ANTAL PIGER ANTAL PIGER ELEVER IALTELEVER IALT

0A0A 1212 1111 2323

1A1A 77 88 1515

1B1B 44 1111 1515

2A2A 1010 1313 2323

3A3A 1010 1212 2222

4A4A 77 77 1414

4B4B 1010 55 1515

5A5A 1919 88 2727

6A6A 1010 99 1919

6B6B 99 1010 1919

7A7A 77 1919 2626

7G7G 33 55 88

7I7I 77 33 1010

8A8A 1010 1616 2626

9A9A 1212 1515 2727

0MO0MO 33 22 55

0P10P1 11 11 22

0P20P2 00 55 55

10B10B 44 44 88

10CE10CE 00 11 11

1MO1MO 88 55 1313

2CE2CE 88 55 1313

3DCE3DCE 33 33 66

4MO4MO 44 11 55

6CE6CE 33 44 77

8CE8CE 44 44 88

9CE9CE 44 99 1313

RESTREST 55 66 1111

Alle klasserAlle klasser 184184 202202 386386

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 2121

• Numeric fields• Text fields• Invisible field separators• Semantic dependencies

• i.e. the numbers actually mean something

• first row + second row == third row

• Last line are column sums

Page 22: Chapter 23 Text Processing Bjarne Stroustrup

Describe rowsDescribe rows

Header lineHeader line Regular expression:Regular expression: ^[\w ]+(^[\w ]+( [\w ]+)*$[\w ]+)*$ As string literal:As string literal: "^[\\w ]+("^[\\w ]+( [\\w ]+)*$"[\\w ]+)*$"

Other linesOther lines Regular expression:Regular expression: ^([\w ]+)(^([\w ]+)( \d+)(\d+)( \d+)(\d+)( \d+)$\d+)$ As string literal: As string literal: "^([\\w ]+)("^([\\w ]+)( \\d+)(\\d+)( \\d+)(\\d+)( \\d+)$"\\d+)$"

Aren’t those invisible tab characters annoying?Aren’t those invisible tab characters annoying? Define a tab character classDefine a tab character class

Aren’t those invisible space characters annoying?Aren’t those invisible space characters annoying? Use Use \s\s

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 2222

Page 23: Chapter 23 Text Processing Bjarne Stroustrup

Simple layout checkSimple layout check

int main()int main()

{{

ifstream in("table.txt");ifstream in("table.txt"); // // input fileinput file

if (!in) error("no input file\n");if (!in) error("no input file\n");  

string line;string line; // // input bufferinput buffer

int lineno = 0;int lineno = 0;

  

regex header( "^[\\w ]+(regex header( "^[\\w ]+( [\\w ]+)*$");[\\w ]+)*$"); // // header lineheader line

regex row( "^([\\w ]+)(regex row( "^([\\w ]+)( \\d+)(\\d+)( \\d+)(\\d+)( \\d+)$"); // \\d+)$"); // data linedata line

   // // … check layout …… check layout …

}}

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 2323

Page 24: Chapter 23 Text Processing Bjarne Stroustrup

Simple layout checkSimple layout check

int main()int main()

{{

// … open files, define patterns …// … open files, define patterns …

if (getline(in,line)) {if (getline(in,line)) { // // check header linecheck header line

smatch matches;smatch matches;

if (!regex_match(line, matches, header))if (!regex_match(line, matches, header)) error("no header");error("no header");

}}

while (getline(in,line)) {while (getline(in,line)) { // // check data linecheck data line

++lineno;++lineno;

smatch matches;smatch matches;

if (!regex_match(line, matches, row)) if (!regex_match(line, matches, row))

error("bad line", to_string(lineno));error("bad line", to_string(lineno));

}}

}} Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 2424

Page 25: Chapter 23 Text Processing Bjarne Stroustrup

Validate tableValidate tableint boys = 0;int boys = 0; // // column totalscolumn totalsint girls = 0;int girls = 0;

  

while (getline(in,line)) {while (getline(in,line)) { // // extract and check dataextract and check datasmatch matches;smatch matches;if (!regex_match(line, matches, row)) error("bad line");if (!regex_match(line, matches, row)) error("bad line");

  

int curr_boy = from_string<int>(matches[2]);int curr_boy = from_string<int>(matches[2]); // // check rowcheck rowint curr_girl = from_string<int>(matches[3]);int curr_girl = from_string<int>(matches[3]);int curr_total = from_string<int>(matches[4]);int curr_total = from_string<int>(matches[4]);if (curr_boy+curr_girl != curr_total) error("bad row sum");if (curr_boy+curr_girl != curr_total) error("bad row sum");

  

if (matches[1]==“Alle klasser”) {if (matches[1]==“Alle klasser”) { // // last line; check columns:last line; check columns:if (curr_boy != boys) error("boys don’t add up");if (curr_boy != boys) error("boys don’t add up");if (curr_girl != girls) error("girls don’t add up");if (curr_girl != girls) error("girls don’t add up");return 0;return 0;

}}

boys += curr_boy;boys += curr_boy;girls += curr_girl;girls += curr_girl;

}}   Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 2525

Page 26: Chapter 23 Text Processing Bjarne Stroustrup

Application domainsApplication domains

Text processing is just one domain among manyText processing is just one domain among many Or even several domains (depending how you count)Or even several domains (depending how you count) Browsers, Word, Acrobat, Visual Studio, …Browsers, Word, Acrobat, Visual Studio, …

Image processingImage processing Sound processingSound processing Data basesData bases

MedicalMedical ScientificScientific Commercial Commercial ……

NumericsNumerics FinancialFinancial ……

Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 2626