real world haskell: lecture 7

34
Real World Haskell: Lecture 7 Bryan O’Sullivan 2009-12-09

Upload: bryan-osullivan

Post on 12-May-2015

2.058 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Real World Haskell: Lecture 7

Real World Haskell:Lecture 7

Bryan O’Sullivan

2009-12-09

Page 2: Real World Haskell: Lecture 7

Getting things done

It’s great to dwell so much on purity, but we’d like to maybe useHaskell for practical programming some time.

This leaves us concerned with talking to the outside world.

Page 3: Real World Haskell: Lecture 7

Word count

import System . Env i ronment ( getArgs )import C o n t r o l . Monad ( forM )

countWords path = doc o n t e n t <− r e a d F i l e pathl e t numWords = length ( words c o n t e n t )putStrLn ( show numWords ++ ” ” ++ path )

main = doa r g s <− getArgsmapM countWords a r g s

Page 4: Real World Haskell: Lecture 7

New notation!

There was a lot to digest there. Let’s run through it all, from topto bottom.

import System . Env i ronment ( getArgs )

“Import only the thing named getArgs fromSystem.Environment.”

Without an explicit (comma separated) list of names to import,everything that a module exports is imported into this one.

Page 5: Real World Haskell: Lecture 7

The do block

Notice that this function’s body starts with the keyword do:

countWords path = do. . .

That keyword introduces a series of actions. Each action issomewhat similar to a statement in C or Python.

Page 6: Real World Haskell: Lecture 7

Executing an action and using its result

The first line of our function’s body:

countWords path = doc o n t e n t <− r e a d F i l e path

This performs the action “readFile path”, and assigns the resultto the name “content”.

The special notation “<−” makes it clear that we are executing anaction, i.e. not applying a pure function.

Page 7: Real World Haskell: Lecture 7

Applying a pure function

We can use the let keyword inside a do block, and it applies apure function, but the code that follows does not need to startwith an in keyword.

l e t numWords = length ( words c o n t e n t )putStrLn ( show numWords ++ ” ” ++ path )

With both let and <−, the result is immutable as usual, and staysin scope until the end of the do block.

Page 8: Real World Haskell: Lecture 7

Executing an action

This line executes an action, and ignores its return value:

putStrLn ( show numWords ++ ” ” ++ path )

Page 9: Real World Haskell: Lecture 7

Compare and contrast

Wonder how different imperative programming in Haskell is fromother languages?

def c o u n t w o r d s ( path ) :c o n t e n t = open ( path ) . r e a d ( )num words = l e n ( c o n t e n t . s p l i t ( ) )p r i n t r e p r ( num words ) + ” ” + path

countWords path = doc o n t e n t <− r e a d F i l e pathl e t numWords = length ( words c o n t e n t )putStrLn ( show numWords ++ ” ” ++ path )

Page 10: Real World Haskell: Lecture 7

A few handy rules

When you want to introduce a new name inside a do block:

I Use name <− action to perform an action and keep its result.

I Use let name = expression to evaluate a pure expression, andomit the in.

Page 11: Real World Haskell: Lecture 7

More adventures with ghci

If we load our source file into ghci, we get an interesting typesignature:

*Main> :type countWordscountWords :: FilePath -> IO ()

See the result type of IO ()? That means “this is an action thatperforms I/O, and which returns nothing useful when it’s done.”

Page 12: Real World Haskell: Lecture 7

Main

In Haskell, the entry point to an executable is named main. Youare shocked by this, I am sure.

main = doa r g s <− getArgsmapM countWords a r g s

Instead of main being passed its command line arguments as in C,it uses the getArgs action to retrieve them.

Page 13: Real World Haskell: Lecture 7

What’s this mapM business?

The map function can only call pure functions, so it has anequivalent named mapM that maps an impure action over a list ofarguments and returns the list of results.

The mapM function has a cousin, mapM , that throws away theresult of each action it performs.

In other words, this is one way to perform a loop over a list inHaskell.

“mapM countWords args” means “apply countWords to everyelement of args in turn, and throw away each result.”

Page 14: Real World Haskell: Lecture 7

Compare and contrast II, electric boogaloo

These don’t look as similar as their predecessors:

def main ( ) :f o r name i n s y s . a r g v [ 1 : ] :

c o u n t w o r d s ( name )

main = doa r g s <− getArgsmapM countWords a r g s

I wonder if we could change that.

Page 15: Real World Haskell: Lecture 7

Idiomatic word count in Python

If we were writing “real” Python code, it would look more like this:

def main ( ) :f o r path i n s y s . a r g v [ 1 : ] :

c = open ( path ) . r e a d ( )p r i n t l e n ( c . s p l i t ( ) ) , path

Page 16: Real World Haskell: Lecture 7

Meet forM

In the Control .Monad module, there are two functions namedforM and forM . They are nothing more than mapM and mapMwith their arguments flipped.

In other words, these are identical:

mapM countWords a r g sforM a r g s countWords

That seems a bit gratuitous. Why should we care?

Page 17: Real World Haskell: Lecture 7

Function application as an operator

In our last lecture, we were introduced to function composition:

f . g = \x −> f ( g x )

We can also write a function to apply a function:

f $ x = f x

This operator has a very low precedence, so we can use it to getrid of parentheses. Sometimes this makes code easier to read:

putStrLn ( show numWords ++ ” ” ++ path )putStrLn $ show numWords ++ ” ” ++ path

Page 18: Real World Haskell: Lecture 7

Idiomatic word counting in Haskell

See what’s different about this word counting?

main = doa r g s <− getArgsforM a r g s $ \ ar g −> do

c o n t e n t <− r e a d F i l e ar gl e t l e n = length ( words c o n t e n t )putStrLn ( show l e n ++ ” ” ++ ar g )

Doesn’t that use of forM look remarkably like a for loop in someother language? That’s because it is one.

Page 19: Real World Haskell: Lecture 7

The reason for the $

Notice that the body of the forM loop is an anonymous functionof one argument.

We put the $ in there so that we wouldn’t have to either wrap theentire function body in parentheses, or split it out and give it aname.

Page 20: Real World Haskell: Lecture 7

The good

Here’s our original code, using the $ operator:

forM a r g s $ \ ar g −> doc o n t e n t <− r e a d F i l e ar gl e t l e n = length ( words c o n t e n t )putStrLn ( show l e n ++ ” ” ++ ar g )

Page 21: Real World Haskell: Lecture 7

The bad

If we omit the $, we could use parentheses:

forM a r g s (\ ar g −> doc o n t e n t <− r e a d F i l e ar gl e t l e n = length ( words c o n t e n t )putStrLn ( show l e n ++ ” ” ++ ar g ) )

Page 22: Real World Haskell: Lecture 7

And the ugly

Or we could give our loop body a name:

l e t body ar g = doc o n t e n t <− r e a d F i l e ar gl e t l e n = length ( words c o n t e n t )putStrLn ( show l e n ++ ” ” ++ ar g ) )

forM a r g s body

Giving such a trivial single-use function a name seems gratuitous.

Nevertheless, it should be clear that all three pieces of code areidentical in their operation.

Page 23: Real World Haskell: Lecture 7

Trying it out

Let’s assume we’ve saved our source file as WC.hs, and give it a try:

$ ghc --make WC[1 of 1] Compiling Main ( WC.hs, WC.o )Linking WC ...

$ du -h ascii.txt58M ascii.txt

$ time ./WC ascii.txt9873630 ascii.txt

real 0m8.043s

Page 24: Real World Haskell: Lecture 7

Comparison shopping

How does the performance of our WC program compare with thesystem’s built-in wc command?

$ export LANG=C$ time wc -w ascii.txt9873630 ascii.txt

real 0m0.447s

Ouch! The C version is almost 18 times faster.

Page 25: Real World Haskell: Lecture 7

A second try

Does it help if we recompile with optimisation?

$ ghc -fforce-recomp -O --make WC$ time ./WC ascii.txt9873630 ascii.txt

real 0m7.696s

So that made our code 5% faster. Ugh.

Page 26: Real World Haskell: Lecture 7

What’s going on here?

Remember that in Haskell, a string is a list. And a list isrepresented as a linked list.

This means that every character gets its own list element, and listelements are not allocated contiguously. For large data structures,list overhead is negligible, but for characters, it’s a total killer.

So what’s to be done?

Enter the bytestring.

Page 27: Real World Haskell: Lecture 7

The original code

main = doa r g s <− getArgsforM a r g s $ \ ar g −> do

c o n t e n t <− r e a d F i l e ar gl e t l e n = length ( words c o n t e n t )putStrLn ( show l e n ++ ” ” ++ ar g )

Page 28: Real World Haskell: Lecture 7

The bytestring code

A bytestring is a contiguously-allocated array of bytes. Becausethere’s no pointer-chasing overhead, this should be faster.

import q u a l i f i e d Data . B y t e S t r i n g . Char8 as B

main = doa r g s <− getArgsforM a r g s $ \ ar g −> do

c o n t e n t <− B . r e a d F i l e ar gl e t l e n = length (B . words c o n t e n t )putStrLn ( show l e n ++ ” ” ++ ar g )

Notice the import qualified—this allows us to write B instead ofData.ByteString.Char8 wherever we want to use a name importedfrom that module.

Page 29: Real World Haskell: Lecture 7

So is it faster?

How does this code perform?

$ time ./WC ascii.txt9873630 ascii.txt

real 0m8.043s

$ time ./WC-BS ascii.txt9873630 ascii.txt

real 0m1.434s

Not bad! We’re 6x faster than the String code, and now just 3xslower than the C code.

Page 30: Real World Haskell: Lecture 7

Seriously? Bytes for text?

There is, of course, a snag to using bytestrings: they’re strings ofbytes, not characters.

This is the 21st century, and everyone should be using Unicodenow, right?

Our answer to this problem in Haskell is to use a package namedData.Text.

Page 31: Real World Haskell: Lecture 7

Unicode-aware word count

import q u a l i f i e d Data . Text as Timport Data . Text . Encoding ( decodeUtf8 )import q u a l i f i e d Data . B y t e S t r i n g . Char8 as B

main = doa r g s <− getArgsforM a r g s $ \ ar g −> do

b y t e s <− B . r e a d F i l e ar gl e t c o n t e n t = decodeUt f8 b y t e s

l e n = length (T . words c o n t e n t )putStrLn ( show l e n ++ ” ” ++ ar g )

Page 32: Real World Haskell: Lecture 7

What happens here?

Notice that we still use bytestrings to read the initial data in.

Now, however, we use decodeUtf8 to turn the raw bytes fromUTF-8 into the Unicode representation that Data.Text usesinternally.

We then use Data.Text’s words function to split the big string intoa list of words.

Page 33: Real World Haskell: Lecture 7

Comparing Unicode performance

For comparison, let’s first try a Unicode-aware word count in C, ona file containing 112.6 million characters of UTF-8-encoded Greek:

$ du -h greek.txt196M greek.txt

$ export LANG=en_US.UTF-8$ time wc -w greek.txt16917959 greek.txt

real 0m8.306s

$ time ./WC-T greek.txt16917959 greek.txt

real 0m7.350s

Page 34: Real World Haskell: Lecture 7

What did we just see?

Wow! Our tiny Haskell program is actually 13% faster than thesystem’s wc command!

This suggests that if we choose the right representation, we canwrite real-world code that is both brief and highly efficient.

This ought to be immensely cheering.