从惰性IO说起_Haskell笔记6

一.惰性I/O与buffer
Haskell中，I/O也是惰性的，例如：

readThisFile = withFile "./data/lines.txt" ReadMode (\handle -> do    contents <- hGetContents handle    putStr contents  )

从硬盘读文件时并不会一次性全读入内存，而是一点一点的流式读取。文本文件的话，默认buffer是line-buffering，即一次读一行，二进制文件的话，默认buffer是block-buffering，一次读一个chunk，其具体大小取决于操作系统

line-buffering和block-buffering用BufferMode值来表示：

data BufferMode  = NoBuffering | LineBuffering | BlockBuffering (Maybe Int)    -- Defined in ‘GHC.IO.Handle.Types’

BufferMode类型下有三个值，NoBuffering，LineBuffering，BlockBuffering (Maybe Int)分别表示不用buffer，用line-buffering，以及用block-buffering模式。其中Maybe Int表示每个chunk有几个字节（byte），给Nothing的话用系统默认的chunk大小，NoBuffering表示一次读一个字符（character），会疯狂（高频）访问硬盘，一般不用

可以手动设置BufferMode，例如：

readThisFileInBlockMode = withFile "./data/lines.txt" ReadMode (\handle -> do    hSetBuffering handle $ BlockBuffering (Just 1024)    contents <- hGetContents handle    putStr contents  )

每次读1024B（即1KB），其中hSetBuffering的类型为：

hSetBuffering :: Handle -> BufferMode -> IO ()

接受一个文件指针和BufferMode值，返回个空的I/O Action

既然有buffer，就需要flush buffer，所以还有个hFlush：

hFlush :: Handle -> IO ()

用来清理buffer，不用等buffer塞满或者其它自动flush机制（如line-buffering遇到换行符就flush）

P.S.有个很形象但不太优雅的比喻：

你的马桶会在水箱有一加仑的水的时候自动冲水。所以你不断灌水进去直到一加仑，马桶就会自动冲水，在水里面的数据也就会被看到。但你也可以手动地按下冲水钮来冲水。他会让现有的水被冲走。冲水这个动作就是hFlush这个名字的含意。

二.Data.ByteString
既然从系统读取文件需要考虑性能采用Buffer，那读入内存之后呢？又该如何存储，如何操作？

ByteString看着像个新的数据类型，但我们不是已经有String了吗？

惰性的List
String是Char List的别名，而List是惰性的，所以：

str = "abc"charList = ['a', 'b', 'c']charList' = 'a' : 'b' : 'c' : []> str == charList && charList == charList'True

声明字符串"abc"只是承诺，我们将会拥有一个Char List，那么什么时候才真正拥有（或者创造）这个List呢？

在不得不计算（求值）的时候，比如上例中==判断的时候：

instance (Eq a) => Eq [a] where  {-# SPECIALISE instance Eq [Char] #-}  []     == []     = True  (x:xs) == (y:ys) = x == y && xs == ys  _xs    == _ys    = False

（摘自GHC.Classes）

通过模式匹配从左向右遍历对比元素是否相等，每次取List首元，此时才真正需要List，才被“创造”出来

用非惰性的JS来描述就像这样：

function unshift(x, xs) {  return [x].concat(xs);}const str = 'abc';charList = unshift('a', unshift('b', unshift('c', [])));function eq(s, a) {  if (!s.length && !a.length) return true;  return s[0] == a[0] && eq(s.slice(1), a.slice(1));}// testeq(str, charList);

但与立即求值的JS不同，Haskell是惰性的，所以，实际情况类似于：

const EMPTY_LIST = {  value: Symbol.for('_EMPTY_LIST_'),  tail: () => EMPTY_LIST};function unshift(x, xs) {  return { value: x, tail: () => xs };}function sugar(str) {  return str.split('')    .reduceRight((a, v) => a.concat([v]), [])    .reduce((a, v) => unshift(v, a), EMPTY_LIST);}const str = sugar('abc');const charList = unshift('a', unshift('b', unshift('c', EMPTY_LIST)));function eq(s, a) {  if (s === EMPTY_LIST && a === EMPTY_LIST) return true;  return s.value == a.value && eq(s.tail(), a.tail());}// testeq(str, charList);

用“懒”链表来模拟只在真正需要的时候才去创造的List，就像'a' : 'b' : 'c' : []“承诺”会有一个'a'开头的List，这个List有多长，占多少空间，在真正需要求值之前都是未知的（也没必要知道，所以允许存在无限长的List，而不用担心如何存储的问题）

但这种惰性并非十全十美，带来的直接问题就是效率不高，尤其是在巨长List的场景（比如读文件），处理一个“承诺”（模拟场景里的tail()）的成本可能不高，但如果积攒了一大堆的“承诺”，处理这些“承诺”的成本就会凸显出来，实际效率自然会下降。所以，为了解决这个问题，就像引入foldl的严格版本（非惰性版本）foldl'一样，我们引入了ByteString

P.S.上面提到的“承诺”，其实在Haskell有个对应的术语叫thunk

ByteString
Bytestring的每个元素都是一个字节（8个bit），分惰性与严格（非惰性）两种：

惰性：Data.ByteString.Lazy，同样具有惰性，但比List稍微勤快一些，不是逐元素的thunk，而是逐chunk的（64K一个chunk），一定程度上减少了所产生thunk的数量

严格：位于Data.ByteString模块，不会产生任何thunk，表示一连串的字节，所以不存在无限长的strict bytestring，也没有惰性List的内存优势

lazy bytestring就像chunk List（List中每个元素都是64K大小的strict bytestring），既减少了惰性带来的效率影响，又具有惰性的内存优势，所以大多数时候用lazy版本

P.S.64K这个大小是有讲究的：

64K有很高的可能性能够装进你CPU的L2 Cache

常用函数
ByteString相当于另一种List，所以List的大多数方法在ByteString都有同名的对应实现，例如：

head, tail, init, null, length, map, reverse, foldl, foldr, concat, takeWhile, filter

所以先要避免命名冲突：

-- 惰性ByteStringimport Data.ByteString.Lazy as B-- 严格ByteStringimport Data.ByteString as S创建一个ByteString：-- Word8 List转ByteStringB.pack :: [GHC.Word.Word8] -> ByteString-- 严格ByteString转惰性ByteStringB.fromChunks :: [Data.ByteString.Internal.ByteString] -> ByteString

其中Word8相当于范围更小的Int（0 ~ 255之间，和Int一样都属于Num类），例如：

> B.pack [65, 66, 67]"ABC"> B.fromChunks [S.pack [65, 66, 67], S.pack [97, 98, 99]]"ABCabc"

注意，fromChunks会把给定的一组strict bytestring串起来变成chunk List，而不是先拼接起来再塞进一个个64K空间，如果有一堆碎的strict bytestring而又不像拼接起来占着内存，可以用这种方式把它们串起来

插入元素：

B.cons :: GHC.Word.Word8 -> B.ByteString -> B.ByteStringB.cons' :: GHC.Word.Word8 -> B.ByteString -> B.ByteString

cons就是List的:，用于在左侧插入元素，同样是惰性的（即便第一个chunk足够容纳新元素，也插入一个chunk），而cons'是其严格版本，会优先填充第一个chunk的剩余空间，区别类似于：

> Prelude.foldr B.cons B.empty [50..60]Chunk "2" (Chunk "3" (Chunk "4" (Chunk "5" (Chunk "6" (Chunk "7" (Chunk "8" (Chunk "9" (Chunk ":" (Chunk ";" (Chunk "<"Empty))))))))))> Prelude.foldr B.cons' B.empty [50..60]Chunk "23456789:;<" Empty

P.S.旧版本GHC会show出类似于上面的差异，0.10.0.1之后的Show实现改成了类似于字符串字面量的形式，看不出来差异了，具体见Haskell: Does ghci show “Chunk .. Empty”?

文件读写：

-- 按chunk读S.readFile :: FilePath -> IO S.ByteString-- 全读进来B.readFile :: FilePath -> IO B.ByteString-- 逐chunk写S.writeFile :: FilePath -> S.ByteString -> IO ()-- 一次写完B.writeFile :: FilePath -> B.ByteString -> IO ()

实际上，ByteString与String类型在大多数场景可以很容易地互相转换，所以可以先用String实现，在性能不好的场景再改成ByteString

P.S.更多ByteString相关函数，见Data.ByteString

三.命令行参数
除交互输入和读文件外，命令行参数是另一种获取用户输入的重要方式：

-- readWhat.hsimport System.Environmentimport System.IOmain = do  args <- getArgs  contents <- readFile (args !! 0)  putStr contents

试玩一下：

$ ghc --make ./readWhat.hs[1 of 1] Compiling Main             ( readWhat.hs, readWhat.o )Linking readWhat ...$  ./readWhat ./data/lines.txthoho, this is xx.who's that ?$ ./readWhat ./data/that.txtcontents in that fileanother linelast line

这就有了cat的基本功能。其中getArgs的类型是：

getArgs :: IO [String]

位于System.Environment模块，以为I/O Action形式返回命令行参数组成的String数组，类似的还有：

-- 获取程序名（可执行文件的名字）getProgName :: IO String-- 获取当前绝对路径getExecutablePath :: IO FilePath-- 设置环境变量setEnv :: String -> String -> IO ()-- 获取环境变量getEnv :: String -> IO String

P.S.更多环境相关函数，见System.Environment

例如：

import System.IOimport System.Environmentmain = do  progName <- getProgName  args <- getArgs  pwd <- getExecutablePath  setEnv "NODE_ENV" "production"  nodeEnv <- getEnv "NODE_ENV"  putStrLn pwd  putStrLn ("NODE_ENV " ++ nodeEnv)  putStrLn (progName ++ (foldl (++) "" $ map (" " ++) args))

试玩：

$ ghc --make ./testArgs[1 of 1] Compiling Main             ( testArgs.hs, testArgs.o )Linking testArgs ...$ ./testArgs -a --p path/absolute/path/to/testArgsNODE_ENV productiontestArgs -a --p path

P.S.除ghc --make sourceFile编译执行外，还有一种直接run源码的方式：

$ runhaskell testArgs.hs -b -c/absolute/path/to/ghc-8.0.1/bin/ghcNODE_ENV productiontestArgs.hs -b -c

此时getExecutablePath返回的是ghc（可执行文件）的绝对路径

四.随机数
除了I/O，另一个铁定不纯的场景就是随机数了。那么，纯函数能造出来随机数吗？

造伪随机数还是有点可能的。做法类似于C语言，要给个“种子”：

random :: (Random a, RandomGen g) => g -> (a, g)

其中Random和RandomGen种子的类型分别为：

instance Random Word -- Defined in ‘System.Random’instance Random Integer -- Defined in ‘System.Random’instance Random Int -- Defined in ‘System.Random’instance Random Float -- Defined in ‘System.Random’instance Random Double -- Defined in ‘System.Random’instance Random Char -- Defined in ‘System.Random’instance Random Bool -- Defined in ‘System.Random’instance RandomGen StdGen -- Defined in ‘System.Random’data StdGen  = System.Random.StdGen {-# UNPACK #-}GHC.Int.Int32                        {-# UNPACK #-}GHC.Int.Int32    -- Defined in ‘System.Random’

P.S.其中Word指的是可以指定宽度的无符号整型，具体见Int vs Word in common use?

数值、字符、布尔类型等都可以有随机值，种子则需要通过特殊的mkStdGen :: Int -> StdGen函数生成，例如：

> random (mkStdGen 7) :: (Int, StdGen)(5401197224043011423,33684305 2103410263)> random (mkStdGen 7) :: (Int, StdGen)(5401197224043011423,33684305 2103410263)

果然是纯函数，所以两次调用结果完全一样（并不是因为连续调用，过十天半个月调用还是这个结果）。通过类型声明来告知random函数期望返回的随机值类型，不妨换个别的：

> random (mkStdGen 7) :: (Bool, StdGen)(True,320112 40692)> random (mkStdGen 7) :: (Float, StdGen)(0.34564054,2071543753 1655838864)> random (mkStdGen 7) :: (Char, StdGen)('\279419',320112 40692)

random函数每次都会生成下一个种子，所以可以这样做：

import System.Randomrandom3 i = collectNext $ collectNext $ [random $ mkStdGen i]  where collectNext xs@((i, g):_) = xs ++ [random g]

试玩一下：

> random3 100[(-3633736515773289454,693699796 2103410263),(-1610541887407225575,136012003 1780294415),(-1610541887407225575,136012003 1780294415)]> (random3 100) :: [(Bool, StdGen)][(True,4041414 40692),(False,651872571 1655838864),(False,651872571 1655838864)]> [b | (b, g) <- (random3 100) :: [(Bool, StdGen)]][True,False,False]

P.S.注意(random3 100) :: [(Bool, StdGen)]只限定了random3的返回类型，编译器能够推断出random $ mkStdGen i所需类型是(Bool, StdGen)

这下有点（伪）随机的意思了，因为random是个纯函数，所以只能通过换种子参数来得到不同的返回值

实际上有更简单的方式：

random3' i = take 3 $ randoms $ mkStdGen i> random3' 100 :: [Bool][True,False,False]

其中randoms :: (Random a, RandomGen g) => g -> [a]函数接受一个RandomGen参数，返回Random无穷序列

此外，常用的还有：

-- 返回[min, max]范围的随机数randomR :: (Random a, RandomGen g) => (a, a) -> g -> (a, g)-- 类似于randomR，返回无限序列randomRs :: (Random a, RandomGen g) => (a, a) -> g -> [a]

例如：

> randomR ('a', 'z') (mkStdGen 1)('x',80028 40692)> take 24 $ randomRs (1, 6) (mkStdGen 1)[6,5,2,6,5,2,3,2,5,5,4,2,1,2,5,6,3,3,5,5,1,4,3,3]

P.S.更多随机数相关函数，见System.Random

动态种子
写死的种子每次都返回同一串随机数，没什么意义，所以需要一个动态的种子（如系统时间等）：

getStdGen :: IO StdGengetStdGen在程序运行时会向系统要一个随机数生成器（random generator），并存成全局生成器（global generator）

例如：

main = do  g <- getStdGen  print $ take 10 (randoms g :: [Bool])

试玩一下：

$ ghc --make rand.hs[1 of 1] Compiling Main             ( rand.hs, rand.o )Linking rand ...$ ./rand[False,False,True,False,False,True,False,True,False,False]$ ./rand[True,False,False,False,True,False,False,False,True,True]$ ./rand[True,True,True,False,False,True,True,False,False,True]

注意，在GHCIi环境调用getStdGen得到的总是同一个种子，类似于程序连续调用getStdGen的效果，所以总是返回同一串随机值序列：

> getStdGen1661435168 1> getStdGen1661435168 1> main[False,False,False,False,True,False,False,False,True,True]> main[False,False,False,False,True,False,False,False,True,True]

可以手动控制取无限序列后面的部分，或者使用newStdGen :: IO StdGen函数：

> newStdGen1018152561 2147483398> newStdGen1018192575 40691

newStdGen能够把现有的global generator分成两个random generator，把其中一个设置成global generator，返回另一个。所以：

> getStdGen1661435170 1655838864> getStdGen1661435170 1655838864> newStdGen1018232589 1655838863> getStdGen1661435171 2103410263

如上面示例，newStdGen不仅返回新的random generator，还会重置global generator

五.异常处理
直到此刻，我们见过许多异常了（模式匹配遗漏、缺少类型声明、空数组取首元、除零异常等），知道一旦发生异常，程序就会立刻报错退出，但一直没有尝试过捕获异常

实际上，与其它主流语言一样，Haskell也有完整的异常处理机制

I/O异常
I/O相关的场景需要更严谨的异常处理，因为与内部逻辑相比，外部环境显得更加不可控，不可信赖：

像是打开文件，文件有可能被lock起来，也有可能文件被移除了，或是整个硬盘都被拔掉

此时需要抛出异常，告知程序某些事情发生了错误，没有按照预期正常运行

I/O异常可以通过catchIOError来捕获，例如：

import System.IO.ErrorcatchIOError :: IO a -> (IOError -> IO a) -> IO a

传入I/O Action和对应的异常处理函数，返回同类型的I/O Action。机制类似于try-catch，I/O Action抛出异常才执行异常处理函数，并返回其返回值，例如：

import System.IOimport System.IO.Errorimport Control.Monadimport System.Environmentmain = do  args <- getArgs  when (not . null $ args) (do    contents <- catchIOError (readFile (head args)) (\err -> do return "Failed to read this file!")    putStr contents    )

在找不到文件，或者其他原因导致readFile异常时，会输出提示信息：

$ runhaskell ioException.hs ./xxFailed to read this file!

这里只是简单粗暴的吃掉了所有异常，最好区别对待：

main = do  args <- getArgs  when (not . null $ args) (do    contents <- catchIOError (readFile (head args)) errorHandler    putStr contents    )  where errorHandler err          | isDoesNotExistError err = do return "File not found!"          | otherwise = ioError err

其中isDoesNotExistError和ioError如下：

isDoesNotExistError :: IOError -> BoolioError :: IOError -> IO a

前者是个predicate，用来判定传入的IOError是不是目标（文件）不存在引起的，后者相当于JS的throw，把这个异常再度丢出去

IOError的其它predicate还有：

isAlreadyExistsErrorisAlreadyInUseErrorisFullErrorisEOFErrorisIllegalOperationisPermissionErrorisUserError

其中isUserError用来判定通过userError :: String -> IOError函数手动制造的异常

获取错误信息
想要输出引发异常的用户输入的话，可能会这么做：

exists = do  file <- getLine  when (not . null $ file) (do    contents <- catchIOError (readFile file) (\err -> do      return ("File " ++ file ++ " not found!\n")      )    putStr contents    )

试玩一下：

> exists./xxFile ./xx not found!> exists./io.hsmain = print "hoho"

符合预期，这里用了lambda函数，能够访问外部的file变量，如果异常处理函数相当庞大，就不太容易了，例如：

exists' = do  file <- getLine  when (not . null $ file) (do    contents <- catchIOError (readFile file) (errorHandler file)    putStr contents    )  where errorHandler file = \err -> do (return ("File " ++ file ++ " not found!\n"))

为了把file变量传入errorHandler，我们多包了一层，看起来蠢蠢的，而且能保留的现场信息很有限

所以，像其他语言一样，我们能够从异常对象身上取出一些错误信息，例如：

exists'' = do  file <- getLine  when (not . null $ file) (do    contents <- catchIOError (readFile file) (\err ->      case ioeGetFileName err of Just path -> return ("File at " ++ path ++ " not found!\n")                                 Nothing -> return ("File at somewhere not found!\n")      )    putStr contents    )

其中ioeGetFileName用来从IOError中取出文件路径（这些工具函数都以ioe开头）：

ioeGetFileName :: IOError -> Maybe FilePathP.S.更多类似函数，见Attributes of I/O errors

纯函数异常
异常并不是I/O场景特有的，例如：

> 1 `div` 0*** Exception: divide by zero> head []*** Exception: Prelude.head: empty list

纯函数也会引发异常，比如上面的除零异常和空数组取首元异常，有两种处理方式：

使用Maybe或Either

使用try :: Exception e => IO a -> IO (Either e a)（位于Control.Exception模块）

例如：

import Data.Maybe> case listToMaybe [] of Nothing -> ""; Just first -> first""> case listToMaybe ["a", "b"] of Nothing -> ""; Just first -> first"a"

其中listToMaybe :: [a] -> Maybe a用于取List首元，并包装成Maybe类型（空List就是Nothing）

除零异常要么手动检查除数不为0，要么用evaluate塞进I/O场景，通过try来捕获：

> import Control.Exception> first <- try $ evaluate $ 1 `div` 0 :: IO (Either ArithException Integer)> firstLeft divide by zero

实际上，除零异常的具体类型是DivideByZero，位于Control.Exception模块：

data ArithException  = Overflow  | Underflow  | LossOfPrecision  | DivideByZero  | Denormal  | RatioZeroDenominator    -- Defined in ‘GHC.Exception’

如果不清楚具体异常类别（这个是确实不清楚异常类型，查源码都猜不出来），或者希望捕获所有类型的异常，可以用SomeException：

> first <- try $ evaluate $ head [] :: IO (Either SomeException ())> firstLeft Prelude.head: empty list

P.S.关于4种异常处理方案的更多信息，见Handling errors in Haskell

参考资料
How to catch a divide by zero error in Haskell?

Exception handling in Haskell

更多相关文章

随机推荐