使用Python或Bash统计文件行数方法与性能测试

当统计大文件的行数时，除了要考虑便捷性，自然也会想到要考虑其计算性能（主要是时间尺度）。网上搜了下有很多种方法可以实现，那么到底应该选择哪种？首先明确方向，需要统计结果的环境是python，那么除了使用python本身的方法外，利用bash也是个很不错的选择，毕竟bash里有很多快速处理文本的工具。

网上调查发现有6种bash下的方法和3种python下的方法

Python方法

参考地址: http://sagerblog.github.io/blog/2013/01/21/python-file-line-count/

从此文的运行结果看，是最后种方法性能最佳

def linecount_3():
    count = 0
    thefile = open('data.sql','rb')
    while 1:
        buffer = thefile.read(65536)
        if not buffer:break
        count += buffer.count('\n') #通过读取换行符计算
    return count

Bash 方法

此文（ https://blog.csdn.net/blackmanren/article/details/60756410 ）给出了六种统计行数的方法，但没有给出具体的性能，所以我写了个脚本计算这六种方法的性能：

awk '{print NR}' a|tail -n1
awk 'END{print NR}' a
grep -n "" a|awk -F: '{print $1}'|tail -n1
sed -n '$=' a
wc -l a|awk '{print $1}'
cat a |wc -l

输出结果如下

Protocol 1:
18323592
Use time: 7
Protocol 2:
18323592
Use time: 2
Protocol 3:
18323592
Use time: 10
Protocol 4:
18323592
Use time: 2
Protocol 5:
18323592
Use time: 0
Protocol 6:
18323592
Use time: 1

最后两个使用时间有时为1有时为0，可以认为是一样的时间。所以需要用更准确的time方法

time wc -l a |awk '{print $1}'
# Output
real    0m0.455s
user    0m0.181s
sys     0m0.274s

time cat a | wc -l
# Output
real    0m0.699s
user    0m0.211s
sys     0m1.071s

所以得出结论，用第5种方法是最快的。

下一个问题是，在Python中调用bash脚本并通过管道获取行数和原生Python脚本获取行数哪个快呢？于是写了个脚本测试：

import sys
import time
import subprocess
import shlex
fn = sys.argv[1]


def timeit(method):
    def timed(*args, **kw):
        ts = time.time()
        result = method(*args, **kw)
        te = time.time()
        print('%r  %.2f ms' % \
                (method.__name__, (te - ts) * 1000))
        return result
    return timed

@timeit
def linecount_3():
    count = 0
    thefile = open(fn,'r')
    while 1:
        buffer = thefile.read(65536)
        if not buffer:break
        count += buffer.count('\n') #通过读取换行符计算
    return count
    thefile.close()

@timeit
def use_bash():
    cmd1 = "wc -l %s" % fn
    cmd2 = "awk '{print $1}'"
    proc1 = subprocess.Popen(shlex.split(cmd1),stdout=subprocess.PIPE)
    proc2 = subprocess.Popen(shlex.split(cmd2),stdin=proc1.stdout,stdout=subprocess.PIPE)
    return int(proc3.communicate()[0])

print(linecount_3())
print(use_bash())

'linecount_3'  1414.43 ms
18323592
'use_bash'  463.55 ms
18323592

最后发现区别也不是太大，而且有一个比较痛苦的是，python的管道没有bash那么的方便，导致为了连接两个简单的程序还得些好多代码。