>>> from env_helper import info; info()
页面更新时间: 2024-01-15 21:16:22
运行环境:
    Linux发行版本: Debian GNU/Linux 12 (bookworm)
    操作系统内核: Linux-6.1.0-16-amd64-x86_64-with-glibc2.36
    Python版本: 3.11.2

1.5. 并行分词

原理:将目标文本按行分隔后,把各行文本分配到多个 Python 进程并行分词,然后归并结果,从而获得分词速度的可观提升

基于 python 自带的 multiprocessing 模块,目前暂不支持 Windows

用法:

  • jieba.enable_parallel(4)开启并行分词模式,参数为并行进程数

  • jieba.disable_parallel() 关闭并行分词模式

注意:并行分词仅支持默认分词器 jieba.dtjieba.posseg.dt

>>> import sys
>>> import time
>>> import jieba
>>>
>>> jieba.enable_parallel()
>>>
>>> url = 'article.txt'
>>> content = open(url,"rb").read()
>>> t1 = time.time()
>>> words = "/ ".join(jieba.cut(content))
>>>
>>> t2 = time.time()
>>> tm_cost = t2-t1
>>>
>>> log_f = open("out.txt","wb")
>>> log_f.write(words.encode('utf-8'))
>>>
>>> print('关闭并行分词模式speed %s bytes/second' % (len(content)/tm_cost))
关闭并行分词模式speed 159173.04799865268 bytes/second
>>> import sys
>>> import time
>>> import jieba
>>>
>>> jieba.enable_parallel(3)
>>>
>>> url = 'article.txt'
>>> content = open(url,"rb").read()
>>> t1 = time.time()
>>> words = "/ ".join(jieba.cut(content))
>>>
>>> t2 = time.time()
>>> tm_cost = t2-t1
>>>
>>> log_f = open("out.txt","wb")
>>> log_f.write(words.encode('utf-8'))
>>>
>>> print('开启并行分词模式speed %s bytes/second' % (len(content)/tm_cost))
开启并行分词模式speed 306005.9902867216 bytes/second