生信下机数据基础知识(3)

R和SVG的较量 2020-06-29

397

如果给你一个fq文件，如何确定它是哪个质量体系的？

生信下机数据基础知识（1）中提到了fq下机数据的质量体系有好多种：

ASCII码值（10进制）范围	质量体系	测序平台
33-93	33体系	Sanger/Illumina1.8
64-104	64体系	Illumina1.3/Illumina1.5
59-104	64体系	Solexa

如果，我们在网上下载了一批fq原始数据，且不知道他是什么平台的下机数据，那么，我们如何最快的知道这个数据是什么质量体系的呢？
这里，小编写了一个小脚本，你只需要使用该脚本测试运行一下你的fq文件，即可快速知道你的数据是什么质量体系了，文件script.py：

#!/bin/usr/python -w
import sys
import time #计算程序消耗s回见
import gzip #打开gzip压缩格式文件
from datetime import datetime #记录程序开始结束时间


#定义判断质量体系子函数Guess_33_64
def Guess_33_64(quality_str):


        #ranges are 33-93 (Sanger/Illumina1.8), 64-104 (Illumina1.3 or Illumina1.5) and 59-104 (Solexa). Similarly FastQC assumes that anything with some scores in the 33-63 range is Sanger and that the rest is Illumina1.3-1.5 (it doesn't know about Solexa scores).
        #In brief,if in 33-58 and not in 94-104,33;elif 64;elif in 33-58 and in 94-104,bad file.


        uniq_qual_numbers = []
        for s in quality_str:
                s_n = ord(s)
                if s_n not in uniq_qual_numbers:


                        uniq_qual_numbers.append(s_n)




        q_33 = 0
        q_64 = 0


        for qual in uniq_qual_numbers:


                if qual >= 33 and qual <= 58:
                        q_33 = 1
                elif qual >= 94 and qual<= 104:
                        q_64 = 1
                elif qual > 104 or qual < 33:
                        return("UnSure")
        ret = ""
        if q_33 == 0 and q_64 == 1:
                ret = 64
        elif q_33 == 1 and q_64 == 0:
                ret = 33
        else:
                ret = "UnSure"


        return(ret)
        
#主题函数，判断输入参数是否满足需求
if __name__ == '__main__':
    if len(sys.argv) <3:
        print("Usage: %s <InPutfq no gzip compressed><Output result>(random lines)" % sys.argv[0])
        sys.exit(1)
#这里读取的是压缩的fastq.gz文件




threshold=int(sys.argv[3]) #将输入的str类型数值转换成int数值
Qphred=''   #定义一个str变量






file_out=open(sys.argv[2],'w')
start_time = time.perf_counter() #开始计算时间
print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'),file=file_out) #记录开始时间日期 年-月-日 时-分-秒


Qphred = os.popen("less %s | head -%s | sed -n '4~4p'" % (sys.argv[1],threshold)).read()
Qphred = Qphred.replace('\n','').replace('\r','')
#截取fq文件前特定行数的质量值行


m=Guess_33_64(Qphred)
print("Qphred quality is %s" % m,file=file_out)
end_time = time.perf_counter() 计算结束时间
print("Calculation time(sec): %.3f" % (end_time-start_time),file=file_out)
#记录总共消耗时间
print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')，file=file_out)
#记录结束时间
复制

使用方法：

python3 script.py fastq.gz  file_out 200
#运行时间:
#2020-06-29 19:43:16
#2020-06-29 19:43:58
#Calculation time(sec): 41.954
#运行结果：
#Qphred quality is 33
复制