我的 txt 檔案中的資料格式如下所示。
ScanHeader # 1
position = 1, start_mass= 2.000000, end_mass = 535.010058
start_time = 0.034048, end_time = 0.000000, packet_type = 24
num_readings = 114, integ_intens = 14276257.301926, data packet pos = 1026
uScanCount = 0, PeakIntensity = 6799450.500000, PeakMass = 18.045876
Scan Segment = 0, Scan Event = 0
Precursor Mass
Collision Energy
Isolation width
Polarity positive, Cenrtoid Data, Full Scan Type, MS Scan
SourceFragmentation Any, Type Ramp, Values = 0, Mass Ranges = 0
Turbo Scan Any, IonizationMode ElectronImpact, Corona Any
Detector Any, Value = 0.00, ScanTypeIndex = -1
DataPeaks
Packet # 0, intensity = 3691.226074, mass/position = 2.112536
saturated = 0, fragmented = 0, merged = 0
Packet # 1, intensity = 42881.203125, mass/position = 3.466080
saturated = 0, fragmented = 0, merged = 0
Packet # 2, intensity = 3006256.000000, mass/position = 4.184193
saturated = 0, fragmented = 0, merged = 0
理想情況下,輸出應該是如下所示的 csv 檔案:
我曾嘗試使用正則運算式以及 read_csv 選項,但似乎都沒有給我想要的輸出。我得到的最接近的是正則運算式,在那里我設法提取了我需要的所有資料,但我無法將其放入資料框中。代碼如下所示:
from tabulate import tabulate
import re
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data = re.findall(r'\d*last_scan = \d*\d.\d*', newfile.read())
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data1 = re.findall(r'\d* start_time = \d*\d.\d*', newfile.read())
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data2 = re.findall(r'\d* end_time = \d*\d.\d*', newfile.read())
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data3 = re.findall(r'\d*low_mass = \d*\d.\d*', newfile.read())
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data4 = re.findall(r'\d*high_mass = \d*\d.\d*', newfile.read())
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data5 = re.findall(r'\d*ScanHeader # \d', newfile.read())
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data6 = re.findall(r'\d*Packet # \d*', newfile.read())
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data7 = re.findall(r'\d* intensity = \d*\d.\d*', newfile.read())
with open('2020-06-23-Didecylamine-deriv-0,1uL.txt') as newfile:
data8 = re.findall(r'\d* mass/position = \d*\d.\d*', newfile.read())
import pandas as pd
data = {'Scanheader': [data5],
'Packet Number': [data6],
'Intensity': [data7],
'Mass/Position': [data8]
}
df = pd.DataFrame(data)
df.to_csv('2020-06-23-Didecylamine-deriv-0,1uL.csv', index=False)
我使用此代碼得到的輸出如下所示:
我知道有一些方法可以讓這段代碼變得不那么復雜,但我仍然是一個初學者,還沒有找到任何方法讓它變得更簡單。任何提示將不勝感激:)
uj5u.com熱心網友回復:
您應該只打開一次檔案。您可以首先使用flags
匹配所有 ScanHeaders 的整個文本。
遍歷這些匹配項并提取 Header # 和 time。
最后遍歷資料包(在上一個匹配中找到)以提取其他列:re
re.MULTILINE re.DOTALL
data = []
scanHeader_pattern = re.compile(r'ScanHeader.*?(?=ScanHeader|\Z)', flags= re.MULTILINE re.DOTALL)
packet_pattern = re.compile(r'Packet.*?(?=Packet|\Z)', flags= re.MULTILINE re.DOTALL)
header_nb_pattern = re.compile(r'ScanHeader # (\d )')
time_pattern = re.compile(r'start_time = (\d \.\d )', re.MULTILINE)
packet_nb_pattern = re.compile(r'Packet # (\d )')
intensity_pattern = re.compile(r'intensity = (\d \.\d )')
mass_pos_pattern = re.compile(r'mass/position = (\d \.\d )')
for sh in re.findall(scanHeader_pattern, newfile.read()):
h_nb = int(re.search(header_nb_pattern, sh).group(1))
t = float(re.search(time_pattern, sh).group(1))
for p in re.findall(packet_pattern, sh):
p_nb = int(re.search(packet_nb_pattern, p).group(1))
intensity = float(re.search(intensity_pattern, p).group(1))
mass_pos = float(re.search(mass_pos_pattern, p).group(1))
data.append(
{'Scanheader': h_nb,
'Packet Number': p_nb,
'Time': t,
'Intensity': intensity,
'Mass/Position': mass_pos
}
)
df = pd.DataFrame(data)
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/517301.html