Analyse Mysqldump File with Python

  1. Preface
  2. Background
  3. Key points
    1. Record Format
    2. Parse Tool
  4. Parse code
  5. Summary

There is one project that requires to load file data to HBase recently. Here some points will be introduced and hope they are helpful for everyone.

Preface

Nowadays the technology of big data has been applied in many industrial and technical areas. As a finance technology company, there are tons of data need to be dealed each day. So many tools have been developped by internet companies, such as Google, Amazon, Microsoft, Netflix, and so on.

This chapter will not introduce those big data platforms, that maybe introduce them in next series of chapters. I will focus on dealing file data with Python here.

Background

The data that need to be dealed has been stored in many files, which was created with mysqldump tool. The size of data is aboue 500 GB. My work focuses on parsing file and extracting INSERT records, and then dealing with the records by MapReduce.


As we known, the mysqldump file is text style and has lots of miscellaneous information that we don’t have to process. So I need to think how to parse out the useful information. There are some points as below.

Key points

Record Format

As we known, the record format of mysqldump file is such as INSERT INTO .... VALUES.., maybe there are multi-rows in one insert record. The key point is to parse how many fields after VALUES keyword and how many rows in one insert record.

Parse Tool

The first language I thought was Python, which is convient for us to build parsing program.


The dump file format is about gigbytes with gz suffix, So the machine memory is one of influencing factors that we need to notice in program.


If you want to accelerate the parsing speed, you can use multi-threads technology.

Parse code

The parsing program is written with Python.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
#!/usr/bin/env python
import fileinput
import csv
import sys
import gzip


# set the max reading size of csv
csv.field_size_limit(sys.maxsize)

def check_insert(line):
"""
check whether the record contains `insert into` or not
"""
return line.startswith('INSERT INTO') or False


def get_line_values(line):
"""
extract the values part
"""
return line.partition('VALUES ')[2]


def check_values_style(values):
"""
check whether the record format is legal or not ( `(` is necessary)
"""

if values and values[0] == '(':
return True
return False

def parse_line(values):
"""
create csv object, and read csv lines
"""
latest_row = []

reader = csv.reader([values], delimiter=',',
doublequote=False,
escapechar='\\',
quotechar="'",
strict=True
)


for reader_row in reader:
for column in reader_row:
# check whether the field is null or empty
if len(column) == 0 or column == 'NULL':
latest_row.append("")
continue
# check whether the fields is the startwiths of `(` or not, if true, it identifies that the record has multi rows.
if column[0] == "(":
new_row = False
if len(latest_row) > 0:
#check whether the row contains `)` or not, if true then the row is the end.
if latest_row[-1][-1] == ")":
# remove `)`
latest_row[-1] = latest_row[-1][:-1]
if latest_row[-1] == "NULL":
latest_row[-1] = ""
new_row = True
# if it is a new line, then print
if new_row:
line="}}}{{{".join(latest_row)
print "%s<{||}>" % line
latest_row = []

if len(latest_row) == 0:
column = column[1:]

latest_row.append(column)
# check the end symbol of row
if latest_row[-1][-2:] == ");":
latest_row[-1] = latest_row[-1][:-2]
if latest_row[-1] == "NULL":
latest_row[-1] = ""

line="}}}{{{".join(latest_row)
print "%s<{||}>" % line

def main():

filename=sys.argv[1]
try:
#get line with lazy method from gz file
with gzip.open(filename,"rb") as f:
for line in f:
if check_insert(line):
values = get_line_values(line)
if check_values_style(values):
parse_line(values)
except KeyboardInterrupt:
sys.exit(0)

if __name__ == "__main__":
main()

Summary

In general speaking, the parsing program utilizes the csv module of Python to analyse the INSERT record of dump file, if the dump file is not the standard format, the parsing process will be failed. If there are tons of dump files, you can consider the multi-processes or multi-threads.


If you want to reprint, please mark origin author. Please let me know if you have any doubts about the article.Welcome to comment here or email to ballwql@163.com

Reward

Article Title:Analyse Mysqldump File with Python

Article Author:zendwind

Publish Time:2019-07-26, 23:00:00

Last Updated:2019-07-28, 21:14:55

Original Link:http://zendwind.com/2019/07/26/python-read-dump/

Copyright: "Signature-Non commercial-Reservation 4.0" Any reprints required the reservation of original author, thank you!

Catalog
×

Thank you for your reward