An example where import_flatfile_into_database
does a better job than pandas about converting a flat file to a table.
import random, pandas
text = [ "one","two","three","four","five","six","seven","eight","nine","ten" ]
data = [ { "name": text[random.randint(0,9)], "number": random.randint(0,99)} \
for i in range(0,10000) ]
df = pandas.DataFrame(data)
df.head(n=3)
name | number | |
---|---|---|
0 | one | 75 |
1 | two | 73 |
2 | three | 77 |
df.to_csv("flatfile.txt", sep="\t", encoding="utf8", header=True, index=False)
dfr = pandas.read_csv("flatfile.txt", sep="\t", encoding="utf8")
dfr.head(n=3)
name | number | |
---|---|---|
0 | one | 75 |
1 | two | 73 |
2 | three | 77 |
Le's assume now we introduce extra tabulations.
datatab = [ {"name": " one\ttab", "number":100 } ] + data
df = pandas.DataFrame(datatab)
df.head(n=3)
name | number | |
---|---|---|
0 | one\ttab | 100 |
1 | one | 75 |
2 | two | 73 |
df.to_csv("flatfile_tab.txt", sep="\t", encoding="utf8", header=True, index=False)
dfr = pandas.read_csv("flatfile_tab.txt", sep="\t", encoding="utf8")
dfr.head(n=3)
name | number | |
---|---|---|
0 | one\ttab | 100 |
1 | one | 75 |
2 | two | 73 |
It works well because we use pandas
to save the dataframe, and we use pandas
to restore it. In the file flatfile_tab.txt, it looks like "on\te"
. pandas
interprets the quotes as a delimiter. However most of the times, the flat file is produced in a different way and the quotes are not present.
with open("flatfile_tab.txt", "r", encoding="utf8") as f:
content = f.read()
content = content.replace('"','')
with open("flatfile_tab2.txt", "w", encoding="utf8") as f:
f.write(content)
dfr = pandas.read_csv("flatfile_tab2.txt", sep="\t", encoding="utf8")
dfr.head(n=3)
name | number | |
---|---|---|
one | tab | 100.0 |
one | 75 | NaN |
two | 73 | NaN |
It failed! Data is not aligned and it did not raise an exception. If we move the extra tab in second position, we get:
datatab = data[:1] + [ {"name": " one\ttab", "number":100 } ] + data[1:]
df = pandas.DataFrame(datatab)
df.to_csv("flatfile_tab_pos2.txt", sep="\t", encoding="utf8", header=True, index=False)
with open("flatfile_tab_pos2.txt","r",encoding="utf8") as f:
content = f.read()
content = content.replace('"','')
with open("flatfile_tab_pos2.txt","w",encoding="utf8") as f:
f.write(content)
dfr = pandas.read_csv("flatfile_tab_pos2.txt", sep="\t", encoding="utf8")
dfr.head(n=3)
--------------------------------------------------------------------------- ParserError Traceback (most recent call last) <ipython-input-8-473d60fe4f7d> in <module>() ----> 1 dfr = pandas.read_csv("flatfile_tab_pos2.txt", sep="\t", encoding="utf8") 2 dfr.head(n=3) c:\Python36_x64\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision) 653 skip_blank_lines=skip_blank_lines) 654 --> 655 return _read(filepath_or_buffer, kwds) 656 657 parser_f.__name__ = name c:\Python36_x64\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds) 409 410 try: --> 411 data = parser.read(nrows) 412 finally: 413 parser.close() c:\Python36_x64\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1003 raise ValueError('skipfooter not supported for iteration') 1004 -> 1005 ret = self._engine.read(nrows) 1006 1007 if self.options.get('as_recarray'): c:\Python36_x64\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1746 def read(self, nrows=None): 1747 try: -> 1748 data = self._reader.read(nrows) 1749 except StopIteration: 1750 if self._first_chunk: pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)() pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)() ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3
As suggested in Python Pandas Error tokenizing data, we could add the parameter error_bad_lines=False
or skiprows=N
but we would still lose those bad lines. So we use function import_flatfile_into_database.
from pyensae.sql import import_flatfile_into_database
import_flatfile_into_database("flatfile_tab_pos2.db3", "flatfile_tab_pos2.txt")
TextFile: opening file flatfile_tab_pos2.txt TextFile.guess_columns: processing file flatfile_tab_pos2.txt TextFile: opening file flatfile_tab_pos2.txt TextFile.guess_columns: using 101 lines TextFile: closing file flatfile_tab_pos2.txt TextFile.guess_columns: sep '\t' nb cols 2 bestnb 100 more {('\t', 1): 100, ('\t', 2): 1} TextFile.guess_columns: header True columns {0: ('name', <class 'str'>), 1: ('number', <class 'int'>)} compiling ^(?P<name>.*)\t(?P<number>([-]?[1-9][0-9]*?)|(0?))$ TextFile.guess_columns: regex ^(?P<name>.*)\t(?P<number>([-]?[1-9][0-9]*?)|(0?))$ TextFile.guess_columns: header True columns {0: ('name', (<class 'str'>, 10)), 1: ('number', <class 'int'>)} [_guess_columns] sep=['\t'] TextFile: closing file flatfile_tab_pos2.txt [_guess_columns] columns_name=None guess with 1001 lines count_types {0: {<class 'str'>: 1000}, 1: {<class 'int'>: 999, <class 'str'>: 1}} columns {0: ('name', <class 'str'>), 1: ('number', <class 'int'>)} guess {0: ('name', (<class 'str'>, 10)), 1: ('number', <class 'int'>)} SQL 'CREATE TABLE flatfile_tab_pos2(name TEXT,' ' number INTEGER);' column_has_space False ['name', 'number'] changes {} TextFileColumns (2): regex: {0: ('name', (<class 'str'>, 10)), 1: ('number', <class 'int'>)} TextFile.guess_columns: processing file flatfile_tab_pos2.txt TextFile: opening file flatfile_tab_pos2.txt TextFile.guess_columns: using 101 lines TextFile: closing file flatfile_tab_pos2.txt TextFile.guess_columns: sep '\t' nb cols 2 bestnb 100 more {('\t', 1): 100, ('\t', 2): 1} TextFile.guess_columns: header True columns {0: ('name', <class 'str'>), 1: ('number', <class 'int'>)} compiling ^(?P<name>.*)\t(?P<number>([-]?[1-9][0-9]*?)|(0?))$ TextFile.guess_columns: regex ^(?P<name>.*)\t(?P<number>([-]?[1-9][0-9]*?)|(0?))$ TextFile.guess_columns: header True columns {0: ('name', (<class 'str'>, 10)), 1: ('number', <class 'int'>)} TextFile: opening file flatfile_tab_pos2.txt ^(?P<name>.*)\t(?P<number>([-]?[1-9][0-9]*?)|(0?))$ error regex 0 unable to interpret line 10002 : '' TextFile: closing file flatfile_tab_pos2.txt 10001 lines imported
'flatfile_tab_pos2'
We check that we got the inserted line in the dataframe:
from pyensae.sql import Database
db = Database("flatfile_tab_pos2.db3")
db.connect()
df = db.to_df("SELECT * FROM flatfile_tab_pos2")
db.close()
df.head()
SQL 'SELECT * FROM flatfile_tab_pos2'
name | number | |
---|---|---|
0 | one | 75 |
1 | one\ttab | 100 |
2 | two | 73 |
3 | three | 77 |
4 | six | 38 |