.. _pyensaetext2tablerst: ========================================== Not so clean text to tables (pandas fails) ========================================== .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`PDF `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/pyensae_text2table.ipynb|*` An example where ``import_flatfile_into_database`` does a better job than pandas about converting a flat file to a table. .. code:: ipython3 import random, pandas text = [ "one","two","three","four","five","six","seven","eight","nine","ten" ] data = [ { "name": text[random.randint(0,9)], "number": random.randint(0,99)} \ for i in range(0,10000) ] df = pandas.DataFrame(data) df.head(n=3) .. raw:: html
name number
0 one 75
1 two 73
2 three 77
.. code:: ipython3 df.to_csv("flatfile.txt", sep="\t", encoding="utf8", header=True, index=False) dfr = pandas.read_csv("flatfile.txt", sep="\t", encoding="utf8") dfr.head(n=3) .. raw:: html
name number
0 one 75
1 two 73
2 three 77
Le’s assume now we introduce extra tabulations. .. code:: ipython3 datatab = [ {"name": " one\ttab", "number":100 } ] + data df = pandas.DataFrame(datatab) df.head(n=3) .. raw:: html
name number
0 one\ttab 100
1 one 75
2 two 73
.. code:: ipython3 df.to_csv("flatfile_tab.txt", sep="\t", encoding="utf8", header=True, index=False) dfr = pandas.read_csv("flatfile_tab.txt", sep="\t", encoding="utf8") dfr.head(n=3) .. raw:: html
name number
0 one\ttab 100
1 one 75
2 two 73
It works well because we use ``pandas`` to save the dataframe, and we use ``pandas`` to restore it. In the file *flatfile_tab.txt*, it looks like ``"on\te"``. ``pandas`` interprets the quotes as a delimiter. However most of the times, the flat file is produced in a different way and the quotes are not present. .. code:: ipython3 with open("flatfile_tab.txt", "r", encoding="utf8") as f: content = f.read() content = content.replace('"','') with open("flatfile_tab2.txt", "w", encoding="utf8") as f: f.write(content) dfr = pandas.read_csv("flatfile_tab2.txt", sep="\t", encoding="utf8") dfr.head(n=3) .. raw:: html
name number
one tab 100.0
one 75 NaN
two 73 NaN
It failed! Data is not aligned and it did not raise an exception. If we move the extra tab in second position, we get: .. code:: ipython3 datatab = data[:1] + [ {"name": " one\ttab", "number":100 } ] + data[1:] df = pandas.DataFrame(datatab) df.to_csv("flatfile_tab_pos2.txt", sep="\t", encoding="utf8", header=True, index=False) with open("flatfile_tab_pos2.txt","r",encoding="utf8") as f: content = f.read() content = content.replace('"','') with open("flatfile_tab_pos2.txt","w",encoding="utf8") as f: f.write(content) .. code:: ipython3 dfr = pandas.read_csv("flatfile_tab_pos2.txt", sep="\t", encoding="utf8") dfr.head(n=3) :: --------------------------------------------------------------------------- ParserError Traceback (most recent call last) in () ----> 1 dfr = pandas.read_csv("flatfile_tab_pos2.txt", sep="\t", encoding="utf8") 2 dfr.head(n=3) c:\Python36_x64\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision) 653 skip_blank_lines=skip_blank_lines) 654 --> 655 return _read(filepath_or_buffer, kwds) 656 657 parser_f.__name__ = name c:\Python36_x64\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds) 409 410 try: --> 411 data = parser.read(nrows) 412 finally: 413 parser.close() c:\Python36_x64\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1003 raise ValueError('skipfooter not supported for iteration') 1004 -> 1005 ret = self._engine.read(nrows) 1006 1007 if self.options.get('as_recarray'): c:\Python36_x64\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1746 def read(self, nrows=None): 1747 try: -> 1748 data = self._reader.read(nrows) 1749 except StopIteration: 1750 if self._first_chunk: pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)() pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)() ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3 As suggested in `Python Pandas Error tokenizing data `__, we could add the parameter ``error_bad_lines=False`` or ``skiprows=N`` but we would still lose those bad lines. So we use function `import_flatfile_into_database `__. .. code:: ipython3 from pyensae.sql import import_flatfile_into_database import_flatfile_into_database("flatfile_tab_pos2.db3", "flatfile_tab_pos2.txt") .. parsed-literal:: TextFile: opening file flatfile_tab_pos2.txt TextFile.guess_columns: processing file flatfile_tab_pos2.txt TextFile: opening file flatfile_tab_pos2.txt TextFile.guess_columns: using 101 lines TextFile: closing file flatfile_tab_pos2.txt TextFile.guess_columns: sep '\t' nb cols 2 bestnb 100 more {('\t', 1): 100, ('\t', 2): 1} TextFile.guess_columns: header True columns {0: ('name', ), 1: ('number', )} compiling ^(?P.*)\t(?P([-]?[1-9][0-9]*?)|(0?))$ TextFile.guess_columns: regex ^(?P.*)\t(?P([-]?[1-9][0-9]*?)|(0?))$ TextFile.guess_columns: header True columns {0: ('name', (, 10)), 1: ('number', )} [_guess_columns] sep=['\t'] TextFile: closing file flatfile_tab_pos2.txt [_guess_columns] columns_name=None guess with 1001 lines count_types {0: {: 1000}, 1: {: 999, : 1}} columns {0: ('name', ), 1: ('number', )} guess {0: ('name', (, 10)), 1: ('number', )} SQL 'CREATE TABLE flatfile_tab_pos2(name TEXT,' ' number INTEGER);' column_has_space False ['name', 'number'] changes {} TextFileColumns (2): regex: {0: ('name', (, 10)), 1: ('number', )} TextFile.guess_columns: processing file flatfile_tab_pos2.txt TextFile: opening file flatfile_tab_pos2.txt TextFile.guess_columns: using 101 lines TextFile: closing file flatfile_tab_pos2.txt TextFile.guess_columns: sep '\t' nb cols 2 bestnb 100 more {('\t', 1): 100, ('\t', 2): 1} TextFile.guess_columns: header True columns {0: ('name', ), 1: ('number', )} compiling ^(?P.*)\t(?P([-]?[1-9][0-9]*?)|(0?))$ TextFile.guess_columns: regex ^(?P.*)\t(?P([-]?[1-9][0-9]*?)|(0?))$ TextFile.guess_columns: header True columns {0: ('name', (, 10)), 1: ('number', )} TextFile: opening file flatfile_tab_pos2.txt ^(?P.*)\t(?P([-]?[1-9][0-9]*?)|(0?))$ error regex 0 unable to interpret line 10002 : '' TextFile: closing file flatfile_tab_pos2.txt 10001 lines imported .. parsed-literal:: 'flatfile_tab_pos2' We check that we got the inserted line in the dataframe: .. code:: ipython3 from pyensae.sql import Database db = Database("flatfile_tab_pos2.db3") db.connect() df = db.to_df("SELECT * FROM flatfile_tab_pos2") db.close() df.head() .. parsed-literal:: SQL 'SELECT * FROM flatfile_tab_pos2' .. raw:: html
name number
0 one 75
1 one\ttab 100
2 two 73
3 three 77
4 six 38