.. _pyensaetext2tablerst: ========================================== Not so clean text to tables (pandas fails) ========================================== .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`PDF `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/pyensae_text2table.ipynb|*` An example where ``import_flatfile_into_database`` does a better job than pandas about converting a flat file to a table. .. code:: ipython3 import random, pandas text = [ "one","two","three","four","five","six","seven","eight","nine","ten" ] data = [ { "name": text[random.randint(0,9)], "number": random.randint(0,99)} \ for i in range(0,10000) ] df = pandas.DataFrame(data) df.head(n=3) .. raw:: html

	name	number
0	one	75
1	two	73
2	three	77

.. code:: ipython3 df.to_csv("flatfile.txt", sep="\t", encoding="utf8", header=True, index=False) dfr = pandas.read_csv("flatfile.txt", sep="\t", encoding="utf8") dfr.head(n=3) .. raw:: html

	name	number
0	one	75
1	two	73
2	three	77

Le’s assume now we introduce extra tabulations. .. code:: ipython3 datatab = [ {"name": " one\ttab", "number":100 } ] + data df = pandas.DataFrame(datatab) df.head(n=3) .. raw:: html

	name	number
0	one\ttab	100
1	one	75
2	two	73

.. code:: ipython3 df.to_csv("flatfile_tab.txt", sep="\t", encoding="utf8", header=True, index=False) dfr = pandas.read_csv("flatfile_tab.txt", sep="\t", encoding="utf8") dfr.head(n=3) .. raw:: html

	name	number
0	one\ttab	100
1	one	75
2	two	73

It works well because we use ``pandas`` to save the dataframe, and we use ``pandas`` to restore it. In the file *flatfile_tab.txt*, it looks like ``"on\te"``. ``pandas`` interprets the quotes as a delimiter. However most of the times, the flat file is produced in a different way and the quotes are not present. .. code:: ipython3 with open("flatfile_tab.txt", "r", encoding="utf8") as f: content = f.read() content = content.replace('"','') with open("flatfile_tab2.txt", "w", encoding="utf8") as f: f.write(content) dfr = pandas.read_csv("flatfile_tab2.txt", sep="\t", encoding="utf8") dfr.head(n=3) .. raw:: html

	name	number
one	tab	100.0
one	75	NaN
two	73	NaN

It failed! Data is not aligned and it did not raise an exception. If we move the extra tab in second position, we get: .. code:: ipython3 datatab = data[:1] + [ {"name": " one\ttab", "number":100 } ] + data[1:] df = pandas.DataFrame(datatab) df.to_csv("flatfile_tab_pos2.txt", sep="\t", encoding="utf8", header=True, index=False) with open("flatfile_tab_pos2.txt","r",encoding="utf8") as f: content = f.read() content = content.replace('"','') with open("flatfile_tab_pos2.txt","w",encoding="utf8") as f: f.write(content) .. code:: ipython3 dfr = pandas.read_csv("flatfile_tab_pos2.txt", sep="\t", encoding="utf8") dfr.head(n=3) :: --------------------------------------------------------------------------- ParserError Traceback (most recent call last) in () ----> 1 dfr = pandas.read_csv("flatfile_tab_pos2.txt", sep="\t", encoding="utf8") 2 dfr.head(n=3) c:\Python36_x64\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision) 653 skip_blank_lines=skip_blank_lines) 654 --> 655 return _read(filepath_or_buffer, kwds) 656 657 parser_f.__name__ = name c:\Python36_x64\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds) 409 410 try: --> 411 data = parser.read(nrows) 412 finally: 413 parser.close() c:\Python36_x64\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1003 raise ValueError('skipfooter not supported for iteration') 1004 -> 1005 ret = self._engine.read(nrows) 1006 1007 if self.options.get('as_recarray'): c:\Python36_x64\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1746 def read(self, nrows=None): 1747 try: -> 1748 data = self._reader.read(nrows) 1749 except StopIteration: 1750 if self._first_chunk: pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)() pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)() ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3 As suggested in `Python Pandas Error tokenizing data `__, we could add the parameter ``error_bad_lines=False`` or ``skiprows=N`` but we would still lose those bad lines. So we use function `import_flatfile_into_database `__. .. code:: ipython3 from pyensae.sql import import_flatfile_into_database import_flatfile_into_database("flatfile_tab_pos2.db3", "flatfile_tab_pos2.txt") .. parsed-literal:: TextFile: opening file flatfile_tab_pos2.txt TextFile.guess_columns: processing file flatfile_tab_pos2.txt TextFile: opening file flatfile_tab_pos2.txt TextFile.guess_columns: using 101 lines TextFile: closing file flatfile_tab_pos2.txt TextFile.guess_columns: sep '\t' nb cols 2 bestnb 100 more {('\t', 1): 100, ('\t', 2): 1} TextFile.guess_columns: header True columns {0: ('name', ), 1: ('number', )} compiling ^(?P.*)\t(?P([-]?[1-9][0-9]*?)|(0?))$ TextFile.guess_columns: regex ^(?P.*)\t(?P([-]?[1-9][0-9]*?)|(0?))$ TextFile.guess_columns: header True columns {0: ('name', (, 10)), 1: ('number', )} [_guess_columns] sep=['\t'] TextFile: closing file flatfile_tab_pos2.txt [_guess_columns] columns_name=None guess with 1001 lines count_types {0: {: 1000}, 1: {: 999, : 1}} columns {0: ('name', ), 1: ('number', )} guess {0: ('name', (, 10)), 1: ('number', )} SQL 'CREATE TABLE flatfile_tab_pos2(name TEXT,' ' number INTEGER);' column_has_space False ['name', 'number'] changes {} TextFileColumns (2): regex: {0: ('name', (, 10)), 1: ('number', )} TextFile.guess_columns: processing file flatfile_tab_pos2.txt TextFile: opening file flatfile_tab_pos2.txt TextFile.guess_columns: using 101 lines TextFile: closing file flatfile_tab_pos2.txt TextFile.guess_columns: sep '\t' nb cols 2 bestnb 100 more {('\t', 1): 100, ('\t', 2): 1} TextFile.guess_columns: header True columns {0: ('name', ), 1: ('number', )} compiling ^(?P.*)\t(?P([-]?[1-9][0-9]*?)|(0?))$ TextFile.guess_columns: regex ^(?P.*)\t(?P([-]?[1-9][0-9]*?)|(0?))$ TextFile.guess_columns: header True columns {0: ('name', (, 10)), 1: ('number', )} TextFile: opening file flatfile_tab_pos2.txt ^(?P.*)\t(?P([-]?[1-9][0-9]*?)|(0?))$ error regex 0 unable to interpret line 10002 : '' TextFile: closing file flatfile_tab_pos2.txt 10001 lines imported .. parsed-literal:: 'flatfile_tab_pos2' We check that we got the inserted line in the dataframe: .. code:: ipython3 from pyensae.sql import Database db = Database("flatfile_tab_pos2.db3") db.connect() df = db.to_df("SELECT * FROM flatfile_tab_pos2") db.close() df.head() .. parsed-literal:: SQL 'SELECT * FROM flatfile_tab_pos2' .. raw:: html

	name	number
0	one	75
1	one\ttab	100
2	two	73
3	three	77
4	six	38