module df.dataframe_io_helpers

Inheritance diagram of pandas_streaming.df.dataframe_io_helpers

Short summary

module pandas_streaming.df.dataframe_io_helpers

Saves and reads a dataframe into a zip file.

source on GitHub

Classes

class

truncated documentation

JsonIterator2Stream

Transforms an iterator on JSON items into a stream which returns an items as a string every time method …

JsonPerRowsStream

Reads a json streams and adds ,, [, ] to convert a stream containing one :pekg:`json` object …

Functions

function

truncated documentation

enumerate_json_items

Enumerates items from a JSON file or string.

flatten_dictionary

Flattens a dictionary with nested structure to a dictionary with no hierarchy.

Methods

method

truncated documentation

__init__

__init__

__iter__

Iterate on each row.

getvalue

Returns the whole stream content.

read

Reads the next item and returns it as a string.

read

Reads characters, adds ,, [, ] if needed. So the number of read characters is not recessarily …

readline

Reads a line, adds ,, [, ] if needed. So the number of read characters is not recessarily the …

write

The class does not write.

Documentation

Saves and reads a dataframe into a zip file.

source on GitHub

class pandas_streaming.df.dataframe_io_helpers.JsonIterator2Stream(it, **kwargs)[source]

Bases: object

Transforms an iterator on JSON items into a stream which returns an items as a string every time method read is called. The iterator could be one returned by enumerate_json_items.

Reshape a json file

The function enumerate_json_items reads any json even if every record is split over multiple lines. Class JsonIterator2Stream mocks this iterator as a stream. Each row is a single item.

<<<

from pandas_streaming.df.dataframe_io_helpers import enumerate_json_items, JsonIterator2Stream

text_json = '''
    [
    {
        "glossary": {
            "title": "example glossary",
            "GlossDiv": {
                "title": "S",
                "GlossList": [{
                    "GlossEntry": {
                        "ID": "SGML",
                        "SortAs": "SGML",
                        "GlossTerm": "Standard Generalized Markup Language",
                        "Acronym": "SGML",
                        "Abbrev": "ISO 8879:1986",
                        "GlossDef": {
                            "para": "A meta-markup language, used to create markup languages such as DocBook.",
                            "GlossSeeAlso": ["GML", "XML"]
                        },
                        "GlossSee": "markup"
                    }
                }]
            }
        }
    },
    {
        "glossary": {
            "title": "example glossary",
            "GlossDiv": {
                "title": "S",
                "GlossList": {
                    "GlossEntry": [{
                        "ID": "SGML",
                        "SortAs": "SGML",
                        "GlossTerm": "Standard Generalized Markup Language",
                        "Acronym": "SGML",
                        "Abbrev": "ISO 8879:1986",
                        "GlossDef": {
                            "para": "A meta-markup language, used to create markup languages such as DocBook.",
                            "GlossSeeAlso": ["GML", "XML"]
                        },
                        "GlossSee": "markup"
                    }]
                }
            }
        }
    }
    ]
'''

for item in JsonIterator2Stream(enumerate_json_items(text_json)):
    print(item)

>>>

    {"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":[{"GlossEntry":{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}}]}}}
    {"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":{"GlossEntry":[{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}]}}}}

source on GitHub

Parameters
  • it – iterator

  • kwargs – arguments to json.dumps

source on GitHub

__init__(it, **kwargs)[source]
Parameters
  • it – iterator

  • kwargs – arguments to json.dumps

source on GitHub

__iter__()[source]

Iterate on each row.

source on GitHub

read()[source]

Reads the next item and returns it as a string.

source on GitHub

write()[source]

The class does not write.

source on GitHub

class pandas_streaming.df.dataframe_io_helpers.JsonPerRowsStream(st)[source]

Bases: object

Reads a json streams and adds ,, [, ] to convert a stream containing one :pekg:`json` object per row into one single json object. It only implements method readline.

source on GitHub

Parameters

st – stream

source on GitHub

__init__(st)[source]
Parameters

st – stream

source on GitHub

getvalue()[source]

Returns the whole stream content.

source on GitHub

read(size=-1)[source]

Reads characters, adds ,, [, ] if needed. So the number of read characters is not recessarily the requested one but could be greater.

source on GitHub

readline(size=-1)[source]

Reads a line, adds ,, [, ] if needed. So the number of read characters is not recessarily the requested one but could be greater.

source on GitHub

pandas_streaming.df.dataframe_io_helpers.enumerate_json_items(filename, encoding=None, lines=False, flatten=False, fLOG=None)[source]

Enumerates items from a JSON file or string.

Parameters
  • filename – filename or string or stream to parse

  • encoding – encoding

  • lines – one record per row

  • flatten – call flatten_dictionary

  • fLOG – logging function

Returns

iterator on records at first level.

It assumes the syntax follows the format: [ {"id":1, ...}, {"id": 2, ...}, ...]. However, if option lines if true, the function considers that the stream or file does have one record per row as follows:

{“id”:1, …} {“id”: 2, …}

Processes a json file by streaming.

The module :epkg:`ijson` can read a JSON file by streaming. This module is needed because a record can be written on multiple lines. This function leverages it produces the following results.

<<<

from pandas_streaming.df.dataframe_io_helpers import enumerate_json_items

text_json = '''
    [
    {
        "glossary": {
            "title": "example glossary",
            "GlossDiv": {
                "title": "S",
                "GlossList": [{
                    "GlossEntry": {
                        "ID": "SGML",
                        "SortAs": "SGML",
                        "GlossTerm": "Standard Generalized Markup Language",
                        "Acronym": "SGML",
                        "Abbrev": "ISO 8879:1986",
                        "GlossDef": {
                            "para": "A meta-markup language, used to create markup languages such as DocBook.",
                            "GlossSeeAlso": ["GML", "XML"]
                        },
                        "GlossSee": "markup"
                    }
                }]
            }
        }
    },
    {
        "glossary": {
            "title": "example glossary",
            "GlossDiv": {
                "title": "S",
                "GlossList": {
                    "GlossEntry": [{
                        "ID": "SGML",
                        "SortAs": "SGML",
                        "GlossTerm": "Standard Generalized Markup Language",
                        "Acronym": "SGML",
                        "Abbrev": "ISO 8879:1986",
                        "GlossDef": {
                            "para": "A meta-markup language, used to create markup languages such as DocBook.",
                            "GlossSeeAlso": ["GML", "XML"]
                        },
                        "GlossSee": "markup"
                    }]
                }
            }
        }
    }
    ]
'''

for item in enumerate_json_items(text_json):
    print(item)

>>>

    {'glossary': {'title': 'example glossary', 'GlossDiv': {'title': 'S', 'GlossList': [{'GlossEntry': {'ID': 'SGML', 'SortAs': 'SGML', 'GlossTerm': 'Standard Generalized Markup Language', 'Acronym': 'SGML', 'Abbrev': 'ISO 8879:1986', 'GlossDef': {'para': 'A meta-markup language, used to create markup languages such as DocBook.', 'GlossSeeAlso': ['GML', 'XML']}, 'GlossSee': 'markup'}}]}}}
    {'glossary': {'title': 'example glossary', 'GlossDiv': {'title': 'S', 'GlossList': {'GlossEntry': [{'ID': 'SGML', 'SortAs': 'SGML', 'GlossTerm': 'Standard Generalized Markup Language', 'Acronym': 'SGML', 'Abbrev': 'ISO 8879:1986', 'GlossDef': {'para': 'A meta-markup language, used to create markup languages such as DocBook.', 'GlossSeeAlso': ['GML', 'XML']}, 'GlossSee': 'markup'}]}}}}

source on GitHub

pandas_streaming.df.dataframe_io_helpers.flatten_dictionary(dico, sep='_')[source]

Flattens a dictionary with nested structure to a dictionary with no hierarchy. :param dico: dictionary to flatten :param sep: string to separate dictionary keys by :return: flattened dictionary

Inspired from flatten_json.

source on GitHub