Skip to content Skip to sidebar Skip to footer

Can I Speedup Yaml?

I made a little test case to compare YAML and JSON speed : import json import yaml from datetime import datetime from random import randint NB_ROW=1024 print 'Does yaml is using

Solution 1:

You've probably noticed that Python's syntax for data structures is very similar to JSON's syntax.

What's happening is Python's json library encodes Python's builtin datatypes directly into text chunks, replacing ' into " and deleting , here and there (to oversimplify a bit).

On the other hand, pyyaml has to construct a whole representation graph before serialising it into a string.

The same kind of stuff has to happen backwards when loading.

The only way to speedup yaml.load() would be to write a new Loader, but I doubt it could be a huge leap in performance, except if you're willing to write your own single-purpose sort-of YAML parser, taking the following comment in consideration:

YAML builds a graph because it is a general-purpose serialisation format that is able to represent multiple references to the same object. If you know no object is repeated and only basic types appear, you can use a json serialiser, it will still be valid YAML.

-- UPDATE

What I said before remains true, but if you're running Linux there's a way to speed up Yaml parsing. By default, Python's yaml uses the Python parser. You have to tell it that you want to use PyYamlC parser.

You can do it this way:

import yaml
from yaml import CLoader as Loader, CDumper as Dumper

dump = yaml.dump(dummy_data, fh, encoding='utf-8', default_flow_style=False, Dumper=Dumper)
data = yaml.load(fh, Loader=Loader)

In order to do so, you need yaml-cpp-dev (package later renamed to libyaml-cpp-dev) installed, for instance with apt-get:

$ apt-get install yaml-cpp-dev

And PyYaml with LibYaml as well. But that's already the case based on your output.

I can't test it right now because I'm running OS X and brew has some trouble installing yaml-cpp-dev but if you follow PyYaml documentation, they are pretty clear that performance will be much better.

Solution 2:

For reference, I compared a couple of human-readable formats and indeed Python's yaml reader is by far the slowest. (Note the log-scaling in the below plot.) If you're looking for speed, you want one of the JSON loaders, e.g., orjson:

enter image description here


Code to reproduce the plot:

import numpy
import perfplot

import json
import ujson
import orjson
import toml
import yaml
from yaml import Loader, CLoader
import pandas


defsetup(n):
    numpy.random.seed(0)
    data = numpy.random.rand(n, 3)

    withopen("out.yml", "w") as f:
        yaml.dump(data.tolist(), f)

    withopen("out.json", "w") as f:
        json.dump(data.tolist(), f, indent=4)

    withopen("out.dat", "w") as f:
        numpy.savetxt(f, data)

    withopen("out.toml", "w") as f:
        toml.dump({"data": data.tolist()}, f)


defyaml_python(arr):
    withopen("out.yml", "r") as f:
        out = yaml.load(f, Loader=Loader)
    return out


defyaml_c(arr):
    withopen("out.yml", "r") as f:
        out = yaml.load(f, Loader=CLoader)
    return out


defjson_load(arr):
    withopen("out.json", "r") as f:
        out = json.load(f)
    return out


defujson_load(arr):
    withopen("out.json", "r") as f:
        out = ujson.load(f)
    return out


deforjson_load(arr):
    withopen("out.json", "rb") as f:
        out = orjson.loads(f.read())
    return out


defloadtxt(arr):
    withopen("out.dat", "r") as f:
        out = numpy.loadtxt(f)
    return out


defpandas_read(arr):
    out = pandas.read_csv("out.dat", header=None, sep=" ")
    return out.values


deftoml_load(arr):
    withopen("out.toml", "r") as f:
        out = toml.load(f)
    return out["data"]


perfplot.save(
    "out.png",
    setup=setup,
    kernels=[
        yaml_python,
        yaml_c,
        json_load,
        loadtxt,
        pandas_read,
        toml_load,
        ujson_load,
        orjson_load,
    ],
    n_range=[2 ** k for k inrange(18)],
)

Solution 3:

Yes, I also noticed that JSON is way faster. So a reasonable approach would be to convert YAML to JSON first. If you don't mind ruby, then you can get a big speedup and ditch the yaml install altogether:

import commands, json
def load_yaml_file(fn):
    ruby = "puts YAML.load_file('%s').to_json" % fnj= commands.getstatusoutput('ruby -ryaml -rjson -e "%s"' % ruby)
    return json.loads(j[1])

Here is a comparison for 100K records:

load_yaml_file: 0.95 syaml.load: 7.53 s

And for 1M records:

load_yaml_file: 11.55 syaml.load: 77.08 s

If you insist on using yaml.load anyway, remember to put it in a virtualenv to avoid conflicts with other software.

Post a Comment for "Can I Speedup Yaml?"