Can I Speedup Yaml?
Solution 1:
You've probably noticed that Python's syntax for data structures is very similar to JSON's syntax.
What's happening is Python's json
library encodes Python's builtin datatypes directly into text chunks, replacing '
into "
and deleting ,
here and there (to oversimplify a bit).
On the other hand, pyyaml
has to construct a whole representation graph before serialising it into a string.
The same kind of stuff has to happen backwards when loading.
The only way to speedup yaml.load()
would be to write a new Loader
, but I doubt it could be a huge leap in performance, except if you're willing to write your own single-purpose sort-of YAML
parser, taking the following comment in consideration:
YAML builds a graph because it is a general-purpose serialisation format that is able to represent multiple references to the same object. If you know no object is repeated and only basic types appear, you can use a json serialiser, it will still be valid YAML.
-- UPDATE
What I said before remains true, but if you're running Linux
there's a way to speed up Yaml
parsing. By default, Python's yaml
uses the Python parser. You have to tell it that you want to use PyYaml
C
parser.
You can do it this way:
import yaml
from yaml import CLoader as Loader, CDumper as Dumper
dump = yaml.dump(dummy_data, fh, encoding='utf-8', default_flow_style=False, Dumper=Dumper)
data = yaml.load(fh, Loader=Loader)
In order to do so, you need yaml-cpp-dev
(package later renamed to libyaml-cpp-dev
) installed, for instance with apt-get:
$ apt-get install yaml-cpp-dev
And PyYaml
with LibYaml
as well. But that's already the case based on your output.
I can't test it right now because I'm running OS X and brew
has some trouble installing yaml-cpp-dev
but if you follow PyYaml documentation, they are pretty clear that performance will be much better.
Solution 2:
For reference, I compared a couple of human-readable formats and indeed Python's yaml reader is by far the slowest. (Note the log-scaling in the below plot.) If you're looking for speed, you want one of the JSON loaders, e.g., orjson:
Code to reproduce the plot:
import numpy
import perfplot
import json
import ujson
import orjson
import toml
import yaml
from yaml import Loader, CLoader
import pandas
defsetup(n):
numpy.random.seed(0)
data = numpy.random.rand(n, 3)
withopen("out.yml", "w") as f:
yaml.dump(data.tolist(), f)
withopen("out.json", "w") as f:
json.dump(data.tolist(), f, indent=4)
withopen("out.dat", "w") as f:
numpy.savetxt(f, data)
withopen("out.toml", "w") as f:
toml.dump({"data": data.tolist()}, f)
defyaml_python(arr):
withopen("out.yml", "r") as f:
out = yaml.load(f, Loader=Loader)
return out
defyaml_c(arr):
withopen("out.yml", "r") as f:
out = yaml.load(f, Loader=CLoader)
return out
defjson_load(arr):
withopen("out.json", "r") as f:
out = json.load(f)
return out
defujson_load(arr):
withopen("out.json", "r") as f:
out = ujson.load(f)
return out
deforjson_load(arr):
withopen("out.json", "rb") as f:
out = orjson.loads(f.read())
return out
defloadtxt(arr):
withopen("out.dat", "r") as f:
out = numpy.loadtxt(f)
return out
defpandas_read(arr):
out = pandas.read_csv("out.dat", header=None, sep=" ")
return out.values
deftoml_load(arr):
withopen("out.toml", "r") as f:
out = toml.load(f)
return out["data"]
perfplot.save(
"out.png",
setup=setup,
kernels=[
yaml_python,
yaml_c,
json_load,
loadtxt,
pandas_read,
toml_load,
ujson_load,
orjson_load,
],
n_range=[2 ** k for k inrange(18)],
)
Solution 3:
Yes, I also noticed that JSON is way faster. So a reasonable approach would be to convert YAML to JSON first. If you don't mind ruby, then you can get a big speedup and ditch the yaml
install altogether:
import commands, json
def load_yaml_file(fn):
ruby = "puts YAML.load_file('%s').to_json" % fnj= commands.getstatusoutput('ruby -ryaml -rjson -e "%s"' % ruby)
return json.loads(j[1])
Here is a comparison for 100K records:
load_yaml_file: 0.95 syaml.load: 7.53 s
And for 1M records:
load_yaml_file: 11.55 syaml.load: 77.08 s
If you insist on using yaml.load anyway, remember to put it in a virtualenv to avoid conflicts with other software.
Post a Comment for "Can I Speedup Yaml?"