From initial tests of the alvavm_train module, I found a supprising amount of time was spent on 'preprocessing' (IO and cleaning dataset). I found two main coulpits, one was the pd.read_sql, which had an extremely high memory load, and the other double_sales. I have here performed an nonexhaustive ad-hoc exploration of options to pandas.read_sql to reduce both execution time and peak memory load.
Inspired from this issue we first explore the effect of DB connector.
Now lets define five methods to compare:
The current method for reading sql perform OK both with respect to memory and execution times and connection type does not have a major impact. However, please note that the standard deviation of these measurments are large, and drawing any conclusions from the above is difficult. However, from running this tests many times, the raw connection performes better. (I have no insight into why).
Please note that the below methods allow for pd.read_csv, we can now set dtypes and usecols, something which was not directly availeble with read_sql.
Update: I did not include the parse_dates, date_parser, options here, but this is included in the implementation to alvavm_train. Note that there is no memory gain from converting from
object type to
Now we are seeing some substansial and consistent improvements both for execution times and memory load. The two best versions
read_sql_tmpfile_dtype_memory_map we are seeing a ~2x faster execution times compared to
alva_default, and perhaps more importantly an 30x reduction in memory load.
We can also see that writing to memory with StringIO is fast, but still has only 2x improvement in memory load.
Please keep in mind that these results were aquired from testing on my laptop with an SSD drive. On a server which probably read from a "spinning disk", the I/O bandwidth might be more limiting, and could favor the compressed verison.
To better understand where to direct additional efforts we measure the COPY execution and pd.read_csv seperatly.
We could most likely see significant improvements to the csv timings if one wrote a C++ or Cython extension for reading CSV data and infering types. However, I found this to be to much work compared to the potential gain. For future reference see the following for C++, Cython, or this very promissing project feather.
We now focus on potensial optimizations of the COPY of the database. I therefore changed the SQL quary to only include the columns we are interested in.
In the above test we did not observe any significant differences either in memory or execution times. That beeing said, the
market_transaction dataset might not be the best as we include almost all columns. The two runner ups are
read_sql_tmpfile_dtype_selective_quary, and to find a winner we will check the memory load and execution times for the
address tabel (the most memory demanding in the current version), and
ape_per_fylke (the smallest table) as well.
ape_per_fylke I had to redefine the functions to allow for setting na_values = "ingen data".
Wow, what an improvement! Going from 14 minutes and 22 seconds to 42 seconds, an 20x improvement in execution time, and with a much lower memory load. I only write much lower memory load because I do not trust the measurments here, the df alone should be ~300 MiB after setting types.
Now comparing for a very small table we still get much faster execution. Note that all functions are now initiating the database connections inside the functions, allowing for a fair comparison. However, here creating the connector is dominating the execution time. Lets try to initiate these outside the functions.
Even with the database connector outside the function we still observe an improvement. In total I judge the only copy relevant columns as the winner.
When testing on a 'real world' example by calling
double_sales=False we get the following
|Measure||new version||old version||ratio|
|Size of table||X||X||X|
This is a significant improvement, and would allow us to better scale for future, larger, data sets with even more features.
Another, indirect, improvement might be that execution times on further manipulation and computation based on the DataFames might be faster as they are smaller and the IO from memory is less. However, I hav enot properly tested this aspect.
Of note is that neither the execution measurments or the memory increments are very consistent. However, when executing one cell in the notebook they always rank consistently eventough the absolute values change.
The measurments of the memory does not correspond to the sice of the dataset. It might be worth testing other methods for measuring the memory either from
tracemalloc which you can read about here, or from a more manuall approach like this (Credit to Mikael Mortsensen):
1 2 3 4 5 6 7 8 9
def getMemoryUsage(rss=True): mypid = str(getpid()) rss = "rss" if rss else "vsz" process = subprocess.Popen(['ps', '-o', rss, mypid], stdout=subprocess.PIPE) out, _ = process.communicate() mymemory = out.split() return eval(mymemory) / 1024
That beeing said, the exact memory usage is not to important in this context, but rather the order of magnitude. I therefore did not explore any differences between the measurments methods.
Trough working on this problem I have noticed that we establish a range of connections to the database instead of reusing them, and they are never closed. This is a common source of memory leaks and is something to pay attention to.
Here I have only focused on serial execution times, but I except these times to hold for parallell runs as well. It would be interesting to test how mendo or dask potentially could speed up parallell simulations.