y_serial – warehouse compressed Python objects with SQLite

Intro and quick example

In about ten minutes, you should be able to simply give some labels to any Python object and save it to a database file; then get back a set of objects by specifying portion of their labels.

Here’s a quick EXAMPLE for a typical situation. After downloading the y_serial module, create an instance which is associated with a regular file:

import y_serial_v060 as y_serial
demo = y_serial.Main( '/tmp/agency.sqlite' )
#  ^instance          ^include suitable path for database file

#         ... now we do some work producing an object obj, e.g.
obj = 911

That object could have been a dictionary with a complex structure, but let’s continue on for the insert:

demo.insert( obj, "#plan agent007 #london", 'goldfinger' )
#                 ^notes                    ^table

We label each object with “notes” which can be arbitrarily long text (or UTF-8), containing keywords or tags, but excluding commas. Within that file we specify a “table” (merely an organizational subsector). Some time later, perhaps in another script, we will want to select some object:

eg1 = demo.select( "agent00[1-7],#plan", 'goldfinger' )
#                   ^search values are space-sensitive
#                                  and comma separated;
#                    arbitrarily many permitted in string.

print "Example 1:  ", eg1
#            That reveals the _latest_ goldfinger plan
#            which involves any one of the top seven agents
#            anywhere in the world including London.

That’s it... only a few lines of Python code to store compressed serialized objects in a database, and to selectively retrieve them (with optional regular expression, and remarkably, without writing any SQL commands). DEAD SIMPLE – with only one module imported. Hopefully you see how widely this is applicable...

Installation and license

Sole requirement: Python version 2.x where x is 5 or greater.

Download the latest version of the module at http://sourceforge.net/projects/yserial and put it where your Python can find it. No tar.gz or eggs here ;-) The module includes the tutorial documentation within itself. You are free to use y_serial under the BSD license. No monetary charge at all; however, if you are a developer, please critically review the code. More eyeballs lead to increased scrutiny, and thus greater reliability.

Overview

The purpose of y_serial is to keep data persistent. It is based on key/value where the conceptual key is

  • filename + table_name + primary_key + timestamp + notes

and the value is some object. (Ironically this is implemented using a SQL database. ;-) Our goal is to have the simplest possible user interface in the foreground, yet have near optimal computing in the background.

By “objects” we generally mean Python objects, but we include support for files (binary, image, etc.) and URL content (e.g. webpages). Python objects are strings, dictionaries, lists, tuples, classes, and instances. Objects are inserted into a hierarchy: database file, table within that database, row within table. Moreover, each object is annotated for reference by “notes” and automatically timestamped.

So what is happening in the background? To minimize storage size, each object is compressed. But before compression, we serialize the object. The processing between y_serial and the database is handled by the sqlite3 module (all dependencies are standard issue modules). Your program’s interaction with y_serial normally will not require writing SQL, although subqueries can be customized.

y_serial is written as a single Python module which reads like a working tutorial and includes many tips and references. It’s instructive in the way it unifies the standard batteries:

  • sqlite3 (as of Python v2.5)
  • zlib (for compression)
  • cPickle (for serializing objects)
  • [“re” module is not used, instead we access much faster SQLite functions for regular expressions]

Technicalities aside, you are spared from explicitly spelling out many of the difficult protocol details: cursor/connection, SQL/DB-API implementation, serialization, compression, search algorithm, etc. – for these are optimized to interact with speed, security, and concurrency – yet handled transparently.

Our module is faster than comparable approaches under PostgreSQL. Among serialization methods, we found cPickle to be one of the fastest, and so we have used it in a secure manner. Try y_serial with a few million objects.

We recommend SQLite because it requires neither separate installation nor a server process; also, it uses single normal files (easy to backup or send), not an elaborate filesystem. Moreover, in comparison to similar applications with MySQL or PostgreSQL, SQLite is extremely fast and suits most purposes wonderfully. Should you later decide to migrate out of SQLite, y_serial can help port your objects elsewhere including other NoSQL implementations.

The means for insertion, organization by annotation, and finally retrieval are designed to be simple to use. The notes effectively label the objects placed into the database. We can then later query the database, for example, by regex (regular expression) searching on notes, and placing the qualified objects in a dictionary. The keys of this dictionary correspond to the unique primary keys used in the database. If necessary we can access the timestamp for each object. We can thus use Python code to process the contents of the qualified dictionary, in effect, a data subset. If the objects in that dictionary are themselves dictionaries we are essentially dealing with schema-less data (see the compelling Friendfeed case study in the module’s Endnotes).

To illustrate, let’s continue our example by adding an object for agent006:

obj = 411
demo.insert( obj, "agent006 #paris #plan", 'goldfinger' )

#     We now can get a dictionary of objects
#     which matches our search values:
#
eg2 = demo.selectdic( "agent00[1-7],#plan", 'goldfinger' )
print "Example 2:  ", eg2
#
#          which should look like:
#     {1: [1257874696, u'#plan agent007 #london', 411],
#      2: [1257874696, u'agent006 #paris #plan', 911]   }

Notice that we used a different method called selectdic which produces a dictionary whose keys are the unique primary keys automatically assigned in the database. Inside the list are the (unix) epoch timestamp, followed by (unicode) notes, then object. This means that we can work with flexible data subsets using Python code rather than cumbersome SQL.

Other features

Instead of using comma-separated values, as in our example so far, we could have crafted a custom subquery and used a method called dicsub.

Or we could just skip any subquery altogether. Here we pick out the most recent n-th entry:

eg3 = demo.select( 0, 'goldfinger' )
print "Example 3:  ", eg3

The method called “view” will verbosely pretty-print deeply nested structures:

demo.view( 5, 'goldfinger' )
#          ^last m inserts (or use search string argument).

y_serial can also act like a persistent QUEUE. Whatever that is retrieved can be deleted thereafter by appending “POP=True” at the end of any applicable method:

eg4 = demo.select( 0, 'goldfinger', POP=True )

Object(s) can of course be deleted directly:

demo.delete( "agent00?", 'goldfinger' )
#                    ^where notes mention any single digit agent.

To get rid of stale data we could freshen a table and vacuum the entire database via clean:

demo.clean( 365.5 , 'goldfinger' )
#           ^retain one year-old or less prior to last insert.

To delete the entire table:

demo.droptable( 'goldfinger' )

Other useful methods are available:

  • insert any external file (via infile). This is handy for working with thousands of image files.
  • insert anything on the web by URL (via inweb).
  • insert in batches by generator (via ingenerator). This can be used to rapidly store a series of computationally intense results for quick retrieval at a later time.

For concurrency we can easily code for a farm of databases using the module’s copy functions. In general, your program can have multiple interacting instances which control distinct database files.

[What’s in beta? In heavy concurrent situations (say, hundreds of near-simultaneous writes per second), SQLite has a problem because of the way it locks the database. We can alleviate the problem by diffusing such operations across many databases (called “barns” via class Farm), and then writing back to the target database as a batch (which is far more efficient than single writes) over time. The harvest time to reap the batch is stochastic (see the “plant” method). That introduces some latency in accessing the newest objects – the cost for scaling up concurrency.]

The class Main should be stable for all practical purposes. If you run across any bugs, please kindly report them to the Tracker. For group discussions, check out the SourceForge link in the left sidebar.

For other specialized features, please RTM “read the module” for tips on usage.

Summary

y_serial = serialization + persistance. In a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Highly useful NoSQL “standard” module for a database to store schema-less data.



For project details on y_serial, including DOWNLOAD with support and discussion for users and developers, please visit us at http://sourceforge.net/projects/yserial