Archive

Author Archive

Run a Google Cloud Datalab instance on your computer

11 August 2017 Leave a comment

On the official Google Cloud Datalab quickstart, Google gives you the detailed steps about how to start a GCP instance running the Jupyter notebook, where you’ll experiment all the functionalities of the Datalab.

But, perhaps you don’t want to pay the price of the instance. You don’t need to use the cloud for that, since you have your own computer. In this case, to get a Datalab instance on your computer, you just need docker.

docker run -it -p "127.0.0.1:8081:8080" -v $PWD:"/content" gcr.io/cloud-datalab/datalab:local

But let’s admit that you have a BigQuery dataset, with which you want to play. To access easily that data from the Datalab notebook as if you were on a dedicated instance, you’ll have to:

  1. stop the running Datalab instance
  2. read https://developers.google.com/identity/protocols/application-default-credentials#howtheywork and get a credentials.json
  3. if you have started the Datalab instance at least once, you’ll have a datalab folder. Copy the credentials.json to the datalab/.config folder
  4. export GOOGLE_APPLICATION_CREDENTIALS=/content/datalab/.config/credentials.json
  5. once again, docker run -it -p "127.0.0.1:8081:8080" -v $PWD:"/content" gcr.io/cloud-datalab/datalab:local
  6. open your favourite browser to the address that has been printed to the console
  7. In the first code cell, type %projects set yourproject

Now you’re ready to play with your dataset. For example:

  1. add a code cell
  2. %%sql --module records
    SELECT field1, field2, field3, field4
    FROM dataset.table
  3. add another code cell
  4. import datalab.bigquery as bq
    df = bq.Query(records).to_dataframe()

Congratulations! You have now a working pandas dataset 😉

Advertisements
Categories: Uncategorized

The devil’s in the details

8 August 2017 Leave a comment

You know, I always start a new BigQuery query copying and pasting an existing one, and modifying it.

SELECT
record.attribute,
target.kind, target.uid, target.subType, target.className
context.contextType, context.contextId, context.contextSubType, context.contextClassName, context.parent.uid
FROM
TABLE_DATE_RANGE([dataset.raw_data_], TIMESTAMP("2017-08-08"), TIMESTAMP("2017-08-08"))
WHERE
record.attribute == 'value' AND
context.parent.uid is not null
order by createdAt desc
LIMIT 1000

Then, you execute it.

And you realize that the result types don’t match what you expect. In the above query, all the fields with a uid look like integers. But, in my case, there’s no integer at all. Worse than that: I have the impression that the contextId and the contextKind have been switched. Since my code is responsible for collecting this data, I get the unit tests. I check them, their expected results, and I stop at a breakpoint, just to check. Everything’s fine.
I start telling myself that I must have screwed the function calls in come cases. You know, Python isn’t strongly typed, and you could easily pass through. My code is pretty tolerant, and at last all the data is being stored as strings. So, breakpoints, logging instructions everywhere, part 2. I even discover that deferred.defer doesn’t trace anything when you’re on a queue which is not the default one. Wow.
At last, I find myself logging the HTTP requests. Everything’s fine on my side. OK.

Let’s exclude a bug in BigQuery itself. What have I done wrong? That must be my select. Let’s select everything. Looks fine, my fields have the good type AFAICS. Then, Let’s have a look at the initial query. I start browsing the columns. record, OK. target, … Well, where has passed the target.className? OK, I get it now. It’s not a matter of switching columns. It’s just a matter of what BigQuery allows as a valid query, and what its UI is able to parse.

One hour later, I’m happy. That was just a question of a missing comma.

Categories: Uncategorized

Speeding up a GAE-standard application automated tests

29 May 2017 Leave a comment

 

If you’re developing on a Google AppEngine standard environment, you know how slow the dev_appserver is. You have surely experienced its long setup times on the first request being served (probably because of the SQLite based datastore implementation). And also the long shutdown times (when the search indexes are being written to the disk).

LumApps had an automated tests suite, split up in scenarii. Each scenario was performing a set of requests to a (new) dev_appserver instance.
The isolation was made restoring the datastore and search index before playing a scenario. A specific scenario was in charge of recreating the reference datastore and search index. To complete the isolation, the instance was rebooted between two scenarii.

In this situation, the scenarii took 46’ to complete. We didn’t have an idea of how much code was being covered. We just evaluated the number of public endpoints being called (this gave a gross evaluation). Debugging the server in order to get more information about what was going on was also kind of prehistoric.

And last, on my system the tests took even more than 46’. Much more. And I was unable to test my code impacts.

The journey begins

At first, I started coding unit tests. Since I was new to Python and GAE, that allowed me to discover coverage.py, pytest, mock, and the GAE testbed. I was delighted about their maturity level and the functionalities they sported out. In particular, thanks Ned, thanks Alex.

After a few weeks writing unit tests, my thoughts went back to the existing automated tests. I knew that I could make something to speed them up.
How do they work? They’re LumRest scenarii (LumRest is an open-source project). You just put in a yaml file a list of commands, where each command corresponds to an endpoint that shall be called. Each command has a body (in the form of: a json file, inline json, a list of fields and the corresponding values). When in the scope of a command, you have a few keywords that allow you to eval python code or jsonpath expressions, just before or just after emitting the request. You can save a response, in order to reuse it. And you can check that the response corresponds to a certain model/statuscode.

Discovering endpoints and messages

The first step I took was to discover endpoints and messages. When addressing google-endpoints, you have to provide a typed request and you will receive a typed response. Type validation takes place when querying endpoints. Discovering endpoints was pretty easy, using the get_api_classes() method.

our_endpoints = endpoints.api(name='application',
 version='v1',
 description="Application APIs",
 documentation="http://api.company.com/application/",
 allowed_client_ids=CLIENT_IDS)

def get_endpoints_map(endpoints_def):
 api_classes = endpoints_def.get_api_classes()
 paths = {}
 for cls in api_classes:
 base_path = cls.api_info._ApiInfo__path
 for _, method_desc in cls._ServiceClass__remote_methods.items():
 method_key = '{}/{}'.format(base_path, method_desc.method_info._MethodInfo__path)
 paths[method_key] = (cls, method_desc)
 assert paths
 return paths

api_map = get_endpoints_map(our_endpoints)
# > api_map
{'user/list': <function>}
{'user/get': <function>}
[..]

This function hasn’t evolved at all, that’s a sign that it was good enough to get its job done. The discovery doesn’t take long, and it is being executed only once, at the beginning of the tests.

Call the endpoints…

Once the endpoints were known, how to use them? I had the function to call, and that means also the request/response types. But our implementation was just passing json objects and getting back json objects.

On the one way side, I have searched the google code for the classes transforming a json request in a Message, but after a while I decided that implementing a simple recursive algorithm would have taken less time. I was probably wrong, because I kept modifying this function until the last days. But with about 50 lines of code, today everything seems to work.

def process_value(value_type, value, is_repeated, contextualize=None):
 current = value
 variant = value_type.variant
 if is_repeated and not isinstance(value, list):
 current = [value]
 if is_repeated and value is None:
 current = []
 if variant == Variant.ENUM:
 current = value_type.type(value)
 if variant == Variant.STRING and isinstance(value, int):
 current = unicode(value)
 if variant == Variant.INT32 and isinstance(value, basestring):
 current = int(value)
 if variant == Variant.MESSAGE:
 if is_repeated:
 current = []
 if isinstance(value, list):
 current.extend(process_value(value_type, item, False, contextualize) for item in value)
 elif isinstance(value, dict):
 list_elem = value_type.type()
 for key, item in value.items():
 if hasattr(list_elem, key):
 current = [process_value(
 getattr(value_type.type, key), item, value_type.type.repeated, contextualize
 )]
 else:
 raise ValueError('unexpected type {} for value'.format(type(value)))
 else:
 current = value_type.type()
 for key, item in value.items():
 if hasattr(current, key):
 subtype = getattr(value_type.type, key)
 setattr(current, key, process_value(subtype, item, subtype.repeated, contextualize))
 else:
 context = contextualize() if contextualize else ''
 logger.warning("%s the request type <%s> lacks a '%s' attribute", context, value_type, key)
 return current

def call_endpoint(target_class, method_desc, contextualize=None, **kwargs):
 request_type = method_desc.remote.request_type
 response_type = method_desc.remote.response_type
 request = request_type()
 if kwargs:
 for key, value in kwargs.items():
 if hasattr(request, key):
 value_type = getattr(request_type, key, None)
 if value_type:
 setattr(request, key, process_value(value_type, value, value_type.repeated, contextualize))
 else:
 setattr(request, key, value)
 else:
 context = contextualize() if contextualize else ''
 logger.warning("%s the request type <%s> lacks a '%s' attribute", context, request_type, key)

instance = target_class()
 if isinstance(instance, Service):
 instance.initialize_request_state(FakeHttpRequestState())

response = method_desc(instance, request)
 assert isinstance(response, response_type)
 return response

.. and get something back

Then it came the time of serializing the response. In this case, I was so dissatisfied with my implementation, that after a few days I searched more in depth the google code, finding at last ProtoJson. This is probably not the code used by the appserver (because the serialization sometimes differs, when it’s question of nested empty dictionaries/lists).

def typed_response_to_dict(instance):
 converted = instance
 if isinstance(instance, Message):
 original_instance = copy.deepcopy(instance)
 converted = json.loads(ProtoJson().encode_message(instance))
 # fixette: to pass the workflow tests (dictionaries which contain only None values are dropped) till the root
 # this is not true for the dev_appserver
 original_properties = getattr(original_instance, 'properties', {})
 properties = getattr(instance, 'properties', {})
 if original_properties and not properties:
 converted['properties'] = {}
 elif isinstance(instance, BaseEndpointsModel):
 logger.warning('We are receiving a BaseEndpointsModel instead of a protorpc.messages.Message')
 converted = instance.to_dict_full()
 return converted

Stubbing out the dev_appserver — a rapid introduction

The pitch of this dissertation was about the slugginess of the dev_appserver. So, how to make it faster?
When you’re unit-testing a GAE application, you can use the testbed. It’s a great piece of code.
My knowledge of the dev_appserver is small, yet, but.. It uses a set of stubs to fullfill its tasks. On the production nvironment, these stubs are being replaced with real services, queried through an api_proxy. On the local environment, the dev_appserver uses Sqlite as datastore stub, the RAM for the memcache and the search_index stubs.
In the unit tests context, you will be using an alternative DatastoreStub (based on simple pickling/unpickling of objects to the filesystem) and the same stubs for the search_index and memcache.
You could be willing to use also the urlfetch stub (when consuming data from google cloud storage, for example). It’s good to know that you will have to initialize the blobstore stub along with the urlfetch stub:

self.testbed.init_blobstore_stub()
self.testbed.init_urlfetch_stub()

And, if your application is made up of several modules, you will also need the modules stub. I suggest you to read the topic on http://stackoverflow.com/a/28228867, in order to know how to initialize all the modules required by your application.
And, at last, if your application uses deferred tasks and/or background tasks, you will have to initialize the taskqueue stub specifying the path to the folder containing the queues.yaml file.
I have not mentioned the email stub or the appidentity stub (or all the other stubs you could need for your tests). It’s better to read the official documentation, there’s always a useful option you could make profit of.

Persist data

If you like to persist the data at the end of a test, you can use the datastore_file=path, save_changes=True option of the init_datastore_stub. For the search index stub, you will have to get the stub and use its Write method.
We use this technique for our ‘generator’ scenario.

At the test setup

self.testbed = testbed.Testbed()
self.testbed.activate()
self.testbed.init_memcache_stub()
self.testbed.init_datastore_v3_stub(datastore_file=self.DATASTORE_FILE, save_changes=True)
[..]
from google.appengine.ext.testbed import SEARCH_SERVICE_NAME
if not enable:
 self.testbed._disable_stub(SEARCH_SERVICE_NAME)
 return
from google.appengine.api.search import simple_search_stub
if simple_search_stub is None:
 from google.appengine.ext.testbed import StubNotSupportedError
 raise StubNotSupportedError('Could not initialize search API')
stub = simple_search_stub.SearchServiceStub(index_file=self.SEARCH_INDEX_FILE)
self.testbed._register_stub(SEARCH_SERVICE_NAME, stub)

At the test teardown

# nothing to do with the datastore stub (thanks to the save_changes kwarg)
self.search_stub.Write()

LumRest grammar support

Since the topic of this document is about performances, I won’t give you details. In order to execute the existing tests, I have had to support the DSL they were written in. It has taken a certain time, and it’s not fully supported, yet. The commands supported today allow my layer to execute 99% of the tests (and to get hints about what’s going wrong or what could be improved)

Background tasks execution

The testbed doesn’t provide any kind of task runner. It’s up to you to decide whether to execute the tasks that have been queued during the unit test (or just check they’re there).
The official google documentation gives you an example about how to execute deferred tasks. But old applications probably use task handlers.
A task handler is registered as a special route, at your application startup.
At last, a task will just be a method of the http request handler. And in order to interact with a task, you will have to provide a specially crafted HTTP Request. I have already spoken about reimplementing the serialization/deserialization of requests and responses.. But this time it has been definitely simpler. In facts, I kept some bugs in the execution logic till the last days just to spice up my experience 🙂
The whole taskrunner logic takes about 80 lines of code, and will be the longest except in this dissertation.

class AggregateException(Exception):
 def __init__(self, message, errors):
 super(AggregateException, self).__init__(message)
 self.errors = errors

class FakeHttpRequestState(object):
 def __init__(self, **kwargs):
 self.headers = kwargs

class FakeSessionStore(object):
 def __init__(self):
 self.config = {'cookie_args': {}}

def get_session(self, factory=None):
 return factory('mock', self).get_session()

def get_secure_cookie(self, *args, **kwargs):
 return ''

class TaskRunner(object):
 def __init__(self):
 self.routes_patterns = []
 for route in routes: # your web application routes
 pattern = re.compile(route[0])
 self.routes_patterns.append((pattern, route[1]))

@staticmethod
 def __init_handler(handler, task):
 environ = {}
 method = task.method.upper()
 url = task.url
 if task.payload:
 args = {method: task.payload}
 else:
 args = {}

handler.request = Request.blank(url, environ=environ, headers=task.headers, **args)
 handler.session_store = FakeSessionStore()

def run_task(self, task):
 if task.url == '/_ah/queue/deferred':
 deferred.run(task.payload)
 else:
 for route in self.routes_patterns:
 if route[0].match(task.url):
 handler_cls = route[1]
 break
 if not handler_cls:
 raise ValueError("handler not found for task: %s/%", task.url, task.payload)
 handler = handler_cls()
 self.__init_handler(handler, task)
 method = getattr(handler, task.method.lower())
 method()

def safe_run_task(self, task):
 try:
 self.run_task(task)
 except Exception as err:
 task_desc = task.url
 if task.url == '/_ah/queue/deferred':
 import pickle
 task_unpickled = pickle.loads(task.payload)
 task_desc = task_unpickled[1][:2]
 if len(task_desc) == 2:
 task_desc = u'{}.{}'.format(type(task_desc[0]).__name__, task_desc[1])
 else:
 task_desc = repr(task_unpickled[0].func_code)
 logger.exception("caught exception during the execution of the task '%s': %s", task_desc, err)
 return err

def run_tasks(self, tasks):
 exceptions = []
 for task in tasks:
 val = self.safe_run_task(task)
 if val:
 exceptions.append(val)
 if exceptions:
 raise AggregateException(
 'caught one or more exceptions during the execution of background tasks', exceptions
 )

Stubbing out the HTTP communications

The application was still interacting with third party APIs. This was a pain in the neck, because of sluggy/unstable network connections (yeah, they still exist in 2017).
For this reason, at some point we started using vcrpy. This kind of tool replaces all the classes/methods responsible of communicating with a remote server via HTTP. The replacements record all the exchanges (on their first execution). And if a scenario has already been registered, vcrpy uses the recorded exchanges to simulate the dialog between our application and the third party server.
This way of proceeding is safe unless the third party APIs undergo a breaking change.
For our tests, that meant decorating all the methods with the attribute:

@vcr.use_cassette(cassette_library_dir=os.path.join(os.path.dirname(os.path.dirname(__file__)), 'data/cassettes'))

Conclusion

What’s strange, today

  • The GAE DatastoreFileStub seems to have some bugs related to concurrency. One of our tests was failing since the results were not consistent. Mocking the threading.Thread()start and join methods allowed us to pass past these buggy behaviors.
  • The sortOrder doesn’t seem to work for datetimes (when querying objects from the datastore — ndb, and sorting them by their createdAt date, we get results that are not sorted).

Results met

  • The tests execute faster. They take only 20% of the time they took initially.
  • We are able to get branch-level code coverage indicators. We know which portions of the code can be changed confidently.
  • We are able to debug a test using pdb (or pydev).
  • I a able to evaluate confidently what I’m breaking! And it doesn’t take an entire night 😉

What could be improved

  • drop the support for the LumRest grammar and write the tests directly in python. To accomplish this aim, we shall be able to execute the tests on the stubs AND on the dev_appserver (like lumrest did). This could allow us to detect misbehaviors in the google appengine communication layers. The best way to do this kind of tests, would be to use a dedicated deployed test environment (identical to the production one). Advantage of this solution: you don’t need to learn lumRest to write a test.
  • surely, the current lumrest-stubs implementation could be made yet faster.

 

TL;DR

 

Categories: Uncategorized Tags: , ,

Exchange 2010 and Organizational Forms Libraries

26 February 2010 Leave a comment

With Exchange 2010 you’ve the possibility to create a Public Folder Database (PFDB), in order to keep the backward compatibility with Outlook 2003 and Entourage clients. You can create such PFDB answering yes to a question during the setup, or with a specific procedure through the Exchange Management Console. Once such a PFDB is available, you can create an Organizational Forms Library.

Concerning this Organizational Forms Libraries (used to store the Outlook Forms since at least Exchange 2000) an official procedure existed for Exchange 2007: http://support.microsoft.com/kb/933358

Trying to complete this procedure, you’ll find out that the PR_URL_NAME property is not available for an Exchange 2010 Organizational Forms Library. So, here you the missing steps, that you shall execute right after the 2.g step in the official Microsoft procedure:

  • On the Property Pane menu, click on Modify “Extra” Properties
  • Click Add, then type in the Property Tag textbox the value 0x6707001E
  • In the Property Type textbox the value PT_STRING8 appears. In the Property Name box the PR_URL_NAME string appears along with some other characters
  • Click OK twice
  • Double click on the PR_URL_NAME property.
  • Type in the Ansi String textbox the value /NOM_IPM_SUBTREE/EFORMS REGISTRY/[The Name Of Your Organizational Forms Library]
  • Click OK
  • Now you can complete the official procedure starting on the step 2.h

A ‘very pushed technical test’ before your hiring

17 November 2009 Leave a comment

Recently I’ve taken my employer’s ‘development test’. The 2009 version of this test looks a lot like the first University programming tests (optimize memory/CPU consumption, create your own data types and handling functions, avoid C++ standard libraries). The test looked like this text.

This has been the first step of a hiring process which included a second technical test. This further test has been taken in a ‘solo’ condition, which excluded the use of Internet/computers/manuals. This test included some code samples that I’ve had to correct, a couple of small programs to perform basic operations (OOP, memory allocation and developer logic are the examined subject) and some general computing knowledge questions (what certain technologies are, how to perform a typical action on an Unix system, et cetera).

(somebody calls this test, or its equivalent ten years ago, a ‘test technique assez poussé’… http://emploi.journaldunet.com/magazine/1168/)

Categories: Uncategorized

expose yourself to ghostscript

16 October 2009 Leave a comment

ghostscript is a powerful, yet slightly complicated tool.

  • it allows you to change the paper format or resolution of a pdf,
  • it allows you to convert it to other electronic formats, among which tiff
  • it allows you to change its properties, and to optimize it for different usages

in the last weeks, I’ve used gs to change the resolution of an input file, while converting it to tiff. then I’ve discovered other options, among which the one which allows me to force the output paper format. on the web you find a lot of spam about proprietary tools which allow you to do this same operation, veiling the underlying gs technology.

thus I’ve published a bash script, which supposes the existence of gs in your execution environment, and which converts any paper format to any other common paper format. I’ve named it letter2a4, since its default behavior is to change the current paper format of the input document to a4.

feel free to download it. you’ll only need to copy aligntop.ps to /usr/local/share or wherever you like. feel equally free to read the man page for gs(1). 🙂

Update: this morning I have had also the time to write a gs wrapper to merge all the files (in the given directory|passed as arguments). Download it, if you like.

(thanks, mehul, for this idea)

Converter.jar: an electronic format converter

29 September 2009 Leave a comment

Converter.jar is an applet to convert every kind of file to each other format (enfin, presque…). This applet uses the Esker WebServices (thanks a lot!) to convert the input file in the pdf format (by default) or to the format you specify.

You will need an EskerOndemand account to use it, but format conversions are free (thanks again, Esker!)

Converter can as well convert all the files contained in a given folder to the wished fomat, have a look at its README 🙂

The folder where you can download its binaries and a pair of sample command lines: http://nilleb.com/pub/opencode/Converter/. If you want to have a (look at|copy of) the sources, simply ask (or have a look at the referenced webservices online help).

Categories: Uncategorized