# Python's Summer of Code 2016 Updates

## September 06, 2016

### chrisittner (pgmpy)

#### Feature summary of BN structure learning in python pgm libraries

This is a (possibly already outdated) summary of structure learning capabilities of existing Python libraries for general Bayesian networks.

### libpgm.pgmlearner

• Discrete MLE Parameter estimation
• Discrete constraint-based Structure estimation
• Linear Gaussian MLE Parameter estimation
• Linear Gaussian constraint-based Structure estimation

Version 1.1, released 2012, Python 2

### bnfinder (also here)

• Discrete & Continuous score-based Structure estimation
• scores: MDL/BIC (default), BDeu, K2
• supports restriction to subset of data set, per node
• supports restrictions of parents set, per node
• allows to restrict the serach space (max number of parents)
• search method??
• Command line tool

Version 2, 2011-2014?, Python 2

### pomegranate

• Discrete MLE Parameter estimation
• Can be used to estimate missing values in incomplete data sets prior to model parametrization

Version 0.4, 2016, Python 2, possibly Python 3

Further relevant libraries include PyMC, BayesPy, and the Python Bayes Network Toolbox. Also check out the bnlearn R package for more functionality.

## July 25, 2016

### Riddhish Bhalodia (dipy)

#### Brain Extraction Explained!

As promised I will outline the algorithm we are following for the brain extraction using a template, which is actually a combination of elements taken from [1] and [2].

### Step 1

Read the input data, input affine information , template data, template data mask, template affine information

### Step 2

We perform registration of the template data onto the input data, this involves two sub steps.

#### (2.a) Affine Registration

Perform the affine registration of template onto the input and get the transformation matrix which will be used in the next step

#### (2.b) Non-Linear Registration (Diffeomorphic Registration)

Using the above affine transform matrix as the pre-align information we perform the diffeomorphic registration of the template over the input.

These two steps gets most of it done! (this is also followed in [2])

### Step 3

We use the transformed template and the input data to use a non-local patch similarity method for assigning labels to the input data, this part is used from [1].

This is it! The branch for brain extraction is here

## Experiments and Results

I am currently experimenting with NITRC IBSR data which has a manual brain extraction given with it. This will help me to validate the correctness of the algorithm.

## Next Up…

• Functional tests for the brain extraction process
• More datasets, even the harder ones
• Refining the code
• Better measure for validation

## References

[1]“BEaST:Brain extraction based on nonlocal segmentation technique”
Simon Fristed Eskildsen, Pierrick Coupé, Vladimir Fonov, José V. Manjón, Kelvin K. Leung, Nicolas Guizard, Shafik N. Wassef, Lasse Riis Østergaard and D. Louis Collins  NeuroImage, Volume 59, Issue 3, pp. 2362–2373.
http://dx.doi.org/10.1016/j.neuroimage.2011.09.012

[2] “Optimized Brain Extraction for Pathological Brains (OptiBET)”
Evan S. Lutkenhoff, Matthew Rosenberg, Jeffrey Chiang, Kunyu Zhang, John D. Pickard, Adrian M. Owen, Martin M. Monti
December 16, 2014 PLOS http://dx.doi.org/10.1371/journal.pone.0115551

### Aakash Rajpal (italian mars society)

#### Oculus working yaay

Hey everyone, Sorry for this late post.

I was busy setting up the Oculus, It has been a pain but at the end a sweet one :p. A week before, I was down thinking of even quitting the program . I had my code ready to run but it just wouldn’t show up on the Oculus . I was lost , but somewhere inside I knew I could do it. So I got up one last time, sat through the day tweaking my code, tweaking the Blender Game Engine , changing  configuration for Oculus and At last Bazzingaa.

Thank God I said to myself and eventually my code was running on the Oculus:p.

Here is a link to the DEMO VIDEO GSOC

### Levi John Wolf (PySAL)

#### A Post-SciPy Chicago Update

After a bit of a whirlwind, going to SciPy and then relocating to Chicago for a bit, I figure I’ve collected enough thoughts to update on my summer of code project, as well as some of the discussion we’ve had in the library recently.

I’ve actually seen a lot of feedback on quite a bit of my postings since my post on handling burnout as a graduate student. But, I’ve been forgetting to tag posts so that they’d show up in the GSOC aggregator! Bummer!

# The Great Divide

Right before SciPy, a contributor suggested that it might be a reasonable idea to split the library up into independent packages. Ostensibly motivated by this conversation on twitter, the suggestion highlighted a few issues (I think) with how PySAL operates, both on a normative level, on a proecedural level, and in our code. This is an interesting suggestion, and I think it has a few very strong benefits.

### Lower Maintainence Surface

Chief among the benefits is that minimizing the maintainence burden makes academic developers much more productive. This is something I’m actually baffled by in our current library. I understand that technical debt is hard to overcome and that some parts of the library may not exist had we started now rather than five years ago. But, it’s so much easier to swap in ecosystem-standard packages than it is to continue maintaining code that few people understand. This is also much more true when you recognize that our library does, in many places, exhibit effective use of duck typing. The barrier to us using something like pygeoif or shapely as a computational geometry core is primarily mental, and conversion of the library to drop/wrap unnecessary code in cg, weights, and core would take less than a week of full-time work. And, it’d strongly lower the maintenance footprint of the library, which I think is a central benefit of the split package suggestion.

### Clearer Academic Crediting

Plus, the idea that splitting up the library into many, more loosely-coupled packages seems like a stroke towards the R-style ecosystem, which is exactly what the linked twitter thread suggests. But, I think that R actually has some comfy structural incentives for the drivers of its ecosystem to do what they do. Since an academic can make a barely-maintained package that does some unique statistical operation and get a Journal of Statistical Software article out of it, the academic-heavy ecosystem in R is angled towards this kind of development. And, indeed, with a very small maintainence surface, these tiny packages get shipped, placed on a CV, and occasionally updated. Thus, the social incentives align to generate a particular technical structure, something I think Hadly overstates in that brief conversation as a product of object oriented programming. While OO isn’t a perfect abstraction, I’m kind of done with blaming OO for everything I don’t like, and I think that the claim that OO encourages monolithic packages is, on its face, not a necessary conclusion. It comes down to defining efficient interfaces between classes and exposing a consistent, formal API. I don’t really think it matters whether that API is populated or driven using functions & immutable data or objects & bound methods. Closures & Objects are two sides of the same coin, really. Mostly, though, thinking that the social & technical differences in R and Python package development can be explained through quick recourse to OO vs. FP (when I bet the majority of academic package developers don’t even deeply understand OO or FP) is flippant at best. I really think more of it is the structure of academic rewards, and the predominance of academics in the R ecosystem.

But that’s an aside. More generally, fragmenting the library would make it easier for new contributors to derive academic credit from their contributions.

### Cleaner Dependency Logic

I think many of the library developers also feel limited by the strict adherence to a minimal set of dependencies, namely scipy and numpy. By splitting the package up into separate modules with potentially different dependency requirements, we legitimate contributors who want to provide new stuff with flashy new packages.

To be clear, I think the way we do this right now is somewhat frustrating. If a contribution is done using only SciPy & Numpy and is sufficiently integrated into the rest of the library, it gets merged into “core” pysal. If it uses “extra” libraries but is still relevant to the project, we merge it into a module, contrib. This catch-all module contains some totally complete code from younger contributors, like the spint module for spatial interaction models or my handler module for formula-based spatial regression interfaces, as well as code from long-standing contributors, like the viz module. But, it also contains incomplete remnants of prior projects, put in contrib to make sure they weren’t forgotten. And, to make matters worse, none of the stuff in contrib is used in our continuous integration framework. So, even if an author writes test suites, they’re not run routinely, meaning that the compatibility clock is ticking every time code is committed to the module. Since it’s not unittested and documentation & quality standards aren’t the same as the code in core, it’s often easier to write from scratch when something breaks. Thus, fragmenting the package would “liberate” packages in contrib that meet standards of quality for introduction to core but have extra dependencies.

### But why is this necessary?

Of course, we can do much of what fragmentation provides technologically using soft dependencies. At the module level, it’s actually incredibly easy. But, I have also built tooling to do this at the class/function level, and it works great. So, this particular idea about having multiple packages doesn’t solve what I think is fundamentally a social/human problem.

The rules we’ve built around contribution do not actively support using the best tools for the job. Indeed, the social structure of two-tiered contribution, where the second tier has incredibly heterogeneous quality, intent, and no support for coverage/continuous integration testing, inhibits code reuse and magnifies not-invented-here syndrome intensely. We can’t exploit great packages like cytools, have largely avoided merging code that leverages improved computational runtimes (using numba & cython), and haven’t really (until my GSOC) programmed around pandas as a valid interaction method to the library.

Most of the barriers to this are, as a mentioned above, mental and social, not technical. Our code can be well-architected, even though we’ve implemented special structures to do things that are more commonly (sometimes more efficiently) solved in other packages or using other techniques.

And, there’s some freaking cool stuff going on involving PySAL. Namely, the thing that’s been animating me is its use in Carto’s Crankshaft, which integrates some PySAL tooling into a PL/Python plugin for Postgres. They’ll be exposing our API (or a subset of it) to users through this wrapper, and that feels super cool! So, we’ve got good things going for our library. But, I think that continued progress needs to address these primarily social concerns, because the code, technologically, I think is more sound than one could expect from full-time academic authors.

## July 24, 2016

### shrox (Tryton)

#### Refactoring

Right now I am working on refactoring the code that I have written till now.

I need to simplify functions, make them easier to understand and make sure that my code conforms to the standards of the Tryton and Relatario codebases.

### mr-karan (coala)

#### GSoC Week8,9 Updates

The last two weeks I had been busy with making some more bears for coala. I had found

• write-good which helps in writing good English documentation and checks the text files for common english mistakes. I really liked this tool so thought to wrap this in a linter bear and implemented WriteGoodLint Bear.

You can see it in action here:

• happiness, well as the name suggests, happiness is a tool which lints Js files for common syntax and semantic errors and confirms to a style which is well defined in their docs and is actually the one I like for Javascript files. It’s a fork of Standard which is another style guide, but happiness has a few better changes, so I wrapped this and implemented HappinessLintBear.

You can see it in action here:

• httpolice is a tool which is a linter for HTTP requests and responses. It can be used on a HAR file. If you go to Developer Tools of your browser, head over to Networks Tab, right click and save the request as HAR file. Then this tool can be used to lint that file. We didn’t have anything of this kind in coala-bears yet so I thought to wrap this and implement HTTPoliceLintBear. There have been some issues with lxml dependenices on AppVeyor and I’m figuring out how to solve so that the tests pass and this PR also gets merged.

Future Work

• Syntax Highlighting: There has been some clarity on how to implement this and I have till now used highlight class from Pygments but now planning to rather make a new own class ConsoleText which will help setting specific attributes to certain parts of text. Like separting the bullet marks from the string and so on. I plan to extensively work on this so I can complete this task by end of this week.

• coala-bears website: I’ll be starting off with the prototype of this website and also some basic functionality like filtering the bears based on the parameters apart from Language they support.

Happy Coding!

## July 23, 2016

### Raffael_T (PyPy)

#### Progress async and await

It's been some time, but I made quite some progress in the new async feature of Python 3.5! There is still a bit to be done though and the end of this years Google Summer of Code is pretty close already. If I can do it in time will mostly be a luck factor, since I don't know how much I will still have to do in order for asyncio to work. The module is dependent of many new features from Python 3.3 up to 3.5 that have not been implemented in PyPy yet.

Does async and await work already?
Not quite. PyPy now accepts async and await though, and checks pretty much all places where it is allowed and where it is not. In other words, the parser is complete and has been tested.
The code generator is complete as well, so the right opcodes get executed in all cases.
The new bytecode instructions I need to handle are: GET_YIELD_FROM_ITER, GET_AWAITABLE, GET_AITER, GET_ANEXT, BEFORE_ASYNC and SETUP_ASYNC.
These opcodes do not work with regular generators, but with coroutine objects. Those are based on generators, however they do not imlement __iter__ and __next__ and can therefore not be iterated over. Also generators and generator based coroutines (@asyncio.coroutines in asyncio) cannot yield from coroutines. [1]
I started implementing the opcodes, but I can only finish them after asyncio is working as I need to test them constantly and can only do that with asyncio, because I am unsure what the values normally lying on the stack are. That is also valid for some functions in coroutine objects. Coroutine objects are working, however they are missing a few functions needed for the async await-syntax feature.
These two things are the rest I have to do though, everything else is tested and should therefore work.

What else has been done?
Only implementing async and await would have been too easy I guess. With it comes a problem I already mentioned, and that is the missing dependencies of Python 3.3 up to 3.5.
The module sre (offers support for regular expressions) was missing a macro named MAXGROUPS (from Python 3.3), the magic number standing for the number of constants had to be updated as well. The memoryview objects also got an update from Python 3.3 that is needed for an import. It has a function called “cast” now, which converts memoryview objects to any other predefined format.
I just finished implementing this as well, now I am at the point where it says inside threading.py:
AttributeError: 'module' object has no attribute '_set_sentinel'

What to do next?
My next goal is that asyncio works and the new opcodes are implemented. Hopefully I can write about success in my next blog post, because I am sure I will need some time to test everything afterwards.

A developer tip for execution of asyncio in pyinteractive (--withmod)
(I only write that as a hint because it gets easily skipped in the PyPy doc, or at least it happened to me. The PyPy team already thought about a solution for that though :) )
Asyncio needs some modules in order to work which are by default not loaded in pyinteractive. If someone stumbles across the problem where PyPy cannot find these modules, –withmod does the trick [2]. For now, –withmod-thread and –withmod-select are required.

[1] https://www.python.org/dev/peps/pep-0492/
[2] http://doc.pypy.org/en/latest/getting-started-dev.html#pyinteractive-py-options

Update (23.07.): asyncio can be imported and works! Well that went better than expected :)
For now only the @asyncio.coroutine way of creating coroutines is working, so for example the following code would work:

import asyncio
@asyncio.coroutine
def my_coroutine(seconds_to_sleep=3):
    print('my_coroutine sleeping for: {0} seconds'.format(seconds_to_sleep))
    yield from asyncio.sleep(seconds_to_sleep)
loop = asyncio.get_event_loop()
loop.run_until_complete(
    asyncio.gather(my_coroutine())
)
loop.close(

(from http://www.giantflyingsaucer.com/blog/?p=5557)

And to illustrate my goal of this project, here is an example of what I want to work properly:

import asyncioasync def coro(name, lock):    print('coro {}: waiting for lock'.format(name))    async with lock:        print('coro {}: holding the lock'.format(name))        await asyncio.sleep(1)        print('coro {}: releasing the lock'.format(name))loop = asyncio.get_event_loop()lock = asyncio.Lock()coros = asyncio.gather(coro(1, lock), coro(2, lock))try:    loop.run_until_complete(coros)finally:    loop.close()

(from https://docs.python.org/3/whatsnew/3.5.html#whatsnew-pep-492)

The async keyword replaces the @asyncio.coroutine, and await is written instead of yield from. "await with" and "await for" are additional features, allowing to suspend execution in "enter" and "exit" methods (= asynchronous context manager) and to iterate through asynchronous iterators respectively.

### Ramana.S (Theano)

#### Second Month Blog

Hello there,
The work of GraphToGPU optimizer was finally merged into the master of theano, giving the bleeding edge approx 2-3 times speedup. Well, that is a huge thing. Now the compilation time for the graph on the FAST_COMPILE mode had one small block, which was created from the local_cut_gpu_transfers. The nodes introduced into the graphs were host_from_gpu(gpu_from_host(host_from_gpu(Variable))) and gpu_from_host(host_from_gpu(gpu_from_host(Variable))) patterns. This caused the slowdown of local_cut_gpu_transfers and when tried to investigate where these patterns are created, it was found to be created from one of the AbstractConv2d optimizers. We (Me and Fred) spent sometime to filter out these pattern, but we finally concluded that this speedup wouldn't help as much as the effort and dropped the idea for now.
There were some work done in Caching the Op classes from the base Op class so that all the instances of Op don't recreate an Op instance that was already created.(The criterion being same parameter). I tried to implement the caching from Op class using a Singleton. I also verified that the instances with the same parameters are not recreated. But there are few problems which require some higher level refactoring. Currently the __call__ methods for the Op is implemented from PureOp which when making a call to the make_node, does not identify and pass all the parameters correctly. This passing parameter issue would hopefully be resolved if all the Ops in theano support __props__, which would make me convenient to access the _props_dict and pass the parameter instead of using the generalized unconventional way from *args and **kwargs. Currently, most of the Ops in the old backend does not have __props__ implemented to make use of the _props_dict. There are few road blocks to this. The instances of Elemwise would require a dict to be passed as parameter, which is of unhashable type and hence could not implement the __props__. Early of this week, work would begin on making that parameter hashable  type and hence paving way for both of this PR to get merged. Once it gets merged, there would be at least 0.5X speed up in the optimization time.
Finally the work has begun on implementing a CGT style optimizer. This new optimizer does optimization in topological sort. In theano, this is being implemented as a local optimizer, aimed at replacing the cannonicalize phase. Currently theano optimizes the node only "once". The main advantage of this optimizer is, it optimizes a node more than once, by trying all the possible optimizers to the node, until None of them apply. This new optimizer applies an optimization to a node, and again tries all the optimization to the newer node(the one that is modified) and so on.. There is one drawback in this approach. After two optimization being applied, the node that is being replaced wouldn't have the fgraph attribute and hence the optimization that would require this attribute could not be tried. An example of working of the new optimizer is shown below,

Current theano master:
x ** 4 = T.sqr(x ** 2)

This branch :

x ** 4 =  T.sqr(T.sqr(x))

The drawback of this branch is that we won't be able to do this type of speed up for x ** 8 onwards. When profiled with the SBRNN network, the initial version of the draft seem to give approx 20sec speed up. Isn't that a good start? :D

That's it for now folks! :)

### mkatsimpris (MyHDL)

#### Week 9

This week we had a lot of problems in defining the interfaces between the frontend and the backend part. From my part, I changed the method of the outputs from parallel to serial in order to communicate with the backend part without problems.  Moreover, I synthesized the frontend part and from the design there are inferred some latches which cause a lot of timing problems. I changed some of

## July 22, 2016

### Avishkar Gupta (ScrapingHub)

#### Formalising the Benchmark Suite, Some More Unit Tests and Backward Compatibility Changes

In the past two weeks I focused my efforts on finalizing the benchmarking suite and improving test coverage.

From what Codecov says, we’re 83% of the way there regarding test coverage. As far as the performance of the new signals is concerned, from what the testing shows I gathered that the new signal API always takes less than half the time that is required by the old signal API for both signal connection and the actual sending of the signal.

This is attributed mostly to the fact that a lot of time that was previously used up by running a combo of the getAllReceivers and liveReceivers functions together everytime was taking up a huge amount of time and was the bottleneck to the process. As it currently stands, we’re not using the caching mechanism of the library, i.e. have use_caching set to false always because the receivers which do not connect to a specific sender but rather to all require me to find a suitable key for them that can be weakref ref’d to make the entry in the WeakKeyDictionary. But enough about that, back to benchmarking.

So for the benchmarking process, Djangobech the Django benchmarking library, does not benchmark the signals currently and the same is still on the TODO list in the project. They however, did provide me with some excellent modules that I used to write the scrapy benchmarking suite for signals. I would leave a link to the same here, but currently I’m in a discussion with my mentor on where to include them, as including them in the repo would require that we still keep pyDispatcher as a dependency as it is required to perform a raw apples to apples comparison of the signal code. In this post I’m also sharing results that I got using Robert Kern’s line_profiler module.

.

As for the compatibility changes this cycle, I added support for the old style scrapy signals, which were just standard python objects. In similar fashion to how I implemented backward compatiblity for receivers without keyword arguments, I proxied the signals through the signal manager to implement backward compatability for the objects. With that the new signals can be safely integrated into scrapy with no worries about breaking legacy code. In the coming weeks, I plan on working on finishing test coverage, maybe adding some signal benchmarks to scrapy bench and doing documentation.

### Nelson Liu (scikit-learn)

#### (GSoC Week 8) MAE PR #6667 Reflection: 15x speedup from beginning to end

If you've been following this blog, you'll notice that I've been talking a lot about the weighted median problem, as it is intricately related to optimizing the mean absolute error (MAE) impurity criterion. The scikit-learn pull request I was working on to add aforementioned criterion to the DecisionTreeRegressor, PR #6667, has received approval from several reviewers for merging. Now that the work for this PR is complete, I figure that it's an apt time to present a narrative of the many iterations it took to converge to our current solution for the problem.

# Iteration One: Naive Sorting

The Criterion object that is the superclass of MAE has a variety of responsibilities during the process of decision tree construction, primarily evaluating the impurity of the current node, and evaluating the impurity of all the possible children to find the best next split. In the first iteration, every time we wanted to calculate the impurity of a set of samples (either a node, or a possible child), we would sort this set of samples and extract the median from it.
After implementing this, I ran some benchmarks to see how fast it was compared to the Mean Squared Error (MSE) criterion currently implemented in the library. I used both the classic Boston housing price dataset and a larger, synthetic dataset with 1000 samples and 100 features each to compare. Training was done on 0.75 of the total dataset, and the other 0.25 was used as a held-out test set for evaluation.

### Boston Housing Dataset Benchmarks: Iter. 1

MSE time: 105 function calls in 0.004 seconds
MAE time:  105 function calls in 0.175 seconds

Mean Squared Error of Tree Trained w/ MSE Criterion: 32.257480315
Mean Squared Error of Tree Trained w/ MAE Criterion: 29.117480315

Mean Absolute Error of Tree Trained w/ MSE Criterion: 3.50551181102
Mean Absolute Error of Tree Trained w/ MAE Criterion: 3.36220472441


### Synthetic Dataset Benchmarks: Iter. 1

MSE time: 105 function calls in 0.089 seconds
MAE time:  105 function calls in 15.419 seconds

Mean Squared Error of Tree Trained w/ MSE Criterion: 0.702881265958
Mean Squared Error of Tree Trained w/ MAE Criterion: 0.66665916831

Mean Absolute Error of Tree Trained w/ MSE Criterion: 0.650976429446
Mean Absolute Error of Tree Trained w/ MAE Criterion: 0.657671579992


This sounds reasonable enough, but we quickly discovered after looking at the numbers that it was intractable; while sorting is quite fast in general, sorting in the process of finding the children was completely unrealistic. For a sample set of size n, we would divide it into n-1 partitions of left and right child, and sort each one, on every node. The larger dataset made MSE take 22.25x more time, but it made MAE take 88.11x (!) slower. This result was obviously unacceptable, so we began thinking of how to optimize; this led us to our second development iteration.

# Iteration 2: MinHeap to Calculate Weighted Median

In iteration two, we implemented the algorithm / methodology I discussed in my week 6 blog post. With the method, we did away with the time associated with sorting every sample set for every node and possible child and instead "saved" sorts, using a modified bubblesort to insert and remove new elements from the left and right child heaps efficiently. This algorithm had a substantial impact on the code --- rerunning the benchmarks we used earlier yielded the following results (MSE results remained largely the same due to run-by-run variation, but accuracy is the same as is thus omitted):

### Boston Housing Dataset Benchmarks: Iter. 2

MSE time: 105 function calls in 0.004s (was: 0.004s)
MAE time:  105 function calls in 0.276s (was: 0.175s)


### Synthetic Dataset Benchmarks: Iter. 2

MSE time: 105 function calls in 0.065s (was: 0.089s)
MAE time:  105 function calls in 5.952s (was: 15.419s)


After this iteration, MAE is still quite slower than MSE, but it's a definite improvement from naive sorting (especially when using a large dataset). I found it interesting that the new method is actually a little bit slower than the naive method we first implemented on the relatively small Boston dataset (0.276s vs 0.175s, respectively). My mentors and I hypothesized that this might be due to the time cost associated with creating the WeightedMedianCalculators (the objects that handled the new median calculation), though their efficiency in calculation is supported by the speed increase from 15.419s to 5.952s on the larger randomly generated dataset. 5.952 seconds on a dataset with 1000 samples is still slow though, so we kept going.

# Iteration 3: Pre-allocation of objects

We suspected that there could be a high cost associated with spinning up objects used to calculate the weighted median. This is especially important because the majority of the tree code in scikit-learn is written in Cython, which disallows us of Python objects and functions. This is because we run the Cython code without the Python GIL (global interpreter lock). The GIL is a mutex that prevents multiple native threads from executing Python bytecodes at once, so running without the GIL makes our code a lot faster. However, because our WeightedMedianCalculators are Python objects, we unfortunately need to reacquire the GIL to instantiate them. We predicted that this could be a major source of the bottleneck. As a result, I implemented a reset function in the objects to clear them back to their state at construction, which could be executed without the GIL. When we first ran the C-level constructor (it is run at every node, as opposed to the Python constructor that is run only once), we evaluated whether the WeightedMedianCalculators had been created or not; if they have not been, we reacquire the GIL and create them. If they have, we simply reset them. This allowed us to only reacquire the GIL once throughout the algorithm, which, as predicted, led to substantial speedups. Running the benchmarks again displayed:

### Boston Housing Dataset Benchmarks: Iter. 3

MSE time: 105 function calls in 0.009s (was: 0.004s, 0.004s)
MAE time:  105 function calls in 0.038s (was: 0.276s, 0.175s)


### Synthetic Dataset Benchmarks: Iter. 3

MSE time: 105 function calls in 0.065s (was: 0.065s, 0.089s)
MAE time:  105 function calls in 0.978s (was: 5.952s, 15.419s)


Based on the speed improvement from the most recent changes, it's reasonable to conclude that a large amount of time was spent re-acquiring the GIL. With this approach, we cut down the time spent reacquiring the GIL by quite a significant amount since we only need to do it once, but ideally we'd like to do it zero times. This led us to our third iteration.

# Iteration 4: Never Re-acquire the GIL

Constructing the WeightedMedianCalculators requires two pieces of information, n_outputs (the number of outputs to predict) and n_node_samples (the number of samples in this node). We need to create a WeightedMedianCalculator for each output to predict, and the internal size of each should be equal to n_node_samples.
We first considered whether we could allocate the WeightedMedianCalculators at the Splitter level (the splitter is in charge of finding the best splits, and uses the Criterion to do so). In splitter.pyx, the __cinit__ function (Python-level constructor) only exposes the value of n_node_samples and we lack the value of n_outputs. The opposite case is true in criterion.pyx, where the __cinit__ function is only shown the value of n_outputs and does not get n_node_samples until C-level init time, hence why we previously were constructing the WeightedMedianHeaps in the init function and cannot completely do it in __cinit__. If we could do it completely in the __cinit__, we would not have to reacquire the GIL because the __cinit__ operates on the Python level in the first place.
As a result, we simply modified the __cinit__ of the Criterion objects to expose the value of n_node_samples, allowing us to do all of the allocation of the objects at the Python-level without having to specifically reacquire the GIL. We reran the benchmarks on this, and saw minor improvements in the results:

### Boston Housing Dataset Benchmarks: Iter. 4

MSE time: 105 function calls in 0.003s (was: 0.009s, 0.004s, 0.004s)
MAE time:  105 function calls in 0.032s (was: 0.038s, 0.276s, 0.175s)


### Synthetic Dataset Benchmarks: Iter. 4

MSE time: 105 function calls in 0.065s (was: 0.065s, 0.065s, 0.089s)
MAE time:  105 function calls in 0.961s (was: 0.978s, 5.952s, 15.419s)


# Conclusion

So after these four iterations, we managed to get a respectable 15x speed improvement. There's still a lot of work to be done, especially with regards to speed on larger datasets; however, as my mentor Jacob commented, "Perfect is the enemy of good", and those enhancements will come in future (very near future) pull requests.

If you have any questions, comments, or suggestions, you're welcome to leave a comment below.

Thanks to my mentors Raghav RV and Jacob Schreiber for their input on this problem; we've run through several solutions together, and they are always quick to point out errors and suggest improvements.

You're awesome for reading this! Feel free to follow me on GitHub if you want to track the progress of my Summer of Code project, or subscribe to blog updates via email.

### aleks_ (Statsmodels)

#### Bugs, where art thou?

The latest few weeks were all about searching for bugs. The two main bugs (both related to the estimation of parameters) showed up in two cases:
• When including seasonal terms and a constant deterministic term in the vector error correction model (VECM), the estimation for the constant term differed from the one produced by the reference software JMulTi which is written by Lütkepohl. Interestingly, my results did equal those printed in the reference book (also written by Lütkepohl) so I believe that JMulTi has gotten a corresponding update between the release of the book and the release of the software - which would also mean that the author of the reference book made the same mistake as I ;) Basically the error was the result of a wrong construction of the matrix holding the dummy variables. Instead of the following pattern (assuming four seasons, e.g. quaterly data):
[[1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, ..., 1, 0, 0, 0],
[0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ..., 0, 1, 0, 0],
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, ..., 0, 0, 1, 0]]
• 1 / (number of seasons) had to be subtracted from each element in the matrix above. This isn't described in the book and it is not the way to define dummy variables I learned in my lecture about regression analysis. And due to the fact that the estimation for the seasonal parameters was actually correct (the wrong matrix had only side effects on constant terms), I kept searching the bug in a lot of different places...
• A small deviation from certain parameters occurred when deterministic linear trends were assumed to be present. Thanks to an R package handling VECM, called tsDyn, I could see that my results exactly matched those produced by tsDyn when specifying linear trends to be inside the cointegration relation in the R-package. On the other hand the tsDyn output equaled that of JMulTi when tsDyn did not treat the linear trend as part of the cointegration relation. After I had seen that, the reimplementation to produce the JMulTi output in Python was easy. But also here I had searched a lot for bugs before.
Now I am happy that the code works even if the bugs have thrown me behind the time schedule. I expect that coding will now continue much more smoothly.
The only thing that makes me worry is the impression the bug hunting made on my supervisors. Being unable to push code while searching for bugs may have looked as if I am not doing anything though I spent hours and hours reading my code and checking it against Lütkepohl's book. So while I was working much more than 40 hours per week in the last few weeks  (once even without a single day off), it may have looked completely different.
What counts now is that I will continue to give my best in the remaining weeks of GSoC even if I don't get a passing grade from my supervisors. After all, it's not about the money, it's about being proud of the end product and about knowing that one has given his very best : )

### ghoshbishakh (dipy)

#### Google Summer of Code Progress July 22

It has been about 3 weeks after the midterm evaluations. The dipy website is gradually heading towards completion!

### Progress so far

The documentation has been completely integrated with the website and it is synced automatically from the github repository where the docs are hosted.

The honeycomb gallery in the home page is replaced with a carousal of images with content overlays that will allow us to display important announcements at the top.

The news feed now has sharing options for Facebook, Google Plus and Twitter.

Google analytics has been integrated for monitoring traffic.

There are many performance optimizations like introducing a layer of cache and enabling GZipping middleware. Now the google page speed score is even higher than the older website of dipy.

All pages of the website has meta tags for search engine optimizations.

And of course there has been lots of bug fixes and the website scales a lot better in mobile devices.

The current pull request is #13

You can visit the site under development at http://dipy.herokuapp.com/

### Documentation Integration

The documentations are now generated and uploaded to the dipy_web repository using a script. Previously html version of the docs were built, but this script builds the json docs that allows us to integrate them within the django templates vrey easily. Then using github API the list of documentations are synced with the django website models.

Now the admins with proper permissions can select which documentation versions to display in the website. Those selected documentations are displayed in the navbar dropdown menu. This is done by passing the selected docs in the context in a context preprocessor.

Now when a user requests a documentation, the doc in json format is retrieved from github parsed and the urls in the docs are processed so that they work properly within the django site. Then the docs are rendered in a template.

### Cache

Processing the json documentations every time a page is requested is an overhead. Also in the home page, every time the social network feeds are fetched which is not required. So a cache is used to reduce the overhead. In django adding a cache is really really simple. All we need to do is setup the cache settings and add some decorators to the views.

For now we are using local memory cache, but in production it will be replaced with memcached.

We are keeping the documentation view and the main page view in cache for 30 minutes.

But this creates a problem. When we change some section or news or publications in the admin panel then the changes are not reflected in the views and we need to wait for 30 minutes to see the changes. In order to solve the issue the cache is cleared whenever some changes are made to the sections, news etc.

### Search Engine Optimizations

One of the most important steps for SEO is adding proper meta tags in every page of the webiste. These also include the open graph tags and the twitter card tags so that when a page is shared in a social network, it is properly rendered with the correct title, description, thumbnail etc.

The django-meta app provides vrey useful template that can be included to render the meta tags properly provided a meta object is passed in the context. Ideally all pages should have its unique meta tags, but there must be a fallback so that if no meta attributes are specified then some default values are used.

So in order to generate the meta objects we have this function:

And in settings.py we can specify some default values:

### Googe Analytics

Adding google analytics is very simple. All we need to do is put a code snippet in every template or just the base template that is extended by all other templates. But in order to make it more easy to customize, I have kept it as a context preprocessor that will take the Tracking ID from settings.py and generate the code snippet in the templates.

### What’s next

We have to add more documentation versions (the older ones) and add a hover button in the documentation pages to hop from one documentation version to another just like the django documentations.

We have to design a gallery page that will contain images, videos and tutorials.

I am currently working on a github data visualization page for visualization of dipy contributors and activity in the dipy repository.

Will be back with more updates soon! :)

### liscju (Mercurial)

#### Coding Period - VII-VIII Week

The main thing i managed to do in the last two weeks was to make new clients(with redirection feature) put files in redirection place by themselves instead of pushing to the main repository. The new client in order to obtain the redirection destination asks main repo server for this information and later it communicates directly with the redirection server. This behaviour is proper because it is sure that transaction of push will only suceed when the client succesfully put all large files in redirection destination. In other case the transaction will fail, so main server repo will not have any revision that large files are not in redirection destination.

Second thing i have done was to add/tweak current test cases for redirection module.

The next thing i have done was to research what functionalities redirection server should have. There were some discussion if server should be thin/rich in functionalities but the general conclusion is that it should be thin - it should support only getting files and pushing files. The one thing we demand from redirection server that it should check if pushed large file has proper hash because it is the only way to be sure that consecutives clients will download files with proper content.

The last thing i managed to do was to make putting files by old clients not saving those files temporarily before sending to redirection server. So far those files was saved because when old clients pushes files, the main repo server doesn't know the size of the file and in the result he doesn't know how to set Content-Length in the request to the redirection server. This was overcome by using Chunked Transfer Encoding. This functionality of the http 1.1 protocol enables sending files chunk by chunk knowing only single chunk size that is sent. You can read more about this on wikipedia:

https://en.wikipedia.org/wiki/Chunked_transfer_encoding

### Abhay Raizada (coala)

#### week full of refactor

My Project has grown a lot now, we are officially going to support  C, C++, python3, JS, CSS and JAVA with our generic algorithms, though they’ll still be experimental owing to the nature of the bears.

The past two weeks were heavily concentrating on refactoring algorithms of the AnnotationBear and IndentationBear, the IndentationBear received only small fixes while the AnnotationBear had to undergo a change in the algorithm, the new and improved algorithm also adds the feature of distinguishing between single-line and multi-line strings while earlier there were just strings.

The IndentationBear is almost close to completion barring basic things like:

• It still messes up your doc strings/ multi-line strings.
• Still no support for keyword indents.

the next weeks efforts will go into introducing various indentation styles into the bear and fixing these issues, before we move on to the LineBreakBear and the FormatCodeBear.

### Prayash Mohapatra (Tryton)

#### Few methods left

Well yes, according to my Trello board, I am just a couple of methods away from completely porting the Import/Export feature from Python (GTK) to JavaScript (sao). The journey now feels rewarding, especially since I just learnt that GNU Health uses Tryton as their framework too.

There has been no problem as such since last two weeks. Made the predefined exports be used, created, saved and removed. Can now get the records selected in the tab and could fetch the relevant data from the ‘export_data’ RPC call. Got confidence in making the RPC calls in general.

Feeling comfortable around promises. Now I smile at the times when the folks at my college club would use promise for every concurrency issue, and I would be staring at them poker-faced.

Would soon move into writing the tests for the feature, something I am waiting eagerly for. Have a nice weekend.

### Ravi Jain (MyHDL)

#### Started Receive Engine!

Its been a long time since my last post!(2 weeks phew)! Sorry for the slump. Anyways During the period i successfully merged Transmit Engine after mentor’s review. I later realised that i missed adding functionality of client underrun used to corrupt current frame transmission. I shall make sure to add that in next merge.

Next I started looking towards GMII, which partly stalled my work cause i was unable to clearly understand what I have to do for that. So I decided to move on and complete Receive Engine with Address Filter First. Till now i have finished receiving the destination address from the data stream and filtering using the address table by matching it against frame’s destination address. If there is any match, the receiver starts forwarding the stream to client side, otherwise just ignores it.

Next i look forward to add error check functionalities to be able to assert Good/Bad Frame at the end of the transmission.

### Utkarsh (pgmpy)

#### Google Summer of Code week 7 and 8

MY PR for No-U-Turn-Sampler (NUTS) and NUTS with dual averaging has been merged PR #706. Apart from that I made a slight change in API of all the continuous sampling algorithms. Earlier I made the grad_log_pdf argument to be optional. You can either pass in a custom implementation or otherwise it will use the gradient function in the model object. But it was a poor design choice. Not only it was making code to look ugly, it was also making things more complex with increased number if useless checks. Other issue was that if user has to implement as custom model it will not necessarily have the gradient method. It is rightly said simpler is better :). The current API is like:

>>> from pgmpy.inference.continuous import NoUTurnSamplerDA as NUTSda, GradLogPDFGaussian
>>> from pgmpy.factors import JointGaussianDistribution as JGD
>>> import numpy as np
>>> mean = np.array([1, -100])
>>> covariance = np.array([[-12, 45], [45, -10]])
>>> model = JGD(['a', 'b'], mean, covariance)
>>> samples = sampler.generate_sample(initial_pos=np.array([12, -4]), num_adapt=10,
...                                   num_samples=10, stepsize=0.1)
>>> samples
<generator object NoUTurnSamplerDA.generate_sample at 0x7f4fed46a4c0>
>>> samples_array = np.array([sample for sample in samples])
>>> samples_array
array([[ 11.89963386,  -4.06572636],
[ 10.3453755 ,  -7.5700289 ],
[-26.56899659, -15.3920684 ],
[-29.97143077, -12.0801625 ],
[-29.97143077, -12.0801625 ],
[-33.07960829,  -8.90440347],
[-55.28263496, -17.31718524],
[-55.28263496, -17.31718524],
[-56.63440044, -16.03309364],
[-63.880094  , -19.19981944]])
"""


Also since PR which dealt with implementing JointGaussianDistribution has been merged, I also made changes to accommodate it. As you can see in the example I have imported JointGaussianDistribution from factors instead of models :P. Now pgmpy supports continuous models and inference on these models using sampling algorithms. I don’t have any plans for next week, I have began working on the content for introductory blog posts and ipython notebooks maybe by next fortnight I might finish it.

### jbm950 (PyDy)

#### GSoC Week 8 & 9

Last week I did not end up writing a blog post and so I am combining that week’s post with this week. Last week I attended the SciPy 2016 conference and was able to meet my mentor, and many other contributers to SymPy, in person. I was also able to help out with the Pydy tutorial. During this time at the conference (and this current week) I was able to flesh out the remaining details on the different portions of the project. I have updated PR #353 to reflect the api decisions for SymbolicSystem (previously eombase.EOM).

In line with trying to put the finishing touches on implementation details before diving in to code, Jason and I met with someone who has actually implemented the algorithm in the past to help us with details surrounding Featherstone’s method. He also pointed me to a different description of the same algorithm that may be easier to implement.

This week I also worked on rewriting the docstrings in physics/mechanics/body.py because I found the docstrings currently there to be somewhat confusing. I also did a review on one of Jason’s PR’s where he reduces the amount of work that *method.rhs() has to do when inverting the mass matrix by pulling out the kinematical information before the inversion takes place.

### Future Directions

With the work these past two weeks being focused on implementing the different parts of the projects, I will start implementing these various parts next week. I will first work on finishing off the SymbolicSystem object and then move towards implementing the OrderNMethod. This work should be very straight forward with all the work that has been put into planning the api’s.

### PR’s and Issues

• (Merged) Speeds up the linear system solve in KanesMethod.rhs() PR #10965
• (Open) Docstring cleanup of physics/mechanics/body.py PR #11416
• (Open) [WIP] Created a basis on which to discuss EOM class PR #353

## July 21, 2016

### srivatsan_r (MyHDL)

#### Clarity on the Project

It has been a long time since I have updated my blogs. I posted a block diagram in the previous post saying that this is what I will be doing next. My mentor then told me that completing and making the RISC-V core functional will itself take a lot of time, so video streaming is not required at the moment.

So, for the rest of my GSoC period I will be working on the RISC-V core. I will be doing the project along with another GSoC participant. After reviewing a lot of implementations of RISC-V my partner and our mentor chose the V-Scale RISC-V Processor core (Which is a verily version of Z-Scale RISC-V processor).

My partner has already completed the Decoder of the processor during the first half of GSoC. He was getting a Return type mismatch error when trying to convert the decoder module to verilog. We couldn’t figure out why this error was coming. Then after reviewing the code carefully I was able to spot the error in the code.

def fun(value, sig):
if value == 0:
return sig[4:2]
else:
return sig[5:2]

The above given function code cannot be converted to verilog. This is because the length of intbv returned in the function varies at each branch of if-else. This cannot be modelled by a verilog function which should have a definite length for the return type, and hence the error.

To solve this error I had to inline the code and remove the function. After rectifying this error the code was converting to verilog correctly. I then made a pull request to the dev branch of the repo meetshah1995/riscv.

The code for ALU was ported to MyHDL by my partner and I created some tests for the module. While doing this I learnt some basic stuff like in verilog ‘>>’ denotes logical right shift and ‘>>>’ denotes arithmetic right shift, but in python ‘>>’ denotes arithmetic right shift. There is a catch here in MyHDL –

a >> 2 # If a is of type Signal/intbv '>>' works as logical right shift.
a >> 2 # If a is of type int '>>' works as arithmetic right shift.

That is an interesting fact, isn’t it?

## July 20, 2016

### Yen (scikit-learn)

#### How to set up 32bit scikit-learn on Mac without additional installation

Sometimes you may want to know how scikit-learn behaves when it’s running on 32-bit Python. This blog post try to give the simplest solution.

## Step by Step

Below I’ll go through the procedure step by step:

I. Type the following command and make sure it outputs 2147483647.

arch -32 /System/Library/Frameworks/Python.framework/Versions/2.7/bin/python -c "import sys; print sys.maxint"


II. Modify line 5 of Makefile exists in root directory of scikit-learn becomes:

PYTHON ?= arch -32 /System/Library/Frameworks/Python.framework/Versions/2.7/bin/python

and modify *line 11* to:

BITS := (shell PYTHON -c 'import struct; print(8 * struct.calcsize("P"))')  III. Type sudo make  in the root directory of scikit-learn and you are good to go! ## Verification You can verify if 32-bit version of scikit-learn built successfully by typing: arch -32 /System/Library/Frameworks/Python.framework/Versions/2.7/bin/python  to enter 32-bit Python shell. After that, type: import sklearn  to check if sklearn can now run on 32-bit Python. Hope this helps! ### tushar-rishav (coala) #### Python f-strings Hey there! How are you doing? :) Since past couple of days I’ve been attending the EuroPython conference at Bilbao, Spain and it has been an increíble experience so far! There are over a dozen amazing talks with something new to share every day and the super fun lightning talks at the end of the day. If for some reason you weren’t able to attend the conference then you may see the talks live at EuroPython YouTube channel. In this blog I would like to talk briefly about PEP498 - Literal String Interpolation in Python. Python supports multiple ways to format text strings (%-formatting, format formatting and Templates). Each of these are useful in some ways but they do lack in other aspects. For eg. the simplest version of format style is too verbose. Clearly, there is a redundancy. place is being used multiple times. Similarly, % formatting is limited with the types (int, str, double) that can be parsed. f-strings are proposed in PEP498. f-strings are basically a literal strings with ‘f’ or ‘F’ as prefix. It embeds expressions using braces that are evaluated at runtime. Let’s see some simple examples: I think that’s simpler and better than other string formatting options. If this feature interests you and you want to learn more about it then I recommend checking out the PEP498 documentation. Cheers! #### Week 6 Last week coala-html was released and my visa application for attending EuroPython, Spain got accepted. It was a joyful week. \o/ Now post mid-term, I’ve started working on a new website for coala. Beside having a new UI, it will also have an editor integrated where users can upload a code snippet (presently, Python, JavaScript, Perl, Java, PHP only) and let coala run static code analysis on the file. Eventually, user gets the feedback. Some of the desired features are: • Let user choose from various available bears for their language. • Autofix the code snippet based on patch produced. The features appear cool but should be little tricky to implement. The challenging part is to automatically generate the coafile based on the settings and also apply the patch upon analysis. Also, I shall implement more features as I progress. May the CSS be with me! :) Also, coming week I shall be conducting an online workshop - mostly to cover version control and workflow. Many new comers struggle in this area. I am hopeful that this workshop will help them learn the skills that are much needed and get them started quickly. Details to follow. :) Stay tuned! ### John Detlefs (MDAnalysis) #### SciPy 2016! Last week I went to Austin, TX to Scipy2016. I wasn’t sure what to expect. How would people communicate? Would I fit in, what talks would interest me? Fortunately the conference was a huge success. I have came away a far more confident and motivated programmer than when I went in. ## So what were the highlights of my experience at Scipy? On a personal level, I got to meet some of my coworkers, the members of the Beckstein Lab. Dr. Oliver Beckstein, David Dotson, and Sean Seyler are brilliant physicists and programmers who I have been working with on MDAnalysis and datreant. It was surreal to meet the people you have been working with over the internet for 3 months and get an idea of how they communicate and what they enjoy outside of work. It was the modern day equivalent of meeting penpals for the first time. I especially appreciated that David Dotson and Sean Seyler, both approximately four years my senior, provided invaluable advice to a recent graduate. If you’re reading this, thanks guys. The most valuable moments were the conversations I had in informal settings. There is a huge diversity in career trajectories among those attending Scipy, everyone has career advice and technical knowledge to impart upon a young graduate as long as you are willing to ask. I had excellent conversations with people from Clover Health, Apple data scientists, Andreas Klockner (Keynote Speaker), Brian Van de Ven (Bokeh Dev), Ana Ruvalcaba at Jupyter, the list goes on… ## Fascinating, Troubling, and Unexpected Insights • Scipy doubled in size in the last year! • So many free shirts (and stickers), don’t even bother coming with more than one shirt, also nobody wears professional attire. • Overheard some troubling comments made by men at Scipy, e.g. “Well, all the women are getting the jobs I’m applying for…” (said in a hallway group, this is not appropriate even if it was a joke) • The amount of beer involved in social events is kind of nuts; this probably comes with the territory of professional programming. • There are a lot of apologists for rude people, someone can be extremely nonverbally dismissive and when you bring it up to other people they will defend him (yes, always him) saying something to the effect of ‘he has been really busy recently’. Oliver Beckstein is a shining example of someone who is very busy and makes a conscious effort to always be thoughtful and kind. • Open source does not always imply open contribution, some companies represented at Scipy maintain open source projects while making the barriers to contribution prohibitively high. • A lot of people at Scipy apologize for their job (half-seriously) if they aren’t someone super-special like a matplotlib core developer or the inventor of Python. Your jobs are awesome people! • It is really hot in Austin. • git pull is just git fetch + git merge. • A lot of women in computing have joined and left male dominated organizations not because people are necessarily mean, but because they’ve been asked out too much or harassed in a similar fashion. Stay professional folks. • Cows turn inedible corn into edible steak. • As a young professional you have to work harder and take every moment more seriously than those older than you in order to get ahead. • Breakfast tacos are delicious. • Being able to get out of your comfort zone is a professional asset. • Slow down, take a breath, read things over, don’t make simple mistakes. ## Here are some talks I really enjoyed ### Datashader! ### Dating! ### Loo.py! ### Dask! #### Scipy 2016! Last week I went to Austin, TX to Scipy2016. I wasn’t sure what to expect. How would people communicate? Would I fit in, what talks would interest me? Fortunately the conference was a huge success. I have came away a far more confident and motivated programmer than when I went in. ## So what were the highlights of my experience at Scipy? On a personal level, I got to meet some of my coworkers, the members of the Beckstein Lab. Dr. Oliver Beckstein, David Dotson, and Sean Seyler are brilliant physicists and programmers who I have been working with on MDAnalysis and datreant. It was surreal to meet the people you have been working with over the internet for 3 months and get an idea of how they communicate and what they enjoy outside of work. It was the modern day equivalent of meeting penpals for the first time. I especially appreciated that David Dotson and Sean Seyler, both approximately four years my senior, provided invaluable advice to a recent graduate. If you’re reading this, thanks guys. The most valuable moments were the conversations I had in informal settings. There is a huge diversity in career trajectories among those attending Scipy, everyone has career advice and technical knowledge to impart upon a young graduate as long as you are willing to ask. I had excellent conversations with people from Clover Health, Apple data scientists, Andreas Klockner (Keynote Speaker), Brian Van de Ven (Bokeh Dev), Ana Ruvalcaba at Jupyter, the list goes on… ## Fascinating, Troubling, and Unexpected Insights • Scipy doubled in size in the last year! • So many free shirts (and stickers), don’t even bother coming with more than one shirt, also nobody wears professional attire. • Overheard some troubling comments made by men at Scipy, e.g. “Well, all the women are getting the jobs I’m applying for…” (said in a hallway group, this is not appropriate even if it was a joke) • The amount of beer involved in social events is kind of nuts; this probably comes with the territory of professional programming. • There are a lot of apologists for rude people, someone can be extremely nonverbally dismissive and when you bring it up to other people they will defend him (yes, always him) saying something to the effect of ‘he has been really busy recently’. Oliver Beckstein is a shining example of someone who is very busy and makes a conscious effort to always be thoughtful and kind. • Open source does not always imply open contribution, some companies represented at Scipy maintain open source projects while making the barriers to contribution prohibitively high. • A lot of people at Scipy apologize for their job (half-seriously) if they aren’t someone super-special like a matplotlib core developer or the inventor of Python. Your jobs are awesome people! • It is really hot in Austin. • git pull is just git fetch + git merge. • A lot of women in computing have joined and left male dominated organizations not because people are necessarily mean, but because they’ve been asked out too much or harassed in a similar fashion. Stay professional folks. • Cows turn inedible corn into edible steak. • As a young professional you have to work harder and take every moment more seriously than those older than you in order to get ahead. • Breakfast tacos are delicious. • Being able to get out of your comfort zone is a professional asset. • Slow down, take a breath, read things over, don’t make simple mistakes. ## Here are some talks I really enjoyed ### Datashader! ### Dating! ### Loo.py! ### Dask! ## July 19, 2016 ### Sheikh Araf (coala) #### [GSoC16] Week 8 update Time flies and it’s been an astonishingly quick 8 weeks. I’ve finished work for the first coala Eclipse release, and the plug-in will be released with coala 0.8 in the next few days. Most of the coala team is at EuroPython so the development speed has slowed down. Nevertheless there are software development sprints this weekend and coala will be participating too. We also plan on having a mini-conference of our own, and will have lightning talks from GSoC students and other coala community members. As I’m nearing the end of my GSoC project, I’ve started reading up some material in order to get started with implementing the coafile editor. Currently I’m planing to extend either the AbstractTextEditor class or the EditorPart class, or something similar maybe. Cheers. ### Kuldeep Singh (kivy) #### After Mid-Term Evaluation Hello! guys, It’s been a month since I wrote a blog. I passed the Mid-Term and my mentors wrote a nice review for me. After my last blog post I have worked on a couple of features. Have a look at my PRs (Pull Requests) on the project Plyer. I am quiet happy with my work and hope to do more in future. Visit my previous blog here. #### GSoC week 8 roundup @cfelton wrote: There has been a little bit of a slump after the midterms hopefully this will not continue throughout the rest of the program The end will approach quickly. • August 15th: coding period ends. • August 15-20th: students submit final code and evaluations. • August 23-25th: mentors submit final evaluations. Overall the progress being made is satisfactory. I am looking forward to the next stage of the projects now that majority of the implementation is complete: analysis of the designs, clean-up, and documentation. One topic I want to stress, a program like GSoC is very different than much of the work that is completed by an undergrad student. This effort is the students exposition for this period of time (which isn't insignificant - (doh double negative)). Meaning, the goal isnt' to simply show us you can get something working but you are publishing your work to the public. Users should easily be able to use the cores developed and the subblocks within the cores. Developers, reviewers, contributors should feel comfortable reading the code. The code should feel clean [1]. You (the students) are publishing something into the public domain that carries your name take great pride in your work: design, code, documenation, etc. As well as the readability stated above the code should be analyzed for performance, efficiency, resource usage, etc. This information should be summarized in the blogs and final documentation. Student week8 summary (last blog, commits, PR): jpegenc: health 88%, coverage 97% @mkatsimpris: 10-Jul, >5, Y @Vikram9866: 25-Jun, >5, Y riscv: health 96%, coverage 91% @meetsha1995: 14-Jul, 1, N hdmi: health 94%, coverage 90% @srivatsan: 02-Jul, 0, N gemac: health 93%, coverage 92% @ravijain056, 04-Jul, 2, N pyleros: health missing, 70% @formulator, 26-Jun, 0, N Links to the student blogs and repositories: Merkourious, @mkatsimpris: gsoc blog, github repo Vikram, @Vikram9866: gsoc blog, github repo Meet, @meetshah1995, gsoc blog: github repo Srivatsan, @srivatsan-ramesh: gsoc blog, github repo Ravi @ravijain056: gsoc blog, github repo Pranjal, @forumulator: gsoc blog, github repo Posts: 2 Participants: 1 Read full topic ### Yashu Seth (pgmpy) #### The Canonical Factor Hey, nice to have you all back. This post would be about the canonical factors used to represent pure Gaussian relations between random variables. We already have a generalized continuous factor class and even a joint Gaussian distribution class to handle the Gaussian random variables so why do we need an another class? Well, at an abstract level the introduction of continuous random variables is not difficult. As we have seen in the ContinuousFactor class we can use a range of different methods to represent the probability density functions. We can multiply factors, which in this case corresponds to multiplying the multidimensional continuous functions representing the factors; and we can marginalize out variables in a factor, which in this case is done using integration rather than summation. It is not difficult to show that, with these operations in hand, the sum-product inference algorithms that we used in the discrete case can be applied without change, and are guaranteed to lead to correct answers. But a closer look at the ContinuousFactor methods reveal that these implementations are not at all efficient. One can say that we can use these methods directly, only on certain toy examples. Once the number of variables increase, we can not always guarantee that these methods will perform in a feasible manner. In order to provide a better solution, we restrict our variables to the Gaussian universe and hence bring the class JointGaussianDistribution into the picture. While this representation is useful for certain sampling algorithms, a closer look reveals that it can also not be used directly in the sum-product algorithms. Why? Because operations like product and reduce involve matrix inversions at each step. For a detailed study of these operations, you can refer Products and Convolutions of Gaussian Probability Density Functions. So, in order to compactly describe the intermediate factors in a Gaussian network without the costly matrix inversions at each step, a simple parametric representation is used known as the Canonical Factor. This representation is closed under the basic operations used in inference: factor product, factor division, factor reduction, and marginalization. Thus, we can define a set of simple data structures that allow the inference process to be performed. Moreover, the integration operation required by marginalization is always well defined, and it is guaranteed to produce a finite integral under certain conditions; when it is well defined, it has a simple analytical solution. ### The Canonical Form Representation The simplest representation used in this setting represents the intermediate result as a log-quadratic form exp(Q(x)) where Q is some quadratic function. In the inference setting, it is useful to make the components of this representation more explicit. The canonical factor representation is characterized by the three parameters K, h and g. For details on its representation and these parameters, refer Section 14.2.1.1 of the book, Probabilistic Graphical Models Principles and Techniques. ### The CanoncialFactor class Similar to the JointGaussainDistribution class, the CanonicalFactor class is also derived from the ContinuousFactor class but with its own implementations of the methods required for the sum-product algorithms that are much more efficient than its parent class methods. Let us have a look at the API of a few methods in this class. API of the _operate method that is used to define the product and divide methods. >>> import numpy as np >>> from pgmpy.factors import CanonicalFactor >>> phi1 = CanonicalFactor(['x1', 'x2', 'x3'], np.array([[1, -1, 0], [-1, 4, -2], [0, -2, 4]]), np.array([[1], [4], [-1]]), -2) >>> phi2 = CanonicalFactor(['x1', 'x2'], np.array([[3, -2], [-2, 4]]), np.array([[5], [-1]]), 1) >>> phi3 = phi1 * phi2 >>> phi3.K array([[ 4., -3., 0.], [-3., 8., -2.], [ 0., -2., 4.]]) >>> phi3.h array([ 6., 3., -1.]) >>> phi3.g -1 >>> phi4 = phi1 / phi2 >>> phi4.K array([[-2., 1., 0.], [ 1., 0., -2.], [ 0., -2., 4.]]) >>> phi4.h array([-4., 5., -1.]) >>> phi4.g -3  This class also has a method, to_joint_gaussian to convert the canoncial representation back into the joint gaussian distribution. >>> import numpy as np >>> from pgmpy.factors import CanonicalFactor >>> phi = CanonicalFactor(['x1', 'x2'], np.array([[3, -2], [-2, 4]]), np.array([[5], [-1]]), 1) >>> jgd = phi.to_joint_gaussian() >>> jgd.variables ['x1', 'x2'] >>> jgd.covariance array([[ 0.5 , 0.25 ], [ 0.25 , 0.375]]) >>> jgd.mean array([[ 2.25 ], [ 0.875]])  Other than these methods the class has the usual methods like marginalize, reduce and assignment. Details of the entire class can be found here. So with this I come to the end of this post. Thanks, once again for going through it. Hope to see you next time. Bye :-) ## July 17, 2016 ### Yen (scikit-learn) #### Using Function Pointer to Maximize Code Reusability in Cython When writing C, function pointer is extremely useful because it can help us define a callback function, i.e., a way to parametrize a function. This means that some part of the function behavior is not hard-coded into itself, but into the callback function provided by user. Callers can make function behave differently by passing different callback functions. A classic example is qsort() from the C standard library that takes its sorting criterion as a pointer to a comparison function. Besides the benefit above, we can also use function pointer straightforwardly to avoid redundant control flow code such as if, else. In this blog post, I’m going to explain how we can combine function pointer and Cython fused types in a easy way to make function pointer become more powerful than ever, and therefore maximize the code reusability in Cython. ## Function Pointer Let’s start from why function pointer can help us address code duplication issue. Consider we have the following two functions, one add one and the other add two to the function argument: float add_one(float x) { return x+1; } float add_two(float x) { return x+2; }  Now close your eyes, try your best to imagine the operation x+1 performed in add_one and the operation x+2 performed in add_two are costly which must be implemented in C or they will take several hours to complete. Okay, base on the imagined reason above, we indeed need to import C funcitons above to speed up our Cython function, which will return (x+1)*2+1 if x is an odd number, or (x+2)*2+2 if x is an even number: cdef float linear_transform(float x): """ This function will return (x+1)*2+1 if x is odd (x+2)*2+2 if x is even """ float ans if x % 2 == 1: # x is odd ans = add_one(x) else: # x is even ans = add_two(x) ans *= 2 // Code duplication if x % 2 == 1: ans = add_one(x) else: ans = add_two(x) return ans  As one can see, there is a code duplication appears in the end of this function, because we have to check whether we need to apply add_one or add_two to the variable x. To address this issue, we can definine a function pointer and assign it once we know x is a odd number or even number. Until the end of the program, we don’t have to write annoying if, else again. Above code snippet can reduce to: ctypedef float (*ADD)(float x) cdef float linear_transform(float x): """ This function will return (x+1)*2+1 if x is odd (x+2)*2+2 if x is even """ ADD add float ans if x % 2 == 1: # x is odd add = add_one else: # x is even add = add_two ans *= 2 ans = add(ans) return ans  This code snippet is more readable. Function pointer do make our code looks neat! Note: Although there is only one duplication in above example, there may be a lot in real code, which can show function pointer’s value more obviously. ## Function Pointer’s Limitation However, function pointer is not omnipotent. Although they provide a good way to write generic code, unfortunately they don’t provide you with type generality. What do I mean? Consider if we now have the following two C functions that both add 1 to the argument variable, one is for type float and one is for type double: float add_one_float(float x) { return x+1; } double add_one_double(double x) { return x+1; }  Now do the imagination process described in our first example again, base on the same reason, we indeed need to load these extern C functions to speed up the following Cython function linear_transform : cdef floating linear_transform(floating x): """ This function will return (x+1)*2+1 with the same type as input argument """ floating ans if floating is float: ans = add_one_float(x) elif floating is double: ans = add_one_double(x) ans *= 2 if floating is float: ans = add_one_float(x) elif floating is double: ans = add_one_double(x) return ans  Don’t be scared if you havn’t seem floating before, to be brief, floating here refers to either type float or type double. It is just a feature called fused types in Cython, which basically serves the same role like templates in C++ or generics in Java. Note that now we can’t define a function pointer and assign it like what we did in our first example, because C functions add_one_float and add_one_double have different function signatures. Since C is a strong typed language, it’s hard to define a function pointer that can point to functions with different types. (which is why, for example, the standard library qsort still requires a function that takes void* pointer.) NOTE: Usage of void* pointer in C is beyond the scope of this blog post, you can find a simple introduction here. But remember, it’s dangerous. ## Finction Pointer + Fused Types Fortunately, fused types is here to rescue us. With this useful tool, we can actually define fused types function pointer to solve above problem! ctypedef floating (*ADD)(floating x) cdef floating linear_transform(floating x): """ This function will return (x+1)*2+1 with the same type as input argument """ ADD add_one floating ans if floating is float: add_one = add_one_float elif floating is double: add_one = add_one_double ans = add_one(x) # (x+1) ans *= 2 # (x+1)*2 ans = add_one(x) # (x+1)*2+1 return ans  Note that since floating can represent either float or double, function pointer of type floating have the ability to achieve type generality, which is not available before we combine fused types with function pointer. Finally, we are going to demystify the secret of this magic trick performed by Cython and make sure that it works properly. ## Demystifying How It Work In order to know how Cython fused types function pointer works, let’s become a ninja and dive deep to peep the C code generated by Cython. In the generated C code of above Cython function, there is no if floating is float: anymore. Actually, to accommodate fused types floating, Cython generates one version of the function for float and another for double. And in the float version function, it directly assign the function pointer we declared to the actual C function that will be called if x is of type float:  __pyx_v_add_one = add_one_float;  Same as the float version, double version also generates  __pyx_v_add_one = add_one_double;  which directly assigns function pointer to the correct function. In fact, this direct assignment is an optimization performed by C compiler since it can identify variables that remain unchanged within a function. It would find out that function pointer __pyx_v_add_one is only set once to a constant, i.e., an extern function. Hence after object code is linked, __pyx_v_add_one  will directly be assigned to the C function. On the contrary, Python interpreter can provides only little in static analysis and code optimization since the language design doesn’t have the compile phase. In sum, always implement your computation heavy code in Cython instead of Python. ## Summary Combining function pointer with fused types raises its power to another higher level. Actually, it is a generalized version of original function pointers, and can be used in lots of places to make our code looks more readable and cleaner. Also, it is often a good idea to check the C code generated by Cython so as to make sure it’s doing what you hoped. See you next time! ### mkatsimpris (MyHDL) #### Week 8 This week passed with a lot of work. The frontend part of the encoder and the zig-zag module merged in the main repository. So, in this stage the frontend part is ready and the backend part of the encoder is missing in order to create the complete JPEG encoder. As discussed with Christopher, I have to write documentation for each module and don't leave it for the last days. Moreover, this week ## July 16, 2016 ### Riddhish Bhalodia (dipy) #### Registration experiments ## What… Basically data registration (in our case 3D MRI scans) is essentially process to bring the tow different datas (different in structure, distortion, resolution…) in one coordinate frame so that any kind of further processing involving both the datas becomes much simpler. ## Why.. As to why is registration important for us, there is a simple answer. We are aiming at template assisted brain extraction so the first step is to align (register) the input data to the template data. That will allow for the further processing of the images. ## How.. There are different methods for registration and some of them are available on DIPY, most important are the affine registration and the diffeomorphic registration (non-linear registration) , so we will be using these routines for several experiments, this will work in combination with simple brain extraction by median otsu. ## Datasets.. ## Experiments I ran several experiments, for testing different registration combinations and which would give best results. [A] Affine registration with raw input, raw template levels = [10000,1000,100] We can see that we need a slight correction, so we follow this process up by the next experiment using non-linear registration on the images to correct alignment even more. [B] Non-Linear Registration with affine transformed input and identity pre_align With default parameters and levels = [10,10,5], and raw data and affine transformed template (along with the skull) given as inputs to the diffeomorphic registration. The above figure does not show much difference from the affine result, so lets view the other way. The above figure shows some changes from the affine result and looks a little better, we can see that the skull of the template is causing some problems, so we will also see the result which just uses the extracted brain from the template [C] Non-Linear Registration with pre_align = affine transformation here too the levels = [10,10,5] but the inputs are the raw input data and raw template data Again we cant see much difference, so lets see the other image Again we see that the skull is causing a problem to the diffeomorphic correction, so the next experiment. [D] Using only brain from the template we repeat the part C It seems a great fit from the above figure, just a little largely sized, now seeing the alternate representation The above registration seems nice, yes there are few corrections so I started tuning the parameters of the non-linear registration. Another thing to be noticed is that the boundary of the transformed template output is little skewed from the brain in the input data, so we have a patch based method which should correct this! The results to that using this will be posted in next blog. ## Next Up.. I will have an immediate next post about how the following registration helped me to get to a good brain extraction algorithm, and describe the algorithm as well. Then I will have to test it on several datasets to see for it’s correctness, and compare with median otsu. ## July 15, 2016 ### sahmed95 (dipy) #### Model fitting in Python - A basic tutorial on least square fitting, unit testing and plotting Magnetic Resonance images blog &v # Fitting a model using scipy.optimize.leastsq¶ I am halfway through my Google Summer of Code project with Dipy under the Python Software Foundation and I have published a few short posts about the project before but in this post I am going to walk through the entire project from the start. This can also be a good tutorial if you want to learn about curve fitting in python, unit testing and getting started with Dipy, a diffusion imaging library in Python. The first part of this post will cover general curve fitting with leastsq, plotting and writing unit tests. The second part will focus on Dipy and the specific model I am working on - The Intravoxel Incoherent motion model. #### Development branch for the IVIM model in Github : (https://github.com/nipy/dipy/pull/1058)¶ ## Curve fitting with leastsq¶ Mathematical models proposed to fit observed data can be characterized by several parameters. The simplest model is a straight line characterized by the slope (m) and intercept (C). ## y = mx + C¶ Given a set of data points, we can determine the parameters m and C by using least squares regression which minimizes the error between the actual data points and those predicted by the model function. To demonstrate how to use leastsq for fitting let us generate some data points by importing the library numpy, defining a function and making a scatter plot. In [140]: import numpy as npimport matplotlib.pyplot as pltfrom scipy.optimize import leastsq# This is a magic function that renders the plots in the Ipython notebook itself% matplotlib inline  Let us define a model function which is an exponential decay curve and depends on two parameters. ### y = A exp(-x * B)¶ Documentation is important and we should write what each function does and specify the input and output parameters for each function. In [141]: # Define a linear model functiondef func(params, x): """ Model function. The function is defined by the following equation. y = A exp(-x * B) Parameters ---------- params : array (2,) An array with two elements, A and B x : array (N,) The independent variable Returns ------- y : array (N,) The values of the dependent variable """ A, B = params[0], params[1] return A*np.exp(-x*B) Use numpy's arange to generate an array of "x" and apply the function to get a set of data points. We can add some noise by the use of the random number generator in numpy. In [143]: x = np.arange(0,400,10)N = x.shape[0]params = 1., 0.01Y = func(params, x)noise = np.random.rand(N)y = Y + noise/5. Let us make a scatter plot of our data In [144]: plt.scatter(x, y, color='blue')plt.title("Plot")plt.xlabel("x")plt.ylabel("y")plt.show() Now that we have a set of data points, we can go ahead with the fitting using leastsq. Leastsq is a wrapped around the FORTRAN package MINPACK’s lmdif and lmder algorithms. It uses the Levenberg Marquardt algorithm for finding the best fit parameters. Leastsq requires an error function which gives the difference between the observed target data (y) and a (non-linear) function of the parameters f(x, params). You can also specify initial guesses for the parameters using the variable x0. For more reference check out the documentation at http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.leastsq.html In [91]: def error_fun(params, x, y): """ Error function for model function. Parameters ---------- params : array (2,) Parameters for the function x : array (N,) The independent variable y : array (N,) The observed variable Returns ------- residual : array (N,) The difference between observed data and predicted data from the model function """ residual = func(params, x) - y return residual Now, we are ready to use leastsq for getting the fit. The call to leastsq returns an array whose first element is the set of parameters obtained and the other elements give more information about the fitting. In [92]: fit = leastsq(error_fun, x0=[0.5, 0.05], args=(x,y))fitted_params = fit[0]print ("Parameters after leastsq fit are :", fitted_params) Parameters after leastsq fit are : [ 1.03770447 0.00783634] Let us compare our fitting and plot the results. In [93]: y_predicted = func(fitted_params, x)plt.scatter(y_predicted, x, color="red", label="Predicted signal")plt.scatter(y, x, color="blue", label="Actual data")plt.title("Plot results")plt.xlabel("x")plt.ylabel("y")plt.legend()plt.plot() Out[93]: [] Congratualations ! You have done your first model fitting with Python. You can now look at other fitting routines such as scipy.optimize.minimize, use different types of model functions and try to write code for more complicated functions and data. But at the heart of it, its as simple as this basic example. # Writing unit tests¶ Unit testing is a software testing method by which individual units of source code, sets of one or more modules together are tested to determine whether they are fit for use. Unit tests are usually very simple to write and involve very elementary tests on the expected output of your code. In the current context, it can be as simple as checking if your fit gives back the same parameters you used to generate the data. We will be using numpy's unit testing suite for our test. In [145]: from numpy.testing import (assert_array_equal, assert_array_almost_equal)def test_fit(): """ Test the implementation of leastsq fitting """ x = np.arange(0,400,10) N = x.shape[0] params = 1., 0.01 y = func(params, x) fit = leastsq(error_fun, x0=[0.5, 0.05], args=(x,y)) fitted_params = fit[0] predicted_signal = func(fitted_params, x) # Check if fit returns parameters for noiseless simulated data assert_array_equal(params, fitted_params) assert_array_almost_equal(y, predicted_signal)test_fit()# Usually, the package nosetests is used for unit testing. You can run the command nosetest test_file.py to run# all the tests for a module. So your tests pass for a noiseless fit and you get back your original parameters. But what if there is noise in the data ? Let us write a more extreme version of the test and try to see if it fails. In [146]: def test_fit_with_noise(): """ Test the fitting in presence of noise """ x = np.arange(0,400,10) N = x.shape[0] params = 1., 0.01 Y = func(params, x) noise = np.random.rand(N) y = Y + noise/5. fit = leastsq(error_fun, x0=[0.5, 0.05], args=(x,y)) fitted_params = fit[0] predicted_signal = func(fitted_params, x) # Check if fit returns parameters for noiseless simulated data assert_array_equal(params, fitted_params) assert_array_almost_equal(y, predicted_signal)test_fit_with_noise() ---------------------------------------------------------------------------AssertionError Traceback (most recent call last)<ipython-input-146-dd7240a6a422> in <module>() 20 assert_array_almost_equal(y, predicted_signal) 21 ---> 22 test_fit_with_noise()<ipython-input-146-dd7240a6a422> in test_fit_with_noise() 17 18 # Check if fit returns parameters for noiseless simulated data---> 19 assert_array_equal(params, fitted_params) 20 assert_array_almost_equal(y, predicted_signal) 21 /usr/local/lib/python3.4/dist-packages/numpy/testing/utils.py in assert_array_equal(x, y, err_msg, verbose) 737 """ 738 assert_array_compare(operator.__eq__, x, y, err_msg=err_msg,--> 739 verbose=verbose, header='Arrays are not equal') 740 741 def assert_array_almost_equal(x, y, decimal=6, err_msg='', verbose=True):/usr/local/lib/python3.4/dist-packages/numpy/testing/utils.py in assert_array_compare(comparison, x, y, err_msg, verbose, header, precision) 663 names=('x', 'y'), precision=precision) 664 if not cond :--> 665 raise AssertionError(msg) 666 except ValueError as e: 667 import tracebackAssertionError: Arrays are not equal(mismatch 100.0%) x: array([ 1. , 0.01]) y: array([ 1.079017, 0.007812]) ## Thus, unit tests can make sure that your code is working properly and it behaves as it is expected to behave.¶ =================================================================================================================== # The Intravoxel incoherent motion model¶ The Intravoxel incoherent motion model assumes that biological tissue includes a volume fraction of water flowing in perfused capillaries, with a perfusion coefficient D* and a fraction (1-f) of static (diffusion only), intra and extracellular water, with a diffusion coefficient D. Magnetic resonance imaging can be used to reconstruct a model of the brain (or any other tissue) using these parameters. Dipy is a python library for the analysis of diffusion MRI data. Let us get started by importing dipy, loading an image dataset and visualizing a slice of the brain. Dipy has "data fetchers" which can be used to download different datasets and read an image and load other attributes. We will use the Stanford HARDI dataset in this example. If you want an IVIM dataset you have to pull the ivim_dev branch from dipy to get the fetcher read_ivim and replace read_stanford_hardi with read_ivim. The dataset is provided by Eric Peterson. For more details refer to https://figshare.com/articles/IVIM_dataset/3395704 In [147]: from dipy.data import read_ivimimg, gtab = read_ivim() Dataset is already in place. If you want to fetch it again please first remove the folder /home/shahnawaz/.dipy/ivim  The function read_stanford_hardi downloads the dataset, reads it and returns two objects: an image and a gradient table. The gradient table has b-values and b-vectors which can be accessed by calling gtab.bvals and gtab.bvecs. The b-value is a factor that reflects the strength and timing of the gradients used to generate diffusion-weighted images. The higher the b-value, the stronger the diffusion effects. In [148]: print("b-values are :\n", gtab.bvals) b-values are : [ 0. 10. 20. 30. 40. 60. 80. 100. 120. 140. 160. 180. 200. 300. 400. 500. 600. 700. 800. 900. 1000.] "img" is a NIFTI image which can be explored using the nibabel library. In [149]: print(img) <class 'nibabel.nifti1.Nifti1Image'>data shape (256, 256, 54, 21)affine: [[ -0.9375 -0. 0. 120.03099823] [ -0. 0.9375 -0. -91.56149292] [ 0. 0. 2.5 -75.15000153] [ 0. 0. 0. 1. ]]metadata:<class 'nibabel.nifti1.Nifti1Header'> object, endian='<'sizeof_hdr : 348data_type : b''db_name : b''extents : 0session_error : 0regular : b'r'dim_info : 0dim : [ 4 256 256 54 21 1 1 1]intent_p1 : 0.0intent_p2 : 0.0intent_p3 : 0.0intent_code : nonedatatype : float32bitpix : 32slice_start : 0pixdim : [-1. 0.9375 0.9375 2.5 1. 0. 0. 0. ]vox_offset : 0.0scl_slope : nanscl_inter : nanslice_end : 0slice_code : unknownxyzt_units : 2cal_max : 0.0cal_min : 0.0slice_duration : 0.0toffset : 0.0glmax : 0glmin : 0descrip : b''aux_file : b''qform_code : alignedsform_code : scannerquatern_b : -0.0quatern_c : 1.0quatern_d : 0.0qoffset_x : 120.03099822998047qoffset_y : -91.56149291992188qoffset_z : -75.1500015258789srow_x : [ -0.9375 -0. 0. 120.03099823]srow_y : [ -0. 0.9375 -0. -91.56149292]srow_z : [ 0. 0. 2.5 -75.15000153]intent_name : b''magic : b'n+1' For our use, we just need the data which can be read by calling the get_data() method of the img object. In [150]: data = img.get_data() The data has 54 slices, with 256-by-256 voxels in each slice. The fourth dimension corresponds to the b-values in the gtab. In [151]: print (data.shape) (256, 256, 54, 21) Let us take a slice of the data at a particular b value and height (z) and visualize it In [152]: plt.imshow(data[:,:,31, 0], cmap='gray', origin ="lower")plt.show() Thus, for each b value we have a 3d matrix which gives us the signal value in a 256x256x54 box. To see how the signal in one of the points in this volume (one voxel) varies with bvalue, let us consider a random point from the data slice. Say we want the data in a 20x20 square at the height z=31 for all the bvalues, we specify the range of indices from our array using ":" as : In [153]: x1, x2 = 100, 120y1, y2 = 150, 170z = 31data_slice = data[x1:x2, y1:y2, z, :]print(data_slice.shape) (20, 20, 21) The data slice is now a 20x20 portion with the third dimension specifying the bvalues. We can visulaize this slice at a particular bvalue and extract the x and y data for the voxel located at (0,0) and for all bvals (:). In [154]: plt.imshow(data_slice[0], origin="lower", cmap="gray")plt.show() In [155]: xdata = gtab.bvalsydata = data_slice[0,0,:]plt.scatter(xdata, ydata)plt.xlabel("b-values")plt.ylabel("Signal")plt.plot() Out[155]: [] The fitting method remains the same now. Let us try to fit the IVIM model function on this data and get IVIM parameters for this voxel. The IVIM model function is a bi exponential curve and has 4 parameters S0, f, D_star and D and is defined as follows : ## S(b) = S0(f e^(- b D_star) + (1 - f) e^(-b D))¶ In [156]: def ivim_function(params, bvals): """The Intravoxel incoherent motion (IVIM) model function. S(b) = S_0[f*e^{(-b*D\*)} + (1-f)e^{(-b*D)}] S_0, f, D\* and D are the IVIM parameters. Parameters ---------- params : array parameters S0, f, D_star and D of the model bvals : array bvalues References ---------- .. [1] Le Bihan, Denis, et al. "Separation of diffusion and perfusion in intravoxel incoherent motion MR imaging." Radiology 168.2 (1988): 497-505. .. [2] Federau, Christian, et al. "Quantitative measurement of brain perfusion with intravoxel incoherent motion MR imaging." Radiology 265.3 (2012): 874-881. """ S0, f, D_star, D = params S = S0 * (f * np.exp(-bvals * D_star) + (1 - f) * np.exp(-bvals * D)) return Sdef _ivim_error(params, bvals, signal): """Error function to be used in fitting the IVIM model """ return (signal - ivim_function(params, bvals)) Once we have our error function, we can use the same procedure to fit an IVIM model by using the leastsq function from scipy.optimize In [157]: x0 = [1000., 0.1, 0.01, 0.001]fit = leastsq(_ivim_error, x0, args=(xdata, ydata),)estimated_parameters = fit[0]print('Parameters estimated using leastsq :', estimated_parameters) Parameters estimated using leastsq : [ 4.44684213e+03 1.41584992e-02 -1.88101978e-03 1.10032569e-03] In [158]: predicted_signal = ivim_function(estimated_parameters, xdata)plt.plot(xdata, predicted_signal, label="Estimated signal")plt.scatter(xdata, ydata, color="red", label="Observed data")plt.xlabel("b-values")plt.ylabel("Signal")plt.legend()plt.show() ### This forms the core of the project and once you get the hang of this, you can easily implement any new model for fitting data.¶ #### Fitting models using scipy and unit testing Hi everyone, this is my first blog after the start of the coding period and the past two weeks have been quite busy and eventful. We now have a working code to get model parameters which has been tested with simulated data. This is my first time with software testing and it has been quite a learning experience. So, let me describe the work so far. Although we are fitting the IVIM (Intravoxel incoherent motion) model to dMRI signals here, the techniques and code developed so far can be used for any kind of model fitting. The equation we want to fit : S = S0 [f e^{-b*D_star} + (1 - f) e^{-b*D}] but first let us try some basic fitting using Scipy's leastsq. # Fitting a model using scipy.optimize.leastsq I am halfway through my Google Summer of Code project with Dipy under the Python Software Foundation and I have published a few short posts about the project before but in this post I am going to walk through the entire project from the start. This can also be a good tutorial if you want to learn about curve fitting in python, unit testing, constructing Jacobians for faster fitting and getting started with Dipy, a diffusion imaging library in Python. The first part of this post will cover general curve fitting with leastsq, writing a Jacobian and writing unit tests. The second part will focus on Dipy and the specific model I am working on - The Intravoxel Incoherent motion model. ## Curve fitting with leastsq Mathematical models proposed to fit observed data can be charaterized by several parameters. For example, the simplest model is a straight line which is characterised by the slope and intercept. If we consider "x" as the independent variable and "y" as the dependent variable, the relationship between them is controlled by the model parameters m and C when we propose the following model : y = mx + C Given a set of data points, we can determine the parameters m and C by using least squares regression which is simply minimization of the error between the actual data points and those predicted by the model function. To demonstrate how to use leastsq for fitting let us generate some data points by importing the library numpy, defining a linear function and making a scatter plot. In [47]: import numpy as npimport matplotlib.pyplot as pltfrom scipy.optimize import leastsq# This is a magic function that renders the plots in the Ipython notebook itself% matplotlib inline  Let us define a model function which is an exponential decay curve and depends on two parameters. ### y = A exp(-x * B) Documentation is important and we will be specifying the input parameters and the output of out function. In [19]: # Define a linear model functiondef func(params, x): """ Model function. The function is defined by the following equation. y = A exp(-x * B) Parameters ---------- params : array (2,) An array with two elements, A and B x : array (N,) The independent variable Returns ------- y : array (N,) The values of the dependent variable """ A, B = params[0], params[1] return A*np.exp(-x*B) Use numpy's arange to generate an array of "x" and apply the function to get a set of data points. We can add some noise by the use of the random number generator in numpy. In [68]: x = np.arange(0,400,10)N = x.shape[0]params = 1., 0.01Y = func(params, x)noise = np.random.rand(N)y = Y + noise/5. Let us make a scatter plot of our data In [69]: plt.scatter(x, y, color='blue')plt.title("Plot of x vs y")plt.xlabel("x")plt.ylabel("y")plt.show() Now that we have a set of data points, we can go ahead with the fitting using leastsq. Leastsq is a wrapped around the FORTRAN package MINPACK’s lmdif and lmder algorithms. It uses the Levenberg Marquardt algorithm for finding the best fit parameters. Leastsq requires an error function which gives the difference between the observed target data (y) and a (non-linear) function of the parameters f(x, params). You can also specify initial guesses for the parameters using the variable x0. For more reference check out the documentation athttp://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.leastsq.html In [70]: def error_fun(params, x, y): """ Error function for model function. Parameters ---------- params : array (2,) Parameters for the function x : array (N,) The independent variable y : array (N,) The observed variable Returns ------- residual : array (N,) The difference between observed data and predicted data from the model function """ residual = func(params, x) - y return residual Now, we are ready to use leastsq for getting the fit. The call to leastsq returns an array whose first element is the set of parameters obtained and the other elements give more information about the fitting. In [71]: fit = leastsq(error_fun, x0=[0.5, 0.05], args=(x,y))fitted_params = fit[0]print ("Parameters after leastsq fit are :", fitted_params) Parameters after leastsq fit are : [ 0.98855093 0.00679147] Let us compare our fitting and plot the results. In [74]: y_predicted = func(fitted_params, x)plt.scatter(y_predicted, x, color="red", label="Predicted signal")plt.scatter(y, x, color="blue", label="Actual data")plt.title("Plot results")plt.xlabel("x")plt.ylabel("y")plt.legend()plt.plot() Out[74]: [] Congratualations ! You have done your first model fitting with Python. You can now look at other fitting routines such as scipy.optimize.minimize, use different types of model functions and try to write code for more complicated functions and data. But at the heart of it, its as simple as this basic example. The IVIM model function is a a bi-exponential curve with the independent variable "b" and parameters : f, D_star, D and S0. The parameters are perfusion fraction (f), pseudo diffusion constant (D_star), tissue diffusion constant (D) and the non gradient signal value (S0) respectively. We intend to extract these parameters given some data and the bvalues (b) associated with the data. We follow test-driven development and hence the first task was to simulate some data, write a fitting function and see if we are getting the correct results. An Ipython notebook tests a basic implementation of the fitting routing by generating a signal and plotting the results. We have the following plots for two different signals generated with Dipy's multi_tensor function and the fitting by our code. https://github.com/sahmed95/dipy/blob/ipythonNB/dipy/reconst/ivim_dev.ipynb Once, a basic fitting routine was in place, we moved on to incorporate unit tests for the model and fitting functions. Writing a test is pretty simple. Create a file as test_ivim.py and define the test functions as "test_ivim():". Inside the test, generate a signal by taking a set of bvalues and passing the ivim parameters to the multi_tensor function. Then, initiate an IvimModel and call its fit method to get model_parameters. The parameters obtained from the fit should nearly be the same as the parameters used to generate the model. This can be achieved by numpy's testing functions assert_array_equal(), assert_array_almost_equal(). We use nosetests for testing which makes running a test as simple as "nosetests test_ivim.py". It is necessary to build the package by running the setup.py file with the arguments build_ext --inplace. The next step is to implement a two-stage fitting where we will consider a bvalue threshold and carry out a fitting of D while assuming that the perfusion fraction is negligible, This will simplify the model to S = S0 e^{-b*D}). The D value thus obtained will be used as a guess for the complete fit. You can find the code so far here : https://github.com/nipy/dipy/pull/1058 A little note about the two fitting routines we explored : scipy's leastsq and minimize. "optimize" is a more flexible method, allowing bounds to be passed while fitting and allows the user to specify particular fitting methods while leastsq doesn't have that flexibility and uses MINPACK's lmdiff written in FORTRAN. However, leastsq performed better on our tests than minimize. A possible reason might be that leastsq calculates the Jacobian for fitting and may be better suited to fit bi exponential functions. It might also turn out that implementing Jacobian improves the performance of optimize for fitting. All these questions will be explored later and for now we have included both fitting routines in our code. However, as Ariel pointed out, minimize is not available in older versions of scipy and to have backward compatibility we might go ahead with leastsq. Next up, we will implement a two stage fitting as discussed in the original paper on IVIM in Le Bihan's 1988 paper.[1]. Meanwhile feel free to fork the development version of the code and play around with the IVIM data provide by Eric Peterson here : https://figshare.com/s/4733b6fc92d4977c2ee1 1. Le Bihan, Denis, et al. "Separation of diffusion and perfusion in intravoxel incoherent motion MR imaging." Radiology 168.2 (1988): 497-505. #### leastsq vs minimize ! Leastsq wins with an almost 100 times faster fit. So, after some profiling and testing, I found that the difference in performance for the two fitting routines from scipy.optimize - "minimize" and "leastsq" is huge !! Minimize seems to be at least 10 times slower than leastsq on the same data. Have a look at the code below for more details. I used a bi exponential curve which is a sum of two exponents and the model function has 4 parameters to fit. S, f, D_star and D. All default parameters for fitting were used and the fitting function is the IVIM model function : S [f e^(-x * D_star) + (1 - f) e^(-x * D)] Using the time module we got the following results : Time taken by leastsq : 0.0003180503845214844 Time taken by minimize : 0.011617898941040039 We were in favor of minimize earlier since it allowed setting bounds for the parameters and gave the user different options for the algorithm they wanted to use for fitting such as L-BFGS-B, Truncated Newton etc. but due to the huge difference in performance, we have decided to use leastsq for fitting and use a "hack" to implement the bounds while continuing the use of leastsq. The absence of a bounded leastsq method has been long standing issue in Scipy which was resolved in the version 0.17 by the least_squares function which takes bounds and uses a "Trust Region Reflective algorithm" for fitting. This also meant a change the Jacobian which works out as follows : # Jacobian for IVIM function (biexponential) ## Shahnawaz Ahmed Let i be the index which denotes the number of points to fit and j be the number of parameters. In this case i is the number of bvalues. Let xi be the independent parameter (bvalue). Let β denote the vector of parameters with βj=(S0,f,D,D). Thus we have j=0,1,2,3 The IVIM signal is : S(xi,β)=β0[β1e(xiβ2)+(1β1)e(xiβ3)] Here : β0=S0,β1=f,β2=D,β3=D and xi=b The residual that we need to find the Jacobian for is : ri=[yiS(xi,β)] The Jacobian is defined as : riβj Thus we will have : Ji,j=Sβj The various terms are thus : Sβ0=[β1e(xiβ2)+(1β1)e(xiβ3)] Sβ1=β0[exiβ2exiβ3] Sβ2=β0[xiβ1exiβ2] Sβ3=β0[xi(1β1)exiβ3] This should give us the Jacobian More discussion on the issue can be found here : http://stackoverflow.com/questions/6779383/scipy-difference-between-optimize-fmin-and-optimize-leastsq Code to compare the performance of leastsq and minimize. Time import numpy as npfrom scipy.optimize import minimize, leastsqfrom time import timedef ivim_function(params, bvals): """The Intravoxel incoherent motion (IVIM) model function. S(b) = S_0[f*e^{(-b*D\*)} + (1-f)e^{(-b*D)}] S_0, f, D\* and D are the IVIM parameters. Parameters ---------- params : array parameters S0, f, D_star and D of the model bvals : array bvalues References ---------- .. [1] Le Bihan, Denis, et al. "Separation of diffusion and perfusion in intravoxel incoherent motion MR imaging." Radiology 168.2 (1988): 497-505. .. [2] Federau, Christian, et al. "Quantitative measurement of brain perfusion with intravoxel incoherent motion MR imaging." Radiology 265.3 (2012): 874-881. """ S0, f, D_star, D = params S = S0 * (f * np.exp(-bvals * D_star) + (1 - f) * np.exp(-bvals * D)) return Sdef _ivim_error(params, bvals, signal): """Error function to be used in fitting the IVIM model """ return (signal - ivim_function(params, bvals))def sum_sq(params, bvals, signal): """Sum of squares of the errors. This function is minimized""" return np.sum(_ivim_error(params, bvals, signal)**2)x0 = np.array([100., 0.20, 0.008, 0.0009])bvals = np.array([0., 10., 20., 30., 40., 60., 80., 100., 120., 140., 160., 180., 200., 220., 240., 260., 280., 300., 350., 400., 500., 600., 700., 800., 900., 1000.])data = ivim_function(x0, bvals)optstart = time()opt = minimize(sum_sq, x0, args=(bvals, data))optend = time()time_taken = optend - optstartprint("Time taken for opt:", time_taken)lstart = time()lst = leastsq(_ivim_error, x0, args=(bvals, data),)lend = time()time_taken = lend - lstartprint("Time taken for leastsq :", time_taken)print('Parameters estimated using minimize :', opt.x)print('Parameters estimated using leastsq :', lst[0]) ## July 14, 2016 ### meetshah1995 (MyHDL) #### RISC-V Core review The last two weeks have mostly been debugging and reviewing cores for our project. I had gathered all riscv implementations and then I reviewed them this week. Here is a list of the cores I reviewed. I also was struggling with my health last week and thus could not post on the blog. I am also working on the decoder debugging currently as a side task. Core Review # Cores : • Yarvi • Zscale • Shakti • E Class • C Class • PicoRV32 ### Yarvi • Type : In-order , 3 stage pipeline • Source : SystemVerilog • ISA : RV32I • Cache : None • Branch Predictor : None • Instruction Mem : 4KB • Data Mem : 4KB • Memory Interface : Avalon-MM • RISCV toolchain : riscv32-unknown-elf-gcc • Simulation support : Yes • In development : No • Future updates : No ## Zscale • Source : Chisel and Verilog (vscale) • Type : Single-issue, In-order , 3 stage pipeline • ISA : RV32 • Cache : None • Branch Predictor : None • Instruction Mem : 4KB • Data Mem : 4KB • Memory Interface : AHB-Lite bus • In development : Yes • Future updates : Yes • RISCV toolchain : riscv32 • Simulation support : Yes ## SHAKTI Cores ### E Class • Source : Bluespec • Type : In-order , 3 stage pipeline • ISA : RV32I • Cache : L2 • Branch Predictor : Custom • Instruction Mem : 4KB • Data Mem : 4KB • Memory Interface : AXI • In development : Yes • Future updates : Yes • RISCV toolchain : riscv32 • Simulation support : No ### C Class • Source : Bluespec • Type : In-order , 8 stage pipeline • ISA : RV32I • Cache : L2 • Branch Predictor : Custom • Instruction Mem : 4KB • Data Mem : 4KB • Memory Interface : AXI • In development : Yes • Future updates : Yes • RISCV toolchain : None • Simulation support : No ## PicoRV32 • Source : Verilog • Type : In-order , 8 stage pipeline • ISA : RV32IMC • Cache : L2 • Branch Predictor : Custom • Instruction Mem : 4KB • Data Mem : 4KB • Memory Interface : AXI Lite • In development : Yes • Future updates : Yes • RISCV toolchain : rv32 • Simulation support : No See you until next week ! MS. ### Aron Barreira Bordin (ScrapingHub) #### Scrapy-Streaming [4/5] - Scrapy With R Language Hello, In these weeks, I’ve added support for R Language on Scrapy Streaming. Now, you can develop scrapy spiders easily using the scrapystreaming R package. ## About It’s a helper library to help the development process of external spiders in R. It allows you to create the scrapy streaming json messages using R commands. ## Docs You can read the official docs here: http://gsoc2016.readthedocs.io/en/latest/r.html ## Examples I’ve added a few examples about it, and a quickstart section in the documentation. ## PRs Thanks for reading, Aron. Scrapy-Streaming [4/5] - Scrapy With R Language was originally published by Aron Bordin at GSoC 2016 on July 13, 2016. ## July 13, 2016 ### ghoshbishakh (dipy) #### Building VTK with Python3 Wrappers In my previous post I tried to give a short guide on how to build VTK with python wrappers. But then I tried to build it for python3 and found that my own guide does not work :P . So here is an updated version of the guide. I am using Arch Linux. But this should work for most linux distros. ## Fetch the source code of VTK We have two options here 1. Download the latest release from VTK Website 2. Clone the git repository from gitlab We will clone the source from GitLab here This will create a directory named vtk ## Build VTK We will build the source from a different folder. Now configure cmake properly. This is the most important part. We will see a screen like this: Now press c to configure. After configuring several options should appear. Press t to toggle to Advanced mode. ### Now change the following: Toggle BUILD_TESTING on. Toggle VTK_WRAP_PYTHON on. Toggle VTK_WRAP_TCL on. Change CMAKE_INSTALL_PREFIX to /usr Change VTK_PYTHON_VERSION to 3.5 (your python version) Change PYTHON_EXECUTABLE to your python executable. For me it was /usr/local/bin/python . You can check it by typing which python in your terminal. Change PYTHON_INCLUDE_DIR to the directory where the python libraries are installed. In my case it is /usr/lib/python3.5/ ### Configure and Build Press c to configure. Now press g to generate Makefile. ( You may have to press c and g again as it sometimes does not work properly ) Then install VTK and the Wrappers And that should be it :) Check your vtk installation: Open python and run If that did not raise any error then you should be good to go! Please leave a comment if you have any issues. ### fiona (MDAnalysis) #### Wrapping the Universe Hello again! I was away most of last week, hence the delay for this post, but I’m finally here to bring you your next dose of explanatory kitties! This post I’m going to talk about what I’ve been doing for the second part of my project: a container for dealing with Umbrella Sampling (US) simulations (if you can’t remember what US is all about, you can go back and read about it in this post). Originally, I envisioned this as a more specific ‘Umbrella Class’ for storing the simulation for each umbrella window and the associated restraint constants, restrained values, etc. However, it was pointed out that a more general way of storing a collection of Universes and associated metadata could be useful in a range of situations, including for US. Metadata here will likely largely be other data describing a simulation, but not associated with individual frames (as for ‘auxiliary’ data) - like the value of the reaction coordinate restrained to in US. So here’s the general idea of what we want our general ‘Universe collection’ container to do: • Store a set of MDAnalysis Universes, allowing iteration over all, access to each by index or name, and to add or remove. In the US case, there’ll be one Universe for each of our umbrella windows. • Deal with auxiliary data: check what auxiliaries all Universes have in common (e.g. to check all US universes have a ‘pull force’ auxiliary before we try running WHAM), and allow to add or alter (auxiliaries can be altered directly through each universe, but where set-up for each Universe is the same it will be convenient to do at once). • Deal with ‘metadata’: as for auxiliaries, check what metadata all Universes have in common (again, checking for analysis), and allow to add/alter We already have a place for auxiliary data, but we still need somewhere to store the metadata. We considered a couple of options, listed below. I made a quick example of how each might work in practice, which you can check out here. 1. Directly in the container, as an attribute (alongside a dictionary of universes). We’d need to make sure we could match up the data with it’s universe, and we can’t associate metadata with a universe independently of the container. 2. Part of the Universe, in a new ‘data’ attribute. We’d now be able to use metadata without a Universe collection container, though we’ll need to make sure our naming is consistent within it. 3. In a newly-written wrapper for Universe, to avoid cluttering up the universe itself. We’d have a new class that would basically store a Universe and its set of metadata. The Universe collection container would now store a dictionary of wrapped-universes. 4. In an existing wrapper: MDSynthesis* Sims, which would allow storage of a universe and associated data (in a categories dictionary) without us having to write a wrapper ourselves. With MDSynthesis, we’d also be able to store our set of Sims as a Bundle, which already provides a lot of the functionality we want for our container. However, it requires MDSynthesis be installed to work. *MDSynthesis is built on top of datreant, a Python library that allows us to identify, filter or group files across our file system from a Pythonic interface, with the help of tags and categories we can assign. It has persistent storage, so we can access our files/tags/categories again any time we start a new Python session. MDSynthesis builds on this for MD simulations, allowing us to directly reload an MDAnalysis universe without having to respecify the appropriate files and other arguments every session, and generally useful for keeping track of all our simulations. I recommend checking it out!. Of these, using MDSynthesis seems the most attractive option: the existing features do everything we want (and more). In fact, we needn’t bother with making a container at all - we can just work directly with the existing Bundles. But if MDSynthesis does everything, what is there still to do? Firstly, since auxiliaries are something new that I’m trying to add, MDSynthesis doesn’t deal with them yet - we can still add and read auxiliaries from a Sim’s universe directly, but they won’t be stored with the rest of the universe so we’ll have to manual reload them each time. I’ve already added a couple of methods to the auxiliary stuff that’ll allow us to, from a Universe, collect the necessary information from each added auxiliary that would allow us to replicate it. Next, we can add to MDSynthesis to be able to store this and reload our auxiliaries in any subsequent sessions. I’m also working on a function to let us get our Bundle set up more easily - passing in our simulations as trajectory files, Universes or Sims, and providing any auxiliaries or metadata we want to add straight way, and returning the appropriate Bundle. You can go have a look at the WIP on Github! I’ll be back next post with another What I Learned About Python, so look forward to that coming soon! ## July 12, 2016 ### Adhityaa Chandrasekar (coala) #### GSoC '16: Weeks 5-7 updates This week was admittedly slow. I have been a bit pre-occupied with a couple of other things, including EuroPython preparation. But a huge chunk is already completed, so I guess that gives me a little cushion :) The coala-quickstart pull request is still under review - it is an incredibly big thing with over 1000+ lines of code spread out over ~16 commits, so it's understandable. Big props to Lasse for taking the time to patiently review it. Two (major) developments in this period: • coala-utils - this is a utility package that is designed to contain various small helper modules that can be used globally by coala (and everyone else!). It is currently just a collection of tools sourced around from various sections in the coala-analyzer, but it has so much scope for expansion. The idea behind this partitioning is that a there are other coala projects under construction by other (awesome) GSoCers and code redundancy is an evil. Another side benefit is the removal of possible cyclic reference issues between these tools and coalib. There was a challenge I was not able to overcome: integration of prompt-toolkit with pytest. After much effort (and believe me, I spent a full day looking at 3-4 lines of code), I gave up. The gist of the issue is that pytest attempts some magic with imports. I have a suspicion this is for coverage, but I may be wrong here. Anyway, this messes with prompt-toolkits imports and the whole thing breaks. So I had to fallback to the rudimentary input() method instead for the implmentation of ask_question.While prompt-toolkit is really fancy and has a ton of amazing features, tests are more important. So we chose pytest, but I really hope I can find the time to change the module to use prompt-toolkit in the future. Anyway, that was that. Adhityaa :wq ### SanketDG (coala) #### textwrap's my man! So I just discovered textwrap! Turns out, its the perfect thing I need for formatting my documentation under a certain line limit. Remember how I talked about supporting two different documentation styles in my previous post? Well both kinds of formatting needs to have a upper line length limit. So a typical documentation comment consists of a lot of descriptions. These can be of one line or extend to paragraphs. But it’s important that the length of these descriptions be kept in check. So what we need to do, is to somehow “wrap” a paragraph of text to n lines, where n could have a default value or could be specified by the user. Turns out you can do that to paragraphs using the textwrap module. I have recorded it below: This really simplifies my work for the DocumentationFormatBear. ### Karan_Saxena (italian mars society) #### Updates Updates Updates!! This will be a short update post. Verbose post with instructions coming soon. 1) PyTango is working :D 2) I am able to create and read the device over Jive. 3) Point Cloud Library Stable for VS2013 has been released. Read more here. 4) JSON dumps of skeleton points is working. Yay!! All that is left for me is to combine these things now. More updates soon. Onwards and upwards!! ### Levi John Wolf (PySAL) #### Using Soft Dependencies Effectively I wrote a little guide for my fellow library devs about how to use soft dependencies effectively. This moves in tandem with some work on enabling PySAL to have a consistent user experience while leveraging the newest & best libraries for scientific computation. Since most of the work I’ve been doing involves using soft dependencies with the library, it’s important to get this right. ## July 11, 2016 ### Adhityaa Chandrasekar (coala) #### EuroPython! I'll be going to EuroPython 2016 this week! I can't even begin to describe how ecstatic I am! It's going to be held in Bilbao, Spain - a truly beautiful city, I've heard. I'm also delighted to say I've been sponsored full accommodation by the wonderful folks at EuroPython. I'll be holding a workshop on Git and other things along with Tushar a fellow GSoCer at coala. If you'll be around, do attend :) As a GSoCer, I'll also get to talk about my project in something called a Lightning Talk, where each person is given exactly 5 minutes on stage to show something. They go on for about an hour and are really interesting because of the sheer number of different things one gets to learn about. And for the rest of the weekdays, there are talks/demos/workshops by other amazing people. There are so many interesting things happening at once, I can't decide which one to be at! I've kind of worked out a schedule, but so many amazing talks will be missed :( I also look forward to meeting the coala developers in person - I've only talked to them over Gitter and I'm really excited to meet them in person (plus Lasse is bringing chocolates!). To finish off the conference, we have two whole days of sprints - two whole days of pure development. It'll be really interesting to have all the coala developers in one room, making something awesome! This will be my first time in Europe and, boy am I excited about it! Adhityaa :wq ### Pulkit Goyal (Mercurial) #### Print Function One major change that was introduced in Python 3 is that print is no more a statement, instead its a function now. Yes, writing print foo will yield you a syntax error in Python 3 saying SyntaxError: Missing parentheses in call to 'print' #### GSoC week 7 roundup @cfelton wrote: Student week7 summary (last blog, commits, PR): jpegenc: health 88%, coverage 95% @mkatsimpris: 03-Jul, >5, Y @Vikram9866: 25-Jun, >5, Y riscv: health 96%, coverage 91% @meetsha1995: 24-Jun, 0, N hdmi: health 94%, coverage 90% @srivatsan: 02-Jul, 0, N gemac: health 87%, coverage 89% @ravijain056, 04-Jul, >5, Y pyleros: health missing, 70% @formulator, 26-Jun, 0, N Links to the student blogs and repositories: Merkourious, @mkatsimpris: gsoc blog, github repo Vikram, @Vikram9866: gsoc blog, github repo Meet, @meetshah1995, gsoc blog: github repo Srivatsan, @srivatsan-ramesh: gsoc blog, github repo Ravi @ravijain056: gsoc blog, github repo Pranjal, @forumulator: gsoc blog, github repo Posts: 1 Participants: 1 Read full topic ### udiboy1209 (kivy) #### An Update Long Overdue It has been a little over a month since I last posted a blog about my updates on the work I’ve been doing. I admit I’ve been a bit lazy these past few weeks and pushed a few deadlines ahead, but I’ve caught up with my planned time line now after a few night-outs :D. Stuff not working and dealing with futile attempts at optimization are to blame for my temporary loss of enthusiasm. I’ve also been reading A Song of Ice and Fire which is the book series Game of Thrones is based on, and those books are difficult to put down once you’ve started. I was going to work on a TMX parser for directly loading maps from a TMX file. Implementing a parser was easy because I had planned on using an existing python library. I ended up using python-tmx which is a pretty lightweight and straightforward library with the minimal set of features our module requires unlike the feature rich and heavy PyTMX which I had mentioned last time. ## What the parser does All python-tmx does is parses the tmx file’s xml and loads the data into python objects arranged hierarchically according to the xml nesting. So my job just involved reading data from those python objects and initializing all the referenced components. The code is so simple to understand that I was able to submit a patch to this library for animation frames support. First we need to load the tilesets as textures in KivEnt and register them in the texture_manager. Kivy, and even KivEnt uses image atlases for loading multiple textures from a single source image, which is faster because we load and process one large image file instead of multiple smaller image files. The atlas file contains the texture name and its position and size in the source image, so its pretty obvious how to extract the texture from the image. In graphics rendering term you only need to modify the mesh matrix without the overhead of loading an image. More details can be found in this Gamasutra article ## The Tiled way of assigning images to tiles Tiled use the fact that each tile is of the same size to its advantage in storing atlas-like information in a concise manner. Each tile in a tileset has an index assigned to it starting with the top-left tile as 0 and incrementing as we move left. So a 3x3 tileset would have tile indices like so: 0 1 2 3 4 5 6 7 8  Each tileset has a unique id called the firstgid or first global id. The intention is that each tile of each tileset should have a unique id or gid which is the firstgid + index for each tile. So if we have 3 tilesets 3x3, 2x2 and 2x3 respectively and the firstgid of the first is 1. Then the gids of the tiles in that first tileset is 1,2,..9. Hence to maintain uniqueness the firstgid of second tileset is 10, similarly for the third. Now all you need to store is the firstgid for each tileset and you know the gid of each tile. The next thing we have to do is to arrange these tiles on the grid to make our desired map. The way to do it is now extremely simple. Each position on the map (x,y) will have a gid corresponding to the tile we place there or 0 to denote no tile or empty space (notice how gids are greater than 1 because the firstgid of the first tileset was 1). For multiple layers you just need to have this data for each layer. You can play around with the tilesets and tiles to add properties like vertical or horizontal flipping, animation frames, opacity, etc. by adding a tile property element in the tileset with the index of the tile and the property you want to add. ## Rendering it all on our game canvas The map grid data for layers is stored as one big sequence of gids. Because we know the size of the map grid we can easily find the (x,y) coordinates from the index in this sequence where x = index % width y = index / width  Once you know the (x,y) coordinate, we also know the actual pixel position on the canvas of where to draw this tile because we know the tile size. We also know what texture to draw because of the gid. I have fixed the textures and models to be loaded with the name ‘tile_<gid>’. To pass to the tile_manager we need a three-dimensional array spanning rows x columns each element being a list of dictionaries with the model and texture name and layer index for that grid location. It is easy to form this array of dicts from the layer data when we have fixed the naming convention for textures and models. This also lets us form a list of all the gids we will use so that we only optimally load those. Once we have this list of gids we have to load, we will create an atlas dict for each tileset which has the texture name in our defined format and the position of that texture in the image calculated from the index of the tile. We pass this atlas dict to the texture_manager which then handles the rest of the job. Once we have the textures loaded we load the models using the same naming convention. And then we load the tile_map using that array of dicts we generated before. This order is necessary because map_manager loads data for textures and models using the registered names and for that the names need to be registered with the texture and model managers first. Next task is to initialize each tile as an entity so that it gets rendered on the screen. Each layer requires a different renderer if we need to control the z-index of the tiles. And for animated tiles each renderer requires its own animation system. I made a util function for easily initializing multiple renderers and animation systems at once and adding them to the gameworld. And I tweaked the init_enitity_from_map function from before to support layers and use the correct renderer names. The result is pretty amazing :D ## Futile optimizations I tried researching into trying to use a single renderer with z-index support for rendering layers without the need for multiple systems. But there are a lot of problems which would arise when batching. It is better performance-wise to let Kivy handle z-index in terms of order of drawing the canvas. Currently the util functions are all pure python functions because the api to init entities and register textures and models is python-based. A cython api doesn’t exist and hence we have the python overhead of creating dicts to pass as init args. Kovak says there is some more tweaks in the base of KivEnt required to correctly implement a cython api, and that he has been considering it for months now. So for now, I have to stick to using the python api. ## What next Next is implementing hexagonal and isometric tiles support, which is essentially just figuring out the pixel position of the tiles on the canvas differently. I think the toughest part will be to create a hexagonal and isometric map but you never know :P. ## July 10, 2016 ### Shridhar Mishra (italian mars society) #### Update! on 10 July-Tango installation procedure for windows 10. Tango installation on windows can be a bit confusing since the documentation on the website is old and there are a few changes with the new versions of MYSQL. Here's the installation procedure that can be helpful. 1 : Installation of MySQL In this installation process mysql 5.7.11 is install. Newer versions can also be installed. while installation we cannot select a manual destination folder and Mysql is installed in C:\Program Files (x86)\MySQL by default. During installation it is mandatory to set at least a 4 character password for root which wasn't the case for previous versions. This is against the recommendation from tango-controls which was specific for older version of SQL. 2 : Installation of TANGO You can download a ready to use binary distribution on this site. http://www.tango-controls.org/downloads/binary/ Execute the installer. You should specify the destination folder created before :'c:\tango' After installation you can edit the MySQL password. 3 : Configuration • 3-1 TANGO-HOST Define a TANGO_HOST environment variable containing both the name of the host on which you plan to run the TANGO database device server (e.g. myhost) and its associated port (e.g. 20000). With the given example, the TANGO_HOST should contain myhost:20000. On Windows you can do it simply by editing the properties of your system. Select 'Advanced', 'Environment variables'. • 3-2 Create environment variables. 2 new environment variables has to be created to run create-db.bat • 3-2-1 MYSQL_USER this should be root • 3-2-2 MYSQL_PASSWORD fill in the password which was used during mysql installation. • 3-3 MODIFY PATH Add this to windows path for running sql queries. C:\Program Files (x86)\MySQL\MySQL Server 5.7\bin • 3-4 Create the TANGO database tables Be sure the mysql server is running, normally it should. Execute %TANGO_ROOT%\share\tango\db\create_db.bat. • 3-5 Start the TANGO database: execute %TANGO_ROOT%\bin\start-db.bat -v4 the console should show "Ready to accept request" on successful installation. • 3.6 Start JIVE Now you can test TANGO with the JIVE tool, from the Tango menu, or by typing the following command on a DOS windows : %TANGO_ROOT%\bin\start-jive.bat Ref: http: www.tango-controls.org/resources/howto/how-install-tango-windows/ ### Preetwinder (ScrapingHub) #### GSoC-3 Hello, This post continues my updates for my work on porting frontera to python2/3 dual support. During the past few days my activity was decreased because I had taken a small break. Before the mid term I had completed my work on porting single process mode. I haven’t yet submitted some of the PR’s, I’ll do that when my pending PR’s is merged. During the past few days I have begun work on tests for distributed mode components. I will be writing tests for frontera workers, hbase backend, kafka message bus, and some other components. I have begun work on strategy and db worker tests, and will submit the code in the next few days. My work is going as expected and at a decent pace. GSoC-3 was originally published by preetwinder at preetwinder on July 10, 2016. ### Valera Likhosherstov (Statsmodels) #### GSoC 2016 #3 ## Improving the code During May and June I've been working hard, producing thousands of code lines, implementing Markov switching state space logic and tests, assuring that everything works correctly. After the midterm evaluation I've already implemented Kim filter, switching MLE model and Markov switching autoregression all generally working and passing basic tests. So this was a nice moment to take a break and look closer at the existing code. Since the primary aspect of the project is its usability and maintainability after the summer, a detailed documentation, covering some hard mathematical calculations with comments, architectural enhancements are even more important things to do than to produce another model. Here are an items completed so far to achieve a perfect code. ## Refactoring Several architectural improvements were done to decompose functionality into logical modules and match Statsmodels state space idioms. Initial architecture of regime_switching module wasn't anything sophisticated but something that just worked for the beginning: As you can see, the KimFilter class aggregated the entire regime switching state space functionality like a bubble of code, which is something obvious to split into parts. Another inconvenient thing about KimFilter was its complex state architecture, that is, to perform filtering, first thing you need is to bind some data to the filter, optionally select a way of regime probabilities and unobserved state initialization, than call filter method, after that filtered_regime_probs, etc. attributes are fulfilled with a useful data. This is inconvenient, because you have to look after the current state relevance by yourself. This is how regime_switching looks after completed refactoring iteration: Responsibilities of a different kind are now divided between an increased number of entities: • SwitchingRepresentation handles switching state space model, that is, it aggregates KalmanFilter instances for every regime and stores a regime transition probability matrix. FrozenSwitchingRepresentation is an immutable snapshot of representation. • KimFilter class is related to filtering, but it neither performs actual filtering nor stores any filtered data, it only controls the process. The first thing is handled by private _KimFilter class, while the second - by KimFilterResults, which is returned from KimFilter.filter method. • Smoothing is organized in a mirrored way, as you can see from the diagram: KimSmoother, KimSmootherResults and _KimSmoother classes. MLE model wasn't touched by any major changes, except that a private ssm attribute is now KimSmoother class instance, rather than KimFilter. ## Docstrings An iteration of documenting was also done. It touched all main entities and the testing code. This process also had some educational advantages for me personally, because I often feel a problem to express my thoughts and ideas to other people (e.g. my classmates), when it is about a very abstract things like coding or Math. So this was a nice practice. Moreover, documenting helped me to improve the code to make it more clear and concise, sometimes it even helped me to find bugs. ## Comments When it comes to optimal implementation of mathematics algorithms with a lot of matrix manipulations, code becomes quite unreadable. This is where inline comments help a lot. I tried to comment almost every logical block inside every method, the most dense comments are in _KimFilter and _KimSmoother classes, doing all the hard computational work. ## What's next? I will continue to enhance written code. There is some interface functionality to be added and to be covered by smoke tests. Only after that I will switch back to model implementation (MS-DFM and MS-TVP). ### mkatsimpris (MyHDL) #### Week 7 The PR with the zig-zag module, is waiting to be merged and reviewed by Christopher. However, a new branch is created which contains the front-end part of the encoder. This part consists of the color-space conversion, dct-2d and zig-zag modules for 8x8 blocks. From a short discussion with Christopher, we decided that hardware utilization with different configurations of the modules should be ### Upendra Kumar (Core Python) #### Tkinter with Multiprocessing Tkinter is mainly based on single-threaded event model. The mainloop(), callbacks, event handlers and raising tkinter exceptions are all handled in single thread. That is why, it happens quite often that tkinter GUI becomes unresponsive ( for the user or is considers to be unresponsive by the user : which is a bad user experience) when event handlers try to do long blocking operations. In order to avoid it and improve user experience, it is generally advised that long running background tasks should be shifted to other threads. But, we have two choices to incorporate parallelism in application using Python: • Multithreading • Multiprocessing But, Python’s multithreading module is notorious due to GIL in Python. ( Google about GIL, you will find about it). It prevents Python multithreaded applications from taking full advantage of multiple processors. I used multithreading module with my application, with no significant improvement in performance. However, if we have used basic multithreading like : new_thread = multithreading.Thread(target=some_function) new_thread.start()  It was very easy to convert it into multiprocessing module by replacing ‘multithreading’ with ‘multiprocessing’. And also we have used the queues to send messages from processes running in secondary threads to tkinter mainloop() (primary thread), then those queues should be replaced multiprocessing safe queues : multiprocessing.Queue(). But, it is important to remember that only objects which can be pickled can be allowed to pass from process X to process Y. The general thumb rule I came to know about it was that generally all Python native objects like lists, tuples, strings, integers, dictionaries are picklable. But, complex classes or objects can’t be pickled. I found about it here. We need to ask us questions to decided which objects can be pickled : • Can you capture the state of the object by reference (i.e. a function defined in __main__ versus an imported function)? [Then, yes] • Does a generic __get_state__ __set_state__ rule exists for the given object type? [Then, yes] • Does it depend on a Frame object (i.e. rely on the GIL and global execution stack)? Iterators are now an exception to this, by “replaying” the iterator on unpickling. [Then, no] • Does the object instance point to the wrong class path (i.e. due to being defined in a closure, in C-bindings, or other __init__ path manipulations)? [Then, no] A better option is check the docs here ( a more definite answer ) : These objects can be pickled : • None, True, and False • integers, long integers, floating point numbers, complex numbers • normal and Unicode strings • tuples, lists, sets, and dictionaries containing only picklable objects • functions defined at the top level of a module • built-in functions defined at the top level of a module • classes that are defined at the top level of a module • instances of such classes whose __dict__ or the result of calling __getstate__() is picklable (see section The pickle protocol for details). My next post will be just a code sample on how I used multiprocessing with Tkinter. ## July 09, 2016 ### mr-karan (coala) #### GSoC Week6,7 Updates # Week 6 The task of Week 6 onwards was to implement UI changes in coala application. Week 6 was planning stage where I also started off my prototype for Syntax Highlighting. I created a small prototype where the syntax highlighting works for the linter bears, I now have to extend this functionality to the diff result and also make the function more generic. It needs some refactoring, which I will be doing this week. ## Week 7 I started off on completing the PostCSS bear as coala is nearing another release and I think it’ll be neat to have this bear. I am also working on moving the status bar to a common function in coala-utils so that for future work, if someone needs it’s not reimplemented and can be pulled off easily from this module. The future work comprises of completing the above PRs and also starting off with coala-bears website which I am excited to work on. I will also begin work with integrating Tests on coala-bears-create. My coding activity took a hit in Week7 because of a university related work but now that’s it over, I’ll resume and cover up the slowdown in this week and the coming weeks will see a finished PR for the above tasks. Happy Coding! #### GSoC Week6,7 Updates # Week 6 The task of Week 6 onwards was to implement UI changes in coala application. Week 6 was planning stage where I also started off my prototype for Syntax Highlighting. I created a small prototype where the syntax highlighting works for the linter bears, I now have to extend this functionality to the diff result and also make the function more generic. It needs some refactoring, which I will be doing this week. ## Week 7 I started off on completing the PostCSS bear as coala is nearing another release and I think it’ll be neat to have this bear. I am also working on moving the status bar to a common function in coala-utils so that for future work, if someone needs it’s not reimplemented and can be pulled off easily from this module. The future work comprises of completing the above PRs and also starting off with coala-bears website which I am excited to work on. I will also begin work with integrating Tests on coala-bears-create. My coding activity took a hit in Week7 because of a university related work but now that’s it over, I’ll resume and cover up the slowdown in this week and the coming weeks will see a finished PR for the above tasks. Happy Coding! ## July 08, 2016 ### tsirif (Theano) #### Multi GPU in Python This week’s blog post concerns the development of python bindings for multi-gpu collective operations support of libgpuarray. As described in the previous blog post , I have included support for collective operations in libgpuarray exposing a gpu computational framework agnostic API. So far, this supports only the NCCL on CUDA devices. ## PyGPU collectives module So now I am going to describe the API which a user will use by importing the collectives module of pygpu package. It is composed of two classes: GpuCommCliqueId and GpuComm. It depends on the gpuarray module for using a GpuContext instance which describes a GPU process to be executed on a single GPU. For binding with libgpuarray and interfacing with CPython code, Cython is utilized. GpuCommCliqueId is used to create a unique id in a host to be shared among separate processes which manage a GPU. All GPUs corresponding to these proesses are intended to be grouped into a communicating clique. Another framework for interprocess communication must be used in order to communicate the contents of this clique id, such as mpi4py, which bind MPI C-API in python. To instantiate a GpuCommCliqueId one must provide the GpuContext in which it will be used. By default, a unique id is created using libgpuarray’s API and saved upon creation but the user may select to provide a predefined bytearray id to be contained in this structure. As of now, GpuCommCliqueId exposes the Python buffer interface, so it can be used by numpy or other buffer-likes or consumers to get zero-copy access to the internal char[GA_COMM_ID_BYTES] array containing the id. This means that an instance of this class can be passed as is to the mpi4py for broadcasting to other participating processes. In order to create a multi-gpu communicator, one must pass a GpuCommCliqueId instance, the number of participating GPUs and a user-defined rank of this process’s GPU in the clique as arguments. Collective operations for this GpuComm’s participating GPUs are methods of this instance: • Reduce def reduce(self, GpuArray src not None, op, GpuArray dest=None, int root=-1) """Reduce collective operation for ranks in a communicator world. Parameters ---------- src: :ref:GpuArray Array to be reduced. op: string Key indicating operation type. dest: :ref:GpuArray, optional Array to collecti reduce operation result. root: int Rank in GpuComm which will collect result. Notes ----- * root is necessary when invoking from a non-root rank. Root caller does not need to provide root argument. * Not providing dest argument for a root caller will result in creating a new compatible :ref:GpuArray and returning result in it. """  Reduce operation needs a src array to be reduced and a Python string op for the operation to be executed across GPUs. dest array can be omitted. In this case, if the caller is the root rank (either root argument is not provided or root argument is the same as the caller’s GpuComm rank) a consistent with the src array result will be created an returned. • AllReduce def all_reduce(self, GpuArray src not None, op, GpuArray dest=None) """AllReduce collective operation for ranks in a communicator world. Parameters ---------- src: :ref:GpuArray Array to be reduced. op: string Key indicating operation type. dest: :ref:GpuArray, optional Array to collect reduce operation result. Notes ----- * Not providing dest argument for a root caller will result in creating a new compatible :ref:GpuArray and returning result in it. """  AllReduce operation needs a src array to be reduced and a Python string op for the operation to be executed across GPUs, as in the Reduce operation. If a dest array is omitted a src-like result array will be created and returned. • ReduceScatter def reduce_scatter(self, GpuArray src not None, op, GpuArray dest=None) """ReduceScatter collective operation for ranks in a communicator world. Parameters ---------- src: :ref:GpuArray Array to be reduced. op: string Key indicating operation type. dest: :ref:GpuArray, optional Array to collect reduce operation scattered result. Notes ----- * Not providing dest argument for a root caller will result in creating a new compatible :ref:GpuArray and returning result in it. """  ReduceScatter operation needs a src array to be reduced and a Python string op for the operation to be executed across GPUs, as in the Reduce operation. If a dest array is omitted, then a proper dest array will be created and returned. The result array will be shortened in comparison to src in a single dimension (according to C/F contiguity and clique size) and if that dimension has size equal to 1, then it will be omitted. • Broadcast def broadcast(self, GpuArray array not None, int root=-1) """Broadcast collective operation for ranks in a communicator world. Parameters ---------- array: :ref:GpuArray Array to be reduced. root: int Rank in GpuComm which broadcasts its array. Notes ----- * root is necessary when invoking from a non-root rank. Root caller does not need to provide root argument. """  As usual, the user must provide the array to be broadcast across all GPUs in the clique. • AllGather def all_gather(self, GpuArray src not None, GpuArray dest=None, unsigned int nd_up=1) """AllGather collective operation for ranks in a communicator world. Parameters ---------- src: :ref:GpuArray Array to be gathered. dest: :ref:GpuArray, optional Array to receive all gathered arrays from ranks in GpuComm. nd_up: unsigned int Used when creating result array. Indicates how many extra dimensions user wants result to have. Default is 1, which means that the result will store each rank's gathered array in one extra new dimension. Notes ----- * Providing nd_up == 0 means that gathered arrays will be appended to the dimension with the largest stride. """  AllGather operation needs a src array to be collected by all GPUs in the clique. dest array can be omitted, but then a result array will be created and returned according to the clique size and src array contiguity. The returned array in this case will have as many dimensions as the src array plus the nd_up argument provided. The extra dimension will be used to contain information from each GPU. In case nd_up is equal to 0, then all arrays will be stored in rank sequence across the largest in stride dimension (depends on C/F contiguity). nd_up is by default equal to 1. Right now, I am testing this code in question and I expect it to be checked and merged soon. Till then, keep on coding Tsirif ### Ramana.S (Theano) #### Third Fornight blog post Hello there, The PR of the new optimizer is about to be merged, after all the cleanup tasks that are done. Also, there are some progress on the CleanUp PR that i had started last fortnight. In the CleanUp PR, the op_lifter has been testsed with TopoOptimizer. The op_lifter seemed to work well with the TopoOptimizer, paving way for the possibility of implementation of the backward pass. A quick summary of the work done over the last fortnight Over the new_graph2gpu PR • I did some cleanups and addressed all the comments regarding cleanups, refactoring and optimizations. • Pascal helped in fixing the TestDnnConv2d test by figuring out that the get_scalar_constant_value method doesn't not handle SharedVariable of dimension (1, 1) and is broadcastable. • I fixed the failing Gpucumsum Op's test by changing the flatten() operation with a corresponding call to GpuReshape. • Made few changes to fix local_gpua_eye(handled those optimization similar to local_gpuaalloc) and its test. • Applied the changes needed to the interface post merging of dilation PR, by making the test cases inside theano/gpuarray/dnn test with the new filter dilation parameter • Line profiled the more time consuming local_gpua_careduce. I initially thought it was because of a call to as_gpuarray_variable that caused this, until Fred pointed me out the actual reason, which is because of a call to gpuarry.GpuKernel. I am currently trying to find a fix for that. 2) On CleanUp PR, • Replaced the calls to HostFromGpu with transfer. • Added register_topo decorator to op_lifter. Created a new LocalGroupDB instance and registered there all the optimizer to which op_lifter is applied there. Finally, registered this LocalGroupDB into the gpu_seqopt. • I had also tried creating a new TopoOptDB, but I had done this implementation wrong. I had created it similar to LocalGroupDB and that didn't seem to work. I was trying few more ways of implementing it, similar to SequenceDB, that also didn't work out. • Reverted the local_gpua_subtensor to its previous version (as in the current master) as it caused some expected transfers to the GPU not happen. • Removed all the separate caching method, for it to be integrated with the __props__ of the class. 3) On Remove_ShapeOpt PR • I was able to add exceptions only at one place, to completely ignore fgraph's shape_feature. There are few optimizer's which mandatorily needs them, which i have commented on the PR. • Skipped all the tests that tests infer_shape, Contains MakeVector, Shape_i, T.second, T.fill and other optimizations done by ShapeFeature. • The profiling results didn't seem to give significance improvement in optimization time as more work needs to be done on this case. That's it for now! Cheers, ### TaylorOshan (PySAL) #### Spatial weights for flow data This week work on spatial weights for flow data began. So far it has resulted in three new functions. The first two functions refer to two different types of the weights while the third is a helper function. The two types of weights are origin-destination spatial contiguity weights which denote two flows to be neighbors if their origins or their destinations are spatially contiguous. The first type of weight is created by the function ODW() takes as input an O X O W object denoting origin spatial contiguity and a D x D W object denoting destination contiguity and then returns the O*D x O*D origin-destination W object where O = # of origins and D = # of destinations. The second weight, network-based weights, interprets assocations among flows to mean that they share nodes an abstract network representation of the flow system. To this end, two flows may be neighbors is they share any single node, if they share an origin node, if they share a destination node, or if they share an origin or a destination node. Each scenario produces somewhat different spatial weights capturing slightly different notions of associations. This function, netW(), takes as input a list of tuples, (origin,destination) that represent the edges of a network, as well as a parameter to denote which type of nodal association to use, and returns an O*D x O*D W object. Finally, the third function was designed to convert a matrix or array that denotes flows or network edges into a list of tuples to be used as input to the netW() function described above. Still to come is a distance-based weight that uses all four dimenions avialalbe when comparing two flows (origin 1, destination 1, origin 2, destination 2). ### Aakash Rajpal (italian mars society) #### Oculus A Guide Week 2 The Next week started with a Skype Call from my mentors, I told them my problems they understood very well. The mentors gave me hope they told me to try something else, something not on their documentation for the Oculus Setup on Blender. They were very understanding and told me if I wanted to switch to unity we can see to that as well. That call gave me motivation, to go for it one more time to try again. And well as I write this post , I have hope. I found a solution online , talked about it with my mentor and woah it was agreed upon. Although I have lost a lot of time in setting up, yet know I can look forward. Thanks #### Oculus A Guide – Week 1 Midterms are over and I passed it with a good evaluation from my Mentors. Now it was time to move onto the second part of the project , the main part .i.e the Oculus Rift DK2. I had some troubles setting up the oculus during the community bonding period and I thought I will resolve those issues during the coding period after mid-terms. But Heck , I was wrong . The Issues were mainly related to Oculus support for Linux and Blender. I had to use Linux as the organisation’s environment was setup on it and Blender as well for the same reason. There were docs that were provided by my mentor from the Italian Mars Society but they were insufficient as the docs were meant for the Rift DK1 and I had a DK2 . But that shouldn’t be such a problem right, wrong completely wrong. I spent days trying to figure out how to setup the Oculus for my project. My mentor was trying to help but still they weren’t sure as they hadn’t used the DK2 on Ubuntu for Blender. I kept trying for a week and was stuck , literally stuck. Frankly, I hadlost hope I was very disappointed . ### Redridge (coala) #### Week 6 - 7 - coala-utils ## Recap In the last post I described a bit the need for bear creation and packaging utilities. Disregarding all that, I worked in the creation of coala-utils a collection of coala libraries and here is why: while extending an already existing bear creation script made by another fellow coalanian, I was alerted by my mentor (co-mentor actually) that there is possible code duplication between the script that I was working on and another Gsoc project. Basically both projects were implementing a so called ask_question method, that would prompt the user for input, both using different libs. That is how the idea for coala-utils sprouted. ## coala-utils coala's main library, the coalib contains a mix of core coala components and libs that don't necessary require coala. For example a packaging utility could very much benefit "from some coalib" but all of coala is too much, hence the need for coala-utils. There was a package that contained some decorators for coala called coala-decorators. Basically coala-utils is a python package that contains the old coala-decorators package, some libraries extracted from coala's coalib and another library containing the ask_question method needed by both projects. That way we can have a unified prompting method that can be used in tools without listing coala as a dependency (listing coala-utils instead). The idea here is to slowly move some coalib files (that are not necessarily coala related and can be used in other tools) into coala-utils so that we don't need coala as a dependency. As an added bonus the coala-utils package can be used in other projects, completely unrelated to coala since it is open source and its contents are not tied with coala in any way. ## Wrap up coala is now updated to work with coala-utils (as you may have guessed a lot of seding has been done). Now with the coala-utils finished I can finally merge my extension (basically remake because I have to replace all of the old prompts too) of the coala-bears-create. Also for those who are wondering, the cover photo is from my 5 day vacation in Greece. ### Prayash Mohapatra (Tryton) #### Working on Exports Mid-terms went well. Having completed CSV Import feature as per plan, I have moved to implement the second half of the project i.e. CSV Export feature. I was able to share most of the code with CSV Import and could quickly get the views for exports running. I found that more methods could be shared between the Window.Import and Window.Export classes. I am directly implementing them in sao. Have filed an issue for doing the same on tryton. This is how the export dialog looks now: I was finally able to use upload.py to submit the code to codereview. I was stuck due an uncommitted change. upload.py showed me: Got error status from ‘git show 016505c960eb7cdf2b2850a5f88fe3b5e79fc189’ Tried searching about this error, but ended up in links to the source code of upload.py itself 😆 Then I tried removing the index.html’s uncommitted changes with git checkout – index.html and then upload.py worked! I use the tryton-sao.js file instead of the minified file so that live-reload works properly, it enables me to see changes without refreshing the entire page. From now on I use git stash to remove the changes before calling upload.py It was quite exciting getting feedback on my code from mentors. I had missed some obvious things and made the necessary changes. Another thing that I am trying to do is to switch to vim from sublime text. I want to replicate my existing workflow with vim. I am not quite there yet. I am unable to make custom keybindings work. Would write more on it, when I finally get it working. Happy Coding everyone! ### liscju (Mercurial) #### Coding Period - V, VI Week In the last two weeks i was working on making redirection location the main store for the large files on repository. The first try was to make this work using http redirection - http server sends to client 303 (See other) response with location header that tells where the location is. This could be pretty good solution because it doesnt need to do anything in client to make it work - httplib in python is handling this pretty well. Unfortunately this is not an acceptable solution because client can connect to client using ssh protocol. Ssh protocol doesnt have anything similiar, that won't require changing client. The second(and final) try was to make server redirects connection by itself - when old client(without redirection feature) want to get file from the server, server open connection to redirection location and at the same time it downloads file chunk by chunk it sends it to the client. This solution limits needed space in the main repository because it doesnt need to have files locally. From the other point of few this doesn't limit bandwidth needed for handling connections for old clients, but this is the best that can be done right now. Another problem with making redirection destination the main store was dealing with user cache issue. When client commits the file is put in the store and in the user cache as well. Before pushing file to server client checks which files server has. When server and client shares the same user cache the server finds file in user cache and returns that information to client. Client knowing that server has given file doesn't send it. In the result server is not putting large file that is part of the changeset client pushed to the server!! The solution by now is using pretxnchangegroup hook. This is run after a group of changesets has been brought into the local repository from another, but before the transaction completes that will make the changes permanent in the repository. In this hook we check if all files in given changesets exists in redirection destination, if not we push it. This way we make sure that all files are pushed to the redirection destination. Apart from this i extends test suite for the redirection module, it needs it before doing later and probably harder stuff. ### Leland Bybee (Statsmodels) #### Progress There are two primary changes that have been made since the last checkin, the first is that everything has been rewritten so that all the fit methods are contained with their own class DistributedModel. The name may change but the structure is much cleaner now. The second major change is that “true” distributed estimation has been added. Preivously, we only supported sequential estimation but parallel is now supported as well through joblib. Dask support to come. I’m also working ironing out a couple of examples which will be updated here when done. ### mike1808 (ScrapingHub) #### GSOC 2016 #3: Rectangles and Error Handling The beauty of Splash is that it uses right tools for the right things. Particularly, I am talking about the computer language selection. Splash uses the following languages: • Python - backend, agile, powerful and elegant language • Lua - scripts, very powerful and simple • JavaScript - used in Web engine Currently, I am using all that languages and in my recent task I’ve faced with some interesting challenges. ## getBoundingClientRect() vs getClientRects()[0] I am working on HTMLElement class which allows you to manipulate with DOM HTML elements. And I’ve got the following problem. I need somehow get coordinates of the element. The most straightforward and portable solution (Splash is going to change its Web engine) is to use JavaScript. In JS there are two ways to get coordinates of element: Element.getBoundingClientRect() and Element.getClientRects(). getClientRects() method returns the list of all CSS layout border boxes of element. This list may contain rectangles which width and height is zero. getBoundingClientRect() method returns the smallest rectangle that includes all of the rectangles in list which is retrieved from getClientRects() of which the height or width is not zero. For the most cases the first element of getClientRects() returned value will be the same as the result of getBoundingClientRect(). However, there are some situations where they can differ, e.g. elements which computed display CSS property is inline. Inline elements, such as <span>s create a box for each line they take. Let’s look on the Fig. 1. Fig. 1 In this example I’ve highlighted boxes for <a> element What Every Computer Scientist Should Know About Floating-Point Arithmetic. The red bordered boxes are the rectangles returned by getClientRects() and the rectangles with the blue background is the result of getBoundingClientRect(). As this anchor element has inline display mode it creates a box for each line that it takes. Hence, it has two bounding client rectangles. The both methods are useful and has its use cases. For example, if you want to get coordinates of the element in order to click on it you should better use getClientRects()[0]. And, if, for example, you want to take a screenshot of the element it’s better to use getBoundingClientRect(). ## Exceptions vs Flags Another interesting thing that I should deal with is the error handling in Lua. In the world of Lua programming there are two ways to handle errors: exceptions and boolean flags. Flag is the first return value of the called function which shows whether it did its work with no issues or not. local ok, result = func() if not ok: -- do something when the operation was unsuccessful else: -- we got no problems with func end  In Lua exceptions are thrown when the error is the reason of the user interaction when you cannot get the sane result and that error can be easily avoided, for example when user tried to send numbers to the function which requires string arguments. On the other hand, if error is the result of some interaction which cannot be avoided, let’s say user tries to open a file which is doesn’t exist, it can be returned as a boolean flag. In my particular case I’m throwing an exception if user tries to get HTML element with non valid CSS selector and return a false flag if, let’s say, user tries to click on the element which has been removed from DOM. The official Lua guide has an excellent article about error handling. ## My work I’m almost finished with HTMLElement class, there are small bugs, documentation and tests which should be written. The next week I’m going to create HTMLElements which is intended for manipulating collection of HTMLElements. ### jbm950 (PyDy) #### GSoC Week 7 This week the focus was on the support code for the featherstone method. I finished adding examples to the docstrings of every function I made. I then wrote up test code for all of the new functions primarily focusing on expected outputs but included some expected error messages for one of the functions. Lastly I have coded up the functions themselves. This work can be seen in PR #11331. I continued following the PR I reviewed last week and give suggestions as he worked on it. The PR is now in my opinion ready to be merged and is a beneficial addition to the sympy codebase. The last thing I did this week was have a meeting about the presentation that I will be aiding in on Monday. After the meeting I have spent some time looking over my portions of the presentation and making sure I am prepared to speak. ### Future Directions Next week is the SciPy conference were I will be aiding in PyDy’s tutorial. Also I will be meeting with my mentor in person and during our time there I suspect we will work on a variety of things from Featherstone’s articulated body method to the base class. ### PR’s and Issues • (Open) Added docstrings to ast.py PR #11333 • (Open) [WIP] Featherstones EOM support PR #11331 ## July 07, 2016 ### Avishkar Gupta (ScrapingHub) #### Benchmarking and Unit Testing Hi, sorry this post a couple days later than warranted, but there’s been some work that required being taken care of. Since the end of the midterm evaluations, efforts were concentrated into taking care of issues such as compatability with the older API of scrapy, unit testing and benchmarking the signal performance with the new API and looking for places where further optimizations are required. One such area I identified was that the use_caching functionality of the dispatch lib was always going unused because of the presence of NoneType objects when a receiver wants to be triggered on signals sent by any sender. I’m looking into how we can make that usable, as using the caching we can make the Signals perform even better, since majority of the time taken by the send_catch_log function is in the _liveReceivers method of the Signal class. This method would have a constant time look up had it been for the cache. Here’s some preliminary results of the benchmarks, obtained by running the benchmarking spider with –profile as with: bashscrapy bench --profile -o new_profile.cprofile

Here’s the visualizations of the output obtained using pyprof2calltree:

The old API:

As we can see, the difference between the two API’s is clear in terms of performance in that the older API is clearly taking significantly longer time and a lot more cycles than the new API, while the helper methods of the old dispatcher too are taking their fair share of time(results not in picture). Such is not the case anymore. Next, I seek to formalize these results in the form of a benchmarking suite and look for ways to improve the performance even further. Until next time.

### Shridhar Mishra (italian mars society)

#### mid-term update!

work done:

• Tango now publishing skeleton co ordinates directly instead of saving it to json.
• tango devices created to publish the data.

Work In progress:
ways to convert list to any of the tango compatible data type.
perfecting tango communication.

work to do:
running unity on ubuntu machine.
plotting the points on unity game engine.

### Riddhish Bhalodia (dipy)

#### Tying up loose ends!

This week was wrapping up week, I worked on both Adaptive Denoising PR and the Local PCA Denoising PR to finish up the little work that was left to be done, and bring them really close to merging. Along with this I have been reading about brain extraction and affine registration for the next phase of my GSOC.

In this short blog I will describe both the adaptive denoising PR and local PCA PR and provide the tutorials (which are also going to the DIPY repository ) here.

Note: Due to the problems in nlmeans usage , we have reverted to the older voxelwise implementation. We have however deprecated that function and introduced a new non_local_means function which performs the blockwise averaging method which gives better results.

## Adaptive Denoising Tutorial

### A] The new non_local_means function

Using the non-local means filter [Coupe08] and [Coupe11] and you can denoise 3D or 4D images and boost the SNR of your datasets. You can also decide between modeling the noise as Gaussian or Rician (default).

In order to call “non_local_means“ first you need to estimate the standard deviation
of the noise. We use N=4 since the Sherbrooke dataset was acquired on a 1.5T
Siemens scanner with a 4 array head coil.

Calling the main function non_local_means

We show the axial slice and how it has been denoised

[Coupe08] P. Coupe, P. Yger, S. Prima, P. Hellier, C. Kervrann, C. Barillot,
“An Optimized Blockwise Non Local Means Denoising Filter for 3D Magnetic
Resonance Images”, IEEE Transactions on Medical Imaging, 27(4):425-441,                               2008.

[Coupe11] Pierrick Coupe, Jose Manjon, Montserrat Robles, Louis Collins.
“Adaptive Multiresolution Non-Local Means Filter for 3D MR Image Denoising”
IET Image Processing, Institution of Engineering and Technology, 2011

### B] Adaptive soft coefficient matching

This is for the function ascm

Using the non-local means based adaptive denoising [Coupe11]_ you can denoise 3D or
4D images and boost the SNR of your datasets.

Choose one of the datasets from DIPY

The ascm function takes two denoised inputs one more smooth than the other, for
generating these inputs we will use the non_local_means denoising.
In order to call non_local_means first you need to estimate the standard deviation
of the noise. We use N=4 since the Sherbrooke dataset was acquired on a 1.5T
Siemens scanner with a 4 array head coil.

Non-local means with a smaller patch size which implies less smoothing, more sharpness

Non-local means with larger patch size which implies more smoothing, less sharpness

Now we perform the adaptive soft coefficient matching. Empirically we set the parameter h in ascm to be the average of the local noise variance, sigma itself here in this case

Plot the axial slice of the data, it’s denoised output and the residual

[Coupe11] Pierrick Coupe, Jose Manjon, Montserrat Robles, Louis Collins.
“Adaptive Multiresolution Non-Local Means Filter for 3D MR Image Denoising”
IET Image Processing, Institution of Engineering and Technology, 2011

## Local PCA Tutorial

Using the local PCA based denoising for diffusion images [Manjon2013] we can state of the art results. The advantage of local PCA over the other denoising methods is that it takes into the account the directional information of the diffusion data as well.

Let’s load the necessary modules

Load one of the datasets, it has 21 gradients and 1 b0 image

We use a special noise estimation method for getting the sigmato be used in local PCA algorithm. This is also proposed in [Manjon2013]. It takes both data and the gradient table object as inputs and returns an estimate of local noise standard deviation as a 3D array

Perform the localPCA using the function localpca.

The localpca algorithm [Manjon2013] takes into account for the directional
information in the diffusion MR data. It performs PCA on local 4D patch and
then thresholds it using the local variance estimate done by noise estimation
function, then performing PCA reconstruction on it gives us the deniosed
estimate.

We have a fast implementation localpca and a slower one (which has less
memory consumption) localpca_slow

Let us plot the axial slice of the original and denoised data.
We visualize all the slices (22 in total)

And the denoised output of the fast algorithm

Following are the images for the input and output of the local PCA function.

[Manjon2013] Manjon JV, Coupe P, Concha L, Buades A, Collins DL
“Diffusion Weighted Image Denoising Using Overcomplete Local PCA” 2013

Few more things and we can tie up this PR real nicely

## Brain Extraction… First Steps

Well as mentioned in the initial GSOC proposal I also have to improve the brain extraction technique for DIPY. Here is the idea which we have formulated

• Have template (labelled) data
• Then we align the input data (for which we have to extract the brain) to the template data using affine or non-linear registration available
• Transform the label map of the template data
• Use the transformed label map and the median_otsu to improve the input image brain extraction

In the coming blogposts I will focus more on the brain extraction part of the things and add finishing touches to the local PCA PR.

The template image (one slice) is shown as follows

Thank You!

### Ranveer Aggarwal (dipy)

#### A Circular Slider: Let's Make Things Futuristic!

A lot of work was done this week, and hence the second blog post. A new UI element was introduced and I wanted to have a separate post with its details. So a circular slider it is, kind of like a volume rocker. How did I make one? Well, here goes.

### Implementation

#### The Idea

So I wished to have some kind of an orbit on which the disk that denotes the value of the slider moves around. Why would anyone have anything other than a circle (more like a ring). So I used a disk. The average of the outer and inner radius of the disk is like a virtual circle on which the value disk would move.

#### Value Disk Positioning

Now, once a user clicks the value disk, it’s highly improbable (and also nearly impossible) for the user to follow the exact circle while sliding the value disk. Therefore, the previous method of capturing mouse coordinates and setting the center of the value disk to the position of the mouse coordinates while constraining the movement on a line won’t work here. We need to constrain the movement on a circle.
It’s time for some math.
Mathematically speaking, we have a circle, centered at (x1, y1). We have a click position, say (x2, y2). Now, the new position of the center of the value slider should lie on the line joining this point to the center of the circle and the circle itself. This is a simple case of a line-circle intersection!
I wrote down the equation of the circle and the line, used my algebra skills to get the following formula for finding the x coordinate of the intersections (two, since it’s a circle and the line passes through the center; the one nearer to the mouse coordinate makes more sense):

x = x1 +/- r(x2-x1)/(sqrt((x2-x1)^2 + (y2-y1)^2))


Where r is the radius of the constraint circle.

I plugged these values of x into the line equation to get the corresponding y values. The disk position is now this.

#### Text (Percentage) Update

For this, I used angles. Same line, same circle as above. This time, I used the angle made by the line with the line parallel to the X-Axis and passing through the center of the circle. The percentage completion is now this angle by 360.

### The result

Here’s how it looks like.

The Final Result

Working as expected, this is the crude form of the circular slider we’ll use finally. Some animation and colors would probably make it tend more towards a futuristic slider.

#### The Slider Works!

For the past two weeks I have been stuck with the slider, with a weird bug here and a weird bug there, VTK/Python version issues, OS issues, etc. I installed Ubuntu to find out more about it, and changed to Python2.7.11-VTK6.3.0, like everyone else.
It’s all been fixed, except the one in the last blog post. It seems like a VTK-7 bug or change of API. I have posted a question on Stackoverflow regarding the same, I’ll fix it as soon as I find how to do it.

Another modification to the slider is the text actor. It now is a standalone object, catering specifically to the Slider. Doing so also fixed a bug that I had with VTK-6.

The final modification, putting a nail in the coffin for the line clicking issue (someone had faced it previously too). I did this by constructing a thin rectangle using polygons. As correctly pointed out by my mentors, it’s probably because a line is composed of two points and OpenGL events probably work with triangles.

Here’s how it now looks like.

The Slider

We’re finally done with the slider and it’s time to move on to the next task.

PS: I passed the mid-term review. It was a team effort ^_^.

### Anish Shah (Core Python)

#### GSoC'16: Week 5 and 6

Thank you to my mentors, I passed my mid-term review :) This blog post is about my work in the last two weeks.

## Docker

During the first week of Community Bonding, I told you all about Docker and how we are using it to setup b.p.o on Ubuntu Docker container. Last two weeks, I have been working on adding support to setup b.p.o on Fedora. I haven never used Fedora before. So, this was my first experience with Fedora. :) Here are some of the things that I think are important for someone who is using Fedora Docker containers :)

### Basic commands

Fedora container does not have basic commands like find, which or gcc compiler. I had to install these packages using DNF. Whereas if I’m correct, Ubuntu container has these commands already.

### DNF

DNF or Dandified Yum is the package manager for Fedora, just like apt-get is for Ubuntu. DNF is really clean as compared to apt-get. But, I found DNF a slow as I think I looks for packages in a lot of repositories (Please correct me if I’m wrong :)).

### Systemd

Systemd doesn’t run inside your computer if you do not tell it to. Since we didn’t want to use systemd, it kind of becomes tricky to start Postgres database service. :) My mentor suggested to look at the systemd scripts and start postgresql using the commands in that script instead of using systemd. For Postgres, the systemd script is located at /usr/lib/systemd/system/postgresql.service. The content of this script is something like this -


# It's not recommended to modify this file in-place, because it will be
# overwritten during package upgrades.  It is recommended to use systemd
# "dropin" feature;  i.e. create file with suffix .conf under
# /etc/systemd/system/UNITNAME.service.d directory overriding the
# unit's defaults.  Look at systemd.unit(5) manual page for more info.

[Unit]
Description=PostgreSQL database server
After=network.target

[Service]
Type=forking

User=postgres
Group=postgres

# Where to send early-startup messages from the server (before the logging
# options of postgresql.conf take effect)
# This is normally controlled by the global default set by systemd
# StandardOutput=syslog

# Disable OOM kill on the postmaster
# ... but allow it still to be effective for child processes
# (note that these settings are ignored by Postgres releases before 9.5)

# Maximum number of seconds pg_ctl will wait for postgres to start.  Note that
# PGSTARTTIMEOUT should be less than TimeoutSec value.
Environment=PGSTARTTIMEOUT=270

Environment=PGDATA=/var/lib/pgsql/data

ExecStartPre=/usr/libexec/postgresql-check-db-dir %N

# Use convenient postgresql-ctl wrapper instead of directly pg_ctl.  See the
# postgresql-ctl file itself for more info.

ExecStart=/usr/libexec/postgresql-ctl start -D ${PGDATA} -s -w -t${PGSTARTTIMEOUT}
ExecStop=/usr/libexec/postgresql-ctl stop -D ${PGDATA} -s -m fast ExecReload=/usr/libexec/postgresql-ctl reload -D${PGDATA} -s

# Give a reasonable amount of time for the server to start up/shut down.
# Ideally, the timeout for starting PostgreSQL server should be handled more
# nicely by pg_ctl in ExecStart, so keep its timeout smaller than this value.
TimeoutSec=300

[Install]
WantedBy=multi-user.target


ExecStart is assigned the command to start the Postgres service. If we run this command manually, the postgres DB starts without using any systemd command. :)

You can follow my work here

Thank you for reading this blogpost. Let me know if I was wrong somewhere or you have some other suggestions. See you later. :)

## July 06, 2016

### chrisittner (pgmpy)

#### Examples for basic BN learning,

I’ll soon finish basic score-based structure estimation for BayesianModels. Below is the current state of my PR, with two examples.

## Changes in pgmpy/estimators/

I rearranged the estimator classes to inherit from each other like this:

.                                    MaximumLikelihoodEstimator
/
ParameterEstimator -- BayesianEstimator
/
BaseEstimator                        ExhaustiveSearch
| \                    /
|   StructureEstimator -- HillClimbSearch
|                      \
|                        ConstraintBasedEstimator
|
|
|                BayesianScore
|              /
StructureScore -- BicScore


BaseEstimator takes a data set and optionally state_names and a flag for how to handle missing values. ParameterEstimator and its subclasses additionally take a model. All *Search-classes are initialized with a StructureScore-instance (or by default BayesianScore) in addition to the data set.

## Example

Given a data sets with 5 or less variables, we can search through all BayesianModels and find the best-scoring one, using ExhaustiveSearch (currently 5 vars already takes a few minutes, but can be made faster):

import pandas as pd
import numpy as np
from pgmpy.estimators import ExhaustiveSearch

# create random data sample with 3 variables, where B and C are identical:
data = pd.DataFrame(np.random.randint(0, 5, size=(5000, 2)), columns=list('AB'))
data['C'] = data['B']

est = ExhaustiveSearch(data)

best_model = est.estimate()
print(best_model.nodes())
print(best_model.edges())

print('\nall scores:')
for score, model in est.all_scores():
print(score, model.edges())


The example first prints nodes and edges of the best-fitting model and then the scores for all possible BayesianModels for this data set:

['A','B','C']
[('B','C')]

all scores:
-24243.15030635083 [('A', 'C'), ('A', 'B')]
-24243.149854387288 [('A', 'B'), ('C', 'A')]
-24243.149854387288 [('A', 'C'), ('B', 'A')]
-24211.96205284525 [('A', 'B')]
-24211.96205284525 [('A', 'C')]
-24211.961600881707 [('B', 'A')]
-24211.961600881707 [('C', 'A')]
-24211.961600881707 [('C', 'A'), ('B', 'A')]
-24180.77379933967 []
-16603.134367431743 [('A', 'C'), ('A', 'B'), ('B', 'C')]
-16603.13436743174 [('A', 'C'), ('A', 'B'), ('C', 'B')]
-16603.133915468195 [('A', 'B'), ('C', 'A'), ('C', 'B')]
-16603.133915468195 [('A', 'C'), ('B', 'A'), ('B', 'C')]
-16571.946113926162 [('A', 'C'), ('B', 'C')]
-16571.94611392616 [('A', 'B'), ('C', 'B')]
-16274.052597732147 [('A', 'B'), ('B', 'C')]
-16274.052597732145 [('A', 'C'), ('C', 'B')]
-16274.0521457686 [('B', 'A'), ('B', 'C')]
-16274.0521457686 [('C', 'A'), ('B', 'C')]
-16274.0521457686 [('C', 'B'), ('B', 'A')]
-16274.0521457686 [('C', 'A'), ('C', 'B')]
-16274.0521457686 [('C', 'A'), ('B', 'A'), ('B', 'C')]
-16274.0521457686 [('C', 'A'), ('C', 'B'), ('B', 'A')]
-16242.864344226566 [('B', 'C')]
-16242.864344226564 [('C', 'B')]


There is a big jump in score between those models where B and C influence each other (~-16274) and the rest (~-24211), as expected since they are correlated.

## Example 2

I tried the same with the Kaggle titanic data set:

import pandas as pd
import numpy as np
from pgmpy.estimators import ExhaustiveSearch

titanic_data = titanic_data[["Survived", "Sex", "Pclass"]]

est = ExhaustiveSearch(titanic_data)
for score, model in est.all_scores():
print(score, model.edges())


Output:

-2072.9132364404695 []
-2069.071694164769 [('Pclass', 'Sex')]
-2069.0144197068785 [('Sex', 'Pclass')]
-2025.869489762676 [('Survived', 'Pclass')]
-2025.8559302273054 [('Pclass', 'Survived')]
-2022.0279474869753 [('Survived', 'Pclass'), ('Pclass', 'Sex')]
-2022.0143879516047 [('Pclass', 'Survived'), ('Pclass', 'Sex')]
-2021.9571134937144 [('Sex', 'Pclass'), ('Pclass', 'Survived')]
-2017.5258065853768 [('Survived', 'Pclass'), ('Sex', 'Pclass')]
-1941.3075053892835 [('Survived', 'Sex')]
-1941.2720031713893 [('Sex', 'Survived')]
-1937.4304608956886 [('Sex', 'Survived'), ('Pclass', 'Sex')]
-1937.4086886556925 [('Survived', 'Sex'), ('Sex', 'Pclass')]
-1937.3731864377983 [('Sex', 'Survived'), ('Sex', 'Pclass')]
-1934.134485060888 [('Survived', 'Sex'), ('Pclass', 'Sex')]
-1894.2637587114903 [('Survived', 'Sex'), ('Survived', 'Pclass')]
-1894.2501991761196 [('Survived', 'Sex'), ('Pclass', 'Survived')]
-1894.228256493596 [('Survived', 'Pclass'), ('Sex', 'Survived')]
-1891.0630673606006 [('Sex', 'Survived'), ('Pclass', 'Survived')]
-1887.2215250849 [('Sex', 'Survived'), ('Pclass', 'Survived'), ('Pclass', 'Sex')]
-1887.1642506270096 [('Sex', 'Survived'), ('Sex', 'Pclass'), ('Pclass', 'Survived')]
-1887.0907383830947 [('Survived', 'Sex'), ('Survived', 'Pclass'), ('Pclass', 'Sex')]
-1887.077178847724 [('Survived', 'Sex'), ('Pclass', 'Survived'), ('Pclass', 'Sex')]
-1885.9200755341908 [('Survived', 'Sex'), ('Survived', 'Pclass'), ('Sex', 'Pclass')]
-1885.8845733162966 [('Survived', 'Pclass'), ('Sex', 'Survived'), ('Sex', 'Pclass')]


Here it didn’t work as I hoped. [('Sex', 'Survived'), ('Pclass', 'Survived')] has the best score among models with 2 or less edges, but every model with 3 edges scores better. I didn’t have a closer look at the dataset yet, but a weak dependency between Sex and PClass would explain this.

### Sheikh Araf (coala)

#### [GSoC16] Post mid-term evaluation

Mid-term evaluations are over and I passed. Yay!

Coming to my project, most of the stuff is implemented and the plug-in is almost ready for a beta release. I’m fixing the last few bugs and that should be done in the next few weeks.

The coala Eclipse plug-in will be released with coala 0.8.

Here is a video demo of the plug-in in action:

### Abhay Raizada (coala)

#### tabs spaces tabs

Last time i was able to come up with an algorithm to indent python code or basically code without  un-indent specifiers.  This time the challenge was tackling hanging indentation.

Now hanging indents in terms of word processing occur when all the lines except the first line are indented. In code terms it is something like this:

some_function(
param1,
param2,
param3)

here param1, param2, param3 are indented while some_function is not.

Again the algorithm i use do hanging indentation is pretty straight forward, in simple terms it is:

• Check if there is text to the right of paranthesis.
• If there is, indent all lines till closing paranthesis that the column right after paranthesis.
• Otherwise calculate the indentation relative to some_function and indent all later prams to that level.

This is the broad algorithm i use.

Though the difficult part was actually aligning in the file. N0w i align my files in the correct indentation levels(barring hanging indents) by:

• remove all whitespaces to the left of every line.

I have a list called indentation_levels  which has indentation corresponding to line number, and also a variable insert which is either ‘\t’ or  tab_width*’ ‘  where tab_width is the number of spaces to indent

• to each line add insert*indentation_level[line] to the left of the line.

basically line -> insert*indentation_level[line] + line .

now the problem was to insert absolute_indentation_levels in between normal indentation.

So whats the problem? just add number of spaces along with the normal spaces right? WRONG!

there’s a difference between

\t        \t
and
\t\t
are different

real life example:

class {
\tfunction(param1{
\t        \t do_something();}
\t         param2)
}

i was able to accomplish this by:

• storing the previous indentation.of a block
• then adding previous indent + hanging indent + indent level of that line.

do tell me if you find any fault in these algorithms, my work has recently been merged and can be found as the IndentationBear in the master branch of the coala-bears repositiory.

### Nelson Liu (scikit-learn)

#### (GSoC Week 6) Efficient Calculation of Weighted Medians

In my previous blog post, I discussed a method for using two heaps to efficiently find the median for use in the MAE criterion for finding the best split. However, the post did not include information on how to extend the method used to efficiently calculate the weighted median. This post will fill those holes, and give a detailed explanation of how I am extending the median calculation method shown in the previous post to deal with the important case of weighted data.

# Recap of the weighted median problem

In my last blog post, I briefly explained the intuition behind the weighted median problem; it's quoted below for your reading convenience:

A common way to define the median of a set of samples is as the value such that the number of samples above and below the median are both less than or equal to half the total number of samples. The weighted median is thus similar, but we seek to find a value such that the total weights of the samples above and below the median are both less than or equal to half the total weight of all samples. If this seems a bit strange, don't worry! Examples are provided a bit further below.

# Calculating the Weighted Median, given sorted values and weights

Before diving into the intricacies of how to calculate the weighted median of a running stream of numbers, it's important to have an algorithm to calculate the weighted median given a sorted array of values and the associated weights. For example, let's call the array of values Y and its associated weights W. If we had the data Y = [4,1,6] and W = [3,1,2], then we have a sample with a value of 4 and a weight of 3, another sample with a value of 1 and a weight of 1, and a last sample with a value of 6 and a weight of 2. When sorted by value, the data would look like:

Y = [1,4,6]
W = [1,3,2]


Notice that the W array is not sorted, but is rather arranged such that the weight of Y[i] is W[i], for all i in the python expression range(len(Y)). Given this data, to calculate the weighted median, we divide the calculation into two cases.

## Case One: The "Split Median" Case

When calculating the unweighted median, there are cases where it's necessary to take the average of two elements in order to find the median --- I refer to this as the "split median" case, because you need to "split" to find the median. Namely, this is necessary when the number of elements is even. For example, the median of [1,2,3,4] is the average of 2 and 3, or 2.5. In a more nuanced case, the median of [3,4,4,5] is the average of 4 and 4, which is still 4.

In the weighted median calculation, there is also a "split median" case. For example, given Y = [3,4,5,6] and W = [1,2,2,1], you're essentially calculating the median of Y = [3,4,4,5,5,6] with weights W = [1,1,1,1,1,1]. This comes out as a "split median", the value 4.5 (the average of 4 and 5). However, the rule for determining whether a given set of data will need a split median is not the same as in the unweighted case; that is, just because sum(W) % 2 == 0 (sum of weights is even) does not mean that the median will be a split median. This seems slightly unintuitive, especially considering our previous example of converting the weighted median calculation into an unweighted one simply involved creating a new data array with len(sum(W)).

However, the critical catch is when using non-whole weights. If you're given the data Y = [1,2] with weights W = [1.5,1.5], it's quite obvious that this will lead to a split median (1.5, the average of 1 and 2). However, the sum(W) = 3.0, which is not an even number. We cannot convert a problem like the above easily to an unweighted median problem, because we cannot express a value occurring with a frequency of 1.5. We could use tricks like simplifying the weights to W = [1,1] (divide by LCM) or W = [15,15] (turn decimals into whole numbers), but this gets unwieldy very quickly when dealing with weights that can be irrational floats.

As a result, we need a different way of determining whether a weighted median calculation will involve taking the average of two values. To do so, it's useful to refer back to the definition of the weighted median problem. We're looking for the value such that the sum of the weights above and below it are equal. Essentially, we want to find whether there is a value that exactly splits the weights in halves. In other words, is there a value of k in range(1, len(W)) such that sum(W[0:k]) = sum(W[k:len(W)+1]) = sum(W) / 2 (Remember that the notation W[0:k] are the elements W[0], W[1], ... W[k-1])?

Revisiting the examples above with the definition we just introduced, we can verify that it holds.
Given the data:

Y = [3,4,5,6]
W = [1,2,2,1]


we can verify that for the value k = 2, sum(W[0:2]) = 1 + 2 = 3 and sum(W[2:4]) = 2 + 1 = 3, with additional verification in the fact that sum(W) / 2 = 3. You can verify for yourself that valid values of k exist that fulfill the property presented above on the datasets with Y=[1,2] and W=[1.5,1.5], as well as Y=[1,2,3] and W=[2,1,1], indicating that they will indeed involve an average calculation.

Now that we know when a weighted median calculation will involve an average, how do we actually use it to calculate the weighted median? It's actually quite simple -- once we have figured out the value of k that indicates that the median will require an averaging operation, the median is simply calculated as $$\frac{Y[k-1] + Y[k]}{2}$$.

Thus, for the example above (Y = [3,4,5,6] and W=[1,2,2,1], with k = 2), we calculate the median as $$\frac{Y[2-1]+Y[2]}{2} = \frac{Y[1]+Y[2]}{2} = \frac{4+5}{2} = 4.5$$.

In the second example above (Y=[1,2] and W=[1.5,1.5], with k=1), the median is calculated as $$\frac{Y[1-1]+Y[1]}{2} = \frac{Y[0]+Y[1]}{2} = \frac{1+2}{2} = 1.5$$. Try calculating the weighted median with these steps for the third example, and verify that the weighted median is equal to 1.5.

## Case Two: The "Whole" Median Case

When the weighted / non-weighted median calculation does not require taking the average between two values, I refer to the calculation process as calculating a "whole" median. For example, finding the unweighted median of [1,2,3] is 2, a conclusion we can arrive at without taking any averages. In the weighted case, the weighted median given Y=[3,4,5,6] with W=[1,2,1,1] --- the weighted median is 4, a calculation that also does not require the averaging operation. Our criteria for when a median calculation is a "Whole Median" case is simple; if the median calculation is not "split", it is whole. In other words, if there is NO value of k in range(1, len(W)) such that sum(W[0:k]) = sum(W[k:len(W)+1]) = sum(W) / 2, the median is a "whole median". If there is no value of k in range(1, len(W)) such that sum(W[0:k]) = sum(W[k:len(W)+1]) = sum(W) / 2, there must be a value of k in the same range such that sum(W[0:k]) > sum(W) / 2. Calculating the value of the median in this "whole" case requires knowing the smallest value of k that satisfies the condition above, namely that sum(W[0:k]) > sum(W) / 2.

Looking back at the earlier weighted median example, we can calculate the value of k. Given the data Y=[3,4,5,6] with W=[1,2,1,1], we can see that:

sum(W) / 2 = 2.5
when k = 1, sum(W{0:1]) = W[0] = 1
when k = 2, sum(W[0:2]) = W[0] + W[1] = 1+2 = 3
when k = 3, sum(W[0:3]) = W[0] + W[1] + W[2] = 1+2+1 = 4
...


As a result, the correct value of k would be 2, since it is the smallest value such that sum(W[0:k]) > sum(W) / 2. With this value of k in hand, it's trivial to calculate the median; the median is simply Y[k-1] = Y[1] = 4.

## So how does this help us?

Now that we have methods to calculate the weighted median from sorted instances of Y and its associated weights W, we're quite close to having a working (albeit a bit naive) solution to the weighted median problem. If we modify the min and max heaps to internally be sorted arrays (such that in a min heap, the array is sorted in decreasing order and the array is sorted in increasing order in the max heap), we can calculate the median at any time step in O(n) time with the roughly following algorithm (in pseudo-pythonic code):

def calculate_weighted_median(med_heap):
total_weight = sum(med_heap.min_heap) + sum(med_heap.max_heap)
k,sum = calculate_k(med_heap, total_weight)
if sum == total_weight: # split median case
# resize the value of k for accessing elements in each heap
# return (Y[k-1] + Y[k])/2
else: # whole median case
# resize the value of k for accessing elements in each heap
# return Y[k-1]

def calculate_k(med_heap, total_weight):
sum_of_w_0_to_w_k_plus_1 = 0
for k in range(1, len(med_heap.max_heap)):
# go from left to right in the sorted array, so
# start with the max heap and move to the min heap
sum_of_w_0_to_w_k_plus_1 += med_heap.max_heap[k-1]
if sum_of_w_0_to_w_k_plus_1 >= total_weight // 2:
# we have found the value of k
return (k, sum_of_w_0_to_w_k_plus_1)

# if the for loop above terminates, we move to the min heap
for k in range(1, len(med_heap.min_heap)):
sum_of_w_0_to_w_k_plus_1 += med_heap.min_heap[k-1]
if sum_of_w_0_to_w_k_plus_1 >= total_weight // 2:
# we have found k, but be sure to add len(med_heap.max_heap)
# since we restarted iteration from 1
return (k + len(med_heap.max_heap), sum_of_w_0_to_w_k_plus_1)


So in theory, we're done! We can just calculate the median at each time step, and have a functionally correct solution. However, we'd prefer not to calculate the median at each time step because it's fairly expensive, and thus we want to save the values of k and sum_of_w_0_to_w_k_plus_1, and simply update them each time a new element is added to the set of values and weights we are considering.

## Speeding up calculation by iterative updating

If you start off with an empty list (the first iteration of the running weighted median calculation), and insert an element 2 with weight 2, you start with k = 1 and sum_of_w_0_to_w_k_plus_1 = 2 (NOTE: this is the value of W[0] + W[1] ... W[k], henceforth referred to as simply sum_w_k for brevity). This makes sense, because the value of sum_w_k is equal to sum(W[0:k]) = sum(W[0:1]) = W[0] with k being appropriately 1. Using the principles earlier, we can see that sum_w_k > (total_weight / 2), thus indicating that the median is at Y[k-1], or Y[0] because it is a "whole median".

At the end of this first time step, you're left with the following state:

Y = [2]
W = [2]
k = 1
sum_w_k = 2
total_weight = 2


### Second Time Step - inserting below the median

Let's say that in the second time step, you insert the value 1 with an associated weight of 1. Now, Y = [1,2] and W=[1,2]. We need to update k, sum_w__k, and total_weight to reflect the new insertion. Updating total_weight is trivial, as we just add 1 for a new total_weight = 3. Since we inserted something LESS than our original median at Y[k-1], we INCREMENT k by one so it now has a value of 2. Let's make original_k the value of our k at the first time step, namely 1. Similarly, we update sum_w_k to reflect our new k, so now it is equal to sum(W[0:k]) = sum(W[0:2]) = W[0]+W[1]=3.

Given these new values, we have an issue; we cannot calculate the median with these values because we may have broken the condition that k must be the SMALLEST number possible that such that sum(W[0:k]) >= total_weight / 2. Since we inserted something less than Y[original_k] = 2, we know that the median MUST have decreased. As a result, our value of k can only decrease as well. As as result, we want to see if k is still the smallest value such that sum(W[0:k]) >= total_weight / 2. To do this, we iteratively try smaller values of k in the following fashion:

# recall that sum_w_k represents sum(W[0:k])
while(sum_w_k - W[k-1] >= total_weight / 2 and k != 1):
sum_w_k = sum_w_k - W[k-1]
k = k - 1


After this while loop completes, it is guaranteed that k is the smallest value in range(len(Y)) such that sum(W[0:k+1]) >= total_weight / 2, and that sum_w_k is updated accordingly. As a result, after the second time step, our state looks like:

Y = [1,2]
W = [1,2]
k = 2
sum_w_k = 3
total_weight = 3


If we wanted to find the median at this time step, we would first determine whether it is a split or whole median by checking if sum_w_k == total_weight / 2; since it does not, we know that it is a whole median, and the median has the value Y[k-1] = Y[1] = 2. In this second time step, we covered the case of inserting a value less than the current median.

### Third and Fourth Time Step - inserting above the median

In the third time step, we check the case of inserting a value greater than the current median, say the value 5 with a weight of 3. Now, Y = [1,2,5] and W = [1,2,3]. Since we inserted something above the median, we do not change the value of k and sum_w_k. However, we face the opposite problem as when inserting below the median. When inserting below the median, you lose the guarantee that k must be the SMALLEST number possible that such that sum(W[0:k]) >= total_weight / 2. On the other hand, when inserting above the median, you lose the guarantee that sum(W[0:k]) >= total_weight / 2 in the first place! Our new total weight is 6. we can see that sum(W[0:k]) = sum(W[0:2]) = 3 >= total_weight / 2 still holds true, so we do not need to update anything further. However, when calculating the median, since sum_w_k == total_weight / 2, it must be a split median. As a result, we apply the formula that $$\frac{Y[k-1] + Y[k]}{2} = \frac{Y[2-1] + Y[2]}{2} = \frac{Y[1] + Y[2]}{2} = \frac{2 + 5}{2} = 3.5$$ to calculate the correct split weighted median.

After the 3rd time step, the state is:

Y = [1,2,5]
W = [1,2,3]
k = 2
sum_w_k = 3
total_weight = 6


In the fourth time step, I'll demonstrate the case of inserting a value greater than the current median and having to update k. Let's say that we insert the value 8 with a weight of 5. As a result, our Y = [1,2,5,8] and our W = [1,2,3,5]. Once again, since we inserted above the median, we initially do not have to change the value of k and sum_w_k. We need to check whether our current value of k = 2 is the smallest number possible that fulfills sum(W[0:k]) >= total_weight / 2. total_weight is now equal to 11, since we add 5. Because we insert something above the median, the median can only increase; thus, we look for higher values of k. Note that sum(W[0:2]) = 3, which IS NOT greater than total_weight / 2= 11/2 = 5.5. To look for the new value of k, we iteratively try higher and higher values until the condition that sum(W[0:k]) >= total_weight / 2 is met. To illustrate:

iteration 1: Try k = 3 (2+1)
set sum_w_k = sum(W[0:3]) = W[0]+W[1]+W[2] = 6
is sum_w_k >= total_weight / 2? yes, since 6 > 5.5
We have found the lowest value of k fulfilling our target condition, and thus we break


As a result, the new value of k = 3. With this, we can calculate the new median after inserting the new value. since sum(W[0:k]) > total_weight / 2, we know the median is whole. Thus, the median is simply Y[k-1] = Y[3-1] = Y[2] = 5. This is the correct answer, given the following state after step 4:

Y = [1,2,5,8]
W = [1,2,3,5]
k = 3
sum_w_k = 6
total_weight = 11


### Step 5 - Inserting at the median

Obviously, inserting the values at the median will not change the median. However, it's important to ensure that we properly modify the other variables in order to facility correct future insertions. Given that the median in the previous state is 5, we will insert the value 5 with the weight 1. As a result our Y = [1,2,5,5,8] and W = [1,2,4,1,5]. We accordingly update total_weight = 12. Now, we have to update the value of k and sum_w_k. As you can see here, inserting at the median is just a special case of inserting above the median (notice how the 5 we inserted is above the original 5). As a result, we update k in the same manner as we do when inserting above the median; that is, we keep increasing it until it is the minimum value such that sum_w_>= total_weight / 2. In this case, k is currently 3 and sum_w_k = 6. Since sum_w_>= total_weight / 2, we do not have to update k.

Calculating the median is trivial; we see that sum_w_k >= total_weight / 2, so the median is Y[k-1]+Y[k] / 2, where k remained unchanged at 3. Thus, the median is still (Y[2] + Y[3]) / 2 = 5. The state after step 5 is:

Y = [1,2,5,5,8]
W = [1,2,3,1,5]
k = 3
sum_w_k = 6
total_weight = 12


### Step 6 - Post-Median Insertion Sanity Check

As a last step, we will insert another value just to ensure that we performed the proper actions in step 5. In this case, we'll choose to insert an extreme value below the median --- we will insert the value 0 with a weight of 14. Now, Y = [0,1,2,5,5,8] and W = [14,1,2,3,1,5]. We update total_weight = 26 and sum_w__k = 20, and k=4. Since we inserted below the median, we need to search k values that are closer to 0. Our search is detailed below, starting from our base value of k=4:

# iteration 1, verify that current k = 4 doesn't work
is sum_w_k >= total_weight / 2? yes, since 20 > 13.

# iteration 2, check if k = 3 also fulfills the condition
set sum_w_k = sum(W[0:3]) = W[0]+W[1]+W[2] = 14+1+2 = 17.
is sum_w_k >= total_weight / 2? yes, since 17 > 13

# iteration 3, check if k = 2 also fulfills the condition
set sum_w_k = sum(W[0:2]) = W[0]+W[1] = 14+1 = 15.
is sum_w_k >= total_weight / 2? yes, since 15 > 13

# iteration 4, check if k = 1 also fulfills the condition
set sum_w_k = sum(W[0:2]) = W[0] = 14.
is sum_w_k >= total_weight / 2? yes, since 14 > 13

# since k = 1 is the minimum value, we break.


With our value of k=1, we can calculate the new weighted median. we see that sum_w_k > total_weight / 2, thus this is a "whole median", and we use Y[k-1] = Y[0] to get a weighted median result of 0, which is correct. After step 6, the program state is:

Y = [0,1,2,5,5,8]
W = [14,1,2,3,1,5]
k = 1
sum_w_k = 14
total_weight = 26


We can insert more arbitrary numbers and weights, and this system will be able to handle it.

## Conclusion

In this blog post, I talked about an empirical method I am using to quickly calculate the weighted median given a running stream of integers. This will be used as a special case of the MedianHeap implementation already in place to handle unweighted values. I'm struggling to find formal mathematical proof that this works, and would highly appreciate if you could comment below if you have any thoughts especially regarding its correctness. It seems intuitive to me and appears to work for a wide variety of cases, but that doesn't mean it's correct in all cases.

If you have any questions, comments, or suggestions, you're welcome to leave a comment below.

Thanks to my mentors Raghav RV and Jacob Schreiber for their input on this problem; we've run through several solutions together, and they are always quick to point out errors and suggest improvements.

You're awesome for reading this! Feel free to follow me on GitHub if you want to track the progress of my Summer of Code project, or subscribe to blog updates via email.

## July 04, 2016

### Upendra Kumar (Core Python)

#### Unittest for tkinter applications

This week I really went through a lot of codebases, docs and Python books to learn about unittest. It was really a difficult task to learn it because we need to think of our application in a different way. We need to think about possible use cases of different methods, verify their functionality and logic and check for exceptions. All this becomes difficult and tricky when we need to create unittests for tkinter application.

The main difficulty in tkinter application is that we can’t just normally check it’s functionality. The call to root.mainloop() is a blocking call. Once we call root.mainloop() in setUpclass() method of unittest.TestCase object it blocks the further execution of tests which is not desired.

Ok, then as a normal person you will try to run root.mainloop() in a different thread other than the main thread in which the unittests are made. This trick also fails as neither tkinter is very handy nor unittest module when it comes to multithreading. root.mainloop() can only be run in main thread, therefore the unittests fail when we attempt to run root.mainloop() in secondary thread.

Therefore, finally we can do one thing. Actually root.mainloop() is nothing but just a loop running which periodically checks for changes in GUI elements. We can simulate the functionality of root.mainloop() ourselves by calling root.update() method whenever we change the GUI element programatically. Yeah, one more thing any GUI event can be only injected into tkinter GUI through invoke() or generate_event() methods. They work exactly similar to user input.

We can have an alternate philosophy of testing our GUI applications. It can be applied to each and every GUI application. We can test Tkinter application without using Tkinter library. Basically, remove all the non-Tkinter code from the classes
that handle the GUI and shunt that to another class. This way, we can easily test the logic that is being implemented without actually using Tkinter.

I have planned to use both methods for testing as they represent different perspectives.

### Ravi Jain (MyHDL)

#### CRC32 : Transmit Engine

Completed first draft of implementation of Transmit Engine. The implementation was fairly straightforward barring the calculation of CRC32(Cyclic Redundancy Check) for Frame Check Sequence.

It stalled me for a day or two requiring patience while reading and understanding the type of implementations. A very painless tutorial for understanding crc32 and its implementation from ground-up can be found here. This implementation in C also helped

Now I have generated pull request for code review.

### Utkarsh (pgmpy)

#### Google Summer of Code week 5 and 6

Mid-terms results are out. Congratulations! to all fellow GSoCer’s who successfully made it through the first half. My PR #702 has been merged which dealt with the first half of my proposed project. During week 5, I started working on No U Turn Sampler (NUTS). NUTS is an extension of Hamiltonian Monte Carlo that eliminates the need to set trajectory length. NUTS recursively builds a tree in forward and backward direction proposing set of likely candidates for new value of position and momentum and stopping automatically when it proposed values are no longer useful (doubling back). With dual-averaging algorithm stepsize can be adapted on fly, thus making possible to run NUTS without any hand tuning at all :) .

I tried implementing following algorithms from the paper[1]

• Algorithm 3: Efficient No-U-Turn Sampler

• Algorithm 6: No-U-Turn Sampler with Dual Averaging

The proposed API is similar to what we have for Hamiltonian Monte Carlo. Here is a sample example on how to use NUTS

>>> from pgmpy.inference.continuous import NoUTurnSampler as NUTS, LeapFrog
>>> from pgmpy.models import JointGaussianDistribution as JGD
>>> import numpy as np
>>> mean = np.array([1, 2, 3])
>>> covariance = np.array([[4, 0.1, 0.2], [0.1, 1, 0.3], [0.2, 0.3, 8]])
>>> model = JGD(['x', 'y', 'z'], mean, covariance)
>>> sampler = NUTS(model=model, grad_log_pdf=None, simulate_dynamics=LeapFrog)
>>> samples = sampler.sample(initial_pos=np.array([0.1, 0.9, 0.3]), num_samples=20000,stepsize=0.4)
>>> samples
rec.array([(0.1, 0.9, 0.3),
(-0.27303886844752756, 0.5028580705249155, 0.2895768065049909),
(1.7139810571103862, 2.809135711303245, 5.690811523613858), ...,
(-0.7742669710786649, 2.092867703984895, 6.139480724333439),
(1.3916152816323692, 1.394952482021687, 3.446906546649354),
(-0.2726336476939125, 2.6230854954595357, 2.923948403903159)],
dtype=[('x', '<f8'), ('y', '<f8'), ('z', '<f8')])


and NUTS with dual averaging

>>> from pgmpy.inference.continuous import NoUTurnSamplerDA as NUTSda
>>> from pgmpy.models import JointGaussianDistribution as JGD
>>> import numpy as np
>>> mean = np.array([-1, 12, -3])
>>> covariance = np.array([[-2, 7, 2], [7, 14, 4], [2, 4, -1]])
>>> model = JGD(['x', 'v', 't'], mean, covariance)
>>> sampler = NUTSda(model=model)
>>> samples = sampler.sample(initial_pos=np.array([0, 0, 0]), num_adapt=10, num_samples=10, stepsize=0.25)
>>> samples
rec.array([(0.0, 0.0, 0.0),
(0.06100992691638076, -0.17118088764170125, 0.14048470935160887),
(0.06100992691638076, -0.17118088764170125, 0.14048470935160887),
(-0.7451883138013118, 1.7975387358691155, 2.3090698721374436),
(-0.6207457594500309, 1.4611049498441024, 2.5890867012835574),
(0.24043604780911487, 1.8660976216530618, 3.2508715592645347),
(0.21509819341468212, 2.157760225367607, 3.5749582768731476),
(0.20699150582681913, 2.0605044285377305, 3.8588980251618135),
(0.20699150582681913, 2.0605044285377305, 3.8588980251618135),
(0.085332419611991, 1.7556171374575567, 4.49985082288814)],
dtype=[('x', '<f8'), ('v', '<f8'), ('t', '<f8')])


Performance wise NUTS is slower than a fine tuned HMC method for a simple model like Joint Gaussian Distribution where gradients are easy to compute because of increased number of inner products. Also in terms of memory efficiency NUTS requires to store more values of position and momentum during recursion (when we recursively build the tree). But for complex models and models with large data(high dimensionality) NUTS is really faster than tuned HMC method.

During week 6 apart from working on NUTS with dual-averaging I also used profiling to see scope of optimizations in my current implementation. Profiling results weren’t helpful. I’ll try to think of different ways on how I can reduce number of gradient computations by re-using them.

For the next week I’ll write tests for NUTS and NUTS with dual-averaging.

## July 03, 2016

### mkatsimpris (MyHDL)

#### Week 6

This week merged the 2d-dct and 1d-dct modules into the main repository. Moreover, a new branch (zig-zag) created which contains the zig-zag module and the test units. Christopher re-organized the contributed code and gave us some points to write a more compact style. In the following days, the code will be refactored according to Christopher's recommendations.

### Raffael_T (PyPy)

#### Unpacking done! Starting with Coroutines

It took a bit longer than anticipated, but the additional unpacking generalizations are finally completed.

Another good thing: I am now finally ready to work full time and make up for the time I lost, as I don't have to invest time into studying anymore.

The unpackings are done slightly different than in cpython, because in PyPy I get objects with a different structure for the maps. So I had to filter them for the keys and values and do a manual type check if it really is a dict. For the map unpack with call I had to implement the intersection check. For that I just check if a key is already stored in the dict that gets returned.

Now it's time to implement coroutines with async and await syntax. There is still a problem with the translation of PyPy, which is connected to the missing coroutines syntax feature. I will need to get this working as well, starting with implementing async in the parser.

## July 02, 2016

### Ranveer Aggarwal (dipy)

#### Progressing the Slider

In my previous blog post I had made a simple slider that worked with a click. This week, I also implemented dragging. It required changing the way I was sending the UI elements to ghe renderer again. It worked well, but I am stuck with a bug which seems pretty difficult to solve.

### The Problem

VTK has actors - 2D actors and 3D actors. Now, when you click on the screen, we use a picker (a built in VTK object) which takes in the click coordinates and returns the actor at that coordinate. This is how everything was working for the past 6 weeks. All the UI elements I have are 2D Actors. I have some problem picking the right 2D actor. This seems like a long known yet unsolved issue with VTK-Python :/

We’re working on it right now and if it doesn’t work out, we’ll have to write our own picker.

#### Past mid-term

Mid-term evaluation was just two weeks ago. And work is yet to be done.

### What’s next?

Currently the uploadTool is almost final. It has endured tons of refactorizing and huge changes in the past week and is almost good to go. However, there’s still tests to be written!

But that’s not all! We still need the Installation tool, which would be a huge addition to coala. This uploadTool only uploads all bears to PyPi. But that’s not enough! They have no requirements, no way to work! That’s exactly what the installation tool does. It generates the installation commands for all the requirements of each bear, and upon installation it simply installs all the dependencies and therefore the bear itself.

### Is it hard?

Probably. Probably not. I don’t think that the making of this  tool is actually hard. But the design itself. Because it has to be great. It has not only to work, but to be actually usable and amazing, so that people would choose to use it with pleasure.

### srivatsan_r (MyHDL)

#### Moving to the RISC-V project

For the remaining part of GSoC I will be creating a system something like this –

I completed the grey coloured block on the right side of the image already which is the HDMI core. The remaining blocks of the system has to be completed.

This week I was reading the RISC-V Instruction Set Architecture, and was trying to understand the project better.

I have to figure out a way to interface the HDMI core with the RISC-V processor.

### TaylorOshan (PySAL)

#### Exploratory data analysis for spatial interaction

Over the past two weeks I have worked on some exploratory tools for spatial interaction data. First, I have coded up a recently proposed spatial autocorrelation statistic for vectors. The statistic itself is a variation of Moran's I, though it requires unique randomization technqiues to carry out significance testing because the udnerlying distribution of vectors is unknown. The original paper put forth two randomization techniques, which I have tested here and which give very different results. More work will need to be done here to decide the best way to carry out hypothesis testing of the vector Moran's I.

I also created helper functions for the Gravity, Production, and Attraction classes, which carry out origin/destination specific gravity models so that local statistics and parameters can be obtained and mapped to explore potential non-stationarity in data-generating processes. In each case, the helper function is called local, thugh it works a bit differently for Gravity models, which can be origin-specific or destination-specific in comparion to constrained models which can only be either origin-specific or destination-specific. If a user tries to use the local function with a doubly-constrained model then they get a not implemented error since it is not possible to compute location-specific doubly constrained models due to a lack of degrees of freedom.

Looking forward, the next to weeks will focus on building functions to produce weighting functions that consider the spatial proximity of both origin and destintion neighbors, which will be useful or exploratory analysis and also for specifying autoregressive/spatial filter models.

## July 01, 2016

### mkatsimpris (MyHDL)

#### Zig-Zag Core

The 2d-dct and 1d-dct modules merged in the main repository by Christopher. Now, it's time for the zig-zag core to be implemented. In the following days I will create a new branch with the zig-zag core and it's unit test.

### jbm950 (PyDy)

#### GSoC Week 6

The main theme of this week is Featherstone’s method. I have finished reading all of the text book that I feel I need to in order to finish my project. After reading I realize that I have been improper about addressing my project. Instead of saying I am introducing Featherstone’s method to SymPy, it would be more accurate to say that I am introducing one of Featherstone’s methods. The book introduced two equations of motion generation methods for open loop kinematic trees and one method for a closed loop kinematic tree (I stopped reading after chapter 8 and so there may have been even more methods). For my project I have decided to focus on the articulated body method of equation of motion generation for kinematic trees. This method is presented as being more efficient than the composite body method and the closed loop method seems rather complicated.

With this in mind I began digging deeper into the articulated body method and better learning how it works. With this mindset I went over the three passes that the method uses and looked for places where code would be needed that isn’t specifically part of the method. I compiled a list of these functions and have written docstrings and presented them in PR #11331. The support code for the method includes operations for spatial (6D) vectors and a function and library for extracing relevant joint information.

This week I reviewed PR #11333. The pull request adds docstrings to method that did not have them previously which is a big plus but the docstrings that were added were minimal and vague. I asked that the contributer add more information to the docstrings and he said he will get to it.

### Future Directions

Next week I plan on furthering my work on the articulated body method. I hope to have the support functions completely written up and to begin writing the equation of motion generator itself. These plans may be set aside, however, as my most active mentor will be coming back from traveling next week and so work may resume on the base class.

### PR’s and Issues

• (Open) Added docstrings to ast.py PR #11333
• (Open) [WIP] Featherstones EOM support PR #11331

## June 29, 2016

### Ravi Jain (MyHDL)

#### Started Transmit Engine!

Yay readers, good news. I got through the mid-terms and received the payment. Feels good!

About the project. I started of with implementation of transmit engine. Sweeping changes had to be made in the the interfaces of the sub-blocks. Notable changes:

• Removal of Client Sub-block and interfacing the FIFOs directly with Engine.
• Addition of intrafaces, i.e., interfaces between the sub-blocks.
• Moving the configregs(Configuration Registers) and addrtable(Address Table) out of management block to main gemac block to improve its scope to other subblocks. Now its accessed by management block through ports.

As a result of changing the ports of management block i had to edit the test_management to reflect the change. I had independent instantiation of the management block in every test which was redundant. I then looked up into pytest fixtures which enabled me to have a common function which would be run before every test thus removing the redundancy. It provides convenience to change the block port definitions in future if needed.

Now i am working on implementing its features. A little about Transmit Engine:

“Accepts Ethernet frame data from the Client Transmitter interface,
adds preamble to the start of the frame, add padding bytes and frame
check sequence. It ensures that the inter-frame spacing between successive
frames is at least the minimum specified. The frame is then converted
into a format that is compatible with the GMII and sent to the GMII Block.”

### Yashu Seth (pgmpy)

#### Custom Discretizers

Hello, everyone. The midterm evaluations are over and after a successful first half of my project I have begun working on the latter parts of my proposal.

Before I move on to the topic of this post, let me give you a glimpse of the planned implementation road map in the coming weeks. This week I am working on a special Continuous Factor class known as the CanonicalFactor. I will keep it concise with just as introduction.

The intermediate factors in a Gaussian network can be described compactly using a simple parametric representation called the canonical form. This representation is closed under the basic operations used in inference: factor product, factor division, factor reduction, and marginalization. Thus, we can define a set of simple data structures that allow the inference process to be performed. Using these canonical factors and a fairly straightforward modification of the discrete sum-product algorithms (whether variable elimination or clique tree) we can have an exact inference algorithm for Gaussian networks.

Now moving on to the main content of this post. As promised, I will be discussing how we can leverage the class based architecture of the discretizing functionality in pgmpy to plugin our own custom made discretizing algorithms.

All the discretizing algorithms in pgmpy our rerpesented as separate classes that are derived from a base class, BaseDiscretizer. They have an abstract method, get_discrete_values that gives discrete probability masses in form of a List, or a Factor or a TabularCPD or maybe some other representation, dependng upon the particular discretization method.

Now, let us implement our own algorithm to discretize a ContinuousNode object. I am defining an upper discretization method that unlike the RoundingDiscretizer takes into account only the upper half of the probability density function between two consecutive points. (The RoundingDiscretizer takes the entire interval into calculation.)

from pgmpy.discretizers import BaseDiscretizer

class UpperDiscretizer(BaseDiscretizer):
def get_discrete_values(self):
step = (self.high - self.low) / self.cardinality

# for x=[low, low+step, low+2*step, ........., high-step]
points = np.linspace(self.low, self.high - step, self.cardinality - 1)
discrete_values = [self.factor.cdf(i + step/2) - self.factor.cdf(i) for i in points]

return discrete_values


As we can see that, without much hassle, we can work with our own algorithm that is best suited for our use case. These custom methods can also be applied to ContinuousFactor objects since just like the ContinuousNode class it also has a discretize method that takes in a BaseDiscretizer subclass and returns the output using that method.

With this I come to an end of the post. Hope you enjoyed it. Stay tuned for more :-) .

### John Detlefs (MDAnalysis)

#### Principal Component Analysis

My next subject for bloggery is Principal Component Analysis (PCA) (its sibling Multidimensional scaling has been left out for a future post, but it is just as special, don’t worry). If I were to give a talk on PCA, the slides would be roughly ordered as follows:

• A very short recap of dimension reduction
• PCA, what it stands for, rough background, history
• Eigenvectors (what are those?!)
• Covariance (Because variance matrix didn’t sound cool enough)
• The very fancy sounding method of Lagrange Multipliers (why they aren’t that hard)
• Explain the PCA Algorithm
• Random Walks: What are they, how are they taken on a configuration space
• Interpreting the results after applying PCA on MD simulation data

In reality not going to follow these bullet points, if you want to get information pertaining to the first two points, please read some of my previous posts. The last two points are going to be a subject for a post next week.

Here are some good sources for those seeking to acquaint themselves with Linear Algebra and Statistics. Multivariate Statistics and PCA (Lessons two through 10) and the Feynman Lecture on Physics: Probability. The Feynman lectures on physics are so good and so accessible. Richard Feynman certainly had his flaws but teaching was not one of them. If you’re too busy to read those, here’s a quick summary of some important ideas I will be using.

## What is a Linear Transformation, John?

Glad you asked, friend! Let’s just stick to linearity. for a function to be linear, it means that $f(a+b) = f(a) + f(b)$. As an example of a non-linear function, consider $y = x^{3}$ . After plugging some numbers in we can see this is non-linear $2^{3} \neq 1^{3} + 1^{3}$.

A transformation at its most abstract is the description of an algorithm that gets an object A to become an object B. In linear algebra the transformation is being done on vectors belonging to a domain (where vectors exist before the transformation) space $V$ and a range (where vectors exist after the transformation) space $W$. For the purposes of our work, these are both $R^{n}$, the Cartesian product n-times of the real line. (The standard (x,y) coordinate system is the Cartesian product of the real line twice, or $R^{2}$)

When dealing with vector spaces, our linear transformation can be represented by a $m$-by-$n$ matrix, where $n$ is the dimension of the space we are sending a vector into (always less than or equal to m), and $m$ is the dimension of the vector space in which our original vector (or set of vectors) being transformed originally exists in. So if we have some set of $k$ vectors being transformed, the matrices will be have row-by-column sizes: . These maps can be scalings, rotations, shearings and more.

## What is a vector, what is an eigenvector?

Good question! A vector has a magnitude (which is just some positive number for anything that we are doing) and a direction (a property that is drawn from the vector space to which the vector belongs). Being told to walk 10 paces due north is to follow a vector with magnitude 10 and direction north. Vectors are presented as if they are centered at the origin, and their head is reflects their magnitude and direction. This allows some consistency when discussing things, when we are given the vector $(1,1)$ in $R^2$ we know it is centered at the origin, and thus has magnitude (from the distance formula) of $\sqrt{2}$ and the direction is 45 degrees from the horizontal axis. An eigenvector sounds a lot scarier than it is; the purpose of an eigenvector is to answer the question, ‘what vectors don’t change direction under a given linear transformation?’

This picture is stolen from wikipedia, but it should be clear that the blue vector is an eigenvector of this transformation, while the red vector is not.

The standard equation given when an eigenvalue problem is posed is: >$M$ is some linear transformation, $v$ is an eigenvector we are trying to find, and $\lambda$ is the corresponding eigenvalue.

From this equation, we can see that eigenvector-eigenvalue pairs are not unique; direction is not a unique property of a vector. If we find a vector $v$ that satisfies this equation for our linear transformation $M$, scaling the vector by some constant $\alpha$ will simply change the eigenvalue associated with the solution. The vector $(1,1)$ in $R^2$ has the same direction as $(2,2)$ in $R^2$. If one of these vectors isn’t subject to a direction change (and therefore an eigenvector), then the other must be as well, because the eigenvector-ness (yes, I just coined this phrase) applies to all vectors with the same direction.

For those of you more familiar with Linear Algebra, this should not be confused with the fact that a linear transformation can have degenerate eigenvalues. This concept of degenerate eigenvalues comes up when the rank of the matrix representation of a linear transformation is less than its dimension, but given that our transformation has rank equal to to the dimension of the vector space, we can ignore this.

## Statistics in ~300 words

Sticking one dimension, plenty of data seems to be random while also favoring a central point. Consider the usual example of height across a population of people. Height can be thought of as a random variable. But this isn’t random in the way that people might think about randomness without some knowledge of statistics. There is a central point where the heights of people tend to cluster, and the likelihood of someone being taller or shorter than this central point decreases on a ‘bell-curve’.

This is called a normal distribution. Many datasets can be thought of as describing the a set of outcomes for some random variable in nature. These outcomes are distributed in some fashion. In our example, the mean of the data is the average height over the entire population. The variance of our data is how far the height of a is spread out from its mean. When the distribution of outcomes follows a bell-curve such as it does in our example, the distribution is referred to as normal. (There are some more technical details, but the phrase normal stems from the fact that total area under the bell-curve defined by the normal distribution is equal to one.) When the data we want to describe is reflects more than one random variable this is a multivariate distribution. Statistics introduces the concept of covariance to describe the relationship that random variables have with one another. The magnitude of a covariance indicates in some fashion the relationship between random variables; it is not easily interpretable without some set of constraints on the covariances, which will come up in Principal Component Analysis. The sign of the covariance between two random variables (X,Y) indicates if the two points are inversely related or directly related. A negative covariance between X and Y means that increasing X decreases while a positive covariance means that increasing X increases Y. The normalized version of covariance is called the correlation coefficient and might be more familiar to those previously acquainted with statistics.

## Constrained Optimization

Again, please do look at the Feynman Lectures, if you’re unfamiliar with statistics look at the Penn State material for the ideas I just went over to better understand them. The last subject I want to broach before getting into the details of PCA is optimization with Lagrange Multipliers.

Lagrange Multipliers are one method of solving a constrained optimization problem. Such a problem requires an objective function to be optimized and constraints to optimize against. An objective function is any function that we wish to maximize or minimize to reflect some target quantity achieving an optimum value. In short, the method of Lagrange Multipliers creates a system of linear equations such that we can solve for a term $\lambda$ that shows when the objective function achieves a maximum subject to constraints. In the case the of PCA, the function we want to maximize is $M^{T} cov(X) M$, describing the covariance of our data matrix $X$.

## PCA

Although it is not quite a fortuitous circumstance, the principal components of the covariance matrix are precisely it’s eigenvectors. For a multivariate dataset generated by $n$ random variables, PCA will return a sequence of $n$ eigenvectors each describing more covariance in the dataset than the next. The picture above presents an example of the eigenvectors reflecting the covariance of a 2-dimensional multivariate dataset.

To be perfectly honest, I don’t know a satisfying way to explain why the eigenvectors are the principal components. The best explanation I can come up with is that the algorithm for PCA is in correspondence with the algorithm for eigenvalue decomposition. It’s one of those things where I should be able to provide a proof for why the two problems are the same, but I cannot at the moment. (Commence hand-waving…)

Let’s look at the algorithm for Principal Component Analysis to better understand things. PCA is an iterative method that seeks to create a sequence of vectors that describe the covariance in a collection of data, each vector describing more than those that will follow. This introduces an optimization problem, an issue that I referenced earlier. In order to guarantee uniqueness of our solution, this optimization is subject to constraints using Lagrange Multipliers. remember, the video provides an example of a constrained optimization problem from calculus. Principal Component Analysis finds this set of vectors by creating a linear transformation $M$ that maximizes the covariance objective function given below.

PCA’s objective function is $trace(M^{T} cov(X) M)$, we are looking to maximize this. From the Penn State lectures:

Earlier in the course we defined the total variation of X as the trace of the variance-covariance matrix, or if you like, the sum of the variances of the individual variables. This is also equal to the sum of the eigenvalues.

In the first step of PCA, we can think of our objective function as:

$a$ in this case is a single vector

We seek to maximize this a term such that the sum of the squares of the coefficients of a is equal to one. (In math terms this is saying that the $L^2$ norm is one). This constraint is introduced to ensure that a unique answer is obtained, (remember eigenvectors are not unique, this is the same process one would undertake to get a unique sequence of eigenvectors from a decomposition!). In the second step, the objective function is :

$B$ consists of $a$ and a new vector $b$

We look to maximize $b$ such that it explains covariance not previously explained by the first, for this reason we introduce an optimization problem with two constraints:

• $a$ remains the same
• The sum of the squares of the coefficients of $b$ equals one
• None of the covariance explained by vector $a$ is explained by $b$, (the two vectors are orthogonal, another property of eigenvectors!)

This process is repeated for the entire transformation $M$. This gives us a sequence of eigenvalues that each reflect some fraction of the covariance, and sum to one. For our previously mentioned n-dimensional multivariate dataset, generally some number $k \lt n$ of eigenvectors explain the total variance well enough to ignore the $n - k$ vectors remaining. This gives a set of principal components to investigate.

It might follow that this corresponds to an eigenvalue problem:

If it doesn’t appear to be clear, let’s step back and look again at eigenvectors. As I said earlier, eigenvectors provide insight into what vectors don’t change direction under a linear map. We are trying find a linearly independent set of vectors that provide insight into the structure of our covariance from our multivariate distribution. Eigenvectors are either the same or orthogonal. This algorithm we described, is precisely the algorithm one would use to find the eigenvectors of any (full-rank) linear transformation, not just the linear transformation done in Principal Component Analysis.

## Chemistry application and interpretation:

Before PCA is done on an MD Simulation, we have to consider what goals we have for the analysis of results. We are searching to arrange the data outputted by PCA such it gives us intuition into some physical behavior of our system. This is usually done a single structure, and in order to focus insight on relative structural change rather than some sort of translational motion of the entire structure, an alignment of an entire trajectory to some structure of interest by minimization of the Root Mean Square Distance must be done prior to analysis. After RMS alignment, any variance in data should be variance due to changes in structure.

Cecilia Clementi has this quote in her Free Energy Landscapes paper:

Essentially, PCA computes a hyperplane that passes through the data points as best as possible in a least-squares sense. The principal components are the tangent vectors that describe this hyperplane

How do we interpret n-dimensional physical data from the tangent vectors of an k-dimensional hyperplane embedded in this higher dimensional space? This is all very mathematical and abstract, . What we do is reduce the analysis to some visually interpretable subset of components, and see if there is any indication of clustering that occurs.

Remember, we have an explicit linear map relating the higher dimensional space to the lower-dimensional space . By taking our trajectory and projecting it onto one of the set of eigenvector components of our analysis, we can extract embeddings in different ways, from Clementi again:

So, the first principal component corresponds to the best possible projection onto a line, the first two correspond to the best possible projection onto a plane, and so on. Clearly, if the manifold of interest is inherently non-linear the low-dimensional e mbedding obtained by means of PCA is severely distorted… The fact that empirical reaction coordinates routinely used in protein folding studies can not be reduced to a linear combination of the Cartesian coordinates underscores the inadequacy of linear dimensionality reduction techniques to characterize a folding landscape.

Again, this is all esoteric for the lay reader. What is a manifold, a reaction coordinate, a linear combination of cartesian coordinates? All we should know is that PCA is a limited investigational tool for complex systems, the variance the principal components explain should not necessarily be interpreted as physical parameters governing the behavior of a system.

My mentor max has a great Jupyter notebook up demonstrating PCA done on MD simulations here. All of these topics are covered in the notebook and should be relatively accessible if you understand what I’ve said so far. In my next post I will write about how I will be implementing PCA as a module in MDAnalysis.

-John

#### Principal Component Analysis

My next subject for bloggery is Principal Component Analysis (PCA) (its sibling Multidimensional scaling has been left out for a future post, but it is just as special, don’t worry). If I were to give a talk on PCA, the slides would be roughly ordered as follows:

• A very short recap of dimension reduction
• PCA, what it stands for, rough background, history
• Eigenvectors (what are those?!)
• Covariance (Because variance matrix didn’t sound cool enough)
• The very fancy sounding method of Lagrange Multipliers (why they aren’t that hard)
• Explain the PCA Algorithm
• Random Walks: What are they, how are they taken on a configuration space
• Interpreting the results after applying PCA on MD simulation data

In reality not going to follow these bullet points, if you want to get information pertaining to the first two points, please read some of my previous posts. The last two points are going to be a subject for a post next week.

Here are some good sources for those seeking to acquaint themselves with Linear Algebra and Statistics. Multivariate Statistics and PCA (Lessons two through 10) and the Feynman Lecture on Physics: Probability. The Feynman lectures on physics are so good and so accessible. Richard Feynman certainly had his flaws but teaching was not one of them. If you’re too busy to read those, here’s a quick summary of some important ideas I will be using.

## What is a Linear Transformation, John?

Glad you asked, friend! Let’s just stick to linearity. for a function to be linear, it means that $f(a+b) = f(a) + f(b)$. As an example of a non-linear function, consider $y = x^{3}$ . After plugging some numbers in we can see this is non-linear $2^{3} \neq 1^{3} + 1^{3}$.

A transformation at its most abstract is the description of an algorithm that gets an object A to become an object B. In linear algebra the transformation is being done on vectors belonging to a domain (where vectors exist before the transformation) space $V$ and a range (where vectors exist after the transformation) space $W$. For the purposes of our work, these are both $R^{n}$, the Cartesian product n-times of the real line. (The standard (x,y) coordinate system is the Cartesian product of the real line twice, or $R^{2}$)

When dealing with vector spaces, our linear transformation can be represented by a $m$-by-$n$ matrix, where $n$ is the dimension of the space we are sending a vector into (always less than or equal to m), and $m$ is the dimension of the vector space in which our original vector (or set of vectors) being transformed originally exists in. So if we have some set of $k$ vectors being transformed, the matrices will be have row-by-column sizes: $$[ k-by-m ] [m-by-n] = [k-by-n]$$. These maps can be scalings, rotations, shearings and more.

## What is a vector, what is an eigenvector?

Good question! A vector has a magnitude (which is just some positive number for anything that we are doing) and a direction (a property that is drawn from the vector space to which the vector belongs). Being told to walk 10 paces due north is to follow a vector with magnitude 10 and direction north. Vectors are presented as if they are centered at the origin, and their head is reflects their magnitude and direction. This allows some consistency when discussing things, when we are given the vector $(1,1)$ in $R^2$ we know it is centered at the origin, and thus has magnitude (from the distance formula) of $\sqrt{2}$ and the direction is 45 degrees from the horizontal axis. An eigenvector sounds a lot scarier than it is; the purpose of an eigenvector is to answer the question, ‘what vectors don’t change direction under a given linear transformation?’

This picture is stolen from wikipedia, but it should be clear that the blue vector is an eigenvector of this transformation, while the red vector is not.

The standard equation given when an eigenvalue problem is posed is: $$Mv = \lambda v$$

$M$ is some linear transformation, $v$ is an eigenvector we are trying to find, and $\lambda$ is the corresponding eigenvalue.

From this equation, we can see that eigenvector-eigenvalue pairs are not unique; direction is not a unique property of a vector. If we find a vector $v$ that satisfies this equation for our linear transformation $M$, scaling the vector by some constant $\alpha$ will simply change the eigenvalue associated with the solution. The vector $(1,1)$ in $R^2$ has the same direction as $(2,2)$ in $R^2$. If one of these vectors isn’t subject to a direction change (and therefore an eigenvector), then the other must be as well, because the eigenvector-ness (yes, I just coined this phrase) applies to all vectors with the same direction.

For those of you more familiar with Linear Algebra, this should not be confused with the fact that a linear transformation can have degenerate eigenvalues. This concept of degenerate eigenvalues comes up when the rank of the matrix representation of a linear transformation is less than its dimension, but given that our transformation has rank equal to to the dimension of the vector space, we can ignore this.

## Statistics in ~300 words

Sticking one dimension, plenty of data seems to be random while also favoring a central point. Consider the usual example of height across a population of people. Height can be thought of as a random variable. But this isn’t random in the way that people might think about randomness without some knowledge of statistics. There is a central point where the heights of people tend to cluster, and the likelihood of someone being taller or shorter than this central point decreases on a ‘bell-curve’.

This is called a normal distribution. Many datasets can be thought of as describing the a set of outcomes for some random variable in nature. These outcomes are distributed in some fashion. In our example, the mean of the data is the average height over the entire population. The variance of our data is how far the height of a is spread out from its mean. When the distribution of outcomes follows a bell-curve such as it does in our example, the distribution is referred to as normal. (There are some more technical details, but the phrase normal stems from the fact that total area under the bell-curve defined by the normal distribution is equal to one.) When the data we want to describe is reflects more than one random variable this is a multivariate distribution. Statistics introduces the concept of covariance to describe the relationship that random variables have with one another. The magnitude of a covariance indicates in some fashion the relationship between random variables; it is not easily interpretable without some set of constraints on the covariances, which will come up in Principal Component Analysis. The sign of the covariance between two random variables (X,Y) indicates if the two points are inversely related or directly related. A negative covariance between X and Y means that increasing X decreases while a positive covariance means that increasing X increases Y. The normalized version of covariance is called the correlation coefficient and might be more familiar to those previously acquainted with statistics.

## Constrained Optimization

Again, please do look at the Feynman Lectures, if you’re unfamiliar with statistics look at the Penn State material for the ideas I just went over to better understand them. The last subject I want to broach before getting into the details of PCA is optimization with Lagrange Multipliers.

Lagrange Multipliers are one method of solving a constrained optimization problem. Such a problem requires an objective function to be optimized and constraints to optimize against. An objective function is any function that we wish to maximize or minimize to reflect some target quantity achieving an optimum value. In short, the method of Lagrange Multipliers creates a system of linear equations such that we can solve for a term $\lambda$ that shows when the objective function achieves a maximum subject to constraints. In the case the of PCA, the function we want to maximize is $M^{T} cov(X) M$, describing the covariance of our data matrix $X$.

## PCA

Although it is not quite a fortuitous circumstance, the principal components of the covariance matrix are precisely it’s eigenvectors. For a multivariate dataset generated by $n$ random variables, PCA will return a sequence of $n$ eigenvectors each describing more covariance in the dataset than the next. The picture above presents an example of the eigenvectors reflecting the covariance of a 2-dimensional multivariate dataset.

To be perfectly honest, I don’t know a satisfying way to explain why the eigenvectors are the principal components. The best explanation I can come up with is that the algorithm for PCA is in correspondence with the algorithm for eigenvalue decomposition. It’s one of those things where I should be able to provide a proof for why the two problems are the same, but I cannot at the moment. (Commence hand-waving…)

Let’s look at the algorithm for Principal Component Analysis to better understand things. PCA is an iterative method that seeks to create a sequence of vectors that describe the covariance in a collection of data, each vector describing more than those that will follow. This introduces an optimization problem, an issue that I referenced earlier. In order to guarantee uniqueness of our solution, this optimization is subject to constraints using Lagrange Multipliers. remember, the video provides an example of a constrained optimization problem from calculus. Principal Component Analysis finds this set of vectors by creating a linear transformation $M$ that maximizes the covariance objective function given below.

PCA’s objective function is $trace(M^{T} cov(X) M)$, we are looking to maximize this. From the Penn State lectures:

Earlier in the course we defined the total variation of X as the trace of the variance-covariance matrix, or if you like, the sum of the variances of the individual variables. This is also equal to the sum of the eigenvalues.

In the first step of PCA, we can think of our objective function as: $$a^T cov(X) a$$

$a$ in this case is a single vector

We seek to maximize this a term such that the sum of the squares of the coefficients of a is equal to one. (In math terms this is saying that the $L^2$ norm is one). This constraint is introduced to ensure that a unique answer is obtained, (remember eigenvectors are not unique, this is the same process one would undertake to get a unique sequence of eigenvectors from a decomposition!). In the second step, the objective function is : $$B^T cov(X) B$$

$B$ consists of $a$ and a new vector $b$

We look to maximize $b$ such that it explains covariance not previously explained by the first, for this reason we introduce an optimization problem with two constraints:

• $a$ remains the same
• The sum of the squares of the coefficients of $b$ equals one
• None of the covariance explained by vector $a$ is explained by $b$, (the two vectors are orthogonal, another property of eigenvectors!)

This process is repeated for the entire transformation $M$. This gives us a sequence of eigenvalues that each reflect some fraction of the covariance, and sum to one. For our previously mentioned n-dimensional multivariate dataset, generally some number $k \lt n$ of eigenvectors explain the total variance well enough to ignore the $n - k$ vectors remaining. This gives a set of principal components to investigate.

It might follow that this corresponds to an eigenvalue problem: $$cov(X)M = \lamba M$$ If it doesn’t appear to be clear, let’s step back and look again at eigenvectors. As I said earlier, eigenvectors provide insight into what vectors don’t change direction under a linear map. We are trying find a linearly independent set of vectors that provide insight into the structure of our covariance from our multivariate distribution. Eigenvectors are either the same or orthogonal. This algorithm we described, is precisely the algorithm one would use to find the eigenvectors of any (full-rank) linear transformation, not just the linear transformation done in Principal Component Analysis.

## Chemistry application and interpretation:

Before PCA is done on an MD Simulation, we have to consider what goals we have for the analysis of results. We are searching to arrange the data outputted by PCA such it gives us intuition into some physical behavior of our system. This is usually done a single structure, and in order to focus insight on relative structural change rather than some sort of translational motion of the entire structure, an alignment of an entire trajectory to some structure of interest by minimization of the Root Mean Square Distance must be done prior to analysis. After RMS alignment, any variance in data should be variance due to changes in structure.

Cecilia Clementi has this quote in her Free Energy Landscapes paper:

Essentially, PCA computes a hyperplane that passes through the data points as best as possible in a least-squares sense. The principal components are the tangent vectors that describe this hyperplane

How do we interpret n-dimensional physical data from the tangent vectors of an k-dimensional hyperplane embedded in this higher dimensional space? This is all very mathematical and abstract, . What we do is reduce the analysis to some visually interpretable subset of components, and see if there is any indication of clustering that occurs.

Remember, we have an explicit linear map relating the higher dimensional space to the lower-dimensional space . By taking our trajectory and projecting it onto one of the set of eigenvector components of our analysis, we can extract embeddings in different ways, from Clementi again:

So, the first principal component corresponds to the best possible projection onto a line, the first two correspond to the best possible projection onto a plane, and so on. Clearly, if the manifold of interest is inherently non-linear the low-dimensional e mbedding obtained by means of PCA is severely distorted… The fact that empirical reaction coordinates routinely used in protein folding studies can not be reduced to a linear combination of the Cartesian coordinates underscores the inadequacy of linear dimensionality reduction techniques to characterize a folding landscape.

Again, this is all esoteric for the lay reader. What is a manifold, a reaction coordinate, a linear combination of cartesian coordinates? All we should know is that PCA is a limited investigational tool for complex systems, the variance the principal components explain should not necessarily be interpreted as physical parameters governing the behavior of a system.

My mentor max has a great Jupyter notebook up demonstrating PCA done on MD simulations here. All of these topics are covered in the notebook and should be relatively accessible if you understand what I’ve said so far. In my next post I will write about how I will be implementing PCA as a module in MDAnalysis.

-John

## June 28, 2016

### aleks_ (Statsmodels)

#### The GSoC story: to be continued : )

Hi everybody!

Yesterday evening I received the message that my summer of code will be continued. Many thanks to my mentors Kevin and Josef for believing in me! I will do my best to make a valuable contribution to the statsmodels project.

It also means that more posts will follow here - so stay tuned ;)

### John Detlefs (MDAnalysis)

#### A Note from the Author

In the last blog post I wrote the most common critique I received was that I alienated myself from most of my potential audience. In an email I expressed to my Summer of Code mentor Max Linke my problem:

My number one worry in all of these matters is coming off as unrigorous or pseudo-scientific, and I think I probably overcompensate by being borderline inaccessible. I think this stems from some time spent enjoying a lack of rigor and the fun that is being pseudo-scientific.

For a while, after perusing blogs and social media, I thought, “Hey, I strongly identify with this ‘impostor syndrome’ thing.” Now I realize that’s a pretty ignorant and borderline insulting view to have. From what I can tell, I may have insecurities, but the difference between my anxieties and people with who struggle with true ‘impostor syndrome’ is that someone has to have experience tangible evidence that they are an outsider. As a straight white male, I don’t have these problems. So I guess in the future I will refrain from letting these anxieties be confused with something more serious — I have it pretty easy.

Going further, as a student and a tutor I noticed that far too often when people were in over their heads, they would get quiet and close off to the outside world. Especially in my math classes; a professor could be explaining Jordan Normal Forms, reciting proofs and corollaries and lemmas as if they were gospel, and although everyone was baffled, they would stay quiet. Nobody likes it when someone dominates a lecture with their own questions and at the same time a lot of people have missed fundamentals out of fear of sounding stupid. If I’m ever asked in a job interview to give a personal strength, it would be that I ask questions that might seem stupid with reckless abandon.

These posts are intended for people working to teach themselves a some difficult topics. I apologize for being obtuse and abstract and abstruse earlier. I will do my best to teach things from an intuition-first standpoint from here on and provide resources for refreshing on math and statistics topics. Please get in touch with me if something I say is unclear or wrong; this blog is as much for my own education as it is others.

-John

#### A Note from the Author

In the last blog post I wrote the most common critique I received was that I alienated myself from most of my potential audience. In an email I expressed to my Summer of Code mentor Max Linke my problem:

My number one worry in all of these matters is coming off as unrigorous or pseudo-scientific, and I think I probably overcompensate by being borderline inaccessible. I think this stems from some time spent enjoying a lack of rigor and the fun that is being pseudo-scientific.

For a while, after perusing blogs and social media, I thought, “Hey, I strongly identify with this ‘impostor syndrome’ thing.” Now I realize that’s a pretty ignorant and borderline insulting view to have. From what I can tell, I may have insecurities, but the difference between my anxieties and people with who struggle with true ‘impostor syndrome’ is that someone has to have experience tangible evidence that they are an outsider. As a straight white male, I don’t have these problems. So I guess in the future I will refrain from letting these anxieties be confused with something more serious — I have it pretty easy.

Going further, as a student and a tutor I noticed that far too often when people were in over their heads, they would get quiet and close off to the outside world. Especially in my math classes; a professor could be explaining Jordan Normal Forms, reciting proofs and corollaries and lemmas as if they were gospel, and although everyone was baffled, they would stay quiet. Nobody likes it when someone dominates a lecture with their own questions and at the same time a lot of people have missed fundamentals out of fear of sounding stupid. If I’m ever asked in a job interview to give a personal strength, it would be that I ask questions that might seem stupid with reckless abandon.

These posts are intended for people working to teach themselves a some difficult topics. I apologize for being obtuse and abstract and abstruse earlier. I will do my best to teach things from an intuition-first standpoint from here on and provide resources for refreshing on math and statistics topics. Please get in touch with me if something I say is unclear or wrong; this blog is as much for my own education as it is others.

-John

## June 27, 2016

### Yen (scikit-learn)

#### scikit-learn KMeans Now Support Fused Types

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters). This common technique is used in many fields, including image analysis and unsupervised document classification. In scikit-learn, clustering of unlabeled data can be performed with the module sklearn.cluster. However, in the current implementation of scikit-learn, one of the most popular clustering algorithm, KMeans, only support float64 input data and will therefore implicitly convert other input data types, e.g., float32, into float64, which may cause seriously memory waste.

Below, I’ll briefly introduce KMeans algorithms and go through the work I’ve done to make it become memory-efficient during GSoC.

## KMeans

KMeans is probably one of the most well-knowned clustering algorithm since it is both effective and easy to be implemented.

To understand KMeans algorithm, I think it is good to start from these figures which clearly illustrate how KMeans works.

Training examples are shown as dots, and cluster centroids are shown as crosses.

• (a) Original dataset.
• (b) Random initial cluster centroids.
• (c-f) Illustration of running two iterations of k-means. In each iteration, we

1. Assign each training example to the closest cluster centroid (shown by “painting” the training examples the same color as the cluster centroid to which is assigned)
2. Move each cluster centroid to the mean of the points assigned to it.

For more details, Andrew Ng’s course note is a good reference.

In scikit-learn, KMeans implements the algorithm described above, and MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function.

## Memory Wasting Issues

However, original implementation of these two algorithms in scikit-learn are not memory-efficient, they will convert input data into np.float64 since Cython implementation only supports double input data.

Here’s a simple test script which can help us identify the memory wasting issues:

import numpy as np
from scipy import sparse as sp
from sklearn.cluster import KMeans

@profile
def fit_est():
estimator.fit(X)

np.random.seed(5)
X = np.random.rand(200000, 20)

# Toggle the following comment to test np.float32 data
# X = np.float32(X)

# Toggle the following comment to test sprase data
# X = sp.csr_matrix(X)

estimator = KMeans()
fit_est()


You can run

mprof run <script>
mprof plot


to see the memory profiling results.

To save your effort, below is the result of memory profiling on my own computer:

• Dense np.float32 data & Dense np.float64 data

No surprise, these two kinds of input data have the same memory usage, which means that there is a huge waste when we pass np.float32 data into original KMeans of scikit-learn because it requires same memory space as np.float64 data.

To solve this problem, we can introduce Cython fused types to avoid data copying.

## Enhanced Results

After PR #6846, now both KMeans and MiniBatchKMeans support fused types and can therefore use np.float32 data as input directly.

Below are the memory profiling results comparison:

• Dense np.float32 data
• Before enhancement:
• Dense np.float32 data
• After enhancement:
• Sparse np.float32 data
• Before enhancement:
• Sparse np.float32 data
• After enhancement:

As one can see, introducing Cython fused types drastically reduces the memory usage of KMeans in scikit-learn, which can help us avoid unexpected memory waste.

Note that in the sparse case, I directly transform the dense array used before into sparse array, which will result in higher memory usage since sparse format uses more space per nonzero value.

## Summary

Both KMeans and MiniBatchKMeans now support np.float32 input data in a memory-efficient way, go grab your huge dataset and feed them into KMeans of scikit-learn now!

### Ravi Jain (MyHDL)

#### GSoC: Mid-Term Summary

Well four weeks of GSoC is over, and its time for the mid-terms summary and replanning.

Mid-Term Summary:

• Studied about MACs and their working. Chose Xilinx User Guids 144 (1- GEMAC) as interface guide and reference verilog design as features guide.
• Completed setup of main repo providing the modular base for further development.
• Implemented Management Sub-block.
• Setup the repo with travis-ci build, landscape linting, coveralls.

So comparing with the timeline in the proposal i have achieved targets of first four weeks switching the management and tx engine modules.

Further Plans:

• Take three other sub-blocks mentioned in the proposal timeline and try and implement them in a week each (rather than two weeks as proposed).
• Implement wrapper blocks including FIFO and client in the next week.
• In the remaining weeks, hardware testing, refactoring code if necessary, setup of readthedocs shall be done.

#### Reaching mid-term

Mid-term is almost here (27 of June) and there’s a lot completed, but there’s much yet to be done.

### What’s next?

The uploadTool is almost complete, it does its job. However, there are many bears lacking requirements, as they run scripts, or stuff, which is not easy to use, since people use a lot of platforms we can simply not make bears for.

### How are we going to do this?

For the pip, npm and gem requirements, we will simply give the command to the user, upload the bears on PyPi and try to run that command once the user installed it. This should work regardless of platform, supposing you have pip, npm or gem installed.

### What about the others?

For the others there is conda (http://conda.pydata.org/docs/). Conda packages are pretty much like PyPi packages, that work on any platform. For that, we can simply pack those scripts or compiled dependencies of bears into conda packages for all 3 important platforms (OS X, Linux, Windows) and when the user grabs that bear that has a conda requirement, it will automatically install that for him. This conda requirement is going to be a class pretty much like PipRequirement or GoRequirement.

Once this is done, this process should be ready, and I can move on to the installation part of the bears, with the nice installation tool and so on.

### sahmed95 (dipy)

#### Two stage fitting, 6 decimal accuracy and the power of the Jacobian

Hi, so the past few days have been busy with testing the code and writing a basic example for the module so far. You can check out the Ipython notebook to get a better idea of how to use the module.

We implemented a two-stage fitting for the data which is the method followed in both the papers we have referred in this project (Le Bihan and Federau). The basic idea is that the bi-exponential function of the IVIM model can be approximated to a single exponential as the b values go higher. In particular we already have a dti tensor fitting model in dipy which has been used to approximate the values of the parameter D for bvalues greater than 200. We also approximate the value of the perfusion fraction f by calculating the intercept from the exponential decay signal and then using the IVIM function to get our f_guess as (1 - S_intercept/S(b=0)).

This week saw the frustrating problem of getting test cases to pass for the optimize function from scipy. Unless, "good" parameters were selected to generate the data the tests were not passing upto 6 decimal places. However, leastsq always seemed to perform better and generate parameters matching upto 6 decimals. This could either be due to the difference in how leastsq and optimize arrive at the minimization. Leastsq uses the Levenberg-Marquardt algorithm as implemented in MINPACK while optimize gives the option of selecting from a variety of constrained and unbounded minimization algorithms such as L-BFGS, Truncated Newtons (TNC).

We are in favour of using minimize since it gives the flexibility to select bounds and specify fitting algorithms. After playing around with the values of parameters for generating our test signal, the tests passed.

The next step is to implement a Jacobian for faster fitting. A math gist with the Jacobian worked out can be seen here : http://mathb.in/64905?key=774f1d2b7c71358b4cf6dd0e6e4f5de3a5b5fbe3

#### GSoC week 5 roundup

@cfelton wrote:

The last week we had the mid-term reviews, unfortunately there was
a communication error and many of our mentors are not marked as
mentors in the GSoC system. @mentors in the future, we need to get
our reviews in 96 hours before the GSoC deadline. PSF requires
48 hours before (for review) and I require 48 hours for review.
Please be respectful of everyones time involved and don't wait until
the last minute to do the reviews.

Consistent progress was made on the projects this week by all
students.

Student week 5 summary (last blog, commits, PR):

jpegenc:
health 88%, coverage 95%
@mkatsimpris: 26-Jun, >5, N
@Vikram9866: 25-Jun, >5, N

riscv:
health 96%, coverage 91%
@meetsha1995: 24-Jun, >5, Y

hdmi:
health 94%, coverage 90%
@srivatsan: 11-Jun, >5, Y

gemac:
health 87%, coverage 89%
@ravijain056, 17-Jun, 3, Y

pyleros:
health missing, 70%
@forumulator, 26-Jun, >5, Y

Links to the student blogs and repositories:

Merkourious, @mkatsimpris: gsoc blog, github repo
Vikram, @Vikram9866: gsoc blog, github repo
Meet, @meetshah1995, gsoc blog: github repo
Srivatsan, @srivatsan-ramesh: gsoc blog, github repo
Ravi @ravijain056: gsoc blog, github repo
Pranjal, @forumulator: gsoc blog, github repo

Posts: 3

Participants: 2

### Upendra Kumar (Core Python)

#### Multithreading with tkinter

Recently, I got stuck with very new problem (for me) of updating GUI in Tkinter when long processes are needed to be run (like running a time-consuming loop, waiting for a process to complete and return or fetching something from URL). Actually, for processes requiring a long time to complete, the Tkinter blocks other GUI events. Because of it, updates to GUI element only happen when the process returns after completing execution.

I earlier didn’t know about the ‘thread safe’ property of python Tkinter. The main caveat here is that we can’t update GUI elements from multiple threads. Once main thread initiates the mainloop(), we can never use other thread to update the GUI. However, we can easily do background processes in other threads. But, here also we need to invoke a GUI function whenever the process in the background thread stops and returns the result. But, a GUI function can only be invoked through the thread executing mainloop().

Therefore, after reading some online resources on StackOverflow and other Python blogs, I came to know about one design pattern followed to solve this problem. Instead of invoking a GUI function, whenever the result is returned we need to maintain a shared queue. The contents of queue will be shared between the thread executing mainloop() and the thread running the background process. Whenever we need to return the result after the process ends in the background thread, we need to put the result in the queue. On the other side, the thread ( executing mainloop() ) needs to periodically check the contents of the shared queue. In my case, I couldn’t understand this concept of ‘shared queue’ just by reading about it. Therefore, let’s go through a small piece of code to understand it better.

def runloop(thread_queue=None):
'''
After result is produced put it in queue
'''
result = 0
for i in range(10000000):
#Do something with result

class MainApp(tk.Tk):

def __init__(self):
####### Do something ######
self.myframe = tk.Frame(self)
self.myframe.grid(row=0, column=0, sticky='nswe')
self.mylabel = tk.Label(self.myframe) # Element to be updated
self.mylabel.config(text='No message')
self.mylabel.grid(row=0, column=0)
self.mybutton = tk.Button(
self.myframe,
text='Change message',
command=lambda: self.update_text)
self.mybutton.grid(row=1, column=0)

def update_text(self):
'''
Spawn a new thread for running long loops in background
'''
self.mylabel.config(text='Running loop')
target=runloop,
self.after(100, self.listen_for_result)

def listen_for_result(self):
'''
Check if there is something in the queue
'''
try:
self.mylabel.config(text='Loop terminated')
except queue.Empty:
self.after(100, self.listen_for_result)

if __name__ == "__main__":
root = tk.Tk()
main_app = MainApp(root)
root.mainloop()


Here, we may need to disable the button, because it may happen that clicking mybutton may result in creating multiple new threads.

## June 26, 2016

### Ramana.S (Theano)

#### Second Fortnight update

The second fortnight blog post update:
It's almost a month into the coding phase of GSoC. The new Global Optimizer is built and the cleaning work on the PR(Pull Request) is also done. The PR would be merged next week and there has been few follow-up tasks in the current PR.
There is another significance improvement on the profiling results that I earlier shared. After few simplification in computation of convolutional operators,  there is a 10sec improvement in optimizer and the optimization time for training SBRNN is now ~20sec.
Currently, there are a few clean-up tasks on this PR. If a node is on the CPU, the output variables of that nodes are on the CPU, which happen to be the input nodes to other nodes. Since the input variable to the next nodes are not on the GPU, the transfer of those nodes to the GPU wouldn't happen, thus all the nodes till the Graph's output node, making the compilation time to be large. There are two ways to fix it, being aggressive, meaning, transferring all the nodes to the GPU, irrespective of if the input Variables to those nodes are GPUVariables or not. The second way to fix it is, to have a backward pass on the graph lifting nodes to the GPU, if their Ops have implementation on the GPU and continuing the transfer from the node that hasn't been transferred. The current thought of doing this would be adapting to one method in the fast_compile mode and the other in the fast_run mode.

The comparison of efficiency of the new optimizer the last fortnight and at the end of last week,

The result of last fortnight
361.601003s - ('gpuarray_graph_optimization', 'GraphToGPU', 0, 24938, 40438) - 0.000s
GraphToGPUOptimizer          gpuarray_graph_optimization
time io_toposort 1.066s
Total time taken by local optimizers 344.913s
times - times applied - Node created - name:
337.134s - 455 - 9968 - local_abstractconv_cudnn_graph
7.127s - 1479 - 1479 - local_gpua_careduce
0.451s - 15021 - 19701 - local_gpu_elemwise
0.119s - 12 - 36 - local_gpuaalloc
0.044s - 4149 - 4149 - local_gpua_dimshuffle
0.020s - 84 - 84 - local_gpua_incsubtensor
0.015s - 1363 - 1642 - local_gpua_subtensor_graph
0.001s - 194 - 239 - local_gpureshape
0.000s - 6 - 6 - local_gpua_split
0.000s - 9 - 9 - local_gpua_join
0.000s - 1 - 2 - local_gpua_crossentropysoftmaxargmax1hotwithbias
0.000s - 1 - 1 - local_gpua_crossentropysoftmax1hotwithbiasdx
0.002s - in 2 optimization that were not used (display only those with a runtime > 0)
0.001s - local_lift_abstractconv2d
0.001s - local_gpua_shape

The result by the end of last week,
25.080994s - ('gpuarray_graph_optimization', 'GraphToGPU', 0, 24938, 31624) - 0.000s
GraphToGPUOptimizer          gpuarray_graph_optimization
time io_toposort 1.204s
Total time taken by local optimizers 7.658s
times - times applied - Node created - name:
7.059s - 1479 - 1479 - local_gpua_careduce
0.498s - 14507 - 21118 - local_gpu_elemwise
0.038s - 2761 - 2761 - local_gpua_dimshuffle
0.022s - 84 - 84 - local_gpua_incsubtensor
0.020s - 455 - 455 - local_lift_abstractconv2d_graph
0.012s - 533 - 533 - local_gpua_shape_graph
0.004s - 57 - 114 - local_gpua_mrg1
0.002s - 104 - 104 - local_gpua_subtensor_graph
0.001s - 194 - 194 - local_gpureshape
0.001s - 12 - 24 - local_gpuaalloc
0.000s - 147 - 147 - local_gpua_dot22
0.000s - 6 - 6 - local_gpua_split
0.000s - 9 - 9 - local_gpua_join
0.000s - 1 - 1 - local_gpua_crossentropysoftmax1hotwithbiasdx
0.000s - 1 - 1 - local_gpua_crossentropysoftmaxargmax1hotwithbias
0.000s - in 1 optimization that were not used (display only those with a runtime > 0)

The improvement in the time taken by the optimizer is immense! I line profiled all the functions that the local_gpu_elemwise is making a call to and detected that this slow-down happens only with the high verbosity flag, because of a call to a print method, without Error raising. Fixing it gave a great speedup of the optimizer!
The plan for implementing a new global(AND or OR) local optimizer in theano, for node replacement is almost done and the implementation would begin soon. This would be mainly targeting to speed up the 'fast_run' flag.
Finally, I'd also be working on removing  ShapeOptimizer from the fast_compile phase. The work will go parallel with building the new optimizer, and on top of the GraphToGpu optimizer.

That's it for now!

### Raffael_T (PyPy)

#### Progress summary of additional unpacking generalizations

Currently there's only so much to tell about my progress. I fixed a lot of errors and progressed quite a bit at the unpacking task. The problems regarding AST generator and handler are solved. There's still an error when trying to run PyPy though. Debugging and looking for errors is quite a tricky undertaking because there are many ways to check whats wrong. What I already did though is checking and reviewing the whole code for this task, and it is as good as ready as soon as that (hopefully) last error is fixed. This is probably done by comparing the bytecode instructions of cpython and pypy, but I still need a bit more info.

As a short description of what I implemented: until now the order of parameters allowed in function calls was predefined. The order was: positional args, keyword args, * unpack, ** unpack. The reason for this was simplicity, because people unfamiliar with this concept might get confused otherwise. Now everything is allowed, breaking the last thought about confusions (as described in PEP 448). So what I had to do was checking parameters for unpackings manually, first going through positional and then keyword arguments. Of course some sort of priority has to stay intact, so it is defined that "positional arguments precede keyword arguments and * unpacking; * unpacking precedes ** unpacking" (PEP 448). Pretty much all changes needed for this task are implemented, there's only one more fix and a (not that important compared to the others) bytecode instruction (map unpack with call) to be done.

As soon as it works, I will write the next entry in this blog. Also, next in the line is already asyncio coroutines with async and await syntax.

Short Update (25.06.): Because of the changes I had to do in PyPy, pretty much all tests failed for some time as function calls haven't been handled properly. I managed to reduce the number of failing tests by about 1/3 by fixing a lot of errors, so there's missing just a bit for the whole thing to work again.

Update 2 (26.06.): And the errors are all fixed! As soon as all opcodes are implemented (that's gonna be soon), I will write the promised next blog entry.

### Pranjal Agrawal (MyHDL)

#### Week 5 Summary - Code cleanup and test coverage

I am back again with my weekly updates. This week was the midterm evaluations week. Thus, the first half of the week passed in code cleanup and increasing health. This to make a pull request from core branch to the master, which will be evaluated for the midterms. After that, starting thursday, I worked on the dev-exp branch, where I focused on writing more tests, and eliminating the errors, module wise.

### Test Coverage

Test coverage of a code measures how much of the code has been covered in the tests written. Loosely, this corresponds to how many of the loc/ possible paths the tests written are covering. Test coverage is absolutely vital before deploying any code, because, once it's all integrated together, if a subtle error occurs, it becomes very difficult to determine where the bug is. Test coverage eliminates that by testing each small part of the code individually, which can help us 'catch' bugs.

In a TDD based development flow, we write some of the tests first, to get an idea of the interface to whatever we are going to code, and also an idea of how the actual code would be. However, it becomes difficult sometimes to test all the possible paths the code is going to take before we write the code itself.

Thus, I mainly wrote the tests for the fetch/decode and execute module of pyleros. It is a bit tricky to write individual tests, because the pipeline stages were only designed to work with one another(synchronously) and not separately. In the design, the data flows automatically from fedec to execute and vice-versa. Thus to design tests for say fedec, I had to somewhat emulate parts of the other module(execute) using my code. So, for example, to test the add instruction, I initialize all the IN/ OUT signals required, and the instantiate the myhdl.block for fedec. I also initialise the alu. Then, I pass the OUT signals of the fedec to the IN of alu, and finally make an assertion on the result. In the input of the fedec, I give the corresponding instructions, which I then test one by one weather they produce correct results from the alu. This tests weather the fetch and decoding of the instruction is happening correctly. In this way, the instruction set is divided into different classes, and tested for each class. A similar procedure for the execute module.

### Future Plan

In the next week, I plan to:
1.) Further increase test coverage, ideally taking it > 95%.
2.) Write examples for the simulation and test that they work correctly.
3.) Refactor the code to use interfaces and some of the other higher level features of myhdl.

This should have my simulation part ready in the next week. The week after that will be focused on hardware setup and testing.

All in all, the work is going pretty well and is thoroughly enjoyable.

# Midterm Summary

This midterm was full of surprises and challenges for me. During it I was working on two features that I should add to Splash:

• splash:with_timeout
• Element class

# with_timeout

## Several return values

splash:with_timeout is the first API that I’d implement during this summer. Originally, it was planned to be a simple flag for splash:go but become something bigger and more practical. More about it you can read in my previous post.

The other challenge that occurred during the development is the ability to return several values from the passed function. Let’s look on this example:

function main(spash)
local ok, result1, result2, result3 = splash:with_timeout(function()
return 1, 2, 3
end, 0.1)

return result1, result2, result3
end


Without any change this script will return [[1, 2, 3], None, None]. It happens because the result values of the passed \ to splash:with_timeout function are converted to Python’s list and then it passed back to Lua, where it becomes a table.

To fix that I wrote a Lua wrapper which packs the result values of the function and then unpacks it back to ensure that it looks like it was originally. Before doing this task I didn’t have so much experience with Lua and how Lua to Python and Python to Lua works in Splash. That’s why I spent more than a week to do it.

## Docs

Another thing in which I didn’t have much experience is a documentation writing. In Splash (and any other open-source project) you should write a comprehensive documentation about your implemented API, with examples and explanations of how it works.

I want to mention one aspect of Splash. It’s Lua scripting engine is implemented using custom written event loop. And because of that splash:with_timeout may not stop the running function if the timeout expires (you can do some blocking operation which will stop the entire event loop). And that aspect should be written in docs and explained as simple as possible, which is not very easy.

# Element

Element class is supposed to be a wrapper for a DOM element with utility methods. Last week I’ve started working on it. One of the interesting parts of this API is how DOM element is stored in Python and Lua. I’ve decided to do in the following way:

1. When the JS window object of the page is created I assign to it a special object for storing DOM elements.
2. Element class is created using CSS selector. It passed to JS and DOM element is retrieved using document.querySelector and stored with UUID in the storage which was created in step 1.
3. That UUID is passed to Python which is assigned to element object property.
4. The further operations are performed using that UUID.

I only implemented the steps which I described below and I’m going to implement Lua interface for performing those steps.

Thank you for reading. See you next time :wink:

## June 25, 2016

### Vikram Raigur (MyHDL)

#### GSoC Mid Term summary

This post concerns with the brief summary of my GSoC experience till now.

I made a Run Length Encoder till my mid-term evaluation and Quantizer module is still under process. I have made a divider for quantizer module, I have to make a top level module to get finished with Quantizer module.

I made a PR for my work in the main repo this week. The PR had 40 commits.

The Run Length Encoder takes 8×8 pixels data and outputs the Run Length Encoded data.

Example output for Run Length Encoder Module:

Sample input:

red_pixels_1 = [
1, 12, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 10, 2, 3, 4, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0
]

red_pixels_2 = [
0, 12, 20, 0, 0, 2, 3, 4,
0, 0, 2, 3, 4, 5, 1, 0,
0, 0, 0, 0, 0, 0, 90, 0,
0, 0, 0, 10, 0, 0, 0, 9,
1, 1, 1, 1, 2, 3, 4, 5,
1, 2, 3, 4, 1, 2, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0
]

green_pixels_1 = [
11, 12, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 10, 2, 3, 4, 0,
0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 1, 1, 2, 3, 4, 5,
1, 2, 3, 4, 1, 2, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0
]

green_pixels_2 = [
13, 12, 20, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 32, 4, 2
]

blue_pixels_1 = [
11, 12, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 2, 3, 4, 5,
1, 2, 3, 4, 1, 2, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1
]

blue_pixels_2 = [
16, 12, 20, 0, 0, 2, 3, 4,
0, 0, 2, 3, 4, 5, 1, 0,
0, 0, 0, 0, 0, 0, 90, 0,
0, 0, 0, 10, 0, 0, 0, 9,
1, 1, 1, 1, 2, 3, 4, 5,
1, 2, 3, 4, 1, 2, 0, 1,
1, 0, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 32, 4, 2
]

Sample output:

============================
runlength 0 size 1 amplitude 1
runlength 0 size 4 amplitude 12
runlength 15 size 0 amplitude 0
runlength 2 size 4 amplitude 10
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 15 size 0 amplitude 0
runlength 15 size 0 amplitude 0
runlength 7 size 1 amplitude 1
runlength 0 size 0 amplitude 0
runlength 0 size 0 amplitude 0
============================
runlength 0 size 1 amplitude -2
runlength 0 size 4 amplitude 12
runlength 0 size 5 amplitude 20
runlength 2 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 2 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 3 amplitude 5
runlength 0 size 1 amplitude 1
runlength 7 size 7 amplitude 90
runlength 4 size 4 amplitude 10
runlength 3 size 4 amplitude 9
runlength 0 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 3 amplitude 5
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 0 amplitude 0
runlength 0 size 0 amplitude 0
=============================

runlength 0 size 4 amplitude 11
runlength 0 size 4 amplitude 12
runlength 15 size 0 amplitude 0
runlength 2 size 4 amplitude 10
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 5 size 1 amplitude 1
runlength 5 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 3 amplitude 5
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 14 size 1 amplitude 1
runlength 0 size 0 amplitude 0
runlength 0 size 0 amplitude 0
==============================
runlength 0 size 2 amplitude 2
runlength 0 size 4 amplitude 12
runlength 0 size 5 amplitude 20
runlength 15 size 0 amplitude 0
runlength 15 size 0 amplitude 0
runlength 15 size 0 amplitude 0
runlength 7 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 3 size 1 amplitude 1
runlength 0 size 6 amplitude 32
runlength 0 size 3 amplitude 4
runlength 0 size 2 amplitude 2
runlength 0 size 0 amplitude 0
==============================
runlength 0 size 4 amplitude 11
runlength 0 size 4 amplitude 12
runlength 15 size 0 amplitude 0
runlength 15 size 0 amplitude 0
runlength 3 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 3 amplitude 5
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 15 size 0 amplitude 0
runlength 2 size 1 amplitude 1
runlength 0 size 0 amplitude 0
==============================
runlength 0 size 3 amplitude 5
runlength 0 size 4 amplitude 12
runlength 0 size 5 amplitude 20
runlength 2 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 2 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 3 amplitude 5
runlength 0 size 1 amplitude 1
runlength 7 size 7 amplitude 90
runlength 4 size 4 amplitude 10
runlength 3 size 4 amplitude 9
runlength 0 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 3 amplitude 5
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 1 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 6 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 3 size 1 amplitude 1
runlength 0 size 6 amplitude 32
runlength 0 size 3 amplitude 4
runlength 0 size 2 amplitude 2
runlength 3 size 4 amplitude 9
==============================

The module if it counts more than 15 zero’s, it stalls the inputs.

I tried to git rebase my repo and I messed up things. Now every thing seems fine.

As per my timeline, I have to finish Quantizer and Run Length Encoder by 28th of this month. I hope to finsih them on time.

New Checkpoints:

Quantizer : 30th June

Huffman : 7th July

Byte Stuffer : 15th July

JFIF Header Generator : 25th July

Control Unit and Documentation : Remaining time.

I came to know today that while indexing Python excludes the upper bound whereas verilog includes the upper bound.

I set a generic feature to RLE module, so that it can take a y number of pixels.

The RLE Moudle have two major parts:

1. RLE Core
2.  RLE Double Buffer

The RLE Core processes the data and stores it in RLE Double Buffer. when Huffman module reads from one buffer, we can write into the second buffer.

This week I set up travis builder for my repo. Things dint work initially well with the Travis builder, because I imported FIFO from the RHEA folder.

As soon, cfelton released a MyHDL 1.0 Version with block decorator of the RHEA files. Travis builder set things well.

RLE core have a negative number issue initially, which I set up finally.

The code coverage for RLE Moudule is 100 percent as per the pytest.

Landscape gives the code around 90 percent health.

Coveralls give around 100 percent coverage for the code.

I added conversion tests for all the modules. I made a dummy wrapper around each module so that I can check that the test converts or not.

I was facing an issue with nested interfaces, I came to know MyHDL have no support to nested interfaces as ports. They assign them as reg or wire but not input or output.

So, I made bit modifications to my interfaces.

Talking about the Quantizer module.

The module have a divider at its heart made of multiplier. We send a number to rom and get its reciprocal stored in rom. We multiply reciprocal with the divisor and hence we get the output.

The module is already made and will be uploaded mostly by tomorrow on github.

I have been following the same architecture as reference design for Quantizer module.

Also, I made a seperate clock, reset modules in common folder so that I can access things easily and also added a reference implementation in common folder.

I will finish the modules as per the checkpoint’s.

Stay tuned for the next update.

#### GSoC Third week

I have been suffering from viral fever. So, the whole week I was not able to work properly. Through out the week I made the first version of Run Length Encoder.

Things worth discussing :

To convert a number to unsigned in Python(MyHDL). We have to do is:

“

a = a and 0xFF

“

This will convert a to unsigned.

Also, I got myself familiar with pylint andd flake8. Flake8 uses PEP8 coding guidelines to check your code. They really help alot in making the code look good. Also, Chris told me to make the code more modular, so that we can rulength encode any size of block.

Overall, the week was a  decent one.

### SanketDG (coala)

#### Summer of Code Midterm Updates

Talking about short updates, I have successfully completed parsing routines for Python and Java. My next task would be to use the coalang functionality to implement parsing routines which are completely implicit and to provide support for C and C++ documentation styles.

So instead of passing the parameter and return symbols as strings through a function, they would be extracted from the coalang files. A strong API to access coalang files would help here.

Support for multiple params also need to be kept in mind. A documentation style can support many type of formats.

After this is done, I will start working on the big thing i.e. DocumentationBear! The first feature of capitalizing sentences has been already implemented (needs a little bit of improvement.)

The second thing to do is to implement checking the docs against a specified style. This can be done in two ways:

• The first one involves the user supplying a regex to check against the documentation.

• Another way would be to define some predefined styles that are generally followed as conventions in most projects, and then check them against the documentation. For example for python docstrings, two conventions seem to rule:

:param x:          blablabla
:param muchtolong: blablabla

:param x:
blablabla
:param muchtolong:
blablabla


Supporting these two conventions as predefined styles would avoid most projects writing a complex regex.

Then I would go forward with more functionality like indentation checking and wrapping of characters in a long line to subsequent lines. I will also check for grammar within documentation!

If there is time available after all this, I would go forward with refactoring all the classes related with documentation extraction and improve the parsing routines to make them more dynamic. I would also like to tackle the problem where languages and docstyles have different fields for extracting, not only the current three(description, parameters and return values).

On a final note, I have a issue tracker at GitLab. Also, to help me organize my work, I have opened a public Trello board. The board is empty right now, but I will start filling it up from tomorrow.

#### Summer of Code Midterm Updates

Talking about short updates, I have successfully completed parsing routines for Python and Java. My next task would be to use the coalang functionality to implement parsing routines which are completely implicit and to provide support for C and C++ documentation styles.

So instead of passing the parameter and return symbols as strings through a function, they would be extracted from the coalang files. A strong API to access coalang files would help here.

Support for multiple params also need to be kept in mind. A documentation style can support many type of formats.

After this is done, I will start working on the big thing i.e. DocumentationBear! The first feature of capitalizing sentences has been already implemented (needs a little bit of improvement.)

The second thing to do is to implement checking the docs against a specified style. This can be done in two ways:

• The first one involves the user supplying a regex to check against the documentation.

• Another way would be to define some predefined styles that are generally followed as conventions in most projects, and then check them against the documentation. For example for python docstrings, two conventions seem to rule:

:param x:          blablabla
:param muchtolong: blablabla

:param x:
blablabla
:param muchtolong:
blablabla


Supporting these two conventions as predefined styles would avoid most projects writing a complex regex.

Then I would go forward with more functionality like indentation checking and wrapping of characters in a long line to subsequent lines. I will also check for grammar within documentation!

If there is time available after all this, I would go forward with refactoring all the classes related with documentation extraction and improve the parsing routines to make them more dynamic. I would also like to tackle the problem where languages and docstyles have different fields for extracting, not only the current three(description, parameters and return values).

On a final note, I have a issue tracker at GitLab. Also, to help me organize my work, I have opened a public Trello board. The board is empty right now, but I will start filling it up from tomorrow.

### meetshah1995 (MyHDL)

#### Let's Silicon

As the title may suggest , most part of the next part pf my GSoC will be making hardware modules for RISC-V cores and interfacing them to make a processor !.

I already have a working and tested myHDL based decoder in place. I am now in discussions with my mentor to finalize a RISC-V module which I can port to myHDL. This will embark the next phase of my coding in GSoC.

We will be selecting a RV32I based core to implement in the coming weeks as the HDL decoder fully supports RV32I at the present.

I have also shifted my development on the dev branch keeping my master up to date with the main repository.

See you next week.
MS

### tsirif (Theano)

#### Midterm GSoC Updates

This week I am going to present in detail my pull request in Theano’s libgpuarray project. As I referred in my previous blog post , this pull request will provide multi-gpu collectives support in libgpuarray. The API exposed for this purpose is described in two header files: collectives.h and buffer_collectives.h.

## Some libgpuarray API

Before, explaining collectives API I must refer to some libgpuarray structures that user has to handle in order to develop functioning software.

• gpucontext: This structure is declared in buffer.h.

This is used to describe what the name means, a GPU context. A context of gpu is a concept which represents a process running in gpu. In general, a context can be “pushed” to a GPU and all kernel operations scheduled while that context is active will be executed accordingly. A context keeps track of state related information to a GPU process (distinct memory address, allocations, kernel definitions). A context is “poped” out, when user does not want to use it anymore. In libgpuarray, gpucontext is assigned to a single gpu on creation and is used also to refer to the gpu which will be programmed. A call to gpucontext_init will create an instance and at least one call is necessary to make use of the rest library.

gpucontext* gpucontext_init(const char* name, int dev, int flags, int* ret);

• gpudata: This structure is declared in buffer.h.

It represent allocated data in a device which is handled by a single gpucontext. A call to gpudata_alloc will return an allocated gpudata which refers to an allocated buffer space of size sz (in bytes) in the GPU selected through the ctx provided. Optionally, pointer data in host’s memory can be provided along with GA_BUFFER_INIT as flags for copying sz bytes from host to the newly allocated buffer in GPU.

gpudata* gpudata_alloc(gpucontext* ctx, size_t sz, void* data, int flags, int* ret);

• GpuArray: This structure is declared in array.h.

It represents a ndarray in GPU. It is a container, similar to Numpy’s one, which places specific vector space attributes to a gpudata buffer. It contains number and size of dimensions, strides, offset from original device pointer in gpudata, data type and flags which indicate if a GpuArray is aligned, contiguous and well-behaved. It can be created in 4 ways: As an empty array, as an array filled with zeros, using previously allocated gpudata or using an existing host ndarray. All of them need information about number, size of dimensions, strides (the first two through data order) and data type. We will use the two following:

int GpuArray_empty(GpuArray* a, gpucontext* ctx, int typecode,
unsigned int nd, const size_t* dims,
ga_order ord);
int GpuArray_copy_from_host(GpuArray *a, gpucontext *ctx, void *buf, int typecode,
unsigned int nd, const size_t *dims,
const ssize_t *strides);


## Collectives API on GPU buffers

I will explain now how to use buffer-level API which exists in buffer_collectives.h. I am going to do this by presenting the test code as an example for convenience.

First of all, since we are going to examine a multi-gpu example, a parallel framework is used since NCCL requires that some of the API must be called in parallel for each GPU to be used. In this example I am going to use MPI. I will omit the initialization of MPI and its ranks and use MPI_COMM_WORLD. Each process will handle a single GPU device and in this example the rank of an MPI process will be used to select a device hardware number.

gpucontext* ctx = gpucontext_init("cuda", rank, 0, NULL);
gpucommCliqueId comm_id;
gpucomm_gen_clique_id(ctx, &comm_id);


A gpucontext is initialized and a unique id for gpu communicators is produced with gpucomm_gen_clique_id.

MPI_Bcast(&comm_id, GA_COMM_ID_BYTES, MPI_CHAR, 0, MPI_COMM_WORLD);
gpucomm* comm;
gpucomm_new(&comm, ctx, comm_id, num_of_devs, rank);


Unique id is broadcast using MPI in order to be the same among GPU communicators. A gpucomm instance is created which represents a communicator of a single GPU in a group of GPU which will participate in collective operations. It is declared in buffer_collectives.h. gpucomm_new needs to know about the ctx to be used and the user-defined rank of ctx’s device in the newly created group. Rank in a GPU group is user defined and is independent of hardware device number or MPI process rank. For convenience of this test example they are equal.

int* A = calloc(1024, sizeof(char));
int i, count = SIZE / sizeof(int);
for (i = 0; i < count; ++i)
A[i] = comm_rank + 2;
int* RES = calloc(1024, sizeof(char));
int* EXP = calloc(1024, sizeof(char));

gpudata* Adev = gpudata_alloc(ctx, 1024, A, GA_BUFFER_INIT, &err);
gpudata* RESdev = gpudata_alloc(ctx, 1024, NULL, 0, &err);


Initialize buffers for input, expected and actual output.

gpucomm_reduce(Adev, 0, RESdev, 0, count, GA_INT, GA_PROD, 0, comm);
MPI_Reduce(A, EXP, count, MPI_INT, MPI_PROD, 0, MPI_COMM_WORLD);


For convenience, all collective operations are checked upon results of the corresponding MPI collective operations. All collectives require a gpucomm as an argument and sync implicitly so that all gpucomms that participate in a GPU group are called to a collective function. Collective operations and documentation exist in buffer_collectives.h. Also, in that file you will find definition of _gpucomm_reduce_ops, one of which is GA_PROD in example. Notice the similarity between MPI and gpucomm signature.

int gpucomm_reduce(gpudata* src, size_t offsrc, gpudata* dest,
size_t offdest, size_t count, int typecode,
int opcode, int root, gpucomm* comm);
int MPI_Reduce(const void *sendbuf, void *recvbuf, int count,
MPI_Datatype datatype, MPI_Op op, int root,
MPI_Comm comm);


Currently supported collective operations are all operations supported by nccl, as of now:

• gpucomm_reduce
• gpucomm_all_reduce
• gpucomm_reduce_scatter
• gpucomm_broadcast
• gpucomm_all_gather
if (rank == 0) {
// Reading from RESdev gpudata to RES host pointer
gpudata_read(RES, RESdev, 0, 1024);

int res;
MAX_ABS_DIFF(RES, EXP, count, res);
if (!(res == 0)) {
PRINT(RES, count);  // print RES array
PRINT(EXP, count);  // print EXP array
ck_abort_msg("gpudata_reduce with GA_INT type and GA_SUM op produced max "
"abs err %d", res);
}
}


Result from root’s GPU is copied back to host and then the expected and actual results are compared.

free(A);
free(RES);
free(EXP);
gpudata_release(RESdev);
gpucomm_free(comm);
gpucontext_deref(ctx);


Finally, resources are released.

Complete testing code can be found in main.c, device.c, communicator.c and check_buffer_collectives.c files. Framework libcheck is used for C testing. Actual testing code contains setup and teardown functions, as well as preprocessor macros and tricks for easily testing for all data and operation types. From the example above, crucial error checking is missing for convenience.

## Collectives API on GPU ndarrays

gpucontext* ctx = gpucontext_init("cuda", rank, 0, NULL);
gpucommCliqueId comm_id;
gpucomm_gen_clique_id(ctx, &comm_id);

MPI_Bcast(&comm_id, GA_COMM_ID_BYTES, MPI_CHAR, 0, MPI_COMM_WORLD);
gpucomm* comm;
gpucomm_new(&comm, ctx, comm_id, num_of_devs, rank);

int(*A)[16];
A = (int(*)[16])calloc(32, sizeof(*A));
int(*RES)[16];
RES = (int(*)[16])calloc(32, sizeof(*RES));
int(*EXP)[16];
EXP = (int(*)[16])calloc(32, sizeof(*EXP));

size_t indims[2] = {32, 16};
size_t outdims[2] = {32, 16};
const ssize_t instrds[ND] = {sizeof(*A), sizeof(int)};
const ssize_t outstrds[ND] = {sizeof(*RES), sizeof(int)};
size_t outsize = outdims[0] * outstrds[0];
size_t i, j;
for (i = 0; i < indims[0]; ++i)
for (j = 0; j < indims[1]; ++j)
A[i][j] = comm_rank + 2;

GpuArray_copy_from_host(&Adev, ctx, A, GA_INT, ND, indims, instrds);
GpuArray RESdev;
GpuArray_empty(&RESdev, ctx, GA_INT, ND, outdims, GA_C_ORDER);


First create a gpucomm as before. Then initialize arrays in host and device to be used in the test. The code above may seem difficult to read or a pain to be written explicitly every time an array must be made, but pygpu python interface to libgpuarray make it easy and readable.

if (rank == 0) {
GpuArray_reduce(&Adev, &RESdev, GA_SUM, 0, comm);
} else {
GpuArray_reduce_from(&Adev, GA_SUM, 0, comm);
}
MPI_Reduce(A, EXP, 32 * 16, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

if (rank == 0) {
// Reading from RESdev gpudata to RES host pointer
int res;
COUNT_ERRORS(RES, EXP, 32, 16, res);
ck_assert_msg(res == 0,
"GpuArray_reduce with GA_SUM op produced errors in %d places",
res);
}


As before, results are checked upon MPI collectives’ results. Collective operations for GpuArrays and documentation exist in collectives.h. In this example, GpuArray_reduce is a function used to perform the reduce collective operation on GpuArrays, while GpuArray_reduce_from is a function which can be used by non-root gpucomm ranks to participate in this collective.

int GpuArray_reduce_from(const GpuArray* src, int opcode,
int root, gpucomm* comm)
int GpuArray_reduce(const GpuArray* src, GpuArray* dest,
int opcode, int root, gpucomm* comm);


Currently supported collective operations on GpuArrays:

• GpuArray_reduce_from
• GpuArray_reduce
• GpuArray_all_reduce
• GpuArray_reduce_scatter
• GpuArray_broadcast
• GpuArray_all_gather
GpuArray_clear(&RESdev);
free(A);
free(RES);
free(EXP);
gpucomm_free(comm);
gpucontext_deref(ctx);


Again finally, resources are released.

## In general and near future

Using this part of libgpuarray requires having nccl installed, as well as CUDA >= v7.0 and GPUs of at least Kepler architecture, as suggested in nccl’s github page. Currently as there is no a collectives framework for OpenCL, collectives operations are supported only for CUDA gpucontext. If nccl exists in a default path in your system (whose bin directory that is contained in environmental variable PATH), then it will be built automatically when invoking make relc for example. Else, you need to specify through the variable NCCL_ROOT_DIR.

If you want to test, you need to have MPI and libcheck installed, as well as have Makefile.conf file properly setup to declare how many and which GPUs you want to use in order to test across many GPUs in your machine.

I want to note that testing with MPI and libcheck gave me a headache, when trying to execute test binaries for the first time. MPI processes signaled a SEGM FAULT reporting that memory address space was not correct. For anybody who may attempt a similar approach for multi-process testing: I did not know that libcheck forks and runs the tests in a subprocess. And it will happen that this subprocess will not be the “registered” MPI process, thus giving an error when a MPI command is issued with the expected MPI comm. To solve this, I turned off forking before running the tests through libcheck API. See this.

Right now I am working in completing python support for collectives libgpuarray API in pygpu. There will be a continuation blog post as soon as I finish.

Till then, have fun coding!
Tsirif, 24/06/2016

## June 24, 2016

### shrox (Tryton)

#### Working ODT file

Hurray! I can now generate an ODT file that works just fine and can be opened in LibreOffice without having to repair it! This is a great step forward for me in my project and I am really happy!

Since I wrote the last blog post, I have indeed come a long way. I have cleaned up my code, for one. It now looks like code should look and is fairy human readable.

Next I have also successfully generated the manifest.xml file that is associated with ODT files. This file keeps a list of all the files, the various xml files as well as the images that the final odt needs to display.

Two very useful, handy additions to my code are the usage of the StringIO and zipfile libraries. StringIO lets me make an “in memory” folder. Earlier I used to generate the files in the folder that my .py file was located. I used to create folders in them, the xml files as well as the images. Then I would manually zip the file and rename to odt. But now, my Python program does all of that, without using the system at all. Hence, I do not even need to ‘import os’ in my program to access the file system :)

### Shubham_Singh (italian mars society)

As planned according to the schedule ,i developed the GUI for the project first and later i interlinked the front end with the database designed using MongoDB .
There are various files developed at different stages of project ,some of them are :

a)health_index.ui : This is the basic user interface file generated using qt-designer which describes the layout of  the interface of the application .

b)health_index.py : This file is the python equivalent of health_index.py which is generated using      pyuic4 using the command :
pyuic4 health_index.py -o health_index.py

c)output.py : This is the exectable python program which is generated using the command :
pyuic4  health_index.ui -x -o output.py

To run the GUI use the following command :
a)Start the MongoDB deamon :
sudo mongod

b) Navigate to health_index directory and start the application using below command :
python test_output.py

if all the dependencies are installed correctly ,we can see the  initial GUI  window which consists of different LineEdit and Textbox for accepting the input .
Time Stamp and Time Interval are the two inputs which will be required with plotting the graph .
To accept the input values and parameters ,click on the "select the parameter button " .

Selecting the parameter and its value can be done through the input dialog window ,and time stamp is fetched from system time which can also be changed in case needed.
All the required input are then inserted into the designed database

db.hi_values.insert_one(
{   'parameter' : dataset[0]  ,
'value' : dataset[1]   ,
'timestamp' : dataset[2] ,
'timeinterval' : dataset[3]
}
)
And similarly fetched from the database in the summary tab.

Also i worked on developing the documentation of the project about the prerequisites ,dependencies and installation of the same .It also describes the steps to run GUI and the flow of control between different files in the project .
And from next week I will start with HI calculations and graph plotting with pyqtgraph .

Cheers !
Shubham

### srivatsan_r (MyHDL)

#### Completed HDMI Cores!

Finally after working for almost 3 weeks, I have completed coding HDMI cores and they are working fine. I took more time to complete the Receiver core because it had lots of modules. Testing them after writing the codes was again very tedious. I had to trace each and every signal’s waveform and find where it was going wrong.

MyHDL allows only certain classes and function to get converted into verilog code. So one doesn’t have the freedom to use all the cool functionalities of python in MyHDL if they want their code to be convertible. After I had coded the cores in MyHDL, I faced many errors when I tried to convert it to verilog. Like, I was not able to compare string inside a function decorated using MyHDL decorators. My mentor suggested me to compare the strings outside the function and assign the result to a boolean variable which can be used to see if its true inside the function.

So, after debugging all the errors the code successfully got converted to verilog. Initially I was trying to use verify_convert() function of MyHDL to check the credibility of the generated verilog code. But, since my code contained many xilinx primitives in it, I was not able to compile the code as the libraries were not available in icarus verilog simulator. So, I had to quit that idea and just use convert() function and check if the code gets successfully converted. The converted verilog code contained more than 3000 lines of code!!

#### Badges of Honour!

Most of the Open source projects hosted on Github will contain some badges in their README file. These badges are images provided by some integration tools displaying their corresponding stats or scores.

Some important integration tools are Travis CI, landscape.io, coveralls.io and readthedocs.

Travis CI

landscape.io

Landscape integrates with your existing development process to give you continuous code metrics seamlessly. Each code push is checked automatically, making it easy to keep on top of code quality. You have to just signup at landscape.io and then get your badge’s Markdown and add it to the README file. The badge will show the health of your project. You can configure landscape also with a YAML file, you can make it ignore some warning checks. Here is a configuration file from my HDMI-Source-Sink-Modules project.

coveralls.io

Coveralls integrates with your code development and gives you the number of lines of your code is actually covered with your test files. Coveralls will be very helpful to give you the percentage of code covered by the test files. You have to just signup in their website and add your project repository. You have to modify a line in your .travis.yml file to make this work. You have to run your tests along with coverage command. Here is a Travis CI configuration file from my HDMI-Source-Sink-Modules project. The badge provided will contain the percentage of code covered by the test files.

Readthedocs lets you host the documentation of the project online. Again you have to signup and add your repository. You can use Sphinx for generating the documentation of your project. The autodoc feature of Sphinx allows you to generate the documentation of your code from the docstrings that are rST texts. For python projects which use Google style docstrings i.e. which are not written in rST format, there is a package called napoleon which automatically parses the Google style docstring into rST and Sphinx uses it. Readthedocs checks for the conf.py file generated by Sphinx and creates the documentation according to it.

All these badges helps any developer to get a quick summary of the status of your project.

### Karan_Saxena (italian mars society)

Time flies!!

It looks as if it was only yesterday that coding period started. Even my exams came and almost (2 more to go :P) passed by :D

So here are the updates:
1) PyKinect2 is [finally] working. Woosh!!
2) I am now able to ping the hardware via the .py script. Yay.
3) The coordinates are being dumped in a temporary file.

See this in action
 My body being tracked

Finally, big thanks to my sub-org admin Antonio and my mentor Ambar for allowing me to accommodate my exams in between.

Here's to next 2 months working on the project full time (y)

Onwards and upwards!!

### ghoshbishakh (dipy)

#### Google Summer of Code Progress June 24

This is midterm period and the dipy website has a proper frontend now! And more improvements coming.

### Progress so far

The custom content management system is improved and has a better frontend now.

Social network feeds for twitter and google plus are now added and a hexagonal gallery is placed in the home page.

The documentation generation script is also updated to upload documentation of different versions separately in github. And the django site is updated by checking the contents of the github repository through github API.

The current pull requests are #9 and #1082

You can visit the site under development at http://dipy.herokuapp.com/

### Details of content management system

The custom CMS now allows maximum flexibility. It is possible to edit almost every content in the website. Fixed sections, documentation versions, pages, publications, gallery images, news feeds, carousal images, everything can be edited from the admin panel.

One of the most important additions in the CMS is that now we can create any number of pages and we will get a url for that page. So this allows us to create any custom page and link it from anywhere. Also with a single click that page can be put in the nav bar.

To make the the nav bar dynamic I had to pass the same context to every template. Thankfully this can be achieved with “context_processors” in a dry way. Also this allows the documentation links to be changed without changing the template.

For now the documentations are hosted in the dipy_web github repository. Different versions of documentations are linked automatically into django by checking the content of the repository using github API.

There is also option of excluding some documentation versions from the admin panel.

### Details of the frontend

Although most parts of the website now have a basic styling, the frontend is still under constant improvement. The progress can be best visualized through some screenshots:

### What’s next

We have to automate the documentation generation process. A build server will be triggered whenever there is a new commit the documentation will be automatically updated in the website. Also there are some command line tools for which the docs must be generated.

I have to include facebook feed in the home page. Also the honeycomb gallery is cool but a carousal with up-to-date contents like upcoming events and news will be more useful. The styling of all parts of the website can be improved.

Then I have to clean up the code a bit and add some more documentation and start testing the internals. After that we have to think of deployment and things like search engine optimization and caching etc.

Will be back with more updates soon! :)

### Ranveer Aggarwal (dipy)

#### Making a Slider

The next UI element is a slider. It’s not as difficult as it looks since I already have most of the framework set up.
Basically, this time when the interactor gets the coordinates of the click, I need to pass these to the UI element in some way so that the callback can pick it up. And this is already set up as ui_params, which I implemented in the case of the text box.

Now, the slider itself is composed of two actors - a line and a disk. The disk would be moving (sliding) on the line. For doing so, I had to change DIPY’s renderer class a bit. The change now allows for nested UI elements, much like the slider - which has a sliderDisk and a sliderLine.
The click is handled by the line (the callback belongs to the sliderLine) and the disk moves.

The slider, in addition has a text box which displays the position of the disk on the line as a percentage.

It currently only works with a click.

### Enhancements and Improvements

These are things left to be done:

• Explore ways to make a circular slider
• Make it draggable
• Left key press should also increase the slider

And other enhancements to make it more futuristic. This is a line slider, there can be many more and they need to be explored too.

### This Week

I’ll be continuing my work on this slider and incorporating the above points.

### Sheikh Araf (coala)

#### Eclipse Plug-in: Dynamic menu with click listener

I’m writing an Eclipse plug-in for coala and recently I had to add a list of available bears so that the user can choose the bear to use for code analysis.

I struggled for a few days, experimenting different approaches. In this post I’ll share the method that I finally settled for.

First you have to add a dynamic menuContribution item to the org.eclipse.ui.menus extension point. So your plugin.xml should look like this:

<extension
label="Run coala with">
<dynamic
</dynamic>
</extension>


The class field points to the Java class the populates the dynamic menuContribution item. This class should extend org.eclipse.jface.action.ContributionItem. So let’s implement this class. The code looks like the following:

public class BearMenu extends ContributionItem {

}

public BearMenu(String id) {
super(id);
}

@Override
public void fill(Menu menu, int index) {
String[] bears = getBears();

for (final String bear : bears) {
public void widgetSelected(SelectionEvent event) {
System.out.println("You clicked " + bear);
// Analyze with coala here.
}
});
}
}

private String[] getBears() {
// returns an array containing names of bears
}


This way you can dynamically add items to the menu and assign them click listeners.

There are other approaches as well, e.g. a dynamic menu with a corresponding handler, etc. but I find this approach to be the easiest.

### liscju (Mercurial)

#### Coding Period - IV Week

The most important goal of this week was to change http redirection to reuse existing in mercurial code to communicate via http protocol. In the last week http redirection was using httplib library, but there were couple of reasons to reuse existing code. First thing is mercurial is able to communicate with httplib and with httplib2 according to library accessibility. The next reason for this was that there are existing code for things like authentication, communication reuse etc. In mercurial communication with http peer on the web is done mainly with:

https://selenic.com/hg/file/tip/mercurial/httppeer.py

Main function to send/get request is _callstream:

https://selenic.com/hg/file/tip/mercurial/httppeer.py#l92

The goal was accomplished and redirection reuses this communication.

The other goal of this week was to send/get data while communicating with http peer in chunks. Because largefiles is in general created to deal with really large files , sending/getting data at once leads to errors while communicating via network. I found this already implemented as httpsendfile, this is file descriptor created for dividing data into chunks. It can be provided to http request builder and thats all needed for sending chuned data, you can see how it looks like here:

https://bitbucket.org/liscju/hg-largefiles-gsoc/src/f34101d68fcc5b5e8fc7cf3d4727a9e2e08e599d/hgext/largefiles/redirection.py?at=default&fileviewer=file-view-default#redirection.py-241

Dividing file stream from server was already implemented also, this functionality is located in util.filechunkiter. Its parameters are stream, chunk sizes and length.

Another thing i did in this week was to enable generating redirection url dynamically by provided user application. I decided to use hooks for this. Hook is an external application that is ran when repository is doing actions. You can for example send email on commit with hooks, you can take a look here to read better description:

https://www.mercurial-scm.org/wiki/Hook

In case of the project we expect that hook(external application) will generate redirection target and write generated redirection to .hg/redirectiondst file from which feature will read it. To see how this works you can take a look here:

https://bitbucket.org/liscju/hg-largefiles-gsoc/src/f34101d68fcc5b5e8fc7cf3d4727a9e2e08e599d/hgext/largefiles/redirection.py?at=default&fileviewer=file-view-default#redirection.py-90

Additionally I did some code cleaning and adding documentation.

Apart from the project i sent patch to pull bookmark with 'pull -B .', it is merged already here:

https://selenic.com/hg/rev/113d0b23321a

So far largefiles was asking two times on cloning repository with largefiles, my patch to deal with this was merged as well:

https://selenic.com/hg/rev/fc777c855d66

### Pulkit Goyal (Mercurial)

#### Absolute and Relative Imports

While switching from Python 2 to 3, there are few more things you should care about. In this blog we will be talking about absolute and relative imports.

### Prayash Mohapatra (Tryton)

Last two weeks have been great. I am finally enjoying both reading and writing code. I realise that most of the problems, I had when I was stuck could be solved after a taking a break and reading the code calmly. There were times, I just sat back opened two panes, one of the left and one on the right, and just read the code over and over again, till I got what I was doing wrong.

These two weeks I have been working on completing the views for the web client and made the action buttons functional. Keeping the code similar to GTK client. Discussed which CSV parsing library to use for sao. Ended up choosing PapaParse, and later I tried PapaParse to support custom quote character, which is a feature we support in the GTK Client. And the contribution was merged upstream :D

Did some refactoring, which I previously felt wasn’t necessary. Re-implemented some parts of the views in a better manner. And the best thing I learnt during all this, was using Chrome Dev Tools to map the source files so that I don’t have to Refresh followed by 9 clicks, every time I make some change, to just 3 Clicks! And am also using breakpoints to see the values floating around the function.

This is what I have achieved so far. Currently working on the auto-detection of multi-level field names from an import file.

### Preetwinder (ScrapingHub)

#### GSoC-2

Hello,
This post continues my updates for my work on porting frontera to python2/3 dual support.
The first coding phase is almost about to end, the task I had to accomplish during this part was Python 3 support for single process mode. I have completed this task and will soon by making my pull requests. Firstly I had to make the syntactic changes which don’t actually change the codes operation, just changes syntax with the same effect. I did this using the modernize script which uses the six library to make the code operational in both versions. After that I made some more syntactic changes which the modernize script is unable to cover(things like changed class variable names etc). After these changes, the next step is to define a precise data model about the type of strings(unicode or bytes) to be used in the API, and the necessary conversions to be performed in different parts of the code. For this I have mostly followed the approach of using native strings(unicode in python 3 and bytes in python 2) everywhere. After these changes I proceeded to make all the test cases work in both python 2 and 3. I was mostly succesful in this, apart from tests related to the distributed mode(which I am yet to work on) and a pending url issue which hasn’t yet been addressed. Once I make the PR’s I am sure I’ll have to address a few more issues, but apart from that this part of my work is mostly done.

GSoC-2 was originally published by preetwinder at preetwinder on June 24, 2016.

### Aron Barreira Bordin (ScrapingHub)

#### Scrapy-Streaming [3] - Binary Encoding and Error Handling

Hi ! In the third week of the project, I implemented the from_response_request.

This allows external spiders to create a request using a response from another request.

## Binary Responses

To be able to serialize binary responses into json messages, such as images, videos, and files, I added the base64 parameter to the request message.

Now, external spiders are able to download and check binary data using scrapy streaming.

## Error Handling

I’ve implemented the exception message, that checks internal exceptions and sends it to the external spider.

We’ve two kind of issues: errors and exceptions.

Errors are raised when there are some problem in the communication channel, such as an invalid request, invalid field, and so on.

Exceptions represents problem in the Scrapy Streaming runtime, such as an invalid url to request, invalid incoming data, etc.

## Docs and PRs

This modifications have been documented in the docs PR: https://github.com/scrapy-plugins/scrapy-streaming/pull/7

And the modification in the communication channel can be found at: https://github.com/scrapy-plugins/scrapy-streaming/pull/5

## Examples

I’ve added new Scrapy Streaming examples here: https://github.com/scrapy-plugins/scrapy-streaming/pull/4

This examples may help new developers to implement their own spiders using any programing language, so each example shows a basic feature of Scrapy Streaming.

Scrapy-Streaming [3] - Binary Encoding and Error Handling was originally published by Aron Bordin at GSoC 2016 on June 23, 2016.

### Levi John Wolf (PySAL)

#### Partially Applied classes with __new__

Python’s got some pretty cool ways to enable unorthodox behavior. For my project, I’ve found myself writing a lot of closures around our existing class init functions, and have decided it might be easier & more consistent to express this as what it really is: partial application.

Partial application is pretty simple to enable for python class constructors, since the separate new method allows you to construct closures around initialization routines.

Since embedding math & code directly has been a pain on Tumblr recently, I’ll just link to the example notebook and (eventually) move this blog to gh-pages.

# Current Results

I’ve made a great deal of headway the last two weeks. The distributed estimation code is up and running correctly, as well as the necessary testing code. I rearranged the format somewhat. fit_distributed is no longer called within fit_regularized, instead it is now part of an entirely seperate module. The PR has a good summary of the current set up:

https://github.com/statsmodels/statsmodels/pull/3055

# To Do

There is still work to be done before I move on to the inference portion but things are getting much closer. First, I need to first the GLM implementation and implement the WLS/GLS version. Second, I need to work on putting together more examples, currently I have an example using simulated data and OLS but it would be good to expand these.

### jbm950 (PyDy)

#### GSoC Week 5

Well I started this week off by getting sick and as such productivity took a little bit of a hit. The majority of this week was spent reading Featherstone’s text book. The example documentation showcasing the base class API still hasn’t been reviewed and so that part of the project will just have to be set aside until later. Overall the project will not suffer, however, because of my progress in learning Featherstone’s method.

I’ve done a couple of other things this week as well. In a discussion with Jason it was determined that the LagrangesMethod class must have access to the dynamic system’s bodies through the Lagrangian input. Upon research the Lagrangain turned out to not be an object instance but rather a function that simply returned the Lagrangian for input bodies. This meant that LagrangesMethod did not in fact have access to the dynamic system’s bodies. Due to this I decided that an easy way to get LagrangesMethod to have body information would be to add an optional keyword argument for it. This was LagrangesMethod can have a more similar API to KanesMethod. This change can be found in PR #11263.

This week I reviewed PR #10856 which claimed to fix Issue #10855. Upon review it seemed that the “fix” was to just not run tests that were failing. When researched it looks like a whole module has not been updated for Python 3.X and is failing its relative imports. When run in Python 2.X it’s still not working either but rather is throwing up many KeyError flags. I think this has not been caught sooner due to the module being a component directly dealing with another project (pyglet) thus the tests are not run by TravisCI.

Lastly there were some test errors in the example documentation for the base class on PyDy. I was not too worried about these because the PR is not currently awaiting merging and is simply a discussion PR. The failing tests, however, were not related to the changes in the PR and so a PyDy member submitted a PR that fixed the tests and asked me to review it. After I looked it over and determined that the fix addressed the issue correctly he merged the PR.

### Future Directions

Next week I plan to continue forward with reading Featherstone’s book and, if possible, begin implementing one of the methods outlined in the book. Also I plan on beginning work on mirroring Jason’s overhaul of KanesMethod on LagrangesMethod.

### PR’s and Issues

• (Open) Added support for a bodies attribute to LagrangesMethod PR #11263
• (Open) Added a depencency on older version of ipywidgets PR #100
• (Open) Blacklisted pygletplot from doctests when pyglet is installed PR #10856
• (Open) sympy.doctest(“plotting”) fails in python 3.5 Issue #10855
• (Merged) Fix multiarray import error on appveyor PR #354

## June 23, 2016

### What’s done

In the time that has passed since I last posted one of these, I managed to get a prototype of scrapy to work using the new Signals API. This introduced two very significant API changes into Scrapy.

• All Signals now need to be objects of the scrapy.dispatch.Signal class instead of the generic python object
• All signal handlers must now receive **kwargs

The first change would not affect the existing extensions/3^rd party plugins much since declaring new signals is not something for the most part extensions do, and using PyDispatcher to call the signals instead of the SignalManager class has long been deprecated in Scrapy. To accomodate this, the Scrapy SignalManager has not yet been phased out and would still be functional, although possibly deprecated depending on how the performance benchmarks work out, and whether avoiding the overhead for the method calls is required.

The second of these changes however, affects the majority of these extensions and requires that we accomodate in someway. The solution required accomodating the RobustApply method of PyDispatcher in Scrapy, this method however would considerably affect the performance of the module, and so in order to have the faster signals one would be required to use handlers with keyword arguments.

The API was also modified to accomodate twisted deferred objects to be returned, and the error handling changed to use the Failure class from twisted.deferred.

The new module has also been unit tested for the most part, with some tests borrowed from Django since they’re the original authors of this signals API. I’m currently working on the benchmark suite, writing spiders that use non-standard signals and calls to test the performance. Eariler, signaling through send_catch_log used to be the biggest bottleneck requiring 5X the time required for HTML parsing. Any improvements we can do on that, the better although ideally I would like if we could make it so that signals are no longer the bottleneck to the crawling.

The following section is under construction.

### What needs to be done

Following the midterms, the highest priority would be to complete the bechmark suite so we know the viability of the approach we have used thus far and where to proceed from here. In case the results obtained are satisfactory, we shall then continue to make backward compatibility fixes and re-writing algortihms that are still not as efficient as they can be and look to maximize performance. We can continue on to provide full backward compatibility with object() like signals, however that would come with the trade-off that the performance of them would be more or less same as that of what was previously achieveable from the API.

Another major requirement would be for me to write good documentation of these parts, since these are essential to anybody writing an extension. We would also need to be on the lookout for regressions, if any.

~ Avishkar

### chrisittner (pgmpy)

#### Score-based Structure Learning BNs

With a bit of delay, I am now working on a basic PR for score-based structure estimation for Bayesian Networks. It comes with two ingredients:

• The StructureScore-Class and its subclasses BayesianScore and BICScore. They are initialized with a data set and provide a score-method to compute how well a given BayesianModel can be fitted to the data, according to different criteria. Since those scores are decomposable for BNs, a local_score-method is also exposed for node-by-node computation.
• The StructureSearch-Class and its subclasses ExhaustiveSearch and HCSearch. They are initialized with a StructureScore-instance and optimize that score over all BayesianModels. The latter subclass has a number of optional search enhancements.

So far BayesianScore supports BDeu and K2 priors, I’ll think for a good interface to specify other prior weights. With K2 priors the score is given by the following form:

$$score^{K2}_D(m) = \log(P(m)) + \sum_{X\in nodes(m)} local\_score^{K2}_D(X, parents_m(X))$$

where $$P(m)$$ is an optional structure prior that is quite negligible in practice. $$local\_score^{K2}$$ is computed for each node as follows:

$$local\_score^{K2}_D(X, P_X) = \sum_{j=1}^{q(P_X)} (\log(\frac{(r-1)!}{(N_j+r-1)!}) + \sum_{k=1}^r \log(N_{jk}!))$$

Where $$r$$ is the cardinality of the variable $$X$$, $$q(P_X)$$ is the product of the cardinalities of the parents of $$X$$ (= the possible states of $$P_X$$) and $$N_{jk}$$ is the number of times that variable $$X$$ is in state $$k$$ while parents are in state $$j$$ in the data sample. Finally, $$N_j:=\sum_{k=1}^r N_{jk}$$.

PR will follow shortly.

### Aakash Rajpal (italian mars society)

#### Midterms are here!

Hey, all Midterms are coming this week and well the first part of my proposal is about done. Some Documentation work is all that’s remaining.

Well, my first part involved integrating the Leap Python API and Blender  PythonAPI and initially It was tough to integrate as the Leap API is designed for Python2.7 whereas the Blender only supported Python3.5. Hence I wasn’t able to integrate at first, thus I came to another solution. The other solution was to send data from the python2.7 script to the blender Python API using a socket connection. This worked well however it was somewhat inefficient and slow. Hence I thought I will try to generate a Python3.5 wrapper for the LeapSDK. however after days of trying I found a Solution with a little help from the Community and thus was able to generate a Python3.5 Wrapper for the Leap SDK through Swig. The Wrapper worked perfectly fine and thus I successfully

Hence I thought I will try to generate a Python3.5 wrapper for the LeapSDK. This sounded easy but was anything but easy. I found little support online, there were pages that were meant to help but most of them were for windows and very few for ubuntu. However after days of trying I found a Solution with a little help from the Community and thus was able to generate a Python3.5 Wrapper for the Leap SDK through Swig. The Wrapper worked perfectly fine and thus I successfully integrated the two API.

I talked to my mentor to talk about the gesture support required for the project and added more gesture support.

I am now documenting my code and preparing myself for the second part of my project.

### Ravi Jain (MyHDL)

#### Maintain A Clean History!

Finally I have completed and merged the management module. Last time I posted, things i needed to be able to merge was to add the doc-strings, setup coveralls, resolve conflicts with master branch(rebase).

Adding Doc-Strings was the easiest but still took time as it gets a little boring(duh!). I used this example provided by my mentor as reference.

Now came time to do a coveralls setup, which i must say i a little more complex compared to the others. I really got a lot of help from referencing an already setup repo test_jpeg on which a fellow GSoCer is currently working. It got a little tricky in between as i stumbled upon the type of end-of-line character problem. Before this i didn’t even know that even an “type of enter” can cause problem in running scripts. It consumed my one whole day. It bugged me when i was trying to edit it in notepad on Windows. This post later helped get me over it. More on coveralls setup on my next post!

Next Rebasing and resolving conflicts my dev branch compared to master branch. When i started my master branch was a few commits ahead (setting up of badges) and thus was having conflicts. Also Rebase was required as my mentors suggested to maintain clean history in the main branch. It took me lot of experiments to finally understand the way to go for rebasing my branches. The structure of my repo:

• origin/master
• origin/dev
• dev

So i have a local dev branch in which i develop my code and constantly push to remote origin/dev branch for code reviews by my mentor. This leads to lot of commits containing lot of small changes and resolving silly issues. But when i make a pull request and merge onto origin/master branch I wish to have cleaner commit history.

So doing an interactive rebase helps to modify that history using pick(Keep the commit), squash(Merge onto previous while editing the commit description), fixup(Merge onto previous keeping the previous commit description intact). Understanding this required me doing lot of experiments with my branch which is dangerous. So I had made a copy of my dev branch, which i suggest you do right now before continuing.

To rebase your local branch onto origin/master branch use “git interactive -i <base branch>“. Warning, avoid moving the commits up or down if the are working on the same file. This may cause conflicts. Once it starts, Resolving conflicts is lot of pain because it triggers other conflicts as well if not done properly.

After rebasing come the trickier part. Your local branch has brand new rebase commits and your remote has old commits. You need to use “git push –force”. It will overwrite the commits on remote branch after which you can generate a pull request onto origin/master. Don’t do it if there are other branches based on this branch In that case directly merge onto master, downside being you wont get to be able to make pull request on which is essential for code discussions.

After all this my code was ready to merge and i got go ahead (after a day of internet cut, hate that) from my mentor to merge it. So i had finally completed second merge on to my main branch implementing the management block and setting up coveralls.

### aleks_ (Statsmodels)

#### Hello testing!

Hello everyone!

Today I am going to share with you a neat little trick I have learned about during the first weeks of GSoC. I will show it in form of a small example which clarifies its use. You can also find a short description in the NumPy/SciPy Testing Guidelines.

Let's say you have a function calculating different things and returning them as a dictionary (e.g. the function is estimating different parameters, say alpha and beta, of a statistical model):
def estimate(data, model_assumption):    # do some calculations    # alpha_est = ...     # beta_est = ...    return {'alpha': alpha_est, 'beta': beta_est}
Now we want to test this function for different data sets and different model assumptions. To do this we first create a separate test file called test_estimate.py. Inside this file we place a setup() function which takes care of loading the data and results:
datasets = [d1, d2]model_assumptions = ['no deterministic terms', 'linear trend']results_ref = {} # dict holding the results of the reference softwareresults_sm  = {} # dict holding the results of our program                 # (sm stands for statsmodels)def setup():    for ds in datasets:        load_data(ds) # read in the data set        results_ref[ds] = load_results_ref(ds) # parse the reference's output        results_sm[ds] = load_results_statsmodels(ds) # calculate our results
Now that all results are accessible via the results_XXX dictionaries they only need to be compared. Note that the load_results_XXX(ds) functions return dictionaries such that results_XXX[ds] is dictionary as well. It is of the form
{'no deterministic terms': results_no_deterministic_terms, 'linear trend': results_linear_trend, ...}
and results_model_assumption is again a dict looking like {'alpha': alpha_est, 'beta': beta_est}.

Phew, this probably sounds a little complicated. So, why all these nested dictionaries? Well, it makes the actual testing very easy. To check whether our result for alpha is the same as in the reference software, we just do (assuming alpha is a numpy array):
def test_alpha():    for ds in datasets:        for ma in model_assumptions:            err_msg = build_err_msg(ds, ma, "alpha")            obtained = results_sm[ds][ma]["alpha"]            desired  = results_ref[ds][ma]["alpha"]            yield assert_allclose, obtained, desired, rtol, atol, False, err_msg
This code will now produce tests for alpha for all different combinations of data sets / model assumptions. So by adding data sets or model assumptions to the corresponding list the generated tests will multiply resulting in a nice set of tests. Failing tests can easily be identified by the error message given to numpy's assert_allclose in which the data set, the model assumption and the parameter that isn't calculated correctly are mentioned. If you have questions regarding this method of testing, check out the NumPy/SciPy Testing Guidelines or leave a comment.

With that, thanks for reading! : )

#### About the Synopsis category

@cfelton wrote:

This category is used to post weekly GSoC student project summaries.

Posts: 1

Participants: 1

## Markov switching autoregression

If you studied statistics and remember basics of time series analysis, you should be familiar with Autoregressive model, usually denoted as AR(p):
Here y is an AR process, e is a white noise term, nu is a mean of the process. Phi is a polynomial of order p:
L is a lag operator, which, multiplied by time series element, gives previous element. So (1), actually, can be rewritten in the following explicit form:
Since the process definition (1) is essentially a linear equation between process lagged values and error, it can be put in a state space form, which is shown in [1], chapter 3.3.
Again, let's extend equation (1) by adding an underlying Markov discrete process St of changing regimes:
You can notice, that the mean, error variance, and lag polynomial become dependent on switching regime value. This is a so called Markov switching autoregressive model (MS AR). Where can it be used in practice? Let's look at the example from [2], chapter 4.4, which I also used for testing of my code:
This is a sequence of U.S. real GDP. Looking at the data, two regimes are noticeable - expansion and recession. Using maximum likelihood estimation, we can fit this data into two-regime switching mean AR model to describe real GDP changing law quantitatively. Authors use AR(4) model, so do we. The next picture displays (smoothed, that is conditional on the whole dataset) probabilities of being in the recession regime:

Peaks of probability accurately correspond to recession periods, which proves that Markov switching AR provides a sophisticated tool for analyzing an underlying structure of time process.

## Implementation

Markov switching autoregression is implemented in ms_ar.py file in my PR to Statsmodels. This file contains MarkovAutoregression class, which extends RegimeSwitchingMLEModel. This class "translates" equation (4) to the state space "language".
It was quite entertaining to express ideas, explained in chapter 3.3 of [1] within the Python code. One thing I had to be very careful about was that having AR(p) model of k regimes, state space representation should carry k^(p+1) regimes, since switching means occur in (4) with different regime indices. Thus, every state space regime represents p+1 lagged AR regimes.
Such a big number of regimes leads to longer computation time, which caused some problems. For example, Kim filtering of the former real GDP model took 25 seconds, which is inappropriate, when we are doing a lot of BFGS iterations to find likelihood maximum. Luckily I found a way to optimize Kim Filter, which was quite straightforward, in fact. If you remember a previous blog post, Kim filter iteration consists of heavy-weight Kalman filter step, where Kalman filtering iteration is applied a lot of (k^(2(p+1)) for MS AR!) times, and then summing the results with weights, equal to joint probabilities of being in current and previous regime. The thing is that in the case of sparse regime transition matrix, which MS AR model is about, these joint probabilities are very rare to be non-zero, and we don't need to calculate Kalman filtering for zero ones! This feature decreased Kim filter routine evaluation dramatically, giving 2-3 seconds on my machine (which is not very powerful, by the way).

## EM-algorithm

MarkovAutoregression class also has a feature of EM-algorithm. Markov switching autoregressive model, defined by (4), with some approximations, though, is a regression with switching parameters and lagged observations as regressors. Such model, as shown in chapter 4.3.5 of [2], has a simple close-form solution for EM iteration. EM-algorithm is a great device to reach a very fast convergence. For example, in the comments to my PR I copied a debug output with the following numbers:
#0 Loglike - -1941.85536159#1 Loglike - -177.181731435
Here #0 indicates random starting parameters likelihood, and #1 indicates the likelihood of parameters after one iteration of EM-algorithm. A very significant improvement, isn't it?
MarkovAutoregression has two public methods to run EM-algorithm: fit_em and fit_em_with_random_starts. First just performs a number of EM iterations for given starting parameters, while the second generates a set of random starting parameters, then applies EM-algorithm to all of them, finally choosing one with the best likelihood.

## Testing

Right now there are two test files for MarkovAutoregression class, each based on one model - test_ms_ar_hamilton1989.py and test_ms_ar_garcia_perron1996.py. Besides formal functional tests, such as that filtering, smoothing and maximum likelihood estimation give correct values against this and this Gauss code samples, these files contain testing of EM-algorithm in its typical usage scenario - when user knows nothing about correct parameters, but wants to estimate something close to likelihood global maximum. This task is handled by already mentioned fit_em_with_random_starts method, which, by default, runs 50 sessions of EM-algorithm from random starts, each session consists of 10 iterations.

## What's next?

I hope that the hardest part of the project, that is developing of Kim filter and Markov autoregression, is passed. Two more models remain: dynamic factor and time-varying parameters model with regime switching. There also will be a lot of refactoring of already written code, so some articles are going to be all about coding.

## Literature

[1] - "Time Series Analysis by State Space Methods", Second Edition, by J. Durbin and S.J. Koopman.
[2] "State-space Models With Regime Switching" by Chang-Jin Kim and Charles R. Nelson.

## Week 3

The task of Week 3 was to create a neat little warning message if the user runs a wrong coala bear. What do I mean by wrong? Well, PyLintBear is a wrong bear for Ruby code, right? I wanted to add a warning message if such a thing happens. The main things I needed to do for this to achieve were

• Get a list of all bears.
• Figure out the language of file (eg Python/Java/Javascript/CSS …).
• Trigger a warning message if the language of the bear being executed doesn’t match the language of file.

The first task is relatively easy, as I had done the same thing earlier for Week2 task which was coala-bears-create, the third part is as simple as it can get. The whole difficulty lied in second part. I explored different tools and libraries like python-magic , mimetype but none of them are accurate enough.

Language detection is a very difficult task and these tools weren’t able to solve the problem. I also tried to use Linguist but it’s Python port wasn’t compatible with Python 3. Seeing all other options fail, I decided to use hypothesist’s dictionary which maps language extensions to language names. Fair enough for some basic checking, not going to help in cases like C header files, Python 2 v/s 3 etc.

But I had to start somewhere, and out of all options this was the most feasibale to use so I went ahead with this and began my coding. However as it turns out, in the diccussion at the issue, language detection is not that accurate, and it’ll most likely fail than succeed. So the status of the issue has been changed to not happening/won'tfix.

• The PR for the same can be seen here https://github.com/coala-analyzer/coala/pull/2310
• You can still see it in action, if you want:

## Week 4

The task of Week 4 was to clean up everything I did till now as MidSem evaluations were approaching. I didn’t get any of my work reviewed till now and I was in a mess that week. With the deadline approaching fast, I had to clean up my stuff, do bug testing and get my work reviewed with mentors and co mentors. The reviewing task took 4 long days as their were around 6 commits and most of the code had to be redone. Some of the changes that were made during review process are:

• Dynamically generating a list of bears than a static list. Also used @lru_cache() to cache the results for some performance improvements.
• Removed few extra prompts such as result message,prerequisite fail message etc., which were just optional values and the template should be kept to bare minimum. stuff needed, hence going with that philosophy it was decided to remove.
• Support for multiple languages added in the dropdown.
• Logging exceptions in a better way by using logging module.
• Add some doctests, wherever possible.
• Use coala API to include StringConverter for easy str to set conversions.
• Add a gitlab-ci for triggering automatic builds running on Gitlab.
• General code cleanup, Minor bug fixing, refactoring some variables and formatting changes.

With great support from my mentor and co mentors, I have got my work accepted to master branch. A sigh of relief, it was for me! Also, you guys can checkout and let me know your feedback by using pip install coala-bears-create.

Watch it in action here:

I have got a huge takeaway lesson from my mistake of not getting your stuff reviewed earlier, which basically saves everyone’s time and doesn’t causes last minute panic attacks of code not working. In order to not happen this again, I have planned to make a devlog in which I’ll be updating my work daily & also will get mentor or someone from the coala community to review every week’s work.

## Future plans

I have started working on making some UI changes in coala application, using Python Prompt Toolkit and currently researching on that. I am also reading on unit tests, as I have to add them to my coala-bears-create application.

Happy Coding!

### fiona (MDAnalysis)

#### Timeline update and a sweet treat

Hello again! Sorry for the long wait since last post – I’m falling behind my proposed timeline so I’ve been focusing more on trying to catch that up a bit.

I’ve finished work on ‘add_auxiliary’ for now - the general framework and specific case for reading xvg files are basically done, though we’ll see in the coming stages if there are any bugs still to iron out. I’ll probably include in a future post a demonstration of the various features!

Let’s take a look at a revised timeline:

You can see I’m not quite as far along as I was hoping back at the start - building all the add_auxiliary stuff took longer than I expected! I also did some things I originally planned to do later, but in retrospect made more sense to do now. This includes in particular getting the documentation and unit tests nicely done up for AuxReader, and now that I have a good idea how both work, documentation and testing for future parts will hopefully go a lot quicker and smoother! I’m hoping I can stay roughly on track from here - but various bits can be simplified/dropped if need be, and I should still have a nice foundation, which can be further built upon later, by the end of GSoC.

So what next? I’m starting working on part 2 - an ‘Umbrella class’, a framework for storing the trajectories and associated information of a set of Umbrella Sampling simulations. The next post will focus on my plan for this in more detail - but before I leave, something a bit different!

### And now for something completely different

All the way back in my first post I mentioned I’m a keen baker. To make consuming sugary snacks even more exciting, I’ve done in the past a couple of ‘edible voting competitions’ within my research group – SBCB, at the University of Oxford – to decide important matters like the fate of New Zealand’s flag design by eating cookies.

I thought this time I’d try something relevant to my GSoC project, so may I present:

Let’s meet out contestants (the selection of which was heavily biased towards the tools that I use and had ‘logos’ I could reasonably approximate with coloured fondant):

1. GROMACS: Software for performing and analysing Molecular Dynamics (MD) simulations.

2. VMD (Visual Molecular Dynamics): A program for visualising and analysing molecular structures and simulation trajectories.

3. Tcl (and Tk): A programming language; the command-line interface for VMD uses Tcl/Tk, so writing Tcl scripts lets us automate loading and analysing structures/trajectories in VMD.

4. Python: another programming language – as you’re hopefully aware by now, since it’s what I’m using for my GSoC project! In SBCB, often used when writing scripts to automate setup, running and/or analysis of simulations when direct interation with VMD isn’t required.

5. MDAnalysis: A Python library for loading and analysing MD simulations – again, as you hopefully know by now, since it’s what I’m working on for GsoC!

6. Git: A version control system, allowing you to keep track of changes to a set of files in a ‘repository’ (so when you find everything is broken after you made a bunch of changes, you don’t have to spend ages tracking them all down to revert to the working version).

7. Github: A web-based hosting service for Git repositories, allowing sharing of code and collaboration on projects. MDAnalysis is there, I have a page too, and it’s where I’ve been pushing all my proposed changes/additions there (see add_auxilairy here) so other people can check them over!

(There are alternatives for each of these that perform more or less the same functions – but the above are those largely used in SBCB).

We started off with three of each ‘logo’, and the rules were to each time you took a biscuit, take the one you use the least, like the least, or have never even heard of – ideally leaving us with SBCB’s favourite MD tool!

So how did the vote go? (*drumroll*)

So congratulations Python – you’re the best tool for for MD*, as voted by SBCB!

(*Disclaimer: unfortunately this otherwise highly rigorous ‘scientific study’ was somewhat biased by the participation by several non-MD or non-computational personnel and on a couple of occasions disregard for the rules in favour of the cookie that was closest of most aesthetically pleasing.)

And on that triumphant note, I’ll sign off here, and see you all next post! In the meantime, if you’re disappointed you didn’t get to eat an (unofficial) MDAnalysis cookie, why not go buy an (official, though inedible) MDAnalysis sticker to show off instead?

## June 22, 2016

### Aakash Rajpal (italian mars society)

#### Midterms coming and I am good to go

Hey, all Midterms are coming this week and well the first part of my proposal is about done. Some Documentation work is all that’s remaining.

Well my first part involved integrating the Leap Python API and Blender  PythonAPI and initally It was tough to integerate as the Leap API is designed for Python2.7 whereas the Blender only supported Python3.5. Hence I wasn’t able to integerate it, however after days of trying I found a Solution with a little help from the Community and thus was able to generate a Python3.5 Wrapper for the Leap SDK through Swig. The Wrapper worked perfectly fine and thus I successfully integerated the two API.

I talked to my mentor to talk about the gesture support required for the project and added more gesture support.

I am now documenting my code and preparing myself for the second part of my project.

### tushar-rishav (coala)

#### Beta release

So finally we released coala-html beta version. At present, coala-html generates an interactive webpage using the results obtained from coala analysis. Users can search across results, browse the files and also the particular code lines where errors were produced. Similar to a coverage tool that displays the lines being missed. At present we support Linux platform only and will add more cool features in coming releases.

We would love to hear from you. If you have any feature proposal or if you find any bugs, please let us know

Now, with coala-html released I’ve started working on coala website. Further updates in next blog!
:)

### TaylorOshan (PySAL)

#### Testing...one...two...

In the last week or so I have refined existing unit tests and added new ones to extend coverage to all of the user classes (Gravity, Production, Attraction, and Doubly) as well as to the BaseGravity class. Instead of testing every user class for every possible parameterization, the baseGravity class is tested over different parameterizations such that the tests for the user classes primarily focus on testing the building of dummy variables for the respective model formulations. In contrast, the BaseGravity class tests different cost functions (power or exponential) and will also be used for different variations that occur across the user classes. Unit tests were also added for the CountModel class, which serves as a dispatcher between the gravity models and more specific linear models/estimation routines. Finally, unit tests were added for the GLM class which is currently being used for all estimation on all existing spatial interaction mdoels within SpInt. This will be expected to change when gradient-based optimization is used for estimation of zero-inflated models in a MLE framework instead of the IWLS estimation currently used in he GLM framework.

In addition to unit tests code was also completed for handling overdispersion. First, several tests were added for testing the hypothesis of overdispersion within a Poisson model. Second, the QuasiPoisson family was added to the GLM framework. This is essentially the same as the Poisson family, a scale parameter, phi, (also known as the dispersion parameter) is estimated using a chi-squared statistic approximation and used to correct the covariance, standard errors, etc. to be less conservative in the face of overdisperison. QuasiPoisson capabilities were then added to the gravity model user classes as a boolean flag that defaults to false so one can easily adopt a quasi-MLE poisson modleing approach if they use the dispersion tests and conclude there is significant overdispersion. It was decided to push the development of the zero-inflated poisson model until the end of the summer of code schedule, which is also where graident-based optimization now resides. This makes sense, since these go hand-in-hand.

Next up on the agenda is an explortory data analysis technique to test for spatial autocorrelation in vectors and a helper function to calibrate origin/destinaton specific models so that the results can be mapped to explore possible non-stationarities.

## Recap

Last week I managed to merge the code I was working on for the last 4 weeks. It was meant to bring coala language independence support, as far as the bears are concerned. As I explained in the previous posts the developer will still need to write a bit of python because the functionality is implemented as a python decorator. A minimum of 3-4 lines of code is necessary to write the wrapper using the implemented decorator.

With all that being said, I am proud to announce that bears can be now officially be written in languages other than python. Is this a daring cover picture? Yes, yes it is. Have I tested coala with each language represented there? No, no I haven't. It is worth mentioning that there are a lot more features that can be added to the decorator and I will definitely try to add as many as possible.

## I don't want to write python

First of all, you should ask yourself why. Secondly, the next part of my project revolves around creating and packaging utilities. Let's explain better.

###### Creating

For the users that don't feel comfortable with python, there should be some kind of script that creates the wrapper automatically for them by asking some questions and then filling a standard template. This is my goal for this week, building such an utility.

###### Packaging

Bears in coala come with a separate python package called coala-bears and it includes all bears developed by the coala community so far. There is another GSoC project that aims to make bears decentralized so that one can download only the bears that he needs. It is the goal of my project to develop such an utility that will let you package your bear (supposedly in a pypi package) so that distribution becomes much easier.

## Conclusion

To sum it up, the language independence part is pretty much done. Now I am working on making it easier to grasp onto and actually use. I never actually know how to end these blogs and they get kind of awkward at the end so...

### Nelson Liu (scikit-learn)

#### (GSoC Week 4) MAE and Median Calculation

In the first part of my project, I am implementing the Mean Absolute Error criterion for the scikit-learn DecisionTreeRegressor. In this blog post, I'll talk about what the criterion does, as well as a technical / design challenge I faced while implementing it and two solutions I used.

# Criterion and impurity

When growing the tree from data, we want to find the best split at each possible level in the tree. To do so, we minimize the impurity (such that we want all of our nodes to be pure). A pure node in classification is one where all of the training samples are of a single class, and a pure node in regression is one where all of the training samples have the same value (for a review of how decision trees work, see my week 2 post or the plethora of resouces on the web). For the rest of the post, I'll be focusing on regression (as it is the domain of choice for my project).

There are a myriad of ways to define the node impurity, which can drastically affect the way the tree is grown. One of the most common impurity criterion is mean squared error (also known as sum of squared error, or MSE). When using mean squared error, the tree makes splits to optimize $$MSE = \frac{1}{n} \sum_{i}^{n} (y_{i} - f(x_{i}))^2$$

This formula might be a little abstract, so I'll go over what each of the terms means. In this case, $$n$$ is the weighted number of samples in the current node (sum of all the weights). $$x_i$$ denotes the features of the sample $$i$$. $$y_i$$ denotes the correct value of sample $$i$$. $$f(x_i)$$ denotes the predicted value of sample $$i$$ (think of the function $$f$$ as taking in a sample and returning a prediction). As a result, the MSE sums up all of the square of the "errors" (difference between predicted and real values).

In MAE, we want to minimize
$$MAE = \frac{1}{n} \sum_{i}^{n} |y_i - f(x_i)|$$ Where $$y_i$$ is again the true value and $$f(x_i)$$ is the predicted value. To minimize the MAE, you need to use the median of the $$y_i$$s, due to the fact that you are forced to make the same prediction for all samples in a node.

As a result, the weighted median problem is a large part of this criterion, as we need to find the weighted median of the samples and use that as our $$y_i$$. In this blog post, we will temporarily ignore the weights and assume that all samples are equally weighted. There isn't much in the literature about efficient calculation of weighted medians, so my mentors and I are still working on methods to do it acceptably.

A common way to define the median of a set of samples is as the value such that the number of samples above and below the median are both less than or equal to half the total number of samples. The weighted median is thus a similar, but we seek to find a value such that the total weights of the samples above and below the median are both less than or equal to half the total weight of all samples. If this seems a bit strange, don't worry! Examples are provided a bit further below.

# Efficiently calculating the median

The speed of the mean absolute error criterion depends heavily on the implementation of the median calculation algorithm. In the rest of the post, I will go over two methods I used to calculate the median for any range of samples in a node --- one naive initial implementation, and an solution specifically optimized for the DecisionTreeRegressor.

More formally, the problem is as follows: Given an sorted array X denoting the samples and an array Y denoting the values corresponding to the samples in X, calculate the median of the Y's corresponding to X that cover the entire set. This is a bit confusing, so I will give an example below.

For example, define the samples X = [[1], [3], [8], [7]] and the values Y = [3, 1, 6, 9]. Define the start index as 0, and the end index as 4. We will use the variable pos to denote intermediary splits that divide the set of values in two the two subsets we need. The problem is to calculate the median of samples[start:pos] and samples[pos:end] quickly for where pos is the set of values between start and end (exclusive).

So in this case, the first iteration would be:

start = 0, pos = 1, end = 4:
Find the median of samples[start:pos]:
Y = [3]
Thus, the median of the values is 3.

Find the median of samples[pos:end]:
Y = [1, 6 , 9]
Thus, the median of the values is 6.


In the second iteration, we increment pos by one, and calculate the new medians.

start = 0, pos = 2, end = 4:
Find the median of samples[start:pos]:
Y = [3, 1]
Thus, the median of the values 2.

Find the median of samples[pos:end]:
Y = [6 , 9]
Thus, the median of the values is 7.5.


Moving on, you'd increment pos one last time to calculate the medians between samples[0:3] and samples[3:4], giving you 3 and 9 (see if you can verify this for yourself!).

This problem may seem quite contrived and random, but it is exactly the process of finding the best split from a given node. The node needs to split at some value of pos between start and end that minimizes the impurity. As a result, it simply tries all values of pos between start and end to find the minimum and performs the split at that value of pos. The samples from start to pos is a candidate left child to split on, and the samples from pos to end is a candidate right child.

## Naive Implementation

Initially, I implemented an extremely naive version of the median calculation because I did not realize just how many splits it would be searching over. In the naive implementation, I put the values for each position of pos in a new array, and then sorted this new array to find the median. AS a result, I was sorting an array of variable size every time a candidate split was to be evaluated; this was incredibly inefficient.

In trying to refactor this, it was important to take into consideration the choice of data structure and the idea that re-sorting the whole array when shifting over pos by one would be wasteful. This eventually led me to the solution implemented in my project.

## MedianHeaps Implementation

After discussion with my mentors Jacob and Raghav, we realized that this problem was the same as finding the median of a running stream of numbers. As a result, I decided to implement a modified MedianHeap used to solve these sort of problems.

A heap is a datastructure that takes values and orders them such that the maximum (in a max heap) or minimum (in a min heap) is always at the top. Heaps are usually implemented internally as an array-represented tree, where the maximum or minimum value of all the data in the heap is the root node of the tree. The specific implementation details of the heap are widely documented, and is thus out of the scope of this blog post. I implemented a MinMaxHeap object in Cython, which could be used as either a min or max heap (based on a parameter passed in at construction). The MinMaxHeap had methods is_empty()(check if there are any elements in the heap), size() (return the number of elements in the heap), push(DOUBLE_t) (add a value to the heap), remove(DOUBLE_t) (remove a specific value from the heap), pop(DOUBLE_t*) (remove the top value from the heap and store it in a pointer of DOUBLE_t), and peek(DOUBLE_t*) (store the top value in the heap in a pointer of DOUBLE_t).

With this MinMaxHeap class, I was able to build another MedianHeap class. A MedianHeap uses a max heap and a min heap to efficiently calculate the median of a dataset at any given time, and allows quick insertion and deletion (because you don't have to re-sort the entire dataset on each modification). To understand how the MedianHeap works, it's useful to visualize the median as the center point of two halves of the data. On the left half, containing all the values less than the current median, is a max heap. On the right half, containing all the values greater than the current median, is a min heap. This configuration allows us to solve for the median quickly --- the median is at the top of whatever heap is larger, and is the average of the top of both heaps if they are the same size. Maintaining this nice property in median calculation comes with some extra complications in insertion and removal, though. When inserting a value into the MedianNode, it's important to pick the correct internal heap to add to (either the min heap on the right or the max heap on the left). If the value to be inserted is greater than the median, it goes in the right (and thus to the min heap). Similarly, if the value to be inserted is less than the median, it is inserted to the left (the max heap). After inserting, it's important to rebalance the MedianHeap; this involves moving values in the left to the right, or vice versa. This is important to ensure the property that the two MinMaxHeaps have a size within one of each other. Removal follows a similar procedure as addition (to decide which heap to remove from), and rebalancing is also needed afterwards.

To efficiently solve the problem outlined earlier, we use two MedianHeaps --- one for the left (samples[start:pos]) and one for the right (samples[pos:end]). Initially, we add all samples except one to the MedianHeap on the right, and calculate the medians accordingly. When moving pos, we don't have to res-sort and recalculate the median anymore! We can simply remove the value corresponding to the old value of pos from the right MedianHeap, and add it to the MedianHeap on the left. In this manner, we can efficiently calculate the impurity for all possible children nodes at split time.

Using two MedianHeaps in tandem resulted in a massive increase in speed for training --- more than 10x on large datasets! While MAE is still quite slow compared to MSE (mainly because there is no proxy function for calculating the median, thus at least some amount of sorting is needed at some time step), implementing a calculation scheme with MedianHeaps speeds it up enough and makes it viable for practical use.

If you have any questions, comments, or suggestions, you're welcome to leave a comment below :)

Thanks to my mentors Raghav RV and Jacob Schreiber for guiding me through any questions I have and helping me work through them.

You're awesome for reading this! Feel free to follow me on GitHub if you want to track the progress of my Summer of Code project, or subscribe to blog updates via email.

## June 21, 2016

### Pranjal Agrawal (MyHDL)

#### Midterm report and future plan

The mid-term evaluations are here. For this, I am required to submit a report of my work so far, and list the plan for the future weeks. So here goes.

### The work till now

In the month since GSoC coding period started, I have :

Week 1 - 2 : Created the tools simulator and assembler for the core, to better understand the design and architecture set.
Week 3 - 4 : Written tests for and coded the main modules of the processor

With respect to the timeline detailed in my GSoC proposal, I have met most of my deadlines. The core and tools have been coded. Tests have been written and are passing for the most part(some test that are not yet passing are marked @pytest.mark.xfail, to be fixed next) .

A PR has been given from the main development branch, core to the master of the repo, which can be seen at:

https://github.com/forumulator/pyLeros/pull/1

### Issues

Unfortunately, I had to take a couple of unplanned trips urgently due to which my work, and more importantly, the work flow, suffered in the first couple of weeks. But I have worked extra during the next two week to make up for the slow start, and now I am almost at my midterm goals.

Work wise, the one major thing that I planned that has been shifted to post-midterm is setting up the hardware and testing the core on Atlys and Basys FPGA, both of which I own. Unfortunately, this is not the simplest task. Subtle issues in the code manifest themselves in the actual hardware execution that do not during simulation. For example, there's the issue of delta delay that occur between simulation steps which are not present in the hardware, which can lead to subtle nuances. Further setting up I/O properly for the boards a significant task. This make building for hardware different from building for simulation.

### Plan for the coming weeks

In the next couple of weeks, I plan to have a completely working processor, including on the hardware. Further, the code will be refactored to take advantages to some of the advances features of myHDL including interfaces. That leaves me with enough time to devote to working on the SoC design, and comparison on VHDL and myHDL versions of the core.

Week 5: Clean up the code and add documentation wherever missing. Make sure that all the tests pass and the simulation of the processor is working
Week 6: Add I/O, reusing uart from rhea if possible. Refactor the code to use interfaces. Write small examples for the instruction set.
Week 7: Setup the Atlys and Basys boards. Make sure that the processor works on FPGAs, along with all the examples. Add I/O for the hardware. Write a script to build for the two boards.

In conclusion, I worked, had issues, completed almost all goals for the midterm evals, and hope to resolve the issues in the coming weeks. I'm really enjoying this experience.

#### Week 1-2 Summary : Assembler and Simulator

This post is a little late in coming, I know. As I mentioned in the earlier post, I was completely cut off from the internet for the first couple of weeks, and the communication part of the project has been a little weak.

Anyway, this is about the work done in the first 2 weeks. The first 2 weeks were dedicated to studying the design of the processor and creating the tools, including the simulator and the assembler. Creating the assembler linker helped to get thoroughly familiar with the instruction set, while the simulator helps understand the data paths that need to be build in the actual processor. Plus, these tools are useful in quickly writing examples to test on the actual core.

What follows is a description of both the tools.

### Simulator:

An instruction set simulator, or simply a simulator, for those who don't know, is a piece of software that does what the processor would do, given the same input. We are 'simulating' the behaviour of hardware on a piece of software. The mechanism, of course, is completely different.

A ISS is usually build for a processor to model how it would behave in different situations.  Compared to describing the entire datapath of the processor, a simulator is much simpler to code.

0x08 0x12 #ADD r1

Since the design of Leros is accumulator based, one of the operands is implicit(the accumulator) and this instruction describes adding the content of memory location r1 to the contents of accumulator, and storing it back in the acc.
Where 0x08 is the opcode, and 0x12 is the address of the register described by the identifier r1. The actuall processor would involve a decoder.:

On a simulator, this can easily be modelled by a decoder function containing if-else statements that do the same job, for example,

if instr & 0xff00:
addr = instr & 0xff
acc += val

The storage units, for example, accumulator, register file, or the data/ instruction memory, is modelled by variables. And that's pretty much there is to a simulator.

### Assembler:

Before assembly code can be simulated, it needs to be assembled into binary for a particular instruction set, and that is the job of the assembler. The major difference between an assembler and compiler is that most of assembly code is just a human readable version of the binary that the processor executes. The major job of an assembler is:

1. Give assembler directives for data declaration, like a_: DA 2, which assigns an array of two bytes to a.
2. Convert identifiers to actuall memory locations.
3. Convert instructions fully in to binary.

When the programs is split into multiple files, there are often external references, which are resolved by the linker. The linker's job is to take two assembled files, resolve the external references, and convert them to a single memory for loading.

### Leros instruction set and tools

Since the leros instruction set is of constant length(16 bit) and uses only one operand(the other being implicit, the accumulator), the job was greatly simplified. The first pass, as described above, has to maintain a list of all the identifiers. There are no complex instructions like in the 8085 instruction set, or a complex encoding like the MIPS instruction set.

The high 8 bits represent the opcode, with the lowest opcode bit representing if the instruction is immediate. The next two bits are used to describe the alu operation, which can be arithametic like
or logical like
OR, AND, XOR, SHR
Data read and write from the memory is done using the instructions
The addressing can be either direct, or immediate, with the first 256(2^8) words of the memory directly accessible with address given as tge lower 8 bits instr(7 downto 0) describing the address. The higher addresses can be accessed by using indirect load stores, in which an 8 bit offset is added to the address, which is also retrieved from the memory using a load.

Finally, branching is done by using the
BRANCH, BRZ, BRNZ, BRP, BRN
instructions, which respectively mean the unconditional branch, branch if zero, branch if non zero, branch if positive, branch if negative.

I/O can be specified by the
IN, OUT
instructions along with the I/O address given as the lower 8-bits of the instruction.

That's the end of that. Stay tuned for more!

### TaylorOshan (PySAL)

#### Sparse Categorical Variables Bottleneck

This post is a note about the function I wrote to create a sparse matrix of categorical variables to add fixed effects to constrained variants of spatial interaction models. The current function is quite fast, but may be able to be improved upon. This link contains a gist to a notebook that explains the current function code, along with some ideas about how it might become faster.

## Converting patches to GitHub pull request

In the last blog post, I told you about this feature where the patch submitted by the developer should be converted to a GitHub pull request. This feature consists of three tasks:

### 1. Create a branch

First, we need get the python version which is affected due to this bug. Then, I use Python’s subprocess module to checkout a new branch from Python version. The contents of patch is saved as text in postgres ‘file’ table. I created a file using this content and apply the patch to the new branch. I commit and push the new branch to GitHub

### 2. Create a pull request

GitHub API provides an endpoint to create a pull request. I first set up an access token to authorize the requests to GitHub. Then, I just need to provide base and head branch and a title for the pull request as data.

### 3. List Pull Request on Issue’s page

The response from the successful API request contains all the information about the new GitHub PR. I save the URL amd state of PR in the database and link it with the issue.

That is it for this blog post, thank you for reading. Let me know if you have any questions.

## June 20, 2016

### Upendra Kumar (Core Python)

#### Page Screenshots in tkinter GUI

1. Install from PyPI :
2. Update Page
3. Install from Requirement Page
4. Install from Local Archive
5. Install from PyPI

These are some of the screenshots of the GUI application developed till now. Now, in a few days, I need to fix and give a final touch to these functionalities and write test modules for this GUI application.

Further work for next week are related to making this application multithreaded, implement advanced features and improve GUI experience. It is very necessary for me to complete the basic GUI application, so that I can get user feedback when this application is released with Python 3.6 in first week of July.

### Pranjal Agrawal (MyHDL)

#### GSoC developement progress and the first blog

June 20, 2016

I got selected to GSoC 2016 for the Leros Microprocessor project under the myHDL organization which is a sub-org of python.  The project consists of me porting and refactoring code for the Leros microprocessor, from VHDL, in which it was originally developed, by Martin Schoeberl(https://github.com/schoeberl), to python and myHDL. This will then be used to build small SoC designs and test the performance on the real hardware on the Atlys and the Basys development board. The other advantage of Leros is that it is optimized for minimal hardware usage on low cost FPGA boards. The architecture and instruction set, and the pipelines have been constructed with this as the primary aim.

The original Github for the VHDL version is available at: https://github.com/schoeberl/leros , and the documentation with the details at: https://github.com/schoeberl/leros/blob/master/doc/leros.pdf

### The situation so far

The GSoC coding period began on 22 May 2016, and ends on 27 August, 2016. Today, the date is 20 June, 2016. It has been almost a month since the start of the coding period, and due to unfortunate circumstances, the work, I'm sorry to stay, was a little slow in the first couple of weeks. On top of that, I have not really blogged about my progress all that frequently, and thus the situation looked quite bleak a couple of weeks ago. However, week 3 saw a dramatic rise in the amount of work being done, and thanks to the extra week I reserved before midterm, I have almost completely caught up to all my goals for the midterm. The blogging was still little laggy, but I will be making up for that with posts describing my weekly progress for the first 4 weeks henceforth.

### Summary of weekly work

The summary of my weekly work is as follows:

Community bonding period: Wrote code samples and get familiar with the myHDL design process.
Week 1: Studied the design of Leros thoroughly and decide the major design decisions for the python version. Started with the instruction set simulator.

Week 2:  Finished with the instruction set simulator.

Week 3: Wrote a crude assembler and linker to complement the simulator which has a high level version of the processor. Started on the actual core with the tests.

Week 4: Integration and continued work on the actual core. The core is more or less where it should be according to my timeline.

As mentioned earlier, I will be following up will blog posts detailing the work of each of the weeks described earlier.

### Further work and midterms eval

TO DO: The major thing that I have not been able to do is setup the processor on actual hardware( the atlys and basys boards), as planned before the midterm. That has been shifted to the week after the midterms.

The work for this week, before the midterm evaluation is to clean up the code in the development branches and make sure the tests pass, then give a PR to the master which I will be showing for the midterm evaluations.

I will also be writing a midterm blog post detailing the complete work and report for the evaluation.

I am immensely enjoying my work so far.

### Yashu Seth (pgmpy)

#### The Continuous Factors

We are reaching the mid semester evaluations soon. Since I started my work a couple of weeks early I am almost through the half way mark of my project. The past few weeks have been great. I also got the first part of my project pushed to the main repository. Yeah!!

First of all, some clarifications related to the confusion surrounding ContinuousNode and ContinuousFactor. Which one does what? After long discussions in my community we have come to the conclusion that we will have two separate classes - ContinuousNode and ContinuousFactor. ContinuousNode is a subclass of scipy.stats.continuous_rv and would inherit all its methods along with a special method discretize. I have discussed the details about this in my post, Support for Continuous Nodes in pgmpy. On the other hand the ContinuousFactor class will behave as a base class for all the continuous factor representations for the multivariate distributions in pgmpy. It will also have a discretize method that would support any discretization algorithm for multivariate distributions.

The past two weeks were almost dedicated to the the Continuous Factor classes - ContinuousFactor and JointGaussianDistribution. Although I had not planned for a separate base class in my timeline, but it turned out that it is a necessity. Despite its inclusion I have managed to stay on schedule and I am looking forward to the mentor reviews for this PR.

Now, I will discuss the basic features of the two classes in this post.

## ContinuosFactor

As already mentioned this class will behave as a base class for the continuous factor representations. We need to specify the variable names and a pdf function to initialize this class.

>>> import numpy as np
>>> from scipy.special import beta
# Two variable drichlet ditribution with alpha = (1,2)
>>> def drichlet_pdf(x, y):
...     return (np.power(x, 1)*np.power(y, 2))/beta(x, y)
>>> from pgmpy.factors import ContinuousFactor
>>> drichlet_factor = ContinuousFactor(['x', 'y'], drichlet_pdf)
>>> drichlet_factor.scope()
['x', 'y']
>>> drichlet_factor.assignemnt(5,6)
226800.0


The class supports method like marginalize and reduce just like what we have with discrete classes.

>>> import numpy as np
>>> from scipy.special import beta
>>> def custom_pdf(x, y, z):
...     return z*(np.power(x, 1)*np.power(y, 2))/beta(x, y)
>>> from pgmpy.factors import ContinuousFactor
>>> custom_factor = ContinuousFactor(['x', 'y', 'z'], custom_pdf)
>>> custom_factor.variables
['x', 'y', 'z']
>>> custom_factor.assignment(1, 2, 3)
24.0

>>> custom_factor.reduce([('y', 2)])
>>> custom_factor.variables
['x', 'z']
>>> custom_factor.assignment(1, 3)
24.0


Just like the ContinuousNode class the ContinuousFactor class also has a method discretize that takes a Discretizer class as input. It will output a list of discrete probability masses or a Factor or TabularCPD object depending upon the discretization method used. Although, we do not have inbuilt discretization algorithms for multivariate distributions for now. But the users can always define their own Discretizer class by subclassing the BaseDiscretizer class. I will soon write a post describing how this can be done.

## JointGaussianDistribution

In its most common representation, a multivariate Gaussian distribution over X1………..Xn is characterized by an n-dimensional mean vector μ, and a symmetric n x n covariance matrix Σ. The JointGaussianDistribution provides its representation. This is derived from ContinuousFactor. We need to specify the variable names, a mean vector and a covariance matrix for its inialization. It will automatically comute the pdf function given these parameters.

>>> import numpy as np
>>> from pgmpy.factors import JointGaussianDistribution as JGD
>>> dis = JGD(['x1', 'x2', 'x3'], np.array([[1], [-3], [4]]),
...             np.array([[4, 2, -2], [2, 5, -5], [-2, -5, 8]]))
>>> dis.variables
['x1', 'x2', 'x3']
>>> dis.mean
array([[ 1],
[-3],
[4]]))
>>> dis.covariance
array([[4, 2, -2],
[2, 5, -5],
[-2, -5, 8]])
>>> dis.pdf([0,0,0])
0.0014805631279234139


It inherits methods like marginalize and reduce but they have been re-implemented here since both of them forms a special case here.

>>> import numpy as np
>>> from pgmpy.factors import JointGaussianDistribution as JGD
>>> dis = JGD(['x1', 'x2', 'x3'], np.array([[1], [-3], [4]]),
...             np.array([[4, 2, -2], [2, 5, -5], [-2, -5, 8]]))
>>> dis.variables
['x1', 'x2', 'x3']
>>> dis.mean
array([[ 1],
[-3],
[ 4]])
>>> dis.covariance
array([[ 4,  2, -2],
[ 2,  5, -5],
[-2, -5,  8]])

>>> dis.marginalize(['x3'])
dis.variables
['x1', 'x2']
>>> dis.mean
array([[ 1],
[-3]]))
>>> dis.covariance
narray([[4, 2],
[2, 5]])

>>> dis = JGD(['x1', 'x2', 'x3'], np.array([[1], [-3], [4]]),
...             np.array([[4, 2, -2], [2, 5, -5], [-2, -5, 8]]))
>>> dis.variables
['x1', 'x2', 'x3']
>>> dis.variables
['x1', 'x2', 'x3']
>>> dis.mean
array([[ 1.],
[-3.],
[ 4.]])
>>> dis.covariance
array([[ 4.,  2., -2.],
[ 2.,  5., -5.],
[-2., -5.,  8.]])

>>> dis.reduce([('x1', 7)])
>>> dis.variables
['x2', 'x3']
>>> dis.mean
array([[ 0.],
[ 1.]])
>>> dis.covariance
array([[ 4., -4.],
[-4.,  7.]])


This class has a method to_canonical_factor that converts a JointGausssianDistribution object into a CanonicalFactor object. The CanonicalFactor class forms the latter part of my project.

## The Future

With my current PR dealing with JointGaussainDistribution and ContinuousFactor being almost in its last stages. I will soon begin my work on LinearGaussainCPD, followed by the CanonicalFactor class.

Hope you enjoyed this post and would be looking forward to my future posts. Thanks again. I will be back soon. Have a nice time meanwhile. :-)

### Utkarsh (pgmpy)

#### Google Summer of Code week 3 and 4

In terms of progress week 3 turned to be dull. I was having doubts in my mind regarding the representation since start of coding period and somehow I always forgot to have my doubts cleared in meeting. I wasted a lot of time reading theory to help me out with this doubt of mine and till mid of the week 4 it remained unclear.

During week 3, I re-structured my code in different parts. Remove the BaseHamiltonianMC and created a HamiltonianMC class which returned samples using Simple Hamiltonian Monte Carlo. This class was then inherited by HamiltonianMCda which returned samples using Simple Hamiltonian Monte Carlo. Wrote function for some section of overlapping code, changed name of some parameters to specify their context in a better manner. Apart from that I was experimenting a bit with the API and how samples should be returned.

As discussed in last post, the parameterization of model was still unclear to me, but upon discussion with my mentor and other members I found that we already had a representation finalized for Continuous factor and Joint distributions. I wasted a lot of time on this matter, I laughed at my silly mistake. If I had my doubts clear in start I would have already finished with my work. No frets now it gave me a good learning experience. So I re-wrote my certain part of code to take this parameterization into account. In discussion of week 4 meeting upon my suggestion we decided to use numpy.recarray objects instead of pandas.DataFrame as pandas.DataFrame was adding a dependency and was also slower than numpy.recarray objects. I also improved the documentation of my code during the week 4, which earlier wasn’t consistent with my examples. I was allowing user to pass any n-dimensional array instead of mentioned 1d array in documentation, I thought it will provide more flexibility but actually it was making things ambiguous. At the end of week 4 the code looks really different from what it was in the start. I wrote _sample method which run a single iteration of sampling using Hamiltonian Monte Carlo. Now the code returns samples in two different types. If user has an installation of pandas, it returns pandas.DataFrame otherwise it returns numpy.recarry object. This is how output looks like now:

• If user doesn’t have a installation of pandas in environment
>>> from pgmpy.inference.continuous import HamiltonianMC as HMC, LeapFrog
>>> from pgmpy.models import JointGaussianDistribution as JGD
>>> import numpy as np
>>> mean = np.array([-3, 4])
>>> covariance = np.array([[3, 0.7], [0.7, 5]])
>>> model = JGD(['x', 'y'], mean, covariance)
>>> sampler = HMC(model=model, grad_log_pdf=None, simulate_dynamics=LeapFrog)
>>> samples = sampler.sample(initial_pos=np.array([1, 1]), num_samples = 10000,
...                          trajectory_length=2, stepsize=0.4)
>>> samples
array([(5e-324, 5e-324), (-2.2348941964735225, 4.43066330647519),
(-2.316454719617516, 7.430291195678112), ...,
(-1.1443831048872348, 3.573135519428842),
(-0.2325915892988598, 4.155961788010201),
(-0.7582492446601238, 3.5416519297297056)],
dtype=[('x', '<f8'), ('y', '<f8')])

>>> samples = np.array([samples[var_name] for var_name in model.variables])
>>> np.cov(samples)
array([[ 3.0352818 ,  0.71379304],
[ 0.71379304,  4.91776713]])
>>> sampler.accepted_proposals
9932.0
>>> sampler.acceptance_rate
0.9932

• If user has a pandas installation
>>> from pgmpy.inference.continuous import HamiltonianMC as HMC, GradLogPDFGaussian, ModifiedEuler
>>> from pgmpy.models import JointGaussianDistribution as JGD
>>> import numpy as np
>>> mean = np.array([1, -1])
>>> covariance = np.array([[1, 0.2], [0.2, 1]])
>>> model = JGD(['x', 'y'], mean, covariance)
>>> sampler = HMC(model=model)
>>> samples = sampler.sample(np.array([1, 1]), num_samples = 5,
...                          trajectory_length=6, stepsize=0.25)
>>> samples
x              y
0  4.940656e-324  4.940656e-324
1   1.592133e+00   1.152911e+00
2   1.608700e+00   1.315349e+00
3   1.608700e+00   1.315349e+00
4   6.843856e-01   6.237043e-01


In contrast to earlier output which was just a list of numpy.array objects

>>> from pgmpy.inference.continuous import HamiltonianMC as HMC, GradLogPDFGaussian, ModifiedEuler
>>> from pgmpy.models import JointGaussianDistribution as JGD
>>> import numpy as np
>>> mean = np.array([1, -1])
>>> covariance = np.array([[1, 0.2], [0.2, 1]])
>>> model = JGD(['x', 'y'], mean, covariance)
>>> samples = sampler.sample(np.array([1, 1]), num_samples = 5,
...                          trajectory_length=6, stepsize=0.25)
>>> samples
[array([[1],
[1]]),
array([[1],
[1]]),
array([[ 0.62270104],
[ 1.04366093]]),
array([[ 0.97897949],
[ 1.41753311]]),
array([[ 1.48938348],
[ 1.32887231]])]


Next week I’ll try to do some changes mentioned by my mentor on my PR. Also I’ll write more test cases to individually test each function instead of testing the overall implementation. After my PR gets merged I’ll try to write introductory blogs related to Markov Chain Monte Carlo and Hamiltonian Monte Carlo and will work on No U Turn Sampling.

#### Google Summer of Code week 3 and 4

In terms of progress week 3 turned to be dull. I was having doubts in my mind regarding the representation since start of coding period and somehow I always forgot to have my doubts cleared in meeting. I wasted a lot of time reading theory to help me out with this doubt of mine and till mid of the week 4 it remained unclear.

During week 3, I re-structured my code in different parts. Remove the BaseHamiltonianMC and created a HamiltonianMC class which returned samples using Simple Hamiltonian Monte Carlo. This class was then inherited by HamiltonianMCda which returned samples using Simple Hamiltonian Monte Carlo. Wrote function for some section of overlapping code, changed name of some parameters to specify their context in a better manner. Apart from that I was experimenting a bit with the API and how samples should be returned.

As discussed in last post, the parameterization of model was still unclear to me, but upon discussion with my mentor and other members I found that we already had a representation finalized for Continuous factor and Joint distributions. I wasted a lot of time on this matter, I laughed at my silly mistake. If I had my doubts clear in start I would have already finished with my work. No frets now it gave me a good learning experience. So I re-wrote my certain part of code to take this parameterization into account. In discussion of week 4 meeting upon my suggestion we decided to use numpy.recarray objects instead of pandas.DataFrame as pandas.DataFrame was adding a dependency and was also slower than numpy.recarray objects. I also improved the documentation of my code during the week 4, which earlier wasn’t consistent with my examples. I was allowing user to pass any n-dimensional array instead of mentioned 1d array in documentation, I thought it will provide more flexibility but actually it was making things ambiguous. At the end of week 4 the code looks really different from what it was in the start. I wrote _sample method which run a single iteration of sampling using Hamiltonian Monte Carlo. Now the code returns samples in two different types. If user has an installation of pandas, it returns pandas.DataFrame otherwise it returns numpy.recarry object. This is how output looks like now:

• If user doesn’t have a installation of pandas in environment
>>> from pgmpy.inference.continuous import HamiltonianMC as HMC, LeapFrog
>>> from pgmpy.models import JointGaussianDistribution as JGD
>>> import numpy as np
>>> mean = np.array([-3, 4])
>>> covariance = np.array([[3, 0.7], [0.7, 5]])
>>> model = JGD(['x', 'y'], mean, covariance)
>>> sampler = HMC(model=model, grad_log_pdf=None, simulate_dynamics=LeapFrog)
>>> samples = sampler.sample(initial_pos=np.array([1, 1]), num_samples = 10000,
...                          trajectory_length=2, stepsize=0.4)
>>> samples
array([(5e-324, 5e-324), (-2.2348941964735225, 4.43066330647519),
(-2.316454719617516, 7.430291195678112), ...,
(-1.1443831048872348, 3.573135519428842),
(-0.2325915892988598, 4.155961788010201),
(-0.7582492446601238, 3.5416519297297056)],
dtype=[('x', '<f8'), ('y', '<f8')])

>>> samples = np.array([samples[var_name] for var_name in model.variables])
>>> np.cov(samples)
array([[ 3.0352818 ,  0.71379304],
[ 0.71379304,  4.91776713]])
>>> sampler.accepted_proposals
9932.0
>>> sampler.acceptance_rate
0.9932

• If user has a pandas installation
>>> from pgmpy.inference.continuous import HamiltonianMC as HMC, GradLogPDFGaussian, ModifiedEuler
>>> from pgmpy.models import JointGaussianDistribution as JGD
>>> import numpy as np
>>> mean = np.array([1, -1])
>>> covariance = np.array([[1, 0.2], [0.2, 1]])
>>> model = JGD(['x', 'y'], mean, covariance)
>>> sampler = HMC(model=model)
>>> samples = sampler.sample(np.array([1, 1]), num_samples = 5,
...                          trajectory_length=6, stepsize=0.25)
>>> samples
x              y
0  4.940656e-324  4.940656e-324
1   1.592133e+00   1.152911e+00
2   1.608700e+00   1.315349e+00
3   1.608700e+00   1.315349e+00
4   6.843856e-01   6.237043e-01


In contrast to earlier output which was just a list of numpy.array objects

>>> from pgmpy.inference.continuous import HamiltonianMC as HMC, GradLogPDFGaussian, ModifiedEuler
>>> from pgmpy.models import JointGaussianDistribution as JGD
>>> import numpy as np
>>> mean = np.array([1, -1])
>>> covariance = np.array([[1, 0.2], [0.2, 1]])
>>> model = JGD(['x', 'y'], mean, covariance)
>>> samples = sampler.sample(np.array([1, 1]), num_samples = 5,
...                          trajectory_length=6, stepsize=0.25)
>>> samples
[array([[1],
[1]]),
array([[1],
[1]]),
array([[ 0.62270104],
[ 1.04366093]]),
array([[ 0.97897949],
[ 1.41753311]]),
array([[ 1.48938348],
[ 1.32887231]])]


Next week I’ll try to do some changes mentioned by my mentor on my PR. Also I’ll write more test cases to individually test each function instead of testing the overall implementation. After my PR gets merged I’ll try to write introductory blogs related to Markov Chain Monte Carlo and Hamiltonian Monte Carlo and will work on No U Turn Sampling.

## June 19, 2016

### Upendra Kumar (Core Python)

#### Design Patterns: How to write reusable and tidy software?

Referred from : Python Unlocked 2015

Hello everyone. In this blog post I am going to tell about a important concept in software development which I came through called “Design Patterns”.

In software engineering, problems requiring similar solutions are very common. Therefore people generally tend to come up with a repeatable design specification to deal with such common problems. Studying design patterns helps one to have a basic idea of existing solutions to such problems.

Few advantages of design patterns are :

1. They speed up the development process by providing tested and robust paradigms for solving a problem.
2. Improves code readability for programmers
3. Documenting the code also becomes easy as a lot of solutions are based on common design pattern. Therefore, less efforts are required to document code.

Let’s come to different design patterns used by people in software engineering. They are mostly classified as follows :

1. Observer pattern
2. Strategy pattern
3. Singleton pattern
4. Template pattern
7. Flyweight pattern
8. Command pattern
9. Abstract factory
10. Registry pattern
11. State pattern

Let’s have a brief overview of the above-mentioned design patterns :

1. Observer Pattern :  The key to the observer pattern is “Spreading information to all listeners“. In other words, when we need to deal with a lot of listeners ( which always waiting for a particular event to be invoked ) we need to keep track of them and inform them about the occurence of an event ( For example, change of state of variable ). Below code snippet will help to make the situation more clear :
class Notifier():
"""
Provider of notifications to other objects
"""
def __init__(self, name):
self.name = name
self._observers = Set()

def register_observer(self, observer):
"""
Function to attach other observers to this notifier
"""
print("observer {0} now listening on {1}".format(observer.name, self.name))

def notify_observers(self, msg):
"""
transmit event to all interested observers
"""
print("subject notifying observers about {}".format(msg,))

for observer in self._observers:
observer.notify(self, msg)

class Observer():

def __init__(self, name):
self.name = name

def start_observing(self, subject):
"""
register for getting event for a subject
"""
subject.register_observer(self)

def notify(self, subject, msg):
"""
notify all observers
"""
print("{0} got msg from {1} that {2}".format(self.name, subject.name,msg))

The above code snippet provides a very simple implementation of the Observer pattern. There is a notifier object which provides a method to register the listeners. And in the listeners ( the Observer object ) there is start_observing function to register with the notifier.

### Riddhish Bhalodia (dipy)

#### Speeding Up!

Working on speeding up the local PCA algorithm by turning it into a cython code.

## Currently..

Let me describe the time division of the current implementation. This is for the data of size (176,176,21,22)

Code Section Time Taken
Noise Estimate MUBE/SIBE 39.23 seconds
Local PCA 140.98 seconds

The localPCA function has two main bottlenecks,

1] Computing the local covariance matrix
2] Projecting the data on PCA basis

## What is Cython?

It is a static compiler which makes writing C extensions for python really easy. These C extensions helps us get a huge time improvment. Cython allows us to call C/C++ functions back and forth natively from the python code. Due to this we are choosing this to improve the performance of the current localPCA implementation in python.

## New method for covariance computation!

Omar and Eleftherios have recently written a paper [1] which gives us a new improved method to compute the integrals in rectangles, without worrying much about memory considerations. Using this method for local covariance computation we expect the performance to improve significantly.

## New Code, New Results

We cythonized the localPCA code, also incorporating the new covariance computation [1] here. The improvement

LocalPCA time reduced from 140.98 seconds to 82.72 seconds, about 40% improvement

with obviously not affecting the accuracy of the code!

## Update, nlmeans_block optimization

Improved on the cython implementation of the nlmeans_block and the improvements are drastic. Tested on data of size (128,128,60), with patch radius = 1 and block radius = 1

Previous time = 7.56 seconds
Time after optimization = 0.64 seconds

😀

## Next Up…

1. Little more optimization of the local PCA cython code
2. Documentations and tutorials for local PCA and adaptive denoising
3. Code formatting and validation via phantom
4. Optimization of Adaptive Denoising code
5. Cythonize the noise estimation process
6. Incorporate suggestions from mentors

## References

[1] On the computation of integrals over fixed-size rectangles of arbitrary dimension
Omar OceguedaOscar Dalmau, Eleftherios Garyfallidis, Maxime Descoteaux, Mariano Rivera. Pattern Recognition Letters 2016

## June 18, 2016

### Ranveer Aggarwal (dipy)

#### Finishing Up on the Text Box

In my previous blog post, I had begun working on a text box, and I had discussed a lot of issues. I used the text box myself and ended up hating it. So I started from scratch.

Again, it wasn’t easy and I had to go through multiple iterations.

### Method 0: Why reinvent the wheel?

Google. Stackoverflow? Someone? Help! Ooh a random forum.

Something similar happened.

### Method 1: Cmd+C, Cmd+V from the previous blog post

For making a dynamic multi line text box, I stored a number which describes the number of current lines. If this number is exceeded by

(length of the text) / (number of characters in a line)


I simply add a newline character to the text variable and increase the current number of lines.
For hiding overflows, I kept a variable for archive text and dynamically updated it.

This was complicated, the code couldn’t be clean with this method. Debugging corner cases would have been difficult. So, Cmd+A, Delete.

### Method 2: Hello Windows

While I stuck to the same OS, I had this idea of using a window.

An intermediate text looks like (the |’s are the window boundaries. The input field size is 1x5)

|abc|


When a character is added

|abcd|


When a character is removed

|abc|


When a character is further added

|abcde|


a|bcdef|


Yep, the window moves. Cool, right? So now the visibility of the text is controlled by this so called window. The caret plays nicely into this window. I simply use a caret position as 0 initially and keep adding or removing one as I add or remove characters. The 0 is always the left window position and the caret moves relative to it. Similarly change the caret position when left/right keys are pressed.

Say this is now the intermediate text

this is som|e text|


My caret is currently before e (the 0 for the window) and I press the left key. Here’s what happens.

this is so|me tex|t


Similarly, I shift the window right when required.
Next, I wrote down all the possible cases (corner cases at the boundaries, for example) and coded them all up.

I thought my work was done, until — bugs!

And I proceeded to rewrite everything once again.

### Method 3: The Final Method

The method didn’t change much from above, except that the caret position was now absolute - the 0 is the 0 of the text always and the window moves relatively.
I grouped similar code together into functions and ended up with a clean (and hopefully bug free) implementation of a text box. I’ll write an independent blog post on it soon.

### Results

Here’s what it looks like currently:

A better text box.

### Conclusion

Building a UI can be tougher than it sounds. When we use something like HTML and simply include a text box using a simple tag, we never think what went behind its making. The method I currently use seems efficient (every operation is O(1)), but I am sure there must be several implementations (maybe even better and cleaner) out there. I shall incorporate a better method if I find any.

### Next Steps

Next, I’ll get started on a slider element. I don’t know how to do it right now, so this week would probably be spent on exploring ways and means to do it.

#### GSoC week 4 roundup

@cfelton wrote:

All the midterm reviews are due by the 25th, the reviews open up
next week and primary mentors are encouraged to completed the
reviews as soon as possible.

Students, try and complete your milestones by Monday :), write
a longer midterm summary blog, and in the blog if you are not on
schedule include a modified proposal with new milestones.

Student week4 summary (last blog, commits, PR):

jpegenc:
health 88%, coverage 95%
@mkatsimpris: 15-Jun, >5, Y
@Vikram9866: 03-Jun, 5, Y

riscv:
health 96%, coverage unknown
@meetsha1995: 12-Jun, 3, N

hdmi:
health 94%, coverage 90%
@srivatsan: 11-Jun, >5, N

gemac:
health 57%, coverage 0%
@ravijain056, 17-Jun, >5, N

pyleros:
health missing, coverage missing
@formulator, 23-Mar, 0, N

Links to the student blogs and repositories:

Merkourious, @mkatsimpris: gsoc blog, github repo
Vikram, @Vikram9866: gsoc blog, github repo
Meet, @meetshah1995, gsoc blog: github repo
Srivatsan, @srivatsan-ramesh: gsoc blog, github repo
Ravi @ravijain056: gsoc blog, github repo
Pranjal, @forumulator: gsoc blog, github repo

Posts: 1

Participants: 1

### udiboy1209 (kivy)

#### Cython Needs A Lint Tool

My love for cython has kept increasing constantly for the past few weeks. It feels like by the end of my GSoC I might switch to cython entirely, if thats possible. Cython lacks a few key development tools though - a code testing tool and a lint tool. I felt great need for it while cleaning up unused variables in the latest PR. Manually doing stuff just doesn’t cut it for an automater :D.

This week was mainly focused on getting animated tiles implemented and working. Implementing animation was easy. I just had to store an extra pointer in the Tile object for the animation FrameList, and add a python api to access it. Animations can now be specified for a tile while map creation by specifying the animation name to load in the dict. The MapManager will take care of fetching the FrameList pointer and map_utils will take care of initialising the animation system.

## Debugging is inevitable

Have I mentioned before that no code works on the first go? Not even something simple like making a tile animated which very frankly just involved copying and modifying code from the last example I had built for testing the animation. After breaking my head behind debugging this for almost a day, I asked Kovak about it. It turned out to be a bug in the animation system! KivEnt’s renderer does a very neat trick to improve efficiency. It batches all the entities which have to be rendered from the same image to be processed together so that the same image doesn’t need to be loaded repeatedly. In AnimationSystem, we constantly keep updating the texture of an entity with time, hence it is important that we add those entities to the corresponding batch according to the new texture to be set. Sometimes entities don’t need to be “rebatched” because they already are in the final batch they will end up in after the texture change. There was a bug in the test condition for this which made it return True always. In the very basic sense this was what was happenning: old_batch == old_batch :P, while it should have been old_batch == new_batch. Because of the always true condition, the case where the animation was created from different images was never rendered because the texture change was never rebatched. A few code fixes and updates to the twinkling stars example so that animations use different image files led to this:

Isn’t this even more beutiful than the previous example :D. Because I had to remove a lot of code for this fix, there were a lot of unused variables lying around. That is why I so desparately wanted a good linting tool I could import into vim and automate the boring process of manually finding unused variables. Also, because cython code is majorly python-like, there needs to be a pep8 like standard for cython, along with a checker tool.

I then later found out the cython compiler can be configured to issue warnings for unused variables. Someone needs to make a good tool out of it, probably just a vim plugin for that matter. I might do it myself if I get enough motivation.

Anyway, that fix led to one more thing, animated tiles started working! Here, have a look:

## Next in line

Now that a basic map creating pipeline is in place for KivEnt, I can move on to actually trying to parse a Tiled Map file i.e. a TMX file and create a map from it. Parsing TMX should be easy as there are a lot of feature-rich TMX parsers existing for python. PyTMX is one such module I have in mind to use for this job. The rest of the task is just loading textures, models and animations from the tmx file and then assigning those values to individual tiles to create entities. I have been trying to create my own maps on tiled to get familiarised with it, and the first thing I did was create a Pokemon map because why not :P. Obviously, I won’t be able to use this in KivEnt examples because I don’t think the tileset is open source :P. So I’ll have to find an open source tileset and make another Tiled map for testing. Have a look at the Poekemon map:

## June 17, 2016

### jbm950 (PyDy)

#### GSoC Week 4

I started off this week writing the example code for a pendulum defined by x and y coordinates instead of an angle, theta. This was to show how the eombase.EOM class would handle a differential algebraic system. I also altered the simple pendulum example I made early on in the project to show how it would look as an eombase.EOM example. Of the examples I have made for the base class this one stands out as currently being the only one making use of the equations of motion generators (the other two have the equations of motion entered by hand). While addressing comments on the PR, it was mentioned that a more traditional documentation approach would allow greater visibility of the desired results of the code as the output could be explicitly shown. I agreed and moved all three examples to a single .rst document in PyDy and changed the code to represent the documentation format over example code format. At this point I made a list of all of the attributes and methods I though the base class should represent and made sure they were represented in the example documentation. In addition I included multiple ways I thought error messages should be brought up for incorrect uses of the base class. This information is currently awaiting review.

In addition to the work on the base class I had to fix the kane benchmark I made early on in the project. At some point in the last few months the input order for kane.kanes_equations() was flipped and this caused the benchmark to not be able to run on pervious versions of Sympy. My fix was to use a try/except clause to catch the error produced by the older versions of Sympy and alter the input order based on whether or not the error was produced. This code sits at PR #29 and it too is awaiting review/approval.

While I have been waiting for review of the base class PR, I have begun reading through Roy Featherstone’s book, “Rigid Body Dynamics Algorithms”. I have spent time going through Jason’s overhaul of KanesMethod as well and trying to provide as much useful feedback as I can.

Lastly I reviewed PR #11209 this week. The PR correctly alters code that tests for the presence of a key in a dictionary. It also altered the indentation of the code that immediately followed. I eventually came to the conclusion that this was a correct alteration because the variable eq_no is set in the dictionary key test and is used in the code that follows the test. I commented that the PR looks good to me and another member of SymPy merged it. This makes me slightly worried that too much value may be attached to my opinion as I still feel like a beginner.

### Future Directions

I will continue reading through Featherstone’s book until I recieve feedback on the proposed base class API at which time I will address the reviewer’s comments and hopefully begin work on the base class itself.

### PR’s and Issues

• (Open) Improved the explanation of the 5 equations in the Kane’s Method docs PR #11183
• (Open) Created a basis on which to discuss EOM class PR #353
• (Open) Minor fix in KanesMethod’s docstring PR #11186
• (Open) Fixed kane benchmark for different input order PR #29
• (Merged) Fix issue #8193 PR #11209

## June 15, 2016

### Sheikh Araf (coala)

#### [GSoC16] Week 3 update

Another week has passed by in this journey and so far it is going great. This week I’ve been busy adding important functionality to the Eclipse plug-in.

The most important feature in the works is the ability to select bear to use for code analysis. I also had to make some design decisions and one thing I’ve learned is designing is more difficult than programming (ofcourse, take this with grain of salt).

Mid-term evaluations are coming and I expect to have a usable plug-in by then.

The major task now is to introduce some basic user-interface elements that make using the plug-in intuitive and easy. I have begun planning out the GUI with help from my mentor and will be adding some parts of it in the coming week.

The plug-in currently uses the common Problems View for the marker elements. This will change and the plug-in will now have a separate view for coala issues. Another new element will be the Annotations that will help better visualize the analysis results.

Cheers!

### liscju (Mercurial)

#### Coding Period - III Week

In this week i have planned to do redirection to the simple http server. The idea is to make GET/POST request to the url with path /file/REVHASH where revhash is the hash of file to put. To test it i created simple http server to handle such a request in python, You can take a look here:

https://bitbucket.org/liscju/hg-largefiles-gsoc/src/e8afcf299ea4bf5714859bf231d62aea0c663d3b/contrib/lfredirection-http-server.py?at=dev&fileviewer=file-view-default

So far i didnt manage to integrate this server in mercurial test framework but i started working on it. The thing worth to notice is how easy is to make http server with python, just a couple of lines and thats it.

Second thing i did was to refactor solution a bit to distinguish between different types of redirection. At this moment there are only two types: local file server and http server but in the future there will be other options so this was a good moment to make distinguishing flexible. Current solution looks like this:

_redirectiondstpeer_provider = {    'file': _localfsredirectiondstpeer,    'http': _httpredirectiondstpeer,}def openredirectiondstpeer(location, hash):    match = _scheme_re.match(location)    if not match:               # regular filesystem path        scheme = 'file'    else:        scheme = match.group(1)    try:        redirectiondstpeer = _redirectiondstpeer_provider[scheme]    except KeyError:        raise error.Abort(_('unsupported URL scheme for redirection peer %r')                          % scheme)    return redirectiondstpeer(location, hash)

Location at the beggining keeps protocol information and this is extracted  and compared with supported types. If its not supported it raises error.

To connect with redirection server and send http request i used httplib(https://docs.python.org/2/library/httplib.html) but im working on reusing current code in mercurial to open/close connection. Other thing im still working on is to send/get files from http server in chunks rather than at once. This is especially important when we consider that this solution will get/send files of big size.

Apart from fixing http connection issues, in this week im going to work on generating redirection location on the fly. The idea is that user specifies script/hook that generates redirection location, it saves this location in the file in .hg directory and the feature is reading location from this file.

https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-June/085244.html

Another thing i was working was to add instruction to pull active bookmark by "hg pull -B .". Beggining of the patch series is here:

https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-June/085232.html

Working on solution i encountered that some of the abort messages are not translated, i sent patch to this, you can browse it here:

https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-June/085251.html

### Abhay Raizada (coala)

#### Python, indentation and white-space

so at the time of the last update i was able to do basic indentation whenever a start and an end indent specifier was provided, this time around i’m working on stuff when the end-indentation specifier is not provided,  for example languages like python

def func(x):
indent-level1
indent-level2

here we can see that there is no specifier that an unindent is going to occur, so how do i figure out what all lines are a part of one block?

Well the answer is very simple actually, i look for the start indent specifier which in case of python it is the very famous: ‘ : ‘. Now after i find the start of indent specifier, the next step is to find an unindent, in the previous example the line containing ‘indent-level2’ unindents, and voila we have our block, starting from the indent-specifier to the first unindent, easy right? The answer to that is NO, nothing’s that easy.

## python doesn’t care about white-space:

well as we all know this isn’t true, python does care about white-space, but not as much as we thought. Python only cares about white-space to figure out indentation, anything else is pretty much useless to it,  for example:

def func(x):
a = [1, 2,
3, 4, 5]
if x in a:
print(x)

this is a pretty valid python code, which prints x if x is an integer between 1 to 5.  What is odd about this examples is, that as we know in python everything has to be indented right? and this breaks that rule! go ahead try this on your own, it works! So no even in python not everything has to be indented, a simpler example could have been:

def func(x):
a = 1
# This comment is not indented
print(a)

does it matter if this comment is not indented?  absolutely not! this is a very valid python code as well.

## The Problem:

so how is all this related to my algorithm?  as you can see in the second example, the line   ‘# This comment is not indented’ unindents and my algorithm is searching for unindents, hence breaking my algorithm, as it would think that block starts from ‘def func(x):’ and end at ‘# This comment is not indented’, also in the first example it would find that the line    ‘3, 4, 5] ‘ unindents which would again break the algorithm.

## The Solution:

The Solution is quite simple in theory: Just be aware of these cases. But that changes the algorithm completely it goes from:

• check first unindent
• Report block as line containing specifier to the line which unindents

To

• check if case of unindent
• check if this line is a comment
• check this if line is inside a multiline-comment
• check if this line is inside paranthesis() or square-brackets []
• If true repeat from 1
• else report block.

So the final algorithm is my working solution as long as we are not able to find some problem in that as well. You can follow all the code  related to this algorithm in my PR.

## Next steps:

Next steps are absolute indentation, hanging indents, keyword indents and an all new bear in the form of the LineLengthBear.

All of this looks really exciting, as i see my once planned Project come to life, i really hope all of this is useful someday and people actually use my code to solve their indentation problems.

### John Detlefs (MDAnalysis)

#### Diffusion Maps in Molecular Dynamics Analysis

It occurs to me in my previous post I didn’t thoroughly explain the motivation for dimension reduction in general. When we have this data matrix $X$ with $n$ samples and each sample having $m$ features, this number m can be very large. This data contains information that we want to extract, in the case of molecular dynamics simulations these are parameters describing how the dynamics are occurring. But this data can be features that distinguish faces from others in the dataset, handwritten letters and numbers from other numbers, etc. As it is so eloquently put by Porte and Herbst at Arizona

The breakdown of common similarity measures hampers the efficient organisation of data, which, in turn, has serious implications in the field of pattern recognition. For example, consider a collection of n × m images, each encoding a digit between 0 and 9. Furthermore, the images differ in their orientation, as shown in Fig.1. A human, faced with the task of organising such images, would likely first notice the different digits, and thereafter that they are oriented. The observer intuitively attaches greater value to parameters that encode larger variances in the observations, and therefore clusters the data in 10 groups, one for each digit

Here we’ve been introduced to the idea of pattern recognition and ‘clustering’, the latter will be discussed in some detail later. Continuing on…

On the other hand, a computer sees each image as a data point in $R^{nm}$, an nm-dimensional coordinate space. The data points are, by nature, organised according to their position in the coordinate space, where the most common similarity measure is the Euclidean distance.

The idea of the data being in a $nm$ dimensional space is introduced by the authors. The important part is that a computer has no knowledge of the the patterns inside this data. The human brain is excellent at plenty of algorithms, but dimension reduction is one it is especially good at.

## Start talking about some chemistry John!

Fine! Back to the matter at hand, dimension reduction is an invaluable tool in modern computational chemistry because of the massive dimensionality of molecular dynamics simulations. To my knowledge, the biggest things being studied by MD currently are on the scale of the HIV-1 Capsid at 64 million atoms! Of course, these studies are being done on supercomputers, and for the most part studies are running on a much smaller number of atoms. For a thorough explanation of how MD simulations work, my Summer of Code colleague Fiona Naughton has an excellent and cat-filled post explaining MD and Umbrella Sampling. Why do we care about dynamics? As Dr. Cecilia Clementi mentions in her slides, ‘Crystallography gives structures’, but function requires dynamics!’

A molecular dynamics simulation can be thought of as a diffusion process subject to drag (from the interactions of molecules) and random forces, (brownian motion). This means that the time evolution of the probability density of a molecule occupying a point in the configuration space $P(x,t)$ satisfies the Fokker-Plank Equation (This is some complex math from statistical mechanics). The important thing to note is that the Fokker-Plank equation has a discrete eigenspectrum, and that there usually exists a spectral gap reflecting the ‘intrinsic dimensionality’ of the system it is modeling. A diffusion process is by definition markovian, in this case a continuous markov process, which means the state at time t is solely dependent on the instantaneous step before it. This is easier when transferred over to the actual discrete problems in MD simulation, the state at time $t$ is only determined by the state at time $t-1$.

Diffusion maps in MD try to find a discrete approximation of the eigenspectrum of the Fokker-Plank equation by taking the following steps. First, we can think of changes in configuration as random walks on an infinite graph defined by the configuration space. From Porte again:

The connectivity between two data points, x and y, is defined as the probability of jumping from x to y in one step of the random walk, and is

It is useful to express this connectivity in terms of a non-normalised likelihood function, k, known as the diffusion kernel:

The kernel defines a local measure of similarity within a certain neighbourhood. Outside the neighbourhood, the function quickly goes to zero. For example, consider the popular Gaussian kernel:

Coifman and Lafon provide a dense but extremely thorough explanation of diffusion maps in their seminal paper. This quote screams molecular dynamics:

Now, since the sampling of the data is generally not related to the geometry of the manifold, one would like to recover the manifold structure regardless of the distribution of the data points. In the case when the data points are sampled from the equilibrium distribution of a stochastic dynamical system, the situation is quite different as the density of the points is a quantity of interest, and therefore, cannot be gotten rid of. Indeed, for some dynamical physical systems, regions of high density correspond to minima of the free energy of the system. Consequently, the long-time behavior of the dynamics of this system results in a subtle interaction between the statistics (density) and the geometry of the data set.

In this paper, the authors acknowledge that oftentimes an isotropic kernel is not sufficient to understand the relationships in the data. He poses the question:

In particular, what is the influence of the density of the points and of the geometry of the possible underlying data set over the eigenfunctions and spectrum of the diffusion? To address this type of question, we now introduce a family of anisotropic diffusion processes that are all obtained as small-scale limits of a graph Laplacian jump process. This family is parameterized by a number $\alpha$ which can be tuned up to specify the amount of influence of the density in the infinitesimal transitions of the diffusion. The crucial point is that the graph Laplacian normalization is not applied on a >graph with isotropic weights, but rather on a renormalized graph.

The derivation from here requires a few more steps:

• Form a new kernel from anisotropic diffusion term: Let
Where
• Apply weighted graph Laplacian normalization:
• Define anisotropic transition kernel from this term

This was all kinds of painful, but what this means for diffusion maps in MD is that a meaningful diffusion map will have an anisotropic, (and therefore unsymmetric kernel). Coifman and Lafon go on to prove that for $\alpha$ equal to $\frac{1}{2}$ this anisotropic kernel is an effective approximation for the Fokker-Plank equation! This is a really cool result that is in no way obvious.

Originally, when I studied diffusion maps while applying for the Summer of Code I was completely unaware of Fokker-Plank and the anisotropic kernel. Of course, learning these topics takes time, but I was under the impression that diffusion kernels were symmetric across the board, which is just dead wrong. This of course changes how eigenvalue decomposition can be performed on a matrix and requires a routine like Singular Value Decomposition instead of Symmetric Eigenvalue Decomposition. If I had spent more time researching literature on my own I think I could have figured this out. With that being said, there are 100+ dense pages given in the citations below.

So where are we at? Quick recap about diffusion maps: + Start taking random walks on a graph + There are different costs for different walks based on likelihood of walk happening + We established a kernel based on all these different walks + For MD we manipulate this kernel so it is anisotropic!

Okay, so what do we have left to talk about… + How is epsilon determined? + What if we want to take a random walk of more than one jump? + Hey John, we’re not actually taking random walks! + What do we do once we get an eigenspectrum? + What do we use this for?

## Epsilon Determination

Epsilon determination is kind of funky. First off, Dr. Andrew L. Ferguson notes that division by epsilon retrains ‘only short pairwise distances on the order of $\sqrt{2\epsilon}$’. In addition, Dr. Clementi in her slides on diffusion maps notes that the neighborhood determined by epsilon should be locally flat. For a free-energy surface, this means that it is potentially advantageous to define a unique epsilon for every single element of a kernel based on the nearest neighbors to that point in terms of value. This can get painful. Most researchers seem to use constant epsilon determined from some sort of guess and check method based on clustering.

For my GSoC pull request that is up right now, the plan is to have an API for an Epsilon class that must return a matrix whose $ij th$ coordinate is $\frac{d(i,j)^2}{\epsilon_ij }$. From here, given weights for the anisotropy of the kernel, we can form the anisotropic kernel to be eigenvalue-decomposed. Any researcher who cares to do some complex choice of epsilon based on nearest-neighbors is probably a good enough hacker to handle implementation of this API in a quick script.

## Length $t$ Walks

Nowhere in the construction of our diffusion kernel are we actually taking random walks. What we are doing is taking all possible walks, where two vertices on the graph are close if $d(x,y)$ is small and far apart if $d(x,y)$ is large. This accounts for all possible one-step walks across our data. In order to get a good idea of transitions that occur over larger timesteps, we take multiple steps. To construct this set of walks, we must multiply our distance matrix $P$ by itself t-times, where t is the number of steps in the walk across the graph. From Porte again (stealing is the best form of flattery, no?):

With increased values of t (i.e. as the diffusion process “runs forward”), the probability of following a path along the underlying geometric structure of the data set increases. This happens because, along the geometric structure, points are dense and therefore highly connected (the connectivity is a function of the Euclidean distance between two points, as discussed in Section 2). Pathways form along short, high probability jumps. On the other hand, paths that do not follow this structure include one or more long, low probability jumps, which lowers the path’s overall probability.

I said something blatantly wrong in my last post. I’m a fool, but still, things do get a little complicated when analyzing time series data with diffusion maps. We want to both investigate different timescale walks from the diffusion maps, but also to be able to project our snapshot from a trajectory at a timestep to the corresponding set of eigenvectors describing the lower dimensional order-parameters.

From Ferguson:

The diffusion map embedding is defined as the mapping of the ith snapshot into the ith components of each of the top k non-trivial eigenvectors of the $M$ matrix.

Here the $M$ matrix is our anisotropic kernel. So from a spectral decomposition of our kernel (remember that it is generated by a particular timescale walk), we get a set of eigenvectors that we project our snapshot (what we have been calling a both a trajectory frame and a sample, sorry) that exists as a particular timestep in our MD trajectory. This can create some overly similar notation, so I’m just going to avoid it and hope that it makes more sense without notation.

## Using Diffusion Maps in MDAnalysis

Alright, this has been a lot to digest, but hopefully you are still with me. Why are we doing this? There are plenty of reasons, and I am going to list a few:

• Dr. Ferguson used diffusion maps to investigate the assembly of polymer subunits in this paper
• Also for the order parameters in alkane chain dynamics
• Also for umbrella sampling
• Dr. Clementi used this for protein folding order parameters here
• Also, Dr. Clementi used this for polymerization reactions here
• Dr. Clementi also created a variant that treats epsilon determination very carefully with LSD
• There are more listed in my works cited

The first item in that list is especially cool; instead of using a standard RMSD metric, they abstracted a cluster-matching problem into a graph matching problem, using an algorithm called Isorank to find an approximate ‘greedy’ solution.

There are some solid ‘greedy’ vs. ‘dynamic’ explanations here. The example I remember getting is to imagine you are a programmer for a GPS direction provider. We can consider two ways of deciding an optimal route, one with a greedy algorithm and the other with a dynamic algorithm. At each gridpoint on a map, a greedy algorithm will take the fastest route at that point. A dynamic algorithm will branch ahead, look into the future, and possibly avoid short-term gain for long term drive-time savings. The greedy algorithm might have a better best-case performance, but a much poorer worst-case performance.

In any case, we want to allow for the execution of a diffusion map algorithm where a user can provide their own metric, tune the choice of epsilon, the choice of timescale, and project the original trajectory timesteps onto the new dominant eigenvector, eigenvalue pairs.

## Let’s talk API/ Actual Coding (HOORAY!)

DistMatrix

• Does frame by frame analysis on the trajectory, implements the _prepare and _single_frame methods of the BaseAnalysis class
• User selects a subset of a atoms in the trajectory here
• This is where user provides their own metric, cutoff for when metric is equal, weights for weighted metric calculation, and a start, stop, step for frame analysis

Epsilon

• We will have some premade classes inheriting from epsilon, but all the API will require is to return the manipulated DistMatrix, where each term has now been divided by some scale parameter epsilon
• These operations should be done in place on the original DistMatrix, under no circumstances should we have two possibly large matrices sitting in memory

DiffusionMap

• Accepts DistMatrix (initialized), Epsilon (uninitialized) with default a premade EpsilonConstant class, timescale t with default = 1, weights of anisotropic kernel as parameters
• Performs BaseAnalysis conclude method, wherein it exponentiates to the negative of each term given by Epsilon.scaledMatrix, performs the procedure for the creation of the anisotropic kernel above, and matrix multiplies anisotropic kernel by the timescale t.
• Finally, eigenvalue decomposes the anisotropic kernel and holds onto the eigenvectors and eigenvalues as attributes.
• Should contain a method DiffusionMap.embedding(timestep), that projects a timestep to its diffusion embedding at the given timescale t.

## Works Cited:

#### Diffusion Maps in Molecular Dynamics Analysis

It occurs to me in my previous post I didn’t thoroughly explain the motivation for dimension reduction in general. When we have th