Python's Summer of Code Updates

March 27, 2015

Varun Sharma
(GNU Mailman student)

January 25, 2015

Varun Sharma
(GNU Mailman student)

October 08, 2014

Julia Medina
(Scrapy student)


I'm Julia Medina, a software developer and a computer science student soon to be graduated. In the following blog posts I will write about my progress in this year's Google Summer of Code application.

by Julia Medina ( at October 08, 2014 08:25 PM

September 23, 2014

Roy Xue
(Theano student)

Theano 中文安装指南

Theano 中文安装指南

Theano Wiki: theano windows installation chinese version


  1. 安装MinGW并升级gcc, 有可能需要更新gcc到4.7.x 或者以上的版本来避免一些编译错误, 如果需要更新可以打开MinGW后使用下列命令更新.
    % mingw-get update
    % mingw-get upgrade gcc
    # this is also needed or you it will not find cc1plus.exe
    % mingw-get upgrade g++
    # we also need Fortran for building BLAS
    mingw-get install gcc-fortran
  2. 安装Python Distribution(Python Distribution是集成了诸多依赖包的Python版本), 若不想安装此部分, 请看步骤三:
  3. 如果没有安装以上Distribution, 请安装如下软件:
    • 安装Python 2.X 版本
    • 安装pip, Python的依赖包安装和管理工具.
    • 使用pip安装Numpy和Scipy
      pip install numpy
      pip install scipy


  1. 基础发行版本安装:
    pip install theano
  2. 最新版本安装:(需要git)
    # Use Git Bash Shell
    git clone git://
    并且将Theano的文件夹目录添加(或编辑)到PYTHONPATH 环境变量中去, 并且重启命令行窗口, 使用如下命令查看是否安装成功:
    Return Theano Directory
    C:\Users\login>echo %PYTHONPATH%
  3. 然后在你的根目录(e.g., C:\Users<you>), 创建一个叫做.theanorc(或者.theanorc.txt)文件, 并编辑其内容为:
    ldflags =
    # ldflags = -lopenblas # placeholder for openblas support
    如果在运行Theano的时候产生error: ‘assert’ was not declared in this scope, 则需要再添加如下内容:
    cxxflags = -IC:\MinGW\include

使用GPU(仅在Visual Studio 下运行):

  1. 安装Cuda:
    • CUDA 6.5 (32-bit on 32-bit Windows, 64-bit on 64-bit Windows).
    • 可选: The CUDA GPU Computing SDK (32-bit/64-bit matches your python ).
  2. 安装Visual Studio(或者Visual C++)
  3. 根据Nvidia 网站的CUDA Getting Started Guide, 利用Visual Studio编译Cuda代码. (如果这一步没有成功, Theano将可能不能编译GPU代码).
  4. 编辑Theano的配置文件, 添加如下内容:(具体内容应当对应你的Python版本和VS版本做适当修改)
    compiler_bindir=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin
  5. 使用Visual Studio命令行工具(在“Visual Studio Tools”文件夹下). 在Python中运行“import theano.sandbox.cuda”. 将会编译第一个Cuda文件, 并且没有错误产生.
    简单测试GPU计算: 首先编辑Theano的配置文件:
    device = gpu
    floatX = float32
  6. 相关推荐: PyCuda, OpenCL


If any questions about this page, contact:

by xljroy at September 23, 2014 06:17 AM

September 14, 2014

(BinPy student)

About me - My First blog post !

Hi... This is my first blog post ...

Am a III year undergrad from SVCE under Anna university, India ... I major in electronics and communications yet most of my interests are skewed towards coding and open source contributions, and more coding ( :P )

About my interests ... Im mad about linux... A big fan of open source .... I love programming ... luv toying with vi ... doing some cool stuff using machine learning ... algos ... building some circuits and stuff like that ... 

Things I can boast about include my programming skills ( specially my python skills )... bash scripting skills ... and my ability to listen to a mundane lecture for hours together without getting bored ( actually without exhibiting the boredom rather ! )

When I am not coding, building some circuit or sleeping, I read ... I am a big fan of Tolstoy, Conan Doyle, Agatha Christie,  and Dickens....

You can find me by my handle "raghavrv" ( or "rvraghav93" ) at github, irc, stackoverflow, fb,  twitter, linkedin etc ...

Thats all about me !!

by Raghav R V ( at September 14, 2014 08:31 AM

August 22, 2014

(SunPy student)

SunPy Database Browser Final Report

It gives me great pleasure to write this post about the SunPy Database Browser, that I developed as part of the Google Summer of Code'14 for the organization SunPy. It has been a wonderful and I feel delighted to say that the first version is ready. I have developed the plug-in as a plug-in with Ginga, an astronomical FITS file viewer, so as to enable side-by-side viewing and manipulation of Solar FITS files. The plug-in is wrapped as a Python package 'ginga-sunpy' which can be easily installed by the following steps:

  1. First install Ginga if you don't have it already.
  2. Download the code (or clone this repo).
  3. To install
    • For general use, from the ginga-sunpy directory run python install.
    • For developers, from the ginga-sunpy directory run python develop.
    • Run ginga-sunpy from your command-line and Ginga will open with the SunPy plug-in running.

    The following database parameters are needed by the plug-in to connect to the database:
  • Driver: Database name such as mysql, oracle, postgresql, etc., and driver the name of a DBAPI, such as psycopg2, pyodbc, cx_oracle, etc.”
  • Database Name: Name of the database to which to connect. (Can also specify the path to the database here)
  • User: Username of the user
  • Password: Password of the User

Optional Parameters

  1. Default Wavelength Unit
  2. Set Database as default


  1. Connect: Connects to a database based on passed arguments. Defaults to sqlite:///sunpydb
  2. Add file to Database: Adds new entries to database based on the selected file from a File Dialog Box
  3. View Database: Opens a new tab with tabular display of database entries
  4. Open Database: Connects to a sqlite database selected from a File Dialog Box
  5. Commit to Database: Commits the new entries added, changes made to the database

Database Table

  • Entries are displayed one-per row
  • The following attributes are displayed
    • id
    • path
    • observation_time_start
    • observation_time_end
    • instrument
    • min_wavelength
    • max_wavelength
    • is_starred
  • On clicking on any database entry's row, the FITS file associated to that entry opens in Ginga
  • Can display only the starred entires by selecting the 'Show starred entries only' checkbox below the table

    The repository can be found at

by Rajul Srivastava ( at August 22, 2014 04:42 AM

August 21, 2014

Manoj Kumar
(scikit-learn student)

GSoC : The end of another journey

I was postponing the last post for the last of my Pull Requests to get merged. Now since it got merged, I do not have any reason to procrastinate. This is the work that I have done across summer, with a short description of each,

(Just in case you were wondering why the “another” in the title, )

1. Improved memory mangement in the coordinate descent code.
Status: merged
Pull Request:
Changing the backend from multiprocessing to threading by removing the GIL, and replacing the function calls with pure cblas. A huge improvement 3x – 4x in terms of memory was seen without compromising much on speed.

2. Randomised coordinate descent
Status: merged
Pull Request:
Updating a feature randomnly with replacement instead of doing an update across all features can make descent converge quickly.

3. Logistic Regression CV
Status: merged
Pull Request:
Fitting a cross validation path across a grid of Cs, with new solvers based on newton_cg and lbfgs. For high dimensional data, the warm start makes these solvers converge faster.

4. Multinomial Logistic Regression
Status: merged
Pull Request:
Minimising the cross-entropy loss instead of doing a OvA across all classes. This results in better probability estimates of the predicted classes.

5. Strong Rules for coordinate descent
Status: Work in Progress
Pull Request:
Rules which help skip over non-active features. I am working on this and it should be open for review in a few days.

Apart from these I have worked on a good number of minor bug fixes and enhancements, including exposing the n_iter parameter across all estimates, fixing incomplete download of newsgroup datasets, and soft coding the max_iter param in liblinear.

I would like to thank my mentor Alex who is the best mentor one can possibly have, (I’m not just saying this because of hope that he will pass me :P), Jaidev, Olivier, Vlad, Arnaud, Andreas, Joel, Lars, and the entire scikit-learn community for helping me to complete an important project to an extent of satisfaction. (It is amazing how people manage to contribute so much, inspite of having other full time jobs). I will be contributing to scikit-learn full-time till December at least as part of my internship.
EDIT: And of course Gael (how did I forget), the awesome project manager who is always full of enthusiasm and encouragement.

As they say one journey ends for the other to begin. The show must go on.

by Manoj Kumar at August 21, 2014 11:31 PM

(pgmpy student)

The final GSOC post

Since I had delayed the last post, so I am combining two of my blog posts and am going to use this blog post to give you an update of all that has happened in this GSOC (special emphasis on the stuff of the last 4 weeks).

So first question is, where am I right now? In Mumbai. In my college. A stupid answer !
I have finished all the weekly goals of the GSOC project apart from a few things :-

1) I wanted to implement atleast one the optimal triangulation algorithms (using the research paper). The particular algorithm which I had to implement was decided pretty late and I didn't find time to implement it towards the end and hence that is still pending. I hope to do it soon.

2) Alpha - expansion algorithm is due. I have written the code, but there seems to be bug. I will debug it and push the code for alpha-beta sampling soon.
Also, some code reviews are pending and hence I am expecting some work on the code improvement part. But other than that, the work which I was officially expected to do in GSOC is over. 
Now that I am done with my status, let me talk a little bit about what happened since my last blog post (since the virtual time of my blog post, if you may). So for those who had read my last blog post, I was slightly stuck with integrating the factor product code for cython. However, I realized soon that cython could easily take in the python objects directly (duh) and so I easily modified the factor product to work seamlessly with extra data that is stored in factors (this extra data is the assignment of the eliminate variables (the variables which were eliminated while maximizing) corresponding to every combinations of the existing variables). 
Anyway, once I was done with that, the next task was the implementation of specific special-setting inference algorithms. One of them was the "Graph-cut algorithm for MAP in pairwise binary MRFs with submodular potentials ". This particular algorithm requires a max-flow-min-cut algorithm and I was really surprised to see that networkx didn't have a function which returns the exact cut-edges which I needed for this algorithm. I could have used external python libraries for this but this would have created an extra dependency just for one function. So we (I and mentors) decided that it would be best to implement the max-flow algorithm on my own which could then be used for this algorithm. So I implemented this and then implemented the algorithm.

However, I was slightly busy for a week after this with resume submission deadlines and so on in the institute and after that came back to the GSOC work. I implemented the gibbs sampling algorithm first and then went to alpha-expansion algorithm. While implementing I suddenly realized that alpha-expansion algorithm might not be too helpful for practical purposes because there are just too many constraints (All the variables must have equal cardinality. Anyway, I implemented it and have not pushed it yet, as there are a few issues to resolve in this function.
Once I was done with that, I spent some time with code review, examples, test cases and documentation (Well, this is the part that I didn't like. I mean while writing examples, I just had to copy stuff from the examples, change the format etc and it was one of the most boring things which i did as a part of gsoc (which kind of tells you how interesting GSOC was for me, in general ). I don't know if there is a better way of doing this but copy-pasting examples, removing "self." and formatting it doesn't look like the kind of work which doesn't have alternatives in the CS world). I had tried to find out more about it, but couldn't find better methods. If you know of something, please tell me too.

I plan to write another blog about the stuff which I learned from GSOC. However, I think it is best written after a week or so of reflection. :)

That's all for now.

by Navin Chandak ( at August 21, 2014 12:04 PM

Rishabh Sharma
(SunPy student)

The end of beautiful journey.

Well it is over.GSOC comes to an end but not my relation with SunPy.
The Unified Downloader(my project) is in its final phases of review.
I presented a small demo to in our weekly meet.The demo was much appreciated.
The adherence to already acquainted style of use was liked by all.
I hope to get this merged early and then help on next project(LightCurve Refactor).

by rishabhsharmagunner at August 21, 2014 08:00 AM

August 20, 2014

Mabry Cervin
(Astropy student)

Wrapping Up [Part 3]

One of the original goals of UQuantity was to make it usable anywhere Quantity was currently being used. A large part of this is maintaining compatibility with Numpy's features. Part of this is making sure that ndarray subclasses instantiate properly under normal Numpy usage, and part of it is enabling Numpy's universal functions ('ufuncs').

The old way of properly handling ufuncs was to use the methods __array_prepare__ and __array_wrap__. This method was clunky and difficult to do, so in the recent versions of Numpy a new method is being used, __numpy_ufunc__. The new method allows a class to intercept the usage of a ufunc on that object and handle the class specifics, returning an object as the final result of the ufunc.

In UQuantity's case the method __numpy_ufunc__ is perfect for making sure that both the uncertainties and the units are handled properly. For the uncertainties side the package provides a very convenient higher order function that takes a function (that takes and returns a float) and returns a function (that takes and returns a Variable object). The implementation of wrap() handles the derivatives necessary to make uncertainty propagation work.

The units side of things is likewise trivial to handle. Units from Astropy's units package can be operated on the same as numbers. In addition to making them useful for dimensional analysis, this makes it easy to handle the units separate from the values. All together this means that UQuantity simply has to apply the ufunc to the value and standard deviation via wrap() and apply it to the units and then recombine them.

This all comes together to solve the problem in my previous post. Previously, my issue was that super() was broken due to the specifics of how it handles type comparisons and my metaclass. I wanted to avoid having to write long functions for each standard Python mathematical operation (goes back to the original motivation for multiple inheritance), which would generally be solved by using super() to pass the operation up the inheritance tree. With Numpy operations working, however, I simply needed to write Python's functions (__add__, __mul__, etc) in terms of Numpy's functions (np.add(), np.multiply(), etc) and let __numpy_ufunc__ handle the tedious work.

by Epitrochoid ( at August 20, 2014 12:24 AM

August 19, 2014

(BinPy student)

GSoC 2014 - End of GSoC - Beginning of new relationships

It is too hard to belive that the GSoC is finally coming to an end.
I would rather wish to call this a beginning of new relationship with the open  source world.

I first of all wish to thank Jay Rambhia, Sudhanshu Misra, Salil Kapur, Sarwar Chahal and the team for letting me be a part of the wonderful library and their rich community of people. They almost made me feel as if I was a part of their BITS-ian community.

I can't resist to say that GSoC is the best and most awesome thing that ever happened in my life.

Okay enough bragging ...

My project involved developing the core of BinPy, including new features, incorporating better ideas etc ...

Though I did not achieve all that I said I would do in this summer, I did do my level best in integrating some good concepts which I feel were important to our BinPy.

This a summary of all the work that I did for my GSoC 2014 ( In no particular order ) :

  • The Linker updation thread, which maintains the Connections and updates their state in the background. ( This one is my personal favourite )
  • The StepperMotor Simulation ( Later moved to BinPyDesk )
  • Refactoring the code base to conform to PEP8 naming conventions.
  • Analog Buffer with attenuation support.
  • Analog Signal Generators ( with capability of modulated output based on external triggers )
  • IPython notebook examples and Sphinx documentation
  • ASCII Based Oscilloscope
  • Multivibrator module. With Astable / Monostable and Bi Stable signal generation
  • A rudimentary expression module ( Rajat assembled this into a module, and provided truth table synthesis )
  • ASCII based IC state drawer.
  • Pin Class as a container for the IC logic.
  • Latches and Flipflops etc ...
These are the stuff I am currently working / will be working on post-gsoc:
  • Binary Multiplication algorithms like Karatsuba's, Booth's, Robertson's, Took3, SSA and algorithms for Binary division - Restoring and Non-Restoring algorithm. [ This one is almost done. I am kinda stuck on the toom3, where Bodrato's sequence evaluation requiring negative intermediate results, is causing some trouble... ]

  • Microprocessor modules ( I initiated this work and felt there is lot more core stuff that needs to be implemented to realize microprocessors. For instance the expr module needs to be fine tuned. This will help realize the microprocessors easily like we do in verilog. So not much work was done here. I wish to continue the work once I am done with expression module )'

  • Expression module - Tree parsed boolean expression module with KMap, Truthtable, Quine McCluskey's methods incorporated into the same.

  • PyQt based Workbench and  Oscilloscope.
Overall this was a great experience to me :)

Thanks to everyone involved and the Python community ( To Terri, meflin, etc ... )

And finally as the BITs community would say - Take lite ;)

by Raghav R V ( at August 19, 2014 02:16 PM

Michael Mueller
(Astropy student)

Week 13

This was the final week of Google Summer of Code, and since last Monday was the suggested "pencils down" date, I spent the week focusing on getting the main pull request ready for merging. I began by testing the new fast converter for unusual input, then handled issues Erik noted with the PR, filed an issue with Pandas, and began work on a new branch which implements a different memory scheme in the tokenizer. The PR seems to be in a final review stage, so hopefully it'll be merged by next week.
After testing out xstrtod(), I noticed a couple problems with extreme input values and fixed them; the most notable problem was an inability to handle subnormals (values with exponent less that -308). As of now, the converter seems to work pretty well for a wide range of input, and the absolute worst-case error seems to be around 3.0 ULP. Interestingly, when I reported the problems with the old xstrtod() as a bug in Pandas, the response I received was that the current code should remain, but a new parameter float_precision might be added to allow for more accurate conversion. Both Tom and I found this response a little bizarre, since the issues with xstrtod() seem quite buggy, but in any case I have an open PR to implement this in Pandas.
Aside from this, Erik pointed out some suggestions and concerns about the PR, which I dealt with in new commits. For example, he suggested that I use the mmap module in Python rather than dealing with platform-dependent memory mapping in C, which seems to make more sense for the sake of portability. He also pointed out that the method FileString.splitlines(), which returns a generator yielding lines from the memory-mapped file, was inefficient due to repeated calls to chr(). I ultimately rewrote it in C, and although its performance is really only important for commented-header files with header line deep into the file, I managed to get more than a 2x speedup on a 10,000-line integer file with a commented header line in the last row with the new approach.
Although it won't be a part of the main PR, I've also been working on a separate branch change-memory-layout which changes the storage of output in memory in the tokenizer. The main purpose of this branch is to reduce the memory footprint of parsing, as the peak memory usage is almost twice that of Pandas; the basic idea is that instead of storing output in char **output_cols, it's stored instead in a single string char *output and an array of pointers, char **line_ptrs, records the beginning of each line for conversion purposes. While I'm still working on memory improvements, I actually managed to get a bit of a speed boost with this approach. Pure floating-point data is now slightly quicker to read with io.ascii than with Pandas, even without multiprocessing enabled!
Since today is the absolute pencils down date, this marks the official end of the coding period and the end of my blog posts. I plan to continue responding to the review of the main PR and finish up the work in my new branch, but the real work of the summer is basically over. It's been a great experience, and I'm glad I was able to learn a lot and get involved in Astropy development!

by Michael Mueller ( at August 19, 2014 04:13 AM

Roy Xue
(Theano student)

GSoC Final Summary#GSoC2014_Theano


The final pencil down day finally comes here. The previous three months is like an amazing journey for me, this is the first time I can do works for open source community. Theano community is really great, my mentors: Fred, James, Arnaud, they gave me great help teached me useful knowledge. It’s a precious oppotunity for me to work with them, and learn from them.

For the GSoC part, I finished the Lower the max memory usage part. There are several parts of it. Firstly what I do is to create 2 new variables tracking the current node executed order and cleared order. Because previously the order we used in memory profiling is not used now. After this part, I wrote the memory counting method, to make it working for 2 order(I kept the previous order and its results will output in brackets following the current order results). Also, I created the test_profiling file, to make it work for unit test with this part. Then, what I do is to write an algorithm to find out the min peak of all the order. For example, in our simple test file, there are 11 node, the total order number is 11!, I have to find the valid order and count its min peak memory. The first version of algorithm tooks 7hrs to finish the work, but now it takes almost 2-3 mins. During the process of finding the best algorithm, we found a fault in previous memory counting method. Because during node execution, their will generate some view variables, we cannot avoid these in memory counting. So we came up with a new algorithm to fix this fault.

Futhermore, after GSoC, I decide to continue as a contributer of Theano, I will finish the “reduce the number of allocation/reuse allocation of ndarray” part fisrt, and then do other works for Theano.

I think after GSoC, my python coding skills improved a lot. There are lots of things to learn, I still have a long way to go.

I would especially appreciate to Fred, I had a great time learning from him, and told me lots of information about further study in his university.

Also, thanks to Google, thanks for providing this amazing project to open a “open source” door for me. I really enjoyed it this summer.

by xljroy at August 19, 2014 03:27 AM

Hamzeh Alsalhi
(scikit-learn student)

Google Summer of Code 2014 Final Summary

Now at the end of this GSoC I have contributed four pull requests that have been merged into the code base. There is one planed pull request that has not been started and another pull request nearing its final stages. The list below gives details of each pull request and what was done or needs to be done in the future.

This GSoC has been an excellent experience. I wan't to thank the members of the scikit-learn community, most of all Vlad, Gael, Joel, Oliver, and my mentor Arnaud, for their guidance and input which improved the quality of my projects immeasurably.

Sparse Input for Ensemble Methods

PR #3161 - Sparse Input for AdaBoost
StatusCompleted and Merged
Summary of the work done: The ensemble/weighted_boosting class was edited to avoid densifying the input data and to simply pass along sparse data to the base classifiers to allow them to proceed with training and prediction on sparse data. Tests were written to validate correctness of the AdaBoost classifier and AdaBoost regressor when using sparse data by making sure training and prediction on sparse and dense formats of the data gave identical results, as well verifying the data remained in sparse format when the base classifier supported it. Go to the AdaBoost blog post to see the results of sparse input with AdaBoost visualized.

PR - Sparse input Gradient Boosted Regression Trees (GBRT)
StatusTo be started
Summary of the work to be done: Very similar to sparse input support for AdaBoost, the classifier will need modification to support passing sparse data to its base classifiers and similar tests will be written to ensure correctness of the implementation. The usefulness of this functionality depends on the sparse support for decision trees which is a pending mature pull request here PR #3173.

Sparse Output Support

PR #3203 - Sparse Label Binarizer
StatusCompleted and Merged
Summary of the work done: The label binarizing function in scikit-learns label code was modified to support conversion from sparse formats and helper functions to this function from the utils module were modified to be able to detect the representation type of the target data when it is in sparse format. Read about the workings of the label binarizer.

PR #3276 - Sparse Output One vs. Rest
StatusCompleted and Merged
Summary of the work done: The fit and predict functions for one vs. rest classifiers modified to detect sparse target data and handle it without densifying the entire matrix at once, instead the fit function iterates over densified columns of the target data and fits an individual classifier for each column and the predict uses binarizaion on the results from each classifier individually before combining the results into a sparse representation. A test was written to ensure that classifier accuracy was within a suitable range when using sparse target data.

PR #3438 - Sparse Output Dummy Classifier
StatusCompleted and Merged
Summary of the work done: The fit and predict functions were adjusted to accept the sparse format target data. To reproduce the same behavior of prediction on dense target data first a sparse class distribution function was written to get the classes of each column in the sparse matrix, second a random sampling function was created to provide a sparse matrix of randomly drawn values from a user specified distribution. Read the blog post to see detailed results of the sparse output dummy pull request.

PR #3350 - Sparse Output KNN Classifier
StatusNearing Completion
Summary of the work done: In the predict function of the classifier the dense target data is indexed one column at a time. The main improvement made here is to leave the target data in sparse format and only convert a column to a dense array when it is necessary. This results in a lower peak memory consumption, the improvement is proportional to the sparsity and overall size of the target matrix.

Future Directions 

It is my goal for the Fall semester to support the changes I have made to the scikit-learn code base the best I can. I also hope to see myself finalize the remaining two pull requests.

by Hamzeh ( at August 19, 2014 01:28 AM

August 18, 2014

Julia Medina
(Scrapy student)

Final Summary

Summer of code has finally come to an end. I’ve managed to develop all the ideas in my original proposal, although their implementation drifted from what was planned at the beginning. The API remained essentially the same, but details in the execution were adjusted given previously unconsidered matters and new ideas that came along for improving the consistency, simplicity and user-friendliness of the interface.

One of these considerations was dropping backward support on non highly used features, when keeping their functionality would clutter and bloat the codebase. The most important implication of this resolve was that it released the design decisions of the API from several constraints, which improved the clarity and straightforwardness of the implementation. The list of dropped features can be seen in the description of the clean up pull request.

I think the major highlight of the cleanup process (and actually what the other changes revolve around) is the update of the Crawler class. We modified its dependencies by taking a required Spider class at initialization time, effectively linking each crawler to a single spider definition. Its functionality has been unified in two methods: its creation (with the already mentioned spider class and the configuration of its execution) that initializes all the components needed to start crawling, and the crawl method, that instantiates a spider object from the crawler’s spider class and sets the crawling engine in motion.

Since even this distinction is usually not necessary, a new class, CrawlerRunner, was introduced to deal with configuring and starting crawlers without user intervention. This class handles multiple crawler jobs and provides convenient helpers to control them. This functionality was moved from an already implemented helper, CrawlerProcess, which was left in charge of Scrapy's execution details, such as Twisted’s reactor configuration and hooking system signals.

Finally, per-spider settings development didn’t diverge significantly from the proposal. Each spider has a custom_settings class attribute with settings that will be populated by the update_settings method. The latter was made available so users can override the default population behavior (By default, they are set with a custom ‘spider’ priority).

A full list of implemented features can be seen in the description of the API clean-up pull request, along with the Per-spider settings and first half’s Settings clean-up pull requests.

I had to learn a little bit of Twisted’s basics to know how concurrency is dealt within Scrapy, and I found the concept of deferreds (Core element for the application model of Twisted, inspired in the futures concept) quite intuitive and a great alternative for achieving concurrent executions in Python.

By exposing the deferreds returned by delayed routines and providing convenient single-purpose helpers we came up with a flexible interface.

For instance, a single spider inside a project can be ran his way:

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())

# 'followall' is the name of one of the spiders of the project.
d = runner.crawl('followall', domain='')
d.addBoth(lambda _: reactor.stop()) # the script will block here until the crawling is finished

Now it's possible to run spiders outside projects too (what allows Scrapy to be used as a library instead of a framework, one of the goals of this GSoC), in a similar manner:

from twisted.internet import reactor
from scrapy.spider import Spider
from scrapy.crawler import CrawlerRunner
from scrapy.settings import Settings

class MySpider(Spider):
# Your spider definition

settings = Settings({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
runner = CrawlerRunner(settings)

d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop()) # the script will block here until the crawling is finished

We can run multiple spiders sequentially as before, but with a much simpler approach:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())

def crawl():
for domain in ['', '']:
yield runner.crawl('followall', domain=domain)

crawl() # the script will block here until the last crawl call is finished

Or we can run them simultaneously (something that previously wasn’t viable) reflowing the deferreds interface:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())
dfs = set()
for domain in ['', '']:
d = runner.crawl('followall', domain=domain)

defer.DeferredList(dfs).addBoth(lambda _: reactor.stop()) # the script will block here until all crawling jobs are finished

Each usage example and interface details are carefully documented in the referenced pull requests.

My work is currently under evaluation by the Scrapy developers, and I’m fixing issues that arise after concerns brought by reviewing. There aren’t any apparent critical issues about the implementation, but I plan on addressing further suggestions so the changes can be merged to the main repository.

There are some ideas that I’d love to work on given the time to improve other aspects of the Scrapy API, but I can gladly say that I've delivered every point of my proposal.

After being through this process, I’d like to bring out the importance of writing documentation to closely contemplate design decisions. Good practices and high standards in this open source project were great guidelines that assure the quality of the developed work. Code is extensively tested, documented and backported (within reasonable expectations). Even keeping a clean commit history is a sensitive matter.

I got to know Github features that were really useful for discussing decisions with the community. The Travis continuous integration system integrated with this project made submitting patches that fit the minimum expectation of successfully completing the test suite easier. Finally, being reviewed by such skilled developers was definitely the experience that I enjoyed and valued the most, and I appreciate them for helping me in this process.

I want to wrap this post up by saying that I really loved this program. Actually, I would have wanted to hear about it sooner so I could have applied before. It was great to get involved in the open source world and get to know these developers, and I recommend this experience to any student who is looking to participate in a real project, face an interesting challenge and contribute to open source while doing so.

by Julia Medina ( at August 18, 2014 10:17 PM

Asra Nizami
(Astropy student)

And GSoC is over..

This has been a really hard blog post to write for me because every time I would sit down to write something, I'd feel sad about my Summer of Code ending. I've had a such great time working on my project for the summer! It was different from anything I'd ever worked on before, it was challenging but not so much that I'd want to pull my hair out in frustration and on top of that, it was relevant to my own academic interests.

Overall, I've pretty much finished what I'd said in my proposal. Though there was one thing which I never got to, working with NDData, an Astropy class sort of similar to Numpy arrays but since NDData is still under development, we decided to not focus on it. Other than that, I worked on adding a lot of other features and fixed bugs as we discovered them, something which I didn't anticipate in the beginning but took up a significant bit of my summer (in a good way of course). WCSAxes is now in a pretty good place - we did our first release in early July and I just released a second version last week with some bug fixes and new features. I also have write access to the repository now! :D Integration of WCSAxes into APLpy is also pretty much finished. There is one non-trivial issue we need to find a solution to but it's of a low priority so it's fine.

There have been some roadblocks along the way, but nothing too unmanageable. If I did get stuck, my mentor was there to help me figure things out. This is where I should thank my mentor, Tom, for always being available and supportive, explaining things clearly and just being great to work with! :) 

I'm also planning to continue working on WCSAxes once GSoC is over or maybe see if I can contribute to something else for Astropy or an Astropy affiliated package, but that also depends on how much work there is for me to do and if it fits me. I really hope I do find things to work on because I've learned so much about the tools available in Python for astronomers by working with Astropy. 

Well that's it, I guess. I hope you've enjoyed reading about my summer!

by Asra Nizami ( at August 18, 2014 08:45 PM

Mabry Cervin
(Astropy student)

Wrapping Up [Part 2]

In the previous post I explained that UQuantity would use multiple inheritance to solve problems that had been met with writing a wrapper. Unfortunately multiple inheritance is best used with classes that are written to work together (cooperative multiple inheritance), whereas the two classes I was working with were from separate packages. There were some things that worked out, but there were also issues with the approach.

The luckiest thing is that the class Quantity is a subclass of Numpy's Ndarray. Due to the specifics of how ndarrays are initialized, subclasses put their initialization code in the methods __new__ and __array_finalize__ while Variable used __init__. This allowed me to call each parent classes' initialization code separately without have to resort to wrapping each class to make them compatible via super(). Right off the bat, however, I got the error "TypeError: multiple bases have instance lay-out conflict."

This particular error is due to the field __slots__ being defined in the inheritance tree of both parents. __slots__ is a special field that causes the interpreter to instance any class that defines it statically, rather than give it a dictionary, with significant memory savings. Because this happens at a very low level, there is no way to reconcile having two parent definitions of the field.

The immediate (and simple) solution is to remove the __slots__ declaration from one of the parent classes, and this is what I did at first. While this solution works well, it would require shipping a very slightly modified copy of the parent within the class, something that would be best avoided. The other simple solution would be to inherit one of the parents and wrap the other, but then I would end up with all the problems from the first post that I sought to avoid.

The solution that finally worked for me was to use a metaclass. In Python metaclasses describe the class of a class; in the way an object is instantiated from a class, a class is instantiated from a metaclass. Writing my own metaclass gave me a very powerful tool, a type constructor (in fact the base metaclass is called type) that would be called before the class is defined and its inheritance set. The metaclass that I wrote is below.

The __new__ acts a type constructor and is passed the parameter "bases" which is a tuple of the parents of the class. The __new__ method simply copies each base, checks its dictionary for __slots__, and removes it from the dictionary if found. At the end it takes the new tuple of bases and constructs the class (not an instance, but the class itself) being meta'd using type().

The metaclass works perfectly for its intended purpose, I was able to inherit from two parents even if both had __slots__ defined in their inheritance tree. It did create a subtle issue that would make later coding difficult, however. When it modifies the parent class, it changes the type id of that class at runtime. This causes any calls to super() to fail that would touch that parent on account of super()'s use of isinstance(). While UQuantity is a subtype of Variable in every practical way, this modification to the type id causes isinstance(UQuantity, Variable) to fail.

In the next (and likely final) post I will detail how breaking super() was less than optimal, and how I got mathematical operations working without it (in a way that fortuitously adds additional features, no less).

by Epitrochoid ( at August 18, 2014 08:28 PM

Alan Leggitt
(MNE-Python student)

Pencils Down

Welp, that's the end of Google Summer of Code 2014, but not the end of this project. I didn't blog as much as I would have liked to, but I think I accomplished a lot overall.

Regarding the last post, the problem I encountered was that the source spaces were in "head" coordinates, which happens when the forward solution is computed. "Head" coordinates refers to the coordinate space of the MEG or EEG sensors. To resolve this, I had to transform the grid of mri voxels to "head" coordinates as well when mri_resolution=True. This wasn't an issue when mri_resolution=False because the volume source space was also converted to "head" coordinates.

Now the class SourceSpaces has a method called export_volume, which saves the source spaces as a nifti or mgz file that can be viewed in freeview. This only works for mixed source spaces with at least one volume source space, since the volume source space is responsible for setting up the 3d grid.

The source estimate can also be computed from a mixed source space. I wasn't able to implement code to view the source estimate as a 4d image, but that will build largely on the export_volume code previously described.

In addition, I created an example file to generate mixed source spaces. This example outputs the following figures.

The first figure shows the cortical surface with the additional volume source space of the left cerebellum. The locations of the dipoles are in yellow. The second figure shows the .nii file in freeview, where source spaces are in red.

Future work that needs to be done is creating these visualizations for source estimates, add options to fix the orientation of surface but not volume dipoles, and continue testing the accuracy of these combined source spaces using simulated data.

by Alan Leggitt ( at August 18, 2014 06:51 PM

Saurabh Kathpalia
(MoinMoin student)

GSoC Journey ends......

This week I mainly worked removing mostly UI bugs and also some backend bugs. Here is the list of bugs that were solved this week and also tried to solve bugs that were reported in my repo.

  1. Now user is redirected to updated ticket page on clicking ticket update button in ticket modify view - commit
  2. Moved login link to right, increased spacing between EDSP in ticket view and some more css changes - commit
  3. Now user can see unsubscribe link in blog view if he is subscribed to that blog in modernized theme - commit
  4. Now user can select a filter(open/closed/all) and then choose sort option and only tickets relevant to that filter will be displayed - commit
  5. Now user can check the issues assigned to a particular user - commit
  6. Added link to depends_on and superseded_by in ticket modify view - commit
  7. Made ticket-create button in +tickets view working with a workaround - commit
  8. Now Clicking the All, Open, or Closed buttons in +tickets view clears search field and selected tags - commit
  9. Top-right search query field is now aligned properly in modernized theme - commit
  10. Removed overlapping of create-ticket button with the heading underlining - commit
  11. Now hspacing between All/Open/Closed and Sort is same in +tickets view - commit
  12. Fixed traceback which comes when a superuser views the Admin > Users and Admin > Groups reports - commit
  13. Changed format of ticket submit and modify view- moved comments to right and metadata to left - commit
  14. Now blog entry heading and discussion page link at the bottom are left aligned - commit
  15. Added spacing between create-blog-entry button and there are not blog entries message and also changed the tip for sorting by columns in +tickets view - commit
  16. Removed padding for table headers not having on click sorting feature - commit
Finally the 3 month long journey has come to an end. This was a great experience for me, learnt many new things and also got an exposure of working with Open Source. I would like to thank my mentors - Dmitrijs, Roger and Thomas for their constant support throughout this time period. They have been really very helpful throughout the project. 
Looking forward to continue contribution to Open Source    

by saurabh kathpalia ( at August 18, 2014 04:47 PM

(SunPy student)

The Final Push

All good things must come to an end.

~ Some Dude on the Internet

And so, we have reached the close of the code-fest that is Google Summer of Code. It has been an enlightening, thoroughly enjoyable and skill-building journey. One that has laid the foundation of my future career and one that I will never forget.

In any program like GSoC, the mentors matter the most – they are the ones who enable a student to reach their potential and deliver something of note. I would like to thank all my SunPy mentors – Stuart, David, Steven, Nabil – for ensuring that I never had any uncleared doubts. Stuart, especially, has been available throughout the duration of the program, and was always around for a screenshare session, or to explain something that I didn’t understand. He was open to conversation even when he was out of town. I am very thankful for their active support! I would also like to thank Google and Melange for ensuring a smooth and wonderful experience. 

The last few weeks have gone mostly in testing and documentation. Stuart and I introduced some analytical tests to check on the accuracy of the coordinate transforms. The tests were mostly passing, except for one parameter which seems to be giving us the wrong values, i.e., expected values don’t match the output. Those interested in the mathematics behind this kind of testing may access the relevant document here, and those who wish to see the test code in action may check it out here

The final (in GSoC terms) code can be accessed here and the PR is on GitHub, of course! Over the next few days, I will be completing the SunPy Enhancement Proposal to get this pull request fully accepted. I also gave a presentation on Hangouts to explain the purpose of my module, whose IPython notebook can be accessed here.

Besides all this, I have been learning a bunch of web frameworks to aid in my quest for a job. It was Stuart’s suggestion to start learning Flask, and now that I am learning Django as well, I can see why he suggested it. Everything is so…simple in Flask. But then, it would be to my benefit to learn both to some extent. Miguel Grinberg’s Mega Tutorial on Flask has been extremely helpful, and I would recommend it to those who wish to learn web development in Python. Hopefully, with my Flask knowledge, and a bit of Bootstrap, I will be able to make a nice web framework for SunPy in the near future, or atleast I plan to!

I am signing off for now, but I may post some more stuff in the future. Stay tuned!

by xpritish at August 18, 2014 03:53 PM

Ajitesh Gupta(:randomax)
(MoinMoin student)

Week 13 - End of the journey

Its the end of week 13 now and also the time for "Firm Pencils Down" date. 3 months have passed since I commenced work on this project. Its been a long and eventful summer and I got to learn a lot from.

The last week has been quite hectic with a flurry of pull requests to do bug fixes of existing bugs and the bugs which surfaced up recently and also to do changes according to feedback received by mentors. Here is a list of tasks I completed in the last week -

1. Fixed issue #451 - Now the modify page prompts users if they try to leave the page without saving the changes, in all themes. Also it gives the prompt only when the user has made a change.

2. Fixed issue #454 - Fixed the Global Index to show the alphabetical filters to sort the items

3. Finally fixed quicklinks and their tooltips - Fixed the overflowing quicklinks text in the basic theme. Also fixed the quicklink tooltips which now show both wikititle and full url to the item instead of just the wikititle.

4. Added css for smileys in basic theme which was missing earlier - Commit

5. Replaced important hex color values with variables in basic theme's theme.less. We should stick to using variables as much as possible as it makes it easier to understand and use and also make changes quickly - Commit

6. Fixed the invisible links in meta view in basic theme - The itemlinks in the meta section in the basic theme had the same color as the background color and hence became invisible. Changed the color of the links in this patch to fix the issue. - Commit

7. Added capability to edit acl string in item acl report view itself - Commit

8. Fixed the overflow of item names in the index view - Long item names used to break the css and overflow into a second line - Commit

9. Removed full stops from view titles - Commit

10. Fixed erroneous input element css in basic theme - There was unnecessary shortening of input textboxes - Commit

11. Fixed overlapping links in orphan view - The links did not have enough vspace between them. Added css to fix that - Commit

12. Right aligned time column in modernized theme history view - Commit

13. Made default acl string green in item acl report and made submit acl button smaller to make it look neater - Commit

14. Increased font size in global tags view in Modernized theme - The tag cloud in this theme earlier had very tiny font which was very hard to read - Commit

15. Removed extra commas and spaces between links in the meta view in basic theme - Commit

16. Made section heading size uniform across all sections in basic theme - Commit

17. Shifted Login button to the right in modify view in basic theme - Commit

18. Added title to icons in global history view and added padding under date titles - Commit

19. Fixed faulty footer which used to rise to the top at low resolutions in basic theme - Commit

20. Fixed failing modify view - The modify view was not being able to make changes to an item due to not being able to retrieve the Item ID - Commit

Phew!! Finally this has come to an end. It was a whole new experience and a whole new level of exposure to real-time, real-world work. I would like to thank both my mentors - Roger and Thomas for helping me and guiding me throughout the project and more importantly for giving me an opportunity to work on such a big project. Also I would also like to thank PSF and also Google because without them this would not been as good as it was. I hope to keep contributing to open-source projects. Cheers.

by Ajitesh Gupta ( at August 18, 2014 03:51 PM

Rishabh Raj
(scikit-image student)

Last but not least

Towards the later part we focused on streamlining the gallery as much as possible with respect to real world usage, which even involved fixing positions for buttons. The most visible among such changes was extending the setup, which initially allowed editing only one snippet at a time to handle the case when multiple snippets are present on a page, for ex –

We added a configuration file which enables setting up different parameters for the server, such as how many lines of STDOUT/ERR do we want to return with each request, the maximum size of the queue, etc.

We now also have cleaner code as well as better documentation at the repo – as well as a small demo (link in the docs) which should ease the reuse of this code by other projects should they have similar aims in mind

The docs were moved over to Github @ which is strikingly more reliable for serving HTTP than the SimpleHTTPServer which we were using previously, as expected ;)

Needless to say, I would love to continue to remain associated with the scikit-image organisation and PSF as a whole too. I had a good time.

PS – /me is a big goodie fan. Does PSF / scikit-image give away any? <insert_cute_cat_picture>

by sharky932014 at August 18, 2014 03:25 PM

M S Suraj
(Vispy student)

GSoC 2014 Final Summary

GSoC has finally come to an end. It was an awesome experience and I would like to thank the amazing vispy developers – Luke, Eric, Almar, Cyrille and Nicolas for their guidance without which this wouldn’t have been possible.

I delayed this post a bit so that I could complete a working example of a RectPolygon visual with rounded corners. Here’s a sample – 


As you can see, you can also specify a different radius of curvature for each of the four corners. The corner vertices are also generated linearly with the curvature – the more the curvature, the more the no. of vertices. Although I stumbled upon this after implementing it, here’s a link which shows how the radius is defined –

The current PR is being reviewed and I hope to get it merged by tonight before GSoC ends completely.

Here’s a summary of the work completed so far:

  •  Implemented triangulation – this one is partial and is still broken. Post-GSoC I will work with my mentor Luke to wrap it up with tests.
  • Expanded the visuals library with PolygonVisual, EllipseVisual, RegularPolygonVisual
  • Added tests for each of the visuals and integrated them with the Color module by Eric
  • Modified Visuals to be reactive and added tests for the same
  • Soon to be completed – RectPolygon

Post-GSoC work:

  •  Write tests for RectPolygon
  • Wrap up triangulation
  • Expand on Cyrille’s GPU ray-tracer example –

That’s all Folks!

by mssurajkaiga at August 18, 2014 06:59 AM

Rajeev S
(GNU Mailman student)

GSoC(); exit(0);

The most exciting summer of my life has now come to an end. Its GSoC deadline and I can't believe how fast the time flies.

My project has made significant progress and has covered almost all deliverables mentioned in my proposal. The Mailman CLI now has a pretty usable and useful command line interface.

The Command line tools was completed before the mid term evaluations. After the mid term evaluations, most of the time was directed towards the Mailman shell. However few changes and improvements were made to the command line tools too. One of the major additions made to the tools were the the backup and restore functionalities. Also, an export to CSV feature is also included with all the display commands aka show commands. Apart from these, a few general changes also affect the command tools. The changes are largely due to the feedback from Steve. The major changes as per Steve's feedback include the following

  • Refactoring the get_listing method, which was repeated in every class. The method was made a generic method and moved to the lib/utils. The class specific code in the method is handled by passing the attribute list to the method and the corresponding value was obtained by using the getattr function.
  • Code was refactored at many instances where unnecessary indent was created due to unwise use of conditionals and try-catch blocks.
  • The coloring part is made more configurable by setting the colors as module level constants and moving the colors to a separate file.
  • The connection verification part was previously an expensive due to a probably expensive database query, which as replaced by a quick, database independent method.
  • The CLI configuration file was dropped as the details were already stored in Mailman config. 
  • The connection method is now unified and global, and reads the credentials from the mailman config. The connection verification part is now handled by this method
  • Exceptions were verified before reporting the error, which previously `assumed` the error message.
The next part of the project, which filled the post mid term phase was the development of the command shell, which give Mailman a custom shell. This part of the project was rewritten many times due to improper planning and at the end, the shell was built in the best possible way. Initially the shell was built in a bad way, by using bare string processing and array operations to parse the commands for arguments. This code looked ugly and was too error prone and inflexible. This was replaced by a better, but not the best way, by writing regular expressions for each command. I successfully completed this task and I even built regexes for my most complex commands, the update preference command. I discussed this approach in the python IRC, asking for a good error reporting method for regex matching, and my method was remarked as a "recipe for failure" and advised that I am to use a parser.

I began researching on the various parser libraries available for python and I landed in the PLY module, which seemed interesting and simple to use. My prior experience in using lex and YACC proved handy in building my parser. In a week's time, I was done porting my shell from regex to YACC. The error reporting was handled beautifully by the PLY and in addition, the command usage help string is also printed in case of an error.

The filtering was handled by a separate class named Filter that supports various like of filters like equality, regular expression matching list searching etc. The Filter class makes it easy to add new Filters if any, in future.

Unit tests were written for the work until now and they are executed using nosetests. The CLI is now install-able using the python dev-tools, ie, python install. The installation creates a new executable command named mmclient, which can be used to run the commands or the shell.

The final tasks was to make the docs better by making them sphinx compatible. The task involved ordering and verification of the documentation. Inline code highlighting and file hierarchy of the docs was fixed in the last revision, r72, made on 17/08/2014. The docs now look pretty decent and useful.

I would be continuing the development of the CLI by adding more features like PostgreSQL support for the backup and restore tool. Also the shell environment has some scope for improvement. Apart from that, the project is fully complete and functional.

Finally, I would like to quote my heartfelt thanks to Google for hosting such a great event and giving students  such a great opportunity to be a part of huge projects under the mentorship of great people, who are geographically apart. I would like to thank my primary mentor, Stephen J Turnbull, who inspite of his busy life, found room for my project, right from the proposal period, and for doing extensive scrutiny and providing feedback for my code. I would like to thank Abhilash Raj, with whom I regularly discussed about my work, code and roadblocks, usually via the IRC. I would like to thank Barry Warsaw, who shed light upon how my project should look like, from a user's perspective. He is one of the people who responds to almost each one of my mails to mmdev mailing list. I would like to thank Terri Oda and Florian Fuchs, who are the org admin and co admin of Python Software Foundation for this year's Summer of Code, for their support and great work they did in managing the PSF's Summer of Code.

by Rajeev S ( at August 18, 2014 06:26 AM

Elana Hashman
(OpenHatch student)

Final Blog Post and End of the Summer

Why not end the semester the same way we started it? Today we had another regular OpenHatch sprint!

End of Term Sprint

At today's sprint, we cleaned up a number of loose ends: all of my outstanding pull requests were merged and I polished up our deliverables in order to "put a bow on" our final product. I completed the usual chores of reviewing and merging others' pull requests.

I also had the pleasure of mentoring a new friend and colleague, Kristina Foster, who I met at a Women Who Code meetup! Kristina is generally awesome, and works as a designer for a local company. We spent some time getting her Windows laptop set up with the oh-mainline code, learned a bit of git, and by the end of the day, I helped walk her through making her first contribution to an open source project! I'm so happy that I was able to help her achieve this. She did so well!

I am really bad at using GIMP
Figure 1: Kristina's contribution


As we reached the end of the development period, Asheesh and I worked on the most advanced features of the project and wrapped up our final deliverables. I think I speak for both of us when I say I am very pleased with the final product!

You can try it out any time at the OpenHatch website—it's been deployed to production. You'll need an OpenHatch account to create a bug set, but anyone can view them.

gsoc14.13 and gsoc14.14 deliverables

I worked on a number of different features for these two weeks. I finished up the create and edit screens, upgraded my code to be compatible with django 1.5, and fixed the django-inplaceedit permissions

that now allow any public user to modify AnnotatedBug objects, while denying access to the rest of the database. I also paywalled the create and edit screens, and added a creation process for AnnotatedBugs through django forms, which resulted in a bit of an accidental reinvention of the ModelForm wheel, and also allowed for integration with our main database, which makes the process "smart."

Our issues were migrated to GitHub during this time, so they all received new numbers.

gsoc14.15 deliverables

I spent this milestone working on the most advanced feature of the project: real-time updates for the list view screen. This involved writing some javascript that asynchronously updated all the editable fields on the page, such that if another user edits, the current user will be able to view their edit in near real-time. Thus, a number of users all viewing the same bugset will be able to concurrently view and edit it without confusion.

I also updated the main view with edit links when logged in, and a notice to log in order to create and edit sets when not logged in.

Figure 2: Edit links for authenticated users, such as testuser


Over the last blog post period, I've encountered new and exciting obstacles to conquer. The primary issues have been the django upgrade, final exams, and the GitHub issue migration.

Some time in the past month, we decided to upgrade django out of the stone ages, and thus, the django 1.5 migration was born. However, this broke a number of my tests and dependencies, and generally confused me. But with the support of my mentor, and a south upgrade, we were able to smoothly sail through these treacherous waters.

I also had final exams over the past week, which was the usual stressful fiasco that is. My marks come out tonight. I can't say I'm anticipating them with great joy. Of course, my CS 499R mark is great :)

The last obstacle was the GitHub issues migration. I had some objections to the migration, both functional and ideological, but I don't have a better solution so I've accepted this outcome. I had a difficult time finding my new, renumbered issues and reassigning them, navigating the (imho) overly minimal tracker model, and generally getting used to the new system.

In the end, all was overcome. The project must go on. And so it has.

Extending the project

As this is my final blog post, I wanted to document the outstanding issues with my newly developed application, as well as some areas for future extension. It is likely that I will have a chance to work on these things in the near future, though they are not in the scope of my Google Summer of Code project.

Remaining issues

Future extensions/to-dos

  • Filtering by status (unclaimed, claims, needs-review, resolved)
  • We currently pull some metadata from the main OpenHatch bug database to create AnnotatedBug objects. It would be nice to be able to refresh this metadata, perhaps by clicking a button.
  • Field testing
  • User guide and additional documentation

by Elana Hashman at August 18, 2014 03:00 AM

Vytautas Jančauskas
(TARDIS-SN student)

GSOC 2014 - final summary

To recap - this years GSoC was a complete success: We successfully rewrote TARDIS's Monte Carlo routines in C, thus gaining approximately a 25% performance increase. Furthermore the core routines were significantly restructured, using good structured programming principles as opposed to the wall of code it was before. Hopefully this will increase maintainability and simplify both future debugging and profiling. This was not an insignificant job since the original Cython code was well over a thousand lines. A lot of effort has gone in breaking up the code in to it's logical parts and assembling relevant data in to structures. In the process not only have I further developed my C and Cython programming skills but also learned about TARDIS, it's inner workings and the physics behind it. Hopefully my efforts will benefit anyone working with it in the future.

by Vytautas Jančauskas ( at August 18, 2014 12:59 AM

August 17, 2014

Gaurav Trivedi
(Kivy student)

Kivy wrap for the summer

As I conclude my summer work on Kivy and Plyer, here’s a post to summarize all the contributions I have made. It would also be useful to start from here when I wish to revisit any of this in future.

To draw a comparison to the current state of Plyer development, this table shows a list of supported facades before the summer started:

Platform Android < 4.0 Android > 4.0 iOS Windows OSX Linux
Accelerometer X X X
Camera (taking picture) X X
Notifications X X X X X
Text to speech X X X X X
Email (open mail client) X

If you have been following the updates, you would have come across my weekly progress posts over the last couple of months. Here’s a list of all such posts since mid-summer for easy access (also check out my mid-summer summary post):

  1. I can haz commit access and other updates
  2. Maintenance work in progress
  3. Plyer on iOS
  4. More, more facades

And in comparison to the table above, this is how the Plyer support looks like as of today after all these changes:

Platform Android < 4.0 Android > 4.0 iOS Windows OSX Linux
Accelerometer X X X X X
Camera (taking picture) X X
Notifications X X X X X
Text to speech X X X X X X
Email (open mail client) X X X X X
Vibrator X
Sms (send messages) X X
Compass X X X
Unique ID (IMEI or SN) X X X X X X
Gyroscope X X X
Battery X X X X X X

Of course there’s more than what meets the eye. There has been a lot of background work that went into writing them. This included understanding the individual platforms APIs and working with other Kivy projects — Pyjnius and Pyobjus that support this work. Some of these changes called for a re-write of old facades in order to follow a consistent approach. Since Plyer is at an early stage of development, I also contributed some maintenance code and writing build scripts.

In the beginning of August, I took a break from facade development for two weeks and made recommendations on making Kivy apps more accessible. I looked into existing projects that could be useful for us and pointed at a possible candidate that we could adapt for our purposes. Here are the two posts summarizing my investigations:

  1. Towards Making Kivy Apps Accessible
  2. Towards Making Kivy Apps Accessible – 2

At this point, I would also include a thank you note to everyone on #kivy and #plyer on freenode for helping me out whenever I got stuck. This was the first time I actively participated in IRC discussions over an extended period. I also tried to return the favor by offering help, when I could, to other new users. Apart from getting a chance to work with the Kivy community from all around the world (with so many timezones!), there were couple of other firsts as well that I experienced while working on the project. Those served as good learning experiences and a motivation for making contributions to open source.

Overall, it was a quite a fun experience contributing to kivy over the summer and I hope to continue doing so every now and then. Now as Kivy is gaining more popularity everyday, I hope to see many more users diving into writing code for it and be a part of this community. Hope these posts could also serve to point them to relevant development opportunities.

by gtrivedi at August 17, 2014 10:57 PM

Brigitta Sipőcz
(Astropy student)

Final summary

It's strange that the summer is already over, and so does GSoC.

Read more »

by Brigitta Sipocz ( at August 17, 2014 10:56 PM

Vighnesh Birodkar
(scikit-image student)

GSoC 2014 – Signing off

This years GSoC coding period has nearly come to an end. This post aims to briefly summarize everything that happened during the last three months. My task was to implement Region Adjacency Graph based segmentation algorithms for scikit-image. This post provides a good explanation about them. Below I will list out my major contributions.


Region Adjacency Graphs

Fixing the API for RAGs was very important, since it was directly going to affect everything else that followed. After a long discussion and some benchmarks we finally decided to have NetworkX as a dependency. This helped a lot, since I had a lot of graph algorithms already implemented for me. The file implements the RAG class and the RAG construction methods. I also implemented threshold_cut, a function which segments images by simply thresholding edge weights. To know more, you can visit, RAG Introduction.

Normalized Cut

The function cut_normazlied, implements the Normalized Cut algorithm for RAGs. You can visit Normalized Cut on RAGs to know more. See the videos at the end to get a quick idea of how NCut works. Also see, A closer look at NCut, where I have benchmarked the function and indicated bottlenecks.

Drawing Regions Adjacency Graphs

In my posts, I had been using a small piece of code I had written to display RAGs. This Pull Request implements the same functionality for scikit-image. This would be immensely useful for anyone who is experimenting with RAGs. For a more detailed explanation, check out Drawing RAGs.

Hierarchical Merging of Region Adjacency Graphs

This Pull Request implements a simple form of Hierarchical Merging. For more details, see Hierarchical Merging of Region Adjacency Graphs. This post also contains videos at the end, do check them out. This can also be easily extended to a boundary map based approach, which I plan to do post-GSoC


Final Comments

The most important thing for me is that I am a better Python programmer as compared to what I was before GSoC began this year. I was able to see how some graph based segmentation methods work at their most basic level. Although GSoC has come to an end, I don’t think my contributions to scikit-image have. Contributing to it has been a tremendous learning experience and plan to continue doing so. I have been been fascinated with Image Processing since me and my friends wrote an unholy piece of Matlab code about 3 years ago to achieve this. And as far as I can see its a fascination I will have for the rest of my life.

Finally, I would like to thank my mentors Juan, Johannes Schönberger and Guillaume Gay. I would also like to thank Stefan for reviewing my Pull Requests.



by Vighnesh Birodkar at August 17, 2014 07:15 PM

(Statsmodels student)

State space modeling in Python

A baseline version of state space models in Python is now ready as a pull request to the Statsmodels project, at Before it's description, here is the general description of a state space model (See Durbin and Koopman 2012 for all notation):

$$ \begin{align} y_t & = d_t + Z_t \alpha_t + \varepsilon_t \qquad & \varepsilon_t \sim N(0, H_t)\\ \alpha_{t+1} & = c_t + T_t \alpha_t + R_t \eta_t & \varepsilon_t \sim N(0, Q_t) \end{align} $$

Integrating state space modeling into Python required three elements (so far):

  1. An implementation of the Kalman filter
  2. A Python wrapper for easily building State space models to be filtered
  3. A Python wrapper for Maximum Likelihood estimation of state space models based on the likelihood evaluation performed as a byproduct of the Kalman filter.

These three are implemented in the pull request in the files,, and The first is a Cython implementation of the Kalman filter which does all of the heavy lifting. By taking advantage of static typing, compilation to C, and direct calls to underlying BLAS and LAPACK libraries, it achieves speeds that are an order of magnitude above a straightforward implementation of the Kalman filter in Python (at least in test cases I have performed so far).

The second handles setting and updating the state space representation matrices ($d_t, Z_t, H_t, c_t, T_t, R_t, Q_t$) , maintaining appropriate dimensions, and making sure that underlying datatypes are consistent (a requirement for the Cython Kalman filter - if the wrong datatype is passed directly to one of the underlying BLAS or LAPACK functions, it will cause an error in the best case and a segmentation fault in the worst).

The third introduces the idea that the representation matrices are composed of two types of elements: those that are known (often ones and zeros) and those that are unknown: parameters. It defines an interface for retrieving start parameters, transforming parameters (for example to induce stationarity in ARIMA models), and updating the corresponding elements of the state space matrices. Finally, it takes advantage of the updating structure to define a fit method which uses scipy.optimize methods to perform maximum likelihood estimation.


The pull request contains, right now, one example of a fully-fledged econometric model estimatable via state space methods. The Seasonal Autoregressive Integrated Moving Average with eXogenous regressors model is implemented in the file. The bulk of the file is in describing the specific form of the state space matrices for the SARIMAX model, defining methods for finding good starting parameters, and updating the matrices appropriately when new parameters are tried.

In descending from the Model class (3, above), it is able to ignore any of the intricacies of the actual optimization calls, or the construction of standard estimation output like the variance / covariance matrix, etc, summary tables, etc.

In descending from the Representation class (2, above), it directly has a filter method to apply the Kalman filter, it is able to ignore worries about dimensions and datatypes, and it gets all of the filter output "for free". For example, the loglikelihood, residuals, and fitted values come directly from output from the filter. Finally, all prediction, dynamic prediction, and forecasting are performed in the generic representation results class and can be painlessly used by the SARIMAX model.

Two example notebooks using the resultant SARIMAX class:

Extensibility: Local Linear Trend Model

Of course, Statsmodels already has an ARIMAX class, so the marginal contribution of the SARIMAX model is mostly the ability to work with Seasonal (or arbitrary lag polynomial) models, and the ability to work with missing values. However it is just an example of the kind of models that can be easily produced from the given framework, which was specifically designed to be extensible.

As an example, I will present below the code for a full implementation of the Local Linear Trend model. This model has the form (see Durbin and Koopman 2012, Chapter 3.2 for all notation and details):

$$ \begin{align} y_t & = \mu_t + \varepsilon_t \qquad & \varepsilon_t \sim N(0, \sigma_\varepsilon^2) \\ \mu_{t+1} & = \mu_t + \nu_t + \xi_t & \xi_t \sim N(0, \sigma_\xi^2) \\ \nu_{t+1} & = \nu_t + \zeta_t & \zeta_t \sim N(0, \sigma_\zeta^2) \end{align} $$

It is easy to see that this can be cast into state space form as:

$$ \begin{align} y_t & = \begin{pmatrix} 1 & 0 \end{pmatrix} \begin{pmatrix} \mu_t \\ \nu_t \end{pmatrix} + \varepsilon_t \\ \begin{pmatrix} \mu_{t+1} \\ \nu_{t+1} \end{pmatrix} & = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} \begin{pmatrix} \mu_t \\ \nu_t \end{pmatrix} + \begin{pmatrix} \xi_t \\ \zeta_t \end{pmatrix} \end{align} $$

Notice that much of the state space representation is composed of known values; in fact the only parts in which parameters to be estimated appear are in the variance / covariance matrices:

$$ \begin{align} H_t & = \begin{bmatrix} \sigma_\varepsilon^2 \end{bmatrix} \\ Q_t & = \begin{bmatrix} \sigma_\xi^2 & 0 \\ 0 & \sigma_\zeta^2 \end{bmatrix} \end{align} $$
In [2]:
Univariate Local Linear Trend Model

Author: Chad Fulton
License: Simplified-BSD
from __future__ import division, absolute_import, print_function

import numpy as np
from sm.tsa.statespace.model import Model, StatespaceResults

class LocalLinearTrend(Model):
    def __init__(self, endog, *args, **kwargs):
        # Model order
        k_states = k_posdef = 2

        # Initialize the statespace
        super(LocalLinearTrend, self).__init__(
            endog, k_states=k_states, k_posdef=k_posdef, *args, **kwargs

        # Initialize the matrices = np.r_[1, 0]
        self.transition = np.array([[1, 1],
                                    [0, 1]])
        self.selection = np.eye(k_states)

        # Initialize the state space model as approximately diffuse
        # Because of the diffuse initialization, burn first two
        # loglikelihoods
        self.loglikelihood_burn = 2
        # Cache some indices
        self._obs_cov_idx = np.diag_indices(k_posdef)

    _latex_names = ['$\\sigma_\\varepsilon^2$', '$\\sigma_\\xi^2$', '$\\sigma_\\zeta^2$']
    _names = ['sigma2.measurement', 'sigma2.level', 'sigma2.trend']
    def _get_model_names(self, latex=False):
        return self._latex_names if latex else self._names

    def start_params(self):
        # Simple start parameters: just set as 0.1
        return np.r_[0.1, 0.1, 0.1]

    def transform_params(self, unconstrained):
        # Parameters must all be positive for likelihood evaluation.
        # This transforms parameters from unconstrained parameters
        # returned by the optimizer to ones that can be used in the model.
        return unconstrained**2

    def untransform_params(self, constrained):
        # This transforms parameters from constrained parameters used
        # in the model to those used by the optimizer
        return constrained**0.5

    def update(self, params, *args, **kwargs):
        # The base Model class performs some nice things like
        # transforming the params and saving them
        params = super(LocalLinearTrend, self).update(params, *args, **kwargs)

        # Extract the parameters
        measurement_variance = params[0]
        level_variance = params[1]
        trend_variance = params[2]

        # Observation covariance
        obs_cov = self.obs_cov.real.astype(params.dtype)
        obs_cov[0] = measurement_variance
        self.obs_cov = obs_cov

        # State covariance
        state_cov = self.state_cov.real.astype(params.dtype)
        state_cov[self._obs_cov_idx] = np.array(
            [level_variance, trend_variance]
        ).reshape((2, 1))
        self.state_cov = state_cov

Using this simple model, we can estimate the parameters from a local linear trend model. The following example is from Commandeur and Koopman (2007), section 3.4., modeling motor vehicle fatalities in Finland.

In [5]:
import pandas as pd
from datetime import datetime
# Load Dataset
df = pd.read_table('ck_data/NorwayFinland.txt', skiprows=1, header=None)
df.columns = ['date', 'nf', 'ff']
df.index = pd.date_range(start=datetime([0], 1, 1), end=datetime(df.iloc[-1, 0], 1, 1), freq='AS')

# Log transform
df['lff'] = np.log(df['ff'])

# Setup the model
mod = LocalLinearTrend(df['lff'])

# Fit it using MLE (recall that we are fitting the three variance parameters)
res =
                           Statespace Model Results                           
Dep. Variable:                    lff   No. Observations:                   34
Model:               LocalLinearTrend   Log Likelihood                  26.740
Date:                Tue, 19 Aug 2014   AIC                            -47.480
Time:                        14:34:41   BIC                            -42.901
Sample:                    01-01-1970   HQIC                           -45.919
                         - 01-01-2003                                         
                         coef    std err          z      P>|z|      [95.0% Conf. Int.]
sigma2.measurement     0.0032      0.001      3.033      0.002         0.001     0.005
sigma2.level        1.079e-11   9.83e-07    1.1e-05      1.000     -1.93e-06  1.93e-06
sigma2.trend           0.0015      0.001      1.762      0.078        -0.000     0.003
In [26]:
%matplotlib inline
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,4))

# Perform dynamic prediction and forecasting
ndynamic = 20
predict, cov, ci, idx = res.predict(alpha=0.05, dynamic=df['lff'].shape[0]-ndynamic)

# Plot the results
ax.plot(df.index, df['lff'], 'k.', label='Observations');
ax.plot(idx[:-ndynamic], predict[0][:-ndynamic], label='One-step-ahead Prediction');
ax.plot(idx[:-ndynamic], ci[0, :-ndynamic], 'k--', alpha=0.5);

ax.plot(idx[-ndynamic:], predict[0][-ndynamic:], 'r', label='Dynamic Prediction');
ax.plot(idx[-ndynamic:], ci[0, -ndynamic:], 'k--', alpha=0.5);

# Cleanup the image
ax.set_ylim((5.5, 8));
legend = ax.legend(loc='upper right');


Commandeur, Jacques J. F., and Siem Jan Koopman. 2007.
An Introduction to State Space Time Series Analysis.
Oxford ; New York: Oxford University Press.

Durbin, James, and Siem Jan Koopman. 2012.
Time Series Analysis by State Space Methods: Second Edition.
Oxford University Press.

by Chad Fulton at August 17, 2014 06:58 PM

Vighnesh Birodkar
(scikit-image student)

Hierarchical Merging of Region Adjacency Graphs

Region Adjacency Graphs model regions in an image as nodes of a graph with edges between adjacent regions. Superpixel methods tend to over segment images, ie, divide into more regions than necessary. Performing a Normalized Cut and Thresholding Edge Weights are two ways of extracting a better segmentation out of this. What if we could combine two small regions into a bigger one ? If we keep combining small similar regions into bigger ones, we will end up with bigger regions which are significantly different from its adjacent ones. Hierarchical Merging explores this possibility. The current working code can be found at this Pull Request

Code Example

The merge_hierarchical function performs hierarchical merging on a RAG. It picks up the smallest weighing edge and combines the regions connected by it. The new region is adjacent to all previous neighbors of the two combined regions. The weights are updated accordingly. It continues doing so till the minimum edge weight in the graph in more than the supplied thresh value. The function takes a RAG as input where smaller edge weight imply similar regions. Therefore, we use the rag_mean_color function with the default "distance" mode for RAG construction. Here is a minimal code snippet.

from skimage import graph, data, io, segmentation, color

img =
labels = segmentation.slic(img, compactness=30, n_segments=400)
g = graph.rag_mean_color(img, labels)
labels2 = graph.merge_hierarchical(labels, g, 40)
g2 = graph.rag_mean_color(img, labels2)

out = color.label2rgb(labels2, img, kind='avg')
out = segmentation.mark_boundaries(out, labels2, (0, 0, 0))

I arrived at the threshold 40 after some trial and error. Here is the output.


The drawback here is that the thresh argument can vary significantly depending on image to image.

Comparison with Normalized Cut

Loosely speaking the normalized cut follows a top-down approach where as the hierarchical merging follow a bottom-up approach. Normalized Cut starts with the graph as a whole and breaks it down into smaller parts. On the other hand hierarchical merging, starts with individual regions and merges them into bigger ones till a criteria is reached. The Normalized Cut however, is much more robust and requires little tuning of its parameters as images change. Hierarchical merging is a lot faster, even though most of its computation logic is written in Python.

Effect of change in threshold

Setting a very low threshold, will not merge any regions and will give us back the original image. A very large threshold on the other hand would merge all regions and give return the image as just one big blob. The effect is illustrated below.











Hierarchical Merging in Action

With this modification the following code can output the effect of all the intermediate segmentation during each iteration.

from skimage import graph, data, io, segmentation, color
import time
from matplotlib import pyplot as plt

img =
labels = segmentation.slic(img, compactness=30, n_segments=400)
g = graph.rag_mean_color(img, labels)
labels2 = graph.merge_hierarchical(labels, g, 60)

c = 0

out = color.label2rgb(graph.graph_merge.seg_list[-10], img, kind='avg')
for label in graph.graph_merge.seg_list:
    out = color.label2rgb(label, img, kind='avg')
    out = segmentation.mark_boundaries(out, label, (0, 0, 0))
    io.imsave('/home/vighnesh/Desktop/agg/' + str(c) + '.png', out)
    c += 1

I then used avconv -f image2 -r 3 -i %d.png -r 20 car.mp4 to output a video. Below are a few examples.

In each of these videos, at every frame, a boundary dissapears. This means that the two regions separated by that boundary are merged. The frame rate is 5 FPS, so more than one region might be merged at a time.

Coffee Image


Car Image


Baseball Image


by Vighnesh Birodkar at August 17, 2014 05:52 PM

(scikit-learn student)

GSoC 2014 Final summary

I posted the final summary of my work in GSoC 2014 in my other blog :

Thank you.

by Issam Laradji ( at August 17, 2014 01:31 PM

Saimadhav A Heblikar
(Core Python student)

GSoC 2014 Summary blog post

Three months of coding, Python, code reviews, IRC meet's, emails sent back and forth, filing bug reports, getting bug reports, submitting patches and I could just go on and on! The past three months have been the most exciting and productive three months, in terms of programming, EVER! GSoC 2014 has been an eye opening experience for me. I can finally claim to understand *some* of the internal working of Python.

Read more »

by Saimadhav Heblikar ( at August 17, 2014 07:08 AM

(scikit-learn student)

Performance comparison among LSH Forest, ANNOY and FLANN

Finally, it is time to compare performance of Locality Sensitive Hashing Forest(approximate nearest neighbor search implementation in scikit-learn), Spotify Annoy and FLANN.


Synthetic datasets of different sizes (varying n_samples and n_features) are used for this evalutation. For each data set, following measures were calculated.

  1. Index building time of each ANN (Approximate Nearest Neighbor) implementation.
  2. Accuracy of nearest neighbors queries with their query times.

Python code used for this evaluation can be found in this Gist. Parameters of LSHForest (n_estimators=10 and n_candidates=50) are kept fixed during this experiment. Accuracies can be raised by tuning these parameters.


For each dataset, two graphs have been plotted according to the measures expressed in the above section. n_samples=1000, n_features=100 n_samples=1000, n_features=500 n_samples=6000, n_features=3000 n_samples=10000, n_features=100 n_samples=10000, n_features=500 n_samples=10000, n_features=1000 n_samples=10000, n_features=6000 n_samples=50000, n_features=5000 n_samples=100000, n_features=1000 It is evident that index building times of LSH Forest and FLANN are almost incomparable to that of Annoy for almost all the datasets. Moreover, for larger datasets, LSH Forest outperforms Annoy at large margins with respect to accuracy and query speed. Observations from these graphs prove that LSH Forest is competitive with FLANN for large datasets.

August 17, 2014 12:00 AM

A demonstration of the usage of Locality Sensitive Hashing Forest in approximate nearest neighbor search

This is a demonstration to explain how to use the approximate nearest neighbor search implementation using locality sensitive hashing in scikit-learn and to illustrate the behavior of the nearest neighbor queries as the parameters vary. This implementation has an API which is essentially as same as the NearestNeighbors module as approximate nearest neighbor search is used to speed up the queries at the cost of accuracy when the database is very large.

Before beginning the demonstration, background has to be set. First, the required modules are loaded and a synthetic dataset is created for testing.

import time
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
from sklearn.neighbors import LSHForest
from sklearn.neighbors import NearestNeighbors

# Initialize size of the database, iterations and required neighbors.
n_samples = 10000
n_features = 100
n_iter = 30
n_neighbors = 100
rng = np.random.RandomState(42)

# Generate sample data
X, _ = make_blobs(n_samples=n_samples, n_features=n_features,
                  centers=10, cluster_std=5, random_state=0)

There are two main parameters which affect queries in the LSH Forest implementation.

  1. n_estimators : Number of trees in the LSH Forest.
  2. n_candidates : Number of candidates chosen from each tree for distance calculation.

In the first experiment, average accuracies are measured as the value of n_estimators vary. n_candidates is kept fixed. slearn.neighbors.NearestNeighbors used to obtain the true neighbors so that the returned approximate neighbors can be compared against.

# Set `n_estimators` values
n_estimators_values = np.linspace(1, 30, 5).astype(
accuracies_trees = np.zeros(n_estimators_values.shape[0], dtype=float)

# Calculate average accuracy for each value of `n_estimators`
for i, n_estimators in enumerate(n_estimators_values):
    lshf = LSHForest(n_candidates=500, n_estimators=n_estimators,
    nbrs = NearestNeighbors(n_neighbors=n_neighbors, algorithm='brute')
    for j in range(n_iter):
        query = X[rng.randint(0, n_samples)]
        neighbors_approx = lshf.kneighbors(query, return_distance=False)
        neighbors_exact = nbrs.kneighbors(query, return_distance=False)

        intersection = np.intersect1d(neighbors_approx,
        ratio = intersection/float(n_neighbors)
        accuracies_trees[i] += ratio

    accuracies_trees[i] = accuracies_trees[i]/float(n_iter)

Similarly, average accuracy vs n_candidates is also measured.

# Set `n_candidate` values
n_candidates_values = np.linspace(10, 500, 5).astype(
accuracies_c = np.zeros(n_candidates_values.shape[0], dtype=float)

# Calculate average accuracy for each value of `n_candidates`
for i, n_candidates in enumerate(n_candidates_values):
    lshf = LSHForest(n_candidates=n_candidates, n_neighbors=n_neighbors)
    nbrs = NearestNeighbors(n_neighbors=n_neighbors, algorithm='brute')
    # Fit the Nearest neighbor models
    for j in range(n_iter):
        query = X[rng.randint(0, n_samples)]
        # Get neighbors
        neighbors_approx = lshf.kneighbors(query, return_distance=False)
        neighbors_exact = nbrs.kneighbors(query, return_distance=False)

        intersection = np.intersect1d(neighbors_approx,
        ratio = intersection/float(n_neighbors)
        accuracies_c[i] += ratio

    accuracies_c[i] = accuracies_c[i]/float(n_iter)

You can get a clear view of the behavior of queries from the following plots. accuracies_c_l

The next experiment demonstrates the behavior of queries for different database sizes (n_samples).

# Initialize the range of `n_samples`
n_samples_values = [10, 100, 1000, 10000, 100000]
average_times = []
# Calculate the average query time
for n_samples in n_samples_values:
    X, labels_true = make_blobs(n_samples=n_samples, n_features=10,
                                centers=10, cluster_std=5,
    # Initialize LSHForest for queries of a single neighbor
    lshf = LSHForest(n_candidates=1000, n_neighbors=1)

    average_time = 0

    for i in range(n_iter):
        query = X[rng.randint(0, n_samples)]
        t0 = time.time()
        approx_neighbors = lshf.kneighbors(query,
        T = time.time() - t0
        average_time = average_time + T

    average_time = average_time/float(n_iter)

n_samples space is defined as [10, 100, 1000, 10000, 100000]. Query time for a single neighbor is measure for these different values of n_samples. query_time_vs_n_samples

August 17, 2014 12:00 AM

August 16, 2014

Asish Panda
(SunPy student)

Final Summary

The end is here! After a three months of thinking and coding it has finally come to an end. I must say this was the best thing I have ever done in terms of programming. Though I don’t plan to leave it at the best thing for long! :P 
My project involved integrating astropy into sunpy. Though I can’t say I achieved what I hoped I would, but regardless there was 3 PR I created from which one is merged, other one is pending and the third one requires me to work with my fellow gsoc student in a common branch(which involved both of our work). I plan to work with SunPy further to make any suitable changes.

1) Units

2) Spectra

3) Maps

by kaichogami at August 16, 2014 01:27 PM

August 15, 2014

(SciPy/NumPy student)

The deadlines ringing do remind that 3 months have flown past in a jiffy. Time to bid sayonara! Good times seem to roll by too soon, always; sad thing! Nevertheless it has been a great experience under a brilliant mentoring. The post has come up much later the previous one. Things have taken shape since then.

1. Ellipsoidal Harmonics:
 Ellipsoidal harmonic functions, the first kind; also known as Lames functions have been implemented in Cython and had been integrated with ufuncs.
 Ellipsoidal Harmonic function of the second kind and the calculation normalization constant for Lames function were implemented as .pyx files due to the involvement of global variables. The calculation of normalization constant was implemented in 2 different ways, using integration and using recurrence. Though recurrence seemed to be the more basic and faster way of implementation, the numerical stability wasn't that good; so we adopted integration.
The process involved many new things to me, like the integration of awesome LAPACK library, the speed and awesomeness of Cython etc!
The pull request is here:
2. Hypergeometric functions:
The present implementation of hypergeometric functions is buggy. For real values of x, C implementation of the function from Cephes library has few errors while the FORTRAN implementation for complex values of x suffers with errors for much wider domain. There has been an attempt to re-implement the function in Cython and make it less ridden with errors. Though the shortage of time denied a bug-free implementation a few bugs have been removed successfully.
The implementation so far has been posted here

I would yet again stress on the fact that the flipping of bits this summer has been of great fun and well as a great skill and knowledge booster. Never was my summer so productive!

Signing off with loads of great memories and experiences

by janani padmanabhan ( at August 15, 2014 06:56 PM

August 12, 2014

Mustafa Furkan Kaptan
(Vispy student)

Finishing Backends

Hello reader,

Before we begin, here is how GSoC is going for me:

I have finished the static backend. In this backend, we aim to display a PNG image on IPython notebook. This is just the basic version of our VNC backend. We used IPython's `display_png()` function for displaying the PNG. I can say it is ready to merge.

On the other side, we solved the timer issue that I have mentioned in my previous post. We used a JavaScript timer with `setInterval()` function of JS window. We are sending 'poll' events from front-end (JS) to backend (python). When the backend receives these events, it generates a 'timer' event. Therefore, on the vispy canvas object that we created in IPython notebook, we can connect a callback function (like on_timer) and use it without any problem. This means we are able to do animations now. That is great!

Another good news is, we managed to hide the vispy canvas! That was a big problem from the beginning. Eric and Almar, two of the vispy-devs, proposed a solution: showing the canvas once and hiding it immediately. This solution was tested by myself before, but I couldn't manage to run it. After that, Almar took over the issue and solved it like a boss! The problem was (what I understood) Qt was refusing to draw when there is no visible canvas. So we are forcing the draw manually and voilà! Problem solved:

See you next post!

by Mustafa Kaptan ( at August 12, 2014 08:55 PM

Michael Mueller
(Astropy student)

Week 12

There's not too much to report for this week, as I basically worked on making some final changes and double-checking the writing code to make sure it works with the entire functionality of the legacy writer. After improving performance issues related to tokenization and string conversion, I created a final version of the IPython notebook for reading. Since IPython doesn't work well with multiprocessing, I wrote a separate script to test the performance of the fast reader in parallel and output the results in an HTML file; here are the results on my laptop. Parallel reading seems to work well for very large input files, and I guess the goal of beating Pandas (at least for huge input and ordinary data) is basically complete! Writing is still a little slower than the Pandas method to_csv, but I fixed an issue involving custom formatting; the results can be viewed here.
I also wrote up a separate section in the documentation for fast ASCII I/O, although there's still the question of how to incorporate IPython notebooks in the documentation. For now I have the notebooks hosted in a repo called ascii-profiling, but they may be moved to a new repo called astropy-notebooks. More importantly, Tom noticed that there must actually be something wrong with the fast converter (xstrtod()), since increasing the number of significant figures seems to scale the potential conversion error linearly. After looking over xstrtod() and reading more about IEEE floating-point arithmetic, I found a reasonable solution by forcing xstrtod() to stop parsing digits after the 17th digit (since doubles can only have a maximum precision of 17 digits) and by correcting an issue in the second half of xstrtod(), where the significand is scaled by a power of ten. I tested the new version of xstrtod() in the conversion notebook and found that low-precision values are now guaranteed to be within 0.5 ULP, while high-precision values are within 1.0 ULP about 90% of the time with no linear growth in error.
Once I commit the new xstrtod(), my PR should be pretty close to merging--at this point I'll probably write some more tests just to make sure everything works okay. Today is the suggested "pencils down" date of Google Summer of Code, so I guess it's time to wrap up.

by Michael Mueller ( at August 12, 2014 02:10 PM

August 11, 2014

Richard Tsai
(SciPy/NumPy student)

GSoC2014: Recent progress

Hi! It has been several weeks since I talked about my work last time. In the past serveral weeks I mainly worked on the optimization of cluster.hierarchy.

The SLINK Algorithm

The most important optimization is the SLINK alogrithm1 for single linkage. The naive hierarchical agglomerative clustering (HAC) algorithm has a \(O(n ^ 3)\) time complexity, while SLINK is \(O(n ^ 2)\) and very easy to implement (even easier than the naive algorithm).

The Pointer Representation

SLINK is very good in performance but requires a special linkage representation – the pointer representation, which is different from what cluster.hierarchy is using. The pointer representation can be described as follows.

  • A cluster is represented by the member with the largest index
  • \(\Pi[i] (i \in [0, n - 1])\) is the first cluster that cluster \(i\) joins, \(\Lambda[i] (i \in [0, n - 1]\) is the distance between cluster \(i\) and cluster \(\Pi[i]\) when they join

For example, the pointer representation of the following dendrogram is

  • \(\Pi[i] = \{6, 3, 3, 5, 9, 9, 8, 9, 9, 9\}\)
  • \(\Lambda[i] = \{1.394, 0.419, 0.831, 1.561, 3.123, 10.967, 1.633, 1.198, 4.990, \infty\}\)


The implementation of SLINK is very simple. There’s a pesudo-code in the orignal paper. The following Cython code need two pre-defined function condensed_index, which calculate the index of element (i, j) in a square condensed matrix, and from_pointer_representation, which convert the pointer representation to what you need.

def slink(double[:] dists, double[:, :] Z, int n):
    cdef int i, j
    cdef double[:] M = np.ndarray(n, dtype=np.double)
    cdef double[:] Lambda = np.ndarray(n, dtype=np.double)
    cdef int[:] Pi = np.ndarray(n, dtype=np.int32)

    Pi[0] = 0
    Lambda[0] = NPY_INFINITYF
    for i in range(1, n):
        Pi[i] = i
        Lambda[i] = NPY_INFINITYF

        for j in range(i):
            M[j] = dists[condensed_index(n, i, j)]

        for j in range(i):
            if Lambda[j] >= M[j]:
                M[Pi[j]] = min(M[Pi[j]], Lambda[j])
                Lambda[j] = M[j]
                Pi[j] = i
                M[Pi[j]] = min(M[Pi[j]], M[j])

        for j in range(i):
            if Lambda[j] >= Lambda[Pi[j]]:
                Pi[j] = i

    from_pointer_representation(Z, Lambda, Pi, n)


On a N = 2000 dataset, the improvement is significant.

In [20]: %timeit _hierarchy.slink(dists, Z, N)
10 loops, best of 3: 29.7 ms per loop

In [21]: %timeit _hierarchy.linkage(dists, Z, N, 0)
1 loops, best of 3: 1.87 s per loop

Other Attempts

I’ve also tried some other optimizations, some of which succeed while the others failed.

I used binary search in cluster_maxclust_monocrit and there was a bit improvement (though it is not a time-consuming function in most cases).

Before (N = 2000):

In [14]: %timeit hierarchy.fcluster(Z, 10, 'maxclust')
10 loops, best of 3: 35.6 ms per loop

After (N = 2000):

In [11]: %timeit hierarchy.fcluster(Z, 10, 'maxclust')
100 loops, best of 3: 5.86 ms per loop

Besides, I tried an algorithm similar to SLINK but for complete linkage – the CLINK algorithm2. However, it seems that CLINK is not the complete linkage that we used today. It did not always result in the best linkage in some cases on my implementation. Perhaps I have misunderstood some things in that paper.

At the suggestion of Charles, I tried an optimized HAC algorithm using priority queue. It has \(O(n^2 \log n)\) time complexity. However, it didn’t work well as expected. It was slower even when N = 3000. The algorithm needs to delete a non-root node in a priority queue, so when a binary heap is used, it needs to keep track of the index of every node and results in the increase of the time constant. Some other priority queue algorithms might perform better but I haven’t tried.

  1. Sibson, R. (1973). SLINK: an optimally efficient algorithm for the single-link cluster method. The Computer Journal, 16(1), 30-34. 

  2. Defays, D. (1977). An efficient algorithm for a complete link method. The Computer Journal, 20(4), 364-366. 

by Richard at August 11, 2014 12:36 PM

Wenzhu Man
(PyPy student)

only zero gc pointers not the whole nursery

As mentioned in the earlier post, our original idea is divided nursery into two parts that growing in opposite directions.

However, after I implemented this idea and adopted GC I started to think that maybe it's not the best solution. As instead of simplifies the nursery, we made it more complicated.
As the problem is following the uninitialized Gc Pointers with crash program, we can just zero the fields whose type is GcPtr. 

So I proposed this solution and after discussion with my mentors, we chose this solution instead of the two-end nursery.

After coding one month and a half,  here is my progress.
1. The GC is fully adopted and tested that new object are all allocated  in uninitialized memory.
This simplified the c code that is generated and there is only one end of nursery(used two have two ends-top/real top).
 Here is the generated c code:


As shown above, the new generated c code gets rid of the expensive OP_RAW_MEMCLEAR
and the c code is simplified.
2.As now the malloc is non_clear, all the clear operation and methods and fully refactored e.g, malloc_fixed_size_clear to malloc_fixed_size

3.Rewrite the logic in gcwrapper of how to choose the right malloc function between different gc.

4. For GcStruct and GcArray, we insert an special operation after malloc-  zero_gc_pointers_inside,now
 this operation is implemented but doesn't fully work(tons of debugging:(  )

by Wenzhu Man ( at August 11, 2014 07:47 AM

August 10, 2014

Gaurav Trivedi
(Kivy student)

More, more facades

This week I added many more facade implementations in Plyer. It was only a few days ago that I had started working on iOS and I am happy that the list has grown quite a bit this week.

I also added Plyer in the kivy-ios tool-chain, i.e. it is now a part of the build-all script and would be available for use in apps packaged with Kivy for iOS.

Apart from that I also did a couple of maintenance fixes to close the holes that I noticed with the checked in code and fix style problems with other contributions.

Although this update was a short one, it did involve a considerable amount of coding effort.

As the summer is coming to a close, I will be spending the next week wrapping up my work, polishing the rough edges in the contributions till now, and of course write the “obvious” bits and pieces that I may have ignored from the documentation till now.

by gtrivedi at August 10, 2014 09:36 PM

Ajitesh Gupta(:randomax)
(MoinMoin student)

Week 8, 9, 10, 11, 12 Work

Ok guess it has been quite long since my last post.

Week 8

Just like week 7 week 8 was also about ACLs. This week I had to work on the Group ACL visualisation. In order to do that I had to first create a view to list all the groups present in the wiki just like the "Userbrowser" view that is already present in the admin section. We discussed the design of the view on the etherpad and we came up with this view -

The Group View

The names of the groups, their member users and groups are mentioned along with the links to the ACL reports for each of the groups. The list is sortable and is by default sorted by the group names and also the names of group members are sorted by default. Her is the commit for that. After this came the making of the Group ACL Report. For each group we decided to show only those items in the ACL reports which specifically mention the Group name in their ACL string. Here is what the view looks like -

The Group ACL Report

It lists the Item names or Item IDs in case of a nameless item. The names/IDs are hyperlinks to the modify view of the items so that the admin can directly go and modify the permissions of the item. Here is the commit for that.

I also worked on and committed the old patches - #445, Userwise ACL Reports, Metadata View in basic theme, Item ACL Report View.

Week 9

In this week I had to work on providing the capability to edit the ACL string for an item in the Item ACL Report so that the admin does not have to go to the modify item view of each item in order to change the ACL rights. So I created an extra text field in the Item ACL Report itself which contains the current ACL string for an item and the admin can modify the ACL there itself. The "default" tag specifies that the string for that item is same as the default ACL in the configuration. The codereview for that is here. Here is what it looks like -

The new Item ACL report with editing functionality

Also this week I had to work on removing the auto-computed metadata from the modify meta view. There was no use in giving them in the edit view as it would either have no effect or would give an error in case the user deliberately tried to change it. Here is the commit.

Week 10 and 11

This week I had improve UI elements and color schemes. Firstly I had to fix a few existing bugs in the UI. The first one was the #394 where there was an ugly mouseover and a validation error in the modernized theme. The namespace shown along with the location line used to show an ugly dropdown menu on mouseover and also it gave a validation error as a span element is nor allowed to contain a ul element. So we did away the showing the namespace over there. Instead we made a "Namespace" section in the "User" view itself. Hence it solved both problems together. Here is the commit for that. Here is how it looks like -

The Namespaces section in User view

Then came the #425 where the basic theme used to give a 404 not found error for common.css in the background as there was no common.css in the basic theme setup. So in order to fix this we decided to keep a uniform naming system for the stylesheets in all the themes. So I changed the names of basic.less to theme.less and basic.css to theme.css in basic theme and main.styl to theme.styl and common.css to theme.css in modernized and foobar themes. Here is the commit for that.

Also in one of the meetings Thomas pointed out that the URL for the User ACL Report is untidy due to unnecessary values being passed in the GET request. I removed those values to make it cleaner. Also he asked me reorder the options in the Admin Menu so that they are properly organised. Here is the commit regarding both of those issues.

Since I had a lot of time left in week 10 I decided to jump ahead and proceed to week 11 work, in week 10 itself. The first thing to do there was to add a background color to the sidebar in the basic theme so that it would look different from the content. So I made up a few samples and we finally decided that shades of blue would be the best choice among all as it would match the theme too. Here is the commit and the screenshot -

The new blue Sidebar

Then the Itemsize view and the Interwiki names view both needed a proper tabular view as they were rather clumsily arranged. Also the item sizes in the Itemsize view were not human readable and were rather the size of the items in bytes. I added bootstrap tables to both the views and also made the item sizes human readable. Here is the commit for them. The old Interwiki and Itemsize views were like -

And the new ones are here -

Also then I added CSS classes for making 2, 3 and 4 column lists. Here is the commit for them. Here is one sample usage -

Old single column list

New multi-column list

Next was to address the issue #66 in sharky93's repo which was about reloading the page when a new theme is selected so that the new theme gets automatically loaded. I added a line fo javascript in common.js -  "location.reload(true)" at the end of the form submission and processing so that current location is reloaded. The "true" argument is to force the page to reload from the server rather than from the cache. Here is the commit for that.

Also I made the Textarea in the modify form to expand/contract as the page is expanded or contracted, similiar to the way it happens in the modernized and foobar themes by making the width of the textarea as 100%. Here is the commit for that.

Last but not the least I removed the subitems sidebar in the modernized theme as it was not working as intended and removing that would have made it consistent with the other themes. Here is the commit for that.

Week 12

In this week I was supposed to add the capability for users to add their own custom css to the themes using wiki items as the source in the "User CSS URL" in the "Appearance" settings. We had planned to use a "raw" view to render the raw css as use that as the source. But Thomas pointed out that this capability would cause potential security threats as the users could make calls to malicious javascript codes using the url tag in the css. So this plan was dropped. So we planned on working on whatever is currently there and to improve it. Also the plan was to find flaws if any in the work done by me till now by practically using the wiki.

So first up this week was to finish open issues in my own repo and to report and solve new findings. The first one was to add CSS to the ACL reports in the foobar and modernized themes. I added the "zebra" class to the tables in both and modified a bit of css in order to make them look better. Here is the commit and the screenshot -

A small fix was to add an h1 header to the global index view in the basic theme to make it consistent with the other views. Here is the commit and screenshot -

A bigger task this week was to improve the global index view in the basic theme as it was poorly organized. I added bootstrap icons and tables to improve the view. Here are the commit and screenshots -

The old Global History view

The new Global History view

I also revisited the overflowing quicklinks problem. I now removed the delete quicklink "X" icon and instead gave the user the same option in the User Actions tab, as suggested by Thomas. Also since the icon is removed I increased the maximum characters in a quicklink to 20 as there was more space. Furthermore the tooltip now shows as "<wikiname> : <full url>". Here is the codereview for the changes. 

The next week would involve finishing the project after finding and fixing whatever bugs I can find. Hope that all goes well :D

by Ajitesh Gupta ( at August 10, 2014 01:21 PM

Saurabh Kathpalia
(MoinMoin student)

GSoC Update: August 10

Week 12:

This week I mainly focussed on getting the pending crs committed and finalized.

Finalized and committed stuff

Improved UI of Blog Items in all the themes:
  • Modernized Theme: codereview and commit to my repo
  • Foobar Theme: codereview and commit to my repo. Also added itemviews such as Add quicklink, modify etc in blog view by defining itemviews.
  • Basic Theme: codereview and commit to my repo.

Moved comments to right and meta-data information to the left in ticket show/modify view and also removed the dependency on item_name in this template, used summary/fqname instead of fqname. Here is the codereview and commit to my repo.

Added ticket-create button and blog-entry-create button in +tickets view and blog view respectively in all the themes with a TODO of creating tickets without any initial fqname. Here is the codereview and commit to my repo.

Now shortened fqname is shown in trail and also used fqname instead of item_name in location_breadcrumbs in foobar theme. Here is the codereview and commit to my repo.

Also did the same thing for basic theme. Here is the codereview and commit to my repo.

Fixed #326 - Removed duplicate "Add comment" and "message", just used "Add comment". Here is the codereview and commit to my repo.

Now only tags specific to ticket items are shown in +tickets view and ticket show/modify view. For this I added a function that takes ITEMTYPE as argument and returns the tags associated with items of such itemtype. Here is the codereview and commit to my repo.

For the coming week my main focus would be to check for some more bugs and also try to solve them, improve documentation and some UI changes.

by saurabh kathpalia ( at August 10, 2014 09:46 AM

August 09, 2014

Saimadhav A Heblikar
(Core Python student)

GSoC 2014 - Overview of workflow and resources

In this blogpost, I will link to all GSoC 2014 related resources. If you intend to regularly check my GSoC progress, it is advisable that you bookmark this page. It will link to all resources.

Read more »

by Saimadhav Heblikar ( at August 09, 2014 02:38 PM

August 07, 2014

(pgmpy student)

Two weeks after mid-term eval

Oh man, Florian Fuchs is really going to kill me for this, but this blog post had completely gone off my mind. Anyway, so here I am writing this post long after it was supposed to be written. So I will just write about the experiences which I had till 13th. (They were interesting experiences for sure). I will also pretend in the blog as if it is 13th today :p

So, starting from where I had finished off in the last post. I was left with a broken factor function and some frustration arising from that. I had posted the issue on our group, but Shashank and Ankur were busy with other stuff and they were not finding time to solve it. So in the last GSOC chat, Shashank told me to write a new factor function itself. He thought that it would be a good exercise in cython to do that on my own and also, it would be very difficult for me to parse the code which Shashank had initially written for the factor product function.

Also, I decided to include a new list to store the MAP values along with the numpy.array which we had for storing the potential values. (Corresponding to each value, there would be the values which the other variables had taken during elimination). There were a lot of tough design decisions here (how to store it : list vs dictionary. Is a list fine? However, I implemented each of the entries as a list of the eliminated variable and stored the entire thing in a list. I am not really sure what to do about the huge expense of the fact that we are storing it in a list. For later.

After this, I pinged shashank again about the product module since it was making testing very difficult for me. So shashank told me to implement it myself since it would be a very good exercise in learning cython for me too. So I implemented this in cython and it was working fine. However, there was this big issue. How to handle the eliminated variable values in the cython function. because they were stored in lists and we could only pass C++ data structures to the cython functions. I am on the lookout for possible ways to handle this.

Anyway, that's all for now. (from the point of view of 13th August).

by Navin Chandak ( at August 07, 2014 09:10 PM

Alan Leggitt
(MNE-Python student)

Day 56: Checking In

Its been awhile since I wrote a post. There haven't been a lot of changed to the actual source localization, but I've spent a lot of time trying to integrate my work into the existing MNE repository.

For the past 2 weeks or so, I've been struggling with visualizing source spaces. This is an integral part of visualizing the source estimates. While I've been having success plotting the exact locations of source dipoles in down-sampled source spaces (e.g. volumes with 5 mm spacing, or surfaces with 6 mm spacing), when I try to interpolate onto a higher resolution image, I can't seem to get the right transformations for the surface sources.

For instance, the image below shows the source spaces for the cerebellum (blue) and the cortex (white). They look pixellated because they're lower resolution than the mri image.

But when I try to create higher resolution images of the source spaces, the cerebellum lines up but the cortex does not.

The goal now is to go through the code and find out exactly what coordinate frames the surfaces and volumes are generated in and then figure out how to transform from one coordinate system to another.

If you're interested in more details, the nitty gritty details have been on this pull request on github.

by Alan Leggitt ( at August 07, 2014 12:00 AM

August 03, 2014

Simon Liedtke
(Astropy student)

Astroquery -- New Service Atomic Line List

"Atomic Line List" is a collection of more than 900,000 atomic transitions in the range from 0.5 Å to 1000 µm (source). By adding support for this service in astroquery, it will be possible to access these records easily with the Python programming language.

The class AtomicLineList has only 2 public methods: query_object and query_object_async. The latter one only gives us an HTTP response object, whereas the former one converts the HTTP response into an AstroPy table. So let's take a look at query_object: So far, it only has four parameters (all optional): wavelength_range, wavelength_type, wavelength_accuracy and element_spectrum. The respective web form for Atomic Line List can be found at As you can see there, the first form fields are "Wavelength range" and "Unit". The AstroPy package offers a very handy unit package, so I decided to support this type instead of passing plain strings. Therefore, even more units are supported than the one given in the web form! The parameter wavelength_range is a 2-tuple where each item is a scalar multiplied with an AstroPy unit. Behind the scenes, both values will be converted to Angstrom (arbitrarily chosen from the web form's dropdown menu)

In the following Python session you can see the atomic package in action. Note that Hz is actually not a supported unit by Atomic Line List, the atomic package takes care to support all spectral units.

>>> from astropy import units as u
>>> from astroquery.atomic import AtomicLineList
>>> alist = AtomicLineList()
>>> wavelength_range = (15 * u.nm, 1.5e+16 * u.Hz)
>>> alist.query_object(wavelength_range, wavelength_type='Air', wavelength_accuracy=20, element_spectrum='C II-IV')
<Table rows=3 names=('LAMBDA VAC ANG','SPECTRUM','TT','TERM','J J','LEVEL ENERGY  CM 1')>
array([(196.8874, 'C IV', 'E1', '2S-2Po', '1/2-*', '0.00 -   507904.40'),
       (197.7992, 'C IV', 'E1', '2S-2Po', '1/2-*', '0.00 -   505563.30'),
       (199.0122, 'C IV', 'E1', '2S-2Po', '1/2-*', '0.00 -   502481.80')],
      dtype=[('LAMBDA VAC ANG', '<f8'), ('SPECTRUM', 'S4'), ('TT', 'S2'), ('TERM', 'S6'), ('J J', 'S5'), ('LEVEL ENERGY  CM 1', 'S18')])

by Simon Liedtke at August 03, 2014 10:00 PM

August 02, 2014

Jaspreet Singh
(pgmpy student)

The one with PomdpXWriter Class

Hi there!

After completing reading from the PomdpX file format I had to now add support for writing to the same format. This class when given a model data containing network description creates an object containing XML for the PomdpX file format.The XML tags were created with the help of  lxml module. Some of the functions used throughout the class were
etree.Element(tag_name, attrib={})
etree.SubElement(xml_tag, tag_name, attrib={})
Some of the functions finished quickly but others consumed more time due to their complexity. Especially the one dealing with parameters which had to be supported with recursive functions. The most difficult part of coding this class was to get the exact XML for the file format with indentation and spacing along with sequence of the tags correct. So, I had to spend a lot of time debugging the code to get each and every function working properly. Finally all the tests passed and this issue ended with a sigh of relief. Good Bye guys! you soon :)

by jaspreet singh ( at August 02, 2014 03:50 PM

August 01, 2014

Rishabh Raj
(scikit-image student)

Final destinación and the road ahead

The last two weeks with our project involved adding a bunch of new features which we had in mind which sort of brings things together.

The gallery now feels more intuitive to use, click on the code to start editing code, `Esc` to  get out of the edit mode,  there are visual indications for when the code is being run, the editing of code is not allowed during that time, etc.

As in the past, this time too, we had a strange issue we were facing, AJAX requests to the server-side for executing the code worked perfectly fine for Chrome/ium but Firefox just won’t budge. It always resulted in a transmission error, signalling a cross-origin policy issue, that requests need to be from the same domain or CORS needs to be enabled, but wait.. we already had CORS enabled server-side, the reason why it was working on Chrome/ium!

Everything seemed to be alright, response to the first OPTIONS request (this is sent prior to an AJAX call, specifying the properties of the communication which is to follow), was identical in Chrome and Firefox, but the fox just refused to proceed with further communication (the actual request). But what kept me going was this ..

When you have eliminated the impossible, whatever remains, however improbable, must be the truth


It was clear that it surely had something to do with the response which we were getting for the first OPTIONS request. After a close look it started becoming somewhat clear, one response looked like `Access-Control-Allow-Headers: Access-Control-Allow-Headers: Origin, Content-Type, X-Requested-With, Accept` ..

Access-Control-Allow-Headers, is present in this preflight response to indicate which headers can be used to make the actual request, but wait, there it was repeated twice. I’d read about browsers being too defensive with such stuff and it could easily break things, so with a sigh i tried changing that, and Voila!, the fox did not disappoint.

With this major hurdle out of the way, we added timestamps to the images which are generated, so when you hover over the image your code generated it will show up the server time at which point it was generated.

Another interesting thing added recently was the idea of a ‘queue’. At times when the load on the server is high (given by the number of instances of containers it is running), we don’t process more requests but simply revert back that the server is busy and to try again later, just to make the experience more interactive.

Future work in this involves making this whole server setup configurable, currently one can configure the maximum amount of output (the number of lines of output) which we want to send back to the user running the code as well as the maximum number of containers it can run simultaneously, and of course, it goes without saying, the documentation :)

Link to live demo –



Initial view of one of the examples in the gallery


Editor active


Editor active, showing the Run button

Code has been sent for execution, the grayed out editor indicates editing is disable plus the wheel beside where the 'Run' button was is an indicator of execution in progress.

Code has been sent for execution, the grayed out editor indicates editing is disable plus the wheel beside where the ‘Run’ button was is an indicator of execution in progress.

Shows the timestamp of the image generated as per server time.

Shows the timestamp of the image generated as per server time.

by sharky932014 at August 01, 2014 04:17 AM

July 31, 2014

Vytautas Jančauskas
(TARDIS-SN student)

Four Weeks in to the Second Half Report

We have finally merged in the C port of the Monte Carlo routines in to the main TARDIS repository and made it the default. All the final bugs were resolved and hopefully this will let us collect user feedback and gather any information about bugs that were introduced and performance improvements that were achieved. This more or less concludes the coding part of the GSoC and all that is left is to fix any remaining bugs, write documentation and react to users requests. Another potential area of issues is the difference between GCC and clang compilers. For example previously we had issues with clang treating inline keyword differently. Currently I am working on using TARDIS to fit synthetic spectra to observed supernova spectra using various optimization algorithms which also provides a good test for the implemented functionality.

by Vytautas Jančauskas ( at July 31, 2014 06:58 PM

July 29, 2014

Tarun Gaba
(PyDy student)

GSoC 14: First Week!

{% include JB/setup %} [ <-Back to posts ](/gsoc14) First week of GSoC 14 has ended. It has been a week full of discussions and brainstorming over how to handle the project, and the collaboration. Most of the time was spent in taking crucial design decisions As decided, I will be publishing these weekly blog posts in A.O.I format(_Accomplishments_, _Objectives_ and _Issues_) ###Accomplishments: The main accomplishments of this week involved finalizing a stable API for the generic visualizer. Discussions were held with Adam and it was decided that a fork of MGView will be used for developing the new visualizer. It will have a UI look similar to MGView, but with additional features pertaining to PyDy and related enhancements as well. Apart from that another main aim was to flesh out an API for the visualizer. The generic visualizer will be made up of following modules: - Parser Module: to parse the JSON and save it in JS objects/variables - SceneGenerator Module: To take the relevant data and information from parsed JSON and create a scene on the canvas. - SceneEditor Module: Using GUI controls to edit the scene and save them in a JSON file. - ParamsEditor Module: Using GUI widgets to modify the simulation parameters, and send/save them as relevant. ###Objectives: The objectives of the upcoming week are: - To develop the Parser Module to be able to consume JSON from both PyDy and MG and parse them into relevant Javascript objects. - To develop methods on PyDy side to generate the JSON in the form that MotionView can consume. - To test some benchmark examples to check this workflow: output from PyDy --> JSON --> consumed by Parser Module. ###Issues: Since actual coding work is not started yet!, there are no technical issues encountered so far. I will keep this blog updated with the regular advancements in the project. Happy Coding! [ <-Back to posts ](/gsoc14)

July 29, 2014 04:11 PM

July 28, 2014

Manoj Kumar
(scikit-learn student)

Scikit-learn: Logistic Regression CV

Hi, It has been a long time since I had posted something on my blog. I had the opportunity to participate in the scikit-learn sprint recently, with the majority of the core-developers. The experience was awesome, but most of the time I had no idea what people were talking about, and I realised I have to learn a lot. I read somewhere that if you need to keep improving in life, you need to make sure the worst person in the job, and if that is meant to be true, I’m well on the right track.

Anyhow on a more positive note, recently one of my biggest pull requests got merged, ( ) and we shall have a quick look at the background, what it can do and what it cannot.

1. What is Logistic Regression?
A Logistic Regression is a regression model that uses the logistic sigmoid function to predict classification. The basic idea is to predict the feature vector \omega sucht that it fits the Logistic_log function, \frac{1}{1 + e^{-w'*X}} . A quick look at the graph (taken from wikipedia), when y is one, we need our estimator to predict w'X to be infinity and vice versa.


Now if we want to fit labels [-1, 1] the sigmoid function becomes \frac{1}{1 + e^{-y*w'*X}}. The logistic loss function is given by, log(1 + e^{-y*w*x}. Intuitively this seems correct because when y is 1, we need our estimator to predict w*x to be infinity, to suffer zero loss. Similarly when y is -1, we need out estimator to predict w*x to be -1. Our basic focus is to optimize for loss.

2. How can this be done?
This can be done either using block coordinate descent methods, like lightning does, or use the inbuilt solvers that scipy provides like newton_cg and lbfgs. For the newton-cg solver, we need the hessian, or more simply the double derivative matrix of the loss and for the lbfgs solver we need the gradient vector. If you are too lazy to do the math (like me?), you can obtain the values from here, Hessian

3. Doesn’t scikit-learn have a Logistic Regression already?
Oh well it does, but it is dependent on an external library called Liblinear. There are two major problems with this.
a] Warm start, (one cannot warm start, with liblinear since it does not have a coefficient parameter), unless we patch the shipped liblinear code.
b] Penalization of intercept. Penalization is done so that the estimator does not overfit the data, however the intercept is independent of the data (which can be considered analogous to a column of ones), and so it does not make much sense to penalize it.

4. Things that I learnt
Apart from adding a warm start, (there seems to be a sufficient gain in large datasets), and not penalizing the intercept,
a] refit paramter – generally after cross-validating, we take the average of the scores obtained across all folds, and the final fit is done according to the hyperparameter (in this case C) that corresponds to the perfect score. However Gael suggested that one could take the best hyperparameter across every fold (in terms of score) and average these coefficients and hyperparameters. This would prevent the final refit.
b] Parallel OvA – For each label, we perform a OvA, that is to convert y into 1 for the label in question, and into -1’s for the other labels. There is a Parallel loop across al loops and folds, and this is to supposed to make it faster.
c] Class weight support: The easiest way to do it is to convert to per sample weight and multiply it to the loss for each sample. But we had faced a small problem with the following three conditions together, class weight dict, solver liblinear and a multiclass problem, since liblinear does not support sample weights.

5. Problems that I faced.
a] The fit intercept = True case is found out to be considerably slower than the fit_intercept=False case. Gaels hunch was that it was because the intercept varies differently as compared to the data, We tried different things, such as preconditioning the intercept, i.e dividing the initial coefficient with the square root of the diagonal of the Hessian, but it did not work and it took one and a half days of time.

b] Using liblinear as an optimiser or a solver for the OvA case.

i] If we use liblinear as a solver, it means supplying the multi-label problem directly to liblinear.train. This would affect the parallelism and we are not sure if liblinear internally works the same way as we think we do. So after a hectic day of refactoring code, we finally decided (sigh) using liblinear as an optimiser is better (i.e we convert the labels to 1 and -1). For more details about Gaels comment, you can have a look at this

Phew, this was a long post and I’m not sure if I typed everything as I wanted to. This is what I plan to accomplish in the coming month
1. Finish work on Larsmans PR
2. Look at glmnet for further improvements in the cd_fast code.
3. ElasticNet regularisation from Lightning.

by Manoj Kumar at July 28, 2014 11:36 PM

Asra Nizami
(Astropy student)

Note to self: testing is really important

It's been an interesting couple of weeks. As I mentioned in my last blog post, I've been working on integrating WCSAxes into APLpy. It was going pretty well in the beginning but now things are starting to get complicated (just to be clear, I'm not complaining!). APLpy isn't a very big package but it still has a lot of features which I need to make sure work as I integrate WCSAxes into it. *THAT* is the tricky part. This means extensive testing after small re-writes of code to make sure new features work and I haven't broken something that was working previously. Well, that's the ideal workflow - I should have been doing extensive testing this way but sigh, I didn't. I've almost (I feel like the coyote who's *almost* caught the roadrunner as I say this) made most of the new code work but over the past few days I got so wrapped up in cleaning bits of code which I didn't test out fully that I broke a pretty important part of APLpy, plotting data cubes. I'm not sure where I broke it and my commit history is a mess (because I kept committing changes without testing them out fully again) so git bisect didn't help me identify where the problem started either.

This week is Eid, an Islamic holiday (Eid Mubarak to any Muslims reading this!!) so I won't have that much time to figure out where I introduced this bug, but once I do, I'll probably stock up on Pepsi, blare loud music in the background and carefully comb through my branch to examine all the code changes. That's it for now! :)

by Asra Nizami ( at July 28, 2014 12:28 PM

Rajeev S
(GNU Mailman student)

The YACC Cleans up my Code!

Interesting week! Even in my wildest dreams, I never hoped that I would get to wet my hands again on the beautiful programming of writing a compiler. I have done this as a part of my BTech curriculum, but did not expect it to pop up during my Summer of Code.

My project included building a query interface and thus obviously a query language. Since the intended language was simple and easy,I was in an idea to implement some basic parsing technique like the recurrence descent parser. Once I started the coding, I found that recurrence descent parsing was painful, and I ended up writing the naive code to process a string array. This was error prone and very hard to manage, as it lead to a lot too many index errors and more dangerous, ignored extra parameters.

My next option was to use a regular expression based parser, that I successfully built and worked fine. I used the config.ini file to store the regular expressions for each command and used it to validate the command. This approach did not break anywhere, but the code looked rather ugly, with a lot too many `pops`.Another huge drawback was the lack of good error reporting, as I could not report where the command went wrong. All I reported was that the syntax is wrong and printed the command usage string. I asked in the python IRC about a way to print better error messages for failed regular expressions, it was there I got the suggestion to use a parser. Since this is a PSF project, I could not quite ignore the suggestion from the Python IRC. Further, regular expression approach only performed the validation part, I had to hard code the command parsing part. This can create difficulties in extending the project in the future.

I did some amount of research on the parser libraries available for python and I settled on the PLY, a project that had a good documentation and lot too many examples. I tried a few sample grammars and began writing the parser for my project. I used the class approach of PLY. I had to choose between a common parser for the whole project and separate parser for each command, I settled for the latter one, with a assumption that it would be cleaner and easier to manage and extend. In two days time,I completed the parser for each command and also rebuilt the environment variable management part. I completely stashed the decorators used for the command validation and pre-processing.

The code is now a lot more cleaner and readable than before. Errors are beautifully reported and handled and all works fine. Time is running fast, and I plan to complete my tasks at least a week before the deadline of 11/08/2014. I have announced in the mm-dev list that I work with 6th August as my soft deadline.

Last but not the least, blogging from the Kanyakumari Banglore Island Express, thanks to the Indian Railways!

by Rajeev S ( at July 28, 2014 05:47 AM

(SunPy student)

Nearing the Close

Hello folks.

While admittedly there has not been too much happening on the online front, a lot has been happening on the offline front around here!

There has been talk of introducing some more FrameAttributes to support the HeliographicStonyhurst to Heliocentric and vice-versa transformations. The previous part with the introduction of the new get_hpc_distance() method paid off handsomely – my mentor’s calculations got the HGS to HP transform to work properly. The next step is to fix some buggies and introduce the new FrameAttributes.

I have been learning Flask from Miguel Grinberg’s Mega Tutorial on the advice of my mentor and it has been a very satisfying journey so far. I have created a repo on GitHub to learn while committing. Flask is a microframework for Python based on Werkzeug and Jinja2. It is less complex in comparison with Django, and learning it really is fun. There are also Flask extensions to help one with the job of interfacing with other apps such as SQLAlchemy.

I have appeared for a couple of job interviews for some good startups in these two weeks. The GSoC project definitely has helped me gain some traction. It gets me noticed where I would earlier not have had a chance.

That’s all for now! My apologies if this post is too small, peeps.

by xpritish at July 28, 2014 05:37 AM

Elana Hashman
(OpenHatch student)

Security Holes and Django Forms

"Hey—I have this crazy idea. What if the POSTs django-inplaceedit processes are totally exploitable?"

Security Hilarity

And so our journey down the rabbit hole began. It turns out that you should probably never include little-used open source software that modifies entries in your database while making the assumption that it's secure.

I believe the following code speaks for itself:

def test_will_inplaceedit_allow_us_to_pwn_ourselves(self):
    # Asheesh: "total cost of pwnership: 1 test"
    # note: user paulproteus has poor password hygiene
    u = User.objects.create(username='paulproteus', password='password')
            "app_label": "auth",      # the django app
            "module_name": "user",    # the django table
            "field_name": "username", # the field name
            "obj_id":,           # the pk
            "value": '"LOLPWNED"'     # new value

    self.assertEqual(User.objects.get(, "LOLPWNED")

Carefully crafting this payload involved very little effort; watching this test pass was a little bit horrifying.

The problem stems from the way django-inplaceedit sets up user permissions; their docs give an example of only allowing superusers edit access, which would suggest an internal or authenticated use case. But we want our bugsets to be publicly editable... and given that this is the way we set up our permissions, we also implicitly gave the entire public internet the ability to edit our entire database. Wow.

Lesson learned: trust no one. Or, at least restrict access to the bugsets app tables. For now, we've disabled this by turning off edit access in production. We also plan to make an upstream pull request to the django-inplaceedit docs to help others avoid this kind of security issue in the future. We also created (and addressed) an issue to track this fun adventure.


Development for this period started off a little delayed due to lack of Asheesh availability, and development on the main project was delayed slightly by the GIANT SECURITY HOLE django-inplaceedit introduced, but we sure wrote a lot of code in the end.

gsoc14.10 and gsoc14.11 deliverables

These two weeks saw the end of the long-standing issue995:

  • Fix django-inplaceedit CSS to make the screen usable
  • YAGNI simplification of AnnotatedBug fields

This milestone was pushed back again due to difficulty with CSS and issues that continually arose every time existing ones were fixed. Here's a code sample that reflects the majority of these two weeks' development.

Figure 1: CSS successfully coerced into exhibiting beauty

gsoc14.12 deliverables

This milestone saw major development of the create/edit view (tracking issue).

Asheesh and I fought with Django forms, many tests, and mostly emerged victorious.

Figure 2: Create a bug set

Figure 3: You can't inject javascript!

What's next

There are a number of features we've discussed implementing for the last few weeks. But here's what we have planned:

  • Finishing up the edit view
  • Update list view to include edit links (while logged in)
  • Simultaneous editing: async notification if loaded page values have changed
  • Security fixes to re-enable editing in production
  • Refresh button for associated metadata we have in the OpenHatch bug db
  • User testing
  • Documentation


There have been two main issues the past few weeks: CSS issues and availability. As I've lamented about CSS in my last post, I'll discuss the availability problem in a bit more detail.

Since returning from OSBridge, Asheesh and I have been a bit busy playing catchup with school and work. In particular, since Asheesh' availability has been even less than that of the dreaded "working adult with a life outside of job," we've had a lot less time to sync up and pair. And due to catching up on assignments and course wrapup, I was able to complete less GSoC work than I might have if Asheesh was babysitting me in some pairing sessions.

In regard to the CSS, in the end I managed to figure it out! A big thanks to Rachelle Saunders, who I met at a local Women Who Code meetup, and was able to solve all my woes in all of 5 fateful minutes. You rock, lady! (A moment of silence for all those hours I could have spent fighting with this on my own. Tears shed for the loss of this valuable learning experience.)

by Elana Hashman at July 28, 2014 03:00 AM

July 27, 2014

M S Suraj
(Vispy student)

Visuals tests – GSoC Week 9

Mandatory blog post –

This week has been filled with writing tests for the Visuals.

  • Wrote a sort of test-bed for Visuals in the form of a TestingCanvas object.
  • Used it to write tests for Ellipse, Polygon and RegularPolygon visuals.
  • The tests would retrieve draw a visual on the canvas, retrieve an appropriate reference image from a test-data repository and would compare the two.
  • Minor allowances has been made to ignore small difference in the corners.
  • The work is available here – .

Although it took a lot of iterations to get this right, it was fun to code them and learn how tests are written, how Travis- the continuous integration platform works and so on.

Next target – Make the visuals reactive!

by mssurajkaiga at July 27, 2014 11:21 PM

Brigitta Sipőcz
(Astropy student)

Hamzeh Alsalhi
(scikit-learn student)

Sparse Output Dummy Classifier

The Scikit-learn dummy classifier is a simple way to get naive predictions based only on the target data of your dataset. It has four strategies of operation.

  • constant - always predict a value manually specified by the use
  • uniform - label each example with a label chosen uniformly at random from the target data given
  • stratified  - label the examples with the class distribution seen in the training data
  • most-frequent - always predict the mode of the target data
The dummy classifier has built in support for multilabel-multioutput data. I have made a pull request #3438 this week that has introduced support for sparsely formatted output data. This is useful because memory consumption can be vastly improved when the data is highly sparse. Below a benchmark these changes with two memory consumption results graphed for each of the four strategies, once in with sparsely formatted target data and once with densely formatted data as the control.

Benchmark and Dataset

I used the Eurlex eurovoc dataset available here in libsvm format for use with the following script.  The benchmark script will let you recreate the results in this post easily. When run with the python module memory_profiler it measures the total memory consumed when doing an initialization of a dummy classifier, along with a fit and predict on the Eurlex data.

The dataset used has approximately 17,000 samples and 4000 classes for the training target data, and the test data is similar. They both have sparsity of 0.001.

Results Visualized

Constant Results: Dense 1250 MiB, Sparse 300 MiB

The constants used in the fit have a level of sparsity similar to the data because they were chosen as an arbitrary row from the target data.

Uniform Results: Dense 1350 MiB, Sparse 1200 MiB

Stratified Results: Dense 2300 MiB, Sparse 1350 MiB

Most-Frequent Results: Dense 1300 MiB, Sparse 300 MiB


We can see that in all cases expect for Uniform we get significant memory improvements by supporting sparse matrices. The sparse matrix implementation for uniform is not useful because of the dense nature of the output even when the input shows high levels of sparsity. It is possible this case will be revised to warn the user or even throw an error.

Remaining Work

There is work to be done on this pull request to make the predict function faster in the stratified and uniform cases when using sparse matrices. Although the uniform cases is not important in itself the underlying code for generating sparse random matrices is used in the stratified case. Any improvements to uniform will come for free is the stratified case speed is improved.

Another upcoming focus is to return to the sparse output knn pull request and make some improvements. There will be code written in the sparse output dummy pull request for gathering a class distribution from a sparse target matrix that can be abstracted to a utility function and will be reusable in the knn pull request.

by Hamzeh ( at July 27, 2014 08:14 PM

(Statsmodels student)

Tempita and {S,D,C,Z} BLAS Functions

Tempita and {S,D,C,Z} BLAS Functions

Developing a fast version of the multivariate Kalman filter for Statsmodels has required dipping into Cython, for fast loops, direct access to memory, and the ability to directly call the Fortran BLAS libraries.

Once you do this, you have to start worrying about the datatype that you're working with Numpy and Scipy typically do this worrying for you, so that you can, for example, take the dot product of a single precision array with a double complex array, and no problems will result.

In [1]:
import numpy as np
x = np.array([1,2,3,4], dtype=np.float32)
y = np.array([1,2,3,4], dtype=np.complex128) + 1j
z =, y)
print z, z.dtype
(30+10j) complex128

Whereas if you use the scipy direct calls to the BLAS libraries with wrong or differing datatypes, in the best case scenario it will perform casts to the required datatype. Notice the warning, below, and the truncation of the complex part.

In [2]:
from scipy.linalg.blas import ddot
z = ddot(x,y)
print z
-c:2: ComplexWarning: Casting complex values to real discards the imaginary part

The sciply.linalg.blas functions do some checking to prevent the BLAS library from getting an argument with the wrong datatype, but in the worst case if something slips through, it could crash Python with a segmentation fault.

Types and the Kalman filter

This matters for the Cython-based Kalman filter for two reasons. The first is that we likely want to behave as nicely as numpy dot and allow the filter to run on any datatype. The second is that numerical derivatives in Statsmodels are computed via complex step differentiation, which requires the function to be able to deal with at least the double complex case.

This means that all of the Cython functions need to be duplicated four times.

Fortunately, all of the underlying BLAS functions are structured with a prefix at the beginning indicating the datatype, followed by the call. For example, dgemm performs matrix multiplication on double precision arrays, whereas zgemm performs matrix multiplication on double precision complex arrays.

The need for relatively simple duplication means that this is a great place for templating, and it turns out that Cython has great, simple templating engine built in: Tempita.

As an example, take a look at the below code. This generates four functions: sselect_state_cov, dselect_state_cov, cselect_state_cov, and zselect_state_cov which handle part of the Kalman filtering operations for the four different datatypes.

    "s": ("np.float32_t", "np.float32", "np.NPY_FLOAT32"),
    "d": ("np.float64_t", "float", "np.NPY_FLOAT64"),
    "c": ("np.complex64_t", "np.complex64", "np.NPY_COMPLEX64"),
    "z": ("np.complex128_t", "complex", "np.NPY_COMPLEX128"),

{{for prefix, types in TYPES.items()}}

# ### Selected state covariance matrice
cdef int {{prefix}}select_state_cov(int k_states, int k_posdef,
                                    {{cython_type}} * tmp,
                                    {{cython_type}} * selection,
                                    {{cython_type}} * state_cov,
                                    {{cython_type}} * selected_state_cov):
        {{cython_type}} alpha = 1.0
        {{cython_type}} beta = 0.0

    # #### Calculate selected state covariance matrix  
    # $Q_t^* = R_t Q_t R_t'$
    # Combine the selection matrix and the state covariance matrix to get
    # the simplified (but possibly singular) "selected" state covariance
    # matrix (see e.g. Durbin and Koopman p. 43)

    # `tmp0` array used here, dimension $(m \times r)$  

    # $\\#_0 = 1.0 * R_t Q_t$  
    # $(m \times r) = (m \times r) (r \times r)$
    {{prefix}}gemm("N", "N", &k_states, &k_posdef, &k_posdef,
          &alpha, selection, &k_states,
                  state_cov, &k_posdef,
          &beta, tmp, &k_states)
    # $Q_t^* = 1.0 * \\#_0 R_t'$  
    # $(m \times m) = (m \times r) (m \times r)'$
    {{prefix}}gemm("N", "T", &k_states, &k_states, &k_posdef,
          &alpha, tmp, &k_states,
                  selection, &k_states,
          &beta, selected_state_cov, &k_states)


Of course this merely provides the capability to support multiple datatypes, and wrapping that in a user-friendly way is part of another part of the project.

by Chad Fulton at July 27, 2014 09:20 AM

Mustafa Furkan Kaptan
(Vispy student)

VNC Backend

Hello reader,

Last week we finished the Javascript part of our IPython-VNC backend. One of the tricky parts was to import Javascript code to IPython notebook as soon as user creates a vispy canvas. After solving this issue, we are now able to handle user events like mouse move, key press and mouse wheel.

About the implementation, we listen the html canvas in IPython notebook with Javascript. As soon as an event is detected, we generate the event with proper name and type according to our JSON spec. The generated event is sent to vispy with widget's `this.send()` method. This method allows us to send a message immediately from frontend to backend. When a message is received from frontend, the backend generates the appropriate vispy event. For instance, if we receive a mousepress event from frontend, we generate vispy_mousepress event in the backend. That way, user can use the conected `on_mouse_press` callback in vispy canvas.

We embeded an IPython DOMWidget into our VNC backend for ease the communication between JS and python. We want this backend to be very easy to use for a user. So s/he never needs to deal with listening events, sending them to python or creating an IPython widget or even coding in Javascript.

There are still some problems though. In mouse move event, Javascript can capture all of mouse's position. I mean every single pixel that mouse was on.. So when we are trying to generate and send mouse move event, it takes a lot of time. For example if we are trying to do a drag operation, a lag occurs because of this. Also, it is nearly impossible to capture the screen, convert it to PNG, encode it with base64 and send them through websocket in the speed of mouse move. So this is another reason why we have this lag.

Another problem is that we can not use a python timer. In vispy we use the backend's timer (QTimer for qt, etc.). But here, it is not possible to have a QTimer and IPython's event loop at the same time. We have to think a different way to have a timer. Or else we have to let go the timer based animation option.

See you next post!

by Mustafa Kaptan ( at July 27, 2014 12:45 AM

July 23, 2014

Mainak Jas
(MNE-Python student)

mne report PR merged

I added a __repr__ string for the inverse operator by creating an InverseOperator class which inherits from dict in a separate PR. This was integrated into the mne report and @agramfort pitched in to help me in fixing py3k bugs. They weren't easy as we had to finally hack into the tempita code. There was a nasty bug whereby Chrome was modifying the report by replacing all relative paths with absolute path names. It took a while to track it down and we fixed it by using some javascript in this commit. The tests were improved and coverage went up to 85% from 46% and the test time went down to 6 seconds. This was achieved by using decorators and an environment variable so that the tests would not call the visualization functions which are already tested in another module. All of this taken together, the mne report PR got merged finally. A report generated from the command for the sample dataset can be found here. Here is a screenshot of the html:

Next, I focussed on using joblibs to make the rendering parallel. This was addressed in the PR here. It seemed that the processing was IO bound rather than CPU bound which suggested that we should process the files in batches rather than single files at a time. Nonetheless, we could achieve an improvement of up to 2x in speed for the *.fif files but the largest improvements in speed came from MRI which burned all the cores requested.

Finally, there were a few bugs with whitespace in section names in the add_section method of the Report class. This was addressed in this PR.

I wrote to the MNE mailing list asking neuroscientists for feedback on this new tool. One important suggestion was to reorder sections so that they followed a natural ordering, i.e., raw -> events -> evoked -> cov -> trans -> MRI -> forward -> inverse. This was rather easy to do. I addressed this in this PR.

Finally, @dengemann reported some corner cases where the ordering of sections was inconsistent if the saving was done twice. This I fixed in this PR.

The next steps are as follows:
  • Allow add_section to take mlab figures, i.e., Mayavi figures and also text.
  • Introduce interactivity in the report.
  • Get as much feedback as possible and fix bugs.
The mne report will be included in the next release 0.8, which will happen rather soon.

by Mainak ( at July 23, 2014 07:41 PM

July 15, 2014

Rishabh Sharma
(SunPy student)

Near Completion

Well an update on the work.
Unified Downloader
All the major code appears to have been writter down.The Unified Downloader retains the ability to have semi-sql query ability(VSO type).It has a new response object. UnifiedResponse object is now returned from the query method.
No major code base problem faced, even I can not remember any git issues faced

JSOC Attributes:
My work is in final part of submission here.PR reviewing also is nearly complete.

Other PRs:
Some miniscule issues, otherwise everything is up.


by rishabhsharmagunner at July 15, 2014 03:06 PM

July 14, 2014

Mainak Jas
(MNE-Python student)

mpld3, python3.3, topomaps and more

We are approaching towards a merge for the 3000 line PR for the mne report command. One of the features that I incorporated last week was interactive plot_events using mpld3 that combines the power of matplotlib and d3.js. Unfortunately, a recent PR breaks things as mpld3 has not yet implemented ax.set_yticklabels() in its latest release. So, we decided to leave it out interactivity for now and push for a merge before the 0.8 release of MNE-Python.

Travis is not yet happy because of Python3.3 errors. I am going to use the six package to implement methods that have changed from Python2.7 to Python3.3. There were issues with os.path.abspath() as it does not recognize symlinks and therefore we replaced it with os.path.realpath().

Finally, thinking of possible use cases, we realized that including the topographic maps (see figure below) of the evoked response would be useful for the report.

 The show parameter for the plotting function was somehow missing, so I fixed it in this PR. Finally, the plots were ordered by section rather than figure which was addressed in this commit.

The next steps are as follows:
  • Speed up the tests by using a decorator for the tests which uses a small string for the plot instead of making the actual plot. That way, we do not end up testing the plotting functions twice.
  • A separate PR for a repr for the inverse operator.
  • Miscellaneous fixes for merging the mega-PR: Python3.3 errors, not allowing multiple saves by throwing errors etc.

by Mainak ( at July 14, 2014 01:03 PM

Richard Tsai
(SciPy/NumPy student)

Rewrite scipy.cluster.hierarchy

After rewriting cluster.vq, I am rewriting the underlying implementation of cluster.hierarchy in Cython now.

Some Concerns about Cython

The reason why I use Cython to rewrite these modules is that Cython code is more maintainable, especially when using NumPy’s ndarray. Cython provides some easy-to-use and efficient mechanisms to access Python buffer memory, such as Typed Memoryview. However, if you need to iterate through the whole array, it would is a bit slow. It is because Cython will translate A[i, j] into something like *( + i * A.strides[0] + j * A.strides[1]), i.e. it needs to calculate the offset in the array data buffer on every array access. Consider the following C code.

int i, j;
double *current_row;

/* method 1 */
s = 0;
current_row = (double *);
for(i = 0; i < A.shape[0]; ++i) {
    for(j = 0; j < A.shape[1]; ++j)
        s += current_row[j];
    current_row += A.shape[1];

/* method 2 */
s = 0;
for(i = 0; i < A.shape[0]; ++i)
    for(j = 0; j < A.shape[1]; ++j)
        s += *( + i * A.shape[1] + j);

The original C implementation uses method 1 shown above, which is much more efficient than method 2, which is similiar to the C code that Cython generates for memoryview accesses. Of course method 1 can be adopted in Cython but the neatness and maintainablity of Cython code will reduce. In fact, that is just how I implemented _vq last month. But _vq has only two public functions while _hierarchy has 14, and the algorithms in _hierarchy are more complicated that those in _vq. It would be unwise to just translate all the C code into Cython with loads of pointer operations. Fortunately, the time complexities of most functions in _hierarchy are \(O(n)\). I think the performance loss of these functions is not a big problem and they can just use memoryview to keep maintainablity.

The New Implementation

The following table is a speed comparision of the original C implementation and the new Cython implementation. With the use of memoryview, most functions have about 30% performance loss. I used some optimization strategies for some functions and they run faster than the original version. The most important function, linkage, has a 2.5x speedup on a dataset with 2000 points.


The new implementation has yet to be finished. All tests pass but there may be still some bugs now. And it lacks documentations now.

by Richard at July 14, 2014 09:19 AM

July 13, 2014

Asish Panda
(SunPy student)

Progress, 2 weeks after mid term

Hey all. Again a mandatory post. The PR I created, is done, only needs merging now. I made a hell at my local git repo when a commit was corrupted. It did not end there as it was followed by a weird error even after removing the corrupted empty files. I tried to fix it but failed so I had no choice but to clone it once again. Which again led to an error while compiling “python.h cannot compile” even after install all python-dev updates. I don’t remember how exactly but it was a problem with pyc files hanging around. I removed them mannualy and it worked.

As for now what I will be doing next, I think I will go with the wcs module. I have yet to start that and I think it will be much more interesting than my previous tasks.

But still I have yet to merge the changes of maps and spectra, but I think it can wait till I get comments on how to improve it.

I guess thats all for now. I am not good at writing about the stuff I do, so check out the PR or my forked repo for details!


by kaichogami at July 13, 2014 07:48 AM

July 12, 2014

Shailesh Ahuja
(Astropy student)

Using specutils for reading and writing FITS files

I have just finished developing the readers and writers for FITS IRAF format described here. I am going to give a small tutorial on how the specutils package (affiliated with Astropy) can be used for the same, and also give some details of how it works.

To read a FITS file, just simply use method. This method might be later moved to Spectrum1D class. The following code snippet gives an example:

from import read_fits
spectra = read_fits.read_fits_spectrum1d(example.fits)

Depending on the type of FITS file, the object can be different. For linear or log-linear formats, the object returned will be a Spectrum1D object. For multispec formats, the object will be a list of Spectrum1D objects. The dispersion and the flux of the Spectrum can be accessed using spectrum1Dobject.dispersion and spectrum1Dobject.flux respectively. The dispersion is automatically computed based on the type of dispersion function defined in the header. Other methods such as slicing dispersion for a particular set of flux coordinates will be supported soon.

To write a FITS file, method. Like the reader, this method may be moved to Spectrum1D later. The following code snippet gives an example:

from import write_fits
write_fits.write(spectra, 'example.fits')

The spectra object passed can be a list of Spectrum1D objects or just one Spectrum1D object. Depending on the type of spectra, the writer will write either in multispec or linear format respectively. If the spectra was read in using the reader, then the additional information from the read file will also be written.

There will be some more operations supported, but the focus will be on keeping things simple, and handling the complexities in the code itself.

by Shailesh Ahuja ( at July 12, 2014 01:23 PM

July 05, 2014

(SunPy student)

SunPy Database Browser Meeting on 5th July, 2014

Nabil, Stuart and I had discussion regarding the Sunpy Database Browser on the 5th of July, 2014. Various points were discussed during the meeting:
1. EIT files seem to have a problem with being loaded in into the SunPy database. This is because the SunPy database does not apparently allow WAVEUNIT attribute to be None. And the EIT file in the has WAVEUNIT as None. Thus it is not been able to load in the database. This can be due to one of the following two problems:
a. The header of the SunPy's sample EIT file is faulty
b. The SunPy database module apparently does not have the functionality to deal with None WAVEUNIT, and hence EIT files
So, Nabil has assigned me to try and download more of EIT test files and check their headers. If the header of SunPy's sample file is faulty then we will replace the SunPy's test file else we will amend the sunpy.database
2. Stuart has created a pull request for solar co-ordinate handling in Ginga. I shall try to pull in this PR and try to render SunPy's sample, solar FITS files and take screenshots to send to Eric.

3. Start work on the Search Box feature in the Database GUI

by Rajul Srivastava ( at July 05, 2014 05:51 PM

July 04, 2014

Tarun Gaba
(PyDy student)

GSoC 14: Midterm Evaluations have ended!

{% include JB/setup %} [ <-Back to posts ](/gsoc14) It has been a week after the midterm evaluations are over, and I am back to work after a small break(with permission from my mentor, off course!). I have been working on writing a test suite for the Dynamics Visualizer. This is the wrapping up part of the visualizer for this gsoc. [Here](/blog/visualization/index.html?load=samples/scene_desc.json) is a visualization of a rolling disc(it is slightly buggy though), that i prepared. To view the animation, allow the visualization to load in the webpage(it shall load automatically), and then hit the `Play Animation` button. After writing some tests for visualizer, I am going to start fleshing out API for the module, to provide IPython support to the visualizer. The main aim of writing this module is to make visualizer interactive, in the sense, that a user should be able to change all the variables from the GUI(which is rendered inside notebook's output cell) and then rerun the simulations without having to write any code, or execute any of the code manually. The data of the new simulations will be automatically fed into visualizer, and then it can be viewed as an animation. This whole workflow will be very convenient for the existing PyDy users, as well as the new ones. It will be particularly convenient for those who want to just play around with the existing systems, by changing the system variables, and view how it affects the resulting animations. With the development of this module, as well as ongoing improvements in the other PyDy modules(by my fellow GSoC'ers from PyDy), we should be able to perform lightening fast simulations for a system, as well as view them on a canvas. I will keep posting the new work I will be doing, with better details(once I actually start implementing new stuff!). [ <-Back to posts ](/gsoc14)

July 04, 2014 06:11 PM

June 30, 2014

Tyler Wade
(PyPy student)

UTF-8 Progress Update and Examples of Index Caching

Progress update:  I'm still working on replace unicodes in Pypy with a UTF-8 implementation.  Things are progressing more slowly than I'd hoped they might.  At this point I have most of the codecs working and most of the application-level unicode tests passing, but there are still a lot of little corner cases to fix and I don't even want to think about regex right now.  That said, the summer is only about half over.

Since I wanted to have something interesting to write about, I've done some looking into how other implementations that use a UTF-8 representation handle index caching (if at all) and thought I'd share what I've found.

Relatively few languages with built-in unicode support use a UTF-8 representation.  The majority seem to prefer UTF-16, for better or for worse.  The implementations that use UTF-8 I found which are worth mentioning are Perl and wxWidgets.

wxWidgets by default uses UTF-16 or UTF-32 depending on the size of wchar_t, but it has an option to using UTF-8.  When UTF-8 is enabled, wxWidgets uses an extremely simple index caching system.  It caches the byte position of the last accessed character.  When indexing a string, it starts searching from the cached byte position if the index being looked up is higher than the cached index, otherwise it searches from the start of the string.  It will never search backwards. The idea here is to make in-order traversal -- like a typical for(int i = 0; i < s.size(); i++) loop -- O(n). Unfortunately, a backwards traversal is still O(n^2) and random access is still O(n).

Perl's index caching is much more elaborate.  Perl caches two character/byte-index pairs.  When indexing a string, the search will start from the closest of the cached indices, the start, or the end of the string. It can count backwards as well as forwards through the string, assuming that counting backwards is approximately twice as slow as counting forward.

Perl's method for selecting the cached indices is also more sophisticated than wx's. Initially, the first two indices accessed are used for the cached indices. After the first two are cached, when a new index is accessed, it considers replacing the current two cached indices with one of the two possible pairs of one of the old indices and and the new index.  It does so by selecting the root-mean-square distance between the start of the string, the first index, the second index and the end of the string for each the 3 possible pairs of indices.  That's probably as clear as mud, so this maybe this expert from the Perl source will help:0
#define THREEWAY_SQUARE(a,b,c,d) \
((float)((d) - (c))) * ((float)((d) - (c))) \
+ ((float)((c) - (b))) * ((float)((c) - (b))) \
+ ((float)((b) - (a))) * ((float)((b) - (a)))

The pair that minimizes THREEWAY_SQUARE(start, low index, high index, end) is kept.

This method seems to be better than wx's for almost all cases except in-order traversal. I'm not actually sure what the complexity here would be;  I think its still O(n) for random access.

by Tyler Wade ( at June 30, 2014 05:43 AM

June 25, 2014

(SciPy/NumPy student)

With those mid-sem bells chiming, it is time for another update.
The following checkpoints have been reached:

1. The implementation of ellipsoidal harmonic function (also known as Lame's function): The first kind.
The following is the link to the pull request:
The implementation is in Cython and calls LAPACK subroutine. It is based on the python implementation by Knepley and Bardhan given here
Further the absence of Lame's function implementation by any many other libraries there is a challenge in preparation of an extensive test-suite. At present the output of the function for certain range of inputs is tested. The immediate next plan is to try improving the test-suite.
This will be followed by the implementation of Ellipsoidal harmonic function: The second kind.

2. Before this the spherical harmonic functions were improved by reimplementing them in Cython rather than python thus improving the speed.
The details having been elaborately touched in the previous post, are omitted here, saving people from the boredom due to redundancy. The pull request can be accessed from

Thanks to the constant support of my brilliant mentors Pauli, Ralf and Stefan, (and I suppose, few miracles!) the progress is as per schedule. Hoping that this pace will be maintained or even better, quickened; post mid-sem evaluation!

Signing off till next time,

by janani padmanabhan ( at June 25, 2014 05:59 PM

(scikit-learn student)

June 23, 2014

Akhil Nair
(Kivy student)

A lot and a little

This summer of code started on 19th May and today it is 23rd June. What did I do the whole month, you ask?. A lot and a little.

This month was as action packed as a climax scene of a Transformers movie!
 I'm sure the usual - had exams, had project submissions - wouldn't interest you, so let's talk about my project and how buildozer has fared and how I project it would fare.

The aim of the project is clear. Have an automated process that packages any Kivy application for a target Operating System.
My goal was to have target codes for linux by the mid term evaluation. Am I there? Well, almost.

At the beginning my mentors recommended the use of pyinstaller as a tool which I could use to package the application. I spent a considerable amount of time of the past month trying to get pyinstaller to work but failed. The problem being that after following all the procedures and converting the application to an executable, the application would give an import error on a PIL library. When tested the problem also occurred at a basic level where I tried to package a small application which does nothing but import pygame library. Due to the fact that I had wasted major part of my month behind this and a solution wasn't visible, I moved on to the basic packaging methodology used by debian which I had mentioned in my proposal. After piecing together various processes I was able to successfully package an application to debian. Small functional aspects such as resolving dependency still pertains but I'm optimistic that it can be solved.

The whole process can be optimised and divided into the following steps:

1: Place the application in an empty directory (Yes there is a reason for that).

2: Create an file in the application directory, so that python reads the application as a package.

3: Create a file outside the application directory.

4: The should be in the following format.
from distutils.core import setup

#This is a list of files to install, and where
#(relative to the 'root' dir, where is)
#You could be more specific.
files = ["*","sounds/*.wav","icons/*.png"] #List of static directories and files in your application

setup(name = "application_name",
    version = " ", #example 1.0
    description = "A small description",
    author = "myself and I",
    author_email = "",
    url = "whatever",
    #Name the folder where your packages live:
    #(If you have other packages (dirs) or modules (py files) then
    #put them into the package directory - they will be found
    packages = ['application_name'],
    #'package' package must contain files (see list above)
    #I called the package 'package' thus cleverly confusing the whole issue...
    #This dict maps the package name =to=> directories
    #It says, package *needs* these files.
    package_data = {'application_name' : files },
    #'runner' is in the root.
    scripts = ["script_name"], #We will use this to call the application as a program from the command line. Ideally your application_name
    long_description = """Really long text here."""
    #This next part it for the Cheese Shop, look a little down the page.
    #classifiers = []    


5: Create another file named as the script_name from the

6: This file will ideally look something like this.
from application import main #Here application being the application directory.  #Ideally your main function


7: Now run the command "python sdist". This will create a directory named dist with a tarball of your application.

8: Traverse to the tarball and run the command
"py2dsc your_application_tar_file"

9: This will then create a directory named deb_dist. It will have a series of files but we want the one named after your application. Must be similar to application_name-1.0 (1.0 or any other version you specified in your file)

10: Go into that file, you will find a directory name debian. Just run the command "debuild". Now go back to the deb_dist directory. You will have a debian package waiting for you. Try installing it.

There are still certain things that need to be sorted out such as key signing and dependency issues. But I will get to that soon. As you might have seen, the is the most important thing. Also, currently I am using distutils library but going to shift to setuptools as setuptools has functionalities to package directly to rpm. (And maybe windows too).

Currently writing target code to implement this whole process. Will push for review as soon as is over.

by AkhilNair ( at June 23, 2014 04:17 PM

June 22, 2014

Anurag Goel
(Mercurial student)

Still a long way to go and more milestones to cover

This blog is about the midterm summary update. Last two weeks were quite a busy week for me.  As i mentioned in the previous post, i will try to cover things up quickly. So below is the work progress upto the midterm.

Until now i have done mainly two tasks now.

Task 1 is about gathering timing info of test files. In this task, I calculated two main things.
1) User time taken by a child processes.
2) System time taken by a child processes.

This task has already been mentioned in details in previous blog. Patch of this task is under 4th revision of review and its getting better on every revision. You can find more patch details here.

Task 2 is about to “Set Up a tool able to plot timing information for each file”.
In this task firstly I introduced a new functionality of ‘--json’ in “”. While testing if  user enabled the ‘--json’ optional then timing data gets stored in json format in the newly created  ‘report.json’ file. Patch of this task is under 2nd revision. You can find more patch details here.

After that, i wrote html/javascript file which accesses this report.json file to plot graph between testname Vs testime. Mercurial(hg) buildbot runs test cases periodically. So the main aim of this task is give graphical view of test results to the users.

Apart from the above two tasks, the work in which I spent most of time was, in fixing the regression. This regression has been recently introduced by refactoring of which includes
1) produce error on running a failing test
2) produce '!' mark after running a failing test
3) skipped test should not produce 'i' mark while retesting
4) fixes the number of tests ran when '--retest' is enabled
5) checks behaviour of test on failure while testing
6) fixes the '--interactive' option error
There are several other milestones to cover which mainly includes
1) Designing of annotation format
2) Highlight test section with annotation format
3) Declare dependency between sections

Above regression fixing was quite a warm up exercise. This exercise helped in understanding how things work in "". This would definitely help me in quickly covering the above mentioned milestones.

Stay tune for more updates :)

by Anurag Goel ( at June 22, 2014 10:10 PM

Chinmay Joshi
(Mercurial student)

Midterm Update

Midterm evaluation period has approached and I am summerizing my GSoC journey. Until now, I have been working on various files in mercurial for finding operations which touch the filesystem, adding those operations to vfs and updating the related filesystem operations to be invoked via vfs.

During this time, I made several advancements in this process. I am currently working on mercurial/, mercurial/, mercurial/ and mercurial/ Few filesystem operations like lexists and unlinkpath are already merged (See progress wiki for more details on WindowsUTF8 plan) and some others I sent are under review. I am intending to send more from my queue by refining them upon the feedback I receive from currently sent patches. Fujiwara Katsunori has played a crucial role and helped me a lot by providing valuable feedback. Updating users should not only be functional but effcient. This is a large part of project and still needs more operations to be added and replaced accordingly.

Regarding the very basic filesystem operations which do not rely or have specific file path, I recently had discussion with Giovanni Gherdovich for adding file operations which do not rely on a base in filesystem (eg. os.getcwd, os.path.basename, os.path.join, etc) and he as per suggestion we decided to try implementing them with classmethod to access them directly without vfs objects. I will be sending this series as RFC to mercurial developer mailing list for feedback from experts on this project.

The part of project which is about accessing filesystem on Windows I asked on the mailing list for clues. Following the discussion from mailing list, proposed utf8vfs should fetch filenames by primarily calling python's API with unicode objects and should convert results back to utf-8 encoding. However in mercurial, there are some operations which use win32 API in Windows to implement few crucial filesystem operations such as unlink, remove a dirctory tree, etc. As for this I will need to use Win32 W API. Considering the fact that I am absolutely new to win32 API, I laid my hands on Win32 API along with the work of adding users to vfs and updating file system operation users to use vfs during the last week. I decided to do allocate this a proportion of time eventhough it is scheduled at a later stage in timeline because this are the things which can consume my huge time at later stage if I get stuck in some thing. I have also experimented some operations with unicode objects by adding them to utf8vfs.

Stay tuned for more updates!

by Chinmay Joshi ( at June 22, 2014 04:30 PM

June 21, 2014

Shailesh Ahuja
(Astropy student)

Analysing IRAF multispec format FITS files

Today I am going to analyze a couple of FITS files, which have been stored in the IRAF multispec specification described here. I am going to focus to Legendre and Chebyshev polynomial dispersion functions.
First, take a quick look at the FITS headers in the following files:
  1. Legendre dispersion file headers:
  2. Chebyshev dispersion file headers:

NAXIS defines the number of dimensions. In multispec format, there are always two dimensions. Multiple one-dimensional spectra are stored in this format. These headers have CTYPE1 and CTYPE2 equal to `MULTISPE`. This is necessary to indicate that the spectra is stored in multispec format. NAXIS1 tells us the size of the data (length of the flux array) in each spectra, and NAXIS2 tells us the number of such one dimensional spectra. In both the files, there are 51 spectra stored.

One of the most important header keyword is WAT2_XXX. This keyword stores the information to compute the dispersion at each point, for each spectra. `specK` holds the information for the Kth spectra. There are various numbers separated by spaces within each `specK`. These numbers describe the exact function to be used to compute the dispersion values. The following list explains these numbers in order:

  1. Aperture number: According to Wikipedia, aperture number is directly or inversely proportional to the exposure time. This entry always holds an integer value. For both the files, this number goes from 1 to 51. This value has no significance on the calculation of dispersion.
  2. Beam number: I am not sure what this means, but it is always an integer value. For the Legendre file, this decreases from 88 to 38 and for Chebyshev file this increases from 68 to 118. 
  3. Dispersion Type: This can be 0 (linear dispersion), 1 (log-linear dispersion) or 2 (non-linear dispersion). As both these files define non-linear polynomial functions, this is always 2.
  4. Dispersion start: This value indicates the dispersion at the first physical pixel. This value is not used for computation, however, this value can be used to verify whether the function is giving the correct output at the first pixel. Unfortunately, this value is the same for all 51 spectra in both the files, which implies that this value hasn't been correctly stored. The value matches the output returned by the 51st spectra dispersion function.
  5. Average dispersion delta: This value is equal to the mean of the difference between consecutive dispersion values. Again, this value is not used for computation, but can be used to verify the function output. Similar to the previous value, this has been stored incorrectly in both these files. It is only correct for the 51st spectra.
  6. Number of pixels: This value indicates the length of the flux array of this spectra. This value can be at most the value of NAXIS1. This value should be equal to PMAX - PMIN (defined later).
  7. Doppler factor (Z): Due to relative motion of the object and the observer, Doppler effect can alter the dispersion values. This factor can be used to compute the adjusted dispersion values, by using the formula below:

                                 Adjusted dispersion = Dispersion / (1 + Z)
  8. Aperture low: This value is for information only. Stores the lower limit of spatial axis used to compute this dispersion.
  9. Aperture high: Again, this value is for information only. It stores the upper limit of spatial axis used to compute this dispersion.

    From this point, the function descriptors start. There can be more than one function too. In that case these descriptors are repeated starting from weight. These descriptors determine the function output. The final dispersion is calculated as:

                                Final dispersion = Sum of all function outputs
  10. Weight:  The weight of the function gives the multiplier for the dispersion calculated. It's use becomes more obvious in the formula below.
  11. Zero point offset: The value to be added to all the dispersion values. Combined with the weight, and the Doppler factor, the function output can be calculated as:

        Final function output = Weight * (Zero point offset + function output) / (1 + Z)

    In the files given, there is only one function. The Doppler factor and the zero point offset are zero, and the weight is one. So the final dispersion is equal to the function output.
  12. Function type code: Until this point, we know how to calculate the final dispersion, if we know the function output. This value stores to type of the function that will be used to compute the output at any given pixel. There are six possibilities:
    1 => Chebyshev polynomial, 2 => Legendre polynomial, 3 => Cubic spline,
    4 => Linear spline, 5 => Pixel coordinate array, and 6 => Sampled coordinate array

    Starting from this point, the numbers may mean different things for different functions. I am explaining the descriptors for Legendre and Chebyshev.
  13. Order (O): The order of the Legendre or Chebyshev function.
  14. Minimum pixel value (Pmin): The lower limit of the range of the physical pixel coordinates.
  15. Maximum pixel value (Pmax): The upper limit of the range of the physical pixel coordinates. In combination with the lower limit, they determine the domain of the function. This domain should be mapped to [-1, 1].
  16. Coefficients: There are O coefficients that follow. These coefficients define the Legendre or the Chebyshev functions.
And that's it. It's a bit tedious to understand, but format enables so much information to be stored. The documentation is not very clear, and I hope this post helped you understand what these parameters stand for. 

by Shailesh Ahuja ( at June 21, 2014 09:27 AM

June 19, 2014

Milan Oberkirch
(Core Python student)


I spent the last days improving my smtpd patch and reading some (proposed) standards which have their very own way of sounding boring in almost every sentence (they have to be specific so don’t blame the authors). For example the following sentence can be found in nearly every RFC published since 1997:

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
document are to be interpreted as described in RFC 2119.

After reading all those RFCs I feel like I SHOULD get back coding right now.

The most relevant RFCs so far:

  • 6531: SMTPUTF8
    already implemented in and in progress in smtplib. I will come back to this after dealing with imaplib and #15014 is closed.
  • 6532: Internationalized Headers
    the one I started with in the first weak, affects the email package. I’ll come back to that when everything else is done (I discussed some of the reasons for this in my last post).

and new this week:

  • 6855: UTF-8 support for IMAP
    This concerns the imaplib. I will get started with that today.
  • 6856: same thing for POP3
    concerns poplib (obviously). I already proposed a patch for this one but left out the new ‘LANG’ command (used to change the language of messages from the server) for now. It seemed quite irrelevant, but I MAY add it sometime in the future.
  • 6857: Post-Delivery Message Downgrading for Internationalized Email Messages
    I MAY look into this after I implemented the things above.

So I have many new playgrounds now. Working on many things in parallel is also an attempt to make the whole review and patch-improvement process more efficient.

by milan at June 19, 2014 06:30 PM

Jaspreet Singh
(pgmpy student)

The One With PomdpX Reader Class

Hi folks!
There were a lot of decision to be made before I could start hacking for pgmpy. First of all, I had to decide which of the file formats will I start with. I decided to chose PomdpX file format since it was based on xml similar to the XMLBIF format which was completed by Ankur Ankan already. This eased the task. The Reader Class was to be implemented first. The aim was to parse a pomdpX file and convert all the information it contains about the model in structured data. I read the etree tutorial and learned about the functions which will be used to achieve the task. The ElementTree API provides different kind of functions to get this done.I'll provide an example. Suppose the xml looks something like this

<pomdpx version="0.1" id="rockSample"
<Description> · · · </Description>
<Discount> · · · </Discount>
<Variable> · · · </Variable>
<InitialStateBelief> · · · </InitialStateBelief>
<StateTransitionFunction> · · · </StateTransitionFunction>
<ObsFunction> · · · </ObsFunction>
<RewardFunction> · · · </RewardFunction>

If the above xml is contained in a string, then etree.fromstring(string)  will return the root element of the tree, i.e. pomdpx in this case. Now functions like find() and findall() can be used to obtain the elements contained within the root. Attributes were obtained using the get() method and the value in a tag using the text() method. These all combined helped me to form structured data out off the file format.

The most difficult part was the parameter tag of type Decision Diagram where I had to parse tree type of format. I could not use any other class object inside the Reader Class to make it independent. I finally came up with a recursive parsing solution to it. I was able to come up with it by using the idea from the TreeCPD class present in the Factors module of the pgmpy codebase.

I then wrote tests for each of the functions so that each and every possibility of mistake can be handled and no bugs remained in the code. Some docstrings were also added. Finally the Reader Class was complete along with the tests for me to send a pull request.

by jaspreet singh ( at June 19, 2014 07:05 AM

June 16, 2014

Frank Cheng
(Statsmodels student)

How to test MICE?

MICE is a method that imputes missing data using simulation from an inferred posterior distribution. See the Appendix of to see the exact way we simulated data.

With this in mind, MICE is a random method that will yield differing results every time it is run. Therefore, a typical testing procedure where we try to get exact agreement between our implementation and an existing, mature package is not possible. However, we should get roughly the same results for any given instance, and more importantly, the distribution of results should be very similar. So what we will aim to do is really a comparison of simulations between our implementation and those of R and Stata. Right now, predictive mean matching is directly comparable between all implementations, but there are some details in our other asymptotically Gaussian approach that are different in the other packages. I will write a follow up post shortly describing the different approaches taken by different packages. For now, the takeaway for testing is that we will perform simulation studies to assess the similarity between several MICE implementations.

by Frank Cheng ( at June 16, 2014 06:04 AM

Tyler Wade
(PyPy student)

Beginning UTF-8 Implementation

Over the last week and a half, I've begun working on implementing UTF-8 based unicodes in PyPy.

As I've stated in a previous blog post, my goal is to replace the use of the RPython unicode type in PyPy with a PyPy-specific UTF-8 implementation.  The motivation for doing this is to simplify PyPy's unicode support by providing a single implementation and application-level interface irrespective of the platform.  Presently, the size of the system's wchar_t type determines which representation is used, from both perspectives.  The new implementation will represent the unicode strings internally as an RPython string of bytes, but the user will interact with the application-level unicode string on a code point basis.  In other words, the user will have a "UTF-32" interface.

Since Python supports converting unicode objects into UTF-8 byte strings and vice-versa, RPython helper functions already exist for a lot of the handling of UTF-8. I am largely reusing these with minimal modifications.  As a result, most of my work is a matter of fixing the various places in PyPy where the built-in/RPython unicode type is expected and replacing unicode literals.  That said, I plan on leaving unicode literals in (the interpreter-level) tests in place. Replacing them serves little useful purpose and would make the tests more annoying to read and to write.

For the time being I'm not paying very much attention to optimization (with the exception of special cases for ASCII only strings.)  Things will be slow for a little while; random access before any optimizations are added will be O(n).  However, the goal is correctness first and this won't be merged if it creates any serious performance regressions.  That said, I already have some ideas about optimizations and hopefully I'll be far enough along in my work to be able to write more about those next time.

by Tyler Wade ( at June 16, 2014 12:45 AM

June 14, 2014

Akhil Nair
(Kivy student)

Packaging- The Debian Way

In the earlier post I had mentioned, how I was stuck with Pyinstaller. Despite several efforts the roadblock remained. So earlier this month, I was stuck at a dead end and nowhere to turn to.

After some soul searching and the inspiration season finale of Silicon Valley, I decided to try a different approach. And I must say, this looks promising.

What Pyinstaller does is it converts a program into an executable. We have to take it the rest of the way.
So I thought why not try to directly package using some basic debian build tools.

And I can say that I was fairly successful.

So this method has two parts. The setup creation and the package creation.

A combination of Part1 and Part2.
There are some steps in between but nothing big.

Part1 deals with making a Kivy Application an installable program that can be run by firing a command on the terminal.
This step is important because this functionality is a fundamental part of any debian package.

These are the steps in part one:

- Create a file outside the application directory and enclose this combination into another directory which we may refer to as root.
The file will contain information about the author, static files, package, etc. 

- We must also make the application directory as a package which can be read as a module. So we create a file (doesn't matter if it's blank) so that python refers to it as a module.

- Create a Manifest file which specifies all the other files(non-python) of the application such as sounds, images, readme, etc.

- Run the command "python sdist". This will now create a dist directory which will have a tar file. You may check whether the application can be installed or not by running the "python -install" command.  If the above part works well then the second part should work without hiccups.

The second part deals with the creation of deb package using the obtained tarball.

                Static files are not getting included to the tarball. Following the MANIFEST file in the python distutils tutorial only results in creation of an empty directory named "recursive-include". 

If some of the problems can be ironed out, this can work wonderfully.
Also the assumption is that the first part would be same for rpm conversion, so it can be possible to have a single target code for both deb and rpm.

Final thoughts:
                           Since I have the attention span of a sparrow and that this post is getting larger, I would like to elaborate on the second part in the next post. Those of you who couldn't understand a thing can go to the link I have provided above, what I am trying to explain is pretty much the same thing, only that they explain it better.

Feedbacks and advices will be appreciated and I will try and send chocolates to anyone who can help me out with this(just kidding).

by AkhilNair ( at June 14, 2014 05:47 AM

June 13, 2014

Wenzhu Man
(PyPy student)

two end nursery GC

So things have been kind of complicated in the past four weeks.
And my original project changed to a new one as someone in pypy have been working on  the original project(add pinning to GC) for months, but none of my mentors know until two weeks passed.

So my new project is :
two end nursery GC.
As mentioned in previous post, current incremen
tal minimark GC in pypy has two generations: nursery generation and old generation. There is also a external_malloc() to raw-allocate objects(these objects are out of GC control)

The current nursery allocation logic is clear: there is three pointers :nursery-free/nursery-top/nursery-real-top, between nursery-free and nursery-top there is a piece of zeroed memory (grey area below)
,between nursery-top and nursery-real-top  there is non-zeroed( contains random garbage) memory( blue area below).

The nursery is cleaned up step by step, when the nursery-free reaches nursery-top, the nursery-top will move forward and zero the memory for one more step(defined by nursery-cleanup).

From this mechanism  we can see every object in allocation in memory that is allocated in memory full of zeros.
This is very useful for some objects.  Take an object that contains a field which is a GC pointer.  It needs to be zero-initialized, because otherwise, if the program doesn't quickly fill the pointer, garbage is left in this field.  At the next minor or major collection, trying to follow this garbage-valued pointer would
However, not all objects need such zero-allocation.  For any object not containing any GC pointers, it is not useful from the point of view of the GC.   The point is that we would save part of the time needed to zero out the nursery if we could choose to allocate zeroed or non-zeroed objects.

So the idea of the project is to allocation objects(don't have GC pointers) from nursery-real-top and grows towards nursery-top. So the nursery can allocate objects from two end.

These will save the time that instead of there are two writes(zeroing and allocation) for every object there will be just one write for some of the objects.

The current allocation function are malloc_fixedsize_clear(), malloc_varsize_clear(), I added two new pointers nursery-second-part-free(grows from nursery-real-top), nursery-second-part-end(points to nursery-top).
I also implemented malloc_fixedsize() as the third allocation function in GC to allocate objects in opposite direction from the other end of nursery.

 The next step will be modify the GCtransformer to replace malloc_fixedsize_clear() with malloc_fixedsize() when appropriate.

by Wenzhu Man ( at June 13, 2014 05:47 PM

June 12, 2014

Milan Oberkirch
(Core Python student)

First things first…

After working on the email package for two weeks my mentors and I decided to work on the smtpd and smtplib modules (and RFC 6531) first. This makes sense for several reasons, one of them is task scheduling: the SMTP related modules are much smaller and therefore dealing with them first ensures that as many tasks as possible are done as early as possible (with high probability).

I moved the old development repo to and started a new one at for that. The patch for smtpd is ready for review at The smtplib part is also nearly done, I’m waiting for #15014 (refactoring of the authentication logic) to be resolved to finish that. Another thing that needs to get merged is support for Internationalized Domain Names (IDN). I will have to do some research on how to handle those in a SMTPUTF8 environment (see #20083, #11783).

After having finished the SMTP part I will research if there are standardised ways to teach IMAP (the imaplib module in Python) Unicode. RFC 6857 (Section 1.1) does not sound very promising so I will most likely return to the email package at the end of next weak.

That’s it for now. I’m looking forward to blog more frequently in the future!

by milan at June 12, 2014 08:38 PM

June 02, 2014

Edwin Marshall
(Kivy student)

Too Many Moving Parts

The title really speaks to two issues, one personal, and one technical. As most of you reading this are concerned only with the technical aspects of the project, I'll focus on the that first.

Technically Speaking

The awesome thing about the core kivy project is that it is organized very well and thoroughly documented. While the Kivy Architecture Diagram is a little scary when first encountered, once you take a peek at the directory structure, everything becomes much clearer. For instance, understanding how SDL ties into everything merely requires a developer study kivy/core. In fact, one could add new providers piece-meal by selecting which of the core modules to implement first. For example, if a developer wanted an SFML or GLFW-based window, he would wander into kivy/core/window.

Unfortunately, I can't say that things have been as pleasant in python-for-android. For example, in attempting to understand how SDL 1.2 ties in to the project, I've had to examine numerous files:
Responsible for cross-compiling python and any other dependencies for Android
Explain how each of those dependencis should be built. Some of them are deceptively simple, until you realize that JNI is involved.
An Android activity that sets up a sound thread, loads the libraries built by as well as the native application and sdl_main shared objects.
Various bits of native wrappers responsible for not only loading the sdl shared libraries, but also providing a user's python code with an entry point.

All this indirection makes it rather easy for a simple guy like myself to get disoriented, so more time is spent understanding the code and following its logic trail than actually writing code.

Speaking of Writing Code

I've been apprehensive about breaking things, hince the lack of pushes, but that will change tonight. Yesterday I spent a lot of time reorganizing my workspace. Specifically, I made sure to properly set up my git remotes so I didn't push to the wrong branch. I've also forked to my own repo so that I never have to worry about messing up the master branch.

Today I plan to do a bit of liposuction to see if things that seem a bit redundant are necessarily so, or like my beer gut just another thing to judge the code by that needn't be there. I'm hoping this week will be the week that you guys see that I am acutually a productive individual.

A Little Personal

On the personal side of things, The last couple of weeks have been interesting because I've recently quit my job and been scrabbling for new work. I typically would have done the smart thing and waited for a new job before quiting, but my last job was taking its toll on me.

Due to the amount of repititive motion (and not having to do manual labor itself), my body has started to ache in ways it never has in all my 13 years of legally working. My stress-induced ecsema was acting up, I had pain in joints all over, and I found myself more lethargic during my off days than I had ever been.

I also found myself more depressed than usual. Whats ironic about that is that prior to this job, I used to find myslf working excessively as a way to cope with-- or perhaps more accurately, as a distraction from-- my depression. Here, things were different; I was constantly reminded of how easily talents get ignored and stripped of an identity; I was just another worker. If I was going to be nobody, I needed to be nobody of my own accord and with the opportunity to change.

While I haven't been succcessful in job search yet, I've gotten some responses from some freelance opportunities which I'll be continuing to explore this week. As such, I apologize if I seem somewhat distant.

I suppose that's enough about my last story. I hope everyone else's life has been less complicated and wish the other students happy coding!

June 02, 2014 07:26 PM

June 01, 2014

Chinmay Joshi
(Mercurial student)

Dive in code during the First two weeks

At the end of first two weeks of GSoC coding period, I am really enjoying a great time working on my Project. This is a never-before experience in my life to work on with open source project community. I keep on discovering new things each and every day. I thought I had done enough basic research about my project but it continuously proves me wrong. One possibly can’t understand blocks before submerging into code. Working on code base, I discover the way various ways developers have used in mercurial to provide maximum optimization.

 In spite of stumbling and frustrating in the beginning because of slow pace in my work, I sent a small patch series in Week 2. I added some filesystem functions to vfs like lexists(), lstat(), unlinkpath() (still in queue to send) and updated some users in and Some of my patches were queued and merged and some got me very important feedback. Based upon the feedback I learn to check for callers and origins of functions before updating them with vfs interface. I also tried to clear my doubts with repository related and current working directory (cwd) related paths. Discussing things with Community helped me to identify what are the cautions for this task. Checking for all callers and origins of function before replacing them, is a task I found very time consuming and tedious but bare necessity. Switching to a slightly advanced IDE than a normal editor is giving me a helping hand in this task.

I am preparing another series in which I am adding more function to vfs and updating more users to use new API. This first phase of this project of adding/updating looks quite mechanical process at first sight, but it needs proper attention for maximum compatibility and optimization. During this process, I encountered a few filesystem functions from os module of python which I had never used in my usual college projects. Also I liked the util of mercurial written in C for file manipulation. This util.* functions, especially the ones about tree links are very new for me. Talking and discussing things with mentors and community always helps me to clear up my mind give me directions. I am currently focusing on this add functions / update users task as it is essential for further process.

Stay tuned for further updates!

by Chinmay Joshi ( at June 01, 2014 07:35 AM

Anurag Goel
(Mercurial student)

Slowly and Steadily reaching towards the goal.

It has been two weeks now. Experience was great so far. At initial point i was viewing the project in big picture like how things get done and time it will take.
But that approach did not give me the clear idea about the project and even restricted me to move one step forward. After communicating with mentors and crew members, i realise that i should focus on one task at a time. And before doing that task, i should break it into small  steps. Following this approach, it helped me in completing the first task. Although it has not been reviewed yet but if i have to make some changes further, i will do side by side.

Task 1 is about gathering timing info of test files. In this task, I calculated two main things.
1) Children's user time taken by a process.
2) Children's system time taken by a process.

I used "os.times()" module to get the above info. Unfortunately, in windows "os.times()" module works only for parent process. Therefore, my contribution only works for linux user as of now.    

According to proposal timeline, i planned to complete first two tasks in first two weeks. But i am only able to complete first task yet. Now as i got the approach, i will try to cover up things quickly.       

Biggest challenge in doing a task, is in making those small steps on which you have to get along. This could only be possible when you communicate with your mentor and crew members as much as possible. With every conversation things will get more clearer to you and this would help in building greater understanding about the project.

by Anurag Goel ( at June 01, 2014 04:25 AM

May 31, 2014

Terri Oda
(PSF Org admin, Mailman mentor)

You can leave academia, but you can't get the academic spam out of your inbox

When I used to do research on spam, I wound up spending a lot of time listening to people's little pet theories. One that came up plenty was "oh, I just never post my email address on the internet" which is fine enough as a strategy depending on what you do, but is rather infeasible for academics who want to publish, as custom says we've got to put our email addresses on the paper. This leads to a lot of really awesome contacts with other researchers around the world, but sometimes it leads to stuff like the email I got today:

Dear Terri,

As stated by the Carleton University's electronic repository, you authored the work entitled "Simple Security Policy for the Web" in the framework of your postgraduate degree.

We are currently planning publications in this subject field, and we would be glad to know whether you would be interested in publishing the above mentioned work with us.

LAP LAMBERT Academic Publishing is a member of an international publishing group, which has almost 10 years of experience in the publication of high-quality research works from well-known institutions across the globe.

Besides producing printed scientific books, we also market them actively through more than 80,000 booksellers.

Kindly confirm your interest in receiving more detailed information in this respect.

I am looking forward to hearing from you.

Best regards,
Sarah Lynch
Acquisition Editor

LAP LAMBERT Academic Publishing is a trademark of OmniScriptum
GmbH & Co. KG

Heinrich-Böcking-Str. 6-8, 66121, Saarbrücken, Germany
s.lynch(at) / www. lap-publishing .com

Handelsregister Amtsgericht Saarbrücken HRA 10356
Identification Number (Verkehrsnummer): 13955
Partner with unlimited liability: VDM Management GmbH
Handelsregister Amtsgericht Saarbrücken HRB 18918
Managing director: Thorsten Ohm (CEO)

Well, I guess it's better than the many mispelled emails I get offering to let me buy a degree (I am *so* not the target audience for that, thanks), and at least it's not incredibly crappy conference spam. In fact, I'd never heard of this before, so I did a bit of searching.

Let's just post a few of the summaries from that search:

From wikipedia:
The Australian Higher Education Research Data Collection (HERDC) explicitly excludes the books by VDM Verlag and Lambert Academic Publishing from ...

From the well-titled Lambert Academic Publishing (or How Not to Publish Your Thesis):
Lambert Academic Publishing (LAP) is an imprint of Verlag Dr Muller (VDM), a publisher infamous for selling cobbled-together "books" made ...

And most amusingly, the reason I've included the phrase "academic spam" in the title:
I was contacted today by a representative of Lambert Academic Publishing requesting that I change the title of my blog post "Academic Spam", ...

So yeah, no. My thesis is already published, thanks, and Simple Security Policy for the Web is freely available on the web for probably obvious reasons. I never did convert the darned thing to html, though, which is mildly unfortunate in context!

comment count unavailable comments

May 31, 2014 05:19 AM

PlanetPlanet vs iPython Notebook [RESOLVED: see below]

Short version:

I'd like some help figuring out why RSS feeds that include iPython notebook contents (or more specifically, the CSS from iPython notebooks) are showing up as really messed up in the PythonPython blog aggregator. See the Python summer of code aggregator and search for a MNE-Python post to see an example of what's going wrong.

Bigger context:

One of the things we ask of Python's Google Summer of Code students is regular blog posts. This is a way of encouraging them to be public about their discoveries and share their process and thoughts with the wider Python community. It's also very helpful to me as an org admin, since it makes it easier for me to share and promote the students' work. It also helps me keep track of everyone's projects without burning myself out trying to keep up with a huge number of mailing lists for each "sub-org" under the Python umbrella. Python sponsors not only students to work on the language itself, but also for projects that make heavy use of Python. In 2014, we have around 20 sub-orgs, so that's a lot of mailing lists!

One of the tools I use is PythonPython, software often used for making free software "planets" or blog aggregators. It's easy to use and run, and while it's old, it doesn't require me to install and run an entire larger framework which I would then have to keep up to date. It's basically making a static page using a shell script run by a cron job. From a security perspective, all I have to worry about is that my students will post something terrible that then gets aggregated, but I'd have to worry about that no matter what blogroll software I used.

But for some reason, this year we've had some problems with some feeds, and it *looks* like the problem is specifically that PlanetPlanet can't handle iPython notebook formatted stuff in a blog post. This is pretty awkward, as iPython notebook is an awesome tool that I think we should be encouraging students to use for experimenting in Python, and it really irks me that it's not working. It looks like Chrome and Firefox parse the feed reasonably, which makes me think that somehow PlanetPlanet is the thing that's losing a <style> tag somewhere. The blogs in question seem to be on blogger, so it's also possible that it's google that's munging the stylesheet in a way that planetplanet doesn't parse.

I don't suppose this bug sounds familiar to anyone? I did some quick googling, but unfortunately the terms are all sufficiently popular when used together that I didn't find any reference to this bug. I was hoping for a quick fix from someone else, but I don't mind hacking PlanetPlanet myself if that's what it takes.

Anyone got a suggestion of where to start on a fix?

Edit: Just because I saw someone linking this on twitter, I'll update in the main post: tried Mary's suggestion of Planet Venus (see comments below) out on Monday and it seems to have done the trick, so hurrah!

comment count unavailable comments

May 31, 2014 03:53 AM

May 23, 2014

Frank Cheng
(Statsmodels student)

coding structure first pass

Me and Kerby have gotten the structure of MICE to a reasonable point; it's a good time to make and update! Here is the current user interface:

>>> import pandas as pd (1)
>>> import statsmodels.api as sm (2)
>>> from statsmodels.sandbox.mice import mice (3)
>>> data = pd.read_csv('directory_here') (4)
>>> impdata = mice.ImputedData(data) (5)
>>> m1 = impdata.new_imputer("x2") (6)
>>> m2 = impdata.new_imputer("x3") (7)
>>> m3 = impdata.new_imputer("x1", model_class=sm.Logit) (8)
>>> impcomb = mice.MICE("x1 ~ x2 + x3", sm.Logit, [m1,m2,m3]) (9)
>>> p1 = impcomb.combine(iternum=20, skipnum=10) (10)

Now here is what's going on, step by step:

1) Our base data type is going to be a pandas DataFrame. Currently, the data must be in a form that supports pd.DataFrame(data).

2) Import our statsmodels API.

3) Import our mice module.

4) Read in our data.

5) mice.ImputedData is a class that stores both the underlying data and its missing data attributes. Right now the key attribute is which indices for each variable are missing. Later we will be making changes directly to the underlying dataset, so we don't want to lose this information as we start filling in values. As soon as these indices are saved, we modify the underlying data by filling in as an initial imputation all the column-wise means for the missing values. ImputedData also contains a helper function that allows us to fill in values and a function that allows us to construct Imputers (described below). Note that changes to the data are within the scope of ImputedData, so your actual data will be safely preserved after all the MICE carnage is done :)

6-8) For each variable with missing values that we want to impute, we initialize an Imputer. These Imputers contain two simulation methods that will help us impute the specific variable of interest: impute_asymptotic_bayes (described in my last post) and impute_pmm (a new one, predictive mean matching). There will be more simulation methods later on. Each Imputer's job is to impute one variable given a formula and model, specified by the user. ImputedData.new_imputer defaults to OLS and all other variables as predictors. Note that here we are implicitly assuming Missing At Random since, conditional on the predictors, the missing value is completely random (and asymptotically Gaussian).

9) Initialize a MICE instance by specifying the model and formula that we are really interested in. The previous imputation models are simply conditional models that we think will do a good job at predicting the variable's missing values; this new model is an analysis model that we want to fit to the already-imputed datasets. We pass in a list containing all the Imputers from steps 6-8; these Imputers will do the work of imputation for us.

10) Here's where we specify an iteration number (iternum) and number of imputations to skip between iterations (skipnum). Mechanically, what is happening is that once MICE.combine is called, we start stringing imputations together, with the variable with the least number of missing observations being imputed first. ImputerChain just makes sure every Imputer in the list is used once to make one fully-imputed dataset. However, one imputation for each variable may not be enough to ensure a stable conditional distribution, so we want to skip a number of datasets between actually fitting the analysis model. So we run through all the Imputers ad nauseum until we have skipped the set number of fully imputed datasets, then fit the model. This is one iteration; we repeat until have iternum number of fitted models. To be clear: if we specify iternum=10 and skipnum=5, we will go through a total of 50 imputation iterations (one iteration is a series of imputations for all specified Imputers) and only fit the analysis model for imputation number 5, 10, 15, etc.

All this skipping and fitting happens in AnalysisChain. MICE.combine then takes all these fitted models (actually, only the parameters we care about: params, cov_params, and scale) and combines them using Rubin's rule, and finally the combined parameters are stuffed into a fitted analysis model instance.

Whew, that's a mouthful! That's the gist of it, if you'd like to get under the hood (and I hope people do!) the Github is here. There are some minor things I did to statsmodels.api and the working directory just so I could work from my local Github folder/machine, so don't try to install it directly before changing those back.

Next step is making the user interface simpler (hopefully he will just pass the data object directly into MICE and not have to deal with initializing Imputers and ImputerData) and also adding logic in that lets the user specify which simulation method he wants to use for imputation. Hopefully get some feedback and make more improvements before my next post!

by Frank Cheng ( at May 23, 2014 06:51 AM

May 18, 2014

Edwin Marshall
(Kivy student)

GSoC '14 End of Community Bonding Period

Project Scope

One thing that has been frustrating about this experience is the fact that I think there has been some disparity between what I proposed and what is expected to be implemented. In particular, I don't make mention of mobile platforms in my original proposal, but as I communicate with my mentors it is rather apparent that my focus early on should be not only getting SDL2 to play nicely with Android in Kivy's ecosystem, but that I am expected to refactor the runtime system in general.

While I certainly voiced how overwhelming such a turn of direction has been for me to one of the core developers, I haven't protested because I honestly believe that this new direction will more positively impact the community. Also, while I was speaking specifically of kivy.core, when talking about having a solid foundation, the same logic can be applied to mobile platforms, so I don't think it is unreasonable to expect the same. After all, I've often read that the only way to become better (at anything) is to do something beyond your current capabilities.

The Community

The kivy community is as friendly and helpful as always. As I get acclimated to the Android ecosystem, I've had plenty of questions, all of which have been answered rather quickly. As I continue to feel my way around the intricacies of the current Android run time, I have no doubt that as I run into issues, they'll be able to offer assistance where necessary. My only real concern is IRC. While it is certainly more effective than email, I find that simply having an IRC client open distracts me from actually working. I find myself consumed by the side conversations or consistently checking to see if I've missed any important messages. As such, I might make it a point to be on IRC at specific times of the day, having it closed the others.

The Way Forward

I look forward to this summer as a way to give back to a great open source project and grow as a developer in the process. In fact, I'm considering leaving my part-time gig so that I can focus on this exclusively, since I've found the context switching (warehouse work to coding) to be a bit draining. I'm already thinking about code all day, I might as well make use of all that potential energy. I'll also need some kind of structure to keep me on task, given that this is the first time I've done any work from home. Given hat I attend school online as well, I think that should be less of a puzzle to work out. However, I think it will be important simply because I don't have a lot of time to do something great, so every minute counts.


May 18, 2014 11:11 PM

May 17, 2014

Simon Liedtke
(Astropy student)

Bonding Period is Over

The last week has officially been the so-called "community bonding period". I used it to get more familiar with the codebase, hang around on #astropy and follow the development via the mailinglist and the updates of the astroquery repository at GitHub. My next plans are:

  • to finally finish my started pull request to switch from the library lxml to BeautifulSoup.
  • to fix the issue #49

Today I will have a hangout with my mentor Adam Ginsburg to discuss the next steps and how to implement them. At least that's what I imagine, actually we only arranged a time for a hangout.

by Simon Liedtke at May 17, 2014 10:00 PM