Python's Summer of Code Updates

July 29, 2014

Michael Mueller
(Astropy student)

Week 10

I spent this week mostly working on the multiprocessing branch, fixing up some issues from last week and adding a bit more functionality. Most importantly, I finally switched to a more efficient scheme for reading data when given a filename as input; I'd been meaning to deal with this previously, since profiling indicated that the naïve method of simply reading file data into a single Python string took up something like 15% of processing time, so it's nice to have a more efficient method.
One idea I had thought of previously was to split file reading into each process, but my initial changes showed a speed decrease, which made sense after I came across this comment on StackOverflow explaining that it's best from a performance standpoint to access file data sequentially. I then switched over to reading separate chunks for each process in the main process at the beginning of parsing and then passing each chunk to its respective process, which seems more promising but still runs into issues with memory management. I've been trying to find a better solution today, and I think I should be able to figure it out by tomorrow.
Another issue I looked into was finding a faster algorithm for converting strings to doubles, based on Pandas' superior performance. After looking into what Pandas does, I found that a fairly simple conversion function xstrtod() replaces strtod() from the standard library; from what I was told by a Pandas developer, it seems that Pandas considers the speed gains more important than retaining the relatively slow correction loop in strtod() for high-precision values and is therefore willing to have values off by more than 0.5 ULP (units in the last place). numpy's conversion routine doesn't seem to offer a clear benefit (I tried replacing strtod() with numpy's conversion and got mixed results), so I'm not sure if any optimization is possible.
I then added a parallel parameter for reading and patched up the failing tests (due to multi-line quoted values, which Tom suggested should be an acceptable but documented problem with parallelized reading). I also fixed a unicode issue with string conversion in Python 3, which arose because the 'S' dtype in numpy corresponds to bytes in Python3 rather than str. Interestingly I found it a fairly trivial extension of the file reading code to implement memory mapping for reading on non-Windows systems, so I tentatively added it with a memory_map keyword and will test it more thoroughly to see if the performance boost is anything very significant. This does not enable a memory-mapped view of the file inside the output Table, but simply memory-maps the file to a char * pointer for faster reading and then un-maps it after processing is done.
Next week I'll be working on Erik's idea of creating custom numpy dtypes, and I'll also add documentation, make some final fixes, and otherwise get the engine ready for merging in the meantime.

by Michael Mueller (noreply@blogger.com) at July 29, 2014 04:07 AM

July 28, 2014

Manoj Kumar
(scikit-learn student)

Scikit-learn: Logistic Regression CV

Hi, It has been a long time since I had posted something on my blog. I had the opportunity to participate in the scikit-learn sprint recently, with the majority of the core-developers. The experience was awesome, but most of the time I had no idea what people were talking about, and I realised I have to learn a lot. I read somewhere that if you need to keep improving in life, you need to make sure the worst person in the job, and if that is meant to be true, I’m well on the right track.

Anyhow on a more positive note, recently one of my biggest pull requests got merged, ( https://github.com/scikit-learn/scikit-learn/pull/2862 ) and we shall have a quick look at the background, what it can do and what it cannot.

1. What is Logistic Regression?
A Logistic Regression is a regression model that uses the logistic sigmoid function to predict classification. The basic idea is to predict the feature vector \omega sucht that it fits the Logistic_log function, \frac{1}{1 + e^{-w'*X}} . A quick look at the graph (taken from wikipedia), when y is one, we need our estimator to predict w'X to be infinity and vice versa.

Logistic-curve.svg

Now if we want to fit labels [-1, 1] the sigmoid function becomes \frac{1}{1 + e^{-y*w'*X}}. The logistic loss function is given by, log(1 + e^{-y*w*x}. Intuitively this seems correct because when y is 1, we need our estimator to predict w*x to be infinity, to suffer zero loss. Similarly when y is -1, we need out estimator to predict w*x to be -1. Our basic focus is to optimize for loss.

2. How can this be done?
This can be done either using block coordinate descent methods, like lightning does, or use the inbuilt solvers that scipy provides like newton_cg and lbfgs. For the newton-cg solver, we need the hessian, or more simply the double derivative matrix of the loss and for the lbfgs solver we need the gradient vector. If you are too lazy to do the math (like me?), you can obtain the values from here, Hessian

3. Doesn’t scikit-learn have a Logistic Regression already?
Oh well it does, but it is dependent on an external library called Liblinear. There are two major problems with this.
a] Warm start, (one cannot warm start, with liblinear since it does not have a coefficient parameter), unless we patch the shipped liblinear code.
b] Penalization of intercept. Penalization is done so that the estimator does not overfit the data, however the intercept is independent of the data (which can be considered analogous to a column of ones), and so it does not make much sense to penalize it.

4. Things that I learnt
Apart from adding a warm start, (there seems to be a sufficient gain in large datasets), and not penalizing the intercept,
a] refit paramter – generally after cross-validating, we take the average of the scores obtained across all folds, and the final fit is done according to the hyperparameter (in this case C) that corresponds to the perfect score. However Gael suggested that one could take the best hyperparameter across every fold (in terms of score) and average these coefficients and hyperparameters. This would prevent the final refit.
b] Parallel OvA – For each label, we perform a OvA, that is to convert y into 1 for the label in question, and into -1’s for the other labels. There is a Parallel loop across al loops and folds, and this is to supposed to make it faster.
c] Class weight support: The easiest way to do it is to convert to per sample weight and multiply it to the loss for each sample. But we had faced a small problem with the following three conditions together, class weight dict, solver liblinear and a multiclass problem, since liblinear does not support sample weights.

5. Problems that I faced.
a] The fit intercept = True case is found out to be considerably slower than the fit_intercept=False case. Gaels hunch was that it was because the intercept varies differently as compared to the data, We tried different things, such as preconditioning the intercept, i.e dividing the initial coefficient with the square root of the diagonal of the Hessian, but it did not work and it took one and a half days of time.

b] Using liblinear as an optimiser or a solver for the OvA case.

i] If we use liblinear as a solver, it means supplying the multi-label problem directly to liblinear.train. This would affect the parallelism and we are not sure if liblinear internally works the same way as we think we do. So after a hectic day of refactoring code, we finally decided (sigh) using liblinear as an optimiser is better (i.e we convert the labels to 1 and -1). For more details about Gaels comment, you can have a look at this https://github.com/scikit-learn/scikit-learn/pull/2862#issuecomment-49450285

Phew, this was a long post and I’m not sure if I typed everything as I wanted to. This is what I plan to accomplish in the coming month
1. Finish work on Larsmans PR
2. Look at glmnet for further improvements in the cd_fast code.
3. ElasticNet regularisation from Lightning.


by Manoj Kumar at July 28, 2014 11:36 PM

Julia Medina
(Scrapy student)

Four weeks into 2nd half

I've opened the #816 Scrapy's pull request with the progress I've made on the API refactoring. I'm keeping a list in the description of the PR with the current status of the tasks that I have assigned, along with the backward incompatible changes that were introduced.

I've updated the Spider class's methods, Crawler's initialization and SpiderManager's interface following the design that was explained in my proposal. Upon reviewing the added complexity to support deprecated features, a relaxation on maintaining backward compatibility to prioritize code simplicity is currently undergoing.

Over the next weeks I'll wrap up the changes, adding missing tests and documentation and getting the PR to a stable and mergeable state. I also plan to address further issues concerning API problems if I have spare time until the final deadline.

by Julia Medina (noreply@blogger.com) at July 28, 2014 06:53 PM

Roy Xue
(Theano student)

Algorithm Speeding Up#GSoC2014-Theano


Recently what I done in mainly speeding up the algorithm of calculating the memory. At first I feel it is very hard for me, cause I didnt learnt much about Algorithm before, sometimes I will find it’s difficult for me to think about a faster algo.

problem

About generating valid order to count the min peak peak, I use a python generator, is very fast. But the problem is that there are too many node, like in the test file. It has 11 nodes, which means there should be 11! total number of order, it’s a huge number, to generator the valid order, the code need to do maybe million times check_node_state() function, because each time when we need to check node in this order, if it is possible to “execute”. After this process we will get about 20k orders, then we have to count memory of each order and find the minimum one. This is the main part which cost lots of running time.

Solution

The solution we uses has 2 parts. One is before generator, we create a set including Apply nodes that can be executed. By this step, we will reduce the check_node_state() method called times, and also reduce some generator loops.

On the other hand, I wrote the new simple count_min_memory() method, cause the previous count memory method do lots of unnecessary works under this situation. Actually the first idea of solving this is to count the memory during the loop. But after thinking this way in details, I found it’s not a good idea, cause it will cost much more time and also it’s difficult to generate post_thunk_of_old_storage for each node.

Result

using command python -m cProfile -s cumulative test_profiling.py

Before speed 2262689 function calls (2018667 primitive calls) in 3.566 seconds

2262689 function calls (2018667 primitive calls) in 3.566 seconds

Ordered by: cumulative time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.001    0.001    3.579    3.579 test_profiling.py:4(<module>)
    1    0.000    0.000    2.993    2.993 test_profiling.py:13(test_profiling)
    1    0.000    0.000    2.720    2.720 profiling.py:1002(summary)
    1    0.000    0.000    2.717    2.717 profiling.py:602(summary_memory)
    1    0.786    0.786    2.716    2.716 profiling.py:696(count_minimum_peak)
19802    0.690    0.000    1.244    0.000 profiling.py:641(count_running_memory)
    1    0.002    0.002    0.488    0.488 __init__.py:23(<module>)
   19    0.024    0.001    0.465    0.024 __init__.py:1(<module>)
261450/19801    0.305    0.000    0.432    0.000 profiling.py:782(min_memory_generator)
    1    0.001    0.001    0.301    0.301 __init__.py:29(<module>)
    1    0.001    0.001    0.300    0.300 scan_opt.py:3(<module>)
    1    0.000    0.000    0.255    0.255 function.py:15(function)
    1    0.000    0.000    0.255    0.255 pfunc.py:336(pfunc)
    1    0.000    0.000    0.254    0.254 function_module.py:1262(orig_function)
439476    0.249    0.000    0.249    0.000 {getattr}
19802    0.193    0.000    0.223    0.000 link.py:504(gc_helper)
    1    0.002    0.002    0.219    0.219 basic.py:1(<module>)
178301    0.215    0.000    0.215    0.000 {method 'index' of 'list' objects}
    1    0.001    0.001    0.212    0.212 elemwise.py:1(<module>)
    ......

After speed up

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.001    0.001    3.272    3.272 test_profiling.py:4(<module>)
    1    0.000    0.000    2.791    2.791 test_profiling.py:13(test_profiling)
    1    0.000    0.000    2.514    2.514 profiling.py:1015(summary)
    1    0.000    0.000    2.511    2.511 profiling.py:602(summary_memory)
    1    0.749    0.749    2.509    2.509 profiling.py:696(count_minimum_peak)
19800    0.575    0.000    0.890    0.000 profiling.py:809(count_min_memory)
261450/19801    0.290    0.000    0.397    0.000 profiling.py:785(min_memory_generator)
   19    0.018    0.001    0.396    0.021 __init__.py:1(<module>)
    1    0.002    0.002    0.395    0.399 __init__.py:23(<module>)
    1    0.001    0.001    0.283    0.283 __init__.py:29(<module>)
    1    0.001    0.001    0.282    0.282 scan_opt.py:3(<module>)
    1    0.000    0.000    0.259    0.259 function.py:15(function)
    1    0.000    0.000    0.259    0.259 pfunc.py:336(pfunc)
    1    0.000    0.000    0.258    0.258 function_module.py:1262(orig_function)
 439476    0.238    0.000    0.238    0.000 {getattr}
 19802    0.183    0.000    0.212    0.000 link.py:504(gc_helper)
    1    0.002    0.002    0.211    0.211 basic.py:1(<module>)
178301    0.207    0.000    0.207    0.000 {method 'index' of 'list' objects}
......

We can see the cost time reduced.

But when I tried with machine learning samples, like logistic_sgd.py the result seems not very well, it cost really lots of time. So i guess I still need to think about other ideas or design count min peak as a parameter of the function.

The problem is like for N apply node in our function, we will have total N! orders, and find valid order from them. When N is big, the N! is really huge, so the valid order number is also very big. If we have too many order, it cost too much time to count min memory. So maybe remove similar order is a solution to do this.

by xljroy at July 28, 2014 04:19 PM

Asra Nizami
(Astropy student)

Note to self: testing is really important

It's been an interesting couple of weeks. As I mentioned in my last blog post, I've been working on integrating WCSAxes into APLpy. It was going pretty well in the beginning but now things are starting to get complicated (just to be clear, I'm not complaining!). APLpy isn't a very big package but it still has a lot of features which I need to make sure work as I integrate WCSAxes into it. *THAT* is the tricky part. This means extensive testing after small re-writes of code to make sure new features work and I haven't broken something that was working previously. Well, that's the ideal workflow - I should have been doing extensive testing this way but sigh, I didn't. I've almost (I feel like the coyote who's *almost* caught the roadrunner as I say this) made most of the new code work but over the past few days I got so wrapped up in cleaning bits of code which I didn't test out fully that I broke a pretty important part of APLpy, plotting data cubes. I'm not sure where I broke it and my commit history is a mess (because I kept committing changes without testing them out fully again) so git bisect didn't help me identify where the problem started either.

This week is Eid, an Islamic holiday (Eid Mubarak to any Muslims reading this!!) so I won't have that much time to figure out where I introduced this bug, but once I do, I'll probably stock up on Pepsi, blare loud music in the background and carefully comb through my branch to examine all the code changes. That's it for now! :)


by Asra Nizami (noreply@blogger.com) at July 28, 2014 12:28 PM

Rajeev S
(GNU Mailman student)

The YACC Cleans up my Code!

Interesting week! Even in my wildest dreams, I never hoped that I would get to wet my hands again on the beautiful programming of writing a compiler. I have done this as a part of my BTech curriculum, but did not expect it to pop up during my Summer of Code.

My project included building a query interface and thus obviously a query language. Since the intended language was simple and easy,I was in an idea to implement some basic parsing technique like the recurrence descent parser. Once I started the coding, I found that recurrence descent parsing was painful, and I ended up writing the naive code to process a string array. This was error prone and very hard to manage, as it lead to a lot too many index errors and more dangerous, ignored extra parameters.

My next option was to use a regular expression based parser, that I successfully built and worked fine. I used the config.ini file to store the regular expressions for each command and used it to validate the command. This approach did not break anywhere, but the code looked rather ugly, with a lot too many `pops`.Another huge drawback was the lack of good error reporting, as I could not report where the command went wrong. All I reported was that the syntax is wrong and printed the command usage string. I asked in the python IRC about a way to print better error messages for failed regular expressions, it was there I got the suggestion to use a parser. Since this is a PSF project, I could not quite ignore the suggestion from the Python IRC. Further, regular expression approach only performed the validation part, I had to hard code the command parsing part. This can create difficulties in extending the project in the future.

I did some amount of research on the parser libraries available for python and I settled on the PLY, a project that had a good documentation and lot too many examples. I tried a few sample grammars and began writing the parser for my project. I used the class approach of PLY. I had to choose between a common parser for the whole project and separate parser for each command, I settled for the latter one, with a assumption that it would be cleaner and easier to manage and extend. In two days time,I completed the parser for each command and also rebuilt the environment variable management part. I completely stashed the decorators used for the command validation and pre-processing.

The code is now a lot more cleaner and readable than before. Errors are beautifully reported and handled and all works fine. Time is running fast, and I plan to complete my tasks at least a week before the deadline of 11/08/2014. I have announced in the mm-dev list that I work with 6th August as my soft deadline.

Last but not the least, blogging from the Kanyakumari Banglore Island Express, thanks to the Indian Railways!



by Rajeev S (noreply@blogger.com) at July 28, 2014 05:47 AM

PritishC
(SunPy student)

Nearing the Close

Hello folks.

While admittedly there has not been too much happening on the online front, a lot has been happening on the offline front around here!

There has been talk of introducing some more FrameAttributes to support the HeliographicStonyhurst to Heliocentric and vice-versa transformations. The previous part with the introduction of the new get_hpc_distance() method paid off handsomely – my mentor’s calculations got the HGS to HP transform to work properly. The next step is to fix some buggies and introduce the new FrameAttributes.

I have been learning Flask from Miguel Grinberg’s Mega Tutorial on the advice of my mentor and it has been a very satisfying journey so far. I have created a repo on GitHub to learn while committing. Flask is a microframework for Python based on Werkzeug and Jinja2. It is less complex in comparison with Django, and learning it really is fun. There are also Flask extensions to help one with the job of interfacing with other apps such as SQLAlchemy.

I have appeared for a couple of job interviews for some good startups in these two weeks. The GSoC project definitely has helped me gain some traction. It gets me noticed where I would earlier not have had a chance.

That’s all for now! My apologies if this post is too small, peeps.


by xpritish at July 28, 2014 05:37 AM

Elana Hashman
(OpenHatch student)

Security Holes and Django Forms

"Hey—I have this crazy idea. What if the POSTs django-inplaceedit processes are totally exploitable?"

Security Hilarity

And so our journey down the rabbit hole began. It turns out that you should probably never include little-used open source software that modifies entries in your database while making the assumption that it's secure.

I believe the following code speaks for itself:

def test_will_inplaceedit_allow_us_to_pwn_ourselves(self):
    # Asheesh: "total cost of pwnership: 1 test"
    # note: user paulproteus has poor password hygiene
    u = User.objects.create(username='paulproteus', password='password')
    u.save()

    self.client.post(
        '/inplaceeditform/save/',
        {
            "app_label": "auth",      # the django app
            "module_name": "user",    # the django table
            "field_name": "username", # the field name
            "obj_id": u.pk,           # the pk
            "value": '"LOLPWNED"'     # new value
        })

    self.assertEqual(User.objects.get(pk=u.pk).username, "LOLPWNED")

Carefully crafting this payload involved very little effort; watching this test pass was a little bit horrifying.

The problem stems from the way django-inplaceedit sets up user permissions; their docs give an example of only allowing superusers edit access, which would suggest an internal or authenticated use case. But we want our bugsets to be publicly editable... and given that this is the way we set up our permissions, we also implicitly gave the entire public internet the ability to edit our entire database. Wow.

Lesson learned: trust no one. Or, at least restrict access to the bugsets app tables. For now, we've disabled this by turning off edit access in production. We also plan to make an upstream pull request to the django-inplaceedit docs to help others avoid this kind of security issue in the future. We also created (and addressed) an issue to track this fun adventure.

Development

Development for this period started off a little delayed due to lack of Asheesh availability, and development on the main project was delayed slightly by the GIANT SECURITY HOLE django-inplaceedit introduced, but we sure wrote a lot of code in the end.

gsoc14.10 and gsoc14.11 deliverables

These two weeks saw the end of the long-standing issue995:

  • Fix django-inplaceedit CSS to make the screen usable
  • YAGNI simplification of AnnotatedBug fields

This milestone was pushed back again due to difficulty with CSS and issues that continually arose every time existing ones were fixed. Here's a code sample that reflects the majority of these two weeks' development.


Figure 1: CSS successfully coerced into exhibiting beauty

gsoc14.12 deliverables

This milestone saw major development of the create/edit view (tracking issue).

Asheesh and I fought with Django forms, many tests, and mostly emerged victorious.


Figure 2: Create a bug set



Figure 3: You can't inject javascript!

What's next

There are a number of features we've discussed implementing for the last few weeks. But here's what we have planned:

  • Finishing up the edit view
  • Update list view to include edit links (while logged in)
  • Simultaneous editing: async notification if loaded page values have changed
  • Security fixes to re-enable editing in production
  • Refresh button for associated metadata we have in the OpenHatch bug db
  • User testing
  • Documentation

Obstacles

There have been two main issues the past few weeks: CSS issues and availability. As I've lamented about CSS in my last post, I'll discuss the availability problem in a bit more detail.

Since returning from OSBridge, Asheesh and I have been a bit busy playing catchup with school and work. In particular, since Asheesh' availability has been even less than that of the dreaded "working adult with a life outside of job," we've had a lot less time to sync up and pair. And due to catching up on assignments and course wrapup, I was able to complete less GSoC work than I might have if Asheesh was babysitting me in some pairing sessions.

In regard to the CSS, in the end I managed to figure it out! A big thanks to Rachelle Saunders, who I met at a local Women Who Code meetup, and was able to solve all my woes in all of 5 fateful minutes. You rock, lady! (A moment of silence for all those hours I could have spent fighting with this on my own. Tears shed for the loss of this valuable learning experience.)

by Elana Hashman at July 28, 2014 03:00 AM

Gaurav Trivedi
(Kivy student)

Towards Making Kivy Apps Accessible – 2

In this second part of my post of on making kivy apps accessible I would like to describe some of the existing libraries and APIs we can model our accessibility features upon. Last week we identified that the main obstacle for creating accessible kivy apps is a missing module that could communicate the widget states to screen-readers. I explored other frameworks that may have tried to solve these problems before us and will discus one such project in particular.

Kivy includes pygame as one of the supported window providers. While looking for accessible apps in Python I found a GUI engine for pygame called OcempGUI. As a part of this project they also worked upon an accessibility module named Papi:

Papi, the Python Accessibility Programming Interface, is a Python wrapper around the GNOME ATK toolkit. It allows a developer to make python objects and applications easily accessibility aware without the need to install PyGTK and the GNOME accessibility components. Instead it only depends on ATK and – on the developers behalf – the ATK/AT-SPI bridge shipped with AT-SPI.

There is also some support for Microsoft Active Accessibility (MSAA) on Microsoft Windows.

Papi is not limited to the apps you can build with OceampGUI graphical user interfaces but can help support accessibility for any python object. Here’s an example accessible app using Papi:


I created a gist as I couldn’t find a way to embed a Sourceforge file but you can find the complete project there. You can also read more about Papi and OcempGUI on their website. Unfortunately, the project looks like that it is no longer in active development. It’s last release was in 2008.

Assuming that we can build upon this module we’d be able to support accessible Kivy apps on Windows and Linux. We are still left with taking care of OS X Accessibility and also on the mobile devices.

MacOS has a very good accessibility support but only when you are writing Cocoa or Carbon apps. In order to provide a similar level of accessibility support you would likely need to hack around a bit. With inputs from our folks at #macdev on freenode, I set out to do just that. They suggested that I could subclass NSApplication and implement the NSAccessibility protocol within the Kivy app. This would involve creating an hierarchy of fake UI objects that provide accessibility implementations as on a Cocoa app. I did make some progress with by using our in-house project — PyObjus to access AppKit frameworks and subclass objective-c classes. But the situation became a little to overwhelming for me to handle within this one week and I haven’t succeeded in creating a working proof of concept as yet.

Fortunately, Apple folks have recently launched a new API starting OS X 10.10 that includes a NSAccesibilityElement class. I am hoping that this would help avoid creating fake UI objects to implement their accessibility protocols. Here are a couple of examples demonstrating that but I haven’t tried them out yet. Need to get access to their beta versions first. You can also watch their WWDC 2014 session videos on Accessibility on OS X describing it.

We are still left with discussing about the mobile platforms. Here are links to the relevant docs pages for Android (http://developer.android.com/guide/topics/ui/accessibility/apps.html#custom-views) and iOS (https://developer.apple.com/library/ios/documentation/UserExperience/Conceptual/iPhoneAccessibility/Accessibility_on_iPhone/Accessibility_on_iPhone.html). I would like use these bookmarks when I dive deeper into exploring mobile accessibility APIs. However this week, I plan to go back to my original plan and continue the Plyer work from where I left off.

 

by gtrivedi at July 28, 2014 12:36 AM

July 27, 2014

M S Suraj
(Vispy student)

Visuals tests – GSoC Week 9

Mandatory blog post -

This week has been filled with writing tests for the Visuals.

  • Wrote a sort of test-bed for Visuals in the form of a TestingCanvas object.
  • Used it to write tests for Ellipse, Polygon and RegularPolygon visuals.
  • The tests would retrieve draw a visual on the canvas, retrieve an appropriate reference image from a test-data repository and would compare the two.
  • Minor allowances has been made to ignore small difference in the corners.
  • The work is available here – https://github.com/vispy/vispy/pull/327 .

Although it took a lot of iterations to get this right, it was fun to code them and learn how tests are written, how Travis- the continuous integration platform works and so on.

Next target – Make the visuals reactive!


by mssurajkaiga at July 27, 2014 11:21 PM

Brigitta Sipőcz
(Astropy student)

RAGHAV R V
(BinPy student)

GSoC 2014 - BinPy - Binary Multiplication Algorithms in Python - Part 1 / 2

The past week I've been trying to implement Binary Multiplication Algorithms for study / analysis / implementation of the same in ALUs.

Here's a short review of the three Algorithms that I had implemented so far.

Naive approach to binary multiplication.

A naive Register level implementation of binary multipliers works by repeatedly adding multiplicand to the product register ( initialized to 0 ) till an axillary register holding the multiplier is decremented to 0. This is based on the fact that multiplication is repeated addition.

The problem with this algorithm, is that the time complexity is of the order O(n2). This is because it requires n2 numbers of multiplication operations. The results of these n2 multiplications is added up to give the final product.

Practical primary school multiplication.

Practically, without much thought, one can simply multiply binary numbers just the way we do with Decimal ( Base 10 ) numbers, using the traditional method.

That is multiply, individually, the bits of second number with all the bits of first number. Finally add all the intermediate multiplication results to give the final product.

Taken from Google Images

Though this is supposed to be in the order of  O(n2) ( 16 Steps )

Our brain intuitively reduces it to just 4 steps.

This is primarily due to the fact that in binary multiplication the only two results of multiplying a binary number with a single bit ( 1 or 0 ) is either all 0s or the number itself.

Based on this approach, the time complexity has drastically reduced to O(n).

Robertson's Multiplication algorithm.

Robertson's multiplication algorithm is based on such an approach towards multiplying two 2's complement represented signed binary numbers. In this approach the sign bit is taken into consideration the same measure as the magnitude bits.

In essence the product of two binary numbers is typically calculated by a series of shift and add methods which can be mathematically represented as

Pi = Pi + Xi * Y
Pi+1 = Pi * 2-1

Where Pi is the partial product.

This along with some caveats and a correction step in case the multiplier alone is negative makes up the Robertson's algorithm for multiplication of 2s complement signed binary numbers.


Python Code to implement the Robertson's multiplication :
[Code]
Booth's Multiplication algorithm

Booth's multiplication algorithm is a slightly more efficient algorithm which scans 2 adjacent bits at a time ( Unlike Robertson's algorithm which scans only  one bit at a time ).

Based on the two adjacent bits, operations are carried out on the partial product.

Intuitively this could be understood as skipping sequences of 1s and 0s ( Since if the 2 adjacent bits are 00 or 11 neither addition or subtraction takes place )

This reduces number of additions / subtractions at a cost of increased hardware to implement the logic. But it should be noted that the worst case time complexity remains at order O ( n )

Python code to implement Booth's Multiplication algorithm is slightly simpler :
[Code]

Karatsuba Multiplication Algorithm
Karatsuba Multiplication algorithm is based on the divide and conquer principle. It breaks down the multiplication into smaller parts and attempts to compute the product of the larger part using these resultant smaller parts.

It is a really efficient algorithm and is used at the core of python for multiplying really large numbers.

To understand the algorithm, consider two n bit ( assume even ) binary numbers X ( Multiplicand ) and Y ( Multiplier ).

These can be written as ( in Base 2 ) :
X =  X1 * 2n/2 + X0
Y = Y1 * 2n/2 + Y0

Now the multiplication of X and Y can be carried out  as :

X * Y = ( X1 * 2n/2 + X0  ) * ( Y1 * 2n/2 + Y0 )

=  X1*Y1 * 22 + ( X0Y1 + X1Y0 ) * 2n/2 + X0*Y0 * 22

Optimization of the middle term is the key to the speed of Karatsuba's multiplication algorithm.

Upon simple observation we can note that the middle term consists of 2 new product terms ( = 2 Additional multiplication operation ).

In  Karatsuba's algorithm we blow this up using fundamental agebraic relations as :
X0Y1 + X1Y0 = ( X0  + X1  ) * ( Y0  +Y1  ) - X1*Y1 -  X0*Y0

Note that since the last two terms are precomputed the middle term now can be computed using only one additional multiplication  operation.

This is the key optimization, done in this algorithm, that makes it efficient.

The time complexity now can be calculated as T(n) = 3T(n/2) + O(n)
Solving the above we get the order of the algorithm as O(n1.59)

The below graph shows a graphical performance comparison between Karatsuba's algorithm and Standard Multiplication algorithm :
 


Python Code to implement Karatsuba's algorithm ( Needs a little optimization and cleaning ):
[Code]

by Raghav R V (noreply@blogger.com) at July 27, 2014 09:47 PM

Hamzeh Alsalhi
(scikit-learn student)

Sparse Output Dummy Classifier

The Scikit-learn dummy classifier is a simple way to get naive predictions based only on the target data of your dataset. It has four strategies of operation.

  • constant - always predict a value manually specified by the use
  • uniform - label each example with a label chosen uniformly at random from the target data given
  • stratified  - label the examples with the class distribution seen in the training data
  • most-frequent - always predict the mode of the target data
The dummy classifier has built in support for multilabel-multioutput data. I have made a pull request #3438 this week that has introduced support for sparsely formatted output data. This is useful because memory consumption can be vastly improved when the data is highly sparse. Below a benchmark these changes with two memory consumption results graphed for each of the four strategies, once in with sparsely formatted target data and once with densely formatted data as the control.

Benchmark and Dataset

I used the Eurlex eurovoc dataset available here in libsvm format for use with the following script.  The benchmark script will let you recreate the results in this post easily. When run with the python module memory_profiler it measures the total memory consumed when doing an initialization of a dummy classifier, along with a fit and predict on the Eurlex data.

The dataset used has approximately 17,000 samples and 4000 classes for the training target data, and the test data is similar. They both have sparsity of 0.001.

Results Visualized

Constant Results: Dense 1250 MiB, Sparse 300 MiB

The constants used in the fit have a level of sparsity similar to the data because they were chosen as an arbitrary row from the target data.

Uniform Results: Dense 1350 MiB, Sparse 1200 MiB


Stratified Results: Dense 2300 MiB, Sparse 1350 MiB


Most-Frequent Results: Dense 1300 MiB, Sparse 300 MiB


Conclusions 


We can see that in all cases expect for Uniform we get significant memory improvements by supporting sparse matrices. The sparse matrix implementation for uniform is not useful because of the dense nature of the output even when the input shows high levels of sparsity. It is possible this case will be revised to warn the user or even throw an error.

Remaining Work

There is work to be done on this pull request to make the predict function faster in the stratified and uniform cases when using sparse matrices. Although the uniform cases is not important in itself the underlying code for generating sparse random matrices is used in the stratified case. Any improvements to uniform will come for free is the stratified case speed is improved.

Another upcoming focus is to return to the sparse output knn pull request and make some improvements. There will be code written in the sparse output dummy pull request for gathering a class distribution from a sparse target matrix that can be abstracted to a utility function and will be reusable in the knn pull request.

by Hamzeh (noreply@blogger.com) at July 27, 2014 08:14 PM

cfulton
(Statsmodels student)

Tempita and {S,D,C,Z} BLAS Functions

Tempita and {S,D,C,Z} BLAS Functions

Developing a fast version of the multivariate Kalman filter for Statsmodels has required dipping into Cython, for fast loops, direct access to memory, and the ability to directly call the Fortran BLAS libraries.

Once you do this, you have to start worrying about the datatype that you're working with Numpy and Scipy typically do this worrying for you, so that you can, for example, take the dot product of a single precision array with a double complex array, and no problems will result.

In [1]:
import numpy as np
x = np.array([1,2,3,4], dtype=np.float32)
y = np.array([1,2,3,4], dtype=np.complex128) + 1j
z = np.dot(x, y)
print z, z.dtype
(30+10j) complex128

Whereas if you use the scipy direct calls to the BLAS libraries with wrong or differing datatypes, in the best case scenario it will perform casts to the required datatype. Notice the warning, below, and the truncation of the complex part.

In [2]:
from scipy.linalg.blas import ddot
z = ddot(x,y)
print z
30.0
-c:2: ComplexWarning: Casting complex values to real discards the imaginary part

The sciply.linalg.blas functions do some checking to prevent the BLAS library from getting an argument with the wrong datatype, but in the worst case if something slips through, it could crash Python with a segmentation fault.

Types and the Kalman filter

This matters for the Cython-based Kalman filter for two reasons. The first is that we likely want to behave as nicely as numpy dot and allow the filter to run on any datatype. The second is that numerical derivatives in Statsmodels are computed via complex step differentiation, which requires the function to be able to deal with at least the double complex case.

This means that all of the Cython functions need to be duplicated four times.

Fortunately, all of the underlying BLAS functions are structured with a prefix at the beginning indicating the datatype, followed by the call. For example, dgemm performs matrix multiplication on double precision arrays, whereas zgemm performs matrix multiplication on double precision complex arrays.

The need for relatively simple duplication means that this is a great place for templating, and it turns out that Cython has great, simple templating engine built in: Tempita.

As an example, take a look at the below code. This generates four functions: sselect_state_cov, dselect_state_cov, cselect_state_cov, and zselect_state_cov which handle part of the Kalman filtering operations for the four different datatypes.

TYPES = {
    "s": ("np.float32_t", "np.float32", "np.NPY_FLOAT32"),
    "d": ("np.float64_t", "float", "np.NPY_FLOAT64"),
    "c": ("np.complex64_t", "np.complex64", "np.NPY_COMPLEX64"),
    "z": ("np.complex128_t", "complex", "np.NPY_COMPLEX128"),
}

{{for prefix, types in TYPES.items()}}

# ### Selected state covariance matrice
cdef int {{prefix}}select_state_cov(int k_states, int k_posdef,
                                    {{cython_type}} * tmp,
                                    {{cython_type}} * selection,
                                    {{cython_type}} * state_cov,
                                    {{cython_type}} * selected_state_cov):
    cdef:
        {{cython_type}} alpha = 1.0
        {{cython_type}} beta = 0.0

    # #### Calculate selected state covariance matrix  
    # $Q_t^* = R_t Q_t R_t'$
    # 
    # Combine the selection matrix and the state covariance matrix to get
    # the simplified (but possibly singular) "selected" state covariance
    # matrix (see e.g. Durbin and Koopman p. 43)

    # `tmp0` array used here, dimension $(m \times r)$  

    # $\\#_0 = 1.0 * R_t Q_t$  
    # $(m \times r) = (m \times r) (r \times r)$
    {{prefix}}gemm("N", "N", &k_states, &k_posdef, &k_posdef,
          &alpha, selection, &k_states,
                  state_cov, &k_posdef,
          &beta, tmp, &k_states)
    # $Q_t^* = 1.0 * \\#_0 R_t'$  
    # $(m \times m) = (m \times r) (m \times r)'$
    {{prefix}}gemm("N", "T", &k_states, &k_states, &k_posdef,
          &alpha, tmp, &k_states,
                  selection, &k_states,
          &beta, selected_state_cov, &k_states)

{{endfor}}

Of course this merely provides the capability to support multiple datatypes, and wrapping that in a user-friendly way is part of another part of the project.

by Chad Fulton at July 27, 2014 09:20 AM

Mustafa Furkan Kaptan
(Vispy student)

VNC Backend

Hello reader,

Last week we finished the Javascript part of our IPython-VNC backend. One of the tricky parts was to import Javascript code to IPython notebook as soon as user creates a vispy canvas. After solving this issue, we are now able to handle user events like mouse move, key press and mouse wheel.

About the implementation, we listen the html canvas in IPython notebook with Javascript. As soon as an event is detected, we generate the event with proper name and type according to our JSON spec. The generated event is sent to vispy with widget's `this.send()` method. This method allows us to send a message immediately from frontend to backend. When a message is received from frontend, the backend generates the appropriate vispy event. For instance, if we receive a mousepress event from frontend, we generate vispy_mousepress event in the backend. That way, user can use the conected `on_mouse_press` callback in vispy canvas.

We embeded an IPython DOMWidget into our VNC backend for ease the communication between JS and python. We want this backend to be very easy to use for a user. So s/he never needs to deal with listening events, sending them to python or creating an IPython widget or even coding in Javascript.

There are still some problems though. In mouse move event, Javascript can capture all of mouse's position. I mean every single pixel that mouse was on.. So when we are trying to generate and send mouse move event, it takes a lot of time. For example if we are trying to do a drag operation, a lag occurs because of this. Also, it is nearly impossible to capture the screen, convert it to PNG, encode it with base64 and send them through websocket in the speed of mouse move. So this is another reason why we have this lag.

Another problem is that we can not use a python timer. In vispy we use the backend's timer (QTimer for qt, etc.). But here, it is not possible to have a QTimer and IPython's event loop at the same time. We have to think a different way to have a timer. Or else we have to let go the timer based animation option.

See you next post!

by Mustafa Kaptan (noreply@blogger.com) at July 27, 2014 12:45 AM

July 26, 2014

Saimadhav A Heblikar
(Core Python student)

GSoC 2014 - End of week 10 update

"It is faster to make a four-inch mirror and then a six-inch mirror than to make a six-inch mirror." -- Bill McKeenan

A quote which I recently came across, is proven true by my recent GSoC endeavors.


Read more »

by Saimadhav Heblikar (noreply@blogger.com) at July 26, 2014 05:36 PM

July 24, 2014

maheshakya
(scikit-learn student)

Testing LSH Forest

There are two types of tests to perform in order to ensure the correct functionality the LSH Forest.

  1. Tests for individual functions of the LSHForest class.
  2. Tests for accuracy variation with parameters of implementation.

scikit-learn provides a nice set testing tools for this task. It is elaborated in the utilities for developers section. I have used following assertions which were imported from sklearn.utils.testing. Note that numpy as imported as np.

  1. assert_array_equal - Compares each element in an array.
  2. assert_equal - Compares two values.
  3. assert_raises - Checks whether the given type of error is raised.

Testing individual functions

Testing fit function

Requirement of this test is to ensure that the fit function does not work without the necessary arguments provision and it produces correct attributes in the class object(in the sense of dimensions of arrays).

Suppose we initialize a LSHForest as lshf = LSHForest()

If the estimator is not fitted with a proper data, it will produce a value error and it is testes as follows:

    X = np.random.rand(samples, dim)
    lshf = LSHForest(n_trees=n_trees)
    lshf.fit(X)

We define the sample size and the dimension of the dataset as samples and dim respectively and the number of trees in the LSH forest as n_trees.

    # Test whether a value error is raised when X=None
    assert_raises(ValueError, lshf.fit, None)

Then after fitting the estimator, following assertions should hold true.

    # _input_array = X
    assert_array_equal(X, lshf._input_array)
    # A hash function g(p) for each tree
    assert_equal(n_trees, lshf.hash_functions_.shape[0])
    # Hash length = 32
    assert_equal(32, lshf.hash_functions_.shape[1])
    # Number of trees in the forest
    assert_equal(n_trees, lshf._trees.shape[0])
    # Each tree has entries for every data point
    assert_equal(samples, lshf._trees.shape[1])
    # Original indices after sorting the hashes
    assert_equal(n_trees, lshf._original_indices.shape[0])
    # Each set of original indices in a tree has entries for every data point
    assert_equal(samples, lshf._original_indices.shape[1])

All the other tests for functions also contain the tests for valid arguments, therefore I am not going to describe them in those sections.

Testing kneighbors function

kneighbors tests are based on the number of neighbors returned, neighbors for a single data point and multiple data points.

We define the required number of neighbors as n_neighbors and crate a LSHForest.

    n_neighbors = np.random.randint(0, samples)
    point = X[np.random.randint(0, samples)]
    neighbors = lshf.kneighbors(point, n_neighbors=n_neighbors,
                                return_distance=False)
    # Desired number of neighbors should be returned.
    assert_equal(neighbors.shape[1], n_neighbors)

For multiple data points, we define number of points as n_points:

    points = X[np.random.randint(0, samples, n_points)]
    neighbors = lshf.kneighbors(points, n_neighbors=1,
                                return_distance=False)
    assert_equal(neighbors.shape[0], n_points)

The above tests ensures that the maximum hash length is also exhausted because the data points in the same data set are used in queries. But to ensure that hash lengths less than the maximum hash length also get involved, there is another test.

    # Test random point(not in the data set)
    point = np.random.randn(dim)
    lshf.kneighbors(point, n_neighbors=1,
                    return_distance=False)

Testing distances of the kneighbors function

Returned distances should be in sorted order, therefore it is tested as follows: Suppose distances is the returned distances from the kneighbors function when the return_distances parameter is set to True.

    assert_array_equal(distances[0], np.sort(distances[0]))

Testing insert function

Testing insert is somewhat similar to testing fit because what we have to ensure are dimensions and sample sizes. Following assertions should hold true after fitting the LSHFores.

    # Insert wrong dimension
    assert_raises(ValueError, lshf.insert,
                  np.random.randn(dim-1))
    # Insert 2D array
    assert_raises(ValueError, lshf.insert,
                  np.random.randn(dim, 2))

    lshf.insert(np.random.randn(dim))

    # size of _input_array = samples + 1 after insertion
    assert_equal(lshf._input_array.shape[0], samples+1)
    # size of _original_indices[1] = samples + 1
    assert_equal(lshf._original_indices.shape[1], samples+1)
    # size of _trees[1] = samples + 1
    assert_equal(lshf._trees.shape[1], samples+1)

Testing accuracy variation with parameters

The accuracy of the results obtained from the queries depends on two major parameters.

  1. c value - c.
  2. number of trees - n_trees.

Separate tests have been written to ensure that the accuracy variation is correct with these parameter variation.

Testing accuracy against c variation

Accuracy should be in an increasing order as the value of c is raised. Here, the average accuracy is calculated for different c values.

    c_values = np.array([10, 50, 250])

Then the calculated accuracy values of each c value is stored in an array. Following assertion should hold true in order to make sure that, higher the c value, higher the accuracy.

    # Sorted accuracies should be equal to original accuracies
    assert_array_equal(accuracies, np.sort(accuracies))

Testing accuracy against n_trees variation

This is as almost same as the above test. First, n_trees are stored in the ascending order.

    n_trees = np.array([1, 10, 100])

After calculating average accuracies, the assertion is performed.

    # Sorted accuracies should be equal to original accuracies
    assert_array_equal(accuracies, np.sort(accuracies))

What is left?

In addition to the above tests, precision should also be tested against c values and n_trees. But it has already been tested in the prototyping stage and those tests consume a reasonably large amount of time which makes them unable to be performed in the scikit-learn test suit. Therefore, a separate benchmark will be done on this topic.

About Nose

As provided in the guidelines in scikit-learn contributors' page, Nose testing framework has been used to automate the testing process.

July 24, 2014 12:00 AM

July 23, 2014

Mainak Jas
(MNE-Python student)

mne report PR merged

I added a __repr__ string for the inverse operator by creating an InverseOperator class which inherits from dict in a separate PR. This was integrated into the mne report and @agramfort pitched in to help me in fixing py3k bugs. They weren't easy as we had to finally hack into the tempita code. There was a nasty bug whereby Chrome was modifying the report by replacing all relative paths with absolute path names. It took a while to track it down and we fixed it by using some javascript in this commit. The tests were improved and coverage went up to 85% from 46% and the test time went down to 6 seconds. This was achieved by using decorators and an environment variable so that the tests would not call the visualization functions which are already tested in another module. All of this taken together, the mne report PR got merged finally. A report generated from the command for the sample dataset can be found here. Here is a screenshot of the html:



Next, I focussed on using joblibs to make the rendering parallel. This was addressed in the PR here. It seemed that the processing was IO bound rather than CPU bound which suggested that we should process the files in batches rather than single files at a time. Nonetheless, we could achieve an improvement of up to 2x in speed for the *.fif files but the largest improvements in speed came from MRI which burned all the cores requested.

Finally, there were a few bugs with whitespace in section names in the add_section method of the Report class. This was addressed in this PR.

I wrote to the MNE mailing list asking neuroscientists for feedback on this new tool. One important suggestion was to reorder sections so that they followed a natural ordering, i.e., raw -> events -> evoked -> cov -> trans -> MRI -> forward -> inverse. This was rather easy to do. I addressed this in this PR.

Finally, @dengemann reported some corner cases where the ordering of sections was inconsistent if the saving was done twice. This I fixed in this PR.

The next steps are as follows:
  • Allow add_section to take mlab figures, i.e., Mayavi figures and also text.
  • Introduce interactivity in the report.
  • Get as much feedback as possible and fix bugs.
The mne report will be included in the next release 0.8, which will happen rather soon.

by Mainak (noreply@blogger.com) at July 23, 2014 07:41 PM

July 22, 2014

Michael Mueller
(Astropy student)

Week 9

This week was a pretty straightforward one--I spent most of it working on implementing parallelization in the Cython reading algorithm (in a new branch, multiprocessing). After Tuesday's discussion about what problem to pursue next, we decided that I should spend a week or two on parallelization and then begin to work on creating custom numpy dtypes, which will be an extension of Erik's work and should be applicable to FITS files as well as ASCII to some extent.
In short, the new strategy I implemented for reading involves splitting input data into N chunks (I currently have N=8, but I'll have to find an optimal N at some point) and performing both tokenization and conversion in a different process for each chunk. There are a few subtleties involved; for example, chunks need to be separated by newlines, certain data has to be copied for each process, etc. One irritating issue I came across was one with conversion in which one process discovers string data and another discovers numeric data in the same column. The basic idea at the end of the algorithm is to take the resulting data from each process and concatenate the chunks together, using widening conversions (i.e. int to float to str) when chunks have different data types. However, if a column contains integers in the first chunk, floats in the second, and strings in the third, concatenating the first two would turn 5 to 5.0 and then concatenating the first two chunks with the third would result in "5.0" instead of the desired "5". The obvious solution is to check explicitly for string data in any of the chunks and immediately convert all the columns to strings, but this still results in improper behavior for numbers in scientific notation, for example (1.3e2 -> 13 -> "13"). I ended up simply using a multiprocessing.Queue to force processes to reconvert columns as strings in this case, discarding the original converted data altogether. Luckily this seems like a pretty unlikely corner case for large data files, so performance shouldn't really take a hit.
A more serious problem, and one I haven't quite decided how to handle, is how to deal with quoted fields spanning multiple lines. Since my approach involves splitting data into chunks by lines, an input string like 'a b c\n1 2 "3\n4"' will not parse correctly because one chunk will try parsing the line 1 2 "3 (noticing an unfinished quote sequence) and another will try parsing 4" (noticing an unexpected quote character and too few data values in the row). Unfortunately no good solution comes to mind, since explicitly searching through the input for quote characters would basically negate the performance gains of multiprocessing in the first place. I plan on asking my mentors what they think, since I'm not sure how commonly this sort of case occurs.
All tests are passing with the new multiprocessing algorithm except for those involving the quoting issue, and there is a speed gain over the old method. Unfortunately it's a little less than I hoped it would be (Tom warned that this could happen, citing Amdahl's Law), but still fairly nice; on a sample file that took the old algorithm about 0.11 seconds (the same file I've been using for a few weeks...I should probably start trying other ones), the new algorithm takes about 0.075 seconds, a speed gain of about 45%. I've been working on improving the reader's performance with reading from file, since inputting the file into a single string takes up about 0.012 seconds of the reading time, although I'm not yet finished. Table.__init__() is also taking about 0.01 seconds, which is probably unavoidable. Pandas is taking about 0.055 seconds on the same file. It's getting trickier to cut down time at this point, but hopefully I can edge out Pandas in the end!

by Michael Mueller (noreply@blogger.com) at July 22, 2014 04:20 AM

July 21, 2014

Rishabh Raj
(scikit-image student)

Almost There!

Now having successfully setup the basic backend to run and interact with Docker containers. The next intuitive step was to integrate that with our image gallery and thus allow people to play around with this interactive image processing toolkit, online!

We had a discussion to finalize what sort of an interface would we like to integrate this with. This could have been done in two ways, one way would be to use what we named as ‘external’ integration and the other would subsequently be known as ‘inline’ integration. By this time you might have gotten an intuition about what I mean to say. For further info – there’s a summary here @ https://github.com/scikit-image/skimage-docker/issues/6

One day, while testing out some code snippets on this service, I managed to stumble upon a problem. Every now and then, some Docker container would crash abruptly, without an error message. Now this was bad and just shows one the testing. Further digging up the code showed that there was a race condition between a container stopping after executing some statements and then restarting. `docker wait` (which makes the execution of the next command wait till the container exits) to the rescue and the service went back to being hale and hearty again.

We set up a live version of the gallery for everyone to try out @ http://ci.scipy.org:8000/; if you want to go to the examples directly, this is where you need to head to – http://ci.scipy.org:8000/auto_examples/index.html

These docs have been built on a Mac, which for some time now has not had clean doc generation, hence you might find a few pictures missing. But the rest works fine.

We had a few more discussions on how to improve the experience as a whole, there are ideas to put in spam detection, better error messages etc. We’re tracking it all here @ https://github.com/scikit-image/skimage-docker/issues Feel free to send in feedback on the ideas we already are tracking , or even file bugs or any feature requests. The idea is to support this with solid documentation too so that any other project can use it to setup something similar for themselves!


by sharky932014 at July 21, 2014 10:41 PM

Gaurav Trivedi
(Kivy student)

Towards Making Kivy Apps More Accessible

After last week‘s discussion with tshirtman about making making Kivy apps more accessible, I spent this week exploring the accessibility options on the devices I own. The idea was to investigate and plan a set of common features that we could implement for Kivy.

I ended up playing with all the settings I that could find under the accessibility menus on an Android phone and a Macbook. I will soon repeat this exercise with an iPhone as well.

In comparison to my Galaxy running Android 4.4.2, these features were easier to use and more responsive to interact with on my Mac 10.9. Even the tutorials were simpler to follow and avoided statements like:

“When this feature is on, the response time could be slowed down in Phone, Calculator and other apps.”

I will keep my rantings aside for now and would instead like to focus on how we could support these features within Kivy apps. Following Apple’s way of organizing these features, you can group accessibility features into three broad categories:

Here's TalkBack on Android reading out my PIN to me!

TalkBack on Android reading out my PIN to me!

  • Seeing: This includes display settings such as colors and contrast, making text larger and zooming the screen. A lot of these settings come free of cost to the app developers. However, an important feature in this category includes the screen reader app (Example: VoiceOver and TalkBack). This is the main area of interest to us as the onus of making this feature work correctly rests with us and the app developers. I will get back to this after we finish classifying the features.
  • Hearing: We must make sure all our alert widgets in Kivy are accompanied by visual cues (such as a screen flash) along with audio indicators. Another feature that might be of interest to us would be to ensure that we provide a subtitles option along in our video player widget.
  • Interacting: It includes features like the assistant menu on Android to provide easy access to a limited number of commonly accessed functions. Most of these features are again managed and controlled by the operating systems themselves. Apart from a few design choices to make sure that app doesn’t interfere with their functions, there’s nothing that the app developers have to do to support them.

So the most important missing piece in the Kivy framework is to provide support to the screen-reading apps. These apps are designed to provide spoken feedback to what the users select (touch or point) and activate. For example if I hover over a menu-item or select a button on the screen, the app must describe what it is and also say what is does, if possible.

While settings such as zoom, larger font-sizes etc. are already taken care of by the frameworks that Kivy builds upon, we must provide explicit support for the screen readers. Here’s an example of an hello world kivy app and how it is seen by the screen-readers. There is nothing apart from the main windows buttons, i.e. “Close”, “Minimize” and “Zoom” that is available to the VoiceOver program to describe:

There is no meta information available to the screen-readers to describe the interface to the user.

The main challenge here is that none of the Kivy widgets are selectable at this point. Not only that makes it difficult to use them with screen-readers but it is also not possible to navigate them using a keyboard. We must provide a screen-reader support along with our feature on providing focus support for Kivy widgets.

Along with on_focus we must send a descriptive feedback that the screen-readers as a description about the widget. While most of the native OS UI elements have such content descriptors already defined, our kivy widgets lack such information. We must make an effort to make sure that common widgets such as text-inputs, select options etc. have a description consistent with the one natively provided by the platform.

So a major portion of this implementation must be dealt within Kivy itself. While we have previously adopted an approach for implementing the lowest common set of features in other Plyer facades, it would be a good idea to implement the content descriptors (or accessibility attributes) in a way that it covers all our target platforms. Plyer could then take on from there to make the the platform specific calls to support the native screen reading apps. We would need to provide a mechanism to transform these attributes into a form relevant on these platforms separately. We could safely ignore the extra set of attributes at this point.

In this post I have only outlined the first steps that we need to take for making Kivy apps more accessible. I will work on figuring out the finer details about implementing them in the coming weeks. Hopefully this would help in triggering off a meaningful discussion within the community and we can have a chance to listen to other opinions on this as well.

by gtrivedi at July 21, 2014 01:56 AM

July 20, 2014

Elana Hashman
(OpenHatch student)

Failboats

Found in oh-mainline:

def catch_me(request):
    failboat  # NameError

by Elana Hashman at July 20, 2014 11:00 PM

July 18, 2014

Rajeev S
(GNU Mailman student)

Mailman Gets a Custom Shell!

Had a busy week, but happy that I have been able to keep up the proposal timeline. Now that things have cooled down a bit, I believe that I can blog now. The lack of documentation for the Cmd module gave me a hard time in the beginning, but I feel a lot better now.

The mailman shell now has most of the functionalities covered by the command line tools. During this phase of Summer of Code, I have completed the following features for the shell.

  1. Show objects

    I would like to call the `show` object functionality as one of the most coolest parts of the shell. The show object supports different kinds of filters like the equality, regular expression searching and list membership checking. This has been made possible by building a custom Filter class that determines the type of filter to be applied based on the user choice. The Filter class makes it possible to add new filters easily.
    Further, show function supports usage of multiple filters together in conjunction, giving a great deal of filtering ability on the data set.
    The show object reuses the long list feature that has been used in the command line tools.

  1. Create objects
    The create object commands are used to create mailman object, just as in the command line tools and use almost similar commands

  1. Delete objects
    The delete object function follows the show object syntax, that is, supports all the filters available. This makes mass deletion easy, especially deletion of some test data, if they follow a certain pattern.

  1. Environment variables
    The shell supports usage of shell variables that can be set, reset or unset. The user can declare variables that can be reused in the commands. Further, few special variables like list, domain etc are appended automatically to respective commands, if their values are set.
    A user can see the environment variables that has been set using a show variable command

  1. Disable/ Enable environment
    The usage of environment variables can be suppressed explicitly by using a environment disable command. This can be enabled using a enable command

  1. Subscribe / unsubscribe
    Users can be subscribe to lists or unsubscribed from a list using the shell commands

Apart from the above features the shell supports general features like auto complete, that makes usage of the shell simpler. Errors are caught and displayed at a single point, hence unexpected failures and exits due to exceptions are avoided. A few in built functions of the python cmd module have been customized and overridden to suit the shell. This helps reduce a lot of junk code.

Immediate tasks remaining in the shell are development of help commands and documenting the functions written, which is expected to be done in a couple of revisions. Other functional tasks remaining on the shell are the Preferences related functions.




by Rajeev S (noreply@blogger.com) at July 18, 2014 06:56 PM

Saurabh Kathpalia
(MoinMoin student)

GSoC Update: July 18 2014

Last week's summary:

Added css for +tickets view for modernized and foobar theme.
For modernized theme: codereview and commit to my repo.
Here is the screenshot


For foobar theme: codereview and commit to my repo.
Here is the screenshot



Added some more meta-data information(specific to blog items such as ptime, suppertags) to be displayed in modify view of blog items. Here is the codereview and commit to my repo.

Also finalized the solution of #433 - Added User Actions, Item Actions bar, Show, History, Modify and other important links in view template of nameless items in basic theme. Here is the codereview and commit to my repo.

Also made summary field in blog entries as necessary field so that only summary is shown on the blog entry view instead of name. Here is the codereview

Now I am working on Improving UI of blog items in all the themes.

by saurabh kathpalia (noreply@blogger.com) at July 18, 2014 04:12 PM

July 15, 2014

Rishabh Sharma
(SunPy student)

Near Completion

Well an update on the work.
Unified Downloader
All the major code appears to have been writter down.The Unified Downloader retains the ability to have semi-sql query ability(VSO type).It has a new response object. UnifiedResponse object is now returned from the query method.
No major code base problem faced, even I can not remember any git issues faced

JSOC Attributes:
My work is in final part of submission here.PR reviewing also is nearly complete.

Other PRs:
Some miniscule issues, otherwise everything is up.

Thanks


by rishabhsharmagunner at July 15, 2014 03:06 PM

Julia Medina
(Scrapy student)

Two weeks into second half

The work I've done in the last month and a half has been merged into the master branch of Scrapy's main repository, and it's part of the new changes in the last 0.24 release. I'm happy that I was able to finish my contribution in time and get it to a stable stage, wrapping out the first half of my assignment.

I've started the changes to the API with the Spider modifications. Since these changes are mainly additions to the Spider base class, it was a reasonable task to start with. A new from_crawler class method was introduced to deal with handling the current context of execution (Crawler internals) to each spider at initialization time. This was implemented as non-intrusively as possible, since maintaining backward compatibility in this key component in every Scrapy project is a major concern.

After that, I decided to focus on some of the changes in the Crawler class. I began adjusting the init and crawl methods to the new parameters that were previously defined in my proposal. Changing signatures in public functions is a sensitive matter, since this could easily lead to breaking backward compatibility. Python doesn't support function overloading, so we have to work around taking different inputs for our routines.

Taking this into account, and making note that its usage has to be as simple as possible to actually be helpful, I made the following decorator to allow different positional arguments in the same function:

def deprecated_signature(old_implementation, types=None,
warning_msg="Calling {fname} with the given "
"arguments is no longer supported, please refer "
"to the documentation for its new usage."):
"""Decorator factory that defines a deprecated implementation with
different signature for a given function.

An additional dictionary of argument names and their types can be provided
to differentiate functions with the same argument count but different
expected types for them.
"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
try:
call_bindings = inspect.getcallargs(old_implementation, *args, **kwargs)
if types is not None:
_check_types(call_bindings, types)

except TypeError:
f = func

else:
warnings.warn(warning_msg.format(fname=func.__name__),
ScrapyDeprecationWarning, stacklevel=2)
f = old_implementation

return f(*args, **kwargs)

wrapper.__name__ = func.__name__
wrapper.__doc__ = func.__doc__

return wrapper

return decorator

def _check_types(bindings, types):
for arg, value in six.iteritems(bindings):
if arg in types and not isinstance(value, types[arg]):
raise TypeError

An issue about this approach is that we need a reference to the old_implementation function object, which it's not reachable if we are decorating a method and its old implementation is another method in the same object, since they are created after the whole the class is defined. A more flexible and convenient implementation allows dynamically loading both old_implementation and the types of the arguments:

def deprecated_signature(old_implementation, types=None, module='',
warning_msg="Calling {fname} with the given "
"arguments is no longer supported, please refer "
"to the documentation for its new usage."):
"""Decorator factory that defines a deprecated implementation with
different signature for a given function.

An additional dictionary of argument names and their types can be provided
to differentiate functions with the same argument count but different
expected types for them.

If `old_implementation` is a string, that reference will be dynamically
loaded from the `module` provided. If that last argument is an empty string
(default) it will be expanded to the module of the decorated routine. If
it's the None value, the module will be strip out of the
`old_implementation` string.

Values from the `types` dictionary can also be dynamically loaded, but it
will always be assumed that each string contains the full path for the
object.
"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
try:
old_call = _load_or_return_object(old_implementation,
module or func.__module__)
call_bindings = inspect.getcallargs(old_call, *args, **kwargs)
if types is not None:
_check_types(call_bindings, types)

except TypeError:
f = func

else:
warnings.warn(warning_msg.format(fname=func.__name__),
ScrapyDeprecationWarning, stacklevel=2)
f = old_call

return f(*args, **kwargs)

wrapper.__name__ = func.__name__
wrapper.__doc__ = func.__doc__

return wrapper

return decorator


def _check_types(bindings, types):
types = {k: _load_or_return_object(v) for k, v in six.iteritems(types)}
for arg, value in six.iteritems(bindings):
if arg in types and not isinstance(value, types[arg]):
raise TypeError


def _load_or_return_object(path_or_object, module=None):
if isinstance(path_or_object, six.string_types):
if module is None:
module, path_or_object = path_or_object.rsplit('.', 1)
return _load_object(path_or_object, module)
return path_or_object


def _load_object(path, module):
parent = import_module(module)
for children in path.split('.'):
parent = getattr(parent, children)
return parent

The next step is to merge CrawlerProcess into Crawler, but I judged that it was best to leave it for later since it would probably break a lot of tests, and I won't be able to verify following changes until I have fixed them.

I proceeded to work on the scrapy.cfg config file. I've deprecated two settings, SPIDER_MODULES and SPIDER_MANAGER_CLASS, in favor of options in this config file. Currently I'm working on populating them from the settings file for backward compatibility, while doing a simple check on required options and expansion of defaults.

I haven't created a pull request yet since I'm constantly rewriting the commit history, but my progress can be tracked on my local repository, in the api-cleanup branch.

by Julia Medina (noreply@blogger.com) at July 15, 2014 03:53 AM

July 14, 2014

Hamzeh Alsalhi
(scikit-learn student)

Sparse Target Data for One vs. Rest Classifiers

Going forward with sparse support for various classifiers I have been working on a pull request for sparse one vs. rest classifiers that will allow for sparse target data formats. This will results in a significant improvement in memory usage when working with large amount of sparse target data, a  benchmark is given bellow to measure the. Ultimately what this means for users is that using the same amount of system memory it will be possible to train and predict with a ovr classifier on a larger target data set. A big thank you to both Arnaud and Joel for the close inspection of my code so far and the suggestions for improving it!

Implementation

The One vs. Rest classier works by binarizing the target data and fitting an individual classifier for each class. The implementation of sparse target data support improves memory usage because  it uses a sparse binarizer to give a binary target data matrix that is highly space efficient.

By avoiding a dense binarized matrix we can slice the one column at a time required for a classifier and densify only when necessary. At no point will the entire dense matrix be present in memory. The benchmark that follows illustrates this.

Benchmark

A significant part of the work on this pull request has involved devising benchmarks to validate intuition about the improvements provided, Arnaud has contributed the benchmark that is presented here to showcase the memory improvements.

By using the module memory_profiler we can see how the fit and predict functions of the ovr classifier affect the memory consumption. In the following examples we initialize a classifier and fit it to the train dataset we provide in one step, then we predict on a test dataset. We first run a control benchmark which shows the state of one vs. rest classifiers as they are without this pull request. The second benchmark repeats the same steps but instead of using dense target data it passes the target data to the fit function in a sparse format.

The dataset used is generated with scikit-learns make multilabel classification, and is generated with the following call: 
from sklearn.datasets import make_multilabel_classification

X, y = make_multilabel_classification(sparse=True, return_indicator=True,
n_samples=20000, n_features=100,
n_classes=4000, n_labels=4,
random_state=0)

This results in a densely formatted target dataset with a sparsity of about 0.001

Control Benchmark

est = OneVsRestClassifier(MultinomialNB(alpha=1)).fit(X, y)
consumes 179.824 MB
 
est.predict(X)
consumes -73.969 MB. The negative value indicates that data has been deleted from memory.

Sparse OvR PR Benchamrk

est = OneVsRestClassifier(MultinomialNB(alpha=1)).fit(X, y)
consumes 27.426 MB 

est.predict(X)
consumes 0.180 MB 

Improvement

Considering the memory consumption for each case as 180 MB and 30 MB we see a 6x improvement in peak memory consumption with the data set we benchmarked.

Upcoming Pull Request

The next focus in my summer of code after finalizing the sparse one vs. rest classifier will be to introduce sparse target data support for the knn and dummy classifiers which have built in support for multiclass target data. I have begun the knn pull request here. Implementing a row wise mode calculation for sparse matrices will be the main challenge of the knn PR.

by Hamzeh (noreply@blogger.com) at July 14, 2014 02:21 PM

Mainak Jas
(MNE-Python student)

mpld3, python3.3, topomaps and more


We are approaching towards a merge for the 3000 line PR for the mne report command. One of the features that I incorporated last week was interactive plot_events using mpld3 that combines the power of matplotlib and d3.js. Unfortunately, a recent PR breaks things as mpld3 has not yet implemented ax.set_yticklabels() in its latest release. So, we decided to leave it out interactivity for now and push for a merge before the 0.8 release of MNE-Python.

Travis is not yet happy because of Python3.3 errors. I am going to use the six package to implement methods that have changed from Python2.7 to Python3.3. There were issues with os.path.abspath() as it does not recognize symlinks and therefore we replaced it with os.path.realpath().

Finally, thinking of possible use cases, we realized that including the topographic maps (see figure below) of the evoked response would be useful for the report.

 The show parameter for the plotting function was somehow missing, so I fixed it in this PR. Finally, the plots were ordered by section rather than figure which was addressed in this commit.

The next steps are as follows:
  • Speed up the tests by using a decorator for the tests which uses a small string for the plot instead of making the actual plot. That way, we do not end up testing the plotting functions twice.
  • A separate PR for a repr for the inverse operator.
  • Miscellaneous fixes for merging the mega-PR: Python3.3 errors, not allowing multiple saves by throwing errors etc.

by Mainak (noreply@blogger.com) at July 14, 2014 01:03 PM

Richard Tsai
(SciPy/NumPy student)

Rewrite scipy.cluster.hierarchy

After rewriting cluster.vq, I am rewriting the underlying implementation of cluster.hierarchy in Cython now.

Some Concerns about Cython

The reason why I use Cython to rewrite these modules is that Cython code is more maintainable, especially when using NumPy’s ndarray. Cython provides some easy-to-use and efficient mechanisms to access Python buffer memory, such as Typed Memoryview. However, if you need to iterate through the whole array, it would is a bit slow. It is because Cython will translate A[i, j] into something like *(A.data + i * A.strides[0] + j * A.strides[1]), i.e. it needs to calculate the offset in the array data buffer on every array access. Consider the following C code.

int i, j;
double *current_row;

/* method 1 */
s = 0;
current_row = (double *)A.data;
for(i = 0; i < A.shape[0]; ++i) {
    for(j = 0; j < A.shape[1]; ++j)
        s += current_row[j];
    current_row += A.shape[1];
}

/* method 2 */
s = 0;
for(i = 0; i < A.shape[0]; ++i)
    for(j = 0; j < A.shape[1]; ++j)
        s += *(A.data + i * A.shape[1] + j);

The original C implementation uses method 1 shown above, which is much more efficient than method 2, which is similiar to the C code that Cython generates for memoryview accesses. Of course method 1 can be adopted in Cython but the neatness and maintainablity of Cython code will reduce. In fact, that is just how I implemented _vq last month. But _vq has only two public functions while _hierarchy has 14, and the algorithms in _hierarchy are more complicated that those in _vq. It would be unwise to just translate all the C code into Cython with loads of pointer operations. Fortunately, the time complexities of most functions in _hierarchy are \(O(n)\). I think the performance loss of these functions is not a big problem and they can just use memoryview to keep maintainablity.

The New Implementation

The following table is a speed comparision of the original C implementation and the new Cython implementation. With the use of memoryview, most functions have about 30% performance loss. I used some optimization strategies for some functions and they run faster than the original version. The most important function, linkage, has a 2.5x speedup on a dataset with 2000 points.

functionoldnew
calculate_cluster_sizes4.273us6.996us
cluster_dist68.224us108.234us
cluster_in78.020us102.400us
cluster_maxclust_dist32.913ms1.174ms
cluster_maxclust_monocrit35.539ms1.454ms
cluster_monocrit113.104us46.170us
cohpenetic_distances13.052ms13.458ms
get_max_Rfield_for_each_cluster30.679us42.969us
get_max_dist_for_each_cluster27.729us42.977us
inconsistent105.663us146.673us
leaders33.941us47.849us
linkage4.476s1.794s
prelist29.833us41.334us

The new implementation has yet to be finished. All tests pass but there may be still some bugs now. And it lacks documentations now.

by Richard at July 14, 2014 09:19 AM

PritishC
(SunPy student)

Break – Fix – Repeat!

Cadair commented 9 hours ago

@VaticanCameos I broke it

This quote seems to sum up all that we’ve done in the past few weeks. So far, the Coordinates Framework PR has 80 commits. We are to have a meeting on Tuesday to gauge progress and set the path for the rest of the month.

It has been quite an experience. Having complete ownership of a module isn’t easy. One has to think of design and implementation, testing and experimentation, etc. all in quick succession.  On top of that, the repository does not stay the same all the time – one is always on the bleeding-edge of things. Having said that, my mentor has been really supportive. I have migrated to a new laptop, and on his insistence, tried to install ArchLinux but failed due to some problems with the Radeon graphics card. Nevertheless, I am on Linux Mint for the time being!

A summary of things -:

  1. Some coordinate frames have undergone several changes – Heliographic and Helioprojective, more precisely. Helioprojective ‘Cartesian’ is in fact a spherical representation masquerading as Cartesian, so it was decided that it would be better to change the representation to spherical, but keep the attribute names as Tx and Ty.
  2. Astropy have introduced the RepresentationMapping class, which is a useful encapsulation of the mapping of frame representation attributes to the names that the designer wishes to specify, as well as their units. RM objects take a 3-tuple, thus.
  3. Helioprojective frames now have a new property – ‘zeta’ – which was earlier being mapped to the distance attribute. On reviewing the code, we found that distance should actually be ‘d’ (which was a FrameAttribute then) and it is better to have zeta as a property which returns the value of ‘D0 – d’. ‘zeta’ can be used in the constructor to generate a value for ‘d’. This ultimately fixed most of the transformation issues we had from Helioprojective to other class, except for a few corner cases which we are now working on.
  4. A lot has been learned about how to default attributes in the frame constructor. Careful consideration of the various input cases was necessary. Heliographic frames now default ‘rad’ and a FrameAttribute called ‘dateobs’. The empty frame case was the biggest thorn – breaking a lot of things. Since Astropy’s code purposefully generates an empty frame of the same class to check certain things, it was necessary to keep empty frame design clear in mind. Helioprojective frames can default ‘D0′ and ‘d’.
  5. SphericalRepresentation was subclassed to create Spherical180WrapRepresentation. This was done after Stuart noticed that the ‘hlon’ attribute actually spans a range of -180 to +180 degrees. Creating a negative valued longitude was not possible with ordinary Longitude objects. Thus, a new representation had to be created. All frames which were earlier on spherical representations have been shifted to this new frame.
  6. Currently, there are a few corner cases that have to be fixed. These include transformations from Helioprojective to Heliographic when all three attributes are specified as zero and transformation from Heliographic Stonyhurst to Heliographic Carrington which is breaking, presumably because of the dateobs parameter. The dateobs parameter will also be used to generate the Carrington offset for transformation.
  7. Lastly, the coordinates framework has, for the most part, successfully been integrated into the Ginga project! It is now used to model solar coordinates and visualize SunPy maps. This is great news, and it will also push us to work harder on keeping the framework bug free.

In other news, I have started delving into other projects, such as working on an IRC bot, some competitive programming (really lagging on this) and – of course – looking for jobs! The main thing though, is the side project that I have acquired from my mentor to build #sunpy’s irc logs website in Flask. It’s going to take a while as I am completely nil in web dev, but it will be quite worth it!

Oh, and here, have a gif which signifies what Germany have been like this World Cup . Over and out!

Germanic Slaughter


by xpritish at July 14, 2014 06:12 AM

July 13, 2014

Brigitta Sipőcz
(Astropy student)

in the second half

As it was mentioned in earlier posts, the aperture photometry API is going through a major refactoring this summer, and in the past two weeks I mainly worked on that.

Read more »

by Brigitta Sipocz (noreply@blogger.com) at July 13, 2014 11:24 PM

Vytautas Jančauskas
(TARDIS-SN student)

Two Weeks in to the Second Half of GSoC

Since the bulk of the work is finished now I concentrated mainly on the boring, long and tedious bits. Enforcing structured programming and object-oriented techniques like encapsulation are some of the things I was doing. Also there are plans to start integrating the changes in to the main code base and help people implement their physics in to the new infrastructure. One interesting issue I currently have and was weirded out by is this:

I have this code in my fork:

  if (x_insert > x[imin] || x_insert < x[imax])


    {


      ret_val = TARDIS_ERROR_BOUNDS_ERROR;


      // Crashes without this exit, this doesn't make sense.


      exit(1);


    }

This logical branch never gets executed unless something goes terribly wrong. I checked that this branch is never run with the data I use. However, for some reason, if I comment out the exit(1) line the program crashes. It would be very interesting if someone was able to shed some light here or at least give a lead. Other than that I cannot pin point any road blocks. Except maybe that due to the tedious nature of current tasks (mostly maintenance and code improvement with no noticeable immediate practical effect) the enthusiasm is winding down quite a bit.

by Vytautas Jančauskas (noreply@blogger.com) at July 13, 2014 10:07 PM

Saimadhav A Heblikar
(Core Python student)

GSoC 2014 - Overview of workflow and resources

In this blogpost, I will link to all GSoC 2014 related resources. If you intend to regularly check my GSoC progress, it is advisable that you bookmark this page. It will link to all resources.

Read more »

by Saimadhav Heblikar (noreply@blogger.com) at July 13, 2014 05:33 PM

Asra Nizami
(Astropy student)

Weeks 7-8

I just counted and so far I have 14 merged pull requests which makes me really happy! Not all of them are huge PRs, a few of them are minor documentation changes but some of them took a bit more time to work on as I was implementing some new feature in WCSAxes or fixing some bug.

The latest thing I've been working on is integrating WCSAxes into APLpy. So APLpy is an Astropy affiliated package that also plots astronomical images and uses matplotlib for all the plotting tasks. My current task is to make changes to APLpy so that it uses WCSAxes to locate and plot all the ticks, labels, grids etc instead. So far things have been going pretty smoothly. Most of the functionality of APLpy is in place - there are a few small things here and there but I'm not too worried now that it's mostly working. Things have been going so smoothly in fact that I started to get a little bored with the work and consequently, my productivity decreased significantly.

Tom, my mentor, was at a conference this week so I haven't had a Hangout with him and discuss my latest work, which might also be why I haven't been very productive this week. I think my brain perceives our weekly Hangouts as deadlines where I have to report on my progress and that keeps me motivated. I like deadlines - there I said it! Well maybe having stressful deadlines all the time wouldn't be very pleasant but deadlines are kind of fun. They keep you focused and there's that edgy feeling you get close to the deadline 'can I finish this in time?!' Usually I just set my own personal deadlines like 'I have to get this thing done in two days, maximum four days' etc and that keeps me happy but like I said, I got a little bored.

Towards the end of the week, Tom said I could try implementing another feature in WCSAxes if I wanted so I decided to focus on that instead. Currently WCSAxes only plots major ticks but it would be useful if it could plot minor ticks as well.


(In case someone doesn't know about major and minor ticks from matplotlib, the red ticks are minor ticks and the black ones are major ticks. Also this is just an image produced with matplotlib, not WCSAxes)


This is a bit tricky because so far I haven't had to actually understand the math behind WCSAxes' calculations but now might be the right time to understand how it works. I currently have a very faulty implementation of minor ticks but it's fun to work on and hopefully I'll figure it out! :)

by Asra Nizami (noreply@blogger.com) at July 13, 2014 09:49 AM

Asish Panda
(SunPy student)

Progress, 2 weeks after mid term

Hey all. Again a mandatory post. The PR I created, is done, only needs merging now. I made a hell at my local git repo when a commit was corrupted. It did not end there as it was followed by a weird error even after removing the corrupted empty files. I tried to fix it but failed so I had no choice but to clone it once again. Which again led to an error while compiling “python.h cannot compile” even after install all python-dev updates. I don’t remember how exactly but it was a problem with pyc files hanging around. I removed them mannualy and it worked.

As for now what I will be doing next, I think I will go with the wcs module. I have yet to start that and I think it will be much more interesting than my previous tasks.

But still I have yet to merge the changes of maps and spectra, but I think it can wait till I get comments on how to improve it.

I guess thats all for now. I am not good at writing about the stuff I do, so check out the PR or my forked repo for details!

Thanks!


by kaichogami at July 13, 2014 07:48 AM

July 12, 2014

Shailesh Ahuja
(Astropy student)

Using specutils for reading and writing FITS files

I have just finished developing the readers and writers for FITS IRAF format described here. I am going to give a small tutorial on how the specutils package (affiliated with Astropy) can be used for the same, and also give some details of how it works.

To read a FITS file, just simply use specutils.io.read_fits.read_fits_spectrum1d() method. This method might be later moved to Spectrum1D class. The following code snippet gives an example:

from specutils.io import read_fits
spectra = read_fits.read_fits_spectrum1d(example.fits)

Depending on the type of FITS file, the object can be different. For linear or log-linear formats, the object returned will be a Spectrum1D object. For multispec formats, the object will be a list of Spectrum1D objects. The dispersion and the flux of the Spectrum can be accessed using spectrum1Dobject.dispersion and spectrum1Dobject.flux respectively. The dispersion is automatically computed based on the type of dispersion function defined in the header. Other methods such as slicing dispersion for a particular set of flux coordinates will be supported soon.

To write a FITS file, specutils.io.write_fits.write() method. Like the reader, this method may be moved to Spectrum1D later. The following code snippet gives an example:

from specutils.io import write_fits
write_fits.write(spectra, 'example.fits')

The spectra object passed can be a list of Spectrum1D objects or just one Spectrum1D object. Depending on the type of spectra, the writer will write either in multispec or linear format respectively. If the spectra was read in using the reader, then the additional information from the read file will also be written.

There will be some more operations supported, but the focus will be on keeping things simple, and handling the complexities in the code itself.

by Shailesh Ahuja (noreply@blogger.com) at July 12, 2014 01:23 PM

M S Suraj
(Vispy student)

Ellipse, Regular polygons, and tests – GSoC Week 7

I successfully passed the mid-term evaluation of GSoC. Yay! Coming to work done so far, I have implemented EllipseVisual, RegularPolygon visual, and am in the process of writing image processing based tests for them.

I implemented EllipseVisual as a generic object which can be used to draw arcs, circles as well reuse to draw regular polygons. Given a position, radii and span angles, the vertices are computed and drawn as triangle_fans. Drawing using GL_TRIANGLE_FANS made it really easy to implement the span angle part so that arcs can be easily drawn now. The code can be viewed here – https://github.com/vispy/vispy/blob/master/vispy/scene/visuals/ellipse.py . Here are some examples:

ellipse2

ellipse1

ellipse3

Similarly changing the number of segments of the triangle fan to match the number of sides resulted in regular polygons. Here’s the code: https://github.com/vispy/vispy/blob/master/vispy/scene/visuals/regular_polygon.py

regular_polygon1
regular_polygon2

Currently there exists tests for none of the visuals. Tests are crucial to keep in check the visual accuracy of the visuals. Currently, I’m trying out image matching algorithms to see which one suits the need. I will publish a write-up on it once I have finished writing the tests.


by mssurajkaiga at July 12, 2014 12:29 AM

July 10, 2014

Alan Leggitt
(MNE-Python student)

Day 37: Aligning source spaces

Notebook &v

In case you're wondering what I've been up to all week, I got some feedback from the mne community about how
to incorporate source spaces based on volumes of interest into the existing mne code. We decided to implement
volumes of interest using the function mne.setup_volume_source_space. If you check out my pull request, you
can see the various steps I've been through to make this possible. The biggest hurdle was incorporating an
interpolation matrix. This matrix transforms vertices based on their individual vertex numbers in 3d source
space into mri space.

I won't go into detail about the process, but I've included a plot below showing how 3 different source spaces
align in mri space. The surfaces of the left and right hemispheres are in blue and green, respectively, the
whole brain volume is in red, and a volume of interest based on the left cortical white matter is in black.

In [1]:
%matplotlib inline
%load transform_workspace.py
In [3]:
import mne
from mne.datasets import sample
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

mne.set_log_level(False)
data_path = sample.data_path()
subject = 'sample'
subjects_dir = data_path + '/subjects'
aseg_fname = subjects_dir + '/sample/mri/aseg.mgz'

# create a whole brain volume source
vol_src = mne.setup_volume_source_space(subject, mri=aseg_fname,
subjects_dir=subjects_dir)

# create left and right hemisphere cortical surface sources
srf_src = mne.setup_source_space(subject, subjects_dir=subjects_dir,
overwrite=True)

# create a left hemisphere white matter source
lwm_src = mne.setup_volume_source_space(subject, mri=aseg_fname,
volume_label='Left-Cerebral-White-Matter')

# get the coordinates from all sources
vol_rr = vol_src[0]['rr']
ix = vol_src[0]['inuse'].astype('bool')
lh_rr = srf_src[0]['rr']
rh_rr = srf_src[1]['rr']
lwm_rr = lwm_src[0]['rr']

# plot the source spaces in 3 dimensions
ax = plt.axes(projection='3d')
ax.plot(vol_rr[ix, 0], vol_rr[ix, 1], vol_rr[ix, 2], 'r.', alpha=0.1)
ax.plot(lh_rr[:, 0], lh_rr[:, 1], lh_rr[:, 2], 'b.', alpha=0.1)
ax.plot(rh_rr[:, 0], rh_rr[:, 1], rh_rr[:, 2], 'g.', alpha=0.1)
ix = lwm_src[0]['inuse'].astype('bool')
ax.plot(lwm_rr[ix, 0], lwm_rr[ix, 1], lwm_rr[ix, 2], 'k.', alpha=0.5)

plt.show()

The add_dist parameter to mne.setup_source_space currently defaults to False, but the default will change to True in release 0.9. Specify the parameter explicitly to avoid this warning.

by Alan Leggitt (noreply@blogger.com) at July 10, 2014 10:29 PM

Roy Xue
(Theano student)

Next Step #GSoC2014-Theano

Next Step


I just finished my final exams. So it is time to continue my previous tasks. (actually when i post this, Ive already started)

Fred is on his vacation, so these days I will contact Arnaund for help. By this, I start to know why GSoC requires more than one mentor, not only for one’s temporary absence, but also by different mentors, they can provide advices from different aspects which will make me learn more and make the code better.

Now I had completed some updates for my previous PR, mainly on simplify it structure and do much decoration works. Just make it better way. It’s waiting to be merged now: Compute minimum peak #1934

The following is my next part description(thx Fred for writing this):


This is for the GSoC work by Roy. Roy, wait until you finish the current step to check this(or at list fixed the important part), as it will discract you. But I would appreciate feedback now from other people:

For this, I would start by working only with the “vm” linker, when there is no lazy variable. I would also restrict at first with ndarray of 0 dimensions. That push to later the interaction with the shapefeature. That would simplify the making of the base work. What need to be done, is take the execution order of nodes currently use and keep that order. But change how we build the storage_map and add extra dependency check in the garbage collection. I should split this in those step:

1) use the linker vm(not the default one cvm) with graph that don’t have lazy op and with allow_gc=False. In that case, modify the method VM_Linker.make_all() (file gof/vm.py) to change the storage_map that get created to have those properties:

  • The old storage_map have a different storage for each variable in the graph. This need to change.
  • You need to have some variable share the same storage. But those variable need to be carefully selected. For this: for node in nodes_execution_order: check if current node inputs variable that would be gc if the gc was enabled. for those variables for now keep only those that have ndim ==0 for those variables, check in future nodes to execute, those that have an output with ndim ==0. Check that you didn’t already changed the storage_map of that output variable. For that pair of (input variable of the current node to be gc normally, variable that will be outputed later), make sure that the storage_map point to the same place for both variable.

The fact that the output variable is available when a node get executed, will allow that node to reuse the existing memory for its output, instead of allocating new memory. This will lower the memory used when allow_gc=False, as the memory used won’t be the sum of all the intermediate memory as it is now

1.2) Only after 1.1) is working, generalyze it to make it work when allow_gc=True. For this, you need to modify the condition when Theano check what to gc after a node get executed, to only gc it it no other new nodes will use that memory in the future. This will need changed in the varible post_thunk_clear passed to LoopGC.init, to correctly delay the gc of the variable you changed in 1.1.

There is other stuff to do on this, but I think we can delay the detailed plan for them.


 

I will start to look for informations and think about the way how to implement my next step work now.

by xljroy at July 10, 2014 07:16 PM

July 09, 2014

Ajitesh Gupta(:randomax)
(MoinMoin student)

Week 7 Work - ACLs

This week the first task was to find and solve bugs related to Access Control Lists referred to as ACLs hereafter. The access control list in simple metaphorical terms would be equivalent to a guest list at a big party. If your name is on the invited guests' list then you may attend the party else you stay out of it :D . So the access control lists basically specify what permissions a user has w.r.t to a particular object. A user may read, write, create, destroy, admin an item depending on the ACL corresponding to that item.

The first one was an already reported bug - #417 whereby the the userbrowser view in the admin section used to crash in case there were any users who had made accounts but not confirmed them via the emails sent to their accounts. This is caused as when a user registers on the wiki, their email address was stored under the "EMAIL_UNVALIDATED" key till they do not confirm the account. Once they confirm it, the email is stored under "EMAIL" key. So the earlier code used to only look for the email under "EMAIL" key while retrieving user info and as such crashed. I changed the code to look first for "EMAIL" and if that does not exist then look for "EMAIL_UNVALIDATED". Here is the commit for that change. Now you might think that where did ACL figure in all this. This is related to ACLs as one of the new ACL views I created this week would have its link from the userbrowser view where all the users would be listed. I'll talk more about that later.

The second bug that I came across was #444 whereby if the user tried to access something which they are not allowed to read (like gatecrashing that party I mentioned ;) ), they would see an error as mentioned in the bug report instead of a proper "Access Denied" page. This happens because somehow the registered 403 error handler does not spring into action in case a 403(AccessDenied) error is raised in this case. I am still working on this issue, you can find the related code reviews here.

Another bug was #445 whereby if a user changed the ACL of an item such that they were no more allowed to read that object, they would receive the error reported in the bug ( I wonder who'd strike their own name off the guests' list :D ). Basically the self.revs object being used there gets its value from the "get_items_last_revisions()" function. This funtion fails to find the item in case the user is not authorized to read it, and hence returned an empty list. This empty list caused the error. Hence I put in a check for empty lists and displayed an AccessDenied Error in such a case using the abort(403) function. The code review is here.

Yet another bug was #446 wherein a user could modify the ACLs of a particular object even if he had just the read and write permission, whereas according to the design only a person with admin rights should be able to do that. In the basic theme this error was caused because there was no check on the user's ACL capabilities before displaying the modify form, or in metaphorical terms there was a guest list defined but there was no one to cross check the people entering the hall. So I added a check for the user, and only if they have admin rights can they access the Modify ACL form. Here is the commit regarding that.

Finally the last bug I worked on was #447 which again like #417 was not directly involved in ACLs but was related to the view I needed for further ACL views. Here the previous developer had not kept in mind that the rev.meta[NAME] attribute would be a list as items can have multiple names and as such it could not be used as a key for a dictionary. So I unrolled the previous complex list comprehension and used for loops instead to take care of the list. Here is the commit regarding that.

Now comes the new views I worked for easy visualization of ACLs. The first one was the User ACL Info view. This view contains the information about ACL capabilities of the user with respect to each item in the wiki. This view can only be accessed by the superuser(s) specified in the wiki configuration through the userbrowser view, in the admin section, where there would be a list containing various information about each user (Ah! Now you see why I solved #417 and #447). The clicking on the item names in the list would take the superuser to the Modify Item page where they could modify the ACL of that item as they liked ( Thats if they had admin rights for that item as per #446 ). The tablesorter js library was used to make an easily sortable table. Here is the code review for that.


The User ACL Info view

Another view that I made was the Item ACL Report. This again can be accessed only by the superuser. Here the superuser will see a list of all items in the wiki and the corresponding  ACLs for each item in the wiki. Also clicking on the item name would take the superuser to the Modify Item view. Tablesorter js was used here too. Also in both new views if the item was nameless item, the Item Id would be shown instead. Here is the code review for that.


Item ACL Report view

In the continuing couple of weeks I would be working on the show Groups view and Group ACL Reports view to enable the superuser to visualize the ACLs and to easily change them.

by Ajitesh Gupta (noreply@blogger.com) at July 09, 2014 12:48 PM

Saurabh Kathpalia
(MoinMoin student)

GSoC Update: July 9 2014

Last week's summary:

Added 'Subscribe' and 'Add quicklink' option in ticket's modify template. Now users can subscribe to ticket items and also quicklink them in basic theme. Here is the codereview and commit to my repo.

Made nameless items downloadable. Previously there user couldn't download nameless items as the filename that was used for downloading used name of the item which was raising error in case of nameless items so I removed this dependency on name and defined separate function for getting filename (get_download_file_name). This returns the name of the item if there is some name else return field-value as download file name. Here is the codereview and commit to my repo.

Also made blog items update work and add useful links of modify, subscribe, modify meta etc on blog page. Here is the codereview for my changes. One TODO is to make blog-posts nameless which would be done once we get a way to create items without any initial fqname.

Fixed #433 - Now User Actions, Item Actions, Show, History, Modify etc are also displayed in the view of nameless items in basic theme. Here is the codereview for my changes . I am yet to commit it to my repo.

Also tried to write tests for +tickets view but couldn't complete it as we need pytest latest version(2.5.2) but currently it is not compatible with moin, so I will come to it later. Here is the codereview for the above.

by saurabh kathpalia (noreply@blogger.com) at July 09, 2014 11:08 AM

July 08, 2014

RAGHAV R V
(BinPy student)

GSoC 2014 - Week 7

The past two weeks went quite well.

I implemented the SignalGenerator module and also made a few minor bug fixes along the way.

The SignalGenerator module is a multi-threaded implementation which can connect to external inputs ( for modulating the output signal to AM wave or FM wave ) using the linker module. The outputs can also be connected to other modules.

The source of the file can be found here.

The code to generate an AM signal using 2 SignalGenerator Blocks is :



Loading ...

The output of the same :



The Pull Request that I had raised for the same : #249

To finish the Microprocessor implementation there a few more stuff needs to be done. Especially tidying up the expression modules and integrating the Espresso Heuristic logic minimizer.

The idea here is to make a Finite State Machine to act as a block which accepts inputs and generates an output with the least possible delay based on a predetermined equation.

I hope to start work on the same and finish it within next two weeks. Once this is done addition of further ICs is a breeze. All the leftover IC's can be implemented using this block in at most a single day. If possible I will also re-implement existing ICs using the same. The completion of this and the matplotlib based oscilloscope will be my next priority.

by Raghav R V (noreply@blogger.com) at July 08, 2014 12:29 AM

July 07, 2014

maheshakya
(scikit-learn student)

Optimizations on Locality sensitive hashing forest data structure

In order to get LSH forest approximate nearest neighbor search method into a competitive and useful state, optimization was a necessity. These optimizations were based on the portions of the LSH forest algorithm and implementation of that algorithm. There are two general cases of bottlenecks where requirement for optimization arises.

  1. Memory usage of the data structure.
  2. Speed of queries.

Let's discuss these two cases separately in detail. Remember that there will always be a trade off between memory and speed.

Optimizations on memory usage

In the previous post about the implementation of the LSH forest, I have explained how the basic data structure and the functions. Some of the optimizations caused obvious changes in those functions. First of all, I need to mention that this data structure stores the entire data set fitted because it will be required to calculate the actual distances of the candidates. So the memory consumption is usually predominated by the size of the data set. LSH Forest data structure does not have control over this. Only thing that can be controlled from the data structure is the set of hashes of the data points.

Current data structure maintains a fixed length hash(but user can define that length at the time of initialization). Hashes are comprised of binary values (1s and 0s). At the moment, these hashes are stored as binary strings.

One of the optimizations which can be done to maintain a more compact index for hashes is using the equivalent integer of the binary strings. For an example, string 001110 can be represented as 14. For very large data sets with very high dimensions, this technique will hold no effect as the improvement on the memory usage of hash indices is insignificant when compared to the memory required to store the data set.

As I have mentioned earlier, there is always a cost for every optimization. Here we have to trade off speed for improved memory consumption. The LSH algorithm is applied to a single data point up to the number hash length required. In the default case, there will be 32 hashes per a single data point in a single tree and those hashed values have to be concatenated in order to create \(g(p)\)(refer to the LSH Forest paper). Because the hashing is done separately for individual elements of \(g(p)\), they are stored in an array in the first place. In order to create a single hash from these values in the array, they are first concatenated as a string. This is the point where memory speed dilemma occurs. These string hashes can be converted into integer hashes for compact indices but it will cost extra time.

In the next section, I will explain how this extra overhead can be minimized using a trick, but the fact that "you cannot eliminate the memory speed trade-off completely" holds.

Optimizations on query speeds

(I strongly recommend you to go through the LSH forest paper before reading the following section because the description involves terms directly taken from the paper)

I was hapless at the beginning, because the query speeds of the implemented LSH forest structure were not at a competitive state to challenge the other approximate nearest neighbor search implementations. But with the help of my project mentors, I was able to get over with this issue.

Optimizations on the algorithm

In the initial implementation, binary hash: \(g(p)\) for each tree is computed during the synchronous ascending phase when it was required. This causes a great overhead because each time it is computed, the LSH function has to be performed on the query vector. For Random projections, it is a dot product and it is an expensive operation for large vectors. descending phase for a single tree Computing the binary hashes for each tree in advance and storing them is a reliable solution for this issue. The algorithm descend needs to be performed on each tree to find the longest matching hash length. So the above algorithm will iterate over the number of trees in the forest. For each iteration, the binary hash of the query is calculated and stored as follows. This does not consume much memory because the number of trees is small(typically 10).

def descend(query, hash_functions, trees):
    bin_queries = []
    max_depth = 0
    for i in range(number_of_trees):
        binary_query = do_hash(query, hash_functions[i]
        k = find_longest_prefix_match(trees[i], binary_query)
        if k > max_depth:
            max_depth = k
        bin_queries.append(bin_query)
    return max_depth, binary_queries

It is an obvious fact that as the number of candidates for the nearest neighbors grows the number of actual distance calculations grows proportionally. This computation is also an expensive operation when compared with the other operations in the LSH forest structure. So limiting the number of candidates is also a passable optimization which can be done with respect to the algorithm. synchronous ascend phase In the above function, one of the terminating conditions of the candidate search is maximum depth reaching 0 (The while loop runs until x > 0). But it should not necessarily run until maximum depth reaches 0 because we are looking for most viable candidates. It is known fact that for smaller matching hash lengths, it is unlikely to have a large similarity between two data points. Therefore, as the considered hash length is decreased, eligibility of the candidates we get is also decreased. Thus having a lower bound on the maximum depth will set a bound to candidates we collect. It decreases the probability of having irrelevant candidates. So this can be done by setting x > lower_bound in the while loop for some value of lower_bound > 0. But there is a risk of not retrieving the required number of neighbors for very small data sets because the the candidate search may terminate before it acquires the required number of candidates. Therefore user should be aware of a suitable lower_bound for the queries at the time of initialization.

Optimizations on the implementation

I have indicated that bisect operations comes with Python are faster for small lists but as the size of the list grow numpy searchsorted functions becomes faster. In my previous implementation, I have used an alternative version of bisect_right function as it does not fulfill the requirement of finding the right most index for a particular hash length in a sorted array(Please refer to my previous postif things are not clear). But we cannot create an alternative version of numpy searchsorted function, therefore a suitable transformation is required on the hash itself.

Suppose we have a binary hash item = 001110. What we need is the largest number with the first 4 bits being the same as item. 001111 suffices this requirement. So the transformation needed is replacing the last 2 bits with 1s.

def transform_right(item, h, hash_size):
    transformed_item = item[:h] + "".join(['1' for i in range(hash_size - h)])
    return transformed_right

The transformation needed to get the left most index is simple. This is 001100 which is last two bits replaced by 0s. This is same as having 0011. Therefore only a string slicing operation item[:4] will do the the job.

It gets more complicated when it comes to integer hashes. Integer has to be converted to the string, transformed and re-converted to integer.

def integer_transform(int_item, hash_size, h):
    str_item = ('{0:0'+str(hash_size)+'b}').format(int_item)
    transformed_str = transform_right(str_item, h, hash_size):
    transformed_int = int(transformed_str, 2)
    return transformed_int

But this is a very expensive operation for a query. For indexing, only int(binary_hash, 2) is required but this wont make much effect because the LSH algorithms on the data set dominates that operation completely. But for a query this is a significant overhead. Therefore we need an alternative.

Required integer representations for left and right operations can be obtained by performing the bit wise \(AND\) and \(OR\) operations with a suitable mask. Masks are generated by the following function.

def _generate_masks(hash_size):
    left_masks, right_masks = [], []        
    for length in range(hash_size+1):
        left_mask  = int("".join(['1' for i in range(length)])
                         + "".join(['0' for i in range(hash_size-length)]), 2)
        left_masks.append(left_mask)
        right_mask = int("".join(['0' for i in range(length)])
                         + "".join(['1' for i in range(hash_size-length)]), 2)
        right_masks.append(right_mask)
    return left_masks, right_masks

These masks will be generated at the indexing time. Then the masks will be applied with the integer hashes.

def apply_masks(item, left_masks, right_masks, h):
    item_left = item & left_masks[h]
    item_right = item | right_masks[h]
    return item_left, item_right

The left most and right most indices of a sorted array can be obtained in the following fashion.

import numpy as np
def find_matching_indices(sorted_array, item_left, item_right):
    left_index = np.searchsorted(sorted_array, item_left)
    right_index = np.searchsorted(sorted_array, item_right, side = 'right')
    return np.arange(left_index, right_index)

As the masks are precomputed, the speed overhead at the query time is minimum in this approach. But still there is a little overhead in this approach because original hashed binary number are stored in an array and those numbers need to be concatenated and converted to obtain the corresponding integers. If the integers are cached this overhead will be eliminated.

Cached hash

This is method which guarantees a significant speed up in the queries with expense of index building speed. At the indexing time, we can create a dictionary with key being arrays of hashed bits and the values being the corresponding integers. The number of items in the dictionary is the number of combinations for a particular hash length.

Example:

Suppose the hash length is 3. Then the bit combinations are: hash_bit_table

Then it will be only a matter of a dictionary look-up. Once an integer is required for a particular hash bit array, it can be retrieved directly from the dictionary.

The implementation of this type of hash is a bit tricky. Typically in LSH forests, the maximum hash length will be 32. Then the size of the dictionary will be \(2^n = 4294967296\) where \(n = 32\). This is an extremely infeasible size to hold in the memory(May require tens of Gigabytes). But when \(n = 16\), the size becomes \(65536\). This is a very normal size which can be easily stored in the memory. Therefore we use two caches of size 16.

First a list for digits 0 and 1 is created.

digits = ['0', '1']

The the cache is created as follows.

import itertools

cache_N = 32 / 2
c = 2 ** cache_N # compute once

cache = {
   tuple(x): int("".join([digits[y] for y in x]),2)
   for x in itertools.product((0,1), repeat=cache_N)
}

Suppose item is a list of length 32 with binary values(0s and 1s). Then we can obtain the corresponding integer for that item as follows:

xx = tuple(item)
int_item = cache[xx[:16]] * c + cache[xx[16:]]

This technique can be used in queries to obtain a significant improvement in speed assuming that the user is ready to sacrifice a portion of the memory.

Performance of all these techniques were measured using Pythons' line_profiler and memory_profiler. In my next post, I will explain concisely how these tools have been used to evaluate these implementations.

July 07, 2014 12:00 AM

July 06, 2014

Vighnesh Birodkar
(scikit-image student)

scikit-image RAG Introduction

Humans possess an incredible ability to identify objects in an image. Image processing algorithms are still far behind this ability. Segmentation is the process of dividing an image into meaningful regions. All pixels belonging to a region should get a unique label in an ideal segmentation.

The current segmentation functions in scikit-image are too fine grained and fall closer to superpixel methods, providing a starting point for segmentation. Region Adjacency Graphs (RAGs) are a common data structure for many segmentation algorithms. As part of GSoC this year I am implementing RAGs for scikit-image. The current HEAD of scikit-image’s master branch contains my RAG implementation based on Networkx from my recent Pull Request. In the example below, we will see how Region Adjacency Graphs (RAGs) attempt to solve the segmentation problem.Please note that you need the latest master branch of scikit-image to run the following code.

Getting Started

We define the function show_img in preference to the standard call to imshow to set nice default size parameters.
We start with coffee, a nice fresh image of a coffee cup.

from skimage import graph, data, io, segmentation, color
from matplotlib import pyplot as plt
from skimage.measure import regionprops
from skimage import draw
import numpy as np


def show_img(img):
    width = 10.0
    height = img.shape[0]*width/img.shape[1]
    f = plt.figure(figsize=(width, height))
    plt.imshow(img)

img = data.coffee()
show_img(img)

rag_demo_2_0

Over Segmentation

We segment the image using SLIC algorithm. The SLIC algorithm will
assign a unique label to each region. This is a
localized cluster of pixels sharing some similar property, in this case their
color. The label of each pixel is stored in the labels array.

regionprops helps us compute various features of these regions. We will be
sing the centroid, purely for visualization.

labels = segmentation.slic(img, compactness=30, n_segments=400)
labels = labels + 1  # So that no labelled region is 0 and ignored by regionprops
regions = regionprops(labels)

The label2rgb function assigns a specific color to all pixels belonging to one
region (having the same label). In this case, in label_rgb each pixel is
replaces with the average RGB color of its region.

label_rgb = color.label2rgb(labels, img, kind='avg')
show_img(label_rgb)

rag_demo_6_0
Just for clarity, we use mark_boundaries to highlight the region boundaries.
You will notice the the image is divided into more regions than required. This
phenomenon is called over-segmentation.

label_rgb = segmentation.mark_boundaries(label_rgb, labels, (0, 0, 0))
show_img(label_rgb)

rag_demo_8_0

Enter, RAGs

Region Adjacency Graphs, as the name suggests represent adjacency of regions
with a graph. Each region in the image is a node in a graph. There is an edge
between every pair of adjacent regions (regions whose pixels are adjacent). The
weight of between every two nodes can be defined in a variety of ways. For this
example, we will use the difference of average color between two regions as
their edge weight. The more similar the regions, the lesser the weight between
them. Because we are using difference in mean color to compute the edge weight,
the method has been named rag_mean_color.

rag = graph.rag_mean_color(img, labels)

For our visualization, we are also adding an additional property to a node, the
coordinated of its centroid.

for region in regions:
    rag.node[region['label']]['centroid'] = region['centroid']

display_edges is a function to draw the edges of a RAG on its corresponding
image. It draws edges as green lines and centroids as yellow dots.
It also accepts an argument, thresh. We only draw edges with weight below this threshold.

def display_edges(image, g, threshold):
    """Draw edges of a RAG on its image

    Returns a modified image with the edges drawn.Edges are drawn in green
    and nodes are drawn in yellow.

    Parameters
    ----------
    image : ndarray
        The image to be drawn on.
    g : RAG
        The Region Adjacency Graph.
    threshold : float
        Only edges in `g` below `threshold` are drawn.

    Returns:
    out: ndarray
        Image with the edges drawn.
    """
    image = image.copy()
    for edge in g.edges_iter():
        n1, n2 = edge

        r1, c1 = map(int, rag.node[n1]['centroid'])
        r2, c2 = map(int, rag.node[n2]['centroid'])

        line  = draw.line(r1, c1, r2, c2)
        circle = draw.circle(r1,c1,2)

        if g[n1][n2]['weight'] < threshold :
            image[line] = 0,1,0
        image[circle] = 1,1,0

    return image

We call the function with thresh = infinity so that all edges are drawn. I
myself was surprised with the beauty of the following output.

edges_drawn_all = display_edges(label_rgb, rag, np.inf )
show_img(edges_drawn_all)

rag_demo_16_0
png

Let’s see what happens by setting thresh to 29, a value I arrived at with
some trial and error.

edges_drawn_29 = display_edges(label_rgb, rag, 29 )
show_img(edges_drawn_29)

index2

Alas, the graph is cut

As you can see above, the RAG is now divided into disconnected regions. If you
notice, the table above and to the right of the dish is one big connected
component.

Threshold Cut

The function cut_threshold removes edges below a specified threshold and then
labels a connected component as one region. Once the RAG is constructed, many similar
and more sophisticated strategies can improve the initial segmentation.

final_labels = graph.cut_threshold(labels, rag, 29)
final_label_rgb = color.label2rgb(final_labels, img, kind='avg')
show_img(final_label_rgb)

index
Not perfect, but not that bad I’d say. My next steps will be to implement better algorithms to process the RAG after the initial segmentation.These include the merging predicates mention here and N-cut.


by Vighnesh Birodkar at July 06, 2014 03:46 PM

Mustafa Furkan Kaptan
(Vispy student)

Testing Screenshots

Hello reader,

This week I updated and tested the screenshot ability of Vispy.

Normally, Vispy utilizes `read_pixel()` fuction which basically read pixels from a GL context. This function returns numpy array but somehow an upside-down array. So what I did was to flip that array which is extremely easy in Python and adding a test for it. The test was very simple too. We draw to top left a white square on a black screen. Then after reading pixels, we check whether the image is flipped or not by calculating the sum of the pixels in the corners. If the top left corner's pixel values are greater than 0 and the rest are equal to 0, it means we correctly took the screenshot.  You should know that I have never dealt with drawing something with GL functions before, that was my first.

Now I am making another PR with Luke Campagnola's -a vispydev- pure python PNG writer. This will help us to remove another image library dependency, like PIL. More updates will be coming after it has been merged.

See you next post!

by Mustafa Kaptan (noreply@blogger.com) at July 06, 2014 12:31 PM

July 05, 2014

Rajul
(SunPy student)

SunPy Database Browser Meeting on 5th July, 2014

Nabil, Stuart and I had discussion regarding the Sunpy Database Browser on the 5th of July, 2014. Various points were discussed during the meeting:
1. EIT files seem to have a problem with being loaded in into the SunPy database. This is because the SunPy database does not apparently allow WAVEUNIT attribute to be None. And the EIT file in the sunpy.data.sample has WAVEUNIT as None. Thus it is not been able to load in the database. This can be due to one of the following two problems:
a. The header of the SunPy's sample EIT file is faulty
b. The SunPy database module apparently does not have the functionality to deal with None WAVEUNIT, and hence EIT files
So, Nabil has assigned me to try and download more of EIT test files and check their headers. If the header of SunPy's sample file is faulty then we will replace the SunPy's test file else we will amend the sunpy.database
2. Stuart has created a pull request for solar co-ordinate handling in Ginga. I shall try to pull in this PR and try to render SunPy's sample, solar FITS files and take screenshots to send to Eric.

3. Start work on the Search Box feature in the Database GUI

by Rajul Srivastava (noreply@blogger.com) at July 05, 2014 05:51 PM

July 04, 2014

Tarun Gaba
(PyDy student)

GSoC 14: Midterm Evaluations have ended!

{% include JB/setup %} [ <-Back to posts ](/gsoc14) It has been a week after the midterm evaluations are over, and I am back to work after a small break(with permission from my mentor, off course!). I have been working on writing a test suite for the Dynamics Visualizer. This is the wrapping up part of the visualizer for this gsoc. [Here](/blog/visualization/index.html?load=samples/scene_desc.json) is a visualization of a rolling disc(it is slightly buggy though), that i prepared. To view the animation, allow the visualization to load in the webpage(it shall load automatically), and then hit the `Play Animation` button. After writing some tests for visualizer, I am going to start fleshing out API for the module, to provide IPython support to the visualizer. The main aim of writing this module is to make visualizer interactive, in the sense, that a user should be able to change all the variables from the GUI(which is rendered inside notebook's output cell) and then rerun the simulations without having to write any code, or execute any of the code manually. The data of the new simulations will be automatically fed into visualizer, and then it can be viewed as an animation. This whole workflow will be very convenient for the existing PyDy users, as well as the new ones. It will be particularly convenient for those who want to just play around with the existing systems, by changing the system variables, and view how it affects the resulting animations. With the development of this module, as well as ongoing improvements in the other PyDy modules(by my fellow GSoC'ers from PyDy), we should be able to perform lightening fast simulations for a system, as well as view them on a canvas. I will keep posting the new work I will be doing, with better details(once I actually start implementing new stuff!). [ <-Back to posts ](/gsoc14)

July 04, 2014 06:11 PM

July 03, 2014

Alan Leggitt
(MNE-Python student)

Day 32: Reading and writing volume names to fif file

Notebook &v

Today I added functionality to read and write segmentation information into .fif files. This is handy for distinguishing
between multiple source spaces, each corresponding to a particular subcortical volume.

For an idea of the changes needed in the MNE code, check out the latest commit to my pull request.

Below is a verification that everything worked.

In [21]:
import mne
from mne.datasets import spm_face

mne.set_log_level(False)

data_path = spm_face.data_path()

# setup the cortical surface space
src_fname = data_path + '/MEG/spm/combined-src.fif'
src = mne.setup_source_space('spm', overwrite=True)

The add_dist parameter to mne.setup_source_space currently defaults to False, but the default will change to True in release 0.9. Specify the parameter explicitly to avoid this warning.

In [22]:
# add two subcortical volumes
src = mne.source_space.add_subcortical_volumes(src, ['Left-Amygdala', 'Right-Amygdala'])
mne.write_source_spaces(src_fname, src)

/home/alan/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/numpy/lib/shape_base.py:834: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
return c.reshape(shape)
/home/alan/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/numpy/lib/shape_base.py:834: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
return c.reshape(shape)
/home/alan/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/numpy/lib/shape_base.py:834: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
return c.reshape(shape)

In [23]:
# display volume names
src[2]['seg_name'], src[3]['seg_name']
Out[23]:

(&aposLeft-Amygdala&apos, &aposRight-Amygdala&apos)
In [24]:
# read saved version of combined source space
src_read = mne.read_source_spaces(src_fname)
In [25]:
# display volume names
src_read[2]['seg_name'], src_read[3]['seg_name']
Out[25]:

(u&aposLeft-Amygdala&apos, u&aposRight-Amygdala&apos)

by Alan Leggitt (noreply@blogger.com) at July 03, 2014 01:01 AM

June 30, 2014

Ajitesh Gupta(:randomax)
(MoinMoin student)

Week 5 & 6 work

This week I started work on the meta-data section of the UI. Earlier the meta-data section of the UI was displayed as raw data. Below is the screenshot -


The earlier raw metadata view

As you can see the meta-data section just shows a dictionary with HTML Markup. In this code the item metadata was getting converted to a python dictionary and then to text using json.dumps() in the function "meta_dict_to_text()". It was then HTML escaped and put in a <pre> tag by the function named "_render_meta" in "Moinmoin/items/__init__.py". This was then returned to the "show_item_meta" function in the views section (Moinmoin/apps/frontend/views.py). This function would carry out the HTML Markup of the input using the flask Markup() function. This was passed on to the template and displayed as such.

I first made another function "meta_to_dict()"to just convert the metadata to a python dictionary so that I could use that dictionary in the template to extract data and customize it display easily. I further removed the duplicate part of the function in "meta_dict_to_text()" and instead called "meta_to_dict()" for the necessary processing. I also changed the name of the function from "_render_meta()" to "_meta_info" as it was no longer rendering info rather returning a dictionary. Also changed the variable "meta_rendered" to "meta", again because it was not a rendered data, rather a dictionary. In the template I showed the user friendly content-type along with the detailed one present earlier using the function "shorten_ctype()" added in the search UI patch. Also I added hyperlinks and visualisation for the validity of the links. If a particular item pointed by a link in "Item Links" list or "Item Transclusions" list does exist, the link will appear blue in color else it will appear red in color. Also I changed to "Modified Time" to user readable format. Same with the "Item Size". Here is how it looks like with my patch - https://codereview.appspot.com/110210043/


 The new metadata view

Also I worked upon fixing the footer to the bottom of the page which earlier used to come up to the middle of the page in case of a small content. Here is an example -

The footer midway up the page

I added the classes "moin-header", "moin-content", "moin-footer" to the layout.html template in basic theme in order for easy CSS selection. In the css I changed html, body elements to have a size of 100%. Also the container section to have a min-height of 100% and relative position. The moin-content was changed to have bottom padding of 100px in accordance with the size of the footer. The moin-footer also now has bottom as 0, position as absolute, and width of 100%. Here is how it looks now - 


The new footer at the bottom

CRs for this week :
 
Also this week I finally got to commit the previous CRs-






I also realised that I should be making small patches so that they are easier to review and reviewed faster and as such committed faster. Due to large size of my previous patches they took too long to review and thus late commits.

by Ajitesh Gupta (noreply@blogger.com) at June 30, 2014 06:04 PM

Tyler Wade
(PyPy student)

UTF-8 Progress Update and Examples of Index Caching

Progress update:  I'm still working on replace unicodes in Pypy with a UTF-8 implementation.  Things are progressing more slowly than I'd hoped they might.  At this point I have most of the codecs working and most of the application-level unicode tests passing, but there are still a lot of little corner cases to fix and I don't even want to think about regex right now.  That said, the summer is only about half over.

Since I wanted to have something interesting to write about, I've done some looking into how other implementations that use a UTF-8 representation handle index caching (if at all) and thought I'd share what I've found.

Relatively few languages with built-in unicode support use a UTF-8 representation.  The majority seem to prefer UTF-16, for better or for worse.  The implementations that use UTF-8 I found which are worth mentioning are Perl and wxWidgets.

wxWidgets by default uses UTF-16 or UTF-32 depending on the size of wchar_t, but it has an option to using UTF-8.  When UTF-8 is enabled, wxWidgets uses an extremely simple index caching system.  It caches the byte position of the last accessed character.  When indexing a string, it starts searching from the cached byte position if the index being looked up is higher than the cached index, otherwise it searches from the start of the string.  It will never search backwards. The idea here is to make in-order traversal -- like a typical for(int i = 0; i < s.size(); i++) loop -- O(n). Unfortunately, a backwards traversal is still O(n^2) and random access is still O(n).

Perl's index caching is much more elaborate.  Perl caches two character/byte-index pairs.  When indexing a string, the search will start from the closest of the cached indices, the start, or the end of the string. It can count backwards as well as forwards through the string, assuming that counting backwards is approximately twice as slow as counting forward.

Perl's method for selecting the cached indices is also more sophisticated than wx's. Initially, the first two indices accessed are used for the cached indices. After the first two are cached, when a new index is accessed, it considers replacing the current two cached indices with one of the two possible pairs of one of the old indices and and the new index.  It does so by selecting the root-mean-square distance between the start of the string, the first index, the second index and the end of the string for each the 3 possible pairs of indices.  That's probably as clear as mud, so this maybe this expert from the Perl source will help:0
#define THREEWAY_SQUARE(a,b,c,d) \
((float)((d) - (c))) * ((float)((d) - (c))) \
+ ((float)((c) - (b))) * ((float)((c) - (b))) \
+ ((float)((b) - (a))) * ((float)((b) - (a)))

The pair that minimizes THREEWAY_SQUARE(start, low index, high index, end) is kept.

This method seems to be better than wx's for almost all cases except in-order traversal. I'm not actually sure what the complexity here would be;  I think its still O(n) for random access.

by Tyler Wade (noreply@blogger.com) at June 30, 2014 05:43 AM

June 27, 2014

Varun Sharma
(GNU Mailman student)

Moving the project to python's abc(abstract base classes)

While expanding the project, i noticed that i am writing a lot of boilerplate code and not following the DRY principles, so i decided to either switch to zope interfaces or python's abc. After doing some research and reading docs of both of them, i decided to switch to abc. The reasons for preferring abc over zope interface were:
1. ABC will help me make my project more DRY unlike zope interface which will help in making the design more explicit but won't help me reduce the boilerplate code.
2. I think if i won't use zope, which at this point is not actually required due to small size of the project, it will help reduce the dependencies of my project.

So, now i am using a base module, which is storing all the abstract base classes and i will try to declare a single base class per .py file to reduce ambiguity.

by Varun Sharma (noreply@blogger.com) at June 27, 2014 09:50 AM

June 25, 2014

Janani
(SciPy/NumPy student)

With those mid-sem bells chiming, it is time for another update.
The following checkpoints have been reached:

1. The implementation of ellipsoidal harmonic function (also known as Lame's function): The first kind.
The following is the link to the pull request:
https://github.com/scipy/scipy/pull/3753
The implementation is in Cython and calls LAPACK subroutine. It is based on the python implementation by Knepley and Bardhan given here
Further the absence of Lame's function implementation by any many other libraries there is a challenge in preparation of an extensive test-suite. At present the output of the function for certain range of inputs is tested. The immediate next plan is to try improving the test-suite.
This will be followed by the implementation of Ellipsoidal harmonic function: The second kind.


2. Before this the spherical harmonic functions were improved by reimplementing them in Cython rather than python thus improving the speed.
The details having been elaborately touched in the previous post, are omitted here, saving people from the boredom due to redundancy. The pull request can be accessed from https://github.com/scipy/scipy/pull/3692

Thanks to the constant support of my brilliant mentors Pauli, Ralf and Stefan, (and I suppose, few miracles!) the progress is as per schedule. Hoping that this pace will be maintained or even better, quickened; post mid-sem evaluation!

Signing off till next time,
Janani

by janani padmanabhan (noreply@blogger.com) at June 25, 2014 05:59 PM

Rishabh Sharma
(SunPy student)

Unified Downloader

First the two interesting things I came across:

1)I read the Perl Introduction.I could not make up my first impression.Currently I am thinking how should I improve in it.College use to give assignments which promote self-exploration, but I dont have anything to aim at with Perl.I would love some suggestions from my readers.
2)Read Httpd Apache server introduction. Very good docs they have at apache.

Unified Downloader:
Progress made:I have finished separating download code for EVE,LYRA,NoRH sources.They(custom download clients)  have been structured to inherit from GenericClient.The custom clients over-write _get_url_for_timerange as these are unique to clients and have to separate from GenericClient. The GenericClient has query method which returns QueryResponse object.This QueryResponse object is container(modified list) for QueryRequest blocks.These QueryRequest Blocks aim to give information about the answer of  query.(Source…). The GenericClient also has get method which is given the QueryResponse object as input.This method returns a Result(defined in vso/vso.py) object.The aim of get method is to download the files.These clients have been made so as to ape the VSO module as closely as possible providing continuity of interface to users.

I have also setup the structure (basic Factory) for the main unified downloader client to dispense of the queries.This functions as a wrapper over the aforementioned Clients.Eg. main unified downloader client has a query method which basically finds the appropiate client for servicing the query and calls that client’s query method with the given query.

Remainder:
Port all the sources.
Better query dispensation(deciding whether to query vso or jsoc clients)
Tests.

Thanks for reading


by rishabhsharmagunner at June 25, 2014 12:26 PM

Issamou
(scikit-learn student)

June 24, 2014

Rajul
(SunPy student)

SunPy Ginga Database Browser Status

I have been able to get a local plugin to work in the main inteface of Ginga and it is now able to open a new window and display the database in a separate window. I am attaching a few of the screenshots. I shall also now get to trying to display the database as a Ginga tab, but I am afraid that it will be slightly cluttered if we do that. However, I will try to do that any ways. I shall try to explore the options that if we can easily add a separate side box under the main viewing area of Ginga (like the Debugging box in most IDEs) and use it. So to get the plugin working, follow these instructions:

  1. Copy the file SunPyDatabase.py to $HOME./ginga/plugins
  2. Run from your terminal "sudo ginga --plugins=SunPyDatabase"
  3. The Ginga window opens; select SunPyDatabase by clicking on Operation button, which is present at the bottom of the central Image view area.
I am also attaching a few screenshots that I have obtained.









by Rajul Srivastava (noreply@blogger.com) at June 24, 2014 05:20 PM

Rishabh Raj
(scikit-image student)

The Glass Is Half Filled!

We’re almost halfway through and I’m happy to say we’ve covered a lot of ground.

The very first of the issues which we were trying to solve, was to find out a way to connect to the Docker container running the code and talk to it, submit code, fetch the output etc. After discussions with devs on the Docker channel, we came to the conclusion that using the Python subprocess module to attach to the process running the `docker attach` command is indeed an elegant way.

The flow goes like, our code interface submits a JSON to the server, which then fires up a Docker container. The next phase involves using the subprocess module from Python to launch the `docker attach` command in a process which returns a handle which can be used to communicate with it. When the container is fired, it is launched with the python shell, so sending in code through the handle we procured earlier, works. As you might be wondering, the communicate method just returns to us the STDOUT, and STDERR for the code running. But, one step at a time, we take that with open arms. This output is then sent back to the interface as a JSON response to the AJAX call which started it all in the first place.

Here’s a sample video to see this simple thing in action – https://www.youtube.com/watch?v=XTdfwV8nwJM

Okay, so we can submit code and fetch its output, all the time running it inside a Docker container, cool. But we need more. Our use case deals with images being generated as output, sometimes multiple times in the code. A clean solution for this would be to somehow write the output images into files and fetch those files and send them back as response.

So, the next problem we tackled was the transfer of a binary file from inside the Docker container to the user’s browser. The idea was to ‘uuencode’ the binary and then transfer it and decode it on the client side. We made use of the `docker cp` command to copy files from inside the Docker container’s filesystem to the webserver’s proximity. For the time being, we just run a query to fetch the list of files of a particular extension, and copy them one by one. This is again done by capturing the STDOUT of a ‘folder-content-listing’ code which prints the name of each file of interest in one line. We then run a `docker cp` command to copy each file. Interestingly, the stream yields the file ‘tar’ed. We process the stream using ‘tarfile’ (a module in python) and now we have the binary file. UUencode and send the string as a JSON response back. In case of multiple files, the JSON simple has multiple entries in the format of ‘filename’ : ‘string’.

Here’s a sample video showing this at work – https://www.youtube.com/watch?v=7s2ORC_ZZok
but for lack of good client side ‘uudecode’, this just shows the output of ‘uuencoding’ whatever was produced by running the code.

Then comes the most interesting problem to be solved, our use case targets people running code which in the end generates image plots or figures, but we dont expect them to write those images following a specific protocol so that we could fetch them back using the setup we have in place. Hence the need to modify the backend which generates all these figures and plots, which is the matplotlib backend. This is indeed an awesome library which allows us to suppress the use of a GUI to render plots / output images and provides us with helper functions to iterate through all figures generated by the code. We basically use the ‘agg’ backend (Link for more info) and iterate through the figure handles and save them one by one on the isolated container filesystem.
Once this is done, we basically recreate the process of transfer of a binary file as mentioned in the second half above. We exploit the fact that the `<img>` tag can render images directly using the base64 version of the binary, and use that instead of `uuencode`ing it, which is difficult to decode. Once we have the base64 versions of output images, we add them as the source to the `<img>` tag and it springs to life on the webpage. An experiment we ran suggested the ratio of size of the uuencoded image to the base64 version to be ~ 1.03, so we’re not really losing anything.

Here’s a sample video showing this at work – https://www.youtube.com/watch?v=8390QVef1-E

We moved to a new location on https://github.com/scikit-image/skimage-docker (officially under the skimage banner, happy to be a part of team docker).

Next we work on to imposing CPU and Memory constraints on the task running and then the deployment on an actual server begins. Exciting times ahead.


by sharky932014 at June 24, 2014 03:59 AM

June 23, 2014

Akhil Nair
(Kivy student)

A lot and a little

This summer of code started on 19th May and today it is 23rd June. What did I do the whole month, you ask?. A lot and a little.

This month was as action packed as a climax scene of a Transformers movie!
 I'm sure the usual - had exams, had project submissions - wouldn't interest you, so let's talk about my project and how buildozer has fared and how I project it would fare.

The aim of the project is clear. Have an automated process that packages any Kivy application for a target Operating System.
My goal was to have target codes for linux by the mid term evaluation. Am I there? Well, almost.

At the beginning my mentors recommended the use of pyinstaller as a tool which I could use to package the application. I spent a considerable amount of time of the past month trying to get pyinstaller to work but failed. The problem being that after following all the procedures and converting the application to an executable, the application would give an import error on a PIL library. When tested the problem also occurred at a basic level where I tried to package a small application which does nothing but import pygame library. Due to the fact that I had wasted major part of my month behind this and a solution wasn't visible, I moved on to the basic packaging methodology used by debian which I had mentioned in my proposal. After piecing together various processes I was able to successfully package an application to debian. Small functional aspects such as resolving dependency still pertains but I'm optimistic that it can be solved.

The whole process can be optimised and divided into the following steps:

1: Place the application in an empty directory (Yes there is a reason for that).

2: Create an __init__.py file in the application directory, so that python reads the application as a package.

3: Create a setup.py file outside the application directory.

4: The setup.py should be in the following format.
"
from distutils.core import setup

#This is a list of files to install, and where
#(relative to the 'root' dir, where setup.py is)
#You could be more specific.
files = ["*","sounds/*.wav","icons/*.png"] #List of static directories and files in your application

setup(name = "application_name",
    version = " ", #example 1.0
    description = "A small description",
    author = "myself and I",
    author_email = "email@someplace.com",
    url = "whatever",
    #Name the folder where your packages live:
    #(If you have other packages (dirs) or modules (py files) then
    #put them into the package directory - they will be found
    #recursively.)
    packages = ['application_name'],
    #'package' package must contain files (see list above)
    #I called the package 'package' thus cleverly confusing the whole issue...
    #This dict maps the package name =to=> directories
    #It says, package *needs* these files.
    package_data = {'application_name' : files },
    #'runner' is in the root.
    scripts = ["script_name"], #We will use this to call the application as a program from the command line. Ideally your application_name
    long_description = """Really long text here."""
    #
    #This next part it for the Cheese Shop, look a little down the page.
    #classifiers = []    

"

5: Create another file named as the script_name from the setup.py

6: This file will ideally look something like this.
"
#!/usr/bin/python
from application import main #Here application being the application directory.

main.Classname.run()  #Ideally your main function

"

7: Now run the command "python setup.py sdist". This will create a directory named dist with a tarball of your application.

8: Traverse to the tarball and run the command
"py2dsc your_application_tar_file"

9: This will then create a directory named deb_dist. It will have a series of files but we want the one named after your application. Must be similar to application_name-1.0 (1.0 or any other version you specified in your setup.py file)

10: Go into that file, you will find a directory name debian. Just run the command "debuild". Now go back to the deb_dist directory. You will have a debian package waiting for you. Try installing it.

There are still certain things that need to be sorted out such as key signing and dependency issues. But I will get to that soon. As you might have seen, the setup.py is the most important thing. Also, currently I am using distutils library but going to shift to setuptools as setuptools has functionalities to package directly to rpm. (And maybe windows too).

Currently writing target code to implement this whole process. Will push for review as soon as is over.

by AkhilNair (noreply@blogger.com) at June 23, 2014 04:17 PM

Varun Sharma
(GNU Mailman student)

Mid-term summary

So, today 5 weeks are over since the gsoc started and i am going behind by scheduled timeline by around 1 week. So, i have written the server part(trying the patch on server) of the ci-tool but there is more to be done. Right now, it's working for a single project but i have to think and get advice from the mentors on how to do for multiple projects. Right now, i am using the twisted's Protocol class to create client and server tcp connection and send the diff file from client to server. Now, i need to implement the bidirectional tcp connection, so that first i can send the project/s name and then the diff file or i can send a metadata file along with the diff which can help identify the project/s.

Summarizing what i have done in last 5 weeks, i have implemeted pull, test and try(for single project) commands of the ci-tool.

The source of the project is at http://bazaar.launchpad.net/~varun/mailman/ci_tool/files/

by Varun Sharma (noreply@blogger.com) at June 23, 2014 09:53 AM

cfulton
(Statsmodels student)

Kalman Filter Initialization - The Stationary Case

The initialization problem

The Kalman filter is a recursion for optimally making inferences about an unknown state variable given a related observed variable. In particular, if the state variable at time $t$ is represented by $\alpha_t$, then the (linear, Gaussian) Kalman filter takes as input the mean and variance of that state conditional on observations up to time $t-1$ and provides as output the filtered mean and variance of the state at time $t$ and the predicted mean and variance of the state at time $t$.

More concretely, we denote (see Durbin and Koopman (2012) for all notation)

$$ \begin{align} \alpha_t | Y_{t-1} & \sim N(a_t, P_t) \\ \alpha_t | Y_{t} & \sim N(a_{t|t}, P_{t|t}) \\ \alpha_{t+1} | Y_{t} & \sim N(a_{t+1}, P_{t+1}) \\ \end{align} $$

Then the inputs to the Kalman filter recursion are $a_t$ and $P_t$ and the outputs are $a_{t|t}, P_{t|t}$ (called filtered values) and $a_{t+1}, P_{t+1}$ (called predicted values).

This process is done for $t = 1, \dots, n$. While the predicted values as outputs of the recursion are available as inputs to subsequent iterations, an important question is initialization: what values should be used as inputs to start the very first recursion.

Specifically, when running the recursion for $t = 1$, we need as inputs $a_1, P_1$. These values define, respectively, the expectation and variance / covariance matrix for the initial state $\alpha_1 | Y_0$. Here, though, $Y_0$ denotes the observation of no data, so in fact we are looking for the unconditional expectation and variance / covariance matrix of $\alpha_1$. The question is how to find these.

In general this is a rather difficult problem (for example for non-stationary proceses) but for stationary processes, an analytic solution can be found.

Stationary processes

A (covariance) stationary process is, very roughly speaking, one for which the mean and covariances are not time-dependent. What this means is that we can solve for the unconditional expectation and variance explicity (this section results from Hamilton (1994), Chapter 13)

The state equation for a state-space process (to which the Kalman filter is applied) is

$$ \alpha_{t+1} = T \alpha_t + \eta_t $$

Below I set up the elements of a typical state equation like that which would be found in the ARMA case, where the transition matrix $T$ is a sort-of companion matrix. I'm setting it up in such a way that I'll be able to adjust the dimension of the state, so we can see how some of the below methods scale.

In [3]:
import numpy as np
from scipy import linalg

def state(m=10):
    T = np.zeros((m, m), dtype=complex)
    T[0,0] = 0.6 + 1j
    idx = np.diag_indices(m-1)
    T[(idx[0]+1, idx[1])] = 1
    
    Q = np.eye(m)
    
    return T, Q

Unconditional mean

Taking unconditional expectations of both sides yields:

$$ E[\alpha_{t+1}] = T E[ \alpha_t] + E[\eta_t] $$

or $(I - T) E[\alpha_t] = 0$ and given stationarity this means that the unique solution is $E[\alpha_t] = 0$ for all $t$. Thus in initializing the Kalman filter, we set $a_t = E[\alpha_t] = 0$.

Unconditional variance / covariance matrix

Slightly more tricky is the variance / covariance matrix. To find it (as in Hamilton) post-multiply by the transpose of the state and take expectations:

$$ E[\alpha_{t+1} \alpha_{t+1}'] = E[(T \alpha_t + \eta_t)(\alpha_t' T' + \eta_t')] $$

This yields an equation of the form (denoting by $\Sigma$ and $Q$ the variance / covariance matrices of the state and disturbance):

$$ \Sigma = T \Sigma T' + Q $$

Hamilton then shows that this equation can be solved as:

$$ vec(\Sigma) = [I - (T \otimes T)]^{-1} vec(Q) $$

where $\otimes$ refers to the Kronecker product. There are two things that jump out about this equation:

  1. It can be easily solved. In Python, it would look something like:
    m = T.shape[0]
    Sigma = np.linalg.inv(np.eye(m**2) - np.kron(T, T)).dot(Q.reshape(Q.size, 1)).reshape(n,n)
    
  2. It will scale very poorly (in terms of computational time) with the dimension of the state-space ($m$). In particular, you have to take the inverse of an $m^2 \times m^2$ matrix.

Below I take a look at the timing for solving it this way using the code above (direct_inverse) and using built-in scipy direct method (which uses a linear solver rather than taking the inverse, so it is a bit faster)s

In []:
def direct_inverse(A, Q):
    n = A.shape[0]
    return np.linalg.inv(np.eye(n**2) - np.kron(A,A.conj())).dot(Q.reshape(Q.size, 1)).reshape(n,n)

def direct_solver(A, Q):
    return linalg.solve_discrete_lyapunov(A, Q)

# Example
from numpy.testing import assert_allclose
np.set_printoptions(precision=10)
sol1 = direct_inverse(T, Q)
sol2 = direct_solver(T, Q)

assert_allclose(sol1,sol2)
In [6]:
# Timings for m=1
T, Q = state(1)
%timeit direct_inverse(T, Q)
%timeit direct_solver(T, Q)
10000 loops, best of 3: 58.8 µs per loop
10000 loops, best of 3: 73.4 µs per loop

In [7]:
# Timings for m=5
T, Q = state(5)
%timeit direct_inverse(T, Q)
%timeit direct_solver(T, Q)
10000 loops, best of 3: 152 µs per loop
10000 loops, best of 3: 128 µs per loop

In [8]:
# Timings for m=10
T, Q = state(10)
%timeit direct_inverse(T, Q)
%timeit direct_solver(T, Q)
100 loops, best of 3: 1.71 ms per loop
1000 loops, best of 3: 1.03 ms per loop

In [11]:
# Timings for m=50
T, Q = state(50)
%timeit direct_inverse(T, Q)
%timeit direct_solver(T, Q)
1 loops, best of 3: 12.6 s per loop
1 loops, best of 3: 3.5 s per loop

Lyapunov equations

As you can notice by looking at the name of the scipy function, the equation describing the unconditional variance / covariance matrix, $\Sigma = T \Sigma T' + Q$ is an example of a discrete Lyapunov equation.

One place to turn to improve performance on matrix-related operations is to the underlying Fortran linear algebra libraries: BLAS and LAPACK; if there exists a special-case solver for discrete time Lyapunov equations, we can call that function and be done.

Unfortunately, no such function exists, but what does exist is a special-case solver for Sylvester equations (*trsyl), which are equations of the form $AX + XB = C$. Furthermore, the continuous Lyapunov equation, $AX + AX^H + Q = 0$ is a special case of a Sylvester equation. Thus if we can transform the discrete to a continuous Lyapunov equation, we can then solve it quickly as a Sylvester equation.

The current implementation of the scipy discrete Lyapunov solver does not do that, although their continuous solver solve_lyapunov does call solve_sylvester which calls *trsyl. So, we need to find a transformation from discrete to continuous and directly call solve_lyapunov which will do the heavy lifting for us.

It turns out that there are several transformations that will do it. See Gajic, Z., and M.T.J. Qureshi. 2008. for details. Below I present two bilinear transformations, and show their timings.

In [34]:
def bilinear1(A, Q):
    A = A.conj().T
    n = A.shape[0]
    eye = np.eye(n)
    B = np.linalg.inv(A - eye).dot(A + eye)
    res = linalg.solve_lyapunov(B.conj().T, -Q)
    return 0.5*(B - eye).conj().T.dot(res).dot(B - eye)

def bilinear2(A, Q):
    A = A.conj().T
    n = A.shape[0]
    eye = np.eye(n)
    AI_inv = np.linalg.inv(A + eye)
    B = (A - eye).dot(AI_inv)
    C = 2*np.linalg.inv(A.conj().T + eye).dot(Q).dot(AI_inv)
    return linalg.solve_lyapunov(B.conj().T, -C)

# Example:
T, Q = state(3)
sol3 = bilinear1(T, Q)
sol4 = bilinear2(T, Q)

assert_allclose(sol1,sol3)
assert_allclose(sol3,sol4)
In [36]:
# Timings for m=1
T, Q = state(1)
%timeit bilinear1(T, Q)
%timeit bilinear2(T, Q)
1000 loops, best of 3: 202 µs per loop
1000 loops, best of 3: 219 µs per loop

In [37]:
# Timings for m=5
T, Q = state(5)
%timeit bilinear1(T, Q)
%timeit bilinear2(T, Q)
1000 loops, best of 3: 257 µs per loop
1000 loops, best of 3: 316 µs per loop

In [38]:
# Timings for m=10
T, Q = state(10)
%timeit bilinear1(T, Q)
%timeit bilinear2(T, Q)
1000 loops, best of 3: 268 µs per loop
1000 loops, best of 3: 298 µs per loop

In [39]:
# Timings for m=50
T, Q = state(50)
%timeit bilinear1(T, Q)
%timeit bilinear2(T, Q)
100 loops, best of 3: 2.31 ms per loop
100 loops, best of 3: 2.66 ms per loop

Notice that this method does so well we can even try $m=500$.

In [40]:
# Timings for m=500
T, Q = state(500)
%timeit bilinear1(T, Q)
%timeit bilinear2(T, Q)
1 loops, best of 3: 1.51 s per loop
1 loops, best of 3: 1.67 s per loop

Final thoughts

The first thing to notice is how much better the bilinear transformations do as $m$ grows large. They are able to take advantage of the special formulation of the problem so as to avoid many calculations that a generic inverse (or linear solver) would have to do. Second, though, for small $m$, the original analytic solutions are actually better.

I have submitted a pull request to Scipy to augment the solve_discrete_lyapunov for large $m$ ($m >= 10$) using the second bilinear transformation to solve it as a Sylvester equation.

Finally, below is a graph of the timings.

In [27]:
 

by Chad Fulton at June 23, 2014 07:06 AM

June 22, 2014

Anurag Goel
(Mercurial student)

Still a long way to go and more milestones to cover

This blog is about the midterm summary update. Last two weeks were quite a busy week for me.  As i mentioned in the previous post, i will try to cover things up quickly. So below is the work progress upto the midterm.

Until now i have done mainly two tasks now.

Task 1 is about gathering timing info of test files. In this task, I calculated two main things.
1) User time taken by a child processes.
2) System time taken by a child processes.

This task has already been mentioned in details in previous blog. Patch of this task is under 4th revision of review and its getting better on every revision. You can find more patch details here.

Task 2 is about to “Set Up a tool able to plot timing information for each file”.
In this task firstly I introduced a new functionality of ‘--json’ in “run-tests.py”. While testing if  user enabled the ‘--json’ optional then timing data gets stored in json format in the newly created  ‘report.json’ file. Patch of this task is under 2nd revision. You can find more patch details here.

After that, i wrote html/javascript file which accesses this report.json file to plot graph between testname Vs testime. Mercurial(hg) buildbot runs test cases periodically. So the main aim of this task is give graphical view of test results to the users.

Apart from the above two tasks, the work in which I spent most of time was, in fixing the regression. This regression has been recently introduced by refactoring of run-tests.py which includes
1) produce error on running a failing test
2) produce '!' mark after running a failing test
3) skipped test should not produce 'i' mark while retesting
4) fixes the number of tests ran when '--retest' is enabled
5) checks behaviour of test on failure while testing
6) fixes the '--interactive' option error
 
There are several other milestones to cover which mainly includes
1) Designing of annotation format
2) Highlight test section with annotation format
3) Declare dependency between sections

Above regression fixing was quite a warm up exercise. This exercise helped in understanding how things work in "run-tests.py". This would definitely help me in quickly covering the above mentioned milestones.

Stay tune for more updates :)

by Anurag Goel (noreply@blogger.com) at June 22, 2014 10:10 PM

Vytautas Jančauskas
(TARDIS-SN student)

GSoC Mid-term Results and Achievements

By the mid-term we succeeded in fulfilling most of the goals of the overall program. The core Monte Carlo routines are now implemented in C. They were significantly restructured to allow better readability and run around 35% percent faster as of late. There are still areas in want of improvement but the general feel of the code is a lot better. The code can be found here and will be merged in to the main fork some time soon. The new changes hopefully will let us expand the code with new physics easily. This effectively does the job described in the GSoC proposal, however there are still areas where improvements will need to be done. These include error handling, logging, nicer interfaces, code and user documentation. As far as improving the performance is concerned in my opinion we are close to reaching a dead-end. The code is currently very basic and uses mostly atomic operations. If further improvement in terms of program running time is wanted smarter algorithms will need to be developed. Although currently I do not have a clear idea of what that would have to be.

by Vytautas Jančauskas (noreply@blogger.com) at June 22, 2014 09:13 PM

Tarun Gaba
(PyDy student)

GSoC 14: Second Week!

{% include JB/setup %} [ <-Back to posts ](/gsoc14) Second week of GSoC '14 has ended. It has been a very hectic week. A good part of my time was spent in studying and tweaking the MGView code(whose fork we had decided to work on). The new _generic-visualizer_ has to be compatible with both PyDy as well as MG(MotionGenesis), so great care is being taken to maintain the backwards compatibility of the package, as well to add the support for PyDy visualization data. I have also worked on Python side, to modify the JSON data generated for the visualizations(from PyDy side). ###Accomplishments: There have been three major accomplishments this week: - **Studying MGView code:** This has to be the most time consuming part! I had to go over a module spanning multiple files to study the workflow, and how each methods are linked together. Some conversations with Adam Leeper(who wrote the original package) have been very useful in this regard. - **Modifying JSON generated from PyDy:** Certain modifications have been done on the JSON generated from the PyDy backend, so that they are compatible with MotionView. A major change is that the generated JSON data is split over two files, one `scene_desc.json` shall contain all the parameters required to create a static scene, and another file `simulation_data.json` which shall contain all the simulation data(in the form of 4x4 matrices). It would be very helpful in un-cluttering the data. The API changes in Python side are totally backward compatible, which means no previous code shall break once these changes are merged. - **Tweaking MGView code:** I have started working on MGView code, to provide the support for PyDy generated JSON data. This part will span the coming weeks too. ###Objectives: The objectives of the upcoming week are to be able to produce some visualizations on the frontend side. For that I will need to tweak into the MGView code to allow support for PyDy JSONs. ###Issues: The main issues I am facing is the inhomogenities in the output produced by PyDy and MG. Since the visualizer (MOtionView) must be able to visualize data generated from both of these packages, without any bias it has to be designed like so. Since It is already built to specifically visualize data generated from MotionGenesis software, There are certain parts of code which need to be totally rewritten, and it has to be done without breaking the workflow of the visualizer. So it can get somewhat messy to do so. I am working on to find a clean way to do so. The upcoming two weeks are somewhat crucial to the project timeline. I have to work really hard in the coming weeks to keep up with my deliverables before the midterm evaluations. I think I shall get back to coding now! :) [ <-Back to posts ](/gsoc14)

June 22, 2014 07:11 PM

Asish Panda
(SunPy student)

Progress mid term

After a month into gsoc coding period, I have learnt many new things about python and git and also how OSS community work. This has been a great experience for me to work with people who are so willing to help a beginner! 
Currently I have created a PR to merge various files that has been quantified which includes:

1) sunpy/image/coalignment.py

2) sunpy/image/tests/test_coalignment.py

3) sunpy/instr/rhessi.py

4) sunpy/physics/transforms/differential_rotation.py

5) sunpy/physics/transforms/tests/test_differential_rotation.py

6) sunpy/sun/sun.py

7) sunpy/sun/tests/test_sun.py

8) sunpy/util/tests/test_unit_conversion.py [will be removed soon]

9) sunpy/util/unit_conversion.py [will be removed soon]

There were many roadblocks and mistakes that took consumed many hours but they were interesting. I messed up and pushed a error branch to master which could have deleted all my progress. I learnt the hard way that I should not mess with git without thinking! 

Also the spectra integration(which I have done, and i will create a PR soon) was quite tough to understand for me. On the bright side I learnt many new features of python going through that code! 
My next tasks would be to integrate maps and wcs with astropy. 

PR link


by kaichogami at June 22, 2014 05:20 PM

Chinmay Joshi
(Mercurial student)

Midterm Update

Midterm evaluation period has approached and I am summerizing my GSoC journey. Until now, I have been working on various files in mercurial for finding operations which touch the filesystem, adding those operations to vfs and updating the related filesystem operations to be invoked via vfs.

During this time, I made several advancements in this process. I am currently working on mercurial/patch.py, mercurial/hg.py, mercurial/command.py and mercurial/simplemerge.py. Few filesystem operations like lexists and unlinkpath are already merged (See progress wiki for more details on WindowsUTF8 plan) and some others I sent are under review. I am intending to send more from my queue by refining them upon the feedback I receive from currently sent patches. Fujiwara Katsunori has played a crucial role and helped me a lot by providing valuable feedback. Updating users should not only be functional but effcient. This is a large part of project and still needs more operations to be added and replaced accordingly.

Regarding the very basic filesystem operations which do not rely or have specific file path, I recently had discussion with Giovanni Gherdovich for adding file operations which do not rely on a base in filesystem (eg. os.getcwd, os.path.basename, os.path.join, etc) and he as per suggestion we decided to try implementing them with classmethod to access them directly without vfs objects. I will be sending this series as RFC to mercurial developer mailing list for feedback from experts on this project.

The part of project which is about accessing filesystem on Windows I asked on the mailing list for clues. Following the discussion from mailing list, proposed utf8vfs should fetch filenames by primarily calling python's API with unicode objects and should convert results back to utf-8 encoding. However in mercurial, there are some operations which use win32 API in Windows to implement few crucial filesystem operations such as unlink, remove a dirctory tree, etc. As for this I will need to use Win32 W API. Considering the fact that I am absolutely new to win32 API, I laid my hands on Win32 API along with the work of adding users to vfs and updating file system operation users to use vfs during the last week. I decided to do allocate this a proportion of time eventhough it is scheduled at a later stage in timeline because this are the things which can consume my huge time at later stage if I get stuck in some thing. I have also experimented some operations with unicode objects by adding them to utf8vfs.

Stay tuned for more updates!

by Chinmay Joshi (noreply@blogger.com) at June 22, 2014 04:30 PM

Vighnesh Birodkar
(scikit-image student)

GSoC – Mid Term Status Update

So far

For the story so far see –  the Introduction, 2nd week update, Graph Data Structures Comparison .

After much debate on the mailing list with many scikit-image developers we finally decided to use the NetworkX Graph Class for our Region Adjacency Graph ( RAG ). It comes with a lot of well-tested functionality and would speed up the GSoC progress. It is also pure Python, and shares a lot of its dependencies with scikit-image.

Constructing the RAG

To construct a RAG, we need to iterate over the entire labeled image, looking for adjacent pixels with distinct labels. Initially I wrote a special case Cython loop for 2D and 3D, much like this one. But to scikit-image developers suggested, and rightly so, a more genaral n-dimensional approach. I looked for a function, which would iterate over an entire array and call a given function. As it turns out generic_filter does exactly that. I have used it here  with the callable defined here.

The footprint is created such that only the elements which are 2nd and 3rd along all axes are set according to the connectivity. Rest all are forced to zero. In the 2D case with connectivity 2 it would be the bottom right elements ( a[1,1], a[1,2] , a[2,1], a[2,2] ) which are set . This ensures that one pair of adjacent pixels is not processed again when the filter window moves ahead.

The footprint ensures that the first element in the array passed to the callable is the central element in the footprint. All other elements are adjacent to the central element and an edge is created between them and the central element.

Pruning the RAG

I implemented a skeletal RAG algorithm following Juan’s suggestion. It takes the labeled image as input. This is typically an over-segmented image obtained by algorithms like SLIC or watershed. Each region in the labeled image is represented by a node in the graph. Nodes of adjacent regions are joined by an edge. The weight of this edge is the magnitude of difference in mean color. Once the graph is constructed, nodes joined by edges with weights lesser than a given threshold are combined into one region.

You can view the Pull Request here.

Results

Below are the test results of executing the example code on two scikit-image sample images. The threshold value for both the results is different and is found by trial and error. Typically, a higher threshold value, gives fewer regions in the final image.

Coffee Image

Original Image

coffee

SLIC Segmentation

coffee_slic

Threshold Graph Cut

coffee_threshold_cut

 

Coins

Original Image

coins

SLIC Segmentation

coins_slic

Threshold Graph Cut

coins_threshold_cut

Conclusion

The mechanism for RAGs is in place now. The segmentation is pretty satisfactory, considering the simple logic. The major drawback however is that it’s not fully automatic. Over the next few weeks I will implement more sophisticated merging predicates, including N-cut.


by Vighnesh Birodkar at June 22, 2014 02:14 PM

June 21, 2014

Shailesh Ahuja
(Astropy student)

Analysing IRAF multispec format FITS files

Today I am going to analyze a couple of FITS files, which have been stored in the IRAF multispec specification described here. I am going to focus to Legendre and Chebyshev polynomial dispersion functions.
First, take a quick look at the FITS headers in the following files:
  1. Legendre dispersion file headers: http://pastebin.com/FqnSYwGe
  2. Chebyshev dispersion file headers: http://pastebin.com/NDKqNj6n

NAXIS defines the number of dimensions. In multispec format, there are always two dimensions. Multiple one-dimensional spectra are stored in this format. These headers have CTYPE1 and CTYPE2 equal to `MULTISPE`. This is necessary to indicate that the spectra is stored in multispec format. NAXIS1 tells us the size of the data (length of the flux array) in each spectra, and NAXIS2 tells us the number of such one dimensional spectra. In both the files, there are 51 spectra stored.

One of the most important header keyword is WAT2_XXX. This keyword stores the information to compute the dispersion at each point, for each spectra. `specK` holds the information for the Kth spectra. There are various numbers separated by spaces within each `specK`. These numbers describe the exact function to be used to compute the dispersion values. The following list explains these numbers in order:

  1. Aperture number: According to Wikipedia, aperture number is directly or inversely proportional to the exposure time. This entry always holds an integer value. For both the files, this number goes from 1 to 51. This value has no significance on the calculation of dispersion.
  2. Beam number: I am not sure what this means, but it is always an integer value. For the Legendre file, this decreases from 88 to 38 and for Chebyshev file this increases from 68 to 118. 
  3. Dispersion Type: This can be 0 (linear dispersion), 1 (log-linear dispersion) or 2 (non-linear dispersion). As both these files define non-linear polynomial functions, this is always 2.
  4. Dispersion start: This value indicates the dispersion at the first physical pixel. This value is not used for computation, however, this value can be used to verify whether the function is giving the correct output at the first pixel. Unfortunately, this value is the same for all 51 spectra in both the files, which implies that this value hasn't been correctly stored. The value matches the output returned by the 51st spectra dispersion function.
  5. Average dispersion delta: This value is equal to the mean of the difference between consecutive dispersion values. Again, this value is not used for computation, but can be used to verify the function output. Similar to the previous value, this has been stored incorrectly in both these files. It is only correct for the 51st spectra.
  6. Number of pixels: This value indicates the length of the flux array of this spectra. This value can be at most the value of NAXIS1. This value should be equal to PMAX - PMIN (defined later).
  7. Doppler factor (Z): Due to relative motion of the object and the observer, Doppler effect can alter the dispersion values. This factor can be used to compute the adjusted dispersion values, by using the formula below:

                                 Adjusted dispersion = Dispersion / (1 + Z)
     
  8. Aperture low: This value is for information only. Stores the lower limit of spatial axis used to compute this dispersion.
  9. Aperture high: Again, this value is for information only. It stores the upper limit of spatial axis used to compute this dispersion.

    From this point, the function descriptors start. There can be more than one function too. In that case these descriptors are repeated starting from weight. These descriptors determine the function output. The final dispersion is calculated as:

                                Final dispersion = Sum of all function outputs
     
  10. Weight:  The weight of the function gives the multiplier for the dispersion calculated. It's use becomes more obvious in the formula below.
  11. Zero point offset: The value to be added to all the dispersion values. Combined with the weight, and the Doppler factor, the function output can be calculated as:

        Final function output = Weight * (Zero point offset + function output) / (1 + Z)

    In the files given, there is only one function. The Doppler factor and the zero point offset are zero, and the weight is one. So the final dispersion is equal to the function output.
     
  12. Function type code: Until this point, we know how to calculate the final dispersion, if we know the function output. This value stores to type of the function that will be used to compute the output at any given pixel. There are six possibilities:
    1 => Chebyshev polynomial, 2 => Legendre polynomial, 3 => Cubic spline,
    4 => Linear spline, 5 => Pixel coordinate array, and 6 => Sampled coordinate array

    Starting from this point, the numbers may mean different things for different functions. I am explaining the descriptors for Legendre and Chebyshev.
     
  13. Order (O): The order of the Legendre or Chebyshev function.
  14. Minimum pixel value (Pmin): The lower limit of the range of the physical pixel coordinates.
  15. Maximum pixel value (Pmax): The upper limit of the range of the physical pixel coordinates. In combination with the lower limit, they determine the domain of the function. This domain should be mapped to [-1, 1].
  16. Coefficients: There are O coefficients that follow. These coefficients define the Legendre or the Chebyshev functions.
And that's it. It's a bit tedious to understand, but format enables so much information to be stored. The documentation is not very clear, and I hope this post helped you understand what these parameters stand for. 

by Shailesh Ahuja (noreply@blogger.com) at June 21, 2014 09:27 AM

June 20, 2014

Richard Tsai
(SciPy/NumPy student)

GSoC2014: The First Month

The first month of my GSoC has passed and it is about time to make a summary.

The PRs

I’ve made 4 Pull-Requests for my GSoC project, 3 of which have been merged.

  • #3636 MAINT: cluster: some clean up Some cleanups. Mainly for docs.

  • #3683 ENH: cluster: rewrite and optimize vq in Cython The most challenging part. Discussed in this post already.

  • #3702 BUG: cluster: _vq unable to handle large features When I started to work on #3712, I noticed that the new version of _vq I wrote in the previous PR(#3683) may give wrong results when the values of the input features are very large. I though it might be an overflow issue since there was a matrix multiplication in the new algorithm. But I was wrong. It was caused by an initialization issue. I used the C macro NPY_INFINITY to initialize the output distances vector. However, ndarray.fill would regard NPY_INFINITY as the “integral infinity” and fill the float (or double) vector with \(2 ^ {31} - 1\). When the actual minimum distance was greater than such an “infinity”, the result would be an uninitialized value. The currect way to initialize the output distances vector should be outdists.fill(np.inf).

  • #3712 ENH: cluster: reimplement the update-step of K-means in Cython This PR is my recent work and hasn’t been merged yet.

    An iteration of K-means requires two steps, the assignment step and the update step. The assignment step, implemented as vq.vq, has been optimized in the previous PRs. But the update step is still time-consuming, especially for low-dimensional data. The current implementation of the update step in SciPy looks like:

    for j in range(nc):
        mbs = np.where(label == j)
        code[j] = np.mean(data[mbs], axis=0)
    

    However, np.where is slow. A much more efficient algorithm would be:

    for j in range(nobs):
        code[labels[j], :] += data[j, :]
        counts[labels[j]] += 1
    for j in range(nc):
        code[j] /= counts[j]
    

    But of course looping through the observation matrix in pure python would be extremely slow. So I implemented it in Cython and archieved decent speedups especially in low-dimensional data.

Achievement

First of all, I rewrote the underlying part of cluster.vq in Cython and it should be easier to maintain than the original C implementation.

Then, the performance. The new algorithms and optimization tricks work well. I built SciPy with ATLAS and tested vq.kmeans(obs, k) on a i5-3470 machine, and I got the following performance statistics.

datasetorigin#3683#3712speedup
30000 x 100, k = 10, float6428636.259416950.388412687.4252.26
30000 x 100, k = 10, float3213575.05326483.47345272.22.57
10000 x 30, k = 10, float641923.39461146.7348752.1932.56
10000 x 30, k = 10, float321491.879989.209658.10642.27
100000 x 3, k = 4, float641663.1431181.5912529.09923.14
100000 x 3, k = 4, float321278.033967.4396431.02482.97
10000 x 1, k = 3, float64117.690894.062241.61862.83
10000 x 1, k = 3, float32110.287498.682641.99042.63
1000 x 3, k = 4, float6429.2328.6979.85422.97
1000 x 3, k = 4, float3227.524229.248812.48942.20

What’s Next

There’re still about two months to go. First, I will continue to work on #3712. This PR still has some issues. There’re some build warnings which seem to be caused by Cython fused types on some machine. Then, I will start to work on cluster.hierarchy. The hierarchy module will be more complicated and challenging then vq and it may take me about a month to rewrite it in Cython. And then I’ll try to implement the SLINK and CLINK algorithm.

At last, I want to thank my mentors Ralf, Charles and David, and of course everyone else in the SciPy community, who have helpped me a lot. I’m very glad to be working with you!

by Richard at June 20, 2014 05:07 PM

June 19, 2014

Milan Oberkirch
(Core Python student)

Multitasking

I spent the last days improving my smtpd patch and reading some (proposed) standards which have their very own way of sounding boring in almost every sentence (they have to be specific so don’t blame the authors). For example the following sentence can be found in nearly every RFC published since 1997:

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119.

After reading all those RFCs I feel like I SHOULD get back coding right now.

The most relevant RFCs so far:

  • 6531: SMTPUTF8
    already implemented in smtpd.py and in progress in smtplib. I will come back to this after dealing with imaplib and #15014 is closed.
  • 6532: Internationalized Headers
    the one I started with in the first weak, affects the email package. I’ll come back to that when everything else is done (I discussed some of the reasons for this in my last post).

and new this week:

  • 6855: UTF-8 support for IMAP
    This concerns the imaplib. I will get started with that today.
  • 6856: same thing for POP3
    concerns poplib (obviously). I already proposed a patch for this one but left out the new ‘LANG’ command (used to change the language of messages from the server) for now. It seemed quite irrelevant, but I MAY add it sometime in the future.
  • 6857: Post-Delivery Message Downgrading for Internationalized Email Messages
    I MAY look into this after I implemented the things above.

So I have many new playgrounds now. Working on many things in parallel is also an attempt to make the whole review and patch-improvement process more efficient.


by milan at June 19, 2014 06:30 PM

Jaspreet Singh
(pgmpy student)

The One With PomdpX Reader Class

Hi folks!
There were a lot of decision to be made before I could start hacking for pgmpy. First of all, I had to decide which of the file formats will I start with. I decided to chose PomdpX file format since it was based on xml similar to the XMLBIF format which was completed by Ankur Ankan already. This eased the task. The Reader Class was to be implemented first. The aim was to parse a pomdpX file and convert all the information it contains about the model in structured data. I read the etree tutorial and learned about the functions which will be used to achieve the task. The ElementTree API provides different kind of functions to get this done.I'll provide an example. Suppose the xml looks something like this

<pomdpx version="0.1" id="rockSample"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="pomdpx.xsd">
<Description> · · · </Description>
<Discount> · · · </Discount>
<Variable> · · · </Variable>
<InitialStateBelief> · · · </InitialStateBelief>
<StateTransitionFunction> · · · </StateTransitionFunction>
<ObsFunction> · · · </ObsFunction>
<RewardFunction> · · · </RewardFunction>
</pomdpx>

If the above xml is contained in a string, then etree.fromstring(string)  will return the root element of the tree, i.e. pomdpx in this case. Now functions like find() and findall() can be used to obtain the elements contained within the root. Attributes were obtained using the get() method and the value in a tag using the text() method. These all combined helped me to form structured data out off the file format.

The most difficult part was the parameter tag of type Decision Diagram where I had to parse tree type of format. I could not use any other class object inside the Reader Class to make it independent. I finally came up with a recursive parsing solution to it. I was able to come up with it by using the idea from the TreeCPD class present in the Factors module of the pgmpy codebase.

I then wrote tests for each of the functions so that each and every possibility of mistake can be handled and no bugs remained in the code. Some docstrings were also added. Finally the Reader Class was complete along with the tests for me to send a pull request.

by jaspreet singh (noreply@blogger.com) at June 19, 2014 07:05 AM

June 18, 2014

Simon Liedtke
(Astropy student)

New service for Astroquery coming: xMatch

xMatch is a service that makes it possible to cross-identify (thus the name) sources between very large catalogues. One of these catalogues may also be a user-uploaded list of positions. For astroquery, I added a new class, XMatch, that has the methods get_available_tables, is_table_available, and query. get_available_tables returns a listing of all tables that can be queried by this service. The resulting format can be either JSON or plain text. In the sample ipython session below one can see that the number of returned lines is 13259 if the format is plain text. This means that this many tables can be queried by the xMatch service. The following commands prints the first 5 tables. The utility method is_table_available simply checks whether the passed string is one of those that get_available_tables returns in the plain text format. The most interesting method though is query. It receives the following parameters: cat1, cat2, max_distance, format, colRA1, coldDec1, colRA2, colDec2. cat1 and cat2 are used to identify the first or the second table to be queried, respectively. max_distance denotes the maximum distance (in arcsecs) that will be used to look for counterparts. The desired format can be either VOTable, JSON, CSV, or TSV. The last four parameters indicate the names of the columns of the respective tables.

The ipython session below shows the xmatch package "in action":

In [1]: from astroquery.xmatch import XMatch
WARNING: ConfigurationDefaultMissingWarning: Could not determine version of package astroquery Cannot install default profile. If you are importing from source, this is expected. [astroquery._astropy_init]
WARNING: AstropyDeprecationWarning: ConfigurationItem has been deprecated in astropy 0.4. Use ConfigItem objects as members of ConfigNamespace subclasses instead.  See ConfigNamespace for an example. [astropy.config.configuration]

In [2]: xm = XMatch()

In [3]: len(xm.get_available_tables('txt').splitlines())
Out[3]: 13259

In [4]: xm.get_available_tables('txt').splitlines()[:5]
Out[4]:
[u'J/MNRAS/276/33/tables',
 u'J/A+A/552/A8/tablea1',
 u'J/AJ/80/972/table',
 u'J/A+AS/99/291/table1',
 u'J/ApJS/150/1/table5a']

In [5]: xm.is_table_available(u'J/AJ/80/972/table')
Out[5]: True

In [6]: from StringIO import StringIO

In [7]: csv_in = StringIO('''ra,dec
267.22029,-20.35869
274.83971,-25.42714
275.92229,-30.36572
283.26621,-8.70756
306.01575,33.86756
322.493,12.16703''')

In [8]: csv_out = xm.query(csv_in, 'vizier:II/246/out', 5, 'csv', 'ra', 'dec')

In [9]: print csv_out
angDist,ra,dec,2MASS,RAJ2000,DEJ2000,errHalfMaj,errHalfMin,errPosAng,Jmag,Hmag,Kmag,e_Jmag,e_Hmag,e_Kmag,Qfl,Rfl,X,MeasureJD
1.352044,267.22029,-20.35869,17485281-2021323,267.220049,-20.358990,0.150,0.110,16,9.931,8.822,7.550,0.239,0.241,,EEU,226,2,2450950.8609
1.578188,267.22029,-20.35869,17485288-2021328,267.220348,-20.359125,0.140,0.120,158,8.868,7.784,8.530,,,0.128,UUB,662,2,2450950.8609
3.699368,267.22029,-20.35869,17485264-2021294,267.219344,-20.358171,0.130,0.120,33,9.562,8.525,9.445,,,0.119,UUB,662,2,2450950.8609
3.822922,267.22029,-20.35869,17485299-2021279,267.220825,-20.357754,0.080,0.070,7,10.756,9.725,9.287,0.139,0.127,0.103,EBA,222,2,2450950.8609
4.576677,267.22029,-20.35869,17485255-2021326,267.218994,-20.359064,0.140,0.130,87,10.431,9.348,7.926,0.159,0.316,,CEU,226,2,2450950.8609
0.219609,274.83971,-25.42714,18192154-2525377,274.839773,-25.427162,0.060,0.060,90,9.368,8.431,7.919,0.024,0.044,0.036,AAA,211,0,2451407.5033
1.633225,275.92229,-30.36572,18234133-3021582,275.922233,-30.366171,0.080,0.080,55,12.947,12.334,12.145,0.156,0.221,0.127,EEE,222,2,2451021.7212
0.536998,283.26621,-8.70756,18530390-0842276,283.266284,-8.707690,0.060,0.060,45,12.182,11.534,11.380,0.057,0.071,0.063,AAA,222,0,2451301.7945
1.178542,306.01575,33.86756,20240382+3352021,306.015944,+33.867275,0.060,0.060,90,13.575,12.684,12.321,0.025,0.027,0.026,AAA,222,0,2450948.9708
0.853178,322.493,12.16703,21295836+1210007,322.493171,+12.166862,0.100,0.080,179,9.798,9.339,9.176,0.109,0.150,0.100,EEA,222,0,2451080.6935
4.503950,322.493,12.16703,21295861+1210023,322.494242,+12.167332,0.100,0.080,1,10.057,9.720,9.483,0.077,0.136,0.088,EEE,222,0,2451080.6935

Note that the status of this package is that the pull request is still open and things will change because I haven't reacted to all the comments yet.

What's next?

I started working on the next service, SkyView. Yet it is not enough to open a new PR about it, but I have some concrete plans on how to continue on it. Stay tuned!

by Simon Liedtke at June 18, 2014 10:00 PM

Navin
(pgmpy student)

Three weeks into Coding - Progress / Roadblocks

Well, this was supposed to be written when I was two weeks into coding. But, it went off my mind and my mentor has kindly reminded me to write this one. Its about to be four weeks but I will just talk about the experiences of my first three weeks.

So when I had written the last blog, I hadn't even begun the coding for GSOC and hence was all excited to start coding. And now, its been three weeks. I can't use the cliched phrase "Its been a long journey", but it has been a nice journey. Let me first describe the things I have done by now before going into more philosophical discussions.

Beginning with Week1. The task for me for the week was to implement the framework of Markov Network. I had to define all those little-little functions for adding nodes, removing edges, setting observation, accessing potentials etc etc. I had to take care to ensure that the we don't pass integers as node names, we don't access observation that was set and so on. Well to be fair to the previous developers of pgmpy, I copied quite a few things from the code for BayesianModels. So, thanks, the previous developers of pgmpy!

One mistake which I did in this process was making my own class for representing factors, when it existed already in the only package which I had overlooked while going through the code. So Ankur told me about that and hence I had to change my code to work with the existing Factor class.

The next week was supposed to be spent for building the independencies relations for the Markov Networks. However, I felt like working with triangulation and junction trees and so on and hence started working on this particular module. The coding part for this was very interesting as it involved implementing algorithms and so on. So, I started working on this and implemented a lot of things, starting from a checking if a graph is triangulated to triangulation heuristics to implementing the JunctionTree class and writing functions to create the junction tree from the triangulated graph.

So there is a long series of events which happened when I submitted the PR. First, I had not commented the code or added docstrings to the code because I had expected a lot of comments about changing the interface and hence I thought that it will be best if I postpone the commenting part till after the interface and code was fixed. However, I received a (rather-angry) mail from Abinash asking me why I hadn't commented the code. I gave him my reasons. Also, he told me that many of the functions which I had written was already in the standard networkx package. Ah, and I was heart-broken. All the efforts spent in writing those maximum-spanning trees and finding the cliques and checking if the graph was triangulated had just gone waste. I might accept my mistake in not using Maximum Spanning tree code of networkx, (but then it was 15 lines and I would have written that code atleast 3 times before), but who would imagine networkx hiding such specific functionalities(finding cliques in chordal graphs, finding tree width of chordal graphs etc) inside its deep seas of code disguised as functions such that you wouldn't even find them if you looked for functions of a graph object. Anyway, all that code had to be replaced by these library functions for maintainability etc.

Once I was done with this, I started commenting the code and adding docstrings and writing test cases. When I did that, I had a big realization, "Copying code was easy (Bayesian Model's framework to Markov Model), but documentation and testing won't be easy" as I will actually have to do all that. (Well, just because I am mentioning copying the code, don't start assuming that I copied all the code. I only copied small functions)

This took quite some part of week3. Oh man, documentation and testing have two really bad features : 1. They take a lot of time 2. They are slightly boring to do, but then as Ankur said, "I will realize the importance of testing and commenting".

Anyway, once I was done with that I started working on the message passing algorithm, and setting up the entire framework and I am still on it. The next task after this will be to write up various inference algorithms using Junction Trees (and yes, documenting and testing too :P ). Hope I have a good time ahead, as I have been having till now.

Signing off for now.
Navin


by Navin Chandak (noreply@blogger.com) at June 18, 2014 08:14 PM

June 16, 2014

Frank Cheng
(Statsmodels student)

How to test MICE?

MICE is a method that imputes missing data using simulation from an inferred posterior distribution. See the Appendix of http://www.statcan.gc.ca/ads-annonces/12-001-x/5857-eng.pdf to see the exact way we simulated data.

With this in mind, MICE is a random method that will yield differing results every time it is run. Therefore, a typical testing procedure where we try to get exact agreement between our implementation and an existing, mature package is not possible. However, we should get roughly the same results for any given instance, and more importantly, the distribution of results should be very similar. So what we will aim to do is really a comparison of simulations between our implementation and those of R and Stata. Right now, predictive mean matching is directly comparable between all implementations, but there are some details in our other asymptotically Gaussian approach that are different in the other packages. I will write a follow up post shortly describing the different approaches taken by different packages. For now, the takeaway for testing is that we will perform simulation studies to assess the similarity between several MICE implementations.

by Frank Cheng (noreply@blogger.com) at June 16, 2014 06:04 AM

Tyler Wade
(PyPy student)

Beginning UTF-8 Implementation

Over the last week and a half, I've begun working on implementing UTF-8 based unicodes in PyPy.

As I've stated in a previous blog post, my goal is to replace the use of the RPython unicode type in PyPy with a PyPy-specific UTF-8 implementation.  The motivation for doing this is to simplify PyPy's unicode support by providing a single implementation and application-level interface irrespective of the platform.  Presently, the size of the system's wchar_t type determines which representation is used, from both perspectives.  The new implementation will represent the unicode strings internally as an RPython string of bytes, but the user will interact with the application-level unicode string on a code point basis.  In other words, the user will have a "UTF-32" interface.

Since Python supports converting unicode objects into UTF-8 byte strings and vice-versa, RPython helper functions already exist for a lot of the handling of UTF-8. I am largely reusing these with minimal modifications.  As a result, most of my work is a matter of fixing the various places in PyPy where the built-in/RPython unicode type is expected and replacing unicode literals.  That said, I plan on leaving unicode literals in (the interpreter-level) tests in place. Replacing them serves little useful purpose and would make the tests more annoying to read and to write.

For the time being I'm not paying very much attention to optimization (with the exception of special cases for ASCII only strings.)  Things will be slow for a little while; random access before any optimizations are added will be O(n).  However, the goal is correctness first and this won't be merged if it creates any serious performance regressions.  That said, I already have some ideas about optimizations and hopefully I'll be far enough along in my work to be able to write more about those next time.

by Tyler Wade (noreply@blogger.com) at June 16, 2014 12:45 AM

June 14, 2014

Akhil Nair
(Kivy student)

Packaging- The Debian Way

In the earlier post I had mentioned, how I was stuck with Pyinstaller. Despite several efforts the roadblock remained. So earlier this month, I was stuck at a dead end and nowhere to turn to.

After some soul searching and the inspiration season finale of Silicon Valley, I decided to try a different approach. And I must say, this looks promising.

What Pyinstaller does is it converts a program into an executable. We have to take it the rest of the way.
So I thought why not try to directly package using some basic debian build tools.

And I can say that I was fairly successful.

So this method has two parts. The setup creation and the package creation.

A combination of Part1 and Part2.
There are some steps in between but nothing big.

Part1 deals with making a Kivy Application an installable program that can be run by firing a command on the terminal.
This step is important because this functionality is a fundamental part of any debian package.

These are the steps in part one:

- Create a setup.py file outside the application directory and enclose this combination into another directory which we may refer to as root.
The setup.py file will contain information about the author, static files, package, etc. 

- We must also make the application directory as a package which can be read as a module. So we create a __init__.py file (doesn't matter if it's blank) so that python refers to it as a module.

- Create a Manifest file which specifies all the other files(non-python) of the application such as sounds, images, readme, etc.

- Run the command "python setup.py sdist". This will now create a dist directory which will have a tar file. You may check whether the application can be installed or not by running the "python setup.py -install" command.  If the above part works well then the second part should work without hiccups.

The second part deals with the creation of deb package using the obtained tarball.



Roadblock:
                Static files are not getting included to the tarball. Following the MANIFEST file in the python distutils tutorial only results in creation of an empty directory named "recursive-include". 


If some of the problems can be ironed out, this can work wonderfully.
Also the assumption is that the first part would be same for rpm conversion, so it can be possible to have a single target code for both deb and rpm.


Final thoughts:
                           Since I have the attention span of a sparrow and that this post is getting larger, I would like to elaborate on the second part in the next post. Those of you who couldn't understand a thing can go to the link I have provided above, what I am trying to explain is pretty much the same thing, only that they explain it better.

Feedbacks and advices will be appreciated and I will try and send chocolates to anyone who can help me out with this(just kidding).
 



by AkhilNair (noreply@blogger.com) at June 14, 2014 05:47 AM

June 13, 2014

Wenzhu Man
(PyPy student)

two end nursery GC

So things have been kind of complicated in the past four weeks.
And my original project changed to a new one as someone in pypy have been working on  the original project(add pinning to GC) for months, but none of my mentors know until two weeks passed.

So my new project is :
two end nursery GC.
As mentioned in previous post, current incremen
tal minimark GC in pypy has two generations: nursery generation and old generation. There is also a external_malloc() to raw-allocate objects(these objects are out of GC control)

The current nursery allocation logic is clear: there is three pointers :nursery-free/nursery-top/nursery-real-top, between nursery-free and nursery-top there is a piece of zeroed memory (grey area below)
,between nursery-top and nursery-real-top  there is non-zeroed( contains random garbage) memory( blue area below).


The nursery is cleaned up step by step, when the nursery-free reaches nursery-top, the nursery-top will move forward and zero the memory for one more step(defined by nursery-cleanup).

From this mechanism  we can see every object in allocation in memory that is allocated in memory full of zeros.
This is very useful for some objects.  Take an object that contains a field which is a GC pointer.  It needs to be zero-initialized, because otherwise, if the program doesn't quickly fill the pointer, garbage is left in this field.  At the next minor or major collection, trying to follow this garbage-valued pointer would
crash.
However, not all objects need such zero-allocation.  For any object not containing any GC pointers, it is not useful from the point of view of the GC.   The point is that we would save part of the time needed to zero out the nursery if we could choose to allocate zeroed or non-zeroed objects.

So the idea of the project is to allocation objects(don't have GC pointers) from nursery-real-top and grows towards nursery-top. So the nursery can allocate objects from two end.

These will save the time that instead of there are two writes(zeroing and allocation) for every object there will be just one write for some of the objects.

The current allocation function are malloc_fixedsize_clear(), malloc_varsize_clear(), I added two new pointers nursery-second-part-free(grows from nursery-real-top), nursery-second-part-end(points to nursery-top).
I also implemented malloc_fixedsize() as the third allocation function in GC to allocate objects in opposite direction from the other end of nursery.

 The next step will be modify the GCtransformer to replace malloc_fixedsize_clear() with malloc_fixedsize() when appropriate.







by Wenzhu Man (noreply@blogger.com) at June 13, 2014 05:47 PM

Manoj Kumar
(scikit-learn student)

GSoC update – Fixing precompute and LogReg CV

Hi, The first part of the last few days of have been the most productive and the last part well not so much.

1. Fixing precompute for ElasticNetCV
The function argument precompute=”auto” was being unused, in ElasticNetCV as mentioned in my previous post. Setting precompute equals auto uses a gram variant of the input matrix which according to the documentation is np.dot(X.T, X) . This theoritically helps the descent algorithm to converge faster. (However at the time of writing, I do not know exactly how). Practically however, (and after testing using the line profiler) it seems to be a bit slower since the computation of the gram matrix takes quite a bit of time. So with the advice of ogrisel, I split it across three Pull Requests. All three are essentially easy to fix.

1. https://github.com/scikit-learn/scikit-learn/pull/3247 – This ensures that the Gram variant is used if precompute is said to True or (auto and if n_samples > n_features)
2. https://github.com/scikit-learn/scikit-learn/pull/3248 – Remove precompute from Multi Task models since it is being unused,
3. https://github.com/scikit-learn/scikit-learn/pull/3249 – This is a WIP, and changes default precompute from auto to False

2. Threading backend for Linear Models.
I have successfully changed the backend from muli-processing to threading after releasing the GIL for all four variants. After a final round of review it can be merged,
a] Simple coordinate descent
b] Sparse coordinate descent
c] Gram variant
d] MultiTask variant
There is a huge memory gain and speed is almost the same (if not slightly higher by this Pull Request). https://github.com/scikit-learn/scikit-learn/pull/3102

3. Logistic Regression CV
Reading someone else’s magically vectorised NumPy code isn’t an easy task and I somehow crawled my way through it (which explains the more productive first part).

I fixed a bug in the code to compute the Hessian when the intercept is true. I’ve also fixed sparse matrix support and added multiple tests to it and confirmted that the newton-cg and lbfgs solvers give the exact same result, The liblinear has a slight change due to the penalisation of the intercept.

However benchmarking gives ambiguous results. On standard datasets such as the newsgroup and digits data, almost always the lib-linear solver is the fastest. However in datasets using make_classification, lbfgs seems to be the faster solver.

Right now, my job is just to wait for comments from Alex and Olivier and make the necessary changes. I shall come up with a more detailed description on Log Reg CV next week.


by Manoj Kumar at June 13, 2014 08:05 AM

June 12, 2014

Milan Oberkirch
(Core Python student)

First things first…

After working on the email package for two weeks my mentors and I decided to work on the smtpd and smtplib modules (and RFC 6531) first. This makes sense for several reasons, one of them is task scheduling: the SMTP related modules are much smaller and therefore dealing with them first ensures that as many tasks as possible are done as early as possible (with high probability).

I moved the old development repo to https://bitbucket.org/zvyn/cpython-email and started a new one at https://bitbucket.org/zvyn/cpython for that. The patch for smtpd is ready for review at bugs.python.org. The smtplib part is also nearly done, I’m waiting for #15014 (refactoring of the authentication logic) to be resolved to finish that. Another thing that needs to get merged is support for Internationalized Domain Names (IDN). I will have to do some research on how to handle those in a SMTPUTF8 environment (see #20083, #11783).

After having finished the SMTP part I will research if there are standardised ways to teach IMAP (the imaplib module in Python) Unicode. RFC 6857 (Section 1.1) does not sound very promising so I will most likely return to the email package at the end of next weak.

That’s it for now. I’m looking forward to blog more frequently in the future!


by milan at June 12, 2014 08:38 PM

June 05, 2014

Janani
(SciPy/NumPy student)

2 weeks over! That was fast..
And I have sent my pull request finishing the first of 4 sub-parts of my project.
The link to the pull request is https://github.com/scipy/scipy/pull/3692.
Here the function to implement spherical harmonics has been rewritten in Cython. It uses lpmv fortran sub-routine (with the help of the existing wrapper function pmv_wrap) to calculate associated Legendre Function.
Initially the function to calculate spherical harmonics was written in python which called lpmn function to calculate the associated Legendre function. Lpmn function in turn uses lpmn fortran routine to calculate the function. The routine iteratively evaluates associated Legendre function for all orders and degrees m and n. While the routine lpmv avoids the iteration by using recursion.
The usage of a different routine and rewriting the code in cython has given a huge leap in the speed of the function which is shown in a simple benchmark performed by Pauli,
Before:
 In [4]: %timeit sph_harm(50, 50, 0.5, 0.9)
10000 loops, best of 3:
81.3 µs per loop
 In [5]: x = np.linspace(0, 1, 2000)
 In [6]: %timeit sph_harm(50, 50, 0.5, x)
 10 loops, best of 3:
127 ms per loop

After:
 In [2]: %timeit sph_harm(50, 50, 0.5, 0.9)
100000 loops, best of 3:
 4.52 µs per loop
 In [3]: x = np.linspace(0, 1, 2000)
In [4]: %timeit sph_harm(50, 50, 0.5, x)
1000 loops, best of 3: 1.87 ms per loop

Now the next part of the project is to implement the ellipsoidal harmonic functions. It is kind of challenging as they are complex and they havent been implemented much. My next 2-3 weeks are dedicated to the implementation of the both kinds of ellipsoidal functions, thus meeting my mid sem deadline.

by janani padmanabhan (noreply@blogger.com) at June 05, 2014 04:52 PM

June 04, 2014

Mabry Cervin
(Astropy student)

Weeks One and Two Summary and Plans for the Future

So far astropy's class Quantity and the package uncertainties have been playing quite well together. I have been implementing the standard math functions (__add__, __sub__, etc) so that the fields value, unit, and uncertainty are properly updated each operation. I was prepared to have to deal with precision errors in the testing of automaticly propagated uncertainties versus hand propagated, but so far I have not run into any problems by using Python's math module for doing the calculations.

At my previous meeting with one of my mentors it was brought up that uquantity inheriting from Quantities and uncertainties should become a priority. Inheriting from both of those will give uquantity all of the machinery to be used as ndarrays (from the package numpy) and allow astropy users to use uquantity just as they would Quantity. Actually achieving this has been teaching me a ton about portions of Python that I have not encountered yet.

Ndarray is an impressively flexible class, it allows instantiation of an ndarray explicitly, via a different ndarray subclass, and through operations like slicing. Actually implementing this flexibility, however, requires declaring the class differently then I have always done it.

Typically when a class is defined, initialization code is placed in the method __init__. In ndarray the issue with this is that __init__ is not actually run in all supported cases of instantiation. Instead it uses a combination of __new__ and __array_finalize__ for initialization code. Luckily astropy has multiple subclasses of Quantity for me to look at. Using those as templates combined with the official ndarray documentation should make inheriting from Quantity not too difficult.

by Epitrochoid (noreply@blogger.com) at June 04, 2014 03:30 PM

June 03, 2014

Simon Liedtke
(Astropy student)

Debugging BeautifulSoup Pull Request

I started working on a pull request which replaces the XML library lxml with beautifulsoup to have a pure-Python solution. The problem is that some of the remote tests fail, thus the pull request cannot be merged yet.

When I run py.test --remote-data astroquery/eso/tests on the master branch, all tests pass fine. But when I call the same command on my branch which uses beautifulsoup instead of lxml, I get 3 failing tests, each of them raising the exception InconsistentTableError. Time to find out more and debug! Simplified, the stacktraces look like this:

  1. test_Sgr_Astar -> EsoClass.query_instrument -> raise InconsistentTableError
  2. test_empty_return -> EsoClass.query_survey -> astropy.io.ascii.read -> astropy.io.ui._guess -> raise InconsistentTableError
  3. test_SgrAstar_remotevslocal -> EsoClass.query_instrument -> raise InconsistentTableError

The exception InconsistentTableError is a subclass of ValueError and is defined in astropy.io.ascii.core. It has the following docstring:

Indicates that an input table is inconsistent in some way.

The default behavior of BaseReader is to throw an instance of this class if a data row doesn't match the header.

So the failing tests fail because some table is created with inconsistent rows. I presume this is because astropy.io.ascii_read receives the web search form instead of the generated CSV formatted string for some reason. At least this was the case when I had trouble before. Let's check it. The testing framework py.test supports debugging failed tests easily via the --pdb option. So I run the tests now like this: py.test --remote-data --pdb astroquery/eso/tests. This will launch the Python debugger pdb.

> /home/derdon/repos/python/3rdparty/astroquery/astroquery/eso/core.py(333)query_instrument()
-> raise ex
(Pdb) ''.join(content.splitlines()[:3])
'<html><HEAD><TITLE>ESO Science Archive - MIDI Data Products</TITLE>'

Yup, looks like content contains the web form instead of a CSV-formatted string as it should (content is the value that is passed to Table.read and thus passed on to astropy.io.ascii_read). I think the failure is in the method _activate_form and therefore set a breakpoint via import pdb; pdb.set_trace() at the end of this method body directly before returning. The code contains the statement if form_elem.name in inputs: where form_elem is created via the loop for form_elem in form.find_all(['input', 'select', 'textarea']):. The following pdb checks show that this if-statement can never be evaluated to True:

(Pdb) inputs
{'wdbo': 'csv/download', 'max_rows_returned': 50, 'target': 'Sgr A*'}
(Pdb) [form_elem.name for form_elem in form.find_all(['input', 'select', 'textarea'])]
[u'input', u'input', u'select', u'input', u'input', u'input', u'input', u'select', u'select', u'input', u'input', u'input', u'select', u'input', u'input', u'input', u'input', u'input', u'select', u'input', u'select', u'input', u'input', u'input', u'select', u'input', u'select', u'input', u'input', u'select', u'input', u'input', u'input', u'input', u'input', u'input', u'input', u'input', u'input', u'select', u'input', u'select', u'input', u'input', u'select', u'input', u'input', u'input', u'input', u'input', u'input', u'input', u'select', u'input', u'select', u'input', u'input', u'input', u'select', u'input', u'select', u'input', u'select', u'input', u'select', u'input', u'select', u'input', u'input', u'input', u'input', u'input', u'input', u'input', u'input', u'input', u'select', u'input', u'input', u'select', u'select', u'input', u'input']

I git checkout now back to master and use pdb there to see how it should be working.

(Pdb) inputs
{'wdbo': 'csv/download', 'max_rows_returned': 50, 'target': 'Sgr A*'}
(Pdb) [form_input.name for form_input in form.inputs]
[None, None, 'wdbo', 'max_rows_returned', None, None, 'target', 'resolver', 'coord_sys', 'coord1', 'coord2', 'box', 'format', 'tab_wdb_input_file', 'wdb_input_file', 'tab_night', 'night', 'stime', 'starttime', 'etime', 'endtime', 'tab_prog_id', 'prog_id', 'tab_prog_type', 'prog_type', 'tab_obs_mode', 'obs_mode', 'tab_pi_coi', 'pi_coi', 'pi_coi_name', 'tab_prog_title', 'prog_title', 'tab_dp_id', 'dp_id', 'tab_ob_id', 'ob_id', 'tab_obs_targ_name', 'obs_targ_name', 'tab_dp_cat', 'dp_cat', 'tab_dp_type', 'dp_type', 'tab_interferometry', 'tab_dp_tech', 'dp_tech', 'tab_tpl_name', 'tpl_name', 'tab_tpl_nexp', 'tpl_nexp', 'tab_tpl_start', 'tpl_start', 'tab_del_ft_sensor', 'del_ft_sensor', 'tab_del_ft_status', 'del_ft_status', 'tab_det_ntel', 'det_ntel', 'tab_iss_conf_station1', 'iss_conf_station1', 'tab_iss_conf_station2', 'iss_conf_station2', 'tab_iss_conf_station3', 'iss_conf_station3', 'tab_ocs_obs_mode', 'ocs_obs_mode', 'tab_ocs_obs_specconf', 'ocs_obs_specconf', 'tab_ocs_obs_type', 'ocs_obs_type', 'tab_ins_grat1_wlen', 'ins_grat1_wlen', 'tab_fwhm_avg', 'fwhm_avg', 'tab_airmass_range', 'airmass_range', 'tab_night_flag', 'night_flag', 'tab_moon_illu', 'moon_illu', 'order', 'extra_columns', 'force_tabular_mode', 'full_screen_mode']
(Pdb) 'wdbo' in [form_input.name for form_input in form.inputs]
True

Zing! The last expression is always False at my branch which is problematic. Now I realize what the problem is! On lxml, form_input.name gives me the attribute value for the attribute name of the node form_input. E.g. if the node form_input has the source <input name=foo>, then form_input.name will have the value 'foo'. But using beautifulsoup, the value of form_input.name will be 'input' because it returns the name of the node as a string!

After having fixed the code according to my new knowledge, there is only one remote failing test left, yay! Until next time, I think this is enough for one blog post :-)

by Simon Liedtke at June 03, 2014 10:00 PM

June 02, 2014

Edwin Marshall
(Kivy student)

Too Many Moving Parts

The title really speaks to two issues, one personal, and one technical. As most of you reading this are concerned only with the technical aspects of the project, I'll focus on the that first.

Technically Speaking

The awesome thing about the core kivy project is that it is organized very well and thoroughly documented. While the Kivy Architecture Diagram is a little scary when first encountered, once you take a peek at the directory structure, everything becomes much clearer. For instance, understanding how SDL ties into everything merely requires a developer study kivy/core. In fact, one could add new providers piece-meal by selecting which of the core modules to implement first. For example, if a developer wanted an SFML or GLFW-based window, he would wander into kivy/core/window.

Unfortunately, I can't say that things have been as pleasant in python-for-android. For example, in attempting to understand how SDL 1.2 ties in to the project, I've had to examine numerous files:

distribute.sh
Responsible for cross-compiling python and any other dependencies for Android
recipes
Explain how each of those dependencis should be built. Some of them are deceptively simple, until you realize that JNI is involved.
src/src/org/renpy/android/PythonActivity
An Android activity that sets up a sound thread, loads the libraries built by distribute.sh as well as the native application and sdl_main shared objects.
src/jni/{sdl,sdl_main,application}
Various bits of native wrappers responsible for not only loading the sdl shared libraries, but also providing a user's python code with an entry point.

All this indirection makes it rather easy for a simple guy like myself to get disoriented, so more time is spent understanding the code and following its logic trail than actually writing code.

Speaking of Writing Code

I've been apprehensive about breaking things, hince the lack of pushes, but that will change tonight. Yesterday I spent a lot of time reorganizing my workspace. Specifically, I made sure to properly set up my git remotes so I didn't push to the wrong branch. I've also forked to my own repo so that I never have to worry about messing up the master branch.

Today I plan to do a bit of liposuction to see if things that seem a bit redundant are necessarily so, or like my beer gut just another thing to judge the code by that needn't be there. I'm hoping this week will be the week that you guys see that I am acutually a productive individual.

A Little Personal

On the personal side of things, The last couple of weeks have been interesting because I've recently quit my job and been scrabbling for new work. I typically would have done the smart thing and waited for a new job before quiting, but my last job was taking its toll on me.

Due to the amount of repititive motion (and not having to do manual labor itself), my body has started to ache in ways it never has in all my 13 years of legally working. My stress-induced ecsema was acting up, I had pain in joints all over, and I found myself more lethargic during my off days than I had ever been.

I also found myself more depressed than usual. Whats ironic about that is that prior to this job, I used to find myslf working excessively as a way to cope with-- or perhaps more accurately, as a distraction from-- my depression. Here, things were different; I was constantly reminded of how easily talents get ignored and stripped of an identity; I was just another worker. If I was going to be nobody, I needed to be nobody of my own accord and with the opportunity to change.

While I haven't been succcessful in job search yet, I've gotten some responses from some freelance opportunities which I'll be continuing to explore this week. As such, I apologize if I seem somewhat distant.

I suppose that's enough about my last story. I hope everyone else's life has been less complicated and wish the other students happy coding!

June 02, 2014 07:26 PM

June 01, 2014

Issamou
(scikit-learn student)


I am announcing my progress for GSoC 2014:Extending Neural Networks Module for Scikit learn project here,
http://issamlaradji.blogspot.com/?view=classic

Thanks!

by Issam Laradji (noreply@blogger.com) at June 01, 2014 10:51 PM

Chinmay Joshi
(Mercurial student)

Dive in code during the First two weeks

At the end of first two weeks of GSoC coding period, I am really enjoying a great time working on my Project. This is a never-before experience in my life to work on with open source project community. I keep on discovering new things each and every day. I thought I had done enough basic research about my project but it continuously proves me wrong. One possibly can’t understand blocks before submerging into code. Working on code base, I discover the way various ways developers have used in mercurial to provide maximum optimization.

 In spite of stumbling and frustrating in the beginning because of slow pace in my work, I sent a small patch series in Week 2. I added some filesystem functions to vfs like lexists(), lstat(), unlinkpath() (still in queue to send) and updated some users in bundlerepo.py and patch.py. Some of my patches were queued and merged and some got me very important feedback. Based upon the feedback I learn to check for callers and origins of functions before updating them with vfs interface. I also tried to clear my doubts with repository related and current working directory (cwd) related paths. Discussing things with Community helped me to identify what are the cautions for this task. Checking for all callers and origins of function before replacing them, is a task I found very time consuming and tedious but bare necessity. Switching to a slightly advanced IDE than a normal editor is giving me a helping hand in this task.

I am preparing another series in which I am adding more function to vfs and updating more users to use new API. This first phase of this project of adding/updating looks quite mechanical process at first sight, but it needs proper attention for maximum compatibility and optimization. During this process, I encountered a few filesystem functions from os module of python which I had never used in my usual college projects. Also I liked the util of mercurial written in C for file manipulation. This util.* functions, especially the ones about tree links are very new for me. Talking and discussing things with mentors and community always helps me to clear up my mind give me directions. I am currently focusing on this add functions / update users task as it is essential for further process.

Stay tuned for further updates!

by Chinmay Joshi (noreply@blogger.com) at June 01, 2014 07:35 AM

Anurag Goel
(Mercurial student)

Slowly and Steadily reaching towards the goal.

It has been two weeks now. Experience was great so far. At initial point i was viewing the project in big picture like how things get done and time it will take.
But that approach did not give me the clear idea about the project and even restricted me to move one step forward. After communicating with mentors and crew members, i realise that i should focus on one task at a time. And before doing that task, i should break it into small  steps. Following this approach, it helped me in completing the first task. Although it has not been reviewed yet but if i have to make some changes further, i will do side by side.

Task 1 is about gathering timing info of test files. In this task, I calculated two main things.
1) Children's user time taken by a process.
2) Children's system time taken by a process.

I used "os.times()" module to get the above info. Unfortunately, in windows "os.times()" module works only for parent process. Therefore, my contribution only works for linux user as of now.    

According to proposal timeline, i planned to complete first two tasks in first two weeks. But i am only able to complete first task yet. Now as i got the approach, i will try to cover up things quickly.       

Biggest challenge in doing a task, is in making those small steps on which you have to get along. This could only be possible when you communicate with your mentor and crew members as much as possible. With every conversation things will get more clearer to you and this would help in building greater understanding about the project.

by Anurag Goel (noreply@blogger.com) at June 01, 2014 04:25 AM

May 31, 2014

Terri Oda
(PSF Org admin, Mailman mentor)

You can leave academia, but you can't get the academic spam out of your inbox

When I used to do research on spam, I wound up spending a lot of time listening to people's little pet theories. One that came up plenty was "oh, I just never post my email address on the internet" which is fine enough as a strategy depending on what you do, but is rather infeasible for academics who want to publish, as custom says we've got to put our email addresses on the paper. This leads to a lot of really awesome contacts with other researchers around the world, but sometimes it leads to stuff like the email I got today:


Dear Terri,

As stated by the Carleton University's electronic repository, you authored the work entitled "Simple Security Policy for the Web" in the framework of your postgraduate degree.

We are currently planning publications in this subject field, and we would be glad to know whether you would be interested in publishing the above mentioned work with us.

LAP LAMBERT Academic Publishing is a member of an international publishing group, which has almost 10 years of experience in the publication of high-quality research works from well-known institutions across the globe.

Besides producing printed scientific books, we also market them actively through more than 80,000 booksellers.

Kindly confirm your interest in receiving more detailed information in this respect.

I am looking forward to hearing from you.


Best regards,
Sarah Lynch
Acquisition Editor

LAP LAMBERT Academic Publishing is a trademark of OmniScriptum
GmbH & Co. KG

Heinrich-Böcking-Str. 6-8, 66121, Saarbrücken, Germany
s.lynch(at)lap-publishing.com / www. lap-publishing .com

Handelsregister Amtsgericht Saarbrücken HRA 10356
Identification Number (Verkehrsnummer): 13955
Partner with unlimited liability: VDM Management GmbH
Handelsregister Amtsgericht Saarbrücken HRB 18918
Managing director: Thorsten Ohm (CEO)


Well, I guess it's better than the many mispelled emails I get offering to let me buy a degree (I am *so* not the target audience for that, thanks), and at least it's not incredibly crappy conference spam. In fact, I'd never heard of this before, so I did a bit of searching.

Let's just post a few of the summaries from that search:

From wikipedia:
The Australian Higher Education Research Data Collection (HERDC) explicitly excludes the books by VDM Verlag and Lambert Academic Publishing from ...


From the well-titled Lambert Academic Publishing (or How Not to Publish Your Thesis):
Lambert Academic Publishing (LAP) is an imprint of Verlag Dr Muller (VDM), a publisher infamous for selling cobbled-together "books" made ...


And most amusingly, the reason I've included the phrase "academic spam" in the title:
I was contacted today by a representative of Lambert Academic Publishing requesting that I change the title of my blog post "Academic Spam", ...


So yeah, no. My thesis is already published, thanks, and Simple Security Policy for the Web is freely available on the web for probably obvious reasons. I never did convert the darned thing to html, though, which is mildly unfortunate in context!

comment count unavailable comments

May 31, 2014 05:19 AM

PlanetPlanet vs iPython Notebook [RESOLVED: see below]

Short version:

I'd like some help figuring out why RSS feeds that include iPython notebook contents (or more specifically, the CSS from iPython notebooks) are showing up as really messed up in the PythonPython blog aggregator. See the Python summer of code aggregator and search for a MNE-Python post to see an example of what's going wrong.

Bigger context:

One of the things we ask of Python's Google Summer of Code students is regular blog posts. This is a way of encouraging them to be public about their discoveries and share their process and thoughts with the wider Python community. It's also very helpful to me as an org admin, since it makes it easier for me to share and promote the students' work. It also helps me keep track of everyone's projects without burning myself out trying to keep up with a huge number of mailing lists for each "sub-org" under the Python umbrella. Python sponsors not only students to work on the language itself, but also for projects that make heavy use of Python. In 2014, we have around 20 sub-orgs, so that's a lot of mailing lists!

One of the tools I use is PythonPython, software often used for making free software "planets" or blog aggregators. It's easy to use and run, and while it's old, it doesn't require me to install and run an entire larger framework which I would then have to keep up to date. It's basically making a static page using a shell script run by a cron job. From a security perspective, all I have to worry about is that my students will post something terrible that then gets aggregated, but I'd have to worry about that no matter what blogroll software I used.

But for some reason, this year we've had some problems with some feeds, and it *looks* like the problem is specifically that PlanetPlanet can't handle iPython notebook formatted stuff in a blog post. This is pretty awkward, as iPython notebook is an awesome tool that I think we should be encouraging students to use for experimenting in Python, and it really irks me that it's not working. It looks like Chrome and Firefox parse the feed reasonably, which makes me think that somehow PlanetPlanet is the thing that's losing a <style> tag somewhere. The blogs in question seem to be on blogger, so it's also possible that it's google that's munging the stylesheet in a way that planetplanet doesn't parse.

I don't suppose this bug sounds familiar to anyone? I did some quick googling, but unfortunately the terms are all sufficiently popular when used together that I didn't find any reference to this bug. I was hoping for a quick fix from someone else, but I don't mind hacking PlanetPlanet myself if that's what it takes.

Anyone got a suggestion of where to start on a fix?

Edit: Just because I saw someone linking this on twitter, I'll update in the main post: tried Mary's suggestion of Planet Venus (see comments below) out on Monday and it seems to have done the trick, so hurrah!

comment count unavailable comments

May 31, 2014 03:53 AM

May 29, 2014

Wenzhu Man
(PyPy student)

"hg push -f" accident today summary

First of all:never ever use hg push -f never force push

It has been a relatively rough day.
As I was given the write right so I made my first push today. And oops, some change-sets shouldn't be pushed into "default".

what I did:
1. I have pulled every new commits (1.hg pull 2.hg update -C to get rid of uncommitted change but I forgot the committed change-sets!!! plus. hg purge will get rid of the untracked local files)

("In Mercurial, pull is the opposite of push: it syncs the local repo with a remote repo without touching the working copy" but you can use hg pull -u  == git pull
Note this is different from git :  git pull == git fetch(equals hg pull) + git merge )

2. then I run hg push and there is a warning that I will created two new branches(gc-pinning and test-ast-module). As the test-ast-module is remotely closed after merging into default but I still locally has this branch, so I turned on the mq plug-in and used hg strip command to get rid of the local branch.
(maybe this unsafe step created the crisis.)
any safe way to delete local branch? (the safest way I found so far is clone the commit before the branch created and pull from there?)


After the crisis:
Well, all done is done, sorry is the most useless thing to say. I have to solve the 3 heads I created myself, after realizing there is no such thing like simply delete heads and delete change-set.

So I merged the three heads(two by two) and manually resolved the conflicts.

Things seems to be fine now.

Well, at least I learnt several things:
1.before pushing, use hg out to see what are going to be pushed..
2.hg diff is useful to see if you are truly out of trouble(I ran the hg diff with the commit before I pushed, nothing came out, so I know things are fine now.thanks to Brian Kearns)

Last words: writing right to repo is a sword you can use it to kill dragon or yourself...


by Wenzhu Man (noreply@blogger.com) at May 29, 2014 03:16 AM

May 23, 2014

Frank Cheng
(Statsmodels student)

coding structure first pass

Me and Kerby have gotten the structure of MICE to a reasonable point; it's a good time to make and update! Here is the current user interface:

>>> import pandas as pd (1)
>>> import statsmodels.api as sm (2)
>>> from statsmodels.sandbox.mice import mice (3)
>>> data = pd.read_csv('directory_here') (4)
>>> impdata = mice.ImputedData(data) (5)
>>> m1 = impdata.new_imputer("x2") (6)
>>> m2 = impdata.new_imputer("x3") (7)
>>> m3 = impdata.new_imputer("x1", model_class=sm.Logit) (8)
>>> impcomb = mice.MICE("x1 ~ x2 + x3", sm.Logit, [m1,m2,m3]) (9)
>>> p1 = impcomb.combine(iternum=20, skipnum=10) (10)

Now here is what's going on, step by step:

1) Our base data type is going to be a pandas DataFrame. Currently, the data must be in a form that supports pd.DataFrame(data).

2) Import our statsmodels API.

3) Import our mice module.

4) Read in our data.

5) mice.ImputedData is a class that stores both the underlying data and its missing data attributes. Right now the key attribute is which indices for each variable are missing. Later we will be making changes directly to the underlying dataset, so we don't want to lose this information as we start filling in values. As soon as these indices are saved, we modify the underlying data by filling in as an initial imputation all the column-wise means for the missing values. ImputedData also contains a helper function that allows us to fill in values and a function that allows us to construct Imputers (described below). Note that changes to the data are within the scope of ImputedData, so your actual data will be safely preserved after all the MICE carnage is done :)

6-8) For each variable with missing values that we want to impute, we initialize an Imputer. These Imputers contain two simulation methods that will help us impute the specific variable of interest: impute_asymptotic_bayes (described in my last post) and impute_pmm (a new one, predictive mean matching). There will be more simulation methods later on. Each Imputer's job is to impute one variable given a formula and model, specified by the user. ImputedData.new_imputer defaults to OLS and all other variables as predictors. Note that here we are implicitly assuming Missing At Random since, conditional on the predictors, the missing value is completely random (and asymptotically Gaussian).

9) Initialize a MICE instance by specifying the model and formula that we are really interested in. The previous imputation models are simply conditional models that we think will do a good job at predicting the variable's missing values; this new model is an analysis model that we want to fit to the already-imputed datasets. We pass in a list containing all the Imputers from steps 6-8; these Imputers will do the work of imputation for us.

10) Here's where we specify an iteration number (iternum) and number of imputations to skip between iterations (skipnum). Mechanically, what is happening is that once MICE.combine is called, we start stringing imputations together, with the variable with the least number of missing observations being imputed first. ImputerChain just makes sure every Imputer in the list is used once to make one fully-imputed dataset. However, one imputation for each variable may not be enough to ensure a stable conditional distribution, so we want to skip a number of datasets between actually fitting the analysis model. So we run through all the Imputers ad nauseum until we have skipped the set number of fully imputed datasets, then fit the model. This is one iteration; we repeat until have iternum number of fitted models. To be clear: if we specify iternum=10 and skipnum=5, we will go through a total of 50 imputation iterations (one iteration is a series of imputations for all specified Imputers) and only fit the analysis model for imputation number 5, 10, 15, etc.

All this skipping and fitting happens in AnalysisChain. MICE.combine then takes all these fitted models (actually, only the parameters we care about: params, cov_params, and scale) and combines them using Rubin's rule, and finally the combined parameters are stuffed into a fitted analysis model instance.

Whew, that's a mouthful! That's the gist of it, if you'd like to get under the hood (and I hope people do!) the Github is here. There are some minor things I did to statsmodels.api and the working directory just so I could work from my local Github folder/machine, so don't try to install it directly before changing those back.

Next step is making the user interface simpler (hopefully he will just pass the data object directly into MICE and not have to deal with initializing Imputers and ImputerData) and also adding logic in that lets the user specify which simulation method he wants to use for imputation. Hopefully get some feedback and make more improvements before my next post!

by Frank Cheng (noreply@blogger.com) at May 23, 2014 06:51 AM

May 18, 2014

Edwin Marshall
(Kivy student)

GSoC '14 End of Community Bonding Period

Project Scope

One thing that has been frustrating about this experience is the fact that I think there has been some disparity between what I proposed and what is expected to be implemented. In particular, I don't make mention of mobile platforms in my original proposal, but as I communicate with my mentors it is rather apparent that my focus early on should be not only getting SDL2 to play nicely with Android in Kivy's ecosystem, but that I am expected to refactor the runtime system in general.

While I certainly voiced how overwhelming such a turn of direction has been for me to one of the core developers, I haven't protested because I honestly believe that this new direction will more positively impact the community. Also, while I was speaking specifically of kivy.core, when talking about having a solid foundation, the same logic can be applied to mobile platforms, so I don't think it is unreasonable to expect the same. After all, I've often read that the only way to become better (at anything) is to do something beyond your current capabilities.

The Community

The kivy community is as friendly and helpful as always. As I get acclimated to the Android ecosystem, I've had plenty of questions, all of which have been answered rather quickly. As I continue to feel my way around the intricacies of the current Android run time, I have no doubt that as I run into issues, they'll be able to offer assistance where necessary. My only real concern is IRC. While it is certainly more effective than email, I find that simply having an IRC client open distracts me from actually working. I find myself consumed by the side conversations or consistently checking to see if I've missed any important messages. As such, I might make it a point to be on IRC at specific times of the day, having it closed the others.

The Way Forward

I look forward to this summer as a way to give back to a great open source project and grow as a developer in the process. In fact, I'm considering leaving my part-time gig so that I can focus on this exclusively, since I've found the context switching (warehouse work to coding) to be a bit draining. I'm already thinking about code all day, I might as well make use of all that potential energy. I'll also need some kind of structure to keep me on task, given that this is the first time I've done any work from home. Given hat I attend school online as well, I think that should be less of a puzzle to work out. However, I think it will be important simply because I don't have a lot of time to do something great, so every minute counts.

Cheers!

May 18, 2014 11:11 PM

Jaspreet Singh
(pgmpy student)

The Bonding Period

Hi Everyone!

I happen to be selected for gsoc14 for python software foundation under pgmpy. Hurray!!. The title of my project is "Parsing from and writing to standard PGM formats". I will be posting all the details about my project work along with my experiences and any information that may encourage and help future contributors to this open source organization. There will be weekly updates about my work here. Two meetings were conducted in this period. Mostly discussing how to approach for the project. I have a new framework - Pycharm installed and set up to work with. It serves as a good environment for development in python. The fact that one can use the terminal with an editor in the same window is new to me. There will be regular meetings every week with the mentors.

by jaspreet singh (noreply@blogger.com) at May 18, 2014 06:42 PM

Navin
(pgmpy student)

The beginning of the Summer of Code

This is my first post and hence I begin by congratulating myself on getting selected for the Google Summer of Code. I will begin by describing the organization I got selected for : PGMPY. So the pgm in pgmpy stands for probabilistic graphical models and the py stands for python. You would have guessed it by now, but still, it is a organization under the PSF (Python Software Foundation) and works on probabilistic graphical models. Here is the link to the organization https://github.com/pgmpy/pgmpy . Sadly it doesn't still have a "web-page" of its own but still the github page will do. It is relatively a very new organization and they have very little code right now (as compared to many other organizations. A lot of previous GSOCers said that it took them quite a while to get familiar with just the part of the code which was relevant to them).

Anyway, let's get to what I will be doing. My task will be to basically implement the Markov Networks. After implementing the basic architecture, I will be implementing a few algorithms. Take a look here for the current version of the timeline (We decided to make some changes to the initial timeline which I had proposed. A few changes here and there. Though it would hardly be interesting for you ) https://docs.google.com/document/d/1WRuiEUgiuwWMuxezTy-YXqSiVILP2Q1lQ50zU6kib6g/edit?usp=sharing .

Now, the big question is what I have been doing since 19th of April to today . (Which is approximately a month). We were not supposed to start coding yet as it is still the community-bonding period and we are supposed to familiarize ourselves with the code, become friendly with the mentors, get to know people in the org, and do any preparatory work, as necessary to begin the coding work. So I haven't yet started coding up stuff. However I have been doing some cool stuff.

To being with, I had my end-sems exam till 30th of April and so, I was quite inactive till then. We had an IRC chat on 28th of April and then , we just had a brief chat about things and set a few plans for community bonding period. We also chatted a bit about IDE and cython and testing-techniques among other things. After the exams, I lazed away for a while and after about the 3rd of May started reading up the algorithms which I would be implementing. I did a course on Advanced Machine Learning this semester , however, I thought that it would be a good idea to go through the algorithm from the book (the standard book on Probabilistic Graphical Models ofcourse, the one and only, Probabilistic Graphical Models, by Koller and Friedman).

We had the next IRC chat on 10th May (oh , we had decided in the previous IRC chat that we would be having these chats every Saturday starting from 10th). Then it was more of a report of progress and clearing doubts, if any. We discussed plans about the subsequent weeks. So, as per the plan, I went through the software preparations part of my project, and familiarized myself with Cython, PyCharm IDE, configured VIM to work well with python (that was quite a struggle, but then, anything for VIM :D ). I also took a look at the exisiting code so that I could use parts of it for my implementation. (I also guess that I found a bug , for which I sent a pull request).

And then we land up at today. We had another IRC chat yesterday and my task for the next week would be setting up the Markov Networks class and writing the basic class implementation, testing it up and so on. (Look at the timeline) But the good news is that, coding starts now. :)


by Navin Chandak (noreply@blogger.com) at May 18, 2014 06:12 PM

May 17, 2014

Mabry Cervin
(Astropy student)

Beginning Directions

Start of the coding period is in three days and I'm very excited. The plan for the uncertainties framework is for it to be developed as an Astropy affiliated package. Doing it this way rather than trying to merge changes into Astropy core right off the bat has some huge advantages. Astropy uses a well planned release cycle with regular feature freezes before large releases. Trying to coordinate new feature development along side such a process would be an unnecessary burden on both the uncertainties development and core Astropy. Further having it as an affiliated package will allow to me work more closely with my mentors by having a dedicated repo. Astropy has amazing support for affiliated packages, the full list including some really cool stuff like astroML which is a package for astrophysics related machine learning.

The first milestone for the package is getting Astropy's class Quantity to work with the existing Python package Uncertainties. The Uncertainties package propagates error using a technique called automatic differentiation which calculates the derivative of every operation along side the return value of the operation. This method is computationally very efficient. It is restricted with respect to much of astronomic data, but will make an excellent starting point for a properly flexible framework.

by Epitrochoid (noreply@blogger.com) at May 17, 2014 01:29 AM