Python's Summer of Code Updates

July 22, 2014

Michael Mueller
(Astropy student)

Week 9

This week was a pretty straightforward one--I spent most of it working on implementing parallelization in the Cython reading algorithm (in a new branch, multiprocessing). After Tuesday's discussion about what problem to pursue next, we decided that I should spend a week or two on parallelization and then begin to work on creating custom numpy dtypes, which will be an extension of Erik's work and should be applicable to FITS files as well as ASCII to some extent.
In short, the new strategy I implemented for reading involves splitting input data into N chunks (I currently have N=8, but I'll have to find an optimal N at some point) and performing both tokenization and conversion in a different process for each chunk. There are a few subtleties involved; for example, chunks need to be separated by newlines, certain data has to be copied for each process, etc. One irritating issue I came across was one with conversion in which one process discovers string data and another discovers numeric data in the same column. The basic idea at the end of the algorithm is to take the resulting data from each process and concatenate the chunks together, using widening conversions (i.e. int to float to str) when chunks have different data types. However, if a column contains integers in the first chunk, floats in the second, and strings in the third, concatenating the first two would turn 5 to 5.0 and then concatenating the first two chunks with the third would result in "5.0" instead of the desired "5". The obvious solution is to check explicitly for string data in any of the chunks and immediately convert all the columns to strings, but this still results in improper behavior for numbers in scientific notation, for example (1.3e2 -> 13 -> "13"). I ended up simply using a multiprocessing.Queue to force processes to reconvert columns as strings in this case, discarding the original converted data altogether. Luckily this seems like a pretty unlikely corner case for large data files, so performance shouldn't really take a hit.
A more serious problem, and one I haven't quite decided how to handle, is how to deal with quoted fields spanning multiple lines. Since my approach involves splitting data into chunks by lines, an input string like 'a b c\n1 2 "3\n4"' will not parse correctly because one chunk will try parsing the line 1 2 "3 (noticing an unfinished quote sequence) and another will try parsing 4" (noticing an unexpected quote character and too few data values in the row). Unfortunately no good solution comes to mind, since explicitly searching through the input for quote characters would basically negate the performance gains of multiprocessing in the first place. I plan on asking my mentors what they think, since I'm not sure how commonly this sort of case occurs.
All tests are passing with the new multiprocessing algorithm except for those involving the quoting issue, and there is a speed gain over the old method. Unfortunately it's a little less than I hoped it would be (Tom warned that this could happen, citing Amdahl's Law), but still fairly nice; on a sample file that took the old algorithm about 0.11 seconds (the same file I've been using for a few weeks...I should probably start trying other ones), the new algorithm takes about 0.075 seconds, a speed gain of about 45%. I've been working on improving the reader's performance with reading from file, since inputting the file into a single string takes up about 0.012 seconds of the reading time, although I'm not yet finished. Table.__init__() is also taking about 0.01 seconds, which is probably unavoidable. Pandas is taking about 0.055 seconds on the same file. It's getting trickier to cut down time at this point, but hopefully I can edge out Pandas in the end!

by Michael Mueller (noreply@blogger.com) at July 22, 2014 04:20 AM

July 21, 2014

Rishabh Raj
(scikit-image student)

Almost There!

Now having successfully setup the basic backend to run and interact with Docker containers. The next intuitive step was to integrate that with our image gallery and thus allow people to play around with this interactive image processing toolkit, online!

We had a discussion to finalize what sort of an interface would we like to integrate this with. This could have been done in two ways, one way would be to use what we named as ‘external’ integration and the other would subsequently be known as ‘inline’ integration. By this time you might have gotten an intuition about what I mean to say. For further info – there’s a summary here @ https://github.com/scikit-image/skimage-docker/issues/6

One day, while testing out some code snippets on this service, I managed to stumble upon a problem. Every now and then, some Docker container would crash abruptly, without an error message. Now this was bad and just shows one the testing. Further digging up the code showed that there was a race condition between a container stopping after executing some statements and then restarting. `docker wait` (which makes the execution of the next command wait till the container exits) to the rescue and the service went back to being hale and hearty again.

We set up a live version of the gallery for everyone to try out @ http://ci.scipy.org:8000/; if you want to go to the examples directly, this is where you need to head to – http://ci.scipy.org:8000/auto_examples/index.html

These docs have been built on a Mac, which for some time now has not had clean doc generation, hence you might find a few pictures missing. But the rest works fine.

We had a few more discussions on how to improve the experience as a whole, there are ideas to put in spam detection, better error messages etc. We’re tracking it all here @ https://github.com/scikit-image/skimage-docker/issues Feel free to send in feedback on the ideas we already are tracking , or even file bugs or any feature requests. The idea is to support this with solid documentation too so that any other project can use it to setup something similar for themselves!


by sharky932014 at July 21, 2014 10:41 PM

Gaurav Trivedi
(Kivy student)

Towards Making Kivy Apps More Accessible

After last week‘s discussion with tshirtman about making making Kivy apps more accessible, I spent this week exploring the accessibility options on the devices I own. The idea was to investigate and plan a set of common features that we could implement for Kivy.

I ended up playing with all the settings I that could find under the accessibility menus on an Android phone and a Macbook. I will soon repeat this exercise with an iPhone as well.

In comparison to my Galaxy running Android 4.4.2, these features were easier to use and more responsive to interact with on my Mac 10.9. Even the tutorials were simpler to follow and avoided statements like:

“When this feature is on, the response time could be slowed down in Phone, Calculator and other apps.”

I will keep my rantings aside for now and would instead like to focus on how we could support these features within Kivy apps. Following Apple’s way of organizing these features, you can group accessibility features into three broad categories:

Here's TalkBack on Android reading out my PIN to me!

TalkBack on Android reading out my PIN to me!

  • Seeing: This includes display settings such as colors and contrast, making text larger and zooming the screen. A lot of these settings come free of cost to the app developers. However, an important feature in this category includes the screen reader app (Example: VoiceOver and TalkBack). This is the main area of interest to us as the onus of making this feature work correctly rests with us and the app developers. I will get back to this after we finish classifying the features.
  • Hearing: We must make sure all our alert widgets in Kivy are accompanied by visual cues (such as a screen flash) along with audio indicators. Another feature that might be of interest to us would be to ensure that we provide a subtitles option along in our video player widget.
  • Interacting: It includes features like the assistant menu on Android to provide easy access to a limited number of commonly accessed functions. Most of these features are again managed and controlled by the operating systems themselves. Apart from a few design choices to make sure that app doesn’t interfere with their functions, there’s nothing that the app developers have to do to support them.

So the most important missing piece in the Kivy framework is to provide support to the screen-reading apps. These apps are designed to provide spoken feedback to what the users select (touch or point) and activate. For example if I hover over a menu-item or select a button on the screen, the app must describe what it is and also say what is does, if possible.

While settings such as zoom, larger font-sizes etc. are already taken care of by the frameworks that Kivy builds upon, we must provide explicit support for the screen readers. Here’s an example of an hello world kivy app and how it is seen by the screen-readers. There is nothing apart from the main windows buttons, i.e. “Close”, “Minimize” and “Zoom” that is available to the VoiceOver program to describe:

There is no meta information available to the screen-readers to describe the interface to the user.

The main challenge here is that none of the Kivy widgets are selectable at this point. Not only that makes it difficult to use them with screen-readers but it is also not possible to navigate them using a keyboard. We must provide a screen-reader support along with our feature on providing focus support for Kivy widgets.

Along with on_focus we must send a descriptive feedback that the screen-readers as a description about the widget. While most of the native OS UI elements have such content descriptors already defined, our kivy widgets lack such information. We must make an effort to make sure that common widgets such as text-inputs, select options etc. have a description consistent with the one natively provided by the platform.

Here are links to the relevant docs pages for Android (http://developer.android.com/guide/topics/ui/accessibility/apps.html#custom-views) and iOS (https://developer.apple.com/library/ios/documentation/UserExperience/Conceptual/iPhoneAccessibility/Accessibility_on_iPhone/Accessibility_on_iPhone.html).

So a major portion of this implementation must be dealt within Kivy itself. While we have previously adopted an approach for implementing the lowest common set of features in other Plyer facades, it would be a good idea to implement the content descriptors (or accessibility attributes) in a way that it covers all our target platforms. Plyer could then take on from there to make the the platform specific calls to support the native screen reading apps. We would need to provide a mechanism to transform these attributes into a form relevant on these platforms separately. We could safely ignore the extra set of attributes at this point.

In this post I have only outlined the first steps that we need to take for making Kivy apps more accessible. I will work on figuring out the finer details about implementing them in the coming weeks. Hopefully this would help in triggering off a meaningful discussion within the community and we can have a chance to listen to other opinions on this as well.

by gtrivedi at July 21, 2014 01:56 AM

July 20, 2014

Elana Hashman
(OpenHatch student)

Failboats

Found in oh-mainline:

def catch_me(request):
    failboat  # NameError

by Elana Hashman at July 20, 2014 11:00 PM

July 18, 2014

Rajeev S
(GNU Mailman student)

Mailman Gets a Custom Shell!

Had a busy week, but happy that I have been able to keep up the proposal timeline. Now that things have cooled down a bit, I believe that I can blog now. The lack of documentation for the Cmd module gave me a hard time in the beginning, but I feel a lot better now.

The mailman shell now has most of the functionalities covered by the command line tools. During this phase of Summer of Code, I have completed the following features for the shell.

  1. Show objects

    I would like to call the `show` object functionality as one of the most coolest parts of the shell. The show object supports different kinds of filters like the equality, regular expression searching and list membership checking. This has been made possible by building a custom Filter class that determines the type of filter to be applied based on the user choice. The Filter class makes it possible to add new filters easily.
    Further, show function supports usage of multiple filters together in conjunction, giving a great deal of filtering ability on the data set.
    The show object reuses the long list feature that has been used in the command line tools.

  1. Create objects
    The create object commands are used to create mailman object, just as in the command line tools and use almost similar commands

  1. Delete objects
    The delete object function follows the show object syntax, that is, supports all the filters available. This makes mass deletion easy, especially deletion of some test data, if they follow a certain pattern.

  1. Environment variables
    The shell supports usage of shell variables that can be set, reset or unset. The user can declare variables that can be reused in the commands. Further, few special variables like list, domain etc are appended automatically to respective commands, if their values are set.
    A user can see the environment variables that has been set using a show variable command

  1. Disable/ Enable environment
    The usage of environment variables can be suppressed explicitly by using a environment disable command. This can be enabled using a enable command

  1. Subscribe / unsubscribe
    Users can be subscribe to lists or unsubscribed from a list using the shell commands

Apart from the above features the shell supports general features like auto complete, that makes usage of the shell simpler. Errors are caught and displayed at a single point, hence unexpected failures and exits due to exceptions are avoided. A few in built functions of the python cmd module have been customized and overridden to suit the shell. This helps reduce a lot of junk code.

Immediate tasks remaining in the shell are development of help commands and documenting the functions written, which is expected to be done in a couple of revisions. Other functional tasks remaining on the shell are the Preferences related functions.




by Rajeev S (noreply@blogger.com) at July 18, 2014 06:56 PM

Saurabh Kathpalia
(MoinMoin student)

GSoC Update: July 18 2014

Last week's summary:

Added css for +tickets view for modernized and foobar theme.
For modernized theme: codereview and commit to my repo.
Here is the screenshot


For foobar theme: codereview and commit to my repo.
Here is the screenshot



Added some more meta-data information(specific to blog items such as ptime, suppertags) to be displayed in modify view of blog items. Here is the codereview and commit to my repo.

Also finalized the solution of #433 - Added User Actions, Item Actions bar, Show, History, Modify and other important links in view template of nameless items in basic theme. Here is the codereview and commit to my repo.

Also made summary field in blog entries as necessary field so that only summary is shown on the blog entry view instead of name. Here is the codereview

Now I am working on Improving UI of blog items in all the themes.

by saurabh kathpalia (noreply@blogger.com) at July 18, 2014 04:12 PM

July 15, 2014

Rishabh Sharma
(SunPy student)

Near Completion

Well an update on the work.
Unified Downloader
All the major code appears to have been writter down.The Unified Downloader retains the ability to have semi-sql query ability(VSO type).It has a new response object. UnifiedResponse object is now returned from the query method.
No major code base problem faced, even I can not remember any git issues faced

JSOC Attributes:
My work is in final part of submission here.PR reviewing also is nearly complete.

Other PRs:
Some miniscule issues, otherwise everything is up.

Thanks


by rishabhsharmagunner at July 15, 2014 03:06 PM

Julia Medina
(Scrapy student)

Two weeks into second half

The work I've done in the last month and a half has been merged into the master branch of Scrapy's main repository, and it's part of the new changes in the last 0.24 release. I'm happy that I was able to finish my contribution in time and get it to a stable stage, wrapping out the first half of my assignment.

I've started the changes to the API with the Spider modifications. Since these changes are mainly additions to the Spider base class, it was a reasonable task to start with. A new from_crawler class method was introduced to deal with handling the current context of execution (Crawler internals) to each spider at initialization time. This was implemented as non-intrusively as possible, since maintaining backward compatibility in this key component in every Scrapy project is a major concern.

After that, I decided to focus on some of the changes in the Crawler class. I began adjusting the init and crawl methods to the new parameters that were previously defined in my proposal. Changing signatures in public functions is a sensitive matter, since this could easily lead to breaking backward compatibility. Python doesn't support function overloading, so we have to work around taking different inputs for our routines.

Taking this into account, and making note that its usage has to be as simple as possible to actually be helpful, I made the following decorator to allow different positional arguments in the same function:

def deprecated_signature(old_implementation, types=None,
warning_msg="Calling {fname} with the given "
"arguments is no longer supported, please refer "
"to the documentation for its new usage."):
"""Decorator factory that defines a deprecated implementation with
different signature for a given function.

An additional dictionary of argument names and their types can be provided
to differentiate functions with the same argument count but different
expected types for them.
"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
try:
call_bindings = inspect.getcallargs(old_implementation, *args, **kwargs)
if types is not None:
_check_types(call_bindings, types)

except TypeError:
f = func

else:
warnings.warn(warning_msg.format(fname=func.__name__),
ScrapyDeprecationWarning, stacklevel=2)
f = old_implementation

return f(*args, **kwargs)

wrapper.__name__ = func.__name__
wrapper.__doc__ = func.__doc__

return wrapper

return decorator

def _check_types(bindings, types):
for arg, value in six.iteritems(bindings):
if arg in types and not isinstance(value, types[arg]):
raise TypeError

An issue about this approach is that we need a reference to the old_implementation function object, which it's not reachable if we are decorating a method and its old implementation is another method in the same object, since they are created after the whole the class is defined. A more flexible and convenient implementation allows dynamically loading both old_implementation and the types of the arguments:

def deprecated_signature(old_implementation, types=None, module='',
warning_msg="Calling {fname} with the given "
"arguments is no longer supported, please refer "
"to the documentation for its new usage."):
"""Decorator factory that defines a deprecated implementation with
different signature for a given function.

An additional dictionary of argument names and their types can be provided
to differentiate functions with the same argument count but different
expected types for them.

If `old_implementation` is a string, that reference will be dynamically
loaded from the `module` provided. If that last argument is an empty string
(default) it will be expanded to the module of the decorated routine. If
it's the None value, the module will be strip out of the
`old_implementation` string.

Values from the `types` dictionary can also be dynamically loaded, but it
will always be assumed that each string contains the full path for the
object.
"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
try:
old_call = _load_or_return_object(old_implementation,
module or func.__module__)
call_bindings = inspect.getcallargs(old_call, *args, **kwargs)
if types is not None:
_check_types(call_bindings, types)

except TypeError:
f = func

else:
warnings.warn(warning_msg.format(fname=func.__name__),
ScrapyDeprecationWarning, stacklevel=2)
f = old_call

return f(*args, **kwargs)

wrapper.__name__ = func.__name__
wrapper.__doc__ = func.__doc__

return wrapper

return decorator


def _check_types(bindings, types):
types = {k: _load_or_return_object(v) for k, v in six.iteritems(types)}
for arg, value in six.iteritems(bindings):
if arg in types and not isinstance(value, types[arg]):
raise TypeError


def _load_or_return_object(path_or_object, module=None):
if isinstance(path_or_object, six.string_types):
if module is None:
module, path_or_object = path_or_object.rsplit('.', 1)
return _load_object(path_or_object, module)
return path_or_object


def _load_object(path, module):
parent = import_module(module)
for children in path.split('.'):
parent = getattr(parent, children)
return parent

The next step is to merge CrawlerProcess into Crawler, but I judged that it was best to leave it for later since it would probably break a lot of tests, and I won't be able to verify following changes until I have fixed them.

I proceeded to work on the scrapy.cfg config file. I've deprecated two settings, SPIDER_MODULES and SPIDER_MANAGER_CLASS, in favor of options in this config file. Currently I'm working on populating them from the settings file for backward compatibility, while doing a simple check on required options and expansion of defaults.

I haven't created a pull request yet since I'm constantly rewriting the commit history, but my progress can be tracked on my local repository, in the api-cleanup branch.

by Julia Medina (noreply@blogger.com) at July 15, 2014 03:53 AM

Michael Mueller
(Astropy student)

Week 8

After improving the fast writer and fixing up some lingering issues last week, I spent this week looking into porting my reading implementation to numpy and reading the memory mapping code in io.fits to learn more about memory mapping. This felt like a bit of a slow week, since I was sick on Thursday and had trouble figuring out a problem with numpy's build system on Friday, but I think my work on adapting my implementation to genfromtxt() should definitely come in useful (at least at some point when I have time to finish the work).
I tidied up the code and made a few final PRs on Monday and Tuesday, and then at Tom's suggestion I opened a PR for review (https://github.com/astropy/astropy/pull/2716#issuecomment-48749423). Since it's a big PR, I expect it'll take some time to receive an extensive code review, but it's nice to have a roughly final implementation that passes on Travis CI. I also opened a PR in astropy-benchmarks for the io.ascii benchmarks I wrote at the beginning of the summer (https://github.com/astropy/astropy-benchmarks/pull/7). From there I went on to look at the memory mapping in io.fits and read about memory mapping in general, and I'm sure this will be something to discuss in the next hangout; I'm curious whether the Python mmap module will work with an underlying C layer, or how else memory mapping might be achieved.
More importantly, I spent much of the week changing my text reading code to work in numpy as an engine for genfromtxt(). After Tom pointed me to a thread on the numpy discussion list, I forked numpy and started writing a new version of genfromtxt() in a separate branch: https://github.com/amras1/numpy/tree/faster-genfromtxt. I also spoke to a numpy developer on IRC who participated in the mailing list thread, and he believed that my implementation might be of use. He also explained that a particularly important problem with loadtxt() and genfromtxt() is their broken handling of unicode with Python 3, which necessitates a full rewrite of the two functions. After that, I worked on altering the C tokenizing code to handle UTF-8 rather than plain ASCII and wrote related code to have the Cython/C parser work underneath genfromtxt(). As of now, genfromtxt() is working with relatively simple sample data, although there's still more to do with handling the function's parameters (e.g. missing_values and usemask).
Adapting the Cython/C reader to numpy has been a little more difficult than I thought, but I'm glad to have learned a lot of useful information about unicode and the differences between Python 2.x and 3.x. I'm not really sure what I'll be doing next week, but I'll know after the hangout meeting tomorrow.

by Michael Mueller (noreply@blogger.com) at July 15, 2014 03:44 AM

Gaurav Trivedi
(Kivy student)

Kivy Idle Week

Bionoid has graciously agreed to ship an iOS development device to my place (Our Kivy community is awesome!). And while I was waiting for it, I had taken a break this week to attend a couple of interesting lectures happening in the city. Although I have been closely sticking with the plan otherwise, I will put in extra effort when that reaches me to cover up for this.

Also, I have decided to work on facades interfacing with accessibility systems on the various platforms. I didn’t have that as a part of my original summer plan but it seems like an interesting idea as suggested by tshirtman. I will investigate the common features and work on a plan to implement this.

by gtrivedi at July 15, 2014 01:26 AM

July 14, 2014

Hamzeh Alsalhi
(scikit-learn student)

Sparse Target Data for One vs. Rest Classifiers

Going forward with sparse support for various classifiers I have been working on a pull request for sparse one vs. rest classifiers that will allow for sparse target data formats. This will results in a significant improvement in memory usage when working with large amount of sparse target data, a  benchmark is given bellow to measure the. Ultimately what this means for users is that using the same amount of system memory it will be possible to train and predict with a ovr classifier on a larger target data set. A big thank you to both Arnaud and Joel for the close inspection of my code so far and the suggestions for improving it!

Implementation

The One vs. Rest classier works by binarizing the target data and fitting an individual classifier for each class. The implementation of sparse target data support improves memory usage because  it uses a sparse binarizer to give a binary target data matrix that is highly space efficient.

By avoiding a dense binarized matrix we can slice the one column at a time required for a classifier and densify only when necessary. At no point will the entire dense matrix be present in memory. The benchmark that follows illustrates this.

Benchmark

A significant part of the work on this pull request has involved devising benchmarks to validate intuition about the improvements provided, Arnaud has contributed the benchmark that is presented here to showcase the memory improvements.

By using the module memory_profiler we can see how the fit and predict functions of the ovr classifier affect the memory consumption. In the following examples we initialize a classifier and fit it to the train dataset we provide in one step, then we predict on a test dataset. We first run a control benchmark which shows the state of one vs. rest classifiers as they are without this pull request. The second benchmark repeats the same steps but instead of using dense target data it passes the target data to the fit function in a sparse format.

The dataset used is generated with scikit-learns make multilabel classification, and is generated with the following call: 
from sklearn.datasets import make_multilabel_classification

X, y = make_multilabel_classification(sparse=True, return_indicator=True,
n_samples=20000, n_features=100,
n_classes=4000, n_labels=4,
random_state=0)

This results in a densely formatted target dataset with a sparsity of about 0.001

Control Benchmark

est = OneVsRestClassifier(MultinomialNB(alpha=1)).fit(X, y)
consumes 179.824 MB
 
est.predict(X)
consumes -73.969 MB. The negative value indicates that data has been deleted from memory.

Sparse OvR PR Benchamrk

est = OneVsRestClassifier(MultinomialNB(alpha=1)).fit(X, y)
consumes 27.426 MB 

est.predict(X)
consumes 0.180 MB 

Improvement

Considering the memory consumption for each case as 180 MB and 30 MB we see a 6x improvement in peak memory consumption with the data set we benchmarked.

Upcoming Pull Request

The next focus in my summer of code after finalizing the sparse one vs. rest classifier will be to introduce sparse target data support for the knn and dummy classifiers which have built in support for multiclass target data. I have begun the knn pull request here. Implementing a row wise mode calculation for sparse matrices will be the main challenge of the knn PR.

by Hamzeh (noreply@blogger.com) at July 14, 2014 02:21 PM

Mainak Jas
(MNE-Python student)

mpld3, python3.3, topomaps and more


We are approaching towards a merge for the 3000 line PR for the mne report command. One of the features that I incorporated last week was interactive plot_events using mpld3 that combines the power of matplotlib and d3.js. Unfortunately, a recent PR breaks things as mpld3 has not yet implemented ax.set_yticklabels() in its latest release. So, we decided to leave it out interactivity for now and push for a merge before the 0.8 release of MNE-Python.

Travis is not yet happy because of Python3.3 errors. I am going to use the six package to implement methods that have changed from Python2.7 to Python3.3. There were issues with os.path.abspath() as it does not recognize symlinks and therefore we replaced it with os.path.realpath().

Finally, thinking of possible use cases, we realized that including the topographic maps (see figure below) of the evoked response would be useful for the report.

 The show parameter for the plotting function was somehow missing, so I fixed it in this PR. Finally, the plots were ordered by section rather than figure which was addressed in this commit.

The next steps are as follows:
  • Speed up the tests by using a decorator for the tests which uses a small string for the plot instead of making the actual plot. That way, we do not end up testing the plotting functions twice.
  • A separate PR for a repr for the inverse operator.
  • Miscellaneous fixes for merging the mega-PR: Python3.3 errors, not allowing multiple saves by throwing errors etc.

by Mainak (noreply@blogger.com) at July 14, 2014 01:03 PM

Richard Tsai
(SciPy/NumPy student)

Rewrite scipy.cluster.hierarchy

After rewriting cluster.vq, I am rewriting the underlying implementation of cluster.hierarchy in Cython now.

Some Concerns about Cython

The reason why I use Cython to rewrite these modules is that Cython code is more maintainable, especially when using NumPy’s ndarray. Cython provides some easy-to-use and efficient mechanisms to access Python buffer memory, such as Typed Memoryview. However, if you need to iterate through the whole array, it would is a bit slow. It is because Cython will translate A[i, j] into something like *(A.data + i * A.strides[0] + j * A.strides[1]), i.e. it needs to calculate the offset in the array data buffer on every array access. Consider the following C code.

int i, j;
double *current_row;

/* method 1 */
s = 0;
current_row = (double *)A.data;
for(i = 0; i < A.shape[0]; ++i) {
    for(j = 0; j < A.shape[1]; ++j)
        s += current_row[j];
    current_row += A.shape[1];
}

/* method 2 */
s = 0;
for(i = 0; i < A.shape[0]; ++i)
    for(j = 0; j < A.shape[1]; ++j)
        s += *(A.data + i * A.shape[1] + j);

The original C implementation uses method 1 shown above, which is much more efficient than method 2, which is similiar to the C code that Cython generates for memoryview accesses. Of course method 1 can be adopted in Cython but the neatness and maintainablity of Cython code will reduce. In fact, that is just how I implemented _vq last month. But _vq has only two public functions while _hierarchy has 14, and the algorithms in _hierarchy are more complicated that those in _vq. It would be unwise to just translate all the C code into Cython with loads of pointer operations. Fortunately, the time complexities of most functions in _hierarchy are \(O(n)\). I think the performance loss of these functions is not a big problem and they can just use memoryview to keep maintainablity.

The New Implementation

The following table is a speed comparision of the original C implementation and the new Cython implementation. With the use of memoryview, most functions have about 30% performance loss. I used some optimization strategies for some functions and they run faster than the original version. The most important function, linkage, has a 2.5x speedup on a dataset with 2000 points.

functionoldnew
calculate_cluster_sizes4.273us6.996us
cluster_dist68.224us108.234us
cluster_in78.020us102.400us
cluster_maxclust_dist32.913ms1.174ms
cluster_maxclust_monocrit35.539ms1.454ms
cluster_monocrit113.104us46.170us
cohpenetic_distances13.052ms13.458ms
get_max_Rfield_for_each_cluster30.679us42.969us
get_max_dist_for_each_cluster27.729us42.977us
inconsistent105.663us146.673us
leaders33.941us47.849us
linkage4.476s1.794s
prelist29.833us41.334us

The new implementation has yet to be finished. All tests pass but there may be still some bugs now. And it lacks documentations now.

by Richard at July 14, 2014 09:19 AM

PritishC
(SunPy student)

Break – Fix – Repeat!

Cadair commented 9 hours ago

@VaticanCameos I broke it

This quote seems to sum up all that we’ve done in the past few weeks. So far, the Coordinates Framework PR has 80 commits. We are to have a meeting on Tuesday to gauge progress and set the path for the rest of the month.

It has been quite an experience. Having complete ownership of a module isn’t easy. One has to think of design and implementation, testing and experimentation, etc. all in quick succession.  On top of that, the repository does not stay the same all the time – one is always on the bleeding-edge of things. Having said that, my mentor has been really supportive. I have migrated to a new laptop, and on his insistence, tried to install ArchLinux but failed due to some problems with the Radeon graphics card. Nevertheless, I am on Linux Mint for the time being!

A summary of things -:

  1. Some coordinate frames have undergone several changes – Heliographic and Helioprojective, more precisely. Helioprojective ‘Cartesian’ is in fact a spherical representation masquerading as Cartesian, so it was decided that it would be better to change the representation to spherical, but keep the attribute names as Tx and Ty.
  2. Astropy have introduced the RepresentationMapping class, which is a useful encapsulation of the mapping of frame representation attributes to the names that the designer wishes to specify, as well as their units. RM objects take a 3-tuple, thus.
  3. Helioprojective frames now have a new property – ‘zeta’ – which was earlier being mapped to the distance attribute. On reviewing the code, we found that distance should actually be ‘d’ (which was a FrameAttribute then) and it is better to have zeta as a property which returns the value of ‘D0 – d’. ‘zeta’ can be used in the constructor to generate a value for ‘d’. This ultimately fixed most of the transformation issues we had from Helioprojective to other class, except for a few corner cases which we are now working on.
  4. A lot has been learned about how to default attributes in the frame constructor. Careful consideration of the various input cases was necessary. Heliographic frames now default ‘rad’ and a FrameAttribute called ‘dateobs’. The empty frame case was the biggest thorn – breaking a lot of things. Since Astropy’s code purposefully generates an empty frame of the same class to check certain things, it was necessary to keep empty frame design clear in mind. Helioprojective frames can default ‘D0′ and ‘d’.
  5. SphericalRepresentation was subclassed to create Spherical180WrapRepresentation. This was done after Stuart noticed that the ‘hlon’ attribute actually spans a range of -180 to +180 degrees. Creating a negative valued longitude was not possible with ordinary Longitude objects. Thus, a new representation had to be created. All frames which were earlier on spherical representations have been shifted to this new frame.
  6. Currently, there are a few corner cases that have to be fixed. These include transformations from Helioprojective to Heliographic when all three attributes are specified as zero and transformation from Heliographic Stonyhurst to Heliographic Carrington which is breaking, presumably because of the dateobs parameter. The dateobs parameter will also be used to generate the Carrington offset for transformation.
  7. Lastly, the coordinates framework has, for the most part, successfully been integrated into the Ginga project! It is now used to model solar coordinates and visualize SunPy maps. This is great news, and it will also push us to work harder on keeping the framework bug free.

In other news, I have started delving into other projects, such as working on an IRC bot, some competitive programming (really lagging on this) and – of course – looking for jobs! The main thing though, is the side project that I have acquired from my mentor to build #sunpy’s irc logs website in Flask. It’s going to take a while as I am completely nil in web dev, but it will be quite worth it!

Oh, and here, have a gif which signifies what Germany have been like this World Cup . Over and out!

Germanic Slaughter


by xpritish at July 14, 2014 06:12 AM

Elana Hashman
(OpenHatch student)

OSBridge and CSS

BEES
Figure 1: Strong sticker yields for the 2014-2015 conference season

The past three weeks have been centered around conference activities and struggles with front-end development. The contrast leaves a wide gap: the biggest highlight of the summer thus far was definitely attending Open Source Bridge, while the lowest of the lows has been fighting with CSS in the hopes it might make things pretty.

Open Source Bridge

Attending OSBridge has been one of the best conference experiences I've ever had. It's also one of the most unique conference experiences I've ever had. I have heard that the conference draws from a philosophy described as "radical inclusivity", and having experienced it now myself it is absolutely something I can get behind:

  • All food served was vegan, unless specified otherwise (then it was "just" vegetarian)
  • Lanes for the flow of movement and strong mobility accommodation were provided
  • All bathrooms had gender-inclusive labeling
  • I was never the only woman in the room (except while staying at my overnight accommodations)
  • A great balance was struck between "hard" technical content, business and open source culture, and social issues/community problems

Some of the most fun I had was meeting up with other OpenHatchers, some for the first time. We went out for a lovely dinner and had a great time working together in person, for a change! The people were the highlight of the conference, and as most likely the youngest conference attendee, I had a little bit of difficulty getting over some squee-minor celebrity moments. Julie told us on the first day (at 9AM, even) to kill our heroes so I tried to take that to heart. But on the other hand, FIREBEES STICKERS?!

So many smiling faces
Figure 2: OpenHatchers go out for dinner

Though videos have not yet posted, where applicable, here were some of my favorite talks and sessions:

And I'd also like to bring attention to Shauna and Britta's talks, since the other OpenHatchers deserve a spotlight on my blog:

I even gave a talk of my own! It was less than five minutes long and gave an overview of my project. It was not filmed, but you can check out the (very sparse) slides here. Those in attendance claimed I was "calm and competent." I'm not sure I would agree on either account but I don't have video evidence to refute the claim with :)

Development

The last two weeks should have seen the delivery of more workable edit capability. But due to delays with the editing application we chose to use, django-inplaceedit, things haven't been moving quite as quickly or smoothly as we'd desire. CSS bugs abound and not every editing task has been as easy as we'd like.

gsoc14.8 deliverables

The intended deliverables for this week were as follows, still following this tracking item:

  • Make remaining fields editable on the bug set view.
  • Tweak django-inplaceedit CSS to look better.

But work for this week was pushed back due to the Canada Day and July 4th long weekends and troubles with the CSS fixes, which were assumed to be much more straightforward than they turned out to be. It seems that CSS is &astcomplicated&ast. Not to leave an impression that I didn't accomplish anything this week, here is a code sample, not committed until the following week due to delays on content. I also have attached lots of fun screenshots!


Figure 3: Edit a bug title



Figure 4: Change a bug's status



Figure 5: Successful edit

gsoc14.9 deliverables

Initially, this milestone had a single deliverable of "Create/Edit Bug Set Screen."

However, at this point, the milestone requires a rescope most likely, as the work from gsoc14.8 is still not quite complete. It's incredibly frustrating, but at least we left time to work out the kinks.

What's next

At this point, it is effectively a list:

Obstacles

CSS has been the bane of my existence, the scourge of my livelihood, the pox on my $redacted for the last two weeks. I have never before been stuck with the challenge of integrating two disjoint web applications with slightly broken CSS that might look pretty on its own independently in each case but clash like red and green when brought together. I am not exaggerating when I say this was the worst experience I have had with programming in my entire brief career. (At least I now have further assurance that I have no desire to become a web developer...)

Asheesh has been great with helping me out, but we're still a bit delayed. He sent me a fun CSS Diner tutorial which helped along my progress with selectors quite a bit. Part of the difficulty is determining what is "good enough" for the end product, as the OpenHatch team involves many experienced designers and web developers that might be able to accomplish easily what I am struggling with so much. There is also a commitment to a minimal viable product, but we must balance this with usability—a minimum requirement for release. So it makes it difficult deciding exactly how great this needs to be before we can move on.

School and time management continue to haunt me reliably like the little demons they are, but until someone actually implements sudo service schoold stop, I don't think there's much to be done in the meantime. It is a reality that must be faced.

by Elana Hashman at July 14, 2014 01:00 AM

July 13, 2014

Brigitta Sipőcz
(Astropy student)

in the second half

As it was mentioned in earlier posts, the aperture photometry API is going through a major refactoring this summer, and in the past two weeks I mainly worked on that.

Read more »

by Brigitta Sipocz (noreply@blogger.com) at July 13, 2014 11:24 PM

Vytautas Jančauskas
(TARDIS-SN student)

Two Weeks in to the Second Half of GSoC

Since the bulk of the work is finished now I concentrated mainly on the boring, long and tedious bits. Enforcing structured programming and object-oriented techniques like encapsulation are some of the things I was doing. Also there are plans to start integrating the changes in to the main code base and help people implement their physics in to the new infrastructure. One interesting issue I currently have and was weirded out by is this:

I have this code in my fork:

  if (x_insert > x[imin] || x_insert < x[imax])


    {


      ret_val = TARDIS_ERROR_BOUNDS_ERROR;


      // Crashes without this exit, this doesn't make sense.


      exit(1);


    }

This logical branch never gets executed unless something goes terribly wrong. I checked that this branch is never run with the data I use. However, for some reason, if I comment out the exit(1) line the program crashes. It would be very interesting if someone was able to shed some light here or at least give a lead. Other than that I cannot pin point any road blocks. Except maybe that due to the tedious nature of current tasks (mostly maintenance and code improvement with no noticeable immediate practical effect) the enthusiasm is winding down quite a bit.

by Vytautas Jančauskas (noreply@blogger.com) at July 13, 2014 10:07 PM

Saimadhav A Heblikar
(Core Python student)

GSoC 2014 - Overview of workflow and resources

In this blogpost, I will link to all GSoC 2014 related resources. If you intend to regularly check my GSoC progress, it is advisable that you bookmark this page. It will link to all resources.

Read more »

by Saimadhav Heblikar (noreply@blogger.com) at July 13, 2014 05:33 PM

Asra Nizami
(Astropy student)

Weeks 7-8

I just counted and so far I have 14 merged pull requests which makes me really happy! Not all of them are huge PRs, a few of them are minor documentation changes but some of them took a bit more time to work on as I was implementing some new feature in WCSAxes or fixing some bug.

The latest thing I've been working on is integrating WCSAxes into APLpy. So APLpy is an Astropy affiliated package that also plots astronomical images and uses matplotlib for all the plotting tasks. My current task is to make changes to APLpy so that it uses WCSAxes to locate and plot all the ticks, labels, grids etc instead. So far things have been going pretty smoothly. Most of the functionality of APLpy is in place - there are a few small things here and there but I'm not too worried now that it's mostly working. Things have been going so smoothly in fact that I started to get a little bored with the work and consequently, my productivity decreased significantly.

Tom, my mentor, was at a conference this week so I haven't had a Hangout with him and discuss my latest work, which might also be why I haven't been very productive this week. I think my brain perceives our weekly Hangouts as deadlines where I have to report on my progress and that keeps me motivated. I like deadlines - there I said it! Well maybe having stressful deadlines all the time wouldn't be very pleasant but deadlines are kind of fun. They keep you focused and there's that edgy feeling you get close to the deadline 'can I finish this in time?!' Usually I just set my own personal deadlines like 'I have to get this thing done in two days, maximum four days' etc and that keeps me happy but like I said, I got a little bored.

Towards the end of the week, Tom said I could try implementing another feature in WCSAxes if I wanted so I decided to focus on that instead. Currently WCSAxes only plots major ticks but it would be useful if it could plot minor ticks as well.


(In case someone doesn't know about major and minor ticks from matplotlib, the red ticks are minor ticks and the black ones are major ticks. Also this is just an image produced with matplotlib, not WCSAxes)


This is a bit tricky because so far I haven't had to actually understand the math behind WCSAxes' calculations but now might be the right time to understand how it works. I currently have a very faulty implementation of minor ticks but it's fun to work on and hopefully I'll figure it out! :)

by Asra Nizami (noreply@blogger.com) at July 13, 2014 09:49 AM

Asish Panda
(SunPy student)

Progress, 2 weeks after mid term

Hey all. Again a mandatory post. The PR I created, is done, only needs merging now. I made a hell at my local git repo when a commit was corrupted. It did not end there as it was followed by a weird error even after removing the corrupted empty files. I tried to fix it but failed so I had no choice but to clone it once again. Which again led to an error while compiling “python.h cannot compile” even after install all python-dev updates. I don’t remember how exactly but it was a problem with pyc files hanging around. I removed them mannualy and it worked.

As for now what I will be doing next, I think I will go with the wcs module. I have yet to start that and I think it will be much more interesting than my previous tasks.

But still I have yet to merge the changes of maps and spectra, but I think it can wait till I get comments on how to improve it.

I guess thats all for now. I am not good at writing about the stuff I do, so check out the PR or my forked repo for details!

Thanks!


by kaichogami at July 13, 2014 07:48 AM

July 12, 2014

Shailesh Ahuja
(Astropy student)

Using specutils for reading and writing FITS files

I have just finished developing the readers and writers for FITS IRAF format described here. I am going to give a small tutorial on how the specutils package (affiliated with Astropy) can be used for the same, and also give some details of how it works.

To read a FITS file, just simply use specutils.io.read_fits.read_fits_spectrum1d() method. This method might be later moved to Spectrum1D class. The following code snippet gives an example:

from specutils.io import read_fits
spectra = read_fits.read_fits_spectrum1d(example.fits)

Depending on the type of FITS file, the object can be different. For linear or log-linear formats, the object returned will be a Spectrum1D object. For multispec formats, the object will be a list of Spectrum1D objects. The dispersion and the flux of the Spectrum can be accessed using spectrum1Dobject.dispersion and spectrum1Dobject.flux respectively. The dispersion is automatically computed based on the type of dispersion function defined in the header. Other methods such as slicing dispersion for a particular set of flux coordinates will be supported soon.

To write a FITS file, specutils.io.write_fits.write() method. Like the reader, this method may be moved to Spectrum1D later. The following code snippet gives an example:

from specutils.io import write_fits
write_fits.write(spectra, 'example.fits')

The spectra object passed can be a list of Spectrum1D objects or just one Spectrum1D object. Depending on the type of spectra, the writer will write either in multispec or linear format respectively. If the spectra was read in using the reader, then the additional information from the read file will also be written.

There will be some more operations supported, but the focus will be on keeping things simple, and handling the complexities in the code itself.

by Shailesh Ahuja (noreply@blogger.com) at July 12, 2014 01:23 PM

Saimadhav A Heblikar
(Core Python student)

GSoC 2014 - Updates for weeks 6 to 8

A look at what I have been upto during Weeks 6, 7 and 8 in my IDLE Improvements project as a part of Google Summer of Code 2014.
I have mainly been working on adding a functionality into IDLE wherein one can analyze the file being edited from within IDLE

Read more »

by Saimadhav Heblikar (noreply@blogger.com) at July 12, 2014 01:11 PM

M S Suraj
(Vispy student)

Ellipse, Regular polygons, and tests – GSoC Week 7

I successfully passed the mid-term evaluation of GSoC. Yay! Coming to work done so far, I have implemented EllipseVisual, RegularPolygon visual, and am in the process of writing image processing based tests for them.

I implemented EllipseVisual as a generic object which can be used to draw arcs, circles as well reuse to draw regular polygons. Given a position, radii and span angles, the vertices are computed and drawn as triangle_fans. Drawing using GL_TRIANGLE_FANS made it really easy to implement the span angle part so that arcs can be easily drawn now. The code can be viewed here – https://github.com/vispy/vispy/blob/master/vispy/scene/visuals/ellipse.py . Here are some examples:

ellipse2

ellipse1

ellipse3

Similarly changing the number of segments of the triangle fan to match the number of sides resulted in regular polygons. Here’s the code: https://github.com/vispy/vispy/blob/master/vispy/scene/visuals/regular_polygon.py

regular_polygon1
regular_polygon2

Currently there exists tests for none of the visuals. Tests are crucial to keep in check the visual accuracy of the visuals. Currently, I’m trying out image matching algorithms to see which one suits the need. I will publish a write-up on it once I have finished writing the tests.


by mssurajkaiga at July 12, 2014 12:29 AM

July 10, 2014

Alan Leggitt
(MNE-Python student)

Day 37: Aligning source spaces

Notebook &v

In case you're wondering what I've been up to all week, I got some feedback from the mne community about how
to incorporate source spaces based on volumes of interest into the existing mne code. We decided to implement
volumes of interest using the function mne.setup_volume_source_space. If you check out my pull request, you
can see the various steps I've been through to make this possible. The biggest hurdle was incorporating an
interpolation matrix. This matrix transforms vertices based on their individual vertex numbers in 3d source
space into mri space.

I won't go into detail about the process, but I've included a plot below showing how 3 different source spaces
align in mri space. The surfaces of the left and right hemispheres are in blue and green, respectively, the
whole brain volume is in red, and a volume of interest based on the left cortical white matter is in black.

In [1]:
%matplotlib inline
%load transform_workspace.py
In [3]:
import mne
from mne.datasets import sample
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

mne.set_log_level(False)
data_path = sample.data_path()
subject = 'sample'
subjects_dir = data_path + '/subjects'
aseg_fname = subjects_dir + '/sample/mri/aseg.mgz'

# create a whole brain volume source
vol_src = mne.setup_volume_source_space(subject, mri=aseg_fname,
subjects_dir=subjects_dir)

# create left and right hemisphere cortical surface sources
srf_src = mne.setup_source_space(subject, subjects_dir=subjects_dir,
overwrite=True)

# create a left hemisphere white matter source
lwm_src = mne.setup_volume_source_space(subject, mri=aseg_fname,
volume_label='Left-Cerebral-White-Matter')

# get the coordinates from all sources
vol_rr = vol_src[0]['rr']
ix = vol_src[0]['inuse'].astype('bool')
lh_rr = srf_src[0]['rr']
rh_rr = srf_src[1]['rr']
lwm_rr = lwm_src[0]['rr']

# plot the source spaces in 3 dimensions
ax = plt.axes(projection='3d')
ax.plot(vol_rr[ix, 0], vol_rr[ix, 1], vol_rr[ix, 2], 'r.', alpha=0.1)
ax.plot(lh_rr[:, 0], lh_rr[:, 1], lh_rr[:, 2], 'b.', alpha=0.1)
ax.plot(rh_rr[:, 0], rh_rr[:, 1], rh_rr[:, 2], 'g.', alpha=0.1)
ix = lwm_src[0]['inuse'].astype('bool')
ax.plot(lwm_rr[ix, 0], lwm_rr[ix, 1], lwm_rr[ix, 2], 'k.', alpha=0.5)

plt.show()

The add_dist parameter to mne.setup_source_space currently defaults to False, but the default will change to True in release 0.9. Specify the parameter explicitly to avoid this warning.

by Alan Leggitt (noreply@blogger.com) at July 10, 2014 10:29 PM

Roy Xue
(Theano student)

Next Step #GSoC2014-Theano

Next Step


I just finished my final exams. So it is time to continue my previous tasks. (actually when i post this, Ive already started)

Fred is on his vacation, so these days I will contact Arnaund for help. By this, I start to know why GSoC requires more than one mentor, not only for one’s temporary absence, but also by different mentors, they can provide advices from different aspects which will make me learn more and make the code better.

Now I had completed some updates for my previous PR, mainly on simplify it structure and do much decoration works. Just make it better way. It’s waiting to be merged now: Compute minimum peak #1934

The following is my next part description(thx Fred for writing this):


This is for the GSoC work by Roy. Roy, wait until you finish the current step to check this(or at list fixed the important part), as it will discract you. But I would appreciate feedback now from other people:

For this, I would start by working only with the “vm” linker, when there is no lazy variable. I would also restrict at first with ndarray of 0 dimensions. That push to later the interaction with the shapefeature. That would simplify the making of the base work. What need to be done, is take the execution order of nodes currently use and keep that order. But change how we build the storage_map and add extra dependency check in the garbage collection. I should split this in those step:

1) use the linker vm(not the default one cvm) with graph that don’t have lazy op and with allow_gc=False. In that case, modify the method VM_Linker.make_all() (file gof/vm.py) to change the storage_map that get created to have those properties:

  • The old storage_map have a different storage for each variable in the graph. This need to change.
  • You need to have some variable share the same storage. But those variable need to be carefully selected. For this: for node in nodes_execution_order: check if current node inputs variable that would be gc if the gc was enabled. for those variables for now keep only those that have ndim ==0 for those variables, check in future nodes to execute, those that have an output with ndim ==0. Check that you didn’t already changed the storage_map of that output variable. For that pair of (input variable of the current node to be gc normally, variable that will be outputed later), make sure that the storage_map point to the same place for both variable.

The fact that the output variable is available when a node get executed, will allow that node to reuse the existing memory for its output, instead of allocating new memory. This will lower the memory used when allow_gc=False, as the memory used won’t be the sum of all the intermediate memory as it is now

1.2) Only after 1.1) is working, generalyze it to make it work when allow_gc=True. For this, you need to modify the condition when Theano check what to gc after a node get executed, to only gc it it no other new nodes will use that memory in the future. This will need changed in the varible post_thunk_clear passed to LoopGC.init, to correctly delay the gc of the variable you changed in 1.1.

There is other stuff to do on this, but I think we can delay the detailed plan for them.


 

I will start to look for informations and think about the way how to implement my next step work now.

by xljroy at July 10, 2014 07:16 PM

July 09, 2014

Ajitesh Gupta(:randomax)
(MoinMoin student)

Week 7 Work - ACLs

This week the first task was to find and solve bugs related to Access Control Lists referred to as ACLs hereafter. The access control list in simple metaphorical terms would be equivalent to a guest list at a big party. If your name is on the invited guests' list then you may attend the party else you stay out of it :D . So the access control lists basically specify what permissions a user has w.r.t to a particular object. A user may read, write, create, destroy, admin an item depending on the ACL corresponding to that item.

The first one was an already reported bug - #417 whereby the the userbrowser view in the admin section used to crash in case there were any users who had made accounts but not confirmed them via the emails sent to their accounts. This is caused as when a user registers on the wiki, their email address was stored under the "EMAIL_UNVALIDATED" key till they do not confirm the account. Once they confirm it, the email is stored under "EMAIL" key. So the earlier code used to only look for the email under "EMAIL" key while retrieving user info and as such crashed. I changed the code to look first for "EMAIL" and if that does not exist then look for "EMAIL_UNVALIDATED". Here is the commit for that change. Now you might think that where did ACL figure in all this. This is related to ACLs as one of the new ACL views I created this week would have its link from the userbrowser view where all the users would be listed. I'll talk more about that later.

The second bug that I came across was #444 whereby if the user tried to access something which they are not allowed to read (like gatecrashing that party I mentioned ;) ), they would see an error as mentioned in the bug report instead of a proper "Access Denied" page. This happens because somehow the registered 403 error handler does not spring into action in case a 403(AccessDenied) error is raised in this case. I am still working on this issue, you can find the related code reviews here.

Another bug was #445 whereby if a user changed the ACL of an item such that they were no more allowed to read that object, they would receive the error reported in the bug ( I wonder who'd strike their own name off the guests' list :D ). Basically the self.revs object being used there gets its value from the "get_items_last_revisions()" function. This funtion fails to find the item in case the user is not authorized to read it, and hence returned an empty list. This empty list caused the error. Hence I put in a check for empty lists and displayed an AccessDenied Error in such a case using the abort(403) function. The code review is here.

Yet another bug was #446 wherein a user could modify the ACLs of a particular object even if he had just the read and write permission, whereas according to the design only a person with admin rights should be able to do that. In the basic theme this error was caused because there was no check on the user's ACL capabilities before displaying the modify form, or in metaphorical terms there was a guest list defined but there was no one to cross check the people entering the hall. So I added a check for the user, and only if they have admin rights can they access the Modify ACL form. Here is the commit regarding that.

Finally the last bug I worked on was #447 which again like #417 was not directly involved in ACLs but was related to the view I needed for further ACL views. Here the previous developer had not kept in mind that the rev.meta[NAME] attribute would be a list as items can have multiple names and as such it could not be used as a key for a dictionary. So I unrolled the previous complex list comprehension and used for loops instead to take care of the list. Here is the commit regarding that.

Now comes the new views I worked for easy visualization of ACLs. The first one was the User ACL Info view. This view contains the information about ACL capabilities of the user with respect to each item in the wiki. This view can only be accessed by the superuser(s) specified in the wiki configuration through the userbrowser view, in the admin section, where there would be a list containing various information about each user (Ah! Now you see why I solved #417 and #447). The clicking on the item names in the list would take the superuser to the Modify Item page where they could modify the ACL of that item as they liked ( Thats if they had admin rights for that item as per #446 ). The tablesorter js library was used to make an easily sortable table. Here is the code review for that.


The User ACL Info view

Another view that I made was the Item ACL Report. This again can be accessed only by the superuser. Here the superuser will see a list of all items in the wiki and the corresponding  ACLs for each item in the wiki. Also clicking on the item name would take the superuser to the Modify Item view. Tablesorter js was used here too. Also in both new views if the item was nameless item, the Item Id would be shown instead. Here is the code review for that.


Item ACL Report view

In the continuing couple of weeks I would be working on the show Groups view and Group ACL Reports view to enable the superuser to visualize the ACLs and to easily change them.

by Ajitesh Gupta (noreply@blogger.com) at July 09, 2014 12:48 PM

Saurabh Kathpalia
(MoinMoin student)

GSoC Update: July 9 2014

Last week's summary:

Added 'Subscribe' and 'Add quicklink' option in ticket's modify template. Now users can subscribe to ticket items and also quicklink them in basic theme. Here is the codereview and commit to my repo.

Made nameless items downloadable. Previously there user couldn't download nameless items as the filename that was used for downloading used name of the item which was raising error in case of nameless items so I removed this dependency on name and defined separate function for getting filename (get_download_file_name). This returns the name of the item if there is some name else return field-value as download file name. Here is the codereview and commit to my repo.

Also made blog items update work and add useful links of modify, subscribe, modify meta etc on blog page. Here is the codereview for my changes. One TODO is to make blog-posts nameless which would be done once we get a way to create items without any initial fqname.

Fixed #433 - Now User Actions, Item Actions, Show, History, Modify etc are also displayed in the view of nameless items in basic theme. Here is the codereview for my changes . I am yet to commit it to my repo.

Also tried to write tests for +tickets view but couldn't complete it as we need pytest latest version(2.5.2) but currently it is not compatible with moin, so I will come to it later. Here is the codereview for the above.

by saurabh kathpalia (noreply@blogger.com) at July 09, 2014 11:08 AM

July 08, 2014

RAGHAV R V
(BinPy student)

GSoC 2014 - Week 7

The past two weeks went quite well.

I implemented the SignalGenerator module and also made a few minor bug fixes along the way.

The SignalGenerator module is a multi-threaded implementation which can connect to external inputs ( for modulating the output signal to AM wave or FM wave ) using the linker module. The outputs can also be connected to other modules.

The source of the file can be found here.

The code to generate an AM signal using 2 SignalGenerator Blocks is :



Loading ...

The output of the same :



The Pull Request that I had raised for the same : #249

To finish the Microprocessor implementation there a few more stuff needs to be done. Especially tidying up the expression modules and integrating the Espresso Heuristic logic minimizer.

The idea here is to make a Finite State Machine to act as a block which accepts inputs and generates an output with the least possible delay based on a predetermined equation.

I hope to start work on the same and finish it within next two weeks. Once this is done addition of further ICs is a breeze. All the leftover IC's can be implemented using this block in at most a single day. If possible I will also re-implement existing ICs using the same. The completion of this and the matplotlib based oscilloscope will be my next priority.

by Raghav R V (noreply@blogger.com) at July 08, 2014 12:29 AM

July 07, 2014

maheshakya
(scikit-learn student)

Optimizations on Locality sensitive hashing forest data structure

In order to get LSH forest approximate nearest neighbor search method into a competitive and useful state, optimization was a necessity. These optimizations were based on the portions of the LSH forest algorithm and implementation of that algorithm. There are two general cases of bottlenecks where requirement for optimization arises.

  1. Memory usage of the data structure.
  2. Speed of queries.

Let's discuss these two cases separately in detail. Remember that there will always be a trade off between memory and speed.

Optimizations on memory usage

In the previous post about the implementation of the LSH forest, I have explained how the basic data structure and the functions. Some of the optimizations caused obvious changes in those functions. First of all, I need to mention that this data structure stores the entire data set fitted because it will be required to calculate the actual distances of the candidates. So the memory consumption is usually predominated by the size of the data set. LSH Forest data structure does not have control over this. Only thing that can be controlled from the data structure is the set of hashes of the data points.

Current data structure maintains a fixed length hash(but user can define that length at the time of initialization). Hashes are comprised of binary values (1s and 0s). At the moment, these hashes are stored as binary strings.

One of the optimizations which can be done to maintain a more compact index for hashes is using the equivalent integer of the binary strings. For an example, string 001110 can be represented as 14. For very large data sets with very high dimensions, this technique will hold no effect as the improvement on the memory usage of hash indices is insignificant when compared to the memory required to store the data set.

As I have mentioned earlier, there is always a cost for every optimization. Here we have to trade off speed for improved memory consumption. The LSH algorithm is applied to a single data point up to the number hash length required. In the default case, there will be 32 hashes per a single data point in a single tree and those hashed values have to be concatenated in order to create \(g(p)\)(refer to the LSH Forest paper). Because the hashing is done separately for individual elements of \(g(p)\), they are stored in an array in the first place. In order to create a single hash from these values in the array, they are first concatenated as a string. This is the point where memory speed dilemma occurs. These string hashes can be converted into integer hashes for compact indices but it will cost extra time.

In the next section, I will explain how this extra overhead can be minimized using a trick, but the fact that "you cannot eliminate the memory speed trade-off completely" holds.

Optimizations on query speeds

(I strongly recommend you to go through the LSH forest paper before reading the following section because the description involves terms directly taken from the paper)

I was hapless at the beginning, because the query speeds of the implemented LSH forest structure were not at a competitive state to challenge the other approximate nearest neighbor search implementations. But with the help of my project mentors, I was able to get over with this issue.

Optimizations on the algorithm

In the initial implementation, binary hash: \(g(p)\) for each tree is computed during the synchronous ascending phase when it was required. This causes a great overhead because each time it is computed, the LSH function has to be performed on the query vector. For Random projections, it is a dot product and it is an expensive operation for large vectors. descending phase for a single tree Computing the binary hashes for each tree in advance and storing them is a reliable solution for this issue. The algorithm descend needs to be performed on each tree to find the longest matching hash length. So the above algorithm will iterate over the number of trees in the forest. For each iteration, the binary hash of the query is calculated and stored as follows. This does not consume much memory because the number of trees is small(typically 10).

def descend(query, hash_functions, trees):
    bin_queries = []
    max_depth = 0
    for i in range(number_of_trees):
        binary_query = do_hash(query, hash_functions[i]
        k = find_longest_prefix_match(trees[i], binary_query)
        if k > max_depth:
            max_depth = k
        bin_queries.append(bin_query)
    return max_depth, binary_queries

It is an obvious fact that as the number of candidates for the nearest neighbors grows the number of actual distance calculations grows proportionally. This computation is also an expensive operation when compared with the other operations in the LSH forest structure. So limiting the number of candidates is also a passable optimization which can be done with respect to the algorithm. synchronous ascend phase In the above function, one of the terminating conditions of the candidate search is maximum depth reaching 0 (The while loop runs until x > 0). But it should not necessarily run until maximum depth reaches 0 because we are looking for most viable candidates. It is known fact that for smaller matching hash lengths, it is unlikely to have a large similarity between two data points. Therefore, as the considered hash length is decreased, eligibility of the candidates we get is also decreased. Thus having a lower bound on the maximum depth will set a bound to candidates we collect. It decreases the probability of having irrelevant candidates. So this can be done by setting x > lower_bound in the while loop for some value of lower_bound > 0. But there is a risk of not retrieving the required number of neighbors for very small data sets because the the candidate search may terminate before it acquires the required number of candidates. Therefore user should be aware of a suitable lower_bound for the queries at the time of initialization.

Optimizations on the implementation

I have indicated that bisect operations comes with Python are faster for small lists but as the size of the list grow numpy searchsorted functions becomes faster. In my previous implementation, I have used an alternative version of bisect_right function as it does not fulfill the requirement of finding the right most index for a particular hash length in a sorted array(Please refer to my previous postif things are not clear). But we cannot create an alternative version of numpy searchsorted function, therefore a suitable transformation is required on the hash itself.

Suppose we have a binary hash item = 001110. What we need is the largest number with the first 4 bits being the same as item. 001111 suffices this requirement. So the transformation needed is replacing the last 2 bits with 1s.

def transform_right(item, h, hash_size):
    transformed_item = item[:h] + "".join(['1' for i in range(hash_size - h)])
    return transformed_right

The transformation needed to get the left most index is simple. This is 001100 which is last two bits replaced by 0s. This is same as having 0011. Therefore only a string slicing operation item[:4] will do the the job.

It gets more complicated when it comes to integer hashes. Integer has to be converted to the string, transformed and re-converted to integer.

def integer_transform(int_item, hash_size, h):
    str_item = ('{0:0'+str(hash_size)+'b}').format(int_item)
    transformed_str = transform_right(str_item, h, hash_size):
    transformed_int = int(transformed_str, 2)
    return transformed_int

But this is a very expensive operation for a query. For indexing, only int(binary_hash, 2) is required but this wont make much effect because the LSH algorithms on the data set dominates that operation completely. But for a query this is a significant overhead. Therefore we need an alternative.

Required integer representations for left and right operations can be obtained by performing the bit wise \(AND\) and \(OR\) operations with a suitable mask. Masks are generated by the following function.

def _generate_masks(hash_size):
    left_masks, right_masks = [], []        
    for length in range(hash_size+1):
        left_mask  = int("".join(['1' for i in range(length)])
                         + "".join(['0' for i in range(hash_size-length)]), 2)
        left_masks.append(left_mask)
        right_mask = int("".join(['0' for i in range(length)])
                         + "".join(['1' for i in range(hash_size-length)]), 2)
        right_masks.append(right_mask)
    return left_masks, right_masks

These masks will be generated at the indexing time. Then the masks will be applied with the integer hashes.

def apply_masks(item, left_masks, right_masks, h):
    item_left = item & left_masks[h]
    item_right = item | right_masks[h]
    return item_left, item_right

The left most and right most indices of a sorted array can be obtained in the following fashion.

import numpy as np
def find_matching_indices(sorted_array, item_left, item_right):
    left_index = np.searchsorted(sorted_array, item_left)
    right_index = np.searchsorted(sorted_array, item_right, side = 'right')
    return np.arange(left_index, right_index)

As the masks are precomputed, the speed overhead at the query time is minimum in this approach. But still there is a little overhead in this approach because original hashed binary number are stored in an array and those numbers need to be concatenated and converted to obtain the corresponding integers. If the integers are cached this overhead will be eliminated.

Cached hash

This is method which guarantees a significant speed up in the queries with expense of index building speed. At the indexing time, we can create a dictionary with key being arrays of hashed bits and the values being the corresponding integers. The number of items in the dictionary is the number of combinations for a particular hash length.

Example:

Suppose the hash length is 3. Then the bit combinations are: hash_bit_table

Then it will be only a matter of a dictionary look-up. Once an integer is required for a particular hash bit array, it can be retrieved directly from the dictionary.

The implementation of this type of hash is a bit tricky. Typically in LSH forests, the maximum hash length will be 32. Then the size of the dictionary will be \(2^n = 4294967296\) where \(n = 32\). This is an extremely infeasible size to hold in the memory(May require tens of Gigabytes). But when \(n = 16\), the size becomes \(65536\). This is a very normal size which can be easily stored in the memory. Therefore we use two caches of size 16.

First a list for digits 0 and 1 is created.

digits = ['0', '1']

The the cache is created as follows.

import itertools

cache_N = 32 / 2
c = 2 ** cache_N # compute once

cache = {
   tuple(x): int("".join([digits[y] for y in x]),2)
   for x in itertools.product((0,1), repeat=cache_N)
}

Suppose item is a list of length 32 with binary values(0s and 1s). Then we can obtain the corresponding integer for that item as follows:

xx = tuple(item)
int_item = cache[xx[:16]] * c + cache[xx[16:]]

This technique can be used in queries to obtain a significant improvement in speed assuming that the user is ready to sacrifice a portion of the memory.

Performance of all these techniques were measured using Pythons' line_profiler and memory_profiler. In my next post, I will explain concisely how these tools have been used to evaluate these implementations.

July 07, 2014 07:00 AM

July 06, 2014

Vighnesh Birodkar
(scikit-image student)

scikit-image RAG Introduction

Humans possess an incredible ability to identify objects in an image. Image processing algorithms are still far behind this ability. Segmentation is the process of dividing an image into meaningful regions. All pixels belonging to a region should get a unique label in an ideal segmentation.

The current segmentation functions in scikit-image are too fine grained and fall closer to superpixel methods, providing a starting point for segmentation. Region Adjacency Graphs (RAGs) are a common data structure for many segmentation algorithms. As part of GSoC this year I am implementing RAGs for scikit-image. The current HEAD of scikit-image’s master branch contains my RAG implementation based on Networkx from my recent Pull Request. In the example below, we will see how Region Adjacency Graphs (RAGs) attempt to solve the segmentation problem.Please note that you need the latest master branch of scikit-image to run the following code.

Getting Started

We define the function show_img in preference to the standard call to imshow to set nice default size parameters.
We start with coffee, a nice fresh image of a coffee cup.

from skimage import graph, data, io, segmentation, color
from matplotlib import pyplot as plt
from skimage.measure import regionprops
from skimage import draw
import numpy as np


def show_img(img):
    width = 10.0
    height = img.shape[0]*width/img.shape[1]
    f = plt.figure(figsize=(width, height))
    plt.imshow(img)

img = data.coffee()
show_img(img)

rag_demo_2_0

Over Segmentation

We segment the image using SLIC algorithm. The SLIC algorithm will
assign a unique label to each region. This is a
localized cluster of pixels sharing some similar property, in this case their
color. The label of each pixel is stored in the labels array.

regionprops helps us compute various features of these regions. We will be
sing the centroid, purely for visualization.

labels = segmentation.slic(img, compactness=30, n_segments=400)
labels = labels + 1  # So that no labelled region is 0 and ignored by regionprops
regions = regionprops(labels)

The label2rgb function assigns a specific color to all pixels belonging to one
region (having the same label). In this case, in label_rgb each pixel is
replaces with the average RGB color of its region.

label_rgb = color.label2rgb(labels, img, kind='avg')
show_img(label_rgb)

rag_demo_6_0
Just for clarity, we use mark_boundaries to highlight the region boundaries.
You will notice the the image is divided into more regions than required. This
phenomenon is called over-segmentation.

label_rgb = segmentation.mark_boundaries(label_rgb, labels, (0, 0, 0))
show_img(label_rgb)

rag_demo_8_0

Enter, RAGs

Region Adjacency Graphs, as the name suggests represent adjacency of regions
with a graph. Each region in the image is a node in a graph. There is an edge
between every pair of adjacent regions (regions whose pixels are adjacent). The
weight of between every two nodes can be defined in a variety of ways. For this
example, we will use the difference of average color between two regions as
their edge weight. The more similar the regions, the lesser the weight between
them. Because we are using difference in mean color to compute the edge weight,
the method has been named rag_mean_color.

rag = graph.rag_mean_color(img, labels)

For our visualization, we are also adding an additional property to a node, the
coordinated of its centroid.

for region in regions:
    rag.node[region['label']]['centroid'] = region['centroid']

display_edges is a function to draw the edges of a RAG on its corresponding
image. It draws edges as green lines and centroids as yellow dots.
It also accepts an argument, thresh. We only draw edges with weight below this threshold.

def display_edges(image, g, threshold):
    """Draw edges of a RAG on its image

    Returns a modified image with the edges drawn.Edges are drawn in green
    and nodes are drawn in yellow.

    Parameters
    ----------
    image : ndarray
        The image to be drawn on.
    g : RAG
        The Region Adjacency Graph.
    threshold : float
        Only edges in `g` below `threshold` are drawn.

    Returns:
    out: ndarray
        Image with the edges drawn.
    """
    image = image.copy()
    for edge in g.edges_iter():
        n1, n2 = edge

        r1, c1 = map(int, rag.node[n1]['centroid'])
        r2, c2 = map(int, rag.node[n2]['centroid'])

        line  = draw.line(r1, c1, r2, c2)
        circle = draw.circle(r1,c1,2)

        if g[n1][n2]['weight'] < threshold :
            image[line] = 0,1,0
        image[circle] = 1,1,0

    return image

We call the function with thresh = infinity so that all edges are drawn. I
myself was surprised with the beauty of the following output.

edges_drawn_all = display_edges(label_rgb, rag, np.inf )
show_img(edges_drawn_all)

rag_demo_16_0
png

Let’s see what happens by setting thresh to 29, a value I arrived at with
some trial and error.

edges_drawn_29 = display_edges(label_rgb, rag, 29 )
show_img(edges_drawn_29)

index2

Alas, the graph is cut

As you can see above, the RAG is now divided into disconnected regions. If you
notice, the table above and to the right of the dish is one big connected
component.

Threshold Cut

The function cut_threshold removes edges below a specified threshold and then
labels a connected component as one region. Once the RAG is constructed, many similar
and more sophisticated strategies can improve the initial segmentation.

final_labels = graph.cut_threshold(labels, rag, 29)
final_label_rgb = color.label2rgb(final_labels, img, kind='avg')
show_img(final_label_rgb)

index
Not perfect, but not that bad I’d say. My next steps will be to implement better algorithms to process the RAG after the initial segmentation.These include the merging predicates mention here and N-cut.


by Vighnesh Birodkar at July 06, 2014 03:46 PM

Mustafa Furkan Kaptan
(Vispy student)

Testing Screenshots

Hello reader,

This week I updated and tested the screenshot ability of Vispy.

Normally, Vispy utilizes `read_pixel()` fuction which basically read pixels from a GL context. This function returns numpy array but somehow an upside-down array. So what I did was to flip that array which is extremely easy in Python and adding a test for it. The test was very simple too. We draw to top left a white square on a black screen. Then after reading pixels, we check whether the image is flipped or not by calculating the sum of the pixels in the corners. If the top left corner's pixel values are greater than 0 and the rest are equal to 0, it means we correctly took the screenshot.  You should know that I have never dealt with drawing something with GL functions before, that was my first.

Now I am making another PR with Luke Campagnola's -a vispydev- pure python PNG writer. This will help us to remove another image library dependency, like PIL. More updates will be coming after it has been merged.

See you next post!

by Mustafa Kaptan (noreply@blogger.com) at July 06, 2014 12:31 PM

July 05, 2014

Rajul
(SunPy student)

SunPy Database Browser Meeting on 5th July, 2014

Nabil, Stuart and I had discussion regarding the Sunpy Database Browser on the 5th of July, 2014. Various points were discussed during the meeting:
1. EIT files seem to have a problem with being loaded in into the SunPy database. This is because the SunPy database does not apparently allow WAVEUNIT attribute to be None. And the EIT file in the sunpy.data.sample has WAVEUNIT as None. Thus it is not been able to load in the database. This can be due to one of the following two problems:
a. The header of the SunPy's sample EIT file is faulty
b. The SunPy database module apparently does not have the functionality to deal with None WAVEUNIT, and hence EIT files
So, Nabil has assigned me to try and download more of EIT test files and check their headers. If the header of SunPy's sample file is faulty then we will replace the SunPy's test file else we will amend the sunpy.database
2. Stuart has created a pull request for solar co-ordinate handling in Ginga. I shall try to pull in this PR and try to render SunPy's sample, solar FITS files and take screenshots to send to Eric.

3. Start work on the Search Box feature in the Database GUI

by Rajul Srivastava (noreply@blogger.com) at July 05, 2014 05:51 PM

July 04, 2014

Mustafa Furkan Kaptan
(Vispy student)

IPython Backend

Hello reader,

In the past weeks, I have tried to add the interactivity part on IPython. Since both IPython and our backend (glut, qt, etc.) need their own event loops, we didn't manage to handle user events yet.

Our only option was to enter an event loop whenever we capture an event from user, for example mouse press. This approach seems nearly impossible in GLUT (not in freeglut though), because the `glutMainLoop()` never returns. Using Qt's `processEvents()`, we managed this:



Using this method we are nearly finishing the static part. We are going to process events just whenever user sends an event. For the animation, we are still working on it.

Also, this week one of the Vispy-dev's, Almar Klein, has finished the base code of IPython backend in vispy/app. That means we are going to use IPython notebook officially! :)

See you next post!

by Mustafa Kaptan (noreply@blogger.com) at July 04, 2014 10:19 PM

Tarun Gaba
(PyDy student)

GSoC 14: Midterm Evaluations have ended!

{% include JB/setup %} [ <-Back to posts ](/gsoc14) It has been a week after the midterm evaluations are over, and I am back to work after a small break(with permission from my mentor, off course!). I have been working on writing a test suite for the Dynamics Visualizer. This is the wrapping up part of the visualizer for this gsoc. [Here](/blog/visualization/index.html?load=samples/scene_desc.json) is a visualization of a rolling disc(it is slightly buggy though), that i prepared. To view the animation, allow the visualization to load in the webpage(it shall load automatically), and then hit the `Play Animation` button. After writing some tests for visualizer, I am going to start fleshing out API for the module, to provide IPython support to the visualizer. The main aim of writing this module is to make visualizer interactive, in the sense, that a user should be able to change all the variables from the GUI(which is rendered inside notebook's output cell) and then rerun the simulations without having to write any code, or execute any of the code manually. The data of the new simulations will be automatically fed into visualizer, and then it can be viewed as an animation. This whole workflow will be very convenient for the existing PyDy users, as well as the new ones. It will be particularly convenient for those who want to just play around with the existing systems, by changing the system variables, and view how it affects the resulting animations. With the development of this module, as well as ongoing improvements in the other PyDy modules(by my fellow GSoC'ers from PyDy), we should be able to perform lightening fast simulations for a system, as well as view them on a canvas. I will keep posting the new work I will be doing, with better details(once I actually start implementing new stuff!). [ <-Back to posts ](/gsoc14)

July 04, 2014 06:11 PM

July 03, 2014

Mainak Jas
(MNE-Python student)

Plotting BEM contours

Forward modelling in MEG requires three surfaces in the brain to be known : the inner skull, the outer skull, and the skin. These surfaces can be found by taking an anatomical MRI scan of the person and applying the watershed algorithm to segment out these surfaces as Boundary Element Meshes (BEMs).

For successful forward modelling, the surfaces must be as accurate as possible. This is one more quality measure that is important for MEG analysis and will be included in the MNE-Report.

This was addressed in my PR here. The user can provide the orientation of the slices to plot and the number of slices (defaults to 12). The function signature is something like this:

plot_bem(subject='sample', subjects_dir=subjects_dir, orientation='sagittal')

and the plot looks like this:


That's it for today !

by Mainak (noreply@blogger.com) at July 03, 2014 08:25 AM

Alan Leggitt
(MNE-Python student)

Day 32: Reading and writing volume names to fif file

Notebook &v

Today I added functionality to read and write segmentation information into .fif files. This is handy for distinguishing
between multiple source spaces, each corresponding to a particular subcortical volume.

For an idea of the changes needed in the MNE code, check out the latest commit to my pull request.

Below is a verification that everything worked.

In [21]:
import mne
from mne.datasets import spm_face

mne.set_log_level(False)

data_path = spm_face.data_path()

# setup the cortical surface space
src_fname = data_path + '/MEG/spm/combined-src.fif'
src = mne.setup_source_space('spm', overwrite=True)

The add_dist parameter to mne.setup_source_space currently defaults to False, but the default will change to True in release 0.9. Specify the parameter explicitly to avoid this warning.

In [22]:
# add two subcortical volumes
src = mne.source_space.add_subcortical_volumes(src, ['Left-Amygdala', 'Right-Amygdala'])
mne.write_source_spaces(src_fname, src)

/home/alan/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/numpy/lib/shape_base.py:834: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
return c.reshape(shape)
/home/alan/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/numpy/lib/shape_base.py:834: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
return c.reshape(shape)
/home/alan/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/numpy/lib/shape_base.py:834: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
return c.reshape(shape)

In [23]:
# display volume names
src[2]['seg_name'], src[3]['seg_name']
Out[23]:

(&aposLeft-Amygdala&apos, &aposRight-Amygdala&apos)
In [24]:
# read saved version of combined source space
src_read = mne.read_source_spaces(src_fname)
In [25]:
# display volume names
src_read[2]['seg_name'], src_read[3]['seg_name']
Out[25]:

(u&aposLeft-Amygdala&apos, u&aposRight-Amygdala&apos)

by Alan Leggitt (noreply@blogger.com) at July 03, 2014 01:01 AM

July 01, 2014

Asra Nizami
(Astropy student)

Weeks 5+6

You know that feeling when you think a task is going to be super hard but once you get to it, it turns out to be easier than you expected? And once you've finished it, you feel like you're the best programmer ever? I've been on that high for the past few days so I think I'll make a blog post while the feeling lasts.

Over the past two weeks or so I've been working on three tasks. The first thing I worked on was to implement different coordinate systems into WCSAxes. Previously, WCSAxes only allowed users to convert coordinates to the systems FK5 and Galactic but since there are a lot more coordinate systems out there, WCSAxes should be able to handle them. Like I said, this turned out to be pretty easy. The hardest part was probably refactoring the code to make it clean and efficient, but other than that I just had to use the Astropy method 'lookup_name' method to look up the correct coordinate class. Although, since this method is only available for Astropy 0.4, someone using WCSAxes with older versions of Astropy is limited to the old coordinate systems.

The second thing I worked on was a bug with the way axis labels were positioned on images. Previously, WCSAxes always tried to place axis labels as close as possible to the axis without intersecting with the tick labels.

but if the tick labels are in the way, it pushes the axis label further away
This could get annoying in certain situations and people preferred that axis labels always be positioned below the tick labels, whether or not they intersect. I had to read up on matplotlib's code to figure out how they position labels and tried to implement a similar way. It was a bit tricky since Matplotlib and WCSAxes handle ticks, labels and axes very different but luckily I was able to figure it out! And the results are

See? The axis label is now below the tick label by default! Users now also have the ability to adjust the padding of the axis label so they can move it as close or far away from the axis as they wish :)

The third thing that I've been working on is writing a wrapper for the package APLpy that uses WCSAxes. There are several packages out there that are used to plot astronomical images and one of them is APLpy, a package that my mentor helped write. The way APLpy does the actual plotting of images isn't all that great so we're trying to write a wrapper for it that uses WCSAxes to do the actual plotting, which will allow people to still use the package. So far work on the wrapper has been going good and I haven't hit any major roadblocks. Hopefully it'll continue to be this straightforward - or maybe not, things get interesting when they don't go the way you expected them to.

Now what I'm really excited about is the first release of WCSAxes! The package is almost ready for release except for one tiny feature I have to add. The sooner I finish this post, get to actual work and finish my task, the sooner we can do our first release! 

by Asra Nizami (noreply@blogger.com) at July 01, 2014 05:27 PM

June 30, 2014

Ajitesh Gupta(:randomax)
(MoinMoin student)

Week 5 & 6 work

This week I started work on the meta-data section of the UI. Earlier the meta-data section of the UI was displayed as raw data. Below is the screenshot -


The earlier raw metadata view

As you can see the meta-data section just shows a dictionary with HTML Markup. In this code the item metadata was getting converted to a python dictionary and then to text using json.dumps() in the function "meta_dict_to_text()". It was then HTML escaped and put in a <pre> tag by the function named "_render_meta" in "Moinmoin/items/__init__.py". This was then returned to the "show_item_meta" function in the views section (Moinmoin/apps/frontend/views.py). This function would carry out the HTML Markup of the input using the flask Markup() function. This was passed on to the template and displayed as such.

I first made another function "meta_to_dict()"to just convert the metadata to a python dictionary so that I could use that dictionary in the template to extract data and customize it display easily. I further removed the duplicate part of the function in "meta_dict_to_text()" and instead called "meta_to_dict()" for the necessary processing. I also changed the name of the function from "_render_meta()" to "_meta_info" as it was no longer rendering info rather returning a dictionary. Also changed the variable "meta_rendered" to "meta", again because it was not a rendered data, rather a dictionary. In the template I showed the user friendly content-type along with the detailed one present earlier using the function "shorten_ctype()" added in the search UI patch. Also I added hyperlinks and visualisation for the validity of the links. If a particular item pointed by a link in "Item Links" list or "Item Transclusions" list does exist, the link will appear blue in color else it will appear red in color. Also I changed to "Modified Time" to user readable format. Same with the "Item Size". Here is how it looks like with my patch - https://codereview.appspot.com/110210043/


 The new metadata view

Also I worked upon fixing the footer to the bottom of the page which earlier used to come up to the middle of the page in case of a small content. Here is an example -

The footer midway up the page

I added the classes "moin-header", "moin-content", "moin-footer" to the layout.html template in basic theme in order for easy CSS selection. In the css I changed html, body elements to have a size of 100%. Also the container section to have a min-height of 100% and relative position. The moin-content was changed to have bottom padding of 100px in accordance with the size of the footer. The moin-footer also now has bottom as 0, position as absolute, and width of 100%. Here is how it looks now - 


The new footer at the bottom

CRs for this week :
 
Also this week I finally got to commit the previous CRs-






I also realised that I should be making small patches so that they are easier to review and reviewed faster and as such committed faster. Due to large size of my previous patches they took too long to review and thus late commits.

by Ajitesh Gupta (noreply@blogger.com) at June 30, 2014 06:04 PM

Tyler Wade
(PyPy student)

UTF-8 Progress Update and Examples of Index Caching

Progress update:  I'm still working on replace unicodes in Pypy with a UTF-8 implementation.  Things are progressing more slowly than I'd hoped they might.  At this point I have most of the codecs working and most of the application-level unicode tests passing, but there are still a lot of little corner cases to fix and I don't even want to think about regex right now.  That said, the summer is only about half over.

Since I wanted to have something interesting to write about, I've done some looking into how other implementations that use a UTF-8 representation handle index caching (if at all) and thought I'd share what I've found.

Relatively few languages with built-in unicode support use a UTF-8 representation.  The majority seem to prefer UTF-16, for better or for worse.  The implementations that use UTF-8 I found which are worth mentioning are Perl and wxWidgets.

wxWidgets by default uses UTF-16 or UTF-32 depending on the size of wchar_t, but it has an option to using UTF-8.  When UTF-8 is enabled, wxWidgets uses an extremely simple index caching system.  It caches the byte position of the last accessed character.  When indexing a string, it starts searching from the cached byte position if the index being looked up is higher than the cached index, otherwise it searches from the start of the string.  It will never search backwards. The idea here is to make in-order traversal -- like a typical for(int i = 0; i < s.size(); i++) loop -- O(n). Unfortunately, a backwards traversal is still O(n^2) and random access is still O(n).

Perl's index caching is much more elaborate.  Perl caches two character/byte-index pairs.  When indexing a string, the search will start from the closest of the cached indices, the start, or the end of the string. It can count backwards as well as forwards through the string, assuming that counting backwards is approximately twice as slow as counting forward.

Perl's method for selecting the cached indices is also more sophisticated than wx's. Initially, the first two indices accessed are used for the cached indices. After the first two are cached, when a new index is accessed, it considers replacing the current two cached indices with one of the two possible pairs of one of the old indices and and the new index.  It does so by selecting the root-mean-square distance between the start of the string, the first index, the second index and the end of the string for each the 3 possible pairs of indices.  That's probably as clear as mud, so this maybe this expert from the Perl source will help:0
#define THREEWAY_SQUARE(a,b,c,d) \
((float)((d) - (c))) * ((float)((d) - (c))) \
+ ((float)((c) - (b))) * ((float)((c) - (b))) \
+ ((float)((b) - (a))) * ((float)((b) - (a)))

The pair that minimizes THREEWAY_SQUARE(start, low index, high index, end) is kept.

This method seems to be better than wx's for almost all cases except in-order traversal. I'm not actually sure what the complexity here would be;  I think its still O(n) for random access.

by Tyler Wade (noreply@blogger.com) at June 30, 2014 05:43 AM

June 28, 2014

maheshakya
(scikit-learn student)

An illustration of the functionality of the LSH Forest

Before digging into more technical detail on the implementation of LSH forest, I thought it would be useful to provide an insightful illustration on how the best candidates are selected from the fitted data points.

For this task, I decided to use a randomly generated dataset of the sample size of 10000 in the two dimensional space. The reason of using two dimensions is that it is convenient to depict the vectors in 2D space and the a normal human being can easily grasp the idea of convergence in a 2D space.

Following configuration of the LSH forest has been selected for this illustration.

  • Number of trees = 1
  • Hash length= 32
  • c = 1
  • Lower bound of hash length at termination = 4
  • Expecting number of neighbors = 10

You can get an idea of these parameters from my previous article on the LSH forest

The following illustrations show the convergence of data points towards the query point as the considered hash length is increased. The important fact I want to emphasize here is that candidates chosen by matching the most significant hash bits converge into the actual data point we are interested in. This happens because of the amplification property of Locality sensitive hashing algorithm.

(Beware! The query point is in RED) hash length = 0 hash length = 1 hash length = 3 hash length = 4 hash length = 5 hash length = 7 hash length = 8 hash length = 24 hash length = 32

Only if the required number of candidates are not reached during the early sweeps, the algorithm will search for more candidates in smaller matching hash lengths. The the best neighbors are selected from that set of candidates by calculating the actual distance.

In my next post, I will enlighten you about the subtle optimizations done on the LSH forest data structure to find the best candidates at a maximum speed.

June 28, 2014 07:00 AM

June 27, 2014

Varun Sharma
(GNU Mailman student)

Moving the project to python's abc(abstract base classes)

While expanding the project, i noticed that i am writing a lot of boilerplate code and not following the DRY principles, so i decided to either switch to zope interfaces or python's abc. After doing some research and reading docs of both of them, i decided to switch to abc. The reasons for preferring abc over zope interface were:
1. ABC will help me make my project more DRY unlike zope interface which will help in making the design more explicit but won't help me reduce the boilerplate code.
2. I think if i won't use zope, which at this point is not actually required due to small size of the project, it will help reduce the dependencies of my project.

So, now i am using a base module, which is storing all the abstract base classes and i will try to declare a single base class per .py file to reduce ambiguity.

by Varun Sharma (noreply@blogger.com) at June 27, 2014 09:50 AM

June 26, 2014

Roy Xue
(Theano student)

Midterm Summary#GSoC2014-Theano

Midterm Summary


Summary of GSoC 2014 Python Software Foundation Theano Community

Few hours ago, I had submitted the midterm evaluation of GSoC 2014 on Melange.

From Start

For applying the GSoC, I submitted my first PR to Theano as add cumprod/cumsum to _tensor_variable Add a_tensor_variable.cumprod/cumsum() interface #1769 Merged

Then when I received the email from Google for I was chosen to be one of GSoC 2014, it was really a luck day. Then I spend my time in reading some Theano/Numpy document, chatting with my mentors, and bonding with the community.

On May 19th, exactly the day after my birthday (May 18th), I started coding.

Till Now

As plans in the proposal, till midterm it would be the Task Step 1 “Lower Memory Usage”. These step is divided into 2 phrase.

For phrase 1, the aim is to extend the Theano memory profiler to print the max memory usage in the current implementation. This is needed for comparison.

My PR for this phrase is Add a list store executed node order #1854 Merged

In this part, at first I added node_executed_order and node_cleared_order in class Stack to store the node executed and cleared order which theano uses now. Then I rewrote the memory counting as a new method, so we can just call it twice to calculate memory usage for the previous and current order. Then print the statstics with “current(previous)” format. Then I wrote test file for this part, Fred and I had dicussions about the details of my code, like how to make it better, or how to make it work better. Though I could describe this part of work easily, I can still remember how I was feeling when do this at first, kind of no ideas on how to start, and made some mistakes.

For phrase 2, the aim is to implement a function that find the nodes execution order that minimize the memory usage during compilation. This will also be done in the Theano profiler. So all shapes information are known. This can be done by checking all valid order of execution and keeping the one that have the lower max memory usage.

My PR for this phrase is Compute minimum peak #1934 Almost finished

In thies part, I started from writing pseudo code at first, for dicussing with Fred, he said that I could use backtrack algorithm, the first thing occurs in my mind is the magic python generator I used before. The goal of the algorthm is to generate all valid order, so there has to a method to check if node is valid. The way I used here is from vm.py, which use the node.inputs and compute_map to check whether the node has its inputs variables.Then I applied the pseudo code into theano, and modified some detailed places to make it can work. The most confused problem is that global variables in def, however Fred helped solve it finally. For test file of this part, because we will print the minimum peak at profile_memory, so it can use the same one test file with phrase 1.

This PR is nearly finished, there are some details need to check out and modify it to make this method work better.

Future Plan

For next step, I would do “Add a compile argument that use the function above to order the function for such lower memory usage.” I think this wouldn’t be hard.

After this, I would have two exams at July 1st, 2nd, I may restart my GSoC work from July 3rd.

From that time I would start do the step of “Reduce the number of allocation/reuse allocation of ndarray”. It also includes 2 phrases:

  1. Keep the current order of execution that lower the memory usage. Then modify the storage_map to have multiple node share the same output storage space. Start with only 0d ndarray. That way, we are sure it have the same size.
  2. Extend it to nd array by checking node that have the same shape with the information that we have in the shape feature that keep track of shape information in the graph.

I would first read some documents about this aspect, read related part of code, come up with some ideas, then have a dicussion with my mentors about the my ideas and the solution..

Other Aspects

Fred is really a nice mentor, he is very patient, and willing to help me. I just leant a lot from him, if I can meet him, I would like to give him a big hug to appreciate him :), also thanks for James and Arnaund for answering my questions immediately with great help. I think I had made up my mind to continue contributing to Theano after GSoC.

I am so lucky to be chosen by the GSoC ten years Reunion, but whether I can go depends on if I can earn my flight tickets before its opening.

It’s really nice to join the Chinese GSoC group, meet some fantastic guys, and we will host an online sharing in recent days, the topic is introducing our GSoC projects.

by xljroy at June 26, 2014 10:42 AM

June 25, 2014

Janani
(SciPy/NumPy student)

With those mid-sem bells chiming, it is time for another update.
The following checkpoints have been reached:

1. The implementation of ellipsoidal harmonic function (also known as Lame's function): The first kind.
The following is the link to the pull request:
https://github.com/scipy/scipy/pull/3753
The implementation is in Cython and calls LAPACK subroutine. It is based on the python implementation by Knepley and Bardhan given here
Further the absence of Lame's function implementation by any many other libraries there is a challenge in preparation of an extensive test-suite. At present the output of the function for certain range of inputs is tested. The immediate next plan is to try improving the test-suite.
This will be followed by the implementation of Ellipsoidal harmonic function: The second kind.


2. Before this the spherical harmonic functions were improved by reimplementing them in Cython rather than python thus improving the speed.
The details having been elaborately touched in the previous post, are omitted here, saving people from the boredom due to redundancy. The pull request can be accessed from https://github.com/scipy/scipy/pull/3692

Thanks to the constant support of my brilliant mentors Pauli, Ralf and Stefan, (and I suppose, few miracles!) the progress is as per schedule. Hoping that this pace will be maintained or even better, quickened; post mid-sem evaluation!

Signing off till next time,
Janani

by janani padmanabhan (noreply@blogger.com) at June 25, 2014 05:59 PM

Rishabh Sharma
(SunPy student)

Unified Downloader

First the two interesting things I came across:

1)I read the Perl Introduction.I could not make up my first impression.Currently I am thinking how should I improve in it.College use to give assignments which promote self-exploration, but I dont have anything to aim at with Perl.I would love some suggestions from my readers.
2)Read Httpd Apache server introduction. Very good docs they have at apache.

Unified Downloader:
Progress made:I have finished separating download code for EVE,LYRA,NoRH sources.They(custom download clients)  have been structured to inherit from GenericClient.The custom clients over-write _get_url_for_timerange as these are unique to clients and have to separate from GenericClient. The GenericClient has query method which returns QueryResponse object.This QueryResponse object is container(modified list) for QueryRequest blocks.These QueryRequest Blocks aim to give information about the answer of  query.(Source…). The GenericClient also has get method which is given the QueryResponse object as input.This method returns a Result(defined in vso/vso.py) object.The aim of get method is to download the files.These clients have been made so as to ape the VSO module as closely as possible providing continuity of interface to users.

I have also setup the structure (basic Factory) for the main unified downloader client to dispense of the queries.This functions as a wrapper over the aforementioned Clients.Eg. main unified downloader client has a query method which basically finds the appropiate client for servicing the query and calls that client’s query method with the given query.

Remainder:
Port all the sources.
Better query dispensation(deciding whether to query vso or jsoc clients)
Tests.

Thanks for reading


by rishabhsharmagunner at June 25, 2014 12:26 PM

Issamou
(scikit-learn student)

June 24, 2014

Rajul
(SunPy student)

SunPy Ginga Database Browser Status

I have been able to get a local plugin to work in the main inteface of Ginga and it is now able to open a new window and display the database in a separate window. I am attaching a few of the screenshots. I shall also now get to trying to display the database as a Ginga tab, but I am afraid that it will be slightly cluttered if we do that. However, I will try to do that any ways. I shall try to explore the options that if we can easily add a separate side box under the main viewing area of Ginga (like the Debugging box in most IDEs) and use it. So to get the plugin working, follow these instructions:

  1. Copy the file SunPyDatabase.py to $HOME./ginga/plugins
  2. Run from your terminal "sudo ginga --plugins=SunPyDatabase"
  3. The Ginga window opens; select SunPyDatabase by clicking on Operation button, which is present at the bottom of the central Image view area.
I am also attaching a few screenshots that I have obtained.









by Rajul Srivastava (noreply@blogger.com) at June 24, 2014 05:20 PM

Rishabh Raj
(scikit-image student)

The Glass Is Half Filled!

We’re almost halfway through and I’m happy to say we’ve covered a lot of ground.

The very first of the issues which we were trying to solve, was to find out a way to connect to the Docker container running the code and talk to it, submit code, fetch the output etc. After discussions with devs on the Docker channel, we came to the conclusion that using the Python subprocess module to attach to the process running the `docker attach` command is indeed an elegant way.

The flow goes like, our code interface submits a JSON to the server, which then fires up a Docker container. The next phase involves using the subprocess module from Python to launch the `docker attach` command in a process which returns a handle which can be used to communicate with it. When the container is fired, it is launched with the python shell, so sending in code through the handle we procured earlier, works. As you might be wondering, the communicate method just returns to us the STDOUT, and STDERR for the code running. But, one step at a time, we take that with open arms. This output is then sent back to the interface as a JSON response to the AJAX call which started it all in the first place.

Here’s a sample video to see this simple thing in action – https://www.youtube.com/watch?v=XTdfwV8nwJM

Okay, so we can submit code and fetch its output, all the time running it inside a Docker container, cool. But we need more. Our use case deals with images being generated as output, sometimes multiple times in the code. A clean solution for this would be to somehow write the output images into files and fetch those files and send them back as response.

So, the next problem we tackled was the transfer of a binary file from inside the Docker container to the user’s browser. The idea was to ‘uuencode’ the binary and then transfer it and decode it on the client side. We made use of the `docker cp` command to copy files from inside the Docker container’s filesystem to the webserver’s proximity. For the time being, we just run a query to fetch the list of files of a particular extension, and copy them one by one. This is again done by capturing the STDOUT of a ‘folder-content-listing’ code which prints the name of each file of interest in one line. We then run a `docker cp` command to copy each file. Interestingly, the stream yields the file ‘tar’ed. We process the stream using ‘tarfile’ (a module in python) and now we have the binary file. UUencode and send the string as a JSON response back. In case of multiple files, the JSON simple has multiple entries in the format of ‘filename’ : ‘string’.

Here’s a sample video showing this at work – https://www.youtube.com/watch?v=7s2ORC_ZZok
but for lack of good client side ‘uudecode’, this just shows the output of ‘uuencoding’ whatever was produced by running the code.

Then comes the most interesting problem to be solved, our use case targets people running code which in the end generates image plots or figures, but we dont expect them to write those images following a specific protocol so that we could fetch them back using the setup we have in place. Hence the need to modify the backend which generates all these figures and plots, which is the matplotlib backend. This is indeed an awesome library which allows us to suppress the use of a GUI to render plots / output images and provides us with helper functions to iterate through all figures generated by the code. We basically use the ‘agg’ backend (Link for more info) and iterate through the figure handles and save them one by one on the isolated container filesystem.
Once this is done, we basically recreate the process of transfer of a binary file as mentioned in the second half above. We exploit the fact that the `<img>` tag can render images directly using the base64 version of the binary, and use that instead of `uuencode`ing it, which is difficult to decode. Once we have the base64 versions of output images, we add them as the source to the `<img>` tag and it springs to life on the webpage. An experiment we ran suggested the ratio of size of the uuencoded image to the base64 version to be ~ 1.03, so we’re not really losing anything.

Here’s a sample video showing this at work – https://www.youtube.com/watch?v=8390QVef1-E

We moved to a new location on https://github.com/scikit-image/skimage-docker (officially under the skimage banner, happy to be a part of team docker).

Next we work on to imposing CPU and Memory constraints on the task running and then the deployment on an actual server begins. Exciting times ahead.


by sharky932014 at June 24, 2014 03:59 AM

June 23, 2014

Akhil Nair
(Kivy student)

A lot and a little

This summer of code started on 19th May and today it is 23rd June. What did I do the whole month, you ask?. A lot and a little.

This month was as action packed as a climax scene of a Transformers movie!
 I'm sure the usual - had exams, had project submissions - wouldn't interest you, so let's talk about my project and how buildozer has fared and how I project it would fare.

The aim of the project is clear. Have an automated process that packages any Kivy application for a target Operating System.
My goal was to have target codes for linux by the mid term evaluation. Am I there? Well, almost.

At the beginning my mentors recommended the use of pyinstaller as a tool which I could use to package the application. I spent a considerable amount of time of the past month trying to get pyinstaller to work but failed. The problem being that after following all the procedures and converting the application to an executable, the application would give an import error on a PIL library. When tested the problem also occurred at a basic level where I tried to package a small application which does nothing but import pygame library. Due to the fact that I had wasted major part of my month behind this and a solution wasn't visible, I moved on to the basic packaging methodology used by debian which I had mentioned in my proposal. After piecing together various processes I was able to successfully package an application to debian. Small functional aspects such as resolving dependency still pertains but I'm optimistic that it can be solved.

The whole process can be optimised and divided into the following steps:

1: Place the application in an empty directory (Yes there is a reason for that).

2: Create an __init__.py file in the application directory, so that python reads the application as a package.

3: Create a setup.py file outside the application directory.

4: The setup.py should be in the following format.
"
from distutils.core import setup

#This is a list of files to install, and where
#(relative to the 'root' dir, where setup.py is)
#You could be more specific.
files = ["*","sounds/*.wav","icons/*.png"] #List of static directories and files in your application

setup(name = "application_name",
    version = " ", #example 1.0
    description = "A small description",
    author = "myself and I",
    author_email = "email@someplace.com",
    url = "whatever",
    #Name the folder where your packages live:
    #(If you have other packages (dirs) or modules (py files) then
    #put them into the package directory - they will be found
    #recursively.)
    packages = ['application_name'],
    #'package' package must contain files (see list above)
    #I called the package 'package' thus cleverly confusing the whole issue...
    #This dict maps the package name =to=> directories
    #It says, package *needs* these files.
    package_data = {'application_name' : files },
    #'runner' is in the root.
    scripts = ["script_name"], #We will use this to call the application as a program from the command line. Ideally your application_name
    long_description = """Really long text here."""
    #
    #This next part it for the Cheese Shop, look a little down the page.
    #classifiers = []    

"

5: Create another file named as the script_name from the setup.py

6: This file will ideally look something like this.
"
#!/usr/bin/python
from application import main #Here application being the application directory.

main.Classname.run()  #Ideally your main function

"

7: Now run the command "python setup.py sdist". This will create a directory named dist with a tarball of your application.

8: Traverse to the tarball and run the command
"py2dsc your_application_tar_file"

9: This will then create a directory named deb_dist. It will have a series of files but we want the one named after your application. Must be similar to application_name-1.0 (1.0 or any other version you specified in your setup.py file)

10: Go into that file, you will find a directory name debian. Just run the command "debuild". Now go back to the deb_dist directory. You will have a debian package waiting for you. Try installing it.

There are still certain things that need to be sorted out such as key signing and dependency issues. But I will get to that soon. As you might have seen, the setup.py is the most important thing. Also, currently I am using distutils library but going to shift to setuptools as setuptools has functionalities to package directly to rpm. (And maybe windows too).

Currently writing target code to implement this whole process. Will push for review as soon as is over.

by AkhilNair (noreply@blogger.com) at June 23, 2014 04:17 PM

Varun Sharma
(GNU Mailman student)

Mid-term summary

So, today 5 weeks are over since the gsoc started and i am going behind by scheduled timeline by around 1 week. So, i have written the server part(trying the patch on server) of the ci-tool but there is more to be done. Right now, it's working for a single project but i have to think and get advice from the mentors on how to do for multiple projects. Right now, i am using the twisted's Protocol class to create client and server tcp connection and send the diff file from client to server. Now, i need to implement the bidirectional tcp connection, so that first i can send the project/s name and then the diff file or i can send a metadata file along with the diff which can help identify the project/s.

Summarizing what i have done in last 5 weeks, i have implemeted pull, test and try(for single project) commands of the ci-tool.

The source of the project is at http://bazaar.launchpad.net/~varun/mailman/ci_tool/files/

by Varun Sharma (noreply@blogger.com) at June 23, 2014 09:53 AM

cfulton
(Statsmodels student)

Kalman Filter Initialization - The Stationary Case

The initialization problem

The Kalman filter is a recursion for optimally making inferences about an unknown state variable given a related observed variable. In particular, if the state variable at time $t$ is represented by $\alpha_t$, then the (linear, Gaussian) Kalman filter takes as input the mean and variance of that state conditional on observations up to time $t-1$ and provides as output the filtered mean and variance of the state at time $t$ and the predicted mean and variance of the state at time $t$.

More concretely, we denote (see Durbin and Koopman (2012) for all notation)

$$ \begin{align} \alpha_t | Y_{t-1} & \sim N(a_t, P_t) \\ \alpha_t | Y_{t} & \sim N(a_{t|t}, P_{t|t}) \\ \alpha_{t+1} | Y_{t} & \sim N(a_{t+1}, P_{t+1}) \\ \end{align} $$

Then the inputs to the Kalman filter recursion are $a_t$ and $P_t$ and the outputs are $a_{t|t}, P_{t|t}$ (called filtered values) and $a_{t+1}, P_{t+1}$ (called predicted values).

This process is done for $t = 1, \dots, n$. While the predicted values as outputs of the recursion are available as inputs to subsequent iterations, an important question is initialization: what values should be used as inputs to start the very first recursion.

Specifically, when running the recursion for $t = 1$, we need as inputs $a_1, P_1$. These values define, respectively, the expectation and variance / covariance matrix for the initial state $\alpha_1 | Y_0$. Here, though, $Y_0$ denotes the observation of no data, so in fact we are looking for the unconditional expectation and variance / covariance matrix of $\alpha_1$. The question is how to find these.

In general this is a rather difficult problem (for example for non-stationary proceses) but for stationary processes, an analytic solution can be found.

Stationary processes

A (covariance) stationary process is, very roughly speaking, one for which the mean and covariances are not time-dependent. What this means is that we can solve for the unconditional expectation and variance explicity (this section results from Hamilton (1994), Chapter 13)

The state equation for a state-space process (to which the Kalman filter is applied) is

$$ \alpha_{t+1} = T \alpha_t + \eta_t $$

Below I set up the elements of a typical state equation like that which would be found in the ARMA case, where the transition matrix $T$ is a sort-of companion matrix. I'm setting it up in such a way that I'll be able to adjust the dimension of the state, so we can see how some of the below methods scale.

In [3]:
import numpy as np
from scipy import linalg

def state(m=10):
    T = np.zeros((m, m), dtype=complex)
    T[0,0] = 0.6 + 1j
    idx = np.diag_indices(m-1)
    T[(idx[0]+1, idx[1])] = 1
    
    Q = np.eye(m)
    
    return T, Q

Unconditional mean

Taking unconditional expectations of both sides yields:

$$ E[\alpha_{t+1}] = T E[ \alpha_t] + E[\eta_t] $$

or $(I - T) E[\alpha_t] = 0$ and given stationarity this means that the unique solution is $E[\alpha_t] = 0$ for all $t$. Thus in initializing the Kalman filter, we set $a_t = E[\alpha_t] = 0$.

Unconditional variance / covariance matrix

Slightly more tricky is the variance / covariance matrix. To find it (as in Hamilton) post-multiply by the transpose of the state and take expectations:

$$ E[\alpha_{t+1} \alpha_{t+1}'] = E[(T \alpha_t + \eta_t)(\alpha_t' T' + \eta_t')] $$

This yields an equation of the form (denoting by $\Sigma$ and $Q$ the variance / covariance matrices of the state and disturbance):

$$ \Sigma = T \Sigma T' + Q $$

Hamilton then shows that this equation can be solved as:

$$ vec(\Sigma) = [I - (T \otimes T)]^{-1} vec(Q) $$

where $\otimes$ refers to the Kronecker product. There are two things that jump out about this equation:

  1. It can be easily solved. In Python, it would look something like:
    m = T.shape[0]
    Sigma = np.linalg.inv(np.eye(m**2) - np.kron(T, T)).dot(Q.reshape(Q.size, 1)).reshape(n,n)
    
  2. It will scale very poorly (in terms of computational time) with the dimension of the state-space ($m$). In particular, you have to take the inverse of an $m^2 \times m^2$ matrix.

Below I take a look at the timing for solving it this way using the code above (direct_inverse) and using built-in scipy direct method (which uses a linear solver rather than taking the inverse, so it is a bit faster)s

In []:
def direct_inverse(A, Q):
    n = A.shape[0]
    return np.linalg.inv(np.eye(n**2) - np.kron(A,A.conj())).dot(Q.reshape(Q.size, 1)).reshape(n,n)

def direct_solver(A, Q):
    return linalg.solve_discrete_lyapunov(A, Q)

# Example
from numpy.testing import assert_allclose
np.set_printoptions(precision=10)
sol1 = direct_inverse(T, Q)
sol2 = direct_solver(T, Q)

assert_allclose(sol1,sol2)
In [6]:
# Timings for m=1
T, Q = state(1)
%timeit direct_inverse(T, Q)
%timeit direct_solver(T, Q)
10000 loops, best of 3: 58.8 µs per loop
10000 loops, best of 3: 73.4 µs per loop

In [7]:
# Timings for m=5
T, Q = state(5)
%timeit direct_inverse(T, Q)
%timeit direct_solver(T, Q)
10000 loops, best of 3: 152 µs per loop
10000 loops, best of 3: 128 µs per loop

In [8]:
# Timings for m=10
T, Q = state(10)
%timeit direct_inverse(T, Q)
%timeit direct_solver(T, Q)
100 loops, best of 3: 1.71 ms per loop
1000 loops, best of 3: 1.03 ms per loop

In [11]:
# Timings for m=50
T, Q = state(50)
%timeit direct_inverse(T, Q)
%timeit direct_solver(T, Q)
1 loops, best of 3: 12.6 s per loop
1 loops, best of 3: 3.5 s per loop

Lyapunov equations

As you can notice by looking at the name of the scipy function, the equation describing the unconditional variance / covariance matrix, $\Sigma = T \Sigma T' + Q$ is an example of a discrete Lyapunov equation.

One place to turn to improve performance on matrix-related operations is to the underlying Fortran linear algebra libraries: BLAS and LAPACK; if there exists a special-case solver for discrete time Lyapunov equations, we can call that function and be done.

Unfortunately, no such function exists, but what does exist is a special-case solver for Sylvester equations (*trsyl), which are equations of the form $AX + XB = C$. Furthermore, the continuous Lyapunov equation, $AX + AX^H + Q = 0$ is a special case of a Sylvester equation. Thus if we can transform the discrete to a continuous Lyapunov equation, we can then solve it quickly as a Sylvester equation.

The current implementation of the scipy discrete Lyapunov solver does not do that, although their continuous solver solve_lyapunov does call solve_sylvester which calls *trsyl. So, we need to find a transformation from discrete to continuous and directly call solve_lyapunov which will do the heavy lifting for us.

It turns out that there are several transformations that will do it. See Gajic, Z., and M.T.J. Qureshi. 2008. for details. Below I present two bilinear transformations, and show their timings.

In [34]:
def bilinear1(A, Q):
    A = A.conj().T
    n = A.shape[0]
    eye = np.eye(n)
    B = np.linalg.inv(A - eye).dot(A + eye)
    res = linalg.solve_lyapunov(B.conj().T, -Q)
    return 0.5*(B - eye).conj().T.dot(res).dot(B - eye)

def bilinear2(A, Q):
    A = A.conj().T
    n = A.shape[0]
    eye = np.eye(n)
    AI_inv = np.linalg.inv(A + eye)
    B = (A - eye).dot(AI_inv)
    C = 2*np.linalg.inv(A.conj().T + eye).dot(Q).dot(AI_inv)
    return linalg.solve_lyapunov(B.conj().T, -C)

# Example:
T, Q = state(3)
sol3 = bilinear1(T, Q)
sol4 = bilinear2(T, Q)

assert_allclose(sol1,sol3)
assert_allclose(sol3,sol4)
In [36]:
# Timings for m=1
T, Q = state(1)
%timeit bilinear1(T, Q)
%timeit bilinear2(T, Q)
1000 loops, best of 3: 202 µs per loop
1000 loops, best of 3: 219 µs per loop

In [37]:
# Timings for m=5
T, Q = state(5)
%timeit bilinear1(T, Q)
%timeit bilinear2(T, Q)
1000 loops, best of 3: 257 µs per loop
1000 loops, best of 3: 316 µs per loop

In [38]:
# Timings for m=10
T, Q = state(10)
%timeit bilinear1(T, Q)
%timeit bilinear2(T, Q)
1000 loops, best of 3: 268 µs per loop
1000 loops, best of 3: 298 µs per loop

In [39]:
# Timings for m=50
T, Q = state(50)
%timeit bilinear1(T, Q)
%timeit bilinear2(T, Q)
100 loops, best of 3: 2.31 ms per loop
100 loops, best of 3: 2.66 ms per loop

Notice that this method does so well we can even try $m=500$.

In [40]:
# Timings for m=500
T, Q = state(500)
%timeit bilinear1(T, Q)
%timeit bilinear2(T, Q)
1 loops, best of 3: 1.51 s per loop
1 loops, best of 3: 1.67 s per loop

Final thoughts

The first thing to notice is how much better the bilinear transformations do as $m$ grows large. They are able to take advantage of the special formulation of the problem so as to avoid many calculations that a generic inverse (or linear solver) would have to do. Second, though, for small $m$, the original analytic solutions are actually better.

I have submitted a pull request to Scipy to augment the solve_discrete_lyapunov for large $m$ ($m >= 10$) using the second bilinear transformation to solve it as a Sylvester equation.

Finally, below is a graph of the timings.

In [27]:
 

by Chad Fulton at June 23, 2014 07:06 AM

Julia Medina
(Scrapy student)

Mid-term summary

Last week has been quiet as I was doing my last assignment in college, and it has kept me busier than planned. I’ve been working on top of the pull request that I mentioned in my previous post, extending the functionality of the new settings API, but I’m waiting on its review to initiate new pull requests.

These two months of researching and coding have been great. The initial overwhelming feeling over the size and complexity of the problem when I was dealing with my GSoC application has disappeared as I got familiar with Scrapy's implementation details and code base. It’s definitely easier now to come up with new feature ideas and integrate them consistently to existing work.

Working on an open source project has surely been a test for my coding skills. Knowing that my work can be looked at by anyone and it will be especially reviewed by skilled developers has raised my standards of what makes code “good enough” for me. I’m more aware of the importance of maintaining a well tested, documented and supported base, as these good practices greatly aid to contribute to and adopt open source projects.

Although I made some changes to the schedule that I detailed in March, I’m satisfied with the pace of the work that I've done, and I’m even more confident that the entire job will be finished by August, provided that I end my college classes. The most difficult task so far has been to keep balance between college assignments and GSoC work, so I’m relieved to know that I can concentrate on one thing from now on.

The settings API is practically finished (There are two simple features still missing), which is roughly half the work on my proposal, and I’ll start with the backend clean up soon.

by Julia Medina (noreply@blogger.com) at June 23, 2014 04:43 AM

June 22, 2014

Brigitta Sipőcz
(Astropy student)

Anurag Goel
(Mercurial student)

Still a long way to go and more milestones to cover

This blog is about the midterm summary update. Last two weeks were quite a busy week for me.  As i mentioned in the previous post, i will try to cover things up quickly. So below is the work progress upto the midterm.

Until now i have done mainly two tasks now.

Task 1 is about gathering timing info of test files. In this task, I calculated two main things.
1) User time taken by a child processes.
2) System time taken by a child processes.

This task has already been mentioned in details in previous blog. Patch of this task is under 4th revision of review and its getting better on every revision. You can find more patch details here.

Task 2 is about to “Set Up a tool able to plot timing information for each file”.
In this task firstly I introduced a new functionality of ‘--json’ in “run-tests.py”. While testing if  user enabled the ‘--json’ optional then timing data gets stored in json format in the newly created  ‘report.json’ file. Patch of this task is under 2nd revision. You can find more patch details here.

After that, i wrote html/javascript file which accesses this report.json file to plot graph between testname Vs testime. Mercurial(hg) buildbot runs test cases periodically. So the main aim of this task is give graphical view of test results to the users.

Apart from the above two tasks, the work in which I spent most of time was, in fixing the regression. This regression has been recently introduced by refactoring of run-tests.py which includes
1) produce error on running a failing test
2) produce '!' mark after running a failing test
3) skipped test should not produce 'i' mark while retesting
4) fixes the number of tests ran when '--retest' is enabled
5) checks behaviour of test on failure while testing
6) fixes the '--interactive' option error
 
There are several other milestones to cover which mainly includes
1) Designing of annotation format
2) Highlight test section with annotation format
3) Declare dependency between sections

Above regression fixing was quite a warm up exercise. This exercise helped in understanding how things work in "run-tests.py". This would definitely help me in quickly covering the above mentioned milestones.

Stay tune for more updates :)

by Anurag Goel (noreply@blogger.com) at June 22, 2014 10:10 PM

Vytautas Jančauskas
(TARDIS-SN student)

GSoC Mid-term Results and Achievements

By the mid-term we succeeded in fulfilling most of the goals of the overall program. The core Monte Carlo routines are now implemented in C. They were significantly restructured to allow better readability and run around 35% percent faster as of late. There are still areas in want of improvement but the general feel of the code is a lot better. The code can be found here and will be merged in to the main fork some time soon. The new changes hopefully will let us expand the code with new physics easily. This effectively does the job described in the GSoC proposal, however there are still areas where improvements will need to be done. These include error handling, logging, nicer interfaces, code and user documentation. As far as improving the performance is concerned in my opinion we are close to reaching a dead-end. The code is currently very basic and uses mostly atomic operations. If further improvement in terms of program running time is wanted smarter algorithms will need to be developed. Although currently I do not have a clear idea of what that would have to be.

by Vytautas Jančauskas (noreply@blogger.com) at June 22, 2014 09:13 PM

Tarun Gaba
(PyDy student)

GSoC 14: Second Week!

{% include JB/setup %} [ <-Back to posts ](/gsoc14) Second week of GSoC '14 has ended. It has been a very hectic week. A good part of my time was spent in studying and tweaking the MGView code(whose fork we had decided to work on). The new _generic-visualizer_ has to be compatible with both PyDy as well as MG(MotionGenesis), so great care is being taken to maintain the backwards compatibility of the package, as well to add the support for PyDy visualization data. I have also worked on Python side, to modify the JSON data generated for the visualizations(from PyDy side). ###Accomplishments: There have been three major accomplishments this week: - **Studying MGView code:** This has to be the most time consuming part! I had to go over a module spanning multiple files to study the workflow, and how each methods are linked together. Some conversations with Adam Leeper(who wrote the original package) have been very useful in this regard. - **Modifying JSON generated from PyDy:** Certain modifications have been done on the JSON generated from the PyDy backend, so that they are compatible with MotionView. A major change is that the generated JSON data is split over two files, one `scene_desc.json` shall contain all the parameters required to create a static scene, and another file `simulation_data.json` which shall contain all the simulation data(in the form of 4x4 matrices). It would be very helpful in un-cluttering the data. The API changes in Python side are totally backward compatible, which means no previous code shall break once these changes are merged. - **Tweaking MGView code:** I have started working on MGView code, to provide the support for PyDy generated JSON data. This part will span the coming weeks too. ###Objectives: The objectives of the upcoming week are to be able to produce some visualizations on the frontend side. For that I will need to tweak into the MGView code to allow support for PyDy JSONs. ###Issues: The main issues I am facing is the inhomogenities in the output produced by PyDy and MG. Since the visualizer (MOtionView) must be able to visualize data generated from both of these packages, without any bias it has to be designed like so. Since It is already built to specifically visualize data generated from MotionGenesis software, There are certain parts of code which need to be totally rewritten, and it has to be done without breaking the workflow of the visualizer. So it can get somewhat messy to do so. I am working on to find a clean way to do so. The upcoming two weeks are somewhat crucial to the project timeline. I have to work really hard in the coming weeks to keep up with my deliverables before the midterm evaluations. I think I shall get back to coding now! :) [ <-Back to posts ](/gsoc14)

June 22, 2014 07:11 PM

Asish Panda
(SunPy student)

Progress mid term

After a month into gsoc coding period, I have learnt many new things about python and git and also how OSS community work. This has been a great experience for me to work with people who are so willing to help a beginner! 
Currently I have created a PR to merge various files that has been quantified which includes:

1) sunpy/image/coalignment.py

2) sunpy/image/tests/test_coalignment.py

3) sunpy/instr/rhessi.py

4) sunpy/physics/transforms/differential_rotation.py

5) sunpy/physics/transforms/tests/test_differential_rotation.py

6) sunpy/sun/sun.py

7) sunpy/sun/tests/test_sun.py

8) sunpy/util/tests/test_unit_conversion.py [will be removed soon]

9) sunpy/util/unit_conversion.py [will be removed soon]

There were many roadblocks and mistakes that took consumed many hours but they were interesting. I messed up and pushed a error branch to master which could have deleted all my progress. I learnt the hard way that I should not mess with git without thinking! 

Also the spectra integration(which I have done, and i will create a PR soon) was quite tough to understand for me. On the bright side I learnt many new features of python going through that code! 
My next tasks would be to integrate maps and wcs with astropy. 

PR link


by kaichogami at June 22, 2014 05:20 PM

Rajeev S
(GNU Mailman student)

Half Way through the GSoC

Half of the GSoC is done and I am happy that I have been able to cope up with the deadlines I have set in my proposal. The Mailman CLI is almost fully done and can be used comfortably to manage a mailman installation.

At the end of the first phase of the project, the CLI supports 11 commands in 4 scopes, resulting in almost 25 actions that can be performed on Mailman.

Here is a snapshot of the functionalities that can now be performed using the CLI
  1. Creating mailing lists, domains and users
  2. Deleting mailing lists, domains and users
  3. Subscribing and Unsubscribing a user to a mailing list
  4. List the mailing lists, users and domains in the installation, supporting a detailed and non detailed mode
  5. Describe a user, mailing list or domain, showing the important properties of the object
  6. Adding and removing moderators
  7. Adding and removing owners
  8. Display and update preferences of users, addresses, members and globally

It was suggested to me to follow TDD, and I liked the approach too. But when most of the newer commands printed an output rather than performing action, TDD became painful, consumed too much of my time and proved to be awfully boring. I spent days writing test code rather than the functionalities of the CLI. Finally, I dropped the test writing and continued with the projects. I will be writing tests as proposed in my proposal, during the week of mid term evaluation and during last week of the project.

As part of the TDD process, A lot of code refactoring was done to make the code compatible with the unittests. The most important decision was about how the errors were handled. Until then, error messages were printed directly to stdout. There were no means of using these modules in an external project, as the errors could not be caught. The unittest module's assertRaises method seemed to be the easiest way to test many of the functionalities. Hence I replaced all the `print errors` with a raise expression. Along with this, custom exception classes were built for each scope. All these custom error messages were caught at a single point and printed to stderr (just changed it from stdout to stderr, change to be applied from r61).

All other messages, which are not part of the output of commands, like errors and confirmation messages, are printed in cute shades like red for error, yellow for warning and green for confirmation and emphasized messages. These help the messages to standout in a terminal. The colors were added very easily using the corresponding escape codes for each color. This functionality is encapsulated in a class named Colorize, which contains methods like warn, error, emphasize and confirm.

The project now enables the users to manage a mailman installation with ease, allowing them to make changes with a single command. Most of these actions previously required the user to write scripts on the python shell. Now most of the common actions on Mailman is a single command away.

Other interesting stuff used in the project include the tabulate module, which handles the headache of output formatting, especially when the output is huge, as is the case of many commands in the CLI. I had used the difflib.get_close_matches function to suggest the possible keys for preferences, but removed it on finding a better alternative. The nosetests + unittest suite manages all my test cases and test execution process. This was my first encounter with the unittest module and unittest'ing.

The argparse module takes care of all my command linHalf of the GsoC is done and I am happy that I have been able to cope up with the deadlines I have set in my proposal. The Mailman CLI is almost fully done and can be used comfortably to manage a mailman installation.e arguments , usage help for commands and proper error reporting. I had used optparse before, but not in this scale. Last but not least, the bzr+launchpad duo, which again is new to me, but a lot similar to git. However, bzr lacks many of the handy features that git has, like git grep and the log is inverted by default.

The next phase of project is even more interesting, which involves building a custom shell, which, in effect involves building a query language. I guess I will make use of the recurrence descent parser for that purpose. From the looks of it, the commands would be almost similar in both, the CLI and the shell. I will have to come up with a fair design for the parser the coming week, in addition to writing tests for the work until now.

by Rajeev S (noreply@blogger.com) at June 22, 2014 05:16 PM

Chinmay Joshi
(Mercurial student)

Midterm Update

Midterm evaluation period has approached and I am summerizing my GSoC journey. Until now, I have been working on various files in mercurial for finding operations which touch the filesystem, adding those operations to vfs and updating the related filesystem operations to be invoked via vfs.

During this time, I made several advancements in this process. I am currently working on mercurial/patch.py, mercurial/hg.py, mercurial/command.py and mercurial/simplemerge.py. Few filesystem operations like lexists and unlinkpath are already merged (See progress wiki for more details on WindowsUTF8 plan) and some others I sent are under review. I am intending to send more from my queue by refining them upon the feedback I receive from currently sent patches. Fujiwara Katsunori has played a crucial role and helped me a lot by providing valuable feedback. Updating users should not only be functional but effcient. This is a large part of project and still needs more operations to be added and replaced accordingly.

Regarding the very basic filesystem operations which do not rely or have specific file path, I recently had discussion with Giovanni Gherdovich for adding file operations which do not rely on a base in filesystem (eg. os.getcwd, os.path.basename, os.path.join, etc) and he as per suggestion we decided to try implementing them with classmethod to access them directly without vfs objects. I will be sending this series as RFC to mercurial developer mailing list for feedback from experts on this project.

The part of project which is about accessing filesystem on Windows I asked on the mailing list for clues. Following the discussion from mailing list, proposed utf8vfs should fetch filenames by primarily calling python's API with unicode objects and should convert results back to utf-8 encoding. However in mercurial, there are some operations which use win32 API in Windows to implement few crucial filesystem operations such as unlink, remove a dirctory tree, etc. As for this I will need to use Win32 W API. Considering the fact that I am absolutely new to win32 API, I laid my hands on Win32 API along with the work of adding users to vfs and updating file system operation users to use vfs during the last week. I decided to do allocate this a proportion of time eventhough it is scheduled at a later stage in timeline because this are the things which can consume my huge time at later stage if I get stuck in some thing. I have also experimented some operations with unicode objects by adding them to utf8vfs.

Stay tuned for more updates!

by Chinmay Joshi (noreply@blogger.com) at June 22, 2014 04:30 PM

PritishC
(SunPy student)

The Midterms Approach

It has been a long, long two months for me, folks. I have finally completed my last set of semester exams, and graduated from college. Celebratory dances and jigs apart, this also meant that I could devote full time to the work, and that is all that matters!

I had not been sitting idle for that matter during my exams, either. Following on from the issue that Stuart and I posted on Astropy, regarding SkyCoord not recognizing frames in Cartesian representation, a lot has changed since then. Tom Aldcroft, a collaborator with Astropy, began working on adding the requisite changes to the SkyCoord class, with guidance from other Astropy members such as Erik Tollerud (one of the main collaborators who worked on APE5). I summarize some of the changes here as follows -:

  1. The concept of frame_attr_names was removed in favour of FrameAttribute and TimeFrameAttribute objects. For example, if I am to specify an attribute ‘d’ which defaults to 1AU (Astronomical Unit, for the uninitiated), I would write -:
    d = FrameAttribute(default=(1*u.au).to(u.km))

    Not only does this make the code easier to read, it is an example of good design principles in object-oriented programming. Additionally, such attributes are specifiable in the constructor during frame initialization.

  2. Getters and setters were added for the representation property of a frame class. One may now change the representation of a frame coordinate simply by setting the representation property.
  3. _frame_specific_representation_info was added as a nested dict which stores ‘overrides’ of representation information. For example, if I want a frame’s Cartesian representation units to be in degrees, I would specify it in this dict. I could then access the overriden representation by writing frame_coord.cartesian, for example.

I contributed to this pull request by adding tests for the builtin_frames.py and sky_coordinate.py files. It was an exercise in git rebasing, and I learned that the git push that follows a rebase must be done with the -f (forced) parameter. Once the SkyCoord PR was merged to Astropy-master, SkyCoord was SunPy ready. It was rather unfortunate that I fell sick in between, and I could not contribute as many tests as I would’ve liked.

The following week, my Map Tests PR was finally merged into SunPy-master. It underwent a lot of scrutiny before Stuart finally decided to merge it because it was out unmerged too long! It feels nice to have contributed something substantial, but I knew that the real work still remained: the coordinate frames!

We began by specifying some Coordinates API tests for SunPy, as a roadmap to the features we wanted to achieve. They can be found here. Initially, of course, these tests will fail, but as features are added to the codebase, each one of those tests is supposed to pass. They were modelled after the API tests in the Astropy codebase. These will also help in completing the SunPy Enhancement Proposal which I’d filed as a PR earlier, to the sunpy-SEP repository.

Then, we began work on the coordinate frames. The PR here captures the work done so far on the coordinates API. Four frames have been defined for use with SunPy – Heliographic (Stonyhurst and Carrington realizations), Heliocentric and Helioprojective. Each of their representations and frame attributes have been defined as given in Thompson. Additionally, a transformation framework based on Astropy’s frame_transform_graph has also been created. Since each frame subclasses astropy.coordinates.baseframe.BaseCoordinateFrame, each of them gets registered into Astropy’s frame registry. The transformation framework makes use of the mathemagic given in sunpy.wcs. This was a re-implementation project after all, and this is how we plan to phase out sunpy.wcs in favour of sunpy.coordinates! I feel like a lot of things are finally coming into place.

Once I’m done with the low-level frame classes, I will look to possibly patching the high-level SkyCoord object to work properly with SunPy frames.

I’d like to add a closing thanks to my mentors – Stuart, David, Steven, Nabil and the SunPy-Astropy combine for helping me come this far. It’s been fun so far, I’m looking forward to more!


by xpritish at June 22, 2014 03:59 PM

Vighnesh Birodkar
(scikit-image student)

GSoC – Mid Term Status Update

So far

For the story so far see -  the Introduction, 2nd week update, Graph Data Structures Comparison .

After much debate on the mailing list with many scikit-image developers we finally decided to use the NetworkX Graph Class for our Region Adjacency Graph ( RAG ). It comes with a lot of well-tested functionality and would speed up the GSoC progress. It is also pure Python, and shares a lot of its dependencies with scikit-image.

Constructing the RAG

To construct a RAG, we need to iterate over the entire labeled image, looking for adjacent pixels with distinct labels. Initially I wrote a special case Cython loop for 2D and 3D, much like this one. But to scikit-image developers suggested, and rightly so, a more genaral n-dimensional approach. I looked for a function, which would iterate over an entire array and call a given function. As it turns out generic_filter does exactly that. I have used it here  with the callable defined here.

The footprint is created such that only the elements which are 2nd and 3rd along all axes are set according to the connectivity. Rest all are forced to zero. In the 2D case with connectivity 2 it would be the bottom right elements ( a[1,1], a[1,2] , a[2,1], a[2,2] ) which are set . This ensures that one pair of adjacent pixels is not processed again when the filter window moves ahead.

The footprint ensures that the first element in the array passed to the callable is the central element in the footprint. All other elements are adjacent to the central element and an edge is created between them and the central element.

Pruning the RAG

I implemented a skeletal RAG algorithm following Juan’s suggestion. It takes the labeled image as input. This is typically an over-segmented image obtained by algorithms like SLIC or watershed. Each region in the labeled image is represented by a node in the graph. Nodes of adjacent regions are joined by an edge. The weight of this edge is the magnitude of difference in mean color. Once the graph is constructed, nodes joined by edges with weights lesser than a given threshold are combined into one region.

You can view the Pull Request here.

Results

Below are the test results of executing the example code on two scikit-image sample images. The threshold value for both the results is different and is found by trial and error. Typically, a higher threshold value, gives fewer regions in the final image.

Coffee Image

Original Image

coffee

SLIC Segmentation

coffee_slic

Threshold Graph Cut

coffee_threshold_cut

 

Coins

Original Image

coins

SLIC Segmentation

coins_slic

Threshold Graph Cut

coins_threshold_cut

Conclusion

The mechanism for RAGs is in place now. The segmentation is pretty satisfactory, considering the simple logic. The major drawback however is that it’s not fully automatic. Over the next few weeks I will implement more sophisticated merging predicates, including N-cut.


by Vighnesh Birodkar at June 22, 2014 02:14 PM

Hamzeh Alsalhi
(scikit-learn student)

Google Summer of Code Midterm Progress Summary

The first half of my summer of code has resulted in the implementation of just under half of my goals, one of the planned six pull requests has been completely finalized, two are on their way to being finalized, and two of the remaining three will be started shortly. The final pull request is a more independent feature which I aim to start as soon as I am confident the others are close to being wrapped up.

Thank you Arnaud, Joel, Oliver, Noel, and Lars for the time taken to give the constructive criticism that has vastly improved my implementation of these changes to the code base.

Sparse Input for Ensemble Methods

Gradient Boosted Regression Trees is the one remaining ensemble method that needs work before all of scikit-learns ensemble classifiers support sparse input. The first pull request I made this summer was for sparse input on the AdaBoost ensemble method. The AdaBoost pull request was merged, after AdaBoost sparse input support was completed I have skipped to latter goals with the intention to come back and pickup work on GBRT when a pending pull request for sparse input decision trees is merged which will make it easier to continue work on the sparse input for the ensemble method.

PR #3161 - Sparse Input for AdaBoost
Status: Completed and Merged
Summary of the work done: The ensemble/weighted_boosting class was edited to avoid densifying the input data and to simply pass along sparse data to the base classifiers to allow them to proceed with training and prediction on sparse data. Tests were written to validate correctness of the AdaBoost classifier and AdaBoost regressor when using sparse data by making sure training and prediction on sparse and dense formats of the data gave identical results, as well verifying the data remained in sparse format when the base classifier supported it. Go to the AdaBoost blog post to see the results of sparse input with AdaBoost visualized.

PR - Sparse input Gradient Boosted Regression Trees (GBRT)
Status: To be started
Summary of the work to be done: Very similar to sparse input support for AdaBoost, the classifier will need modification to support passing sparse data to its base classifiers and similar tests will be written to ensure correctness of the implementation. The usefulness of this functionality depends heavily on the sparse support for decision trees which is a pending mature pull request here PR #3173.

Sparse Output Support

The Sparse Label Binarizer pull request has gone through numerous revisions after being based of existing code written in PR and it contains a large part of the work necessary to support sparse output for One vs. Rest classification. With this support in place many of the binary classifiers in scikit-learn can be used in a one vs. all fashion on sparse target data. Support for sparse target data in the multilabel metrics will be implemented to provide users with metrics while avoiding the need to densify the target data. Finally in attempt to push support for sparse target data past one vs. rest methods I will work on spare target data support for decision trees .

PR #3203 - Sparse Label Binarizer
Status: Nearing Completion
Summary of the work done: The label binarizing function in scikit-learns label code was modified to support conversion from sparse formats and helper functions to this function from the utils module were modified to be able to detect the representation type of the target data when it is in sparse format. Read about the workings of the label binarizer.

PR #3276 - Sparse Output One vs. Rest
Status: Work In Progress
Summary of the work done: The fit and predict functions for one vs. rest classifiers modified to detect sparse target data and handle it without densifying the entire matrix at once, instead the fit function iterates over densified columns of the target data and fits an individual classifier for each column and the predict uses binarizaion on the results from each classifier individually before combining the results into a sparse representation. A test was written to ensure that classifier accuracy was within a suitable range when using sparse target data.

PR - Sparse Metrics
StatusTo be started
Summary of the work done: Modify the metrics and some misc tools to support sparse target data so sparsity can be maintained throughout the entire learning cycle. The tools to be modified  include precision score, accuracy score, parameter search, and other metrics listed on scikit-learns model evaluation documentation under the classification metrics header.

PR - Decision Tree and Random Forest Sparse Output
StatusTo be started
Summary of the work done: Make revisions in the tree code to support sparsely formatted target data and update the random forest ensemble method to use the new sparse target data support.

Plan for the Coming Weeks

In hopes that the sparse label binarizer will be merged soon after making final revisions, early next week I will begin to respond to the reviews of the sparse One vs. Rest pull request and we will also see the beginnings of the sparse metrics pull request which should be wrapped up and ready for reviews in the same week. Following that the next focus will be rewinding to sparse input for ensemble methods and putting a week of work into sparse support for GBRT. Finally the sparse output decision tree pull request will be started when the remaining goals are nearing completion.

by Hamzeh (noreply@blogger.com) at June 22, 2014 03:48 AM

June 21, 2014

M S Suraj
(Vispy student)

Polygon Visuals – GSoC Mid-term

GSoC midterm is almost here and as expected, I have picked up good pace. Most of the triangulation is complete, a bit of polishing remains. I have also refactored it to be part of geometry module. I have been testing and fixing bugs on my implementation by testing it out on different data. One thing I have realised is that it doesn’t work on self-intersecting polygons. It is not supported by the algorithm itself and will probably require the use of a clipping library before triangulation.

An example of triangulation (without recursive legalization):

triangulation1

I have also implemented PolygonVisual in vispy with triangulation and borders. Unexpectedly, at least for me, this also happens to be the first compound visual being added to vispy. Good thing that MeshVisual and LineVisual were already implemented which form the basis of this. I also added a mode attribute so that the LineVisual can be drawn using GL_LINES or GL_LINE_STRIP (whichever format the data is provided in).

As of now, I am using Delaunay triangulation from scipy.spatial in PolygonVisual till the vispy.geometry.triangulation is tested and fixed completely. Once done, it will become part of the geometry module.

Here are few example of PolygonVisual:

polygonvisual1 polygonvisual2

Although this wasn’t in my proposal, on discussing with my mentor Luke, I have also appended to the geometry module data handling features for polygon so that the visuals module remains uncluttered.  Next week, I will add triangulation tests and more features geometry module like bounding rectangle. Till then, cheers!


by mssurajkaiga at June 21, 2014 08:19 PM

Shailesh Ahuja
(Astropy student)

Analysing IRAF multispec format FITS files

Today I am going to analyze a couple of FITS files, which have been stored in the IRAF multispec specification described here. I am going to focus to Legendre and Chebyshev polynomial dispersion functions.
First, take a quick look at the FITS headers in the following files:
  1. Legendre dispersion file headers: http://pastebin.com/FqnSYwGe
  2. Chebyshev dispersion file headers: http://pastebin.com/NDKqNj6n

NAXIS defines the number of dimensions. In multispec format, there are always two dimensions. Multiple one-dimensional spectra are stored in this format. These headers have CTYPE1 and CTYPE2 equal to `MULTISPE`. This is necessary to indicate that the spectra is stored in multispec format. NAXIS1 tells us the size of the data (length of the flux array) in each spectra, and NAXIS2 tells us the number of such one dimensional spectra. In both the files, there are 51 spectra stored.

One of the most important header keyword is WAT2_XXX. This keyword stores the information to compute the dispersion at each point, for each spectra. `specK` holds the information for the Kth spectra. There are various numbers separated by spaces within each `specK`. These numbers describe the exact function to be used to compute the dispersion values. The following list explains these numbers in order:

  1. Aperture number: According to Wikipedia, aperture number is directly or inversely proportional to the exposure time. This entry always holds an integer value. For both the files, this number goes from 1 to 51. This value has no significance on the calculation of dispersion.
  2. Beam number: I am not sure what this means, but it is always an integer value. For the Legendre file, this decreases from 88 to 38 and for Chebyshev file this increases from 68 to 118. 
  3. Dispersion Type: This can be 0 (linear dispersion), 1 (log-linear dispersion) or 2 (non-linear dispersion). As both these files define non-linear polynomial functions, this is always 2.
  4. Dispersion start: This value indicates the dispersion at the first physical pixel. This value is not used for computation, however, this value can be used to verify whether the function is giving the correct output at the first pixel. Unfortunately, this value is the same for all 51 spectra in both the files, which implies that this value hasn't been correctly stored. The value matches the output returned by the 51st spectra dispersion function.
  5. Average dispersion delta: This value is equal to the mean of the difference between consecutive dispersion values. Again, this value is not used for computation, but can be used to verify the function output. Similar to the previous value, this has been stored incorrectly in both these files. It is only correct for the 51st spectra.
  6. Number of pixels: This value indicates the length of the flux array of this spectra. This value can be at most the value of NAXIS1. This value should be equal to PMAX - PMIN (defined later).
  7. Doppler factor (Z): Due to relative motion of the object and the observer, Doppler effect can alter the dispersion values. This factor can be used to compute the adjusted dispersion values, by using the formula below:

                                 Adjusted dispersion = Dispersion / (1 + Z)
     
  8. Aperture low: This value is for information only. Stores the lower limit of spatial axis used to compute this dispersion.
  9. Aperture high: Again, this value is for information only. It stores the upper limit of spatial axis used to compute this dispersion.

    From this point, the function descriptors start. There can be more than one function too. In that case these descriptors are repeated starting from weight. These descriptors determine the function output. The final dispersion is calculated as:

                                Final dispersion = Sum of all function outputs
     
  10. Weight:  The weight of the function gives the multiplier for the dispersion calculated. It's use becomes more obvious in the formula below.
  11. Zero point offset: The value to be added to all the dispersion values. Combined with the weight, and the Doppler factor, the function output can be calculated as:

        Final function output = Weight * (Zero point offset + function output) / (1 + Z)

    In the files given, there is only one function. The Doppler factor and the zero point offset are zero, and the weight is one. So the final dispersion is equal to the function output.
     
  12. Function type code: Until this point, we know how to calculate the final dispersion, if we know the function output. This value stores to type of the function that will be used to compute the output at any given pixel. There are six possibilities:
    1 => Chebyshev polynomial, 2 => Legendre polynomial, 3 => Cubic spline,
    4 => Linear spline, 5 => Pixel coordinate array, and 6 => Sampled coordinate array

    Starting from this point, the numbers may mean different things for different functions. I am explaining the descriptors for Legendre and Chebyshev.
     
  13. Order (O): The order of the Legendre or Chebyshev function.
  14. Minimum pixel value (Pmin): The lower limit of the range of the physical pixel coordinates.
  15. Maximum pixel value (Pmax): The upper limit of the range of the physical pixel coordinates. In combination with the lower limit, they determine the domain of the function. This domain should be mapped to [-1, 1].
  16. Coefficients: There are O coefficients that follow. These coefficients define the Legendre or the Chebyshev functions.
And that's it. It's a bit tedious to understand, but format enables so much information to be stored. The documentation is not very clear, and I hope this post helped you understand what these parameters stand for. 

by Shailesh Ahuja (noreply@blogger.com) at June 21, 2014 09:27 AM

June 20, 2014

Richard Tsai
(SciPy/NumPy student)

GSoC2014: The First Month

The first month of my GSoC has passed and it is about time to make a summary.

The PRs

I’ve made 4 Pull-Requests for my GSoC project, 3 of which have been merged.

  • #3636 MAINT: cluster: some clean up Some cleanups. Mainly for docs.

  • #3683 ENH: cluster: rewrite and optimize vq in Cython The most challenging part. Discussed in this post already.

  • #3702 BUG: cluster: _vq unable to handle large features When I started to work on #3712, I noticed that the new version of _vq I wrote in the previous PR(#3683) may give wrong results when the values of the input features are very large. I though it might be an overflow issue since there was a matrix multiplication in the new algorithm. But I was wrong. It was caused by an initialization issue. I used the C macro NPY_INFINITY to initialize the output distances vector. However, ndarray.fill would regard NPY_INFINITY as the “integral infinity” and fill the float (or double) vector with \(2 ^ {31} - 1\). When the actual minimum distance was greater than such an “infinity”, the result would be an uninitialized value. The currect way to initialize the output distances vector should be outdists.fill(np.inf).

  • #3712 ENH: cluster: reimplement the update-step of K-means in Cython This PR is my recent work and hasn’t been merged yet.

    An iteration of K-means requires two steps, the assignment step and the update step. The assignment step, implemented as vq.vq, has been optimized in the previous PRs. But the update step is still time-consuming, especially for low-dimensional data. The current implementation of the update step in SciPy looks like:

    for j in range(nc):
        mbs = np.where(label == j)
        code[j] = np.mean(data[mbs], axis=0)
    

    However, np.where is slow. A much more efficient algorithm would be:

    for j in range(nobs):
        code[labels[j], :] += data[j, :]
        counts[labels[j]] += 1
    for j in range(nc):
        code[j] /= counts[j]
    

    But of course looping through the observation matrix in pure python would be extremely slow. So I implemented it in Cython and archieved decent speedups especially in low-dimensional data.

Achievement

First of all, I rewrote the underlying part of cluster.vq in Cython and it should be easier to maintain than the original C implementation.

Then, the performance. The new algorithms and optimization tricks work well. I built SciPy with ATLAS and tested vq.kmeans(obs, k) on a i5-3470 machine, and I got the following performance statistics.

datasetorigin#3683#3712speedup
30000 x 100, k = 10, float6428636.259416950.388412687.4252.26
30000 x 100, k = 10, float3213575.05326483.47345272.22.57
10000 x 30, k = 10, float641923.39461146.7348752.1932.56
10000 x 30, k = 10, float321491.879989.209658.10642.27
100000 x 3, k = 4, float641663.1431181.5912529.09923.14
100000 x 3, k = 4, float321278.033967.4396431.02482.97
10000 x 1, k = 3, float64117.690894.062241.61862.83
10000 x 1, k = 3, float32110.287498.682641.99042.63
1000 x 3, k = 4, float6429.2328.6979.85422.97
1000 x 3, k = 4, float3227.524229.248812.48942.20

What’s Next

There’re still about two months to go. First, I will continue to work on #3712. This PR still has some issues. There’re some build warnings which seem to be caused by Cython fused types on some machine. Then, I will start to work on cluster.hierarchy. The hierarchy module will be more complicated and challenging then vq and it may take me about a month to rewrite it in Cython. And then I’ll try to implement the SLINK and CLINK algorithm.

At last, I want to thank my mentors Ralf, Charles and David, and of course everyone else in the SciPy community, who have helpped me a lot. I’m very glad to be working with you!

by Richard at June 20, 2014 05:07 PM

June 19, 2014

Milan Oberkirch
(Core Python student)

Multitasking

I spent the last days improving my smtpd patch and reading some (proposed) standards which have their very own way of sounding boring in almost every sentence (they have to be specific so don’t blame the authors). For example the following sentence can be found in nearly every RFC published since 1997:

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119.

After reading all those RFCs I feel like I SHOULD get back coding right now.

The most relevant RFCs so far:

  • 6531: SMTPUTF8
    already implemented in smtpd.py and in progress in smtplib. I will come back to this after dealing with imaplib and #15014 is closed.
  • 6532: Internationalized Headers
    the one I started with in the first weak, affects the email package. I’ll come back to that when everything else is done (I discussed some of the reasons for this in my last post).

and new this week:

  • 6855: UTF-8 support for IMAP
    This concerns the imaplib. I will get started with that today.
  • 6856: same thing for POP3
    concerns poplib (obviously). I already proposed a patch for this one but left out the new ‘LANG’ command (used to change the language of messages from the server) for now. It seemed quite irrelevant, but I MAY add it sometime in the future.
  • 6857: Post-Delivery Message Downgrading for Internationalized Email Messages
    I MAY look into this after I implemented the things above.

So I have many new playgrounds now. Working on many things in parallel is also an attempt to make the whole review and patch-improvement process more efficient.


by milan at June 19, 2014 06:30 PM

Jaspreet Singh
(pgmpy student)

The One With PomdpX Reader Class

Hi folks!
There were a lot of decision to be made before I could start hacking for pgmpy. First of all, I had to decide which of the file formats will I start with. I decided to chose PomdpX file format since it was based on xml similar to the XMLBIF format which was completed by Ankur Ankan already. This eased the task. The Reader Class was to be implemented first. The aim was to parse a pomdpX file and convert all the information it contains about the model in structured data. I read the etree tutorial and learned about the functions which will be used to achieve the task. The ElementTree API provides different kind of functions to get this done.I'll provide an example. Suppose the xml looks something like this

<pomdpx version="0.1" id="rockSample"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="pomdpx.xsd">
<Description> · · · </Description>
<Discount> · · · </Discount>
<Variable> · · · </Variable>
<InitialStateBelief> · · · </InitialStateBelief>
<StateTransitionFunction> · · · </StateTransitionFunction>
<ObsFunction> · · · </ObsFunction>
<RewardFunction> · · · </RewardFunction>
</pomdpx>

If the above xml is contained in a string, then etree.fromstring(string)  will return the root element of the tree, i.e. pomdpx in this case. Now functions like find() and findall() can be used to obtain the elements contained within the root. Attributes were obtained using the get() method and the value in a tag using the text() method. These all combined helped me to form structured data out off the file format.

The most difficult part was the parameter tag of type Decision Diagram where I had to parse tree type of format. I could not use any other class object inside the Reader Class to make it independent. I finally came up with a recursive parsing solution to it. I was able to come up with it by using the idea from the TreeCPD class present in the Factors module of the pgmpy codebase.

I then wrote tests for each of the functions so that each and every possibility of mistake can be handled and no bugs remained in the code. Some docstrings were also added. Finally the Reader Class was complete along with the tests for me to send a pull request.

by jaspreet singh (noreply@blogger.com) at June 19, 2014 07:05 AM

June 18, 2014

Simon Liedtke
(Astropy student)

New service for Astroquery coming: xMatch

xMatch is a service that makes it possible to cross-identify (thus the name) sources between very large catalogues. One of these catalogues may also be a user-uploaded list of positions. For astroquery, I added a new class, XMatch, that has the methods get_available_tables, is_table_available, and query. get_available_tables returns a listing of all tables that can be queried by this service. The resulting format can be either JSON or plain text. In the sample ipython session below one can see that the number of returned lines is 13259 if the format is plain text. This means that this many tables can be queried by the xMatch service. The following commands prints the first 5 tables. The utility method is_table_available simply checks whether the passed string is one of those that get_available_tables returns in the plain text format. The most interesting method though is query. It receives the following parameters: cat1, cat2, max_distance, format, colRA1, coldDec1, colRA2, colDec2. cat1 and cat2 are used to identify the first or the second table to be queried, respectively. max_distance denotes the maximum distance (in arcsecs) that will be used to look for counterparts. The desired format can be either VOTable, JSON, CSV, or TSV. The last four parameters indicate the names of the columns of the respective tables.

The ipython session below shows the xmatch package "in action":

In [1]: from astroquery.xmatch import XMatch
WARNING: ConfigurationDefaultMissingWarning: Could not determine version of package astroquery Cannot install default profile. If you are importing from source, this is expected. [astroquery._astropy_init]
WARNING: AstropyDeprecationWarning: ConfigurationItem has been deprecated in astropy 0.4. Use ConfigItem objects as members of ConfigNamespace subclasses instead.  See ConfigNamespace for an example. [astropy.config.configuration]

In [2]: xm = XMatch()

In [3]: len(xm.get_available_tables('txt').splitlines())
Out[3]: 13259

In [4]: xm.get_available_tables('txt').splitlines()[:5]
Out[4]:
[u'J/MNRAS/276/33/tables',
 u'J/A+A/552/A8/tablea1',
 u'J/AJ/80/972/table',
 u'J/A+AS/99/291/table1',
 u'J/ApJS/150/1/table5a']

In [5]: xm.is_table_available(u'J/AJ/80/972/table')
Out[5]: True

In [6]: from StringIO import StringIO

In [7]: csv_in = StringIO('''ra,dec
267.22029,-20.35869
274.83971,-25.42714
275.92229,-30.36572
283.26621,-8.70756
306.01575,33.86756
322.493,12.16703''')

In [8]: csv_out = xm.query(csv_in, 'vizier:II/246/out', 5, 'csv', 'ra', 'dec')

In [9]: print csv_out
angDist,ra,dec,2MASS,RAJ2000,DEJ2000,errHalfMaj,errHalfMin,errPosAng,Jmag,Hmag,Kmag,e_Jmag,e_Hmag,e_Kmag,Qfl,Rfl,X,MeasureJD
1.352044,267.22029,-20.35869,17485281-2021323,267.220049,-20.358990,0.150,0.110,16,9.931,8.822,7.550,0.239,0.241,,EEU,226,2,2450950.8609
1.578188,267.22029,-20.35869,17485288-2021328,267.220348,-20.359125,0.140,0.120,158,8.868,7.784,8.530,,,0.128,UUB,662,2,2450950.8609
3.699368,267.22029,-20.35869,17485264-2021294,267.219344,-20.358171,0.130,0.120,33,9.562,8.525,9.445,,,0.119,UUB,662,2,2450950.8609
3.822922,267.22029,-20.35869,17485299-2021279,267.220825,-20.357754,0.080,0.070,7,10.756,9.725,9.287,0.139,0.127,0.103,EBA,222,2,2450950.8609
4.576677,267.22029,-20.35869,17485255-2021326,267.218994,-20.359064,0.140,0.130,87,10.431,9.348,7.926,0.159,0.316,,CEU,226,2,2450950.8609
0.219609,274.83971,-25.42714,18192154-2525377,274.839773,-25.427162,0.060,0.060,90,9.368,8.431,7.919,0.024,0.044,0.036,AAA,211,0,2451407.5033
1.633225,275.92229,-30.36572,18234133-3021582,275.922233,-30.366171,0.080,0.080,55,12.947,12.334,12.145,0.156,0.221,0.127,EEE,222,2,2451021.7212
0.536998,283.26621,-8.70756,18530390-0842276,283.266284,-8.707690,0.060,0.060,45,12.182,11.534,11.380,0.057,0.071,0.063,AAA,222,0,2451301.7945
1.178542,306.01575,33.86756,20240382+3352021,306.015944,+33.867275,0.060,0.060,90,13.575,12.684,12.321,0.025,0.027,0.026,AAA,222,0,2450948.9708
0.853178,322.493,12.16703,21295836+1210007,322.493171,+12.166862,0.100,0.080,179,9.798,9.339,9.176,0.109,0.150,0.100,EEA,222,0,2451080.6935
4.503950,322.493,12.16703,21295861+1210023,322.494242,+12.167332,0.100,0.080,1,10.057,9.720,9.483,0.077,0.136,0.088,EEE,222,0,2451080.6935

Note that the status of this package is that the pull request is still open and things will change because I haven't reacted to all the comments yet.

What's next?

I started working on the next service, SkyView. Yet it is not enough to open a new PR about it, but I have some concrete plans on how to continue on it. Stay tuned!

by Simon Liedtke at June 18, 2014 10:00 PM

Navin
(pgmpy student)

Three weeks into Coding - Progress / Roadblocks

Well, this was supposed to be written when I was two weeks into coding. But, it went off my mind and my mentor has kindly reminded me to write this one. Its about to be four weeks but I will just talk about the experiences of my first three weeks.

So when I had written the last blog, I hadn't even begun the coding for GSOC and hence was all excited to start coding. And now, its been three weeks. I can't use the cliched phrase "Its been a long journey", but it has been a nice journey. Let me first describe the things I have done by now before going into more philosophical discussions.

Beginning with Week1. The task for me for the week was to implement the framework of Markov Network. I had to define all those little-little functions for adding nodes, removing edges, setting observation, accessing potentials etc etc. I had to take care to ensure that the we don't pass integers as node names, we don't access observation that was set and so on. Well to be fair to the previous developers of pgmpy, I copied quite a few things from the code for BayesianModels. So, thanks, the previous developers of pgmpy!

One mistake which I did in this process was making my own class for representing factors, when it existed already in the only package which I had overlooked while going through the code. So Ankur told me about that and hence I had to change my code to work with the existing Factor class.

The next week was supposed to be spent for building the independencies relations for the Markov Networks. However, I felt like working with triangulation and junction trees and so on and hence started working on this particular module. The coding part for this was very interesting as it involved implementing algorithms and so on. So, I started working on this and implemented a lot of things, starting from a checking if a graph is triangulated to triangulation heuristics to implementing the JunctionTree class and writing functions to create the junction tree from the triangulated graph.

So there is a long series of events which happened when I submitted the PR. First, I had not commented the code or added docstrings to the code because I had expected a lot of comments about changing the interface and hence I thought that it will be best if I postpone the commenting part till after the interface and code was fixed. However, I received a (rather-angry) mail from Abinash asking me why I hadn't commented the code. I gave him my reasons. Also, he told me that many of the functions which I had written was already in the standard networkx package. Ah, and I was heart-broken. All the efforts spent in writing those maximum-spanning trees and finding the cliques and checking if the graph was triangulated had just gone waste. I might accept my mistake in not using Maximum Spanning tree code of networkx, (but then it was 15 lines and I would have written that code atleast 3 times before), but who would imagine networkx hiding such specific functionalities(finding cliques in chordal graphs, finding tree width of chordal graphs etc) inside its deep seas of code disguised as functions such that you wouldn't even find them if you looked for functions of a graph object. Anyway, all that code had to be replaced by these library functions for maintainability etc.

Once I was done with this, I started commenting the code and adding docstrings and writing test cases. When I did that, I had a big realization, "Copying code was easy (Bayesian Model's framework to Markov Model), but documentation and testing won't be easy" as I will actually have to do all that. (Well, just because I am mentioning copying the code, don't start assuming that I copied all the code. I only copied small functions)

This took quite some part of week3. Oh man, documentation and testing have two really bad features : 1. They take a lot of time 2. They are slightly boring to do, but then as Ankur said, "I will realize the importance of testing and commenting".

Anyway, once I was done with that I started working on the message passing algorithm, and setting up the entire framework and I am still on it. The next task after this will be to write up various inference algorithms using Junction Trees (and yes, documenting and testing too :P ). Hope I have a good time ahead, as I have been having till now.

Signing off for now.
Navin


by Navin Chandak (noreply@blogger.com) at June 18, 2014 08:14 PM

June 17, 2014

RAGHAV R V
(BinPy student)

GSoC 2014 - 4th Week

I finally coded the long pending Analog to Digital and Digital to Analog Converter Modules.

Those were quite time consuming, especially since they revealed a lot of the underlying flaws of the Threaded implementation of the BinPy Linker Module. This linker seems to be fine for now after quite a few bug fixes.

I started off with AnalogBuffer module, which gives a way to attenuate a bank of connectors and also to implement tristated input / output connections to modules. The outputs can be virtually disconnected from the inputs by setting the external enable signal to be logic low. The module also includes attenuation support and Analog / Digital interoperability.

The attenuation  will be given as voltage attenuation in decibel value.

The A2D and D2A modules are quite elaborately designed with Support for 4 Bit, 8 Bit, 16 Bit regular digital to analog conversions ( and vice versa ) and  include functionality to convert to and from IEEE 754 Standard values.
The Analog to Digital Conversion is with reference to 2 Analog inputs which can be linked from 2 connectors or 2 Busses of width one each. ( VREF+ and VREF- ) Typically in most practical applications VREF- is taken to be at GND level and VREF+ is impressed with the VCC.

This reference voltage along with the bit-width of the Digital output ( or input, in the case of Digital to Analog Converter ) decides the resolution of the A2D ( or D2A ).

A2D conversion takes place in Successive Approximation  Type.

The resolution can be obtained as a parameter since I have decorated it with @property.

IEEE 754 Standard Single Precision ( 32 bit ) and IEEE 754 Standard Double Precision ( 64 bit ) formats find a lot of application in Micro processor implementations.

The A2D usage has been elaborated in the docstring.

Finally I would like to explain the enhancement made to linker module of BinPy.

The philosophy of the linker was primarily to provide a means of automatic connection state updation and to avoid the infinite recursion troubles that one has when recursively calling the trigger() method. Well the trigger seems to not go away any time soon from BinPy. It had to be re-introduced but implemented in an entirely  different way. The triggers will now act as transfer function block of any module. It transforms the inputs to the desired outputs. The inputs outputs will be served by the linker thread. So callbacks have now been used to bind this transfer function of  a block with its state. Every connection will be bound to a function that will be called by the linker module before updation. The trigger block is typically bound to the connection.

The add_link() parameters of the AutoUpdater after the change :


Loading ....


The change in linker module to callback the bound function :


Loading ....



Here's how any module can be implemented using the Buffers and the new implementation of the linkers ...


Loading ....

I've also split the linker with the AutoUpdater Class . AutoUpdater class is now a container for the DiGraph which store the connector implementing methods to update the same.

These are the PR's I've been woking on :

1. Analog and Linker Enhancement - PR #245

This one is almost complete ( I've got stuck on this linker thread delay problem when working with large inputs ( 4x64bit connections ) and am trying to think of ways to make this efficient and reduce time delay without leaving out call back binding )

2. MicroProcessor Modules - PR #244

I've been working on modules to implement MicroProcessors in parallel ( codes are stubs so I havent pushed anything yet .. But Im done with Booth's and Robertson's algorithm now onto division algos ... )

by Raghav R V (noreply@blogger.com) at June 17, 2014 04:37 PM

June 16, 2014

Frank Cheng
(Statsmodels student)

How to test MICE?

MICE is a method that imputes missing data using simulation from an inferred posterior distribution. See the Appendix of http://www.statcan.gc.ca/ads-annonces/12-001-x/5857-eng.pdf to see the exact way we simulated data.

With this in mind, MICE is a random method that will yield differing results every time it is run. Therefore, a typical testing procedure where we try to get exact agreement between our implementation and an existing, mature package is not possible. However, we should get roughly the same results for any given instance, and more importantly, the distribution of results should be very similar. So what we will aim to do is really a comparison of simulations between our implementation and those of R and Stata. Right now, predictive mean matching is directly comparable between all implementations, but there are some details in our other asymptotically Gaussian approach that are different in the other packages. I will write a follow up post shortly describing the different approaches taken by different packages. For now, the takeaway for testing is that we will perform simulation studies to assess the similarity between several MICE implementations.

by Frank Cheng (noreply@blogger.com) at June 16, 2014 06:04 AM

Tyler Wade
(PyPy student)

Beginning UTF-8 Implementation

Over the last week and a half, I've begun working on implementing UTF-8 based unicodes in PyPy.

As I've stated in a previous blog post, my goal is to replace the use of the RPython unicode type in PyPy with a PyPy-specific UTF-8 implementation.  The motivation for doing this is to simplify PyPy's unicode support by providing a single implementation and application-level interface irrespective of the platform.  Presently, the size of the system's wchar_t type determines which representation is used, from both perspectives.  The new implementation will represent the unicode strings internally as an RPython string of bytes, but the user will interact with the application-level unicode string on a code point basis.  In other words, the user will have a "UTF-32" interface.

Since Python supports converting unicode objects into UTF-8 byte strings and vice-versa, RPython helper functions already exist for a lot of the handling of UTF-8. I am largely reusing these with minimal modifications.  As a result, most of my work is a matter of fixing the various places in PyPy where the built-in/RPython unicode type is expected and replacing unicode literals.  That said, I plan on leaving unicode literals in (the interpreter-level) tests in place. Replacing them serves little useful purpose and would make the tests more annoying to read and to write.

For the time being I'm not paying very much attention to optimization (with the exception of special cases for ASCII only strings.)  Things will be slow for a little while; random access before any optimizations are added will be O(n).  However, the goal is correctness first and this won't be merged if it creates any serious performance regressions.  That said, I already have some ideas about optimizations and hopefully I'll be far enough along in my work to be able to write more about those next time.

by Tyler Wade (noreply@blogger.com) at June 16, 2014 12:45 AM

June 14, 2014

Akhil Nair
(Kivy student)

Packaging- The Debian Way

In the earlier post I had mentioned, how I was stuck with Pyinstaller. Despite several efforts the roadblock remained. So earlier this month, I was stuck at a dead end and nowhere to turn to.

After some soul searching and the inspiration season finale of Silicon Valley, I decided to try a different approach. And I must say, this looks promising.

What Pyinstaller does is it converts a program into an executable. We have to take it the rest of the way.
So I thought why not try to directly package using some basic debian build tools.

And I can say that I was fairly successful.

So this method has two parts. The setup creation and the package creation.

A combination of Part1 and Part2.
There are some steps in between but nothing big.

Part1 deals with making a Kivy Application an installable program that can be run by firing a command on the terminal.
This step is important because this functionality is a fundamental part of any debian package.

These are the steps in part one:

- Create a setup.py file outside the application directory and enclose this combination into another directory which we may refer to as root.
The setup.py file will contain information about the author, static files, package, etc. 

- We must also make the application directory as a package which can be read as a module. So we create a __init__.py file (doesn't matter if it's blank) so that python refers to it as a module.

- Create a Manifest file which specifies all the other files(non-python) of the application such as sounds, images, readme, etc.

- Run the command "python setup.py sdist". This will now create a dist directory which will have a tar file. You may check whether the application can be installed or not by running the "python setup.py -install" command.  If the above part works well then the second part should work without hiccups.

The second part deals with the creation of deb package using the obtained tarball.



Roadblock:
                Static files are not getting included to the tarball. Following the MANIFEST file in the python distutils tutorial only results in creation of an empty directory named "recursive-include". 


If some of the problems can be ironed out, this can work wonderfully.
Also the assumption is that the first part would be same for rpm conversion, so it can be possible to have a single target code for both deb and rpm.


Final thoughts:
                           Since I have the attention span of a sparrow and that this post is getting larger, I would like to elaborate on the second part in the next post. Those of you who couldn't understand a thing can go to the link I have provided above, what I am trying to explain is pretty much the same thing, only that they explain it better.

Feedbacks and advices will be appreciated and I will try and send chocolates to anyone who can help me out with this(just kidding).
 



by AkhilNair (noreply@blogger.com) at June 14, 2014 05:47 AM

June 13, 2014

Wenzhu Man
(PyPy student)

two end nursery GC

So things have been kind of complicated in the past four weeks.
And my original project changed to a new one as someone in pypy have been working on  the original project(add pinning to GC) for months, but none of my mentors know until two weeks passed.

So my new project is :
two end nursery GC.
As mentioned in previous post, current incremen
tal minimark GC in pypy has two generations: nursery generation and old generation. There is also a external_malloc() to raw-allocate objects(these objects are out of GC control)

The current nursery allocation logic is clear: there is three pointers :nursery-free/nursery-top/nursery-real-top, between nursery-free and nursery-top there is a piece of zeroed memory (grey area below)
,between nursery-top and nursery-real-top  there is non-zeroed( contains random garbage) memory( blue area below).


The nursery is cleaned up step by step, when the nursery-free reaches nursery-top, the nursery-top will move forward and zero the memory for one more step(defined by nursery-cleanup).

From this mechanism  we can see every object in allocation in memory that is allocated in memory full of zeros.
This is very useful for some objects.  Take an object that contains a field which is a GC pointer.  It needs to be zero-initialized, because otherwise, if the program doesn't quickly fill the pointer, garbage is left in this field.  At the next minor or major collection, trying to follow this garbage-valued pointer would
crash.
However, not all objects need such zero-allocation.  For any object not containing any GC pointers, it is not useful from the point of view of the GC.   The point is that we would save part of the time needed to zero out the nursery if we could choose to allocate zeroed or non-zeroed objects.

So the idea of the project is to allocation objects(don't have GC pointers) from nursery-real-top and grows towards nursery-top. So the nursery can allocate objects from two end.

These will save the time that instead of there are two writes(zeroing and allocation) for every object there will be just one write for some of the objects.

The current allocation function are malloc_fixedsize_clear(), malloc_varsize_clear(), I added two new pointers nursery-second-part-free(grows from nursery-real-top), nursery-second-part-end(points to nursery-top).
I also implemented malloc_fixedsize() as the third allocation function in GC to allocate objects in opposite direction from the other end of nursery.

 The next step will be modify the GCtransformer to replace malloc_fixedsize_clear() with malloc_fixedsize() when appropriate.







by Wenzhu Man (noreply@blogger.com) at June 13, 2014 05:47 PM

Manoj Kumar
(scikit-learn student)

GSoC update – Fixing precompute and LogReg CV

Hi, The first part of the last few days of have been the most productive and the last part well not so much.

1. Fixing precompute for ElasticNetCV
The function argument precompute=”auto” was being unused, in ElasticNetCV as mentioned in my previous post. Setting precompute equals auto uses a gram variant of the input matrix which according to the documentation is np.dot(X.T, X) . This theoritically helps the descent algorithm to converge faster. (However at the time of writing, I do not know exactly how). Practically however, (and after testing using the line profiler) it seems to be a bit slower since the computation of the gram matrix takes quite a bit of time. So with the advice of ogrisel, I split it across three Pull Requests. All three are essentially easy to fix.

1. https://github.com/scikit-learn/scikit-learn/pull/3247 – This ensures that the Gram variant is used if precompute is said to True or (auto and if n_samples > n_features)
2. https://github.com/scikit-learn/scikit-learn/pull/3248 – Remove precompute from Multi Task models since it is being unused,
3. https://github.com/scikit-learn/scikit-learn/pull/3249 – This is a WIP, and changes default precompute from auto to False

2. Threading backend for Linear Models.
I have successfully changed the backend from muli-processing to threading after releasing the GIL for all four variants. After a final round of review it can be merged,
a] Simple coordinate descent
b] Sparse coordinate descent
c] Gram variant
d] MultiTask variant
There is a huge memory gain and speed is almost the same (if not slightly higher by this Pull Request). https://github.com/scikit-learn/scikit-learn/pull/3102

3. Logistic Regression CV
Reading someone else’s magically vectorised NumPy code isn’t an easy task and I somehow crawled my way through it (which explains the more productive first part).

I fixed a bug in the code to compute the Hessian when the intercept is true. I’ve also fixed sparse matrix support and added multiple tests to it and confirmted that the newton-cg and lbfgs solvers give the exact same result, The liblinear has a slight change due to the penalisation of the intercept.

However benchmarking gives ambiguous results. On standard datasets such as the newsgroup and digits data, almost always the lib-linear solver is the fastest. However in datasets using make_classification, lbfgs seems to be the faster solver.

Right now, my job is just to wait for comments from Alex and Olivier and make the necessary changes. I shall come up with a more detailed description on Log Reg CV next week.


by Manoj Kumar at June 13, 2014 08:05 AM

June 12, 2014

Milan Oberkirch
(Core Python student)

First things first…

After working on the email package for two weeks my mentors and I decided to work on the smtpd and smtplib modules (and RFC 6531) first. This makes sense for several reasons, one of them is task scheduling: the SMTP related modules are much smaller and therefore dealing with them first ensures that as many tasks as possible are done as early as possible (with high probability).

I moved the old development repo to https://bitbucket.org/zvyn/cpython-email and started a new one at https://bitbucket.org/zvyn/cpython for that. The patch for smtpd is ready for review at bugs.python.org. The smtplib part is also nearly done, I’m waiting for #15014 (refactoring of the authentication logic) to be resolved to finish that. Another thing that needs to get merged is support for Internationalized Domain Names (IDN). I will have to do some research on how to handle those in a SMTPUTF8 environment (see #20083, #11783).

After having finished the SMTP part I will research if there are standardised ways to teach IMAP (the imaplib module in Python) Unicode. RFC 6857 (Section 1.1) does not sound very promising so I will most likely return to the email package at the end of next weak.

That’s it for now. I’m looking forward to blog more frequently in the future!


by milan at June 12, 2014 08:38 PM

June 05, 2014

Janani
(SciPy/NumPy student)

2 weeks over! That was fast..
And I have sent my pull request finishing the first of 4 sub-parts of my project.
The link to the pull request is https://github.com/scipy/scipy/pull/3692.
Here the function to implement spherical harmonics has been rewritten in Cython. It uses lpmv fortran sub-routine (with the help of the existing wrapper function pmv_wrap) to calculate associated Legendre Function.
Initially the function to calculate spherical harmonics was written in python which called lpmn function to calculate the associated Legendre function. Lpmn function in turn uses lpmn fortran routine to calculate the function. The routine iteratively evaluates associated Legendre function for all orders and degrees m and n. While the routine lpmv avoids the iteration by using recursion.
The usage of a different routine and rewriting the code in cython has given a huge leap in the speed of the function which is shown in a simple benchmark performed by Pauli,
Before:
 In [4]: %timeit sph_harm(50, 50, 0.5, 0.9)
10000 loops, best of 3:
81.3 µs per loop
 In [5]: x = np.linspace(0, 1, 2000)
 In [6]: %timeit sph_harm(50, 50, 0.5, x)
 10 loops, best of 3:
127 ms per loop

After:
 In [2]: %timeit sph_harm(50, 50, 0.5, 0.9)
100000 loops, best of 3:
 4.52 µs per loop
 In [3]: x = np.linspace(0, 1, 2000)
In [4]: %timeit sph_harm(50, 50, 0.5, x)
1000 loops, best of 3: 1.87 ms per loop

Now the next part of the project is to implement the ellipsoidal harmonic functions. It is kind of challenging as they are complex and they havent been implemented much. My next 2-3 weeks are dedicated to the implementation of the both kinds of ellipsoidal functions, thus meeting my mid sem deadline.

by janani padmanabhan (noreply@blogger.com) at June 05, 2014 04:52 PM

June 04, 2014

Mabry Cervin
(Astropy student)

Weeks One and Two Summary and Plans for the Future

So far astropy's class Quantity and the package uncertainties have been playing quite well together. I have been implementing the standard math functions (__add__, __sub__, etc) so that the fields value, unit, and uncertainty are properly updated each operation. I was prepared to have to deal with precision errors in the testing of automaticly propagated uncertainties versus hand propagated, but so far I have not run into any problems by using Python's math module for doing the calculations.

At my previous meeting with one of my mentors it was brought up that uquantity inheriting from Quantities and uncertainties should become a priority. Inheriting from both of those will give uquantity all of the machinery to be used as ndarrays (from the package numpy) and allow astropy users to use uquantity just as they would Quantity. Actually achieving this has been teaching me a ton about portions of Python that I have not encountered yet.

Ndarray is an impressively flexible class, it allows instantiation of an ndarray explicitly, via a different ndarray subclass, and through operations like slicing. Actually implementing this flexibility, however, requires declaring the class differently then I have always done it.

Typically when a class is defined, initialization code is placed in the method __init__. In ndarray the issue with this is that __init__ is not actually run in all supported cases of instantiation. Instead it uses a combination of __new__ and __array_finalize__ for initialization code. Luckily astropy has multiple subclasses of Quantity for me to look at. Using those as templates combined with the official ndarray documentation should make inheriting from Quantity not too difficult.

by Epitrochoid (noreply@blogger.com) at June 04, 2014 03:30 PM

June 03, 2014

Simon Liedtke
(Astropy student)

Debugging BeautifulSoup Pull Request

I started working on a pull request which replaces the XML library lxml with beautifulsoup to have a pure-Python solution. The problem is that some of the remote tests fail, thus the pull request cannot be merged yet.

When I run py.test --remote-data astroquery/eso/tests on the master branch, all tests pass fine. But when I call the same command on my branch which uses beautifulsoup instead of lxml, I get 3 failing tests, each of them raising the exception InconsistentTableError. Time to find out more and debug! Simplified, the stacktraces look like this:

  1. test_Sgr_Astar -> EsoClass.query_instrument -> raise InconsistentTableError
  2. test_empty_return -> EsoClass.query_survey -> astropy.io.ascii.read -> astropy.io.ui._guess -> raise InconsistentTableError
  3. test_SgrAstar_remotevslocal -> EsoClass.query_instrument -> raise InconsistentTableError

The exception InconsistentTableError is a subclass of ValueError and is defined in astropy.io.ascii.core. It has the following docstring:

Indicates that an input table is inconsistent in some way.

The default behavior of BaseReader is to throw an instance of this class if a data row doesn't match the header.

So the failing tests fail because some table is created with inconsistent rows. I presume this is because astropy.io.ascii_read receives the web search form instead of the generated CSV formatted string for some reason. At least this was the case when I had trouble before. Let's check it. The testing framework py.test supports debugging failed tests easily via the --pdb option. So I run the tests now like this: py.test --remote-data --pdb astroquery/eso/tests. This will launch the Python debugger pdb.

> /home/derdon/repos/python/3rdparty/astroquery/astroquery/eso/core.py(333)query_instrument()
-> raise ex
(Pdb) ''.join(content.splitlines()[:3])
'<html><HEAD><TITLE>ESO Science Archive - MIDI Data Products</TITLE>'

Yup, looks like content contains the web form instead of a CSV-formatted string as it should (content is the value that is passed to Table.read and thus passed on to astropy.io.ascii_read). I think the failure is in the method _activate_form and therefore set a breakpoint via import pdb; pdb.set_trace() at the end of this method body directly before returning. The code contains the statement if form_elem.name in inputs: where form_elem is created via the loop for form_elem in form.find_all(['input', 'select', 'textarea']):. The following pdb checks show that this if-statement can never be evaluated to True:

(Pdb) inputs
{'wdbo': 'csv/download', 'max_rows_returned': 50, 'target': 'Sgr A*'}
(Pdb) [form_elem.name for form_elem in form.find_all(['input', 'select', 'textarea'])]
[u'input', u'input', u'select', u'input', u'input', u'input', u'input', u'select', u'select', u'input', u'input', u'input', u'select', u'input', u'input', u'input', u'input', u'input', u'select', u'input', u'select', u'input', u'input', u'input', u'select', u'input', u'select', u'input', u'input', u'select', u'input', u'input', u'input', u'input', u'input', u'input', u'input', u'input', u'input', u'select', u'input', u'select', u'input', u'input', u'select', u'input', u'input', u'input', u'input', u'input', u'input', u'input', u'select', u'input', u'select', u'input', u'input', u'input', u'select', u'input', u'select', u'input', u'select', u'input', u'select', u'input', u'select', u'input', u'input', u'input', u'input', u'input', u'input', u'input', u'input', u'input', u'select', u'input', u'input', u'select', u'select', u'input', u'input']

I git checkout now back to master and use pdb there to see how it should be working.

(Pdb) inputs
{'wdbo': 'csv/download', 'max_rows_returned': 50, 'target': 'Sgr A*'}
(Pdb) [form_input.name for form_input in form.inputs]
[None, None, 'wdbo', 'max_rows_returned', None, None, 'target', 'resolver', 'coord_sys', 'coord1', 'coord2', 'box', 'format', 'tab_wdb_input_file', 'wdb_input_file', 'tab_night', 'night', 'stime', 'starttime', 'etime', 'endtime', 'tab_prog_id', 'prog_id', 'tab_prog_type', 'prog_type', 'tab_obs_mode', 'obs_mode', 'tab_pi_coi', 'pi_coi', 'pi_coi_name', 'tab_prog_title', 'prog_title', 'tab_dp_id', 'dp_id', 'tab_ob_id', 'ob_id', 'tab_obs_targ_name', 'obs_targ_name', 'tab_dp_cat', 'dp_cat', 'tab_dp_type', 'dp_type', 'tab_interferometry', 'tab_dp_tech', 'dp_tech', 'tab_tpl_name', 'tpl_name', 'tab_tpl_nexp', 'tpl_nexp', 'tab_tpl_start', 'tpl_start', 'tab_del_ft_sensor', 'del_ft_sensor', 'tab_del_ft_status', 'del_ft_status', 'tab_det_ntel', 'det_ntel', 'tab_iss_conf_station1', 'iss_conf_station1', 'tab_iss_conf_station2', 'iss_conf_station2', 'tab_iss_conf_station3', 'iss_conf_station3', 'tab_ocs_obs_mode', 'ocs_obs_mode', 'tab_ocs_obs_specconf', 'ocs_obs_specconf', 'tab_ocs_obs_type', 'ocs_obs_type', 'tab_ins_grat1_wlen', 'ins_grat1_wlen', 'tab_fwhm_avg', 'fwhm_avg', 'tab_airmass_range', 'airmass_range', 'tab_night_flag', 'night_flag', 'tab_moon_illu', 'moon_illu', 'order', 'extra_columns', 'force_tabular_mode', 'full_screen_mode']
(Pdb) 'wdbo' in [form_input.name for form_input in form.inputs]
True

Zing! The last expression is always False at my branch which is problematic. Now I realize what the problem is! On lxml, form_input.name gives me the attribute value for the attribute name of the node form_input. E.g. if the node form_input has the source <input name=foo>, then form_input.name will have the value 'foo'. But using beautifulsoup, the value of form_input.name will be 'input' because it returns the name of the node as a string!

After having fixed the code according to my new knowledge, there is only one remote failing test left, yay! Until next time, I think this is enough for one blog post :-)

by Simon Liedtke at June 03, 2014 10:00 PM

June 02, 2014

Edwin Marshall
(Kivy student)

Too Many Moving Parts

The title really speaks to two issues, one personal, and one technical. As most of you reading this are concerned only with the technical aspects of the project, I'll focus on the that first.

Technically Speaking

The awesome thing about the core kivy project is that it is organized very well and thoroughly documented. While the Kivy Architecture Diagram is a little scary when first encountered, once you take a peek at the directory structure, everything becomes much clearer. For instance, understanding how SDL ties into everything merely requires a developer study kivy/core. In fact, one could add new providers piece-meal by selecting which of the core modules to implement first. For example, if a developer wanted an SFML or GLFW-based window, he would wander into kivy/core/window.

Unfortunately, I can't say that things have been as pleasant in python-for-android. For example, in attempting to understand how SDL 1.2 ties in to the project, I've had to examine numerous files:

distribute.sh
Responsible for cross-compiling python and any other dependencies for Android
recipes
Explain how each of those dependencis should be built. Some of them are deceptively simple, until you realize that JNI is involved.
src/src/org/renpy/android/PythonActivity
An Android activity that sets up a sound thread, loads the libraries built by distribute.sh as well as the native application and sdl_main shared objects.
src/jni/{sdl,sdl_main,application}
Various bits of native wrappers responsible for not only loading the sdl shared libraries, but also providing a user's python code with an entry point.

All this indirection makes it rather easy for a simple guy like myself to get disoriented, so more time is spent understanding the code and following its logic trail than actually writing code.

Speaking of Writing Code

I've been apprehensive about breaking things, hince the lack of pushes, but that will change tonight. Yesterday I spent a lot of time reorganizing my workspace. Specifically, I made sure to properly set up my git remotes so I didn't push to the wrong branch. I've also forked to my own repo so that I never have to worry about messing up the master branch.

Today I plan to do a bit of liposuction to see if things that seem a bit redundant are necessarily so, or like my beer gut just another thing to judge the code by that needn't be there. I'm hoping this week will be the week that you guys see that I am acutually a productive individual.

A Little Personal

On the personal side of things, The last couple of weeks have been interesting because I've recently quit my job and been scrabbling for new work. I typically would have done the smart thing and waited for a new job before quiting, but my last job was taking its toll on me.

Due to the amount of repititive motion (and not having to do manual labor itself), my body has started to ache in ways it never has in all my 13 years of legally working. My stress-induced ecsema was acting up, I had pain in joints all over, and I found myself more lethargic during my off days than I had ever been.

I also found myself more depressed than usual. Whats ironic about that is that prior to this job, I used to find myslf working excessively as a way to cope with-- or perhaps more accurately, as a distraction from-- my depression. Here, things were different; I was constantly reminded of how easily talents get ignored and stripped of an identity; I was just another worker. If I was going to be nobody, I needed to be nobody of my own accord and with the opportunity to change.

While I haven't been succcessful in job search yet, I've gotten some responses from some freelance opportunities which I'll be continuing to explore this week. As such, I apologize if I seem somewhat distant.

I suppose that's enough about my last story. I hope everyone else's life has been less complicated and wish the other students happy coding!

June 02, 2014 07:26 PM

cfulton
(Statsmodels student)

Building Python Modules

Getting Started with Python Packaging

The first two weeks of the state-space project have been dedicated to introducing the Kalman filter - which was written in Cython with calls to the BLAS and LAPACK libraries linked to in Scipy - into the Statsmodels build process. A future post may describe why it was not just written in pure Python (in brief, it is because the Kalman filter is a recursive algorithm with a loop over the number of entries in a dataset, where each loop involves many matrix operations on relatively small matrices). For now, though, the source kalman_filter.pyx needs to be "Cythonized" into kalman_filter.c and then compiled into (e.g.) kalman_filter.so, either when the package is installed using pip, or from source (e.g. python setup.py install).

The first thing to figure out was the state of Python's packaging. I've had a vague sense of some of the various tools of Python packaging for a while (especially since it used to be recommended to specify --distribute which making a new virtualenv), but I built all my Cython packages either via a python setup.py build_ext --inplace (from the Cython quickstart) or via IPython magic.

The recommended setup.py file from Cython quickstart is:

from distutils.core import setup
from Cython.Build import cythonize

setup(
  name = 'Hello world app',
  ext_modules = cythonize("hello.pyx"),
)

as you can see, this uses the distutils package. However, while distutils is part of base Python and is standard for packaging, it, from what I could tell, distribute was the up-and-coming way to proceed. Would that it were that simple; it turns out that Python packaging is not for the faint of heart. A wonderful stackoverflow answer describes the state-of-the-art (hopefully) as of October 2013. It comes to the conclusion that setuptools is probably the way to go, unless you only need basic packaging, in which case you should use distutils.

Setuptools

So it appeared that the way to go was to use setuptools (and more than personal preference, Statsmodels uses setuptools). Unfortunately, I have always previously used the above snippet which is distutils based, and as it turns out, the magic that makes that bit of code possible is not available in setuptools. You can read this mailing list conversation from September 2013 for a fascinating back-and-forth about what should be supported where, leading to the surprising conclusion that to make Setuptools automatically call Cython to build *.pyx files, one should trick it into believing there was a fake Pyrex installation.

This approach can be seen at the repository for the existing Kalman filter code, or at https://github.com/njsmith/scikits-sparse (in both cases, look for the "fake_pyrex" directory in the project root).

It's often a good idea, though, to look at NumPy and SciPy for how it should be done, and it turns out that neither of them use a fake Pyrex directory, and neither do rely on setuptools (or distutils) to Cythonize the *.pyx files. Instead, they use a direct subprocess call to cythonize directly. Why do this, though?

NumPy and SciPy

Although at first it seemed like an awfully Byzantine and arbitrary mish-mash of roll-your-owns, where no two parties do things the same way, it turns out that the NumPy / SciPy approach agrees, in spirit, with the latest Cython documentation on compilation. The idea is that Cython should not be a required dependency for installation, and thus the already Cythonized *.c files should be included in the distributed package. These will be cythonized during the python setup.py sdist process.

So the end result is that setuptools should not be required to cythonize *.pyx files, it only needs to compile and link *.c files (which it has no problem with - no fake pyrex directory needed). Then the question is, how to cythonize the files? It turns out that the common way, as mentioned above, is to use a subprocess call to the cythonize binary directly (see Statsmodels, NumPy, SciPy).

by Chad Fulton at June 02, 2014 06:50 AM

June 01, 2014

Issamou
(scikit-learn student)


I am announcing my progress for GSoC 2014:Extending Neural Networks Module for Scikit learn project here,
http://issamlaradji.blogspot.com/?view=classic

Thanks!

by Issam Laradji (noreply@blogger.com) at June 01, 2014 10:51 PM

Chinmay Joshi
(Mercurial student)

Dive in code during the First two weeks

At the end of first two weeks of GSoC coding period, I am really enjoying a great time working on my Project. This is a never-before experience in my life to work on with open source project community. I keep on discovering new things each and every day. I thought I had done enough basic research about my project but it continuously proves me wrong. One possibly can’t understand blocks before submerging into code. Working on code base, I discover the way various ways developers have used in mercurial to provide maximum optimization.

 In spite of stumbling and frustrating in the beginning because of slow pace in my work, I sent a small patch series in Week 2. I added some filesystem functions to vfs like lexists(), lstat(), unlinkpath() (still in queue to send) and updated some users in bundlerepo.py and patch.py. Some of my patches were queued and merged and some got me very important feedback. Based upon the feedback I learn to check for callers and origins of functions before updating them with vfs interface. I also tried to clear my doubts with repository related and current working directory (cwd) related paths. Discussing things with Community helped me to identify what are the cautions for this task. Checking for all callers and origins of function before replacing them, is a task I found very time consuming and tedious but bare necessity. Switching to a slightly advanced IDE than a normal editor is giving me a helping hand in this task.

I am preparing another series in which I am adding more function to vfs and updating more users to use new API. This first phase of this project of adding/updating looks quite mechanical process at first sight, but it needs proper attention for maximum compatibility and optimization. During this process, I encountered a few filesystem functions from os module of python which I had never used in my usual college projects. Also I liked the util of mercurial written in C for file manipulation. This util.* functions, especially the ones about tree links are very new for me. Talking and discussing things with mentors and community always helps me to clear up my mind give me directions. I am currently focusing on this add functions / update users task as it is essential for further process.

Stay tuned for further updates!

by Chinmay Joshi (noreply@blogger.com) at June 01, 2014 07:35 AM

Anurag Goel
(Mercurial student)

Slowly and Steadily reaching towards the goal.

It has been two weeks now. Experience was great so far. At initial point i was viewing the project in big picture like how things get done and time it will take.
But that approach did not give me the clear idea about the project and even restricted me to move one step forward. After communicating with mentors and crew members, i realise that i should focus on one task at a time. And before doing that task, i should break it into small  steps. Following this approach, it helped me in completing the first task. Although it has not been reviewed yet but if i have to make some changes further, i will do side by side.

Task 1 is about gathering timing info of test files. In this task, I calculated two main things.
1) Children's user time taken by a process.
2) Children's system time taken by a process.

I used "os.times()" module to get the above info. Unfortunately, in windows "os.times()" module works only for parent process. Therefore, my contribution only works for linux user as of now.    

According to proposal timeline, i planned to complete first two tasks in first two weeks. But i am only able to complete first task yet. Now as i got the approach, i will try to cover up things quickly.       

Biggest challenge in doing a task, is in making those small steps on which you have to get along. This could only be possible when you communicate with your mentor and crew members as much as possible. With every conversation things will get more clearer to you and this would help in building greater understanding about the project.

by Anurag Goel (noreply@blogger.com) at June 01, 2014 04:25 AM

May 31, 2014

Manoj Kumar
(scikit-learn student)

GSoC 2nd week: Roadblocks and progress

Hi. It has been another slow week, and the heat and laziness at home (holiday season) isn’t quite helping. However I did manage to do a bit of work and this post will serve as a remainder for me to complete all the work that I have accumulated by the end of next week.

1. Got the memory profiler working.
The https://github.com/fabianp/memory_profiler is a wonderful tool bult by Fabian that gives a line by line coments about the memory being used. You can install it like any other python package, by doing

sudo python setup.py install

a] You can use it by simply importing it in the top of the file.

from memory_profiler import profile

and adding @profile at the top of the function, that you need to use it for.
As mentioned in the last post, this was giving ambiguous results, since it wasn’t taking into account the child processes. There are is a workaround to this, that is using the mprof -c command directly, which gives plots showing how much memory is being used up. (Thanks to Olivier for helping me out with this)

Multiprocessing
sparse_benchmp

Threading
sparse_bench

We can see that threading has a considerable advantage with respect to memory, this is because in threading, the memory of the input data, is shared by all the worker threads while in multiprocessing each process needs its own memory.

2. While I was trying to release the GIL for the Gram case, I found that the optimisation algorithm cd_fast.enet_coordinate_descent_gram wasn’t being used at all! I confirmed from Alexandre that it was indeed a refactoring bug, and so I’ve sent a Pull Request to fix this. https://github.com/scikit-learn/scikit-learn/pull/3220 .

The Gram Matrix which is actually np.dot(X.T, X) (and the coordinate_descent_gram) is found out in either of these two cases.
1. When precompute is set to be true.
2. When precompute is set to be “auto” and when n_features is greater than n_samples (default)
However, contrary to intution, it actually slows down (maybe because of the time taken to compute the Gram matrix) and hence I’ve changed the defualt to False in the PR.

3. Started with the LogisticRegression CV PR
I started reading the source code of the Logistic Regression CV Pull Request, and I could understand some of it. I pushed in a commit to Parallelize a computation involving a regularization parameter, but I found out that it actually slows things down (:-X).

By the way, thanks to the people at Rackspace Cloud for providing an account to help me with my benchmark sessions and a perfect opportunity to learn vim(!).

Things to do be done by next week.
1. Change the LogisticRegressionCV PR from a WIP to MRG.
2. Get the releasing GIL PR merged, with the help of other mentors.

P.S: For those who follow cricket, whenever I’m really slow in code, I take inspiration from M.S.Dhoni (the Indian Captian) who starts off really slow, but then finishes off in style every time (well almost).


by Manoj Kumar at May 31, 2014 03:44 PM

Terri Oda
(PSF Org admin, Mailman mentor)

You can leave academia, but you can't get the academic spam out of your inbox

When I used to do research on spam, I wound up spending a lot of time listening to people's little pet theories. One that came up plenty was "oh, I just never post my email address on the internet" which is fine enough as a strategy depending on what you do, but is rather infeasible for academics who want to publish, as custom says we've got to put our email addresses on the paper. This leads to a lot of really awesome contacts with other researchers around the world, but sometimes it leads to stuff like the email I got today:


Dear Terri,

As stated by the Carleton University's electronic repository, you authored the work entitled "Simple Security Policy for the Web" in the framework of your postgraduate degree.

We are currently planning publications in this subject field, and we would be glad to know whether you would be interested in publishing the above mentioned work with us.

LAP LAMBERT Academic Publishing is a member of an international publishing group, which has almost 10 years of experience in the publication of high-quality research works from well-known institutions across the globe.

Besides producing printed scientific books, we also market them actively through more than 80,000 booksellers.

Kindly confirm your interest in receiving more detailed information in this respect.

I am looking forward to hearing from you.


Best regards,
Sarah Lynch
Acquisition Editor

LAP LAMBERT Academic Publishing is a trademark of OmniScriptum
GmbH & Co. KG

Heinrich-Böcking-Str. 6-8, 66121, Saarbrücken, Germany
s.lynch(at)lap-publishing.com / www. lap-publishing .com

Handelsregister Amtsgericht Saarbrücken HRA 10356
Identification Number (Verkehrsnummer): 13955
Partner with unlimited liability: VDM Management GmbH
Handelsregister Amtsgericht Saarbrücken HRB 18918
Managing director: Thorsten Ohm (CEO)


Well, I guess it's better than the many mispelled emails I get offering to let me buy a degree (I am *so* not the target audience for that, thanks), and at least it's not incredibly crappy conference spam. In fact, I'd never heard of this before, so I did a bit of searching.

Let's just post a few of the summaries from that search:

From wikipedia:
The Australian Higher Education Research Data Collection (HERDC) explicitly excludes the books by VDM Verlag and Lambert Academic Publishing from ...


From the well-titled Lambert Academic Publishing (or How Not to Publish Your Thesis):
Lambert Academic Publishing (LAP) is an imprint of Verlag Dr Muller (VDM), a publisher infamous for selling cobbled-together "books" made ...


And most amusingly, the reason I've included the phrase "academic spam" in the title:
I was contacted today by a representative of Lambert Academic Publishing requesting that I change the title of my blog post "Academic Spam", ...


So yeah, no. My thesis is already published, thanks, and Simple Security Policy for the Web is freely available on the web for probably obvious reasons. I never did convert the darned thing to html, though, which is mildly unfortunate in context!

comment count unavailable comments

May 31, 2014 05:19 AM

PlanetPlanet vs iPython Notebook [RESOLVED: see below]

Short version:

I'd like some help figuring out why RSS feeds that include iPython notebook contents (or more specifically, the CSS from iPython notebooks) are showing up as really messed up in the PythonPython blog aggregator. See the Python summer of code aggregator and search for a MNE-Python post to see an example of what's going wrong.

Bigger context:

One of the things we ask of Python's Google Summer of Code students is regular blog posts. This is a way of encouraging them to be public about their discoveries and share their process and thoughts with the wider Python community. It's also very helpful to me as an org admin, since it makes it easier for me to share and promote the students' work. It also helps me keep track of everyone's projects without burning myself out trying to keep up with a huge number of mailing lists for each "sub-org" under the Python umbrella. Python sponsors not only students to work on the language itself, but also for projects that make heavy use of Python. In 2014, we have around 20 sub-orgs, so that's a lot of mailing lists!

One of the tools I use is PythonPython, software often used for making free software "planets" or blog aggregators. It's easy to use and run, and while it's old, it doesn't require me to install and run an entire larger framework which I would then have to keep up to date. It's basically making a static page using a shell script run by a cron job. From a security perspective, all I have to worry about is that my students will post something terrible that then gets aggregated, but I'd have to worry about that no matter what blogroll software I used.

But for some reason, this year we've had some problems with some feeds, and it *looks* like the problem is specifically that PlanetPlanet can't handle iPython notebook formatted stuff in a blog post. This is pretty awkward, as iPython notebook is an awesome tool that I think we should be encouraging students to use for experimenting in Python, and it really irks me that it's not working. It looks like Chrome and Firefox parse the feed reasonably, which makes me think that somehow PlanetPlanet is the thing that's losing a <style> tag somewhere. The blogs in question seem to be on blogger, so it's also possible that it's google that's munging the stylesheet in a way that planetplanet doesn't parse.

I don't suppose this bug sounds familiar to anyone? I did some quick googling, but unfortunately the terms are all sufficiently popular when used together that I didn't find any reference to this bug. I was hoping for a quick fix from someone else, but I don't mind hacking PlanetPlanet myself if that's what it takes.

Anyone got a suggestion of where to start on a fix?

Edit: Just because I saw someone linking this on twitter, I'll update in the main post: tried Mary's suggestion of Planet Venus (see comments below) out on Monday and it seems to have done the trick, so hurrah!

comment count unavailable comments

May 31, 2014 03:53 AM

May 29, 2014

Wenzhu Man
(PyPy student)

"hg push -f" accident today summary

First of all:never ever use hg push -f never force push

It has been a relatively rough day.
As I was given the write right so I made my first push today. And oops, some change-sets shouldn't be pushed into "default".

what I did:
1. I have pulled every new commits (1.hg pull 2.hg update -C to get rid of uncommitted change but I forgot the committed change-sets!!! plus. hg purge will get rid of the untracked local files)

("In Mercurial, pull is the opposite of push: it syncs the local repo with a remote repo without touching the working copy" but you can use hg pull -u  == git pull
Note this is different from git :  git pull == git fetch(equals hg pull) + git merge )

2. then I run hg push and there is a warning that I will created two new branches(gc-pinning and test-ast-module). As the test-ast-module is remotely closed after merging into default but I still locally has this branch, so I turned on the mq plug-in and used hg strip command to get rid of the local branch.
(maybe this unsafe step created the crisis.)
any safe way to delete local branch? (the safest way I found so far is clone the commit before the branch created and pull from there?)


After the crisis:
Well, all done is done, sorry is the most useless thing to say. I have to solve the 3 heads I created myself, after realizing there is no such thing like simply delete heads and delete change-set.

So I merged the three heads(two by two) and manually resolved the conflicts.

Things seems to be fine now.

Well, at least I learnt several things:
1.before pushing, use hg out to see what are going to be pushed..
2.hg diff is useful to see if you are truly out of trouble(I ran the hg diff with the commit before I pushed, nothing came out, so I know things are fine now.thanks to Brian Kearns)

Last words: writing right to repo is a sword you can use it to kill dragon or yourself...


by Wenzhu Man (noreply@blogger.com) at May 29, 2014 03:16 AM

May 23, 2014

Frank Cheng
(Statsmodels student)

coding structure first pass

Me and Kerby have gotten the structure of MICE to a reasonable point; it's a good time to make and update! Here is the current user interface:

>>> import pandas as pd (1)
>>> import statsmodels.api as sm (2)
>>> from statsmodels.sandbox.mice import mice (3)
>>> data = pd.read_csv('directory_here') (4)
>>> impdata = mice.ImputedData(data) (5)
>>> m1 = impdata.new_imputer("x2") (6)
>>> m2 = impdata.new_imputer("x3") (7)
>>> m3 = impdata.new_imputer("x1", model_class=sm.Logit) (8)
>>> impcomb = mice.MICE("x1 ~ x2 + x3", sm.Logit, [m1,m2,m3]) (9)
>>> p1 = impcomb.combine(iternum=20, skipnum=10) (10)

Now here is what's going on, step by step:

1) Our base data type is going to be a pandas DataFrame. Currently, the data must be in a form that supports pd.DataFrame(data).

2) Import our statsmodels API.

3) Import our mice module.

4) Read in our data.

5) mice.ImputedData is a class that stores both the underlying data and its missing data attributes. Right now the key attribute is which indices for each variable are missing. Later we will be making changes directly to the underlying dataset, so we don't want to lose this information as we start filling in values. As soon as these indices are saved, we modify the underlying data by filling in as an initial imputation all the column-wise means for the missing values. ImputedData also contains a helper function that allows us to fill in values and a function that allows us to construct Imputers (described below). Note that changes to the data are within the scope of ImputedData, so your actual data will be safely preserved after all the MICE carnage is done :)

6-8) For each variable with missing values that we want to impute, we initialize an Imputer. These Imputers contain two simulation methods that will help us impute the specific variable of interest: impute_asymptotic_bayes (described in my last post) and impute_pmm (a new one, predictive mean matching). There will be more simulation methods later on. Each Imputer's job is to impute one variable given a formula and model, specified by the user. ImputedData.new_imputer defaults to OLS and all other variables as predictors. Note that here we are implicitly assuming Missing At Random since, conditional on the predictors, the missing value is completely random (and asymptotically Gaussian).

9) Initialize a MICE instance by specifying the model and formula that we are really interested in. The previous imputation models are simply conditional models that we think will do a good job at predicting the variable's missing values; this new model is an analysis model that we want to fit to the already-imputed datasets. We pass in a list containing all the Imputers from steps 6-8; these Imputers will do the work of imputation for us.

10) Here's where we specify an iteration number (iternum) and number of imputations to skip between iterations (skipnum). Mechanically, what is happening is that once MICE.combine is called, we start stringing imputations together, with the variable with the least number of missing observations being imputed first. ImputerChain just makes sure every Imputer in the list is used once to make one fully-imputed dataset. However, one imputation for each variable may not be enough to ensure a stable conditional distribution, so we want to skip a number of datasets between actually fitting the analysis model. So we run through all the Imputers ad nauseum until we have skipped the set number of fully imputed datasets, then fit the model. This is one iteration; we repeat until have iternum number of fitted models. To be clear: if we specify iternum=10 and skipnum=5, we will go through a total of 50 imputation iterations (one iteration is a series of imputations for all specified Imputers) and only fit the analysis model for imputation number 5, 10, 15, etc.

All this skipping and fitting happens in AnalysisChain. MICE.combine then takes all these fitted models (actually, only the parameters we care about: params, cov_params, and scale) and combines them using Rubin's rule, and finally the combined parameters are stuffed into a fitted analysis model instance.

Whew, that's a mouthful! That's the gist of it, if you'd like to get under the hood (and I hope people do!) the Github is here. There are some minor things I did to statsmodels.api and the working directory just so I could work from my local Github folder/machine, so don't try to install it directly before changing those back.

Next step is making the user interface simpler (hopefully he will just pass the data object directly into MICE and not have to deal with initializing Imputers and ImputerData) and also adding logic in that lets the user specify which simulation method he wants to use for imputation. Hopefully get some feedback and make more improvements before my next post!

by Frank Cheng (noreply@blogger.com) at May 23, 2014 06:51 AM

May 18, 2014

Edwin Marshall
(Kivy student)

GSoC '14 End of Community Bonding Period

Project Scope

One thing that has been frustrating about this experience is the fact that I think there has been some disparity between what I proposed and what is expected to be implemented. In particular, I don't make mention of mobile platforms in my original proposal, but as I communicate with my mentors it is rather apparent that my focus early on should be not only getting SDL2 to play nicely with Android in Kivy's ecosystem, but that I am expected to refactor the runtime system in general.

While I certainly voiced how overwhelming such a turn of direction has been for me to one of the core developers, I haven't protested because I honestly believe that this new direction will more positively impact the community. Also, while I was speaking specifically of kivy.core, when talking about having a solid foundation, the same logic can be applied to mobile platforms, so I don't think it is unreasonable to expect the same. After all, I've often read that the only way to become better (at anything) is to do something beyond your current capabilities.

The Community

The kivy community is as friendly and helpful as always. As I get acclimated to the Android ecosystem, I've had plenty of questions, all of which have been answered rather quickly. As I continue to feel my way around the intricacies of the current Android run time, I have no doubt that as I run into issues, they'll be able to offer assistance where necessary. My only real concern is IRC. While it is certainly more effective than email, I find that simply having an IRC client open distracts me from actually working. I find myself consumed by the side conversations or consistently checking to see if I've missed any important messages. As such, I might make it a point to be on IRC at specific times of the day, having it closed the others.

The Way Forward

I look forward to this summer as a way to give back to a great open source project and grow as a developer in the process. In fact, I'm considering leaving my part-time gig so that I can focus on this exclusively, since I've found the context switching (warehouse work to coding) to be a bit draining. I'm already thinking about code all day, I might as well make use of all that potential energy. I'll also need some kind of structure to keep me on task, given that this is the first time I've done any work from home. Given hat I attend school online as well, I think that should be less of a puzzle to work out. However, I think it will be important simply because I don't have a lot of time to do something great, so every minute counts.

Cheers!

May 18, 2014 11:11 PM

Jaspreet Singh
(pgmpy student)

The Bonding Period

Hi Everyone!

I happen to be selected for gsoc14 for python software foundation under pgmpy. Hurray!!. The title of my project is "Parsing from and writing to standard PGM formats". I will be posting all the details about my project work along with my experiences and any information that may encourage and help future contributors to this open source organization. There will be weekly updates about my work here. Two meetings were conducted in this period. Mostly discussing how to approach for the project. I have a new framework - Pycharm installed and set up to work with. It serves as a good environment for development in python. The fact that one can use the terminal with an editor in the same window is new to me. There will be regular meetings every week with the mentors.

by jaspreet singh (noreply@blogger.com) at May 18, 2014 06:42 PM

Navin
(pgmpy student)

The beginning of the Summer of Code

This is my first post and hence I begin by congratulating myself on getting selected for the Google Summer of Code. I will begin by describing the organization I got selected for : PGMPY. So the pgm in pgmpy stands for probabilistic graphical models and the py stands for python. You would have guessed it by now, but still, it is a organization under the PSF (Python Software Foundation) and works on probabilistic graphical models. Here is the link to the organization https://github.com/pgmpy/pgmpy . Sadly it doesn't still have a "web-page" of its own but still the github page will do. It is relatively a very new organization and they have very little code right now (as compared to many other organizations. A lot of previous GSOCers said that it took them quite a while to get familiar with just the part of the code which was relevant to them).

Anyway, let's get to what I will be doing. My task will be to basically implement the Markov Networks. After implementing the basic architecture, I will be implementing a few algorithms. Take a look here for the current version of the timeline (We decided to make some changes to the initial timeline which I had proposed. A few changes here and there. Though it would hardly be interesting for you ) https://docs.google.com/document/d/1WRuiEUgiuwWMuxezTy-YXqSiVILP2Q1lQ50zU6kib6g/edit?usp=sharing .

Now, the big question is what I have been doing since 19th of April to today . (Which is approximately a month). We were not supposed to start coding yet as it is still the community-bonding period and we are supposed to familiarize ourselves with the code, become friendly with the mentors, get to know people in the org, and do any preparatory work, as necessary to begin the coding work. So I haven't yet started coding up stuff. However I have been doing some cool stuff.

To being with, I had my end-sems exam till 30th of April and so, I was quite inactive till then. We had an IRC chat on 28th of April and then , we just had a brief chat about things and set a few plans for community bonding period. We also chatted a bit about IDE and cython and testing-techniques among other things. After the exams, I lazed away for a while and after about the 3rd of May started reading up the algorithms which I would be implementing. I did a course on Advanced Machine Learning this semester , however, I thought that it would be a good idea to go through the algorithm from the book (the standard book on Probabilistic Graphical Models ofcourse, the one and only, Probabilistic Graphical Models, by Koller and Friedman).

We had the next IRC chat on 10th May (oh , we had decided in the previous IRC chat that we would be having these chats every Saturday starting from 10th). Then it was more of a report of progress and clearing doubts, if any. We discussed plans about the subsequent weeks. So, as per the plan, I went through the software preparations part of my project, and familiarized myself with Cython, PyCharm IDE, configured VIM to work well with python (that was quite a struggle, but then, anything for VIM :D ). I also took a look at the exisiting code so that I could use parts of it for my implementation. (I also guess that I found a bug , for which I sent a pull request).

And then we land up at today. We had another IRC chat yesterday and my task for the next week would be setting up the Markov Networks class and writing the basic class implementation, testing it up and so on. (Look at the timeline) But the good news is that, coding starts now. :)


by Navin Chandak (noreply@blogger.com) at May 18, 2014 06:12 PM

May 17, 2014

Mabry Cervin
(Astropy student)

Beginning Directions

Start of the coding period is in three days and I'm very excited. The plan for the uncertainties framework is for it to be developed as an Astropy affiliated package. Doing it this way rather than trying to merge changes into Astropy core right off the bat has some huge advantages. Astropy uses a well planned release cycle with regular feature freezes before large releases. Trying to coordinate new feature development along side such a process would be an unnecessary burden on both the uncertainties development and core Astropy. Further having it as an affiliated package will allow to me work more closely with my mentors by having a dedicated repo. Astropy has amazing support for affiliated packages, the full list including some really cool stuff like astroML which is a package for astrophysics related machine learning.

The first milestone for the package is getting Astropy's class Quantity to work with the existing Python package Uncertainties. The Uncertainties package propagates error using a technique called automatic differentiation which calculates the derivative of every operation along side the return value of the operation. This method is computationally very efficient. It is restricted with respect to much of astronomic data, but will make an excellent starting point for a properly flexible framework.

by Epitrochoid (noreply@blogger.com) at May 17, 2014 01:29 AM