Searchable Datasets in Python: Images across Domains, Experiments, Algorithms, and Learning

March 3, 2017

When you hear about searching, chances are the first image that pops into your mind is a web browser and a web search engine. From ancient Alta Vista to Google, querying capabilities using web crawlers and indexers have shaped the way we pursue and retrieve information: “If you don’t know something, you Google it.” As of 2017, this simple action comes with a caveat: there are billions of websites in the world wide web, totaling approximately half a zettabyte or 10^21 bytes = 1,000,000,000,000,000,000,000 bytes. If you are looking for popular items, such as the CD cover of your favorite band, you will often get to it in seconds. However, if you are looking for a specific scientific image, it may feel like looking for a needle in a haystack. This is why a team at BIDS decided to tackle this issue and create a tool tailored to scientific datasets lying in databases not immediately obvious in the WWW.

pyCBIR is a new python tool for content-based image retrieval (CBIR) capable of searching for relevant items in large databases given unseen samples. While much work in CBIR has targeted ads and recommendation systems, our pyCBIR allows general purpose investigation across image domains and experiments. Also, pyCBIR contains different distance metrics and several feature extraction techniques, including a convolutional neural network (CNN).

Problem: 500 Exabyte Haystack

Image capture has turned into an ubiquitous activity in our daily lives, but mechanisms to organize and retrieve images based on their content are available only to a few people or for very specific problems. With significant improvements in image-processing speeds and the availability of large storage systems, developing methods to query and retrieve images is fundamental to simple human activities like cataloguing and conducting complex research, such as synthesizing materials. CBIR systems use computer vision techniques to describe images in terms of their properties in order to search similar samples given an image as the query instead of keywords. For this reason, the system works independently of annotations, which can be time consuming and impossible in some scenarios (e.g., with high-throughput imaging instruments).

Data Science to the Rescue

We proposed a CBIR tool using a python program language called pyCBIR. This tool is composed of six feature-extraction methods and 10 distances (see Figure 1). Searches occur based on a single image (or a set of images) as the query, and then pyCBIR retrieves and ranks the most similar images according to user-selected parameters.

Regarding the feature-extraction methods presented in Figure 1, pyCBIR calculates the following sets of attributes: gray-level co-occurrence matrices (GLCM), histograms of oriented gradients (HOGs), first-order texture features (FOTFs), and local binary pattern (LBPs). We also implemented a CNN-based scheme for image characterization; it uses a CNN without the last layer (i.e., the classification layer), retaining the convolution results as features. This is a common approach among new CBIR systems. Next, we can compute different distances between feature vectors and return the most similar result. A graphical user interface presents all options for the feature-extraction methods, distances, and the number of images for the ranked output, as presented in Figure 2.

Preliminary Experiments 

We carried out several experiments using classical image databases for CBIR problems, such as the Flickr Material Database (FMD) and diverseseveral scientific datasets containing cervical cells microscopic images, X-ray microtomograph, and thin film X-ray scatterings of different materials. Previous experiments pointed out that descriptors like HOG and LBP are very sensitive to the parameter choice and the absence of parameters that work well in all databases. The proposed CNN scheme for feature extraction requires only two parameters: number of epochs and learning rate. Figure 3 shows the results of the retrieval process using the CNN to extract the features of cervical cell images.    

Opening Opportunities for Enhanced Geek Exploration

By studying the important visual-exploration mechanisms pathologists and material scientists use when observing relevant microstructures, we developed a new recommendation system for scientific images. The underlying inferential engine ranks data using feature-extraction methods for data representation and enables users to quickly retrieve the top matches within a particular image set.

Being able to detect materials’ properties in real time will add an entirely new level of experimental capability, including triage, quality control, and prioritization. Tying this capability to control systems at the beamlines may enable machines to steer experiments in response to specific structures present in the sample. For example, when a feature of interest is identified, the imaging process may be temporarily enhanced so the equipment collects a magnified image from the region of interest before resuming the process. Currently, a major challenge is the inability to adjust experimental parameters fast enough for optimal data collection, thereby hindering users from customizing the acquisition of detailed features to specific regions.

Future capabilities can be developed to support automated data curation that will enable individuals to extrapolate labeled datasets to classify unseen samples and automatically create metadata while including their respective degrees of uncertainty. As time passes, we intend to improve the pyCBIR data management module to handle billion-sized image collections.


The authors are members of the BIDS, where they have designed and developed research on computer vision, machine learning, and searchable datasets. 

Dani Ushizima, PhD. is a Data Scientist Fellow at @BIDS (UC Berkeley), and leads research within Center for Advanced Mathematics for Energy Research Applications (CAMERA) as a Staff Scientist at the Computational Research Division @LBNL. She is also one of the 2015 DOE Early Career Awarded scientists, with focus on image analytics across domains, experiments, algorithms and leaning. Science domains areas included: (1) emerging algorithms for dealing with complex and large datasets; (2) pattern recognition and machine learning applied to scientific data; (3) advances in evolving computer architectures.

Prof. Romuere Silva is a visiting scholar researcher at BIDS and LBNL. He is also a PhD. candidate at Federal University of Ceara and a professor of Federal University of Piaui, Brazil. His research focuses on image analysis, feature extraction, and searchable images for biomedical data. 

Prof. Flavio Araujo is a professor at the Federal University of Piaui and a PhD student at the Federal University of Ceara. He is also a visiting scholar researcher at BIDS and LBNL. He has a master’s degree in computer science from the Federal University of Piaui, Brazil. His research interests include medical image processing, machine learning, and data mining.