• BLOG
  • ARCHIVE
  • LINKEDIN
  • GITHUB
  • RSS
  • Scraping Alchemist: Celery, Selenium, PhantomJS and TOR

    Sep 19, 2015

    Now a days people do scraping for fun and profit, all alike. Scraping is a mean of collecting data from various websites. This data then is often used for various analysis and sometimes the content is republished. There are tools like Selenium WebDriver, CasperJS which allow automated emulation of real user while interacting with browsers, with not much effort.

    This post provides architectural overview and avoids code snippets. The Scraper explained in this post, went through multiple iterations. The goals changes forced the product to prove its mettle, adapt and transform. The product evolved as demand increased.

    ...more
  • Understanding Java Wrappers and Collections Memory Usage

    Jul 14, 2015
    ...more
  • Monster : A Centralized Monitoring System With OpenTSDB

    Apr 29, 2015

    Systems are rapidly becoming distributed in nature. Systems now a days are also implementing components which are designed to perform a specific task. With more and more components, they are spreading rapidly on different machines, different operating systems and they are altogether different in nature.

    In general, if something goes down or if something is not performing well, then one need to investigate the root cause of issue so as to tune system or to fix it. In complex systems, doing such analysis might become a complex task ending up consuming lots of resources. As there are multiple components involved, there would be multiple teams involved. This further slows down the investigation process. In worst case, business might get impacted because of poor performing system for such a long time. As a solution one can implement Central Monitoring System which collects critical metrics from each of the collects. With this system in place, if something screws up, one can correlate all the events at a single location reducing the unnecessary overhead analyzing each component separately.

    ...more
  • #IssueFix: Missing artifact jdk.tools:jdk.tools:jar:1.6 in Eclipse

    Mar 7, 2015

    Most of us face this issue when they are working with Hadoop related source codes. Maven starts reporting a missing dependency.

    ...more
  • Hive Partitioning: Tips and Hows

    Mar 6, 2015

    After having used Hive for sometime now, I can really say, it has provided some serious productivity boost. Not only that, it is really easy to maintain and most of the things are transparent to the developer. Really huge amount of data can be efficiently crunched using Hive!

    ...more
  • HDFS: Explained as Comic!

    Mar 4, 2015

    I always look for content delivered in visual medium whenever I try to learn new things. After watching videos, going through lengthy articles and ending up writing one such article myself, I found a very interesting comic to learn about HDFS protocols and internals.

    ...more
  • #IssueFix : Too many Hive Staging Directories everywhere

    Mar 4, 2015

    Issue Description:

    While working with Hive, we noticed that there are too many directories with name .hive-staging_hive_yyyy-MM-dd_HH-mm-ss_SSS_xxxx-x in the table location directories. These directories were added to the location with execution of each query.

    At the same time, few users were facing problems like access permission issues on table directories even when each table had read access for all users and groups.

    Error while compiling statement: FAILED: RuntimeException Cannot create staging directory ‘hdfs://namenode:8020/path/to/table/hive-staging_hive_yyyy-MM-dd_HH-mm-ss_SSS_xxxx-x’: Permission denied: user=uname, access=WRITE, inode=”hdfs://namenode:8020/path/to/table”:uname2:hive:drwxrwxr-x at…….

    ...more
  • HDFS: How a file is written!

    Mar 4, 2015

    HDFS is distributed file system capable of storing very large files without much effort. HDFS spans across multiple machines on multiple racks. There are two types of nodes in HDFS:

    • DataNode
    • NameNode

    The nodes responsible for storing files and handling IOs are DataNodes where as NameNode is responsible for keeping the fs image and upkeep of the HDFS. HDFS is robust, highly flexible, faster, fault tolerant, easier to maintain, easier to scale out and reliable. Various decisions made while designing HDFS are responsible for such features and it makes HDFS suitable for many usecases.  Lets have a brief look at how HDFS is designed and how file is written internally.

    ...more
  • JavaScript: Equality !== Truthyness

    Feb 22, 2015

    JavaScript, as others languages, have branching statements. These branching statements decide the flow of execution when provided with few specific conditions[i.e. stimulus]. These conditions are Boolean in nature. Besides default Boolean variables and equality operators, each DataType in JS has a Boolean[Truthy/Falsy] associated with it. In addition to these, equality operator are also of two type, strict equality operator and Coersion based equality operator. Understanding each of these can save a lot of pain while writing complex JS code.

    ...more
  • CasperJS and Navigation Parallelism

    Aug 25, 2014

    This tutorial will describe how CasperJS can be used to scrape/test multiple pages at a time. CasperJS is a navigation scripting and testing utility. It’s execution takes place in sequential manner, in which one navigation step executes after other. For small number of steps, this behavior of CasperJS is perfectly fine. But as number of steps increase, the amount of time consumed can become very huge. This problem can be solved by introducing parallelism in the execution of navigation steps.

    ...more

© 2013 - 2016 Santosh Pingale, powered by Hexo and apollo.