Monday, June 29, 2015

Hadoop with Cloudera + Maven project in IntelliJ

How to write Hadoop MR programs using IntelliJ

We will use Cloudera distributions and use maven to define all the dependencies

Though I am using IntelliJ, you could use Eclipse and create a similar project.

1. Create new Project in IntelliJ

You can use maven archetype to create this project. Use the following archtype
archetypeGroupId=org.apache.maven.archetypes
archetypeArtifactId=maven-archetype-quickstart

Use atleast Java 7. This will generate a maven project with pom.xml

2. Add Cloudera repo in the pom.xml


<repositories>
      <repository>
        <id>cloudera-releases</id>
        <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
        <releases>
          <enabled>true</enabled>
        </releases>
        <snapshots>
         <enabled>false</enabled>
        </snapshots>
      </repository>
    </repositories>


3. Add Hadoop dependencies for writing a client project


<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>${hadoop.version}</version>
    </dependency>


At time of writing this article, the latest Cloudera Hadoop Client version is 2.6.0-mr1-cdh5.5.0-SNAPSHOT

You might also want to add Maven Central Repo
http://repo1.maven.org/maven2
4. Now try and build the project.
The project should build and you should have all the dependencies downloaded.
Now you are good to go.. write all the MapReduce programs you know :)

Tuesday, June 9, 2015

HTTP Headers useful for REST APIs

Access-Control-* headers

preflight - OPTIONS

When writing a simple CORS filter following fields can be set in response header:

response.setHeader("Access-Control-Allow-Origin", "*");
response.setHeader("Access-Control-Allow-Methods", "POST, GET, OPTIONS, DELETE");
response.setHeader("Access-Control-Max-Age", "3600");
response.setHeader("Access-Control-Allow-Headers", "x-requested-with");

CORS filter containing above code will respond to all requests with these Access-Control-* headers.

Access-Control-Allow-Origin - tell to allow all origins. 
Access-Control-Allow-Methods - HTTP methods that will be allowed
Access-Control-Allow-Headers - give header keys that must be allowed in response for CORS calls. This tells that x-requested-with is allowed as a response header when CORS call is made.


HTTP Media type headers

(http://www.newmediacampaigns.com/blog/browser-rest-http-accept-headers)
When a web browser make a request it sends information to the server about what it is looking for in headers. One of these headers is the Accept header. The Accept header tells the server what file formats, or more correctly MIME-types, the browser is looking for.
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

 Quality factors allow the user or user agent to indicate the relative degree of preference for that media-range, using the qvalue scale from 0 to 1 (section 3.9). The default value is q=1.

Order by preference value in descending order.

1: html, xhtml

0.9: xml

0.8: *

 For example if both application/xml and */* had a preference of 0.9 application/xml would still come first. Firefox chooses to make it explicit that */* is less preferred by giving it a preference of 0.8. Firefox's Accept header is sensible and well thought out. Opera's is too. Other browsers: not so much.

Twitter's REST API doesn't use the Accept header for content-negotiation, they use extensions on the URL '.json' and '.xml'.

==============

When Content-Type is null or wrong then rest service returns 

415 Unsupported Media Type - 
"message":"Content type 'null' not supported"
"message":"Content type 'application/vnd.dmi-v2+xfd' not supported"

When Content-Type is correct, but Accept header is wrong then service returns
406 Not Acceptable 

HTTP Caching headers

Works only for safe HTTP methods - GET, HEAD, OPTIONS
HTTP 304 Not modified returned if response sent from cache

Request headershould matchResponse headerE.g. value
If-Modified-Since=Last-ModifiedHTTP-date (Sat, 29 Oct 1994 19:43:31 GMT)
If-None-Match=ETag"123-abef8r3dw"

 In 200 (OK) responses to GET or HEAD, an origin server:

   o  SHOULD send an entity-tag validator unless it is not feasible to
      generate one.

   o  MAY send a weak entity-tag instead of a strong entity-tag, if
      performance considerations support the use of weak entity-tags, or
      if it is unfeasible to send a strong entity-tag.

   o  SHOULD send a Last-Modified value if it is feasible to send one.

   In other words, the preferred behavior for an origin server is to
   send both a strong entity-tag and a Last-Modified value in successful
   responses to a retrieval request.

   A client:

   o  MUST send that entity-tag in any cache validation request (using
      If-Match or If-None-Match) if an entity-tag has been provided by
      the origin server.
   o  SHOULD send the Last-Modified value in non-subrange cache
      validation requests (using If-Modified-Since) if only a
      Last-Modified value has been provided by the origin server.

   o  MAY send the Last-Modified value in subrange cache validation
      requests (using If-Unmodified-Since) if only a Last-Modified value
      has been provided by an HTTP/1.0 origin server.  The user agent
      SHOULD provide a way to disable this, in case of difficulty.

Useful Cache-Control response headers include:
  • max-age=[seconds] — specifies the maximum amount of time that a representation will be considered fresh. Similar to Expires, this directive is relative to the time of the request, rather than absolute. [seconds] is the number of seconds from the time of the request you wish the representation to be fresh for.
  • s-maxage=[seconds] — similar to max-age, except that it only applies to shared (e.g., proxy) caches.
  • public — marks authenticated responses as cacheable; normally, if HTTP authentication is required, responses are automatically private.
  • private — allows caches that are specific to one user (e.g., in a browser) to store the response; shared caches (e.g., in a proxy) may not.
  • no-cache — forces caches to submit the request to the origin server for validation before releasing a cached copy, every time. This is useful to assure that authentication is respected (in combination with public), or to maintain rigid freshness, without sacrificing all of the benefits of caching.
  • no-store — instructs caches not to keep a copy of the representation under any conditions.
  • must-revalidate — tells caches that they must obey any freshness information you give them about a representation. HTTP allows caches to serve stale representations under special conditions; by specifying this header, you’re telling the cache that you want it to strictly follow your rules.
  • proxy-revalidate — similar to must-revalidate, except that it only applies to proxy caches.
When both Cache-Control and Expires are present, Cache-Control takes precedence.


Essentially, "vary" lets the caches know which of the headers to use to figure out if they have a valid cache for a request; if a cache were a giant key-value store, adding "vary" fields appends those values to the key, thus changing which requests are considered valid matches for what exists in the cache.

What is the correct way to version my API?
The "URL" way
A commonly used way to version your API is to add a version number in the URL. For instance:/api/v1/article/1234/api/v2/article/1234GET /api/article/1234 HTTP/1.1Accept: application/vnd.api.article+xml; version=1.0To "move" to another API, one could increase the version number:The hypermedia way

References 
http://restcookbook.com/Basics/versioning/#sthash.jRlZVJ0L.dpuf

https://www.safaribooksonline.com/library/view/rest-api-design/9781449317904/ch04.html


https://devcenter.heroku.com/articles/jax-rs-http-caching

Connect to Cloudera VM installed on your MAC using MAC terminal

So recently started working on Cloudera Hadoop. And installed Cloudera VM from their website. I am using Virtual Box for running the Cloudera Hadoop VM.

Now working on VMs is slow and tedious. You can cofigure your mac to have Mac's terminal ssh to the VM.

Here are the steps:
1. Go to Virtual Box setting for the Cloudera VM and 
open virtualbox
go to File-->Preferences-->Network and click on the "Add Host-only network (Ins)
it will create automatically a "vboxnet0" network
Click to OK to save changes


2. Now on a terminal window of Cloudera Hadoop VM, write ifconfig.
This will return you information with eth0, eth1 settings.
Get the inet addr associated with eth1
3. Go to you Mac's terminal window and type
$ ssh  training@<inet addr ip from prev step>
4. This will ask you to connect to the VM. Type "yes"
and you are in. You can check by typing at prompt - hadoop fs.
If this run's then you are connected.

-connect to cloudera vm using mac
-how to connect to cloudera vm from terminal