The chain rule in multi-variable calculus: Generalized

Now we’ll discuss the chain rule for n-nested functions. For example, an n-nested function would be g=f_1(f_2(\dots(f_n(t))\dots). What would \frac{\partial g}{\partial t} be?

We know that


If f_2 is continuous, then

g(t+h)-g(t)=\frac{\partial f_1}{\partial f_2}.f_2(\dots(f_n(t+h))\dots)-f_2(\dots(f_n(t))\dots)+g_1 such that \lim_{[f_2(\dots(f_n(t+h))\dots)-f_2(\dots(f_n(t))\dots)]\to 0}g_1=0, which is equivalent to saying \lim\limits_{t\to 0}g_1=0.

In turn

f_2(\dots(f_n(t+h))\dots)-f_2(\dots(f_n(t))\dots)=\frac{\partial f_2}{\partial f_3}.f_3(\dots(f_n(t+h))\dots)-f_3(\dots(f_n(t))\dots)+g_2

such that \lim\limits_{t\to 0}g_2=0.

Hence, we have

g(t+h)-g(t)=\frac{\partial f_1}{\partial f_2}.(\frac{\partial f_2}{\partial f_3}.\left[f_3(\dots(f_n(t+h))\dots)-f_3(\dots(f_n(t))\dots)\right]+g_2)+g_1

Continuing like this, we get the formula

g(t+h)-g(t)=\frac{\partial f_1}{\partial f_2}.(\frac{\partial f_2}{\partial f_3}.(\dots(\frac{\partial f_n}{\partial t}.t+g_n)+g_{n-1})\dots)+g_2)+g_1

such that \lim\limits_{t\to 0}g_i=0 for all i\in \{1,2,3,\dots,n\}.

From the above formula, we get

\lim\limits_{t\to 0}g(t+h)-g(t)=\frac{\partial f_1}{\partial f_2}.\frac{\partial f_2}{\partial f_3}.\dots\frac{\partial f_n}{\partial t}.t

Multi-variable differentiation.

There are very many bad books on multivariable calculus. “A Second Course in Calculus” by Serge Lang is the rare good book in this area. Succinct, thorough, and rigorous. This is an attempt to re-create some of the more orgasmic portions of the book.

In \Bbb{R}^n space, should differentiation be defined as \lim\limits_{H\to 0}\frac{f(X+H)-f(X)}{H}? No, as division by a vector (H) is not defined. Then \lim\limits_{\|H\|\to 0}\frac{f(X+H)-f(X)}{\|H\|}? We’re not sure. Let us see how it goes.

Something that is easy to define is f(X+H)-f(X), which can be written as

f(x_1+h_1,x_2+h_2,\dots,x_n+h_n)-f(x_1,x_2,\dots,x_n) (H is the n-tuple (h_1,h_2,\dots,h_n)).

This expression in turn can be written as


Here, we can use the Mean Value Theorem. Let us supposes_1\in((x_1+h_1,x_2+h_2,\dots,x_n+h_n),(x_1,x_2+h_2,\dots,x_n+h_n)),

or in general

s_k\in((x_1,x_2,\dots,x_k+h_k,\dots,x_n+h_n),(x_1,x_2,\dots,x_k\dots,x_n+h_n)). Then

f(x_1+h_1,x_2+h_2,\dots,x_n+h_n)-f(x_1,x_2,\dots,x_n)=\\ \displaystyle{\sum\limits_{k=1}^n{D_{x_k}(x_1,x_2,\dots,s_k,\dots,x_n+h_n).((x_1,x_2,\dots,x_k+h_k,\dots,x_n+h_n)-(x_1,x_2,\dots,x_k,\dots,x_n+h_n))}}.

No correction factor. Just this.

What follows is that a function


is assigned for every k=\{1,2,3,\dots,n\}.

Hence, the expression becomes

f(x_1+h_1,x_2+h_2,\dots,x_n+h_n)-f(x_1,x_2,\dots,x_n)=\sum\limits_{k=1}^n {D_{x_k}(x_1,x_2,\dots,x_n)+g_k}

It is easy to determine that \lim\limits_{H\to 0}g_k=0.

The more interesting question to ask here is that why did we use mean value theorem? Why could we not have used the formula f(x_1+h_1,x_2+h_2,\dots,x_n+h_n)-f(x_1,x_2,\dots,x_n)\\=\sum\limits_{k=1}^n {\left[D_{x_k}(x_1,x_2,\dots,x_k\dots,x_n+h_n)+g_k(x_1,x_2,\dots,x_k,\dots,x_n+h_n,h_k)\right]},

where \lim\limits_{h_k\to 0}g_k(x_1,x_2,\dots,x_k,\dots,x_n+h_n,h_k)=0??

This is because g_k(x_1,x_2,\dots,x_k,\dots,x_n+h_n,h_k) may not be defined at the point (x_1,x_2,\dots,x_n). If in fact every g_k is continuous at x_1,x_2,\dots,x_n), then we wouldn’t have to use mean value theorem.

Watch this space for some more expositions on this topic.

Watch this space for some more posts on this topic.

One passing note as I end this article.

A function is differentiable at X if it can be expressed in this manner: f(X+H)-f(X)=(gradf(X)).H+\|H\|g(X,H) such that \lim\limits_{\|H\|\to 0}g(X,H)=0. This is a necessary and sufficient condition; the definition of differentiability. It does not have a derivation. I spent a very long time trying to derive it before realising what a fool I had been.

Continuity decoded

The definition of continuity was framed after decades of deliberation and mathematical squabbling. The current notation we have is due to a Polish mathematician by the name of Weierstrass. It states that

“If f:\Bbb{R}\to \Bbb{R} is continuous at point a, then for every \epsilon>0, \exists\delta>0 such that for |x-a|<\delta, |f(x)-f(a)|<\epsilon.”

Now let us try and interpret the statement and break it down into simpler statements, in order to give us a strong visual feel.

Can \epsilon be very large? Of course! It can be 1,000,000 for example. Does there exist a \delta such that |x-a|<\delta\implies |f(x)-f(a)|<1,000,000, even if the function is not continuous? Yes. An example would be

f(x)=x for x\in(-\infty,a) and f(x)=x+1 for x\in[a,\infty)

Does this mean that we have proved a discontinuous function to be continuous? NO.

\epsilon should take up the values of all positive real numbers. So f(x) defined above will fail for \epsilon lower than 0.000\dots01

Let us suppose for some \epsilon>0, we have |f(x)-f(a)|<\epsilon if |x-a|<\delta. Let f(x_1) and f(x_2) be two points in B(f(a),\epsilon). Let us now make \epsilon=\frac{|f(x_1)-f(x_2)|}{2}. Will the value of \delta also have to decrease? Can it in fact increase?

The value of \delta cannot increase because the bigger interval will contain x_1 and x_2, and we know that that will violate the condition that for all points in B(a,\delta), the distance between the mappings has to be less than \frac{|f(x_1)-f(x_2)|}{2}. Can \delta remain the same? No (for the same reasons, as the interval will still contain x_1 and x_2). Hence, \delta most definitely has to decrease in this case?

However, does it always have to decrease? No. An example in case is a constant function like y=b.

We have now come to the most important aspect of continuity. The smaller we make \epsilon, the smaller the value of \delta. Does continuity also imply that the smaller we make \delta, the smaller the value of \epsilon? YES! How? When we decrease \delta, \epsilon obviously can’t get bigger. Moreover, we know that there do exist values of \delta which make smaller \epsilon possible. Say, for |f(x)-f(a)|<\epsilon/2, it is necessary that |x-a|<\delta/5. Hence, if we decrease the radius of the interval on the x-axis from \delta to \delta/5, the value of \epsilon (or the bound of the mappings of the points) also decreases to \epsilon/2.

In summation, a continuous function is such that

           decrease in value of \epsilon\Longleftrightarrow decrease in value of \delta

One may ask how does knowing this help?

It has become very easy to prove that differentiable functions are continuous, and a host of other properties of continuous functions.

A doubt that one may face here is does this imply that all continuous functions are differentiable? No. “decrease in value of \epsilon\Longleftrightarrow decrease in value of \delta” just implies that the derivative formula at a will have a limit for every cauchy sequence of x converging to a. In order for a function to be derivable, all those limits of the different cauchy sequences have to be equal. This is not implied by the aforementioned condition.

An attempted generalization of the Third Isomorphism Theorem.

I recently posted this question on The link is this.

My assertion was “Let G be a group with three normal subgroups K_1,K_2 and H such that K_1,K_2\leq H. Then (G/H)\cong (G/K_1)/(H/K_2). This is a generalization of the Third Isomorphism Theorem, which states that (G/H)\cong (G/K)/(H/K), where K\leq H.”

What was my rationale behind asking this question? Let G be a group and H its normal subgroup. Then G/H contains elements of the form g+H, where g+h=(g+\alpha h)+ H, for every \alpha\in Z.

Now let K_1,K_2 be two normal subgroups of G such that K_1,K_2\leq H. Then G/K_1 contains elements of the form g+K_1 and H/K_2 contains elements of the form h+K_2. Now consider (G/K_1)/(H/K_2). One coset of this would be \{[(g+ all elements of K_1)+(h_1+all elements of K_2)],[(g+ all elements of K_1)+(h_2+all elements of K_2)],\dots,[(g+ all elements of K_1)+(h_{|H/K_2|}+all elements of K_2)]\}. We are effectively adding every element of G/K_1 to all elements of H. The most important thing to note here is that every element of K_1 is also present in H.

Every element of the form (g+ any element in K_1) in G will give the same element in G/K_1, and by extension in (G/K_1)/(H/K_2). Let g and g+h be two elements in G (h\in H) such that both are not in K_1. Then they will not give the same element in G/K_1. However, as every element of H is individually added to them in (G/K_1)/(H/K_2), they will give the same element in the latter. If g and g' form different cosets in G/H, then they will also form different cosets in (G/K_1)/(H/K_2). This led me to conclude that (G/H)\cong (G/K_1)/(H/K_2).

This reasoning is however flawed, mainly because H/K_2 need not be a subgroup of G/K_1. Hence, in spite of heavy intuition into the working of cosets, I got stuck on technicalities.

Generalizing dual spaces- A study on functionals.

A functional is that which maps a vector space to a scalar field like \Bbb{R} or \Bbb{C}. If X is the vector space under consideration, and f_i:X\to \Bbb{R} (or f_i:X\to\Bbb{C}), then the vector space \{f_i\} of functionals is referred to as the algebraic dual space X^*. Similarly, the vector space of functionals f'_i:X^*\to \Bbb{R} (or f'_i:X^*\to\Bbb{C}) is referred to as the second algebraic dual space. It is also referred to as X^{**}.

How should one imagine X^*? Imagine a bunch of functionals being mapped to \Bbb{R}. One way to do it is to make all of them map only one particular x\in X. Hence, g_x:X^*\to \Bbb{R} such that g_x(f)=g(f(x)). Another such mapping is g_y. The vector space X^{**} is isomorphic to X.

My book only talks about X, X^* and X^{**}. I shall talk about X^{***}, X^{****}, and X^{**\dots *}. Generalization does indeed help the mind figure out the complete picture.

Say we have X^{n*} (n asterisks). Imagine a mapping X^{n*}\to \Bbb{R}. Under what conditions is this mapping well-defined? When we have only one image for each element of X^{n*}. Notice that each mapping f:X^{n*}\to \Bbb{R} is an element of the vector space X^{(n+1)*}. To make f a well-defined mapping, we select any one element a\in X^{(n-1)*}, and determine the value of each element of X^{n*} at a. One must note here that a is a mapping (a: X^{(n-2)*}\to\Bbb{R}). What element in X^{(n-2)*} that a must map to \Bbb{R} should be mentioned in advance. Similarly, every element in X^{(n-2)*} is also a mapping, and what element it should map from X^{(n-3)*} should also be pre-stated.

Hence, for every element in X^{n*}, one element each from X^{(n-2)*}, X^{(n-3)*},X^{(n-4)*},\dots ,X should be pre-stated. For every such element in X^{n*}, this (n-2)-tuple can be different. To define a well-defined mapping f:X^{n*}\to \Bbb{R}, we choose one particular element b\in X^{(n-1)*}, and call the mapping f_b. Hence,

f_b(X^{n*})=X^{n*}(b, rest of the  (n-2)-tuple ),

f_c(X^{n*})=X^{n*}(c, rest of the (n-2)-tuple), and so on.



f_b(X^{n*})=X^{n*}(b, rest of the  (n-2)-tuple),

we mean the value of every element of X^{n*} at (b, rest of the (n-2)-tuple).

Some facts, better explained, from Atiyah-Macdonald

Today we shall discuss some interesting properties of elements of a ring.

1. If a\in R is not a unit, then it is present in some maximal ideal of the ring R. Self-explanatory.

2. If a is present in every maximal ideal, then 1+xa is a unit for all x\in R. Proof: Let 1+xa not be a unit. Then it is present in some maximal ideal (from 1). Let 1+xa=m, where m is an element from the maximal ideal 1+xa is a part of. Then 1=m-xa. Hence, 1 is also a member of the maximal ideal, which is absurd.

Let’s break down this theorem into elementary steps, and present a better proof (than given on pg.6 of “Commutative Algebra” by Atiyah-Macdonald). If x\in M_1 for some maximal ideal M_1, then 1\pm xy\notin M_1 for all y\in R. Similarly, If x\in M_2 for some maximal ideal M_2, then 1\pm xy\notin M_2 for all y\in R. This argument can then be extended to the fact that if x\in all maximal ideals, then 1\pm xy\notin any maximal ideal for all y\in R. An element not there in any maximal ideal is a unit. Hence, 1\pm xy is a unit.

3. If 1-xy is a unit for all y\in R, then x is part of every maximal ideal in R. Proof: Let is assume x is not part of some maximal ideal. Then there exists some m\in that maximal ideal such that m+xy=1. This implies that m=1-xy, which is impossible as 1-xy is a unit. The same argument can be used for 1+xy.


On pg.6 of Atiyah-Macdonald, it is mentioned that if a,b,c are ideals, then a\cap (b+c)=a\cap b+a\cap c if b\subseteq a and c\subseteq a. It is not elaborated in the book, and the flippant reader may be confused. I’d like to elaborate on this concept.

a\cap (b+c) consists of those elements in b+c which are also there in a. Now elements of a\cap (b+c)=[(b+c) wrt those elements of both b and c that are in a] \bigcup [(b+c) wrt those elements of b that are in a and those elements of c that are not] \bigcup [(b+c) wrt those elements of b that are not in a and those elements of c that are] \bigcup [(b+c) wrt those elements of both b and c that are not in a]

The second and the third terms are null sets, as can be easily seen.

The fourth term is NOT necessarily empty. However, it becomes an empty set if b,c\subseteq a. It may also become an empty set under other conditions which ensure that if both b and c are not in a, then b+c\notin a.

In summation, a\cap (b+c)=a\cap b+a\cap c is definitely true when b,c\subseteq a. However, it is also true under other conditions which ensure that the fourth term is an empty set.


I have extended a small paragraph in Atiyah-Macdonald to a full-fledged exposition: (a,b,c and d are ideals in commutative ring R)

1. a\cup (b+c)\subseteq(a\cup b)+(a\cup c)– Both sides contain all elements of a and b+c. Remember that b\cup c\subseteq b+c. However, the right hand side also contains elements of the form a+b and a+c, which the left hand side does not contain.

2. a\cap (b+c)– This has already been explained above.

3. a+(b\cup c)=(a+b)\cup (a+c)– Both are exactly the same.

4. a+ (b\cap c)\subseteq (a+b)\cap (a+c)– There might be b_1,c_1\notin b\cap c such that a'+b_1=a''+c_1. However, any element in a+(b\cap c) will definitely be present in (a+ b)\cap (a+c).

5. a(b\cup c)\supseteq ab\cup ac– LHS contains elements of the form a'b_1+a''c_1, which the RHS doesn’t. In fact, LHS is an ideal while the RHS isn’t. You might wonder how LHS is an ideal. I have just extended the algorithm used to make AB an ideal when A and B are both ideals to situations in which A is an ideal and B is any subset of R.

6. a(b\cap c)\subseteq ab\cap ac– The RHS may contain elements of the form a'b_1=a''c_1 for b_1,c_1\notin b\cap c.

7. a(b+c)=ab+ac– Easy enough to see.

8. (a+b)(c\cap d)=(c\cap d)a+(c\cap d)b\subseteq(ac\cap ad)+(bc\cap bd)

From this formula, we have (a+b)(a\cap b)\subseteq (a\cap ab)+(b\cap ab)\subseteq ab.
This fact is mentioned on pg.7 of Atiyah-Macdonald.

9. (a+b)(c\cup d)= (c\cup d)a+(c\cup d)b= (ca\cup da)+(cb\cup db).

From this formula, we have (a+b)(a\cup b)=(a^2\cup ab)+(b^2\cup ab)\supseteq ab.

The existence or inexistence of a maximal element

Have you ever wondered why the real number line does not have a maximal element?
Take \Bbb{R}. Define an element \alpha. Declare that \alpha is greater than any element in in \Bbb{R}. Can we do that? Surely! We’re defining it thus. In fact, \alpha does not even have to be a real number! It can just be some mysterious object that we declare to be greater than every real number. Note that \alpha is greater than real number, but it is not the maximal element of \Bbb{R}, as for that it will have to be a part of \Bbb{R}. Why can’t \alpha be a part of \Bbb{R}? We’ll see in the next paragraph. 

However, it is when we assert that \alpha has to be a real number that we begin to face problems. If \alpha is a real number, then so is \alpha+1. Thereby, we reach a contradiction, showing that no real number can exist which is greater than all other real numbers.

Another approach is to take the sum of all real numbers. Let that sum be \mathfrak{S}, which is greater than any one real number. However, as the sum of real numbers is a real number by axiom (\Bbb{R} is a field), \mathfrak{S} is also a real number, which is smaller than the real number \mathfrak{S}+1. If we did not have the axiom that the sum of real numbers should be a real number, then we’d be able to create a number greater than all real numbers. The same argument would work if we were to multiply all real numbers.


Now I’d like to draw your attention to the proof of the fact that every ring must have a maximal ideal, as given on pg. 4 of the book “Commutative Algebra” by Atiyah-Macdonald. The gist of the proof is: take every ideal which is a proper subset of the ring, and find its union. This union is the maximal ideal.

Why this proof works is that we wouldn’t know which element to add to make the ideal bigger. If we could construct a bigger ideal for any ideal we choose, we can prove that no maximal ideal exists. But whatever element we choose to add to the previous ideal, we have no reason to suspect that that element does not already exist in it.

Let us generalize this argument. Let us take a set of elements, define an order between the elements, and then declare the existence of a maximal element which is part of the set. If we cannot prove that a bigger element exists, then there is no contradiction, and hence that element is indeed a maximal element of the set. This argument works if we were to prove that every ring has a maximal ideal, and does not if we were to prove that \Bbb{R} has a maximal element.


Breaking down Zorn’s lemma

Today I’m going to talk about Zorn’s lemma. No. I’m not going to prove that it is equivalent ot the Axiom of Choice. All I’m going to do is talk about what it really is. Hopefully, I shal be able to create a visually rich picture so that you may be able to understand it well.

First, the statement.

“Suppose a partially ordered set P has the property that every chain (i.e. totally ordered subset) has an upper bound in P. Then the set P contains at least one maximal element.”

Imagine chains of elements. Like plants. These may be intersecting or not. Imagine a flat piece of land, and lots of plants growing out of it. These may grow straight, or may grow in a crooked fashion, intersecting. These plants are totally ordered chains of elements. Now as all such chains have a maximal elements, imagine being able to see the tops of each of these plants. Not three things: 1. Each tree may have multiple tops (or maximal elements). 2. There may be multiple points of intersection between any two trees. 3. Different plants may have the same maximal element.

Moreover, there may be small bits of such plants lying on the ground. These are elements that are not part of any chain. If any such bit exists on the ground, then we have a maximal element. Proof: If it could be compared to any other element, it would be on a chain. If it can’t be compared to any other element, it’s not smaller than any element.

Let us suppose no such bits of plants exist. Then a maximal element of any chain will be the maximal element of the whole set! Proof: It is not smaller than any element in its own chain. It can’t be compared with the chains which do not intersect with this chain. And as for chains that intersect with this chain, if the maximal element is the same, then we’re done. If the maximal elements are not the same, then too the two maximal elements can’t be compared. Hence, every distinct maximal element is a maximal element of the whole set.

Assuming that the set is non-empty, at least one plant bit or chain has to exist. Hence, every partially ordered set has at least one maximal element. The possible candidates are plant bits (elements not in any chain) and plant tops (maximal elements of chains).

The mysterious linear bounded mapping

What exactly is a linear bounded mapping? The definition says T is called a linear bounded mapping if \|Tx\|/\|x\|\leq c. When you hear the word “bounded”, the first thing that strikes you is that the mappings can’t exceed a particular value. That all image points are within a finite outer covering. That, unfortunately, is not implied by a linear bounded mapping.

The image points can lie anywhere in infinite space. It is just that the vectors with norm 1 are mapped to vectors whose norms have a finite upper bound (here the upper bound is c. One may also refer to it as \|T\|). Say a is a vector which is mapped to Ta. Then sa will be mapped to sTa, where s is a scalar. This is how the mapping is both bounded (for vectors with norm 1) and linear (T(\alpha x+\beta y)=\alpha Tx+\beta Ty).

Could this concept be generalized? Of course! We could have quadratic bounded mappings: vectors with norm 1 are mapped in a similar way as linear bounded mappings, and T(\alpha x)=\alpha^2 Tx. What about T(\alpha x+\beta y)? Let r be a scalar such that \|r(\alpha x+\beta y)\|=1. Then T(\alpha x+\beta y)=\frac{1}{r^2} T(r(\alpha x+\beta y)). Similarly we could have cubic bounded mappings, etc.

Then why are linear bounded mappings so important? Why haven’t we come across quadratic bounded mappings, or the like? This is because linear bounded mappings are definitely continuous, whist not much can be said about other bounded mapppings. Proof: \|T(x-x_0)\|\leq c\|x-x_0\|\implies \|Tx-Tx_0\|\leq c\|x-x_0\| implies that linear bounded mappings are continuous. Does this definitely prove that quadratic bounded mappings are not continuous? No. All that is shown here is that we can’t use this method to prove that quadratic or other bounded mappings are continuous.  

The “supremum” norm

Today I shall speak on a topic that I feel is important.

All of us have encountered the “\sup” norm in Functional Analysis. In C[a,b], \|f_1-f_2\|=\sup|f_1x-f_2x| for x\in X. In the dual space B(X,Y), \|T_1x-T_2x\|=\sup\limits_{x\in X}\frac{\|(T_1-T_2)x\|}{\|x\|}. What is the utility of this “sup” norm? Why can’t our norm be based on “inf” or the infimum? Or even \frac{\sup+\inf}{2}?

First we’ll talk about C[a,b]. Let us take a straight line on the X-Y plane, and a sequence of continuous functions converging to it. How do we know they’re converging? Through the norm. Had the norm been \inf, convergence would only have to be shown at one point. For example, according to this norm, the sequence f_n=|x|+\frac{1}{n} converges to y=0. This does not appeal to our aesthetic sense, as we’d want the shapes of the graphs to gradually start resembling y=0.  

Now let’s talk about B(X,Y). If we had the \inf norm, then we might not have been able to take every point x\in X and show that the sequence (T_n x), n\in\Bbb{N} is a cauchy sequence. So what if we cannot take every x\in X and say (T_n x) is a cauchy sequence? This would crush the proof. How? Because then \lim\limits_{n\to\infty} T_n would not resemble the terms of the cauchy sequence at all points in X, and hence we wouldn’t be able to comment on whether \lim\limits_{n\to\infty} T_n is linear for all x\in X or not. Considering that B(X,Y) contains only bounded linear operators, the limit of the cauchy sequence (T_n) not being a part of B(X,Y) would prove that B(X,Y) is not complete. Hence, in order for us to be able to prove that \lim\limits_{n\to\infty} T_n is linear and that B(X,Y) is complete, we need to use the \sup norm.