Initially, caches were very small segments of memory kept on or close to the execution unit in order to retain information known or expected to be used soon or reused frequently. As programmers got more experience with caches (registers can be considered first-level caches in many respects, especially since there are getting to be so many of them in newer generation processors), caches became seen as increasingly important to the efficient execution of time-critical codes.
There's an obvious trade-off here: programmers would like the entire memory to be available within a cycle or two of its request, while hardware architects know that the more memory you have close to the processor, the higher the cost of the entire system -- so various kinds of compromises have been worked out, and one of them involves multi-level caches ... these are segments of memory located at increasing access-distance from the processor, each of whose size is in rough proportion to its distance: i.e., the longer you're willing to wait, the more memory you'll be able to draw from. This allows for quite complicated schemes attempting to keep as much as possible as close as possible, but with intelligent regard for just what is needed in the near future driving the decision as to what data should be located how in which level of the caching hierarchy.