Extract Text from Image Left-to-Right and Top-to-Bottom with Keras-OCR

4 min readOct 26, 2022

Make computers read text in a more ‘human’ way.

What is OCR? OCR stands for Optical Character Recognition. It recognizes text within a digital image. In this article, I will discuss how I made improvements to a library called Keras-OCR in order to return text in an ordered, human-readable format (left to right, top to bottom).

Here is the documentation to the library below:

keras-ocr

This is a slightly polished and packaged version of the Keras CRNN implementation and the published CRAFT text…

pypi.org

Simple implementation following the source documents:

import keras_ocrdef detect_w_keras(image_path):
    """Function returns detected text from image"""
   
    # Initialize pipeline
    pipeline = keras_ocr.pipeline.Pipeline()    # Read in image path
    read_image = keras_ocr.tools.read(image_path)    # prediction_groups is a list of (word, box) tuples
    prediction_groups = pipeline.recognize([read_image])    return prediction_groups[0]

Sample prediction with bounding box coordinates (word, box) tuple:

# In the order of...
# (word, ([[top-left], [top-right], [bottom-right], [bottom-left]]))('those',
 array([[299.41794 ,  82.824036],
        [483.91843 ,  86.465485],
        [482.73495 , 146.42897 ],
        [298.23447 , 142.78752 ]], dtype=float32))

A typical bounding box usually looks like…

After using Keras-OCR to extract any detectable text in an image, I used the Pythagorean Theorem (hello middle-school) to order the bounding boxes. Each bounding box’s center will have a distance from the origin at (0,0) and that list of distances are then sorted by its distinguished rows and columns. Note: Matplotlib displays images where the y-axis is inverted. This is normal in computer vision.

Follow the yellow triangle to see the pattern.

Now to put these yellow triangles into code…if triangle gets wider, same row; if triangle gets longer past the specified threshold, new row. First, get the list of all distances from origin for each bounding box. Results are stored in a list of dictionaries with multiple (key, value) pairs.

import mathdef get_distance(predictions):
    """
    Function returns dictionary with (key,value):
        * text : detected text in image
        * center_x : center of bounding box (x)
        * center_y : center of bounding box (y)
        * distance_from_origin : hypotenuse
        * distance_y : distance between y and origin (0,0)
    """
    
    # Point of origin
    x0, y0 = 0, 0    # Generate dictionary
    detections = []
    for group in predictions:
        # Get center point of bounding box
        top_left_x, top_left_y = group[1][0]
        bottom_right_x, bottom_right_y = group[1][1]
        center_x = (top_left_x + bottom_right_x) / 2
        center_y = (top_left_y + bottom_right_y) / 2    # Use the Pythagorean Theorem to solve for distance from origin
    distance_from_origin = math.dist([x0,y0], [center_x, center_y])    # Calculate difference between y and origin to get unique rows
    distance_y = center_y - y0    # Append all results
    detections.append({
                        'text':group[0],
                        'center_x':center_x,
                        'center_y':center_y,
                        'distance_from_origin':distance_from_origin,
                        'distance_y':distance_y
                    })    return detections

Next, distinguish and split detections by rows and columns. Each sublist is a new row. Threshold helps determine when a row breaks off into a new row and may need to be adjusted depending on how spaced out the text is in the original image. 15 is the default value and is a good number for most syntactic texts within images.

def distinguish_rows(lst, thresh=15):
    """Function to help distinguish unique rows"""
    
    sublists = [] 
    for i in range(0, len(lst)-1):
        if lst[i+1]['distance_y'] - lst[i]['distance_y'] <= thresh:
            if lst[i] not in sublists:
                sublists.append(lst[i])
            sublists.append(lst[i+1])
        else:
            yield sublists
            sublists = [lst[i+1]]
    yield sublists

Final results:

def main(image_path, thresh, order='yes'):
    """
    Function returns predictions in human readable order 
    from left to right & top to bottom
    """
    
    predictions = detect_w_keras(image_path)
    predictions = get_distance(predictions)
    predictions = list(distinguish_rows(predictions, thresh))    # Remove all empty rows
    predictions = list(filter(lambda x:x!=[], predictions))    # Order text detections in human readable format
    ordered_preds = []
    ylst = ['yes', 'y']
    for pr in predictions:
        if order in ylst: 
            row = sorted(pr, key=lambda x:x['distance_from_origin'])
            for each in row: 
                ordered_preds.append(each['text'])    return ordered_preds

Source code and Jupyter Notebook can be accessed at: https://github.com/shegocodes/keras-ocr. Let me know if you run into any issues with my code. Feel free to contact me here.

Thank you so much for making it to the end of this page. If you found me helpful, please feel free to support me and Shegocodes by giving me a follow here on Medium and/or buying me a cup of coffee so I can continue to contribute to open source work and build. Happy Coding!

Extract Text from Image Left-to-Right and Top-to-Bottom with Keras-OCR

keras-ocr

This is a slightly polished and packaged version of the Keras CRNN implementation and the published CRAFT text…

Written by Shegocodes

Responses (4)