Extract URL Preview Content with PHP and jQuery
How to extract url contents? This post will guide you how to extract url contents like many websites "Facebook, Twitter, Google" and retrieve the information about what any url title and description is about.

We will be creating following files:
- index.php, Contains html form that will allow us to submit a url for extraction.
- extract-contents.php, Will contain the code to fetch required data from submitted url.
- javascript.js, Will contain the code to send ajax request to extract-contents.php
- style.css, Contains all the style formatting for our html page and url data box.
To extract URL preview content, the extract-contents.php will be doing the main job.
- Prepare regular expression to validate URL.
- Validated the URl and fetch the URL content.
- Open a new DOM document and load the fetched content into DOM.
- Search for first image in content, title and description tags.
- Prepare the HTML preview container and return the response.
index.php
<!DOCTYPE html>
<html>
<head>
<title>Extract URL Contents with PHP and jQuery - Demo</title>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
<script type="text/javascript" src="js/jquery-3.1.1.min.js"></script>
<script type="text/javascript" src="js/javascript.js"></script>
<link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet"/>
<link rel="stylesheet" href="css/style.css" />
</head>
<body>
<div class="container">
<div class="extract-wrapper">
<label>Enter an absolute URL like https://www.codestacked.info</label>
<form class="url-extract-form">
<div class="input-group">
<input type="url" class="form-control url-input" value="" required="required" placeholder="Enter a URL to extract contents" />
<button type="submit" class="btn btn-green">Extract</button>
</div>
<div class="loader">
<i class="fa fa-spinner fa-spin"></i>
</div>
</form>
<div class="content-wrapper" id="content-wrapper"></div>
</div>
</div>
</body>
</html>
extract-contents.php
<?php
if($_POST){
$post = $_POST;
$url = strtolower($post['url']);
$url = str_starts_with($url, 'http') ? $url : 'https://'. $url;
// regular expression to validate url
$regex = '/^((https?|ftp):\/\/)(www\.)?[\w\-]+\.[a-z]{2,4}\/?[\w\/\-]*(\.[a-z]{2,4})?$/';
preg_match($regex, $url, $hostname);
// Check if url is a valid url
if(preg_match($regex, $url)){
// Get contents of url
$content =@file_get_contents($url);
// If failed to get contents show an error
if(!$content){
die('<div class="error">Error parsing the submitted URL.</div>');
}
$title = $description = "";
$images_arr = [];
// Open new dom document object
$dom = new domDocument('1.0', 'UTF-8');
// Load url content to dom document object
@$dom->loadHTML($content);
// Get images from dom document
$images = $dom->getElementsByTagName('img');
// Loop through images and push them to images array
foreach ($images as $image)
{
$src = parse_url($image->getAttribute('src'));
if($src['path'])
$images_arr[]=$image->getAttribute('src');
}
// Open xpath object for current dom document
$xPath = new domxpath($dom);
$og_title = $xPath -> query('//meta[@property="og:title"]');
$og_description = $xPath -> query('//meta[@property="og:description"]');
$og_image = $xPath -> query('//meta[@property="og:image"]');
$meta_description = @$xPath -> query('//meta[@name="description"]');
$meta_title = @$xPath -> query('//title');
// Prepare title of document
if($og_title->length){
$title = $og_title -> item(0)->getAttribute('content');
}elseif($meta_title->length){
$title = $meta_title -> item(0)->textContent;
}
// Prepare description of document
if($og_description->length){
$description = $og_description -> item(0)->getAttribute('content');
}elseif($meta_description->length){
$description = $meta_description -> item(0)->getAttribute('content');
}
// Prepare image of document
if($og_image->length){
$image = $og_image -> item(0)->getAttribute('content');
}elseif($meta_description->length){
$image = reset($images_arr);
}?>
<div class="url-info-box">
<?php
if(!empty($image)){
// Handling the https urls for images
$image = (preg_match('/^(https?)/',$image)) || (preg_match('/^(\/\/)/',$image))
? $image
: $hostname[0].$image;
list($width, $height) = getimagesize($image);
?>
<div class="image">
<img src="<?=$image;?>" class="img-responsive" width="<?=$width?>" height="<?=$height?>" alt=""/>
</div>
<?php } ?>
<div class="data">
<div class="title">
<?=$title;?>
</div>
<div class="description"><?=$description;?></div>
</div>
</div>
<?php
}else{
echo '<div class="error">Invalid URL submitted.</div>';
}
}
?>
So in extract-contents.php we first create a regular expression to
validate the submitted url, If url is valid we will fetch the contents
of submitted url and open a new dom document and load this fetched
content as html into our newly opened dom document. We initially set
Title, Description and image as empty. First we prepare an array of
images in case there is no open graph image added to document we will
use the first image on submitted url page.
After
that we look for all three values that we need. We will first look for
open graph meta tags, If they exist we will be using them for Title,
Description and Image. Otherwise we will fallback to document meta tags
for Title and Description and for Image we will use the first image on
submitted url page. The new domxpath() will be used for accessing elements in loaded dom document using xpath queries.
javascript.js
$(document).ready(function(){
$(".url-extract-form").on("submit",function(e){
e.preventDefault();
var url = $(".url-input").val();
$(".content-wrapper").hide();
if(url != ''){
$(".loader").fadeIn();
$.ajax({
url: "extract-contents.php",
type: "POST",
data:{
url: url
},
success: function(data){
$(".content-wrapper").html(data).slideDown();
$(".loader").fadeOut();
}
});
}
});
});
style.css
*{
box-sizing: border-box;
}
html,body{
margin: 0;
padding: 0;
}
body{
background-color: #f6f6f6;
font-family: "Segoe UI", "Roboto", "Helvetica", sans-serif;
font-size: 15px;
font-weight: normal;
font-style: normal;
}
.container{
max-width: 1024px;
margin: 0 auto;
padding-left: 15px;
padding-right: 15px;
}
.url-extract-form{
position: relative;
margin-bottom: 1rem;
}
.extract-wrapper label{
display: inline-block;
margin-bottom: 0.25rem;
}
.input-group {
position: relative;
display: flex;
flex-wrap: wrap;
align-items: stretch;
width: 100%;
}
.form-control{
border: 1px solid #ddd;
padding: 10px;
position: relative;
font-size: inherit;
flex: 1 1 auto;
width: 1%;
min-width: 0;
}
.form-control:focus {
border-color: #00c0ef;
outline: 0;
}
.loader{
position: absolute;
inset: 0;
font-size: 1.75rem;
background: rgba(150,150,150,0.5);
z-index: 5;
padding: 0px 10px;
display: none;
color: #006699;
text-align: center;
}
.url-extract-form button{
display: inline-block;
padding: 5px 10px;
cursor: pointer;
font: inherit;
background: #00a65a;
border: 1px solid #009549;
color: #fff;
margin-left: -1px;
}
.content-wrapper .error{
padding: 10px;
background: #e95454;
color: #fff;
}
.url-info-box{
background: #fefefe;
border: 1px solid #fefefe;
overflow: hidden;
font-size: 13px;
max-width: 300px;
}
.img-responsive{
max-width: 100%;
height: auto;
display: block;
margin: 0 auto;
}
.url-info-box .data{
padding: 15px;
background: #efefef;
}
.url-info-box .title{
font-weight: bold;
max-height: 35px;
overflow: hidden;
color: #3778cd;
}